LDA-Topic-Modeling

From Teaching Prototype to Industrial-Grade Pipeline: Optimization and Verification of LDA Topic Models. 从教学原型到工业级管线：LDA 主题模型的工程化优化与验证。

🇬🇧 English Documentation

📖 Introduction

Latent Dirichlet Allocation (LDA) remains a cornerstone of unsupervised text mining. While Large Language Models (LLMs) dominate semantic understanding, they face bottlenecks in cost, privacy, and determinism. This project transforms a basic LDA implementation (initially released in 2018) into a rigorous, industrial-grade tool. By employing Stratified Random Sampling, Control Variable Experiments, and Strict Preprocessing Pipelines, we demonstrate that a properly tuned LDA model can achieve a $C_V$ coherence score of 0.65, significantly outperforming baselines and offering a robust, explainable alternative for massive text corpora.

🚀 Key Features & Conclusions

Feature	Detail
🧪 High Coherence	Validated via Multi-Seed (42, 88, 99) experiments. Baseline: 0.59 $\to$ Optimized: 0.65.
🧹 Strict Preprocessing	Stopword Filtering must precede N-gram extraction. Reverse order causes PMI noise.
🚫 Avoid TF-IDF	TF-IDF breaks the Dirichlet-Multinomial conjugacy, causing inference failure.
🤝 LLM Hybrid Arch	LDA for clustering + LLM for labeling. Reduces API costs by 99.9%.
🏭 Industrial CLI	Complete toolkit: `analyze`, `verify`, `find-topics`, `tokenize`.
⚡ Streaming API	Memory-efficient processing for corpora >100MB.

📊 Experimental Results

Validated on THUCNews dataset (14,000 documents):

Configuration	Score ($C_V$)	Conclusion
Baseline (Raw)	0.5332	Early 2018 prototype
Baseline (14k Docs)	0.5902	Sub-optimal convergence
Optimized Pipeline	0.6245 ± 0.02	✅ Significant improvement
Best Run (Seed 99)	0.6505	✅ Theoretical upper bound

🛠️ Getting Started

Installation

pip install -r requirements.txt

Python API

from topic_model.lda_model import LDATopicModel

# Initialize
model = LDATopicModel(num_topics=14, passes=20, random_state=99)

# Load & Train
model.load_corpus("data/thuc_corpus.txt")
model.build_dictionary_and_corpus()
model.train_model()

# Evaluate
score = model.evaluate_model()
print(f"Coherence Score: {score:.4f}")

CLI Commands

# Topic modeling analysis
python -m topic_model.cli analyze data/corpus.txt -k 14 -o results

# Verify experiments (reproduce paper results)
python -m topic_model.cli verify data/corpus.txt --seed 99 -k 14

# Find optimal topic count
python -m topic_model.cli find-topics data/corpus.txt --min 5 --max 20

# Chinese tokenization
python -m topic_model.cli tokenize -s data/stopwords.txt input.txt > output.txt

Docker Deployment

# Build image
docker build -t lda-topic-model .

# Run analysis
docker run -v $(pwd)/data:/app/data -v $(pwd)/results:/app/results \
  lda-topic-model analyze /app/data/corpus.txt -k 14 -o /app/results

📂 Project Structure

LDA-Topic-Modeling/
├── data/                  # Datasets & processing
│   ├── THUCNews/          # Raw THUCNews dataset
│   ├── thuc_corpus.txt    # Processed corpus (14k docs)
│   ├── sample_corpus.txt  # Sample corpus for testing
│   ├── stopwords.txt      # Chinese stopwords
│   └── process_thucnews.py # Data preprocessing script
├── docs/
│   └── REVISITING_LDA.md  # Comprehensive engineering analysis
├── topic_model/           # Core Python package
│   ├── __init__.py        # Package init & logging
│   ├── __main__.py        # Entry point
│   ├── cli.py             # CLI commands
│   └── lda_model.py       # Core LDA class
├── tests/                 # Unit tests
├── results/               # Model outputs
│   ├── model/             # Trained LDA model files
│   ├── report.json        # Analysis report
│   ├── classifications.csv # Document classifications
│   └── lda_visualization.html # Interactive visualization
├── .github/workflows/     # CI/CD pipeline
├── Dockerfile             # Multi-stage Docker build
├── README.md              # This file
└── requirements.txt       # Dependencies

🔬 Key Findings

N-gram Sequence Dependency: Stopwords must be removed before Bigram detection to avoid noise phrases (e.g., "the_of").
TF-IDF Failure: Weighting breaks LDA's mathematical foundation (Dirichlet-Multinomial Conjugacy).
Multi-Seed Validation: Single runs are insufficient; at least 3 independent seeds are required for statistical significance.

🇨🇳 中文文档

📖 项目简介

隐狄利克雷分配（LDA）依然是无监督文本挖掘的基石。虽然大语言模型（LLM）在语义理解上占据主导，但在成本、隐私和确定性方面面临瓶颈。本项目将 2018 年的基础 LDA 实现重构为严谨的工业级工具。通过采用分层随机抽样、控制变量实验和严格的预处理流水线，我们证明了经过优化的 LDA 模型可以达到 0.65 的 $C_V$ 一致性得分，显著优于基线，并为海量文本语料提供了一种鲁棒、可解释的替代方案。

🚀 核心特性与结论

特性	细节
🧪 高一致性	多随机种子（42, 88, 99）验证克服局部最优。基线: 0.59 $\to$ 优化: 0.65。
🧹 严格预处理	停用词过滤必须在 N-gram 提取之前。反序会导致 PMI 噪声和得分下降。
🚫 避免 TF-IDF	实验证实 TF-IDF 破坏狄利克雷-多项共轭结构，导致后验推断失败。
🤝 LLM 混合架构	LDA 用于降维聚类 + LLM 用于语义标注。API 成本降低 99.9%。
🏭 工业级 CLI	完整命令行工具：`analyze`、`verify`、`find-topics`、`tokenize`。
⚡ 流式 API	支持 >100MB 大语料的内存高效处理。

📊 实验结果

基于 THUCNews 数据集（1.4 万篇文档）验证：

配置	得分 ($C_V$)	结论
基线（原始）	0.5332	2018 年初期原型
基线（1.4 万篇）	0.5902	次优收敛
优化流水线	0.6245 ± 0.02	✅ 显著提升
最佳运行（种子 99）	0.6505	✅ 理论上限

🛠️ 快速开始

安装

pip install -r requirements.txt

Python API

from topic_model.lda_model import LDATopicModel

# 初始化
model = LDATopicModel(num_topics=14, passes=20, random_state=99)

# 加载与训练
model.load_corpus("data/thuc_corpus.txt")
model.build_dictionary_and_corpus()
model.train_model()

# 评估
score = model.evaluate_model()
print(f"一致性得分: {score:.4f}")

命令行工具

# 执行主题建模分析
python -m topic_model.cli analyze data/corpus.txt -k 14 -o results

# 验证实验（复现论文结果）
python -m topic_model.cli verify data/corpus.txt --seed 99 -k 14

# 寻找最优主题数
python -m topic_model.cli find-topics data/corpus.txt --min 5 --max 20

# 中文分词
python -m topic_model.cli tokenize -s data/stopwords.txt input.txt > output.txt

Docker 部署

# 构建镜像
docker build -t lda-topic-model .

# 运行分析
docker run -v $(pwd)/data:/app/data -v $(pwd)/results:/app/results \
  lda-topic-model analyze /app/data/corpus.txt -k 14 -o /app/results

📂 目录结构

LDA-Topic-Modeling/
├── data/                  # 数据集与处理
│   ├── THUCNews/          # 原始 THUCNews 数据集
│   ├── thuc_corpus.txt    # 处理后语料（1.4万篇）
│   ├── sample_corpus.txt  # 测试用示例语料
│   ├── stopwords.txt      # 中文停用词表
│   └── process_thucnews.py # 数据预处理脚本
├── docs/
│   └── REVISITING_LDA.md  # 全面工程分析报告
├── topic_model/           # 核心 Python 包
│   ├── __init__.py        # 包初始化与日志配置
│   ├── __main__.py        # 入口点
│   ├── cli.py             # CLI 命令
│   └── lda_model.py       # LDA 核心类
├── tests/                 # 单元测试
├── results/               # 模型输出
│   ├── model/             # 训练好的 LDA 模型文件
│   ├── report.json        # 分析报告
│   ├── classifications.csv # 文档分类结果
│   └── lda_visualization.html # 交互式可视化
├── .github/workflows/     # CI/CD 流水线
├── Dockerfile             # 多阶段 Docker 构建
├── README.md              # 本文件
└── requirements.txt       # 依赖项

🔬 关键发现

N-gram 序列依赖性： 必须在 Bigram 检测前移除停用词，以避免"噪声短语"（如"的_了"）。
TF-IDF 失效： 加权破坏了 LDA 的数学基础（狄利克雷-多项共轭）。
多种子验证： 单次运行结果不具备统计代表性，至少需要 3 个独立随机种子。

🤝 Contribution & Contact / 贡献与联系

Author: Mal-Suen
Blog: Mal-Suen's Blog
GitHub: https://github.com/Mal-Suen/LDA-Topic-Modeling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LDA-Topic-Modeling

🇬🇧 English Documentation

📖 Introduction

🚀 Key Features & Conclusions

📊 Experimental Results

🛠️ Getting Started

Installation

Python API

CLI Commands

Docker Deployment

📂 Project Structure

🔬 Key Findings

🇨🇳 中文文档

📖 项目简介

🚀 核心特性与结论

📊 实验结果

🛠️ 快速开始

安装

Python API

命令行工具

Docker 部署

📂 目录结构

🔬 关键发现

🤝 Contribution & Contact / 贡献与联系

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
data		data
docs		docs
tests		tests
topic_model		topic_model
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

LDA-Topic-Modeling

🇬🇧 English Documentation

📖 Introduction

🚀 Key Features & Conclusions

📊 Experimental Results

🛠️ Getting Started

Installation

Python API

CLI Commands

Docker Deployment

📂 Project Structure

🔬 Key Findings

🇨🇳 中文文档

📖 项目简介

🚀 核心特性与结论

📊 实验结果

🛠️ 快速开始

安装

Python API

命令行工具

Docker 部署

📂 目录结构

🔬 关键发现

🤝 Contribution & Contact / 贡献与联系

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages