Skip to content

Mal-Suen/LDA-Topic-Modeling

Repository files navigation

LDA-Topic-Modeling

From Teaching Prototype to Industrial-Grade Pipeline: Optimization and Verification of LDA Topic Models. 从教学原型到工业级管线:LDA 主题模型的工程化优化与验证。

GitHub license Python 3.9+ Gensim 4.4+ CI/CD


🇬🇧 English Documentation

📖 Introduction

Latent Dirichlet Allocation (LDA) remains a cornerstone of unsupervised text mining. While Large Language Models (LLMs) dominate semantic understanding, they face bottlenecks in cost, privacy, and determinism. This project transforms a basic LDA implementation (initially released in 2018) into a rigorous, industrial-grade tool. By employing Stratified Random Sampling, Control Variable Experiments, and Strict Preprocessing Pipelines, we demonstrate that a properly tuned LDA model can achieve a $C_V$ coherence score of 0.65, significantly outperforming baselines and offering a robust, explainable alternative for massive text corpora.

🚀 Key Features & Conclusions

Feature Detail
🧪 High Coherence Validated via Multi-Seed (42, 88, 99) experiments. Baseline: 0.59 $\to$ Optimized: 0.65.
🧹 Strict Preprocessing Stopword Filtering must precede N-gram extraction. Reverse order causes PMI noise.
🚫 Avoid TF-IDF TF-IDF breaks the Dirichlet-Multinomial conjugacy, causing inference failure.
🤝 LLM Hybrid Arch LDA for clustering + LLM for labeling. Reduces API costs by 99.9%.
🏭 Industrial CLI Complete toolkit: analyze, verify, find-topics, tokenize.
⚡ Streaming API Memory-efficient processing for corpora >100MB.

📊 Experimental Results

Validated on THUCNews dataset (14,000 documents):

Configuration Score ($C_V$) Conclusion
Baseline (Raw) 0.5332 Early 2018 prototype
Baseline (14k Docs) 0.5902 Sub-optimal convergence
Optimized Pipeline 0.6245 ± 0.02 ✅ Significant improvement
Best Run (Seed 99) 0.6505 ✅ Theoretical upper bound

🛠️ Getting Started

Installation

pip install -r requirements.txt

Python API

from topic_model.lda_model import LDATopicModel

# Initialize
model = LDATopicModel(num_topics=14, passes=20, random_state=99)

# Load & Train
model.load_corpus("data/thuc_corpus.txt")
model.build_dictionary_and_corpus()
model.train_model()

# Evaluate
score = model.evaluate_model()
print(f"Coherence Score: {score:.4f}")

CLI Commands

# Topic modeling analysis
python -m topic_model.cli analyze data/corpus.txt -k 14 -o results

# Verify experiments (reproduce paper results)
python -m topic_model.cli verify data/corpus.txt --seed 99 -k 14

# Find optimal topic count
python -m topic_model.cli find-topics data/corpus.txt --min 5 --max 20

# Chinese tokenization
python -m topic_model.cli tokenize -s data/stopwords.txt input.txt > output.txt

Docker Deployment

# Build image
docker build -t lda-topic-model .

# Run analysis
docker run -v $(pwd)/data:/app/data -v $(pwd)/results:/app/results \
  lda-topic-model analyze /app/data/corpus.txt -k 14 -o /app/results

📂 Project Structure

LDA-Topic-Modeling/
├── data/                  # Datasets & processing
│   ├── THUCNews/          # Raw THUCNews dataset
│   ├── thuc_corpus.txt    # Processed corpus (14k docs)
│   ├── sample_corpus.txt  # Sample corpus for testing
│   ├── stopwords.txt      # Chinese stopwords
│   └── process_thucnews.py # Data preprocessing script
├── docs/
│   └── REVISITING_LDA.md  # Comprehensive engineering analysis
├── topic_model/           # Core Python package
│   ├── __init__.py        # Package init & logging
│   ├── __main__.py        # Entry point
│   ├── cli.py             # CLI commands
│   └── lda_model.py       # Core LDA class
├── tests/                 # Unit tests
├── results/               # Model outputs
│   ├── model/             # Trained LDA model files
│   ├── report.json        # Analysis report
│   ├── classifications.csv # Document classifications
│   └── lda_visualization.html # Interactive visualization
├── .github/workflows/     # CI/CD pipeline
├── Dockerfile             # Multi-stage Docker build
├── README.md              # This file
└── requirements.txt       # Dependencies

🔬 Key Findings

  1. N-gram Sequence Dependency: Stopwords must be removed before Bigram detection to avoid noise phrases (e.g., "the_of").
  2. TF-IDF Failure: Weighting breaks LDA's mathematical foundation (Dirichlet-Multinomial Conjugacy).
  3. Multi-Seed Validation: Single runs are insufficient; at least 3 independent seeds are required for statistical significance.

🇨🇳 中文文档

📖 项目简介

隐狄利克雷分配(LDA)依然是无监督文本挖掘的基石。虽然大语言模型(LLM)在语义理解上占据主导,但在成本隐私确定性方面面临瓶颈。本项目将 2018 年的基础 LDA 实现重构为严谨的工业级工具。通过采用分层随机抽样控制变量实验严格的预处理流水线,我们证明了经过优化的 LDA 模型可以达到 0.65 的 $C_V$ 一致性得分,显著优于基线,并为海量文本语料提供了一种鲁棒、可解释的替代方案。

🚀 核心特性与结论

特性 细节
🧪 高一致性 多随机种子(42, 88, 99)验证克服局部最优。基线: 0.59 $\to$ 优化: 0.65
🧹 严格预处理 停用词过滤必须在 N-gram 提取之前。反序会导致 PMI 噪声和得分下降。
🚫 避免 TF-IDF 实验证实 TF-IDF 破坏狄利克雷-多项共轭结构,导致后验推断失败。
🤝 LLM 混合架构 LDA 用于降维聚类 + LLM 用于语义标注。API 成本降低 99.9%。
🏭 工业级 CLI 完整命令行工具:analyzeverifyfind-topicstokenize
⚡ 流式 API 支持 >100MB 大语料的内存高效处理。

📊 实验结果

基于 THUCNews 数据集(1.4 万篇文档)验证:

配置 得分 ($C_V$) 结论
基线(原始) 0.5332 2018 年初期原型
基线(1.4 万篇) 0.5902 次优收敛
优化流水线 0.6245 ± 0.02 ✅ 显著提升
最佳运行(种子 99) 0.6505 ✅ 理论上限

🛠️ 快速开始

安装

pip install -r requirements.txt

Python API

from topic_model.lda_model import LDATopicModel

# 初始化
model = LDATopicModel(num_topics=14, passes=20, random_state=99)

# 加载与训练
model.load_corpus("data/thuc_corpus.txt")
model.build_dictionary_and_corpus()
model.train_model()

# 评估
score = model.evaluate_model()
print(f"一致性得分: {score:.4f}")

命令行工具

# 执行主题建模分析
python -m topic_model.cli analyze data/corpus.txt -k 14 -o results

# 验证实验(复现论文结果)
python -m topic_model.cli verify data/corpus.txt --seed 99 -k 14

# 寻找最优主题数
python -m topic_model.cli find-topics data/corpus.txt --min 5 --max 20

# 中文分词
python -m topic_model.cli tokenize -s data/stopwords.txt input.txt > output.txt

Docker 部署

# 构建镜像
docker build -t lda-topic-model .

# 运行分析
docker run -v $(pwd)/data:/app/data -v $(pwd)/results:/app/results \
  lda-topic-model analyze /app/data/corpus.txt -k 14 -o /app/results

📂 目录结构

LDA-Topic-Modeling/
├── data/                  # 数据集与处理
│   ├── THUCNews/          # 原始 THUCNews 数据集
│   ├── thuc_corpus.txt    # 处理后语料(1.4万篇)
│   ├── sample_corpus.txt  # 测试用示例语料
│   ├── stopwords.txt      # 中文停用词表
│   └── process_thucnews.py # 数据预处理脚本
├── docs/
│   └── REVISITING_LDA.md  # 全面工程分析报告
├── topic_model/           # 核心 Python 包
│   ├── __init__.py        # 包初始化与日志配置
│   ├── __main__.py        # 入口点
│   ├── cli.py             # CLI 命令
│   └── lda_model.py       # LDA 核心类
├── tests/                 # 单元测试
├── results/               # 模型输出
│   ├── model/             # 训练好的 LDA 模型文件
│   ├── report.json        # 分析报告
│   ├── classifications.csv # 文档分类结果
│   └── lda_visualization.html # 交互式可视化
├── .github/workflows/     # CI/CD 流水线
├── Dockerfile             # 多阶段 Docker 构建
├── README.md              # 本文件
└── requirements.txt       # 依赖项

🔬 关键发现

  1. N-gram 序列依赖性: 必须在 Bigram 检测前移除停用词,以避免"噪声短语"(如"的_了")。
  2. TF-IDF 失效: 加权破坏了 LDA 的数学基础(狄利克雷-多项共轭)。
  3. 多种子验证: 单次运行结果不具备统计代表性,至少需要 3 个独立随机种子。

🤝 Contribution & Contact / 贡献与联系

Copyright © 2018-2026 Mal-Suen. Released under MIT License.

About

LDA topic modeling toolkit for Chinese text analysis. Achieves C_V coherence of 0.65 on THUCNews. Features CLI tools and multi-seed validation.中文 LDA 主题建模工具。在 THUCNews 数据集上达到 0.65 的 C_V 一致性得分。提供命令行工具和多种子验证。

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors