Critic-CoT Reasoning Pipeline with Data Engineering
A data-engineered Critic-CoT system that runs LLM reasoning strategies on GSM8K, stores reasoning traces in SQLite, exports CSV files, and generates automated evaluation reports for accuracy, latency, token usage, and critique analysis.
Large Language Models can solve reasoning problems using chain-of-thought reasoning, but the generated reasoning is not always reliable. A model may produce a final answer while making hidden mistakes in intermediate steps.
This project implements the Critic-CoT approach and extends it into a complete data engineering pipeline. Instead of keeping reasoning as plain text inside a notebook, the project stores reasoning traces, steps, critiques, revisions, latency, token usage, and accuracy as structured data.
The final system can ingest GSM8K samples, run multiple reasoning strategies through OpenRouter, save results to SQLite, export CSV files, generate reports, and support scheduled evaluation runs.
Large Language Models generate answers using chain-of-thought reasoning, but this reasoning is often inconsistent or logically flawed. The Critic-CoT approach introduces a critic model that evaluates and improves the reasoning process. This project implements the Critic-CoT methodology and extends it into a data-engineered system. The reasoning steps, critiques, and revisions are treated as structured data rather than plain text. Data engineering pipelines are used to automate ingestion, processing, iteration, and storage of reasoning traces. The system ensures repeatability, traceability, and observability of the reasoning process. Automation removes manual intervention and reduces hidden errors. Monitoring tools are used to analyze reasoning quality over multiple runs. The project bridges research-level reasoning methods with system-level engineering practices. It demonstrates how LLM reasoning can be made more reliable and production-ready.
- Loads GSM8K math reasoning data from HuggingFace.
- Caches and normalizes samples for repeatable processing.
- Runs four reasoning strategies: baseline, iterative refinement, critic as filter, and majority vote.
- Uses OpenRouter API for LLM calls.
- Extracts final answers from generated reasoning.
- Verifies arithmetic expressions when possible.
- Stores complete reasoning traces in SQLite.
- Stores individual reasoning steps and critic feedback separately.
- Exports traces, steps, critiques, and metrics to CSV.
- Generates evaluation reports in CSV, JSON, and Markdown.
- Supports one-command execution through PowerShell.
- Supports scheduled one-time or continuous runs.
GSM8K Dataset
|
v
Data Ingestion Layer
|
v
Standardized Question Format
|
v
Critic-CoT Wrapper
|
v
Reasoning Strategies
|
v
SQLite Storage
|
v
CSV Exports and Evaluation Reports
|
v
Faculty Review / Analysis / Visualization
| Strategy | Description |
|---|---|
baseline |
Generates one direct chain-of-thought answer and checks the final answer. |
iter_refine |
Generates an answer, critiques it, and refines the reasoning for a fixed number of iterations. |
filter |
Generates multiple candidate answers and uses critic feedback to select the strongest one. |
majority |
Generates multiple answers and selects the most frequent normalized final answer. |
| Component | File | Purpose |
|---|---|---|
| Storage Layer | data_engineering/storage/reasoning_db.py |
Creates SQLite tables and stores traces, steps, critiques, and daily metrics. |
| Data Ingestion | data_engineering/pipeline/data_ingestion.py |
Loads GSM8K, caches data, and converts records into pipeline format. |
| Critic-CoT Wrapper | data_engineering/pipeline/critic_cot_wrapper.py |
Handles OpenRouter calls, answer extraction, verification, critique, refinement, and strategies. |
| Main Pipeline | data_engineering/pipeline/simple_pipeline.py |
Runs strategies, stores outputs, exports CSV files, and tracks metrics. |
| Evaluation Runner | data_engineering/pipeline/run_evaluation.py |
Provides CLI support for sample size, strategy selection, and automated reports. |
| Scheduler | data_engineering/pipeline/scheduler.py |
Runs daily or one-time evaluations and logs results to CSV. |
| Configuration | data_engineering/config/settings.py |
Loads API key, model, paths, and runtime settings safely. |
The project uses SQLite as the storage layer. The database contains four main tables.
| Table | Description |
|---|---|
traces |
Stores complete reasoning traces, final answers, correctness, latency, tokens, cost, and raw trace JSON. |
steps |
Stores extracted reasoning steps for each trace and iteration. |
critiques |
Stores critic feedback, detected error step, error flag, and verification result. |
daily_metrics |
Stores aggregated metrics for scheduled or repeated runs. |
project/
|-- critic_cot1.ipynb
|-- README.md
|-- run_once.ps1
|-- run_once.bat
|-- config.example.py
|-- .env.example
|-- generate_report_docx.py
|-- submission/
| |-- Project_Report_Critic_CoT_Data_Engineering.docx
|-- data_engineering/
| |-- README.md
| |-- requirements.txt
| |-- config/
| | |-- settings.py
| |-- pipeline/
| | |-- data_ingestion.py
| | |-- critic_cot_wrapper.py
| | |-- simple_pipeline.py
| | |-- run_evaluation.py
| | |-- scheduler.py
| |-- storage/
| | |-- reasoning_db.py
| |-- data/
| | |-- reasoning_traces.db
| | |-- baseline_results.csv
| | |-- iter_refine_results.csv
| | |-- filter_results.csv
| | |-- majority_results.csv
| | |-- exports/
| | | |-- traces.csv
| | | |-- steps.csv
| | | |-- critiques.csv
| | | |-- daily_metrics.csv
| | |-- reports/
| | |-- evaluation_report_*.csv
| | |-- evaluation_report_*.json
| | |-- evaluation_report_*.md
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAMEpython -m venv .venv
.\.venv\Scripts\Activate.ps1pip install -r .\data_engineering\requirements.txtUse one of the following methods.
Method 1: Environment variable
$env:OPENROUTER_API_KEY="your-openrouter-api-key"Method 2: config.py
Copy-Item .\config.example.py .\config.pyThen edit config.py and add your API key.
Important: Do not upload config.py to GitHub. It is ignored by .gitignore.
For a simple one-time demo run:
powershell -ExecutionPolicy Bypass -File .\run_once.ps1This runs one GSM8K sample using all four strategies:
baselineiter_refinefiltermajority
The cursor may blink for a few minutes because the program is waiting for live LLM API responses. This is normal.
To run a configurable evaluation:
.\.venv\Scripts\python.exe .\data_engineering\pipeline\run_evaluation.py --samples 5 --strategies baseline,iter_refine,filter,majority --max-iterations 1 --filter-samples 2 --majority-samples 3You can change --samples 5 to a higher number for a larger evaluation. Larger runs take more time and may hit free API rate limits.
Run once through the scheduler:
.\.venv\Scripts\python.exe .\data_engineering\pipeline\scheduler.py --mode once --samples 5Run continuously every day at 09:00:
.\.venv\Scripts\python.exe .\data_engineering\pipeline\scheduler.py --mode continuous --time 09:00 --samples 5After running the pipeline, the following files are generated.
| Output | Location |
|---|---|
| SQLite database | data_engineering/data/reasoning_traces.db |
| Complete traces | data_engineering/data/exports/traces.csv |
| Reasoning steps | data_engineering/data/exports/steps.csv |
| Critic feedback | data_engineering/data/exports/critiques.csv |
| Daily metrics | data_engineering/data/exports/daily_metrics.csv |
| Evaluation reports | data_engineering/data/reports/ |
| Strategy result CSVs | data_engineering/data/ |
The following is a representative one-sample validation run. It proves that the full pipeline works end to end. Since the sample size is small, it should be treated as functional validation rather than a full benchmark.
| Strategy | Samples | Correct | Accuracy | Avg Latency |
|---|---|---|---|---|
| baseline | 1 | 1 | 100.0% | 19304.47 ms |
| iter_refine | 1 | 1 | 100.0% | 81151.47 ms |
| filter | 1 | 1 | 100.0% | 71005.80 ms |
| majority | 1 | 1 | 100.0% | 66749.56 ms |
- Open
data_engineering/data/reports/evaluation_report_*.mdfor the summary table. - Open
data_engineering/data/exports/traces.csvto show complete reasoning traces. - Open
data_engineering/data/exports/steps.csvto show step-by-step reasoning. - Open
data_engineering/data/exports/critiques.csvto show critic feedback. - Open
data_engineering/data/reasoning_traces.dbin a SQLite viewer to show the database tables. - Show the terminal output after running
run_once.ps1to prove end-to-end execution.
This project is not only a prompt engineering experiment. It converts Critic-CoT reasoning into a data engineering system. Each reasoning attempt becomes a trace record. Each reasoning step becomes a row in the steps table. Each critic response becomes a row in the critiques table. Each run produces measurable accuracy, latency, token, and cost metrics.
The main data engineering contribution is repeatability and observability. The system can be rerun with different sample sizes, strategies, and schedules. The results can be stored, exported, analyzed, and compared across runs.
The Word report is included in:
submission/Project_Report_Critic_CoT_Data_Engineering.docx
The report contains the project explanation, abstract, methodology, architecture, data engineering design, database schema, results, run instructions, limitations, and conclusion.
Do not upload secrets or local runtime folders.
The following files and folders should not be committed:
config.py.env.venv/venv/.tmp/.virtualenv-app-data/__pycache__/data_engineering/data/cache/
Before pushing to GitHub, check:
git statusThe status output should not show config.py, .env, .venv, venv, or cache folders.
- Free OpenRouter models may be slower or rate-limited.
- LLM responses can vary between runs.
- Small sample runs are for demonstration, not final benchmark claims.
- Larger evaluations require more API time and stable model availability.
- The current pipeline is focused on GSM8K-style mathematical reasoning.
- Run larger evaluations with 20, 50, or 100 samples.
- Add a dashboard using Streamlit, Power BI, or Tableau.
- Compare multiple LLM models using the same pipeline.
- Add detailed error categories such as arithmetic error, missing reasoning step, and final answer extraction error.
- Store results in PostgreSQL or cloud storage.
- Deploy the scheduler as a daily monitoring job.
The project successfully transforms Critic-CoT from a notebook-based reasoning method into a complete data-engineered evaluation system. It automates dataset ingestion, LLM reasoning, critic-based checking, structured storage, CSV export, report generation, and scheduled execution. This makes LLM reasoning more transparent, repeatable, measurable, and suitable for production-style evaluation.