Critic-CoT Reasoning Pipeline with Data Engineering

GitHub Title

GitHub Description

A data-engineered Critic-CoT system that runs LLM reasoning strategies on GSM8K, stores reasoning traces in SQLite, exports CSV files, and generates automated evaluation reports for accuracy, latency, token usage, and critique analysis.

Overview

Large Language Models can solve reasoning problems using chain-of-thought reasoning, but the generated reasoning is not always reliable. A model may produce a final answer while making hidden mistakes in intermediate steps.

This project implements the Critic-CoT approach and extends it into a complete data engineering pipeline. Instead of keeping reasoning as plain text inside a notebook, the project stores reasoning traces, steps, critiques, revisions, latency, token usage, and accuracy as structured data.

The final system can ingest GSM8K samples, run multiple reasoning strategies through OpenRouter, save results to SQLite, export CSV files, generate reports, and support scheduled evaluation runs.

Project Explanation

Large Language Models generate answers using chain-of-thought reasoning, but this reasoning is often inconsistent or logically flawed. The Critic-CoT approach introduces a critic model that evaluates and improves the reasoning process. This project implements the Critic-CoT methodology and extends it into a data-engineered system. The reasoning steps, critiques, and revisions are treated as structured data rather than plain text. Data engineering pipelines are used to automate ingestion, processing, iteration, and storage of reasoning traces. The system ensures repeatability, traceability, and observability of the reasoning process. Automation removes manual intervention and reduces hidden errors. Monitoring tools are used to analyze reasoning quality over multiple runs. The project bridges research-level reasoning methods with system-level engineering practices. It demonstrates how LLM reasoning can be made more reliable and production-ready.

Key Features

Loads GSM8K math reasoning data from HuggingFace.
Caches and normalizes samples for repeatable processing.
Runs four reasoning strategies: baseline, iterative refinement, critic as filter, and majority vote.
Uses OpenRouter API for LLM calls.
Extracts final answers from generated reasoning.
Verifies arithmetic expressions when possible.
Stores complete reasoning traces in SQLite.
Stores individual reasoning steps and critic feedback separately.
Exports traces, steps, critiques, and metrics to CSV.
Generates evaluation reports in CSV, JSON, and Markdown.
Supports one-command execution through PowerShell.
Supports scheduled one-time or continuous runs.

System Architecture

GSM8K Dataset
    |
    v
Data Ingestion Layer
    |
    v
Standardized Question Format
    |
    v
Critic-CoT Wrapper
    |
    v
Reasoning Strategies
    |
    v
SQLite Storage
    |
    v
CSV Exports and Evaluation Reports
    |
    v
Faculty Review / Analysis / Visualization

Reasoning Strategies

Strategy	Description
`baseline`	Generates one direct chain-of-thought answer and checks the final answer.
`iter_refine`	Generates an answer, critiques it, and refines the reasoning for a fixed number of iterations.
`filter`	Generates multiple candidate answers and uses critic feedback to select the strongest one.
`majority`	Generates multiple answers and selects the most frequent normalized final answer.

Data Engineering Components

Component	File	Purpose
Storage Layer	`data_engineering/storage/reasoning_db.py`	Creates SQLite tables and stores traces, steps, critiques, and daily metrics.
Data Ingestion	`data_engineering/pipeline/data_ingestion.py`	Loads GSM8K, caches data, and converts records into pipeline format.
Critic-CoT Wrapper	`data_engineering/pipeline/critic_cot_wrapper.py`	Handles OpenRouter calls, answer extraction, verification, critique, refinement, and strategies.
Main Pipeline	`data_engineering/pipeline/simple_pipeline.py`	Runs strategies, stores outputs, exports CSV files, and tracks metrics.
Evaluation Runner	`data_engineering/pipeline/run_evaluation.py`	Provides CLI support for sample size, strategy selection, and automated reports.
Scheduler	`data_engineering/pipeline/scheduler.py`	Runs daily or one-time evaluations and logs results to CSV.
Configuration	`data_engineering/config/settings.py`	Loads API key, model, paths, and runtime settings safely.

Database Design

The project uses SQLite as the storage layer. The database contains four main tables.

Table	Description
`traces`	Stores complete reasoning traces, final answers, correctness, latency, tokens, cost, and raw trace JSON.
`steps`	Stores extracted reasoning steps for each trace and iteration.
`critiques`	Stores critic feedback, detected error step, error flag, and verification result.
`daily_metrics`	Stores aggregated metrics for scheduled or repeated runs.

Project Structure

project/
|-- critic_cot1.ipynb
|-- README.md
|-- run_once.ps1
|-- run_once.bat
|-- config.example.py
|-- .env.example
|-- generate_report_docx.py
|-- submission/
|   |-- Project_Report_Critic_CoT_Data_Engineering.docx
|-- data_engineering/
|   |-- README.md
|   |-- requirements.txt
|   |-- config/
|   |   |-- settings.py
|   |-- pipeline/
|   |   |-- data_ingestion.py
|   |   |-- critic_cot_wrapper.py
|   |   |-- simple_pipeline.py
|   |   |-- run_evaluation.py
|   |   |-- scheduler.py
|   |-- storage/
|   |   |-- reasoning_db.py
|   |-- data/
|   |   |-- reasoning_traces.db
|   |   |-- baseline_results.csv
|   |   |-- iter_refine_results.csv
|   |   |-- filter_results.csv
|   |   |-- majority_results.csv
|   |   |-- exports/
|   |   |   |-- traces.csv
|   |   |   |-- steps.csv
|   |   |   |-- critiques.csv
|   |   |   |-- daily_metrics.csv
|   |   |-- reports/
|   |       |-- evaluation_report_*.csv
|   |       |-- evaluation_report_*.json
|   |       |-- evaluation_report_*.md

Setup

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME

2. Create a virtual environment

python -m venv .venv
.\.venv\Scripts\Activate.ps1

3. Install dependencies

pip install -r .\data_engineering\requirements.txt

4. Add OpenRouter API key

Use one of the following methods.

Method 1: Environment variable

$env:OPENROUTER_API_KEY="your-openrouter-api-key"

Method 2: config.py

Copy-Item .\config.example.py .\config.py

Then edit config.py and add your API key.

Important: Do not upload config.py to GitHub. It is ignored by .gitignore.

Easy Run

For a simple one-time demo run:

powershell -ExecutionPolicy Bypass -File .\run_once.ps1

This runs one GSM8K sample using all four strategies:

baseline
iter_refine
filter
majority

The cursor may blink for a few minutes because the program is waiting for live LLM API responses. This is normal.

Manual Evaluation Command

To run a configurable evaluation:

.\.venv\Scripts\python.exe .\data_engineering\pipeline\run_evaluation.py --samples 5 --strategies baseline,iter_refine,filter,majority --max-iterations 1 --filter-samples 2 --majority-samples 3

You can change --samples 5 to a higher number for a larger evaluation. Larger runs take more time and may hit free API rate limits.

Scheduler Commands

Run once through the scheduler:

.\.venv\Scripts\python.exe .\data_engineering\pipeline\scheduler.py --mode once --samples 5

Run continuously every day at 09:00:

.\.venv\Scripts\python.exe .\data_engineering\pipeline\scheduler.py --mode continuous --time 09:00 --samples 5

Output Files

After running the pipeline, the following files are generated.

Output	Location
SQLite database	`data_engineering/data/reasoning_traces.db`
Complete traces	`data_engineering/data/exports/traces.csv`
Reasoning steps	`data_engineering/data/exports/steps.csv`
Critic feedback	`data_engineering/data/exports/critiques.csv`
Daily metrics	`data_engineering/data/exports/daily_metrics.csv`
Evaluation reports	`data_engineering/data/reports/`
Strategy result CSVs	`data_engineering/data/`

Sample Demo Result

The following is a representative one-sample validation run. It proves that the full pipeline works end to end. Since the sample size is small, it should be treated as functional validation rather than a full benchmark.

Strategy	Samples	Correct	Accuracy	Avg Latency
baseline	1	1	100.0%	19304.47 ms
iter_refine	1	1	100.0%	81151.47 ms
filter	1	1	100.0%	71005.80 ms
majority	1	1	100.0%	66749.56 ms

How to View or Present the Results

Open data_engineering/data/reports/evaluation_report_*.md for the summary table.
Open data_engineering/data/exports/traces.csv to show complete reasoning traces.
Open data_engineering/data/exports/steps.csv to show step-by-step reasoning.
Open data_engineering/data/exports/critiques.csv to show critic feedback.
Open data_engineering/data/reasoning_traces.db in a SQLite viewer to show the database tables.
Show the terminal output after running run_once.ps1 to prove end-to-end execution.

Faculty Explanation

This project is not only a prompt engineering experiment. It converts Critic-CoT reasoning into a data engineering system. Each reasoning attempt becomes a trace record. Each reasoning step becomes a row in the steps table. Each critic response becomes a row in the critiques table. Each run produces measurable accuracy, latency, token, and cost metrics.

The main data engineering contribution is repeatability and observability. The system can be rerun with different sample sizes, strategies, and schedules. The results can be stored, exported, analyzed, and compared across runs.

Report

The Word report is included in:

submission/Project_Report_Critic_CoT_Data_Engineering.docx

The report contains the project explanation, abstract, methodology, architecture, data engineering design, database schema, results, run instructions, limitations, and conclusion.

GitHub Upload Safety

Do not upload secrets or local runtime folders.

The following files and folders should not be committed:

config.py
.env
.venv/
venv/
.tmp/
.virtualenv-app-data/
__pycache__/
data_engineering/data/cache/

Before pushing to GitHub, check:

git status

The status output should not show config.py, .env, .venv, venv, or cache folders.

Limitations

Free OpenRouter models may be slower or rate-limited.
LLM responses can vary between runs.
Small sample runs are for demonstration, not final benchmark claims.
Larger evaluations require more API time and stable model availability.
The current pipeline is focused on GSM8K-style mathematical reasoning.

Future Enhancements

Run larger evaluations with 20, 50, or 100 samples.
Add a dashboard using Streamlit, Power BI, or Tableau.
Compare multiple LLM models using the same pipeline.
Add detailed error categories such as arithmetic error, missing reasoning step, and final answer extraction error.
Store results in PostgreSQL or cloud storage.
Deploy the scheduler as a daily monitoring job.

Conclusion

The project successfully transforms Critic-CoT from a notebook-based reasoning method into a complete data-engineered evaluation system. It automates dataset ingestion, LLM reasoning, critic-based checking, structured storage, CSV export, report generation, and scheduled execution. This makes LLM reasoning more transparent, repeatable, measurable, and suitable for production-style evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Critic-CoT Reasoning Pipeline with Data Engineering

GitHub Title

GitHub Description

Overview

Project Explanation

Key Features

System Architecture

Reasoning Strategies

Data Engineering Components

Database Design

Project Structure

Setup

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Add OpenRouter API key

Easy Run

Manual Evaluation Command

Scheduler Commands

Output Files

Sample Demo Result

How to View or Present the Results

Faculty Explanation

Report

GitHub Upload Safety

Limitations

Future Enhancements

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data_engineering		data_engineering
submission		submission
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.example.py		config.example.py
config_template.py		config_template.py
critic_cot1.ipynb		critic_cot1.ipynb
generate_report_docx.py		generate_report_docx.py
run_once.bat		run_once.bat
run_once.ps1		run_once.ps1

Folders and files

Latest commit

History

Repository files navigation

Critic-CoT Reasoning Pipeline with Data Engineering

GitHub Title

GitHub Description

Overview

Project Explanation

Key Features

System Architecture

Reasoning Strategies

Data Engineering Components

Database Design

Project Structure

Setup

1. Clone the repository

2. Create a virtual environment

3. Install dependencies

4. Add OpenRouter API key

Easy Run

Manual Evaluation Command

Scheduler Commands

Output Files

Sample Demo Result

How to View or Present the Results

Faculty Explanation

Report

GitHub Upload Safety

Limitations

Future Enhancements

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages