Skip to content

Madhumasa84/critic-cot-implementation

Repository files navigation

Critic-CoT Reasoning Pipeline with Data Engineering

GitHub Title

Critic-CoT Reasoning Pipeline with Data Engineering

GitHub Description

A data-engineered Critic-CoT system that runs LLM reasoning strategies on GSM8K, stores reasoning traces in SQLite, exports CSV files, and generates automated evaluation reports for accuracy, latency, token usage, and critique analysis.

Overview

Large Language Models can solve reasoning problems using chain-of-thought reasoning, but the generated reasoning is not always reliable. A model may produce a final answer while making hidden mistakes in intermediate steps.

This project implements the Critic-CoT approach and extends it into a complete data engineering pipeline. Instead of keeping reasoning as plain text inside a notebook, the project stores reasoning traces, steps, critiques, revisions, latency, token usage, and accuracy as structured data.

The final system can ingest GSM8K samples, run multiple reasoning strategies through OpenRouter, save results to SQLite, export CSV files, generate reports, and support scheduled evaluation runs.

Project Explanation

Large Language Models generate answers using chain-of-thought reasoning, but this reasoning is often inconsistent or logically flawed. The Critic-CoT approach introduces a critic model that evaluates and improves the reasoning process. This project implements the Critic-CoT methodology and extends it into a data-engineered system. The reasoning steps, critiques, and revisions are treated as structured data rather than plain text. Data engineering pipelines are used to automate ingestion, processing, iteration, and storage of reasoning traces. The system ensures repeatability, traceability, and observability of the reasoning process. Automation removes manual intervention and reduces hidden errors. Monitoring tools are used to analyze reasoning quality over multiple runs. The project bridges research-level reasoning methods with system-level engineering practices. It demonstrates how LLM reasoning can be made more reliable and production-ready.

Key Features

  • Loads GSM8K math reasoning data from HuggingFace.
  • Caches and normalizes samples for repeatable processing.
  • Runs four reasoning strategies: baseline, iterative refinement, critic as filter, and majority vote.
  • Uses OpenRouter API for LLM calls.
  • Extracts final answers from generated reasoning.
  • Verifies arithmetic expressions when possible.
  • Stores complete reasoning traces in SQLite.
  • Stores individual reasoning steps and critic feedback separately.
  • Exports traces, steps, critiques, and metrics to CSV.
  • Generates evaluation reports in CSV, JSON, and Markdown.
  • Supports one-command execution through PowerShell.
  • Supports scheduled one-time or continuous runs.

System Architecture

GSM8K Dataset
    |
    v
Data Ingestion Layer
    |
    v
Standardized Question Format
    |
    v
Critic-CoT Wrapper
    |
    v
Reasoning Strategies
    |
    v
SQLite Storage
    |
    v
CSV Exports and Evaluation Reports
    |
    v
Faculty Review / Analysis / Visualization

Reasoning Strategies

Strategy Description
baseline Generates one direct chain-of-thought answer and checks the final answer.
iter_refine Generates an answer, critiques it, and refines the reasoning for a fixed number of iterations.
filter Generates multiple candidate answers and uses critic feedback to select the strongest one.
majority Generates multiple answers and selects the most frequent normalized final answer.

Data Engineering Components

Component File Purpose
Storage Layer data_engineering/storage/reasoning_db.py Creates SQLite tables and stores traces, steps, critiques, and daily metrics.
Data Ingestion data_engineering/pipeline/data_ingestion.py Loads GSM8K, caches data, and converts records into pipeline format.
Critic-CoT Wrapper data_engineering/pipeline/critic_cot_wrapper.py Handles OpenRouter calls, answer extraction, verification, critique, refinement, and strategies.
Main Pipeline data_engineering/pipeline/simple_pipeline.py Runs strategies, stores outputs, exports CSV files, and tracks metrics.
Evaluation Runner data_engineering/pipeline/run_evaluation.py Provides CLI support for sample size, strategy selection, and automated reports.
Scheduler data_engineering/pipeline/scheduler.py Runs daily or one-time evaluations and logs results to CSV.
Configuration data_engineering/config/settings.py Loads API key, model, paths, and runtime settings safely.

Database Design

The project uses SQLite as the storage layer. The database contains four main tables.

Table Description
traces Stores complete reasoning traces, final answers, correctness, latency, tokens, cost, and raw trace JSON.
steps Stores extracted reasoning steps for each trace and iteration.
critiques Stores critic feedback, detected error step, error flag, and verification result.
daily_metrics Stores aggregated metrics for scheduled or repeated runs.

Project Structure

project/
|-- critic_cot1.ipynb
|-- README.md
|-- run_once.ps1
|-- run_once.bat
|-- config.example.py
|-- .env.example
|-- generate_report_docx.py
|-- submission/
|   |-- Project_Report_Critic_CoT_Data_Engineering.docx
|-- data_engineering/
|   |-- README.md
|   |-- requirements.txt
|   |-- config/
|   |   |-- settings.py
|   |-- pipeline/
|   |   |-- data_ingestion.py
|   |   |-- critic_cot_wrapper.py
|   |   |-- simple_pipeline.py
|   |   |-- run_evaluation.py
|   |   |-- scheduler.py
|   |-- storage/
|   |   |-- reasoning_db.py
|   |-- data/
|   |   |-- reasoning_traces.db
|   |   |-- baseline_results.csv
|   |   |-- iter_refine_results.csv
|   |   |-- filter_results.csv
|   |   |-- majority_results.csv
|   |   |-- exports/
|   |   |   |-- traces.csv
|   |   |   |-- steps.csv
|   |   |   |-- critiques.csv
|   |   |   |-- daily_metrics.csv
|   |   |-- reports/
|   |       |-- evaluation_report_*.csv
|   |       |-- evaluation_report_*.json
|   |       |-- evaluation_report_*.md

Setup

1. Clone the repository

git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME

2. Create a virtual environment

python -m venv .venv
.\.venv\Scripts\Activate.ps1

3. Install dependencies

pip install -r .\data_engineering\requirements.txt

4. Add OpenRouter API key

Use one of the following methods.

Method 1: Environment variable

$env:OPENROUTER_API_KEY="your-openrouter-api-key"

Method 2: config.py

Copy-Item .\config.example.py .\config.py

Then edit config.py and add your API key.

Important: Do not upload config.py to GitHub. It is ignored by .gitignore.

Easy Run

For a simple one-time demo run:

powershell -ExecutionPolicy Bypass -File .\run_once.ps1

This runs one GSM8K sample using all four strategies:

  • baseline
  • iter_refine
  • filter
  • majority

The cursor may blink for a few minutes because the program is waiting for live LLM API responses. This is normal.

Manual Evaluation Command

To run a configurable evaluation:

.\.venv\Scripts\python.exe .\data_engineering\pipeline\run_evaluation.py --samples 5 --strategies baseline,iter_refine,filter,majority --max-iterations 1 --filter-samples 2 --majority-samples 3

You can change --samples 5 to a higher number for a larger evaluation. Larger runs take more time and may hit free API rate limits.

Scheduler Commands

Run once through the scheduler:

.\.venv\Scripts\python.exe .\data_engineering\pipeline\scheduler.py --mode once --samples 5

Run continuously every day at 09:00:

.\.venv\Scripts\python.exe .\data_engineering\pipeline\scheduler.py --mode continuous --time 09:00 --samples 5

Output Files

After running the pipeline, the following files are generated.

Output Location
SQLite database data_engineering/data/reasoning_traces.db
Complete traces data_engineering/data/exports/traces.csv
Reasoning steps data_engineering/data/exports/steps.csv
Critic feedback data_engineering/data/exports/critiques.csv
Daily metrics data_engineering/data/exports/daily_metrics.csv
Evaluation reports data_engineering/data/reports/
Strategy result CSVs data_engineering/data/

Sample Demo Result

The following is a representative one-sample validation run. It proves that the full pipeline works end to end. Since the sample size is small, it should be treated as functional validation rather than a full benchmark.

Strategy Samples Correct Accuracy Avg Latency
baseline 1 1 100.0% 19304.47 ms
iter_refine 1 1 100.0% 81151.47 ms
filter 1 1 100.0% 71005.80 ms
majority 1 1 100.0% 66749.56 ms

How to View or Present the Results

  • Open data_engineering/data/reports/evaluation_report_*.md for the summary table.
  • Open data_engineering/data/exports/traces.csv to show complete reasoning traces.
  • Open data_engineering/data/exports/steps.csv to show step-by-step reasoning.
  • Open data_engineering/data/exports/critiques.csv to show critic feedback.
  • Open data_engineering/data/reasoning_traces.db in a SQLite viewer to show the database tables.
  • Show the terminal output after running run_once.ps1 to prove end-to-end execution.

Faculty Explanation

This project is not only a prompt engineering experiment. It converts Critic-CoT reasoning into a data engineering system. Each reasoning attempt becomes a trace record. Each reasoning step becomes a row in the steps table. Each critic response becomes a row in the critiques table. Each run produces measurable accuracy, latency, token, and cost metrics.

The main data engineering contribution is repeatability and observability. The system can be rerun with different sample sizes, strategies, and schedules. The results can be stored, exported, analyzed, and compared across runs.

Report

The Word report is included in:

submission/Project_Report_Critic_CoT_Data_Engineering.docx

The report contains the project explanation, abstract, methodology, architecture, data engineering design, database schema, results, run instructions, limitations, and conclusion.

GitHub Upload Safety

Do not upload secrets or local runtime folders.

The following files and folders should not be committed:

  • config.py
  • .env
  • .venv/
  • venv/
  • .tmp/
  • .virtualenv-app-data/
  • __pycache__/
  • data_engineering/data/cache/

Before pushing to GitHub, check:

git status

The status output should not show config.py, .env, .venv, venv, or cache folders.

Limitations

  • Free OpenRouter models may be slower or rate-limited.
  • LLM responses can vary between runs.
  • Small sample runs are for demonstration, not final benchmark claims.
  • Larger evaluations require more API time and stable model availability.
  • The current pipeline is focused on GSM8K-style mathematical reasoning.

Future Enhancements

  • Run larger evaluations with 20, 50, or 100 samples.
  • Add a dashboard using Streamlit, Power BI, or Tableau.
  • Compare multiple LLM models using the same pipeline.
  • Add detailed error categories such as arithmetic error, missing reasoning step, and final answer extraction error.
  • Store results in PostgreSQL or cloud storage.
  • Deploy the scheduler as a daily monitoring job.

Conclusion

The project successfully transforms Critic-CoT from a notebook-based reasoning method into a complete data-engineered evaluation system. It automates dataset ingestion, LLM reasoning, critic-based checking, structured storage, CSV export, report generation, and scheduled execution. This makes LLM reasoning more transparent, repeatable, measurable, and suitable for production-style evaluation.

About

Implementation of Critic-CoT: Chain-of-Thought Self-Critique for LLM Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors