visual-cot-eval

Evaluation framework for measuring faithfulness of visual chain-of-thought (CoT) reasoning in vision-language models.

Motivation

VLMs can arrive at correct answers through spurious reasoning — the model says "I can see a red car" when no car is present, but still gets the question right. This tool checks whether intermediate reasoning steps are actually grounded in visual evidence, not just plausible-sounding text that happens to lead to the correct answer.

Installation

git clone https://github.com/fengrui128/visual-cot-eval.git
cd visual-cot-eval
pip install -e .

Dependencies: nltk, spacy (en_core_web_sm), numpy.

Basic Usage

from vcot_eval.parser import CoTParser
from vcot_eval.grounding import GroundingChecker
from vcot_eval.faithfulness import FaithfulnessEvaluator

parser = CoTParser()
checker = GroundingChecker()
evaluator = FaithfulnessEvaluator(checker)

model_output = """
Step 1: I can see a dog sitting on the grass.
Step 2: The dog appears to be brown in color.
Step 3: Therefore, the answer is brown.
"""

detected_objects = ["dog", "grass", "fence"]
image_regions = [{"label": "dog", "bbox": [10, 20, 100, 150]}, {"label": "grass", "bbox": [0, 130, 300, 200]}]

steps = parser.parse(model_output)
result = evaluator.evaluate(steps, detected_objects, image_regions)

print(result)
# {'step_scores': [0.9, 0.85, 0.4], 'chain_score': 0.72, 'grounding_rate': 0.67}

Input Format

See data/sample_annotation.json for a full example. Each record expects:

image_id — string identifier
question — the question posed to the model
model_output — raw text output including reasoning steps
ground_truth — expected answer
detected_objects — list of object labels from an object detector
image_regions — list of {label, bbox} dicts (bbox as [x, y, w, h])

Running Evaluation

python scripts/evaluate.py --input data/my_outputs.json --output results/eval_out.json

Output Metrics

Metric	Description
`grounding_rate`	Fraction of steps that reference at least one detected object
`step_faithfulness_score`	Per-step score in [0, 1] based on object and region overlap
`chain_consistency`	Detects contradictory claims across steps (e.g., "red car" then "blue car")

Notes

Grounding is NLP-based (token matching + lemmatization), not vision-based. It checks whether the text references things that are actually in the image, not whether the visual reasoning is correct.
Designed for offline evaluation of saved model outputs, not real-time inference.
Tested on LLaVA-1.5 and InstructBLIP outputs.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
scripts		scripts
tests		tests
vcot_eval		vcot_eval
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

visual-cot-eval

Motivation

Installation

Basic Usage

Input Format

Running Evaluation

Output Metrics

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

visual-cot-eval

Motivation

Installation

Basic Usage

Input Format

Running Evaluation

Output Metrics

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages