Evaluation framework for measuring faithfulness of visual chain-of-thought (CoT) reasoning in vision-language models.
VLMs can arrive at correct answers through spurious reasoning — the model says "I can see a red car" when no car is present, but still gets the question right. This tool checks whether intermediate reasoning steps are actually grounded in visual evidence, not just plausible-sounding text that happens to lead to the correct answer.
git clone https://github.com/fengrui128/visual-cot-eval.git
cd visual-cot-eval
pip install -e .Dependencies: nltk, spacy (en_core_web_sm), numpy.
from vcot_eval.parser import CoTParser
from vcot_eval.grounding import GroundingChecker
from vcot_eval.faithfulness import FaithfulnessEvaluator
parser = CoTParser()
checker = GroundingChecker()
evaluator = FaithfulnessEvaluator(checker)
model_output = """
Step 1: I can see a dog sitting on the grass.
Step 2: The dog appears to be brown in color.
Step 3: Therefore, the answer is brown.
"""
detected_objects = ["dog", "grass", "fence"]
image_regions = [{"label": "dog", "bbox": [10, 20, 100, 150]}, {"label": "grass", "bbox": [0, 130, 300, 200]}]
steps = parser.parse(model_output)
result = evaluator.evaluate(steps, detected_objects, image_regions)
print(result)
# {'step_scores': [0.9, 0.85, 0.4], 'chain_score': 0.72, 'grounding_rate': 0.67}See data/sample_annotation.json for a full example. Each record expects:
image_id— string identifierquestion— the question posed to the modelmodel_output— raw text output including reasoning stepsground_truth— expected answerdetected_objects— list of object labels from an object detectorimage_regions— list of{label, bbox}dicts (bbox as[x, y, w, h])
python scripts/evaluate.py --input data/my_outputs.json --output results/eval_out.json| Metric | Description |
|---|---|
grounding_rate |
Fraction of steps that reference at least one detected object |
step_faithfulness_score |
Per-step score in [0, 1] based on object and region overlap |
chain_consistency |
Detects contradictory claims across steps (e.g., "red car" then "blue car") |
- Grounding is NLP-based (token matching + lemmatization), not vision-based. It checks whether the text references things that are actually in the image, not whether the visual reasoning is correct.
- Designed for offline evaluation of saved model outputs, not real-time inference.
- Tested on LLaVA-1.5 and InstructBLIP outputs.
MIT