grad student at UWA (Perth, Australia) — working on multimodal learning, mostly vision-language models and why they confidently describe things that aren't there.
my research sits somewhere between computer vision and NLP. right now i'm spending most of my time thinking about:
- hallucination in VLMs — when does a model "see" something vs. invent it? how do we measure that reliably?
- cross-modal attention — what's actually happening when a model aligns a word with a region in an image?
- visual reasoning chains — can we evaluate whether a model's intermediate steps are grounded, not just the final answer?
current status: debugging something that was working yesterday
some things i've built / been building:
| project | what it does |
|---|---|
vlm-hallu-probe |
lightweight toolkit for probing hallucination patterns in VLMs — object, attribute, relation levels |
attn-scope |
attention map analysis + visualization for multimodal transformers, because looking at attention weights is half of debugging |
visual-cot-eval |
evaluating whether visual chain-of-thought reasoning steps are actually grounded in the image |
recently been reading / thinking about:
- the gap between automated VQA benchmarks and what "understanding" actually means
- token merging strategies in ViTs and whether they hurt grounding
- whether RLHF-aligned models are more or less prone to hallucination (the answer is complicated)
not much else to say here. i mostly keep notes in obsidian, break things in jupyter notebooks, and occasionally remember to commit my experiments before losing them.
