This repository contains the code for
“Deep Language Geometry: Constructing a Metric Space from LLM Weights.”
We construct binary “language vectors” from LLM weight importance (via OBS-style estimates), then measure inter-language distances (e.g., Hamming), enabling analysis, visualization, and downstream transfer heuristics.
- Accepted for a long presentation at RANLP 2025! 🎉
To calculate and save binary vector from a model and dataset run:
python main.py --model <your model> --dataset <your dataset>The arguments:
--model: The identifier for the model from Hugging Face model hub.--dataset: Calibration dataset name.--seed: Seed for sampling the calibration data.
For more examples of usage see launch.sh
Calculated binary vectors, Euclidian vectors and distances are published as HiggingFace dataset: mshamrai/language-metric-data.
Also, the gradio analysis tool is published as HuggingFace space: mshamrai/language-metric-analysis.
If you use this repo, dataset, or space, please cite:
@article{shamrai2025deep,
title = {Deep Language Geometry: Constructing a Metric Space from LLM Weights},
author = {Maksym Shamrai and Vladyslav Hamolia},
journal = {arXiv preprint arXiv:2508.11676},
year = {2025},
url = {https://arxiv.org/abs/2508.11676}
}
This project is licensed under the MIT License. Feel free to use and modify the code for academic and research purposes.