Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers
- Python 3.10+
uv
- Request and download the standard concepts in csv format from https://athena.ohdsi.org/
- Convert the csv files into a DuckDB database using sidataplus/athena2duckdb
Fine-tuned reranker models are hosted on Hugging Face.
Pre-built indexes will be made available soon.
pip install thirawat-mapper
# or (recommended for global CLI installs)
pipx install thirawat-mapper
thirawat --helpCommand mapping:
thirawat index build ...→python -m thirawat_mapper.index.build ...thirawat infer bulk ...→python -m thirawat_mapper.infer.bulk ...thirawat infer query ...→python -m thirawat_mapper.infer.query ...
thirawat index build \
--duckdb data/derived/concepts.duckdb \
--profiles-table concept_profiles \
--concepts-table concept \
--domain-id Drug \
--concept-class-id "Clinical Drug,Quant Clinical Drug,Clinical Drug Comp,Clinical Drug Form,Ingredient" \
--exclude-concept-class-id "Clinical Drug Box,Branded Drug Box,Branded Pack Box,Clinical Pack Box,Marketed Product,Quant Branded Box,Quant Clinical Box" \
--extra-column "concept_name,domain_id,vocabulary_id,concept_class_id" \
--out-db data/lancedb/db \
--table concepts_drug \
--batch-size 256 \
--device cudaKey options:
--duckdb- DuckDB file produced bysidataplus/athena2duckdb.--profiles-table- Preferred table containingconcept_idandprofile_text. If the table is missing, the builder falls back to generating profiles inline fromconcept(andconcept_synonymwhen available).--concepts-table- OMOP concept table (defaults toconcept). The builder always joins to this table and keeps only standard, valid concepts (standard_concept = 'S' AND invalid_reason IS NULL).--domain-id,--concept-class-id- Optional filters; accept comma-separated lists or repeated flags.--exclude-concept-class-id- Exclude specific classes (comma-separated or repeat flag). Default empty; recommended exclusions: Clinical Drug Box, Branded Drug Box, Branded Pack Box, Clinical Pack Box, Marketed Product, Quant Branded Box, Quant Clinical Box.--extra-column- Carry additional columns from the profiles table into LanceDB (repeat flag).--max-synonyms- Number of synonyms appended when inline profile generation is used.--include-codes-in-text- Includeconcept_codein generated inline profile text.--model-id,--pooling,--max-length- Encoder controls for building the index vectors (also written into the index manifest for inference defaults).--out-db/--table- Target LanceDB directory and table name.
If your Athena-to-DuckDB file does not contain a concept_profiles table, the command still works via inline profile generation:
thirawat index build \
--duckdb data/derived/concepts.duckdb \
--profiles-table concept_profiles \
--concepts-table concept \
--out-db data/lancedb/db \
--table concepts_drug \
--max-synonyms 3 \
--include-codes-in-textDevice matrix:
index build --device: explicitcuda|mps|cpu. If omitted, the encoder usescudawhen available, otherwisecpu.infer bulk/query --device:auto|cuda|mps|cpu(defaultcpufor stability;autopreferscuda, thenmps, thencpu).
Apple Silicon example:
thirawat index build \
--duckdb data/derived/concepts.duckdb \
--profiles-table concept_profiles \
--concepts-table concept \
--out-db data/lancedb/db \
--table concepts_drug \
--device mpsThe command will:
- Load profiles (and apply filters if provided).
- Normalize
profile_textand embed with SapBERT vectors (viatransformers; pooling configurable). - Write a LanceDB table where
vectoris aFixedSizeList<float32>[768]column. - Emit a
<table>_manifest.jsonmanifest describing the build (model id, filters, counts).
thirawat infer query \
--db data/lancedb/db \
--table concepts_drug \
--device cpu \
--reranker-id sidataplus/THIRAWAT-BioLORD # optional override; defaults to sidataplus/THIRAWAT-SapBERTType a query and press Enter to see the post-scored top results:
query> amoxicillin clavulanate 875 mg
concept_id | score | s_sim | name
--------------------------------------------------------------------------------
123456 | 0.841 | 0.990 | Amoxicillin / Clavulanate 875 MG Oral Tablet
...
Commands:
- Type
:q,:quit, or:exitto leave. - Use
--candidate-topkto change the candidate pool and--show-topkto limit display rows. --reranker-idworks here too if you want to test a local or alternative reranker in the REPL.
export TOKENIZERS_PARALLELISM=false
thirawat infer bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/usagi.csv \
--out runs/mapping \
--candidate-topk 200 \
--n-limit 20 \
--device cudaAdd --reranker-id to point at a different reranker checkpoint. The flag accepts either a Hugging Face model ID or a local path, e.g. --reranker-id models/nde_biolord.
Input formats: CSV, TSV, Parquet, or Excel. By default the CLI expects the following columns (override via flags):
sourceName(required)sourceCode(optional)conceptId(optional ground truth)mappingStatus(used for Usagi detection). When the input already follows the Usagi CSV schema (seedata/eval/tmt_to_rxnorm.csv), the CLI validates a sample of rows through a Pydantic schema and surfaces a clear error if the structure is invalid. Otherwise, it synthesizes a minimal Usagi row per record so downstream exports stay consistent.
Selected flags:
--source-name-column,--source-code-column- Override input headers.--label-column- Column containing gold concept IDs (optional, defaultconceptId).--status-column,--approved-value- Configure Usagi approval detection.--batch-size- Query embedding batch size (increase for better GPU throughput).--n-limit- Limit to the first N rows (smoke runs).--where- Optional LanceDB filter, e.g.,vocabulary_id = 'RxNorm' AND concept_class_id != 'Ingredient'(when those columns exist in the index).--device-auto|cuda|mps|cpu(defaultcpufor stability; useautoto prefercuda, thenmps, thencpu).--encoder-model-id,--encoder-pooling,--encoder-max-length- Override the query encoder used for retrieval (defaults to the index manifest when present).--post-mode- Post-score behavior:blend|tiebreak|lex(defaulttiebreak).--post-weight- Blend weight (only when--post-mode blend, default0.05).--tiebreak-eps,--tiebreak-topn- Controls near-tie grouping for--post-mode tiebreak.--brand-strict- For bracketed brand queries, drop brand-mismatched candidates when possible.--inn2usan/--no-inn2usan- Normalize INN/BAN drug names to USAN during inference (default enabled).--atc-scope- Boost candidates matching per-rowatc_ids/atc_codes(requires--vocabor a DuckDB path in the index manifest).--reranker-id- Override the default reranker (sidataplus/THIRAWAT-SapBERT) with another HF model ID or a local directory/filename. Relative paths are resolved to absolute paths so you can passmodels/nde_biolord.
--post-mode controls how post features influence ranking:
tiebreak(default): keeps the ML relevance ordering globally, but reorders only near-tied candidates (gap<= --tiebreak-eps) within the first--tiebreak-topnrows.lex: full lexicographic sort by relevance + post features across all rows.blend: computes a weighted final score.
For blend, the score is:
final_score = (1 - post_weight) * relevance + post_weight * post_score
For lex and tiebreak, tie-break keys are applied in this deterministic order (descending):
brand_strength_exacttop20_strength_form_exactbrand_scorererank_top20strength_exactstrength_simform_route_scorerelease_score
Pipeline steps per row:
- Build query text (
sourceNamewithsourceCodeappended in parentheses when present). - Embed with SapBERT.
- Vector search (cosine) against the LanceDB table to gather
--candidate-topkentries. - Rerank with the THIRAWAT reranker. Beta is vector-only; no FTS/BM25/hybrid.
- Apply post-scoring per
--post-mode(defaulttiebreak: only reorders within near-ties of the ML score). Disable post-scoring via--post-mode blend --post-weight 0.0.
Outputs (written to --out):
results.csv- Classic relabel layout (wide, block-per-query). Columns: leadingrank1..K, then for each query three adjacent columns[match_rank_or_unmatched, source_concept_name, source_concept_code]with K rows beneath. Non-Usagi inputs preserve the original row order; Usagi inputs continue to sort matched rows first so reviewers can focus on confirmed gold IDs.results_with_input.csv- Original input row with candidate columns appended.results_usagi.csv- Always emitted. Each processed row is coerced into the Usagi schema (using the sample indata/eval/tmt_to_rxnorm.csvas ground truth). The top candidate populatesconceptId,conceptName,domainId, andmatchScorewhen available; otherwise those fields remain blank. Every row is markedmappingStatus=UNCHECKED,statusSetBy=THIRAWAT-mapper,mappingType=MAPS_TOso reviewers can import the file directly into Usagi even when the source sheet was not originally in that format.metrics.json- When ground-truth IDs are available (either viaconceptIdor Usagi rows withmappingStatus == APPROVED) the file reports Hit@{1,2,5,10,20,50,100}, MRR@100, coverage, and counts.
Bulk inference can optionally send the top reranked candidates to an LLM for tie-breaking or abstention logic. Enable this flow with --rag-provider and supply provider-specific flags. The CLI saves every prompt/response pair to rag_prompts.md under the chosen --out directory so you can audit exactly what was sent.
LLM output must be structured JSON with a concept_ids array, e.g. {"concept_ids":[123,456,789]}. If a provider returns invalid JSON for a query, that query falls back to the non-LLM ranking and logs an error.
General RAG knobs:
--rag-provider {ollama,llamacpp,openrouter,cloudflare}
--rag-model MODEL_ID # default openai/gpt-oss-20b
--rag-candidate-limit 50 # number of reranked candidates passed to the LLM
--rag-profile-char-limit 512 # truncate long profile_text snippets
--rag-include-retrieval-score/--no-rag-include-retrieval-score
--rag-include-final-score/--no-rag-include-final-score
--rag-extra-context-column COLUMN # optional extra context column from the input sheet
--rag-stop-sequence TEXT (repeatable)
--rag-use-normalized-query/--no-rag-use-normalized-queryTip: RAG is isolated to
infer.bulk. The interactive REPL intentionally remains retrieval-only in this beta.
thirawat infer bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/ollama_rag \
--n-limit 100 \
--rag-provider ollama \
--ollama-base-url http://localhost:11434 \
--ollama-model "gpt-oss:20b"Ollama-specific flags:
--ollama-base-url URL # default http://localhost:11434
--ollama-model MODEL_TAG # defaults to --rag-model value
--ollama-timeout 120 # seconds
--ollama-keep-alive "5m" # optional keep-alive hint sent to serverUse --rag-provider llamacpp only when a llama.cpp llama-server process is already running (default http://127.0.0.1:8080). Launch the server separately with your desired context and batching flags (for example: llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -fa on). Point the CLI at that HTTP endpoint, not at GGUF files directly:
thirawat infer bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/llamacpp_rag \
--rag-provider llamacpp \
--llamacpp-base-url http://127.0.0.1:8080 \
--rag-model ggml-org/gpt-oss-20b-GGUFllama.cpp flags:
--llamacpp-base-url URL # default http://127.0.0.1:8080
--llamacpp-timeout 120 # HTTP timeout in seconds
--llamacpp-chat-format FORMAT # e.g., qwen, llama
--llamacpp-system-prompt TEXT # optional instruction prefix
--llamacpp-n-ctx 8192 # forwarded via query parameters when supported
--llamacpp-model-path /path/model.gguf # fallback to llama-cpp-python bindings when no base URL is setIf you omit --llamacpp-base-url, the CLI falls back to the python bindings and expects --llamacpp-model-path to point to a local GGUF file (plus any --llamacpp-n-* overrides). In that mode, the rag-model flag is ignored and the file name controls which model loads.
For all providers, the CLI logs each prompt/response pair and the parsed candidate ordering to rag_prompts.md in the --out directory for downstream review.
export OPENROUTER_API_KEY=<YOUR_KEY>
thirawat infer bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/openrouter_rag \
--rag-provider openrouter \
--rag-model openrouter/polaris-alphaSet OPENROUTER_API_KEY in your environment; the CLI will refuse to call OpenRouter without it.
export CLOUDFLARE_ACCOUNT_ID=<ACCOUNT_ID>
export CLOUDFLARE_API_TOKEN=<API_TOKEN>
thirawat infer bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/cf_rag \
--n-limit 100 \
--rag-provider cloudflare \
--rag-model openai/gpt-oss-20bCloudflare-specific flags:
--cloudflare-base-url https://api.cloudflare.com/client/v4
--cloudflare-use-responses-api / --no-cloudflare-use-responses-api
--gpt-reasoning-effort {low,medium,high}
--cf-reasoning-summary {auto,concise,detailed}Set CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN in your environment before invoking the Cloudflare provider; the CLI reads only from those variables.
- Models under
@cf/openai/*(for example@cf/openai/gpt-oss-120b) use the Workers AI Responses API, so leave--cloudflare-use-responses-apienabled to send the prompt as aninputpayload. - Meta's
@cf/meta/llama-4-*family is served via the/ai/run/<model>endpoint; pass--no-cloudflare-use-responses-apiwhen targeting those models so the CLI emits themessagespayload the endpoint expects.
# 1. Install dependencies into a local virtual environment (creates .venv/)
uv sync
# 2. (Optional) Activate the environment for interactive shells
source .venv/bin/activate
# 3. Or just run commands directly via uv
uv run python -m thirawat_mapper.index.build --helpuv sync reads the project metadata and installs the required packages (PyTorch, LanceDB, transformers, etc.) against Python 3.10+. Subsequent uv run ... invocations will reuse the same environment. Replace paths in the examples below to match your workspace. All text used for indexing and inference is normalized (lower-cased, whitespace collapsed) for stable matching.
- Vector-only retrieval + reranking (no FTS/BM25/hybrid in beta).
- Text is normalized (lowercase + collapsed whitespace) for indexing and inference.
- The reranker default is
sidataplus/THIRAWAT-SapBERT. As verified on February 10, 2026 via the Hugging Face model API, this model is public (gated=false,private=false). If upstream access settings change later, authenticate with Hugging Face as needed. - LanceDB tables must expose a float32 fixed-size vector column (named
vectorwhen built with this CLI). - Index build keeps only standard, valid OMOP concepts (
standard_concept='S' AND invalid_reason IS NULL). - This beta uses the
transformersencoder path directly (no--backend stswitch in this CLI).
You may see this warning while loading SapBERT-related components:
No sentence-transformers model found with name cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR.
In this project that warning is often benign fallback behavior when loading through transformers/ColBERT wrappers. Treat it as an error only when model loading or inference actually fails (for example, a raised exception, process exit, or no embeddings produced).