Skip to content

Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers

Notifications You must be signed in to change notification settings

sidataplus/THIRAWAT-mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THIRAWAT Mapper

Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers

Prerequisites

Environment

  1. Python 3.10+
  2. uv

OHDSI Standard Concepts

  1. Request and download the standard concepts in csv format from https://athena.ohdsi.org/
  2. Convert the csv files into a DuckDB database using sidataplus/athena2duckdb

Models

Fine-tuned reranker models are hosted on Hugging Face.

Pre-built indexes will be made available soon.

Install from PyPI

pip install thirawat-mapper
# or (recommended for global CLI installs)
pipx install thirawat-mapper

thirawat --help

Command mapping:

  • thirawat index build ...python -m thirawat_mapper.index.build ...
  • thirawat infer bulk ...python -m thirawat_mapper.infer.bulk ...
  • thirawat infer query ...python -m thirawat_mapper.infer.query ...

1. Build a LanceDB Index

thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --domain-id Drug \
  --concept-class-id "Clinical Drug,Quant Clinical Drug,Clinical Drug Comp,Clinical Drug Form,Ingredient" \
  --exclude-concept-class-id "Clinical Drug Box,Branded Drug Box,Branded Pack Box,Clinical Pack Box,Marketed Product,Quant Branded Box,Quant Clinical Box" \
  --extra-column "concept_name,domain_id,vocabulary_id,concept_class_id" \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --batch-size 256 \
  --device cuda

Key options:

  • --duckdb - DuckDB file produced by sidataplus/athena2duckdb.
  • --profiles-table - Preferred table containing concept_id and profile_text. If the table is missing, the builder falls back to generating profiles inline from concept (and concept_synonym when available).
  • --concepts-table - OMOP concept table (defaults to concept). The builder always joins to this table and keeps only standard, valid concepts (standard_concept = 'S' AND invalid_reason IS NULL).
  • --domain-id, --concept-class-id - Optional filters; accept comma-separated lists or repeated flags.
  • --exclude-concept-class-id - Exclude specific classes (comma-separated or repeat flag). Default empty; recommended exclusions: Clinical Drug Box, Branded Drug Box, Branded Pack Box, Clinical Pack Box, Marketed Product, Quant Branded Box, Quant Clinical Box.
  • --extra-column - Carry additional columns from the profiles table into LanceDB (repeat flag).
  • --max-synonyms - Number of synonyms appended when inline profile generation is used.
  • --include-codes-in-text - Include concept_code in generated inline profile text.
  • --model-id, --pooling, --max-length - Encoder controls for building the index vectors (also written into the index manifest for inference defaults).
  • --out-db / --table - Target LanceDB directory and table name.

If your Athena-to-DuckDB file does not contain a concept_profiles table, the command still works via inline profile generation:

thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --max-synonyms 3 \
  --include-codes-in-text

Device matrix:

  • index build --device: explicit cuda|mps|cpu. If omitted, the encoder uses cuda when available, otherwise cpu.
  • infer bulk/query --device: auto|cuda|mps|cpu (default cpu for stability; auto prefers cuda, then mps, then cpu).

Apple Silicon example:

thirawat index build \
  --duckdb data/derived/concepts.duckdb \
  --profiles-table concept_profiles \
  --concepts-table concept \
  --out-db data/lancedb/db \
  --table concepts_drug \
  --device mps

The command will:

  1. Load profiles (and apply filters if provided).
  2. Normalize profile_text and embed with SapBERT vectors (via transformers; pooling configurable).
  3. Write a LanceDB table where vector is a FixedSizeList<float32>[768] column.
  4. Emit a <table>_manifest.json manifest describing the build (model id, filters, counts).

2. Interactive Query (REPL)

thirawat infer query \
  --db data/lancedb/db \
  --table concepts_drug \
  --device cpu \
  --reranker-id sidataplus/THIRAWAT-BioLORD  # optional override; defaults to sidataplus/THIRAWAT-SapBERT

Type a query and press Enter to see the post-scored top results:

query> amoxicillin clavulanate 875 mg
concept_id   | score  | s_sim | name
--------------------------------------------------------------------------------
123456       | 0.841  | 0.990 | Amoxicillin / Clavulanate 875 MG Oral Tablet
...

Commands:

  • Type :q, :quit, or :exit to leave.
  • Use --candidate-topk to change the candidate pool and --show-topk to limit display rows.
  • --reranker-id works here too if you want to test a local or alternative reranker in the REPL.

3. Bulk Inference

export TOKENIZERS_PARALLELISM=false

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/usagi.csv \
  --out runs/mapping \
  --candidate-topk 200 \
  --n-limit 20 \
  --device cuda

Add --reranker-id to point at a different reranker checkpoint. The flag accepts either a Hugging Face model ID or a local path, e.g. --reranker-id models/nde_biolord.

Input formats: CSV, TSV, Parquet, or Excel. By default the CLI expects the following columns (override via flags):

  • sourceName (required)
  • sourceCode (optional)
  • conceptId (optional ground truth)
  • mappingStatus (used for Usagi detection). When the input already follows the Usagi CSV schema (see data/eval/tmt_to_rxnorm.csv), the CLI validates a sample of rows through a Pydantic schema and surfaces a clear error if the structure is invalid. Otherwise, it synthesizes a minimal Usagi row per record so downstream exports stay consistent.

Selected flags:

  • --source-name-column, --source-code-column - Override input headers.
  • --label-column - Column containing gold concept IDs (optional, default conceptId).
  • --status-column, --approved-value - Configure Usagi approval detection.
  • --batch-size - Query embedding batch size (increase for better GPU throughput).
  • --n-limit - Limit to the first N rows (smoke runs).
  • --where - Optional LanceDB filter, e.g., vocabulary_id = 'RxNorm' AND concept_class_id != 'Ingredient' (when those columns exist in the index).
  • --device - auto|cuda|mps|cpu (default cpu for stability; use auto to prefer cuda, then mps, then cpu).
  • --encoder-model-id, --encoder-pooling, --encoder-max-length - Override the query encoder used for retrieval (defaults to the index manifest when present).
  • --post-mode - Post-score behavior: blend|tiebreak|lex (default tiebreak).
  • --post-weight - Blend weight (only when --post-mode blend, default 0.05).
  • --tiebreak-eps, --tiebreak-topn - Controls near-tie grouping for --post-mode tiebreak.
  • --brand-strict - For bracketed brand queries, drop brand-mismatched candidates when possible.
  • --inn2usan/--no-inn2usan - Normalize INN/BAN drug names to USAN during inference (default enabled).
  • --atc-scope - Boost candidates matching per-row atc_ids/atc_codes (requires --vocab or a DuckDB path in the index manifest).
  • --reranker-id - Override the default reranker (sidataplus/THIRAWAT-SapBERT) with another HF model ID or a local directory/filename. Relative paths are resolved to absolute paths so you can pass models/nde_biolord.

Deterministic post-ranking modes

--post-mode controls how post features influence ranking:

  • tiebreak (default): keeps the ML relevance ordering globally, but reorders only near-tied candidates (gap <= --tiebreak-eps) within the first --tiebreak-topn rows.
  • lex: full lexicographic sort by relevance + post features across all rows.
  • blend: computes a weighted final score.

For blend, the score is:

final_score = (1 - post_weight) * relevance + post_weight * post_score

For lex and tiebreak, tie-break keys are applied in this deterministic order (descending):

  1. brand_strength_exact
  2. top20_strength_form_exact
  3. brand_score
  4. rerank_top20
  5. strength_exact
  6. strength_sim
  7. form_route_score
  8. release_score

Pipeline steps per row:

  1. Build query text (sourceName with sourceCode appended in parentheses when present).
  2. Embed with SapBERT.
  3. Vector search (cosine) against the LanceDB table to gather --candidate-topk entries.
  4. Rerank with the THIRAWAT reranker. Beta is vector-only; no FTS/BM25/hybrid.
  5. Apply post-scoring per --post-mode (default tiebreak: only reorders within near-ties of the ML score). Disable post-scoring via --post-mode blend --post-weight 0.0.

Outputs (written to --out):

  • results.csv - Classic relabel layout (wide, block-per-query). Columns: leading rank 1..K, then for each query three adjacent columns [match_rank_or_unmatched, source_concept_name, source_concept_code] with K rows beneath. Non-Usagi inputs preserve the original row order; Usagi inputs continue to sort matched rows first so reviewers can focus on confirmed gold IDs.
  • results_with_input.csv - Original input row with candidate columns appended.
  • results_usagi.csv - Always emitted. Each processed row is coerced into the Usagi schema (using the sample in data/eval/tmt_to_rxnorm.csv as ground truth). The top candidate populates conceptId, conceptName, domainId, and matchScore when available; otherwise those fields remain blank. Every row is marked mappingStatus=UNCHECKED, statusSetBy=THIRAWAT-mapper, mappingType=MAPS_TO so reviewers can import the file directly into Usagi even when the source sheet was not originally in that format.
  • metrics.json - When ground-truth IDs are available (either via conceptId or Usagi rows with mappingStatus == APPROVED) the file reports Hit@{1,2,5,10,20,50,100}, MRR@100, coverage, and counts.

LLM-assisted RAG reranking

Bulk inference can optionally send the top reranked candidates to an LLM for tie-breaking or abstention logic. Enable this flow with --rag-provider and supply provider-specific flags. The CLI saves every prompt/response pair to rag_prompts.md under the chosen --out directory so you can audit exactly what was sent.

LLM output must be structured JSON with a concept_ids array, e.g. {"concept_ids":[123,456,789]}. If a provider returns invalid JSON for a query, that query falls back to the non-LLM ranking and logs an error.

General RAG knobs:

--rag-provider {ollama,llamacpp,openrouter,cloudflare}
--rag-model MODEL_ID                # default openai/gpt-oss-20b
--rag-candidate-limit 50            # number of reranked candidates passed to the LLM
--rag-profile-char-limit 512        # truncate long profile_text snippets
--rag-include-retrieval-score/--no-rag-include-retrieval-score
--rag-include-final-score/--no-rag-include-final-score
--rag-extra-context-column COLUMN   # optional extra context column from the input sheet
--rag-stop-sequence TEXT (repeatable)
--rag-use-normalized-query/--no-rag-use-normalized-query

Tip: RAG is isolated to infer.bulk. The interactive REPL intentionally remains retrieval-only in this beta.

Ollama (local GGUF/chat server)

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/ollama_rag \
  --n-limit 100 \
  --rag-provider ollama \
  --ollama-base-url http://localhost:11434 \
  --ollama-model "gpt-oss:20b"

Ollama-specific flags:

--ollama-base-url URL          # default http://localhost:11434
--ollama-model MODEL_TAG       # defaults to --rag-model value
--ollama-timeout 120           # seconds
--ollama-keep-alive "5m"       # optional keep-alive hint sent to server

llama.cpp server (local HTTP API)

Use --rag-provider llamacpp only when a llama.cpp llama-server process is already running (default http://127.0.0.1:8080). Launch the server separately with your desired context and batching flags (for example: llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -fa on). Point the CLI at that HTTP endpoint, not at GGUF files directly:

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/llamacpp_rag \
  --rag-provider llamacpp \
  --llamacpp-base-url http://127.0.0.1:8080 \
  --rag-model ggml-org/gpt-oss-20b-GGUF

llama.cpp flags:

--llamacpp-base-url URL          # default http://127.0.0.1:8080
--llamacpp-timeout 120           # HTTP timeout in seconds
--llamacpp-chat-format FORMAT    # e.g., qwen, llama
--llamacpp-system-prompt TEXT    # optional instruction prefix
--llamacpp-n-ctx 8192            # forwarded via query parameters when supported
--llamacpp-model-path /path/model.gguf   # fallback to llama-cpp-python bindings when no base URL is set

If you omit --llamacpp-base-url, the CLI falls back to the python bindings and expects --llamacpp-model-path to point to a local GGUF file (plus any --llamacpp-n-* overrides). In that mode, the rag-model flag is ignored and the file name controls which model loads.

For all providers, the CLI logs each prompt/response pair and the parsed candidate ordering to rag_prompts.md in the --out directory for downstream review.

OpenRouter (hosted multi-model API)

export OPENROUTER_API_KEY=<YOUR_KEY>

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/openrouter_rag \
  --rag-provider openrouter \
  --rag-model openrouter/polaris-alpha

Set OPENROUTER_API_KEY in your environment; the CLI will refuse to call OpenRouter without it.

Cloudflare Workers AI (remote)

export CLOUDFLARE_ACCOUNT_ID=<ACCOUNT_ID>
export CLOUDFLARE_API_TOKEN=<API_TOKEN>

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/input/usagi.csv \
  --out runs/cf_rag \
  --n-limit 100 \
  --rag-provider cloudflare \
  --rag-model openai/gpt-oss-20b

Cloudflare-specific flags:

--cloudflare-base-url https://api.cloudflare.com/client/v4
--cloudflare-use-responses-api / --no-cloudflare-use-responses-api
--gpt-reasoning-effort {low,medium,high}
--cf-reasoning-summary {auto,concise,detailed}

Set CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN in your environment before invoking the Cloudflare provider; the CLI reads only from those variables.

  • Models under @cf/openai/* (for example @cf/openai/gpt-oss-120b) use the Workers AI Responses API, so leave --cloudflare-use-responses-api enabled to send the prompt as an input payload.
  • Meta's @cf/meta/llama-4-* family is served via the /ai/run/<model> endpoint; pass --no-cloudflare-use-responses-api when targeting those models so the CLI emits the messages payload the endpoint expects.

Development

# 1. Install dependencies into a local virtual environment (creates .venv/)
uv sync

# 2. (Optional) Activate the environment for interactive shells
source .venv/bin/activate

# 3. Or just run commands directly via uv
uv run python -m thirawat_mapper.index.build --help

uv sync reads the project metadata and installs the required packages (PyTorch, LanceDB, transformers, etc.) against Python 3.10+. Subsequent uv run ... invocations will reuse the same environment. Replace paths in the examples below to match your workspace. All text used for indexing and inference is normalized (lower-cased, whitespace collapsed) for stable matching.

Notes & Requirements

  • Vector-only retrieval + reranking (no FTS/BM25/hybrid in beta).
  • Text is normalized (lowercase + collapsed whitespace) for indexing and inference.
  • The reranker default is sidataplus/THIRAWAT-SapBERT. As verified on February 10, 2026 via the Hugging Face model API, this model is public (gated=false, private=false). If upstream access settings change later, authenticate with Hugging Face as needed.
  • LanceDB tables must expose a float32 fixed-size vector column (named vector when built with this CLI).
  • Index build keeps only standard, valid OMOP concepts (standard_concept='S' AND invalid_reason IS NULL).
  • This beta uses the transformers encoder path directly (no --backend st switch in this CLI).

Troubleshooting: SapBERT warning during startup

You may see this warning while loading SapBERT-related components:

No sentence-transformers model found with name cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR.

In this project that warning is often benign fallback behavior when loading through transformers/ColBERT wrappers. Treat it as an error only when model loading or inference actually fails (for example, a raised exception, process exit, or no embeddings produced).

About

Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages