feat: add lexical baseline models for ranking tasks by federetyk · Pull Request #36 · techwolf-ai/workrb

federetyk · 2026-01-16T12:35:53Z

Addresses #35

Description

This PR adds lexical baseline models to WorkRB for establishing performance bounds on ranking tasks. These baselines complement the existing neural embedding models (BiEncoderModel, JobBERTModel, etc.) by providing lower-bound reference points and enabling future two-stage retrieval pipelines with candidate generation followed by neural re-ranking.

Four models are introduced, all inheriting from ModelInterface and implementing the standard ranking/classification interface. The models accept but ignore ModelInputType parameters, as lexical methods are input-type agnostic. Classification is handled by delegating to ranking, following the same pattern as BiEncoderModel.

The implementations are adapted from the MELO Benchmark repository.

Changes:

Add BM25Model: BM25 Okapi probabilistic ranking using rank-bm25 library
Add TfIdfModel: TF-IDF with cosine similarity, supporting word-level or character n-gram tokenization
Add EditDistanceModel: Levenshtein ratio for string similarity using rapidfuzz library
Add RandomRankingModel: Random score generation for sanity checking, with optional seed for reproducibility
Add shared preprocessing: Unicode normalization (NFKD) and configurable lowercasing across all models
Add rank-bm25 and rapidfuzz dependencies to pyproject.toml
Add unit tests covering initialization and ranking computation
Export new models in src/workrb/models/__init__.py

Checklist

Added new tests for new functionality
Tested locally with example tasks
Code follows project style guidelines
Documentation updated
No new warnings introduced

federetyk added 5 commits January 8, 2026 20:47

feat: add lexical baselines for ranking

9867830

feat: add unicode normalization to lexical baseline preprocessing

0a7f2b1

Merge branch 'techwolf-ai:main' into feat/lexical-ranking-baselines

8a38014

Merge branch 'techwolf-ai:main' into feat/lexical-ranking-baselines

3c719a7

fix: include lowercase setting in lexical baseline model names

b0e49db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add lexical baseline models for ranking tasks#36

feat: add lexical baseline models for ranking tasks#36
federetyk wants to merge 5 commits intotechwolf-ai:mainfrom
federetyk:feat/lexical-ranking-baselines

federetyk commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

federetyk commented Jan 16, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant