Skip to content

perf: add FastBPETrainer using word-frequency counting#13

Closed
AmitMY wants to merge 1 commit into
mainfrom
perf/incremental-counting
Closed

perf: add FastBPETrainer using word-frequency counting#13
AmitMY wants to merge 1 commit into
mainfrom
perf/incremental-counting

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Add FastBPETrainer that flattens words into byte tuples and counts word frequencies
  • Pair counting operates on word-frequency dict instead of traversing the full graph
  • Produces identical merges to graph-based BPE (verified in tests)

Stacked on #12.

What improved

  • Significantly faster on text with repeated patterns (words counted once, not traversed repeatedly)
  • Performance test verifies FastBPE is faster than regular graph-based BPE

Test plan

  • 5 tests: correctness match vs regular BPE, empty/single char, perf comparison
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY AmitMY force-pushed the perf/incremental-counting branch 11 times, most recently from f364502 to 5c951a9 Compare April 8, 2026 17:20
- FastBPETrainer flattens words into byte tuples and counts word
  frequencies, avoiding repeated graph traversal
- Pair counting operates on word-freq dict instead of full corpus
- Produces identical merges to graph-based BPE (tested)
- Significantly faster on repeated text patterns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the perf/incremental-counting branch from 5c951a9 to 2aa3dcc Compare April 8, 2026 17:22
@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

Closing — if we want a flat word-frequency BPE implementation, we'd just use HuggingFace tokenizers directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant