Skip to content

Conversation

@antalvdb
Copy link
Member

Pull Request: Optimize UnicodeHash lookup performance using std::unordered_map

Description

This PR optimizes the Hash::UnicodeHash class by replacing the internal custom Trie implementation (Tries::UniTrie) with a standard std::unordered_map.

Profiling of the dependent timbl application revealed that UnicodeHash::lookup and UnicodeHash::hash were significant performance bottlenecks, consuming over 50% of the CPU time during the learning phase on large datasets. This was due to the linear time complexity O(L) of the Trie structure for string lookups. switching to a hash map provides O(1) average time complexity.

Changes

  • include/ticcutils/UniHash.h:
    • Replaced Tries::UniTrie<UniInfo> _tree with std::unordered_map<icu::UnicodeString, UniInfo*, UnicodeStringHash> _map.
    • Added a custom UnicodeStringHash struct to support icu::UnicodeString keys.
  • src/UniHash.cxx:
    • Refactored hash() and lookup() to use std::unordered_map API (find, insert).
    • Added a conditional check Normalizer2::isNormalized to avoid expensive normalization if the input string is already in NFC format.
    • Updated the destructor ~UnicodeHash() to explicitly delete UniInfo pointers stored in the map to prevent memory leaks.

Performance Analysis

Benchmarks were run using timbl on the edufineweb_train_000001-100k dataset (~4.9M lines).

Metric Original (UniTrie) Optimized (unordered_map) Improvement
Total Learning Time ~274s ~125s 2.2x Speedup
Lookup + Hash Time ~145.3s ~3.4s ~42x Speedup

Profiling Details (Top Functions)

Before (Original):
```text
% cumulative self name
38.44 101.33 101.33 Hash::UnicodeHash::lookup
16.66 145.26 43.93 Hash::UnicodeHash::hash
```

After (Optimized):
```text
% cumulative self name
1.92 96.51 2.19 Hash::UnicodeHash::lookup
1.05 101.13 1.20 Hash::UnicodeHash::hash
```

The bottleneck has been effectively eliminated, shifting the primary processing time to the core algorithm logic (ClassDistribution::IncFreq).
ticcutils_optimization_pr.zip

    modernizing, expanded tests, cleanup
    cleaning up XmlTools.
    update FileUtils to use filesystem::remove
    updating enum_types handling.
    reworking LogStream
    C++ code quality (C++17)
@kosloot kosloot self-assigned this Nov 22, 2025
kosloot added a commit that referenced this pull request Nov 22, 2025
Date:   Sat Nov 22 16:07:08 2025 +0100

    updated as suggested by #30

    entering 2025

    updated nlohmann json.hpp
@kosloot kosloot merged commit 244ee10 into develop Nov 22, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants