Skip to content

C: Use hb_string_T for token_T.value#687

Merged
marcoroth merged 47 commits intomarcoroth:mainfrom
timkaechele:token-value-string
Mar 6, 2026
Merged

C: Use hb_string_T for token_T.value#687
marcoroth merged 47 commits intomarcoroth:mainfrom
timkaechele:token-value-string

Conversation

@timkaechele
Copy link
Contributor

@timkaechele timkaechele commented Oct 19, 2025

This pull request changes the token_T.value to use hb_string_T and adapts the call sites. Token values become non-owning slices into the source buffer, eliminating per-token strdup calls during lexing and parsing. As a result, token_copy becomes a shallow struct copy and token_free no longer frees the value.

Two constants, HB_STRING_EMPTY and HB_STRING_NULL, are introduced to distinguish a valid empty string from the absence of a value. Call sites that previously checked token->value == NULL now use hb_string_is_null, while sites that checked for empty content use hb_string_is_empty.

The public API is simplified by removing herb_lex_file and herb_lex_to_buffer.

herb_lex_file had an inconsistent lifetime contract, it read a file, lexed it, then freed the source, leaving tokens with dangling pointers. Callers should now read the file themselves and pass the source to herb_lex. herb_lex_to_buffer was only used by the C-CLI and C tests, so it moves to an internal lex_helpers.h header.

Since token values are non-owning, callers must keep the source string alive for as long as tokens or AST nodes are in use. All existing bindings already satisfy this naturally, they hold a reference to the source (Ruby string, JNI string, std::string, etc.) for the duration of the operation and convert to native objects before releasing it.

Comparison

make bench_allocs

Before (current main 2990a34cc77681bcd45bb21b4c4320065c5ea129)
=== Allocation Benchmark ===

[small] (35 bytes input)
  lex  small       allocs: 32      deallocs: 0       bytes_alloc: 691       tokens: 16
  parse small      allocs: 106     deallocs: 68      bytes_alloc: 2597

[medium] (650 bytes input)
  lex  medium      allocs: 420     deallocs: 0       bytes_alloc: 9260      tokens: 210
  parse medium     allocs: 1355    deallocs: 897     bytes_alloc: 33354

[large] (2878 bytes input)
  lex  large       allocs: 1384    deallocs: 0       bytes_alloc: 31250     tokens: 692
  parse large      allocs: 4394    deallocs: 2963    bytes_alloc: 110160
After (this pull request)
=== Allocation Benchmark ===

[small] (35 bytes input)
  lex  small       allocs: 16      deallocs: 0       bytes_alloc: 768       tokens: 16
  parse small      allocs: 59      deallocs: 34      bytes_alloc: 2844

[medium] (650 bytes input)
  lex  medium      allocs: 210     deallocs: 0       bytes_alloc: 10080     tokens: 210
  parse medium     allocs: 764     deallocs: 451     bytes_alloc: 36038

[large] (2878 bytes input)
  lex  large       allocs: 692     deallocs: 0       bytes_alloc: 33216     tokens: 692
  parse large      allocs: 2465    deallocs: 1491    bytes_alloc: 117003
Conclusion

This pull request does 50% less allocations while lexing and 40-50% less allocations while parsing, though overall, it has to slightly allocate more total bytes.

@marcoroth marcoroth changed the title WIP: C: Use hb_string_T for token_T.value` WIP: C: Use hb_string_T for token_T.value Oct 19, 2025
@timkaechele timkaechele force-pushed the token-value-string branch 3 times, most recently from cad26cb to 9488839 Compare October 20, 2025 19:19
@timkaechele timkaechele force-pushed the token-value-string branch 2 times, most recently from b29d6cc to b867795 Compare October 29, 2025 16:28
@github-actions github-actions bot added the cpp label Mar 2, 2026
@marcoroth marcoroth added this to the v1.0.0 milestone Mar 2, 2026
@marcoroth marcoroth force-pushed the token-value-string branch from d8870c7 to 3e47f21 Compare March 2, 2026 02:44
@marcoroth marcoroth marked this pull request as ready for review March 2, 2026 06:07
@marcoroth marcoroth force-pushed the token-value-string branch from 3fecb53 to 8b26b90 Compare March 6, 2026 14:23
@marcoroth marcoroth merged commit 5d735c9 into marcoroth:main Mar 6, 2026
37 of 48 checks passed
@timkaechele timkaechele deleted the token-value-string branch March 6, 2026 15:07
marcoroth added a commit that referenced this pull request Mar 6, 2026
marcoroth added a commit that referenced this pull request Mar 6, 2026
marcoroth added a commit that referenced this pull request Mar 7, 2026
marcoroth added a commit that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants