Skip to content

Fix non-deterministic code generation caused by HashSet iteration order#232

Open
jensdietrich wants to merge 1 commit intoantlr:masterfrom
jensdietrich:fix/deterministic-code-generation
Open

Fix non-deterministic code generation caused by HashSet iteration order#232
jensdietrich wants to merge 1 commit intoantlr:masterfrom
jensdietrich:fix/deterministic-code-generation

Conversation

@jensdietrich
Copy link
Copy Markdown

Summary

Replaces HashSet with LinkedHashSet in two places where rewrite-rule
element references are collected during grammar analysis. Because HashSet
iteration order depends on System.identityHashCode(), which varies between
JVM runs, the order in which rewrite stream variables were emitted into
generated parsers was non-deterministic.

Affected locations:

  • DefineGrammarItemsWalker.grewriteRefsDeep and rewriteRefsShallow
    on GrammarAST nodes (3 allocations)
  • Grammar.java — the labels set returned by getLabels()

LinkedHashSet preserves insertion order (i.e. the order in which tokens and
rule references appear in the grammar source), making the generated output
stable across runs.

Motivation: reproducible builds

This bug causes generated parser sources to differ between builds. Concrete
symptoms visible in the generated Java code:

// ordering of hasNext() checks varies between runs:
while ( stream_ELSEIF.hasNext()||stream_c2.hasNext()||stream_t2.hasNext() ) {

// ordering of reset() calls varies between runs:
stream_ELSE.reset();
stream_t3.reset();

Non-deterministic generated sources undermine
reproducible builds, which are an
important defence in software supply chain security. The Java compiler itself
treats non-determinism in generated code as a bug — see for example
JDK-8264306 and
JDK-8295024.

The issue was reported downstream in
antlr/stringtemplate4#325,
where STParser.java was observed to produce a different SHA-256 hash on
every build.

Performance impact

The performance impact of this change is expected to be negligible.
LinkedHashSet offers the same O(1) amortised complexity as HashSet for
add(), contains(), and remove() — the only difference is a small
constant overhead per insertion to maintain the doubly-linked list that
preserves insertion order. This is corroborated by benchmarks in
Performance: TreeSet vs HashSet vs LinkedHashSet,
which show that LinkedHashSet and HashSet perform virtually identically,
while TreeSet (O(log n)) is measurably slower. The sets affected by this
change are small — they hold only the token and rule references appearing in a
single grammar rewrite rule — so even the constant overhead is inconsequential
in practice.

Remaining known limitation

After this fix, generated files still differ between runs in exactly one
place: the wall-clock timestamp ANTLR3 writes into the header comment of
every generated file:

// $ANTLR 3.5.3 org/stringtemplate/v4/compiler/STParser.g 2026-04-14 08:54:49

This does not affect compiled artefacts (.class files / JARs): the
timestamp lives in a comment which is stripped by the Java compiler, so
binary outputs remain reproducible. Eliminating the timestamp entirely would
be a separate follow-up change.

Replaces HashSet with LinkedHashSet in two places where rewrite-rule
element references are collected during grammar analysis. Because HashSet
iteration order depends on System.identityHashCode(), which varies between
JVM runs, the order in which rewrite stream variables were emitted into
generated parsers was non-deterministic.

Affected locations:
- DefineGrammarItemsWalker.g: rewriteRefsDeep and rewriteRefsShallow
  on GrammarAST nodes (3 allocations)
- Grammar.java: the labels set returned by getLabels()

LinkedHashSet preserves insertion order (i.e. the order in which tokens
and rule references appear in the grammar source), making the generated
output stable across runs.

Reported downstream in antlr/stringtemplate4#325, where STParser.java
was observed to produce a different SHA-256 hash on every build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant