Skip to content

[java] Tighten JSON number lexer to match RFC 8259#17739

Open
shs96c wants to merge 3 commits into
SeleniumHQ:trunkfrom
shs96c:shs-json-number-state-machine
Open

[java] Tighten JSON number lexer to match RFC 8259#17739
shs96c wants to merge 3 commits into
SeleniumHQ:trunkfrom
shs96c:shs-json-number-state-machine

Conversation

@shs96c

@shs96c shs96c commented Jul 2, 2026

Copy link
Copy Markdown
Member

nextNumber() collected any of - + 0-9 . e E and handed the string to Long.valueOf / BigDecimal. It accepted several spec-invalid forms:

  • +5 (leading plus)
  • 01, 007 (leading zeros)
  • .5 (no digit before decimal)
  • 5. (no digit after decimal)
  • 1e, - (sign / exponent without digits)

Very large exponents (1e9999) also silently produced Double.POSITIVE_INFINITY, which JSON cannot represent.

We now:

  • Have replaced the collector with a state machine that follows number = [minus] int [frac] [exp] (RFC 8259 §6).
  • Reject non-finite double values with a clear message.
  • Drop + from peek()'s NUMBER classifier — it never was a valid JSON number start.

@selenium-ci selenium-ci added the C-java Java Bindings label Jul 2, 2026
The previous nextNumber() collected any subset of '- + 0-9 . e E' and
then handed the string to Long.valueOf or BigDecimal. That accepted
several spec-invalid forms: '+5' (leading plus), '01' (leading zero),
'.5' (no digit before decimal), '5.' (no digit after decimal), '1e'
(exponent without digits), '-' alone. Very large exponents also
silently produced Double.POSITIVE_INFINITY, for which JSON has no
representation.

Replace the collector with a proper state machine that follows the
grammar 'number = [minus] int [frac] [exp]', and reject non-finite
doubles. Also drop '+' from the peek() classifier so it no longer
claims to see a JSON number.
@shs96c shs96c force-pushed the shs-json-number-state-machine branch from 53d5bf5 to ee9fa0c Compare July 2, 2026 12:47
@shs96c shs96c marked this pull request as ready for review July 2, 2026 13:12
@qodo-code-review

Copy link
Copy Markdown
Contributor

PR Summary by Qodo

Tighten JsonInput number lexing to RFC 8259 grammar and reject non-finite doubles

🐞 Bug fix 🧪 Tests 🕐 20-40 Minutes

Grey Divider

AI Description

• Replace permissive number token collection with an RFC 8259 §6 state machine.
• Reject spec-invalid numeric forms (leading '+', leading zeros, missing digits).
• Fail fast on double overflow/NaN and add regression coverage for valid/invalid cases.
Diagram

graph TD
  T["JsonInputTest"] --> N["JsonInput.nextNumber()"] --> SM{"RFC 8259 number grammar"} --> OUT["Number (Long/double)"]
  N --> IN[("CharInput peek/read")]
  SM --> EX{{"JsonException"}}

  subgraph Legend
    direction LR
    _code["Code"] ~~~ _io[("Input")] ~~~ _decision{"Decision"} ~~~ _err{{"Error"}}
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Regex/grammar pre-check before parsing
  • ➕ Compact validation logic that can be easier to audit against RFC grammar
  • ➕ Keeps parsing (Long/BigDecimal) separate from validation concerns
  • ➖ Harder to produce precise, localized error messages during streaming reads
  • ➖ Still needs careful integration with incremental input consumption
2. Always parse as BigDecimal and validate by re-serializing
  • ➕ Avoids double overflow concerns by keeping arbitrary precision longer
  • ➕ Potentially simplifies numeric type decisions
  • ➖ Behavioral change: callers currently receive Long for integers and double for decimals/exponents
  • ➖ Re-serialization/normalization does not reliably detect all original lexical invalid forms (e.g., leading zeros) without extra checks
3. Delegate to a standard JSON parser for tokenization
  • ➕ Leverages mature RFC-compliant number lexer implementations
  • ➕ Reduces maintenance surface area for lexer correctness
  • ➖ Likely invasive change to JsonInput streaming API and existing call sites
  • ➖ Adds dependency/behavioral differences not aligned with current lightweight parser

Recommendation: The PR’s state-machine approach is the best fit for a streaming lexer: it enforces RFC 8259 at read time, preserves existing return-type behavior (Long fast-path vs double), and provides targeted failures for invalid forms and non-finite doubles. Alternatives either complicate streaming consumption (regex) or require broader API/behavior changes (BigDecimal-only or external parser).

Files changed (2) +115 / -36

Bug fix (1) +67 / -36
JsonInput.javaEnforce RFC 8259 number grammar and reject non-finite doubles +67/-36

Enforce RFC 8259 number grammar and reject non-finite doubles

• Removes '+' from NUMBER detection in peek(). Replaces the permissive character collector in nextNumber() with a grammar-driven state machine that rejects leading zeros, missing digits around '.'/exponent, and missing integer parts. Adds a finite-check for BigDecimal-to-double conversion and introduces a small digit helper.

java/src/org/openqa/selenium/json/JsonInput.java

Tests (1) +48 / -0
JsonInputTest.javaAdd coverage for spec-invalid numbers and double overflow +48/-0

Add coverage for spec-invalid numbers and double overflow

• Introduces tests that assert JsonInput rejects previously-accepted RFC-invalid numeric lexemes and still accepts valid forms. Adds an explicit regression test ensuring extreme exponents that overflow to Infinity are rejected with an out-of-range error message.

java/test/org/openqa/selenium/json/JsonInputTest.java

@qodo-code-review

qodo-code-review Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📜 Skill insights (0)

Context used
✅ Compliance rules (platform): 18 rules

Grey Divider


Informational

1. EOF shown as U+FFFF ✓ Resolved 🐞 Bug ◔ Observability
Description
In nextNumber(), several exception messages cast input.peek() (an int) to char; when the stream is
exhausted this renders EOF (-1) as '\uFFFF', producing misleading diagnostics for inputs like "-",
"5.", or "1e".
This is especially confusing because the codebase explicitly treats U+FFFF as a valid literal
character, so the error text can look like it saw a real character rather than end-of-input.
Code

java/src/org/openqa/selenium/json/JsonInput.java[R234-247]

+    int first = input.peek();
+    if (first == '0') {
+      builder.append((char) input.read());
+      // Leading zeros ("00", "01", ...) are not allowed.
+      if (isDigit(input.peek())) {
+        throw new JsonException("Leading zeros are not permitted in JSON numbers. " + input);
      }
-    } while (read);
+    } else if (first >= '1' && first <= '9') {
+      while (isDigit(input.peek())) {
+        builder.append((char) input.read());
+      }
+    } else {
+      throw new JsonException("Expected digit but saw '" + (char) first + "'. " + input);
+    }
Evidence
Input.peek() returns Input.EOF (-1) at end-of-input, and nextNumber() now casts these ints to
char inside error messages, which turns -1 into \uFFFF. The test suite also explicitly asserts
that literal U+FFFF is valid input, so reporting EOF as \uFFFF is ambiguous and confusing during
debugging.

java/src/org/openqa/selenium/json/JsonInput.java[223-281]
java/src/org/openqa/selenium/json/Input.java[31-38]
java/test/org/openqa/selenium/json/JsonInputTest.java[294-309]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`JsonInput.nextNumber()` builds error messages by casting `input.peek()` (an `int`) to `char`. When `peek()` returns `Input.EOF` (`-1`), the cast becomes `\uFFFF`, so errors at end-of-input misleadingly report that a U+FFFF character was seen.

### Issue Context
`Input.EOF` is intentionally `-1` to avoid colliding with any UTF-16 code unit, including U+FFFF; and there is a unit test ensuring U+FFFF is treated as a literal character.

### Fix Focus Areas
- java/src/org/openqa/selenium/json/JsonInput.java[223-281]
- java/src/org/openqa/selenium/json/Input.java[31-38]

### Suggested fix
- Add a small helper (e.g., `describeChar(int c)`) that returns `"<EOF>"` when `c == Input.EOF`, otherwise returns a quoted printable character.
- Use that helper in the three `JsonException` messages added in `nextNumber()` (expected digit, expected digit after '.', expected exponent digit), so EOF is reported clearly as EOF rather than `\uFFFF`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit e1f3ed2

Results up to commit 9f6599b


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)


Informational
1. EOF shown as U+FFFF ✓ Resolved 🐞 Bug ◔ Observability
Description
In nextNumber(), several exception messages cast input.peek() (an int) to char; when the stream is
exhausted this renders EOF (-1) as '\uFFFF', producing misleading diagnostics for inputs like "-",
"5.", or "1e".
This is especially confusing because the codebase explicitly treats U+FFFF as a valid literal
character, so the error text can look like it saw a real character rather than end-of-input.
Code

java/src/org/openqa/selenium/json/JsonInput.java[R234-247]

+    int first = input.peek();
+    if (first == '0') {
+      builder.append((char) input.read());
+      // Leading zeros ("00", "01", ...) are not allowed.
+      if (isDigit(input.peek())) {
+        throw new JsonException("Leading zeros are not permitted in JSON numbers. " + input);
      }
-    } while (read);
+    } else if (first >= '1' && first <= '9') {
+      while (isDigit(input.peek())) {
+        builder.append((char) input.read());
+      }
+    } else {
+      throw new JsonException("Expected digit but saw '" + (char) first + "'. " + input);
+    }
Evidence
Input.peek() returns Input.EOF (-1) at end-of-input, and nextNumber() now casts these ints to
char inside error messages, which turns -1 into \uFFFF. The test suite also explicitly asserts
that literal U+FFFF is valid input, so reporting EOF as \uFFFF is ambiguous and confusing during
debugging.

java/src/org/openqa/selenium/json/JsonInput.java[223-281]
java/src/org/openqa/selenium/json/Input.java[31-38]
java/test/org/openqa/selenium/json/JsonInputTest.java[294-309]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`JsonInput.nextNumber()` builds error messages by casting `input.peek()` (an `int`) to `char`. When `peek()` returns `Input.EOF` (`-1`), the cast becomes `\uFFFF`, so errors at end-of-input misleadingly report that a U+FFFF character was seen.

### Issue Context
`Input.EOF` is intentionally `-1` to avoid colliding with any UTF-16 code unit, including U+FFFF; and there is a unit test ensuring U+FFFF is treated as a literal character.

### Fix Focus Areas
- java/src/org/openqa/selenium/json/JsonInput.java[223-281]
- java/src/org/openqa/selenium/json/Input.java[31-38]

### Suggested fix
- Add a small helper (e.g., `describeChar(int c)`) that returns `"<EOF>"` when `c == Input.EOF`, otherwise returns a quoted printable character.
- Use that helper in the three `JsonException` messages added in `nextNumber()` (expected digit, expected digit after '.', expected exponent digit), so EOF is reported clearly as EOF rather than `\uFFFF`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread java/src/org/openqa/selenium/json/JsonInput.java
Casting Input.peek() to (char) turned end-of-input (-1) into U+FFFF,
which is misleading now that U+FFFF is treated as a legitimate string
character. Route the three sites in nextNumber() through a small helper
that renders EOF as <EOF> and printable code units in quotes.
@qodo-code-review

Copy link
Copy Markdown
Contributor

Code review by qodo was updated up to the latest commit e1f3ed2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-java Java Bindings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants