Skip to content

feat: compute iso-Spectral source ranges for YAML ruleset diagnostics (#32)#33

Open
eskenazit wants to merge 3 commits into
mainfrom
fix/issue-32-ruleset-source-ranges
Open

feat: compute iso-Spectral source ranges for YAML ruleset diagnostics (#32)#33
eskenazit wants to merge 3 commits into
mainfrom
fix/issue-32-ruleset-source-ranges

Conversation

@eskenazit

@eskenazit eskenazit commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Related Issue

Closes #32

#32 is scoped to YAML by design (it is a sub-issue of the #38 epic, which tracks the
other formats). This PR delivers the full YAML scope — built-in and polyglot ruleset
functions — so it closes #32. Non-YAML formats remain tracked under #38 (#34/#35/#36/#37).


What does this PR do?

Implements #32compute source ranges for ruleset diagnostics (path → SourceRange) — in
full for YAML documents, across both built-in and polyglot (JS/Python/Groovy) functions.

Until now every ruleset diagnostic carried range = null and a path equal to the rule's
given selector (e.g. $.consumes[*].baseUri), so consumers (CLI, SARIF, SDKs, the Crafter
editor extension) could not map a violation back to a precise location.

Built-in functions — real SourceMap + concrete paths

  • Real SourceMap for structured documentsJacksonSourceMap builds a dot-notation
    path → SourceRange index by streaming the YAML/JSON content with Jackson and recording
    each token's JsonLocation. Document.fromString(...) attaches it for structured
    documents (non-structured branch explicitly returns SourceMap.NONE).
  • 0-based ranges, iso-Spectral — ranges are 0-based on both axes with an exclusive
    end
    , matching Spectral and the LSP / vscode.Position convention the Crafter extension
    consumes. A quoted scalar starts at the opening quote and ends just past the closing quote.
  • Full YAML scalar spans (start and end) — both the start and the real end of each
    scalar (plain, single/double-quoted, block | / >), so a violation underlines the whole
    value. Quoting/escaping (\", \\, '') and trailing # comment / whitespace are honoured.
  • Concrete pathsJsonPathEvaluator enumerates the concrete matched path per violation
    (e.g. $.consumes[0].baseUri) via JsonPath AS_PATH_LIST, converted to dot notation.
  • Wiring in RuleExecutor / RulesetValidator — the resolved path looks up the
    SourceRange in the document's SourceMap and is used as the diagnostic path. An
    unresolvable path falls back to range = null, exactly as before — no regression.

Polyglot functions — per-violation path carried through the SPI (Layer 1)

  • RuleFunction SPI evolved to List<Violation> — a new Violation(message, path) record
    and a default evaluateViolations(...) let custom functions report a per-violation relative
    path. The 12 built-ins inherit the default (path null), so they are unchanged.
  • FunctionRegistry wires the ruleset context — custom providers are discovered via
    ServiceLoader and receive the ruleset's functionsDir / functions, so polyglot functions
    now actually execute in production (they were silently ignored before).
  • PolyglotRuleFunction reads message + path from each returned item; the path is
    combined with the rule base path and resolved against the SourceMap, so polyglot-function
    violations now carry a real SourceRange instead of null.

Scope — what is NOT in this PR


Tests

  • Source map (JacksonSourceMapTest) — resolves a known path to the expected 0-based,
    end-exclusive
    range for plain, quoted, escaped and block scalars; unknown paths and empty
    content return null. Targets are frozen against a Spectral provenance fixture.
  • End-scan helpers (JacksonSourceMapEndScanTest) — branch-level coverage of the YAML
    end-of-scalar scanners, including bounds guards and a terminal # (empty trailing
    comment, with one/two leading spaces and glued to the value) so the comment-marker scan
    reaches the last character, iso-Spectral.
  • Polyglot SPIViolationTest, FunctionRegistryTest, RuleExecutorTest
    (combinePath + a custom-function violation resolving to a concrete path and range),
    RulesetValidatorTest, PolyglotPathExtractionTest (extractPath / extractViolations
    edge cases), and PolyglotFunctionProviderTest (context-aware loading).
  • End-to-endPolyglotRulesetEndToEndTest drives a custom JS rule through the public
    validator API and asserts the diagnostic carries the exact path $.consumes[1].namespace
    and the precise SourceRange(3,15,3,18); CliIntegrationTest checks lint … --format json
    emits a populated range and a concrete path.

Full reactor mvn clean verify is green; JaCoCo 100% line + branch coverage holds for every
touched module.


Checklist

  • CI is green (build + tests)
  • Rebased on latest main
  • Small and focused — one concern per PR
  • Commit messages follow Conventional Commits
  • Apache 2.0 license header present on any new Java file

Agent Context (optional)

agent_name: GitHub Copilot
llm: Naftiko Claude Opus 4.8
tool: VS Code
confidence: high
source_event: Crafter VS Code extension migration from Spectral to Polychro (YAML scope of #32)
discovery_method: code_review
files_suspected:

  • polychro-api/src/main/java/io/polychro/spi/Document.java
  • polychro-api/src/main/java/io/polychro/spi/JacksonSourceMap.java
  • polychro-ruleset/src/main/java/io/polychro/ruleset/RuleExecutor.java
  • polychro-ruleset/src/main/java/io/polychro/ruleset/RulesetValidator.java
  • polychro-ruleset/src/main/java/io/polychro/ruleset/FunctionRegistry.java
  • polychro-ruleset-polyglot/src/main/java/io/polychro/ruleset/polyglot/PolyglotRuleFunction.java

eskenazit added a commit that referenced this pull request Jun 8, 2026
Document the intrinsic ambiguity of paths that traverse a key containing a
dot: a flat "x-meta.owner" key is indistinguishable from a nested
x-meta: { owner }, both keyed as $.x-meta.owner. The note is added to
JacksonSourceMap and to JsonPathEvaluator.toDotNotation.

Reword the existing dotted-key test to state honestly that it proves the
round-trip coincides for an isolated dotted key, not that the path is
unambiguous. Add a collision test pinning the putIfAbsent "first token wins"
behaviour when a flat dotted key and a nested structure both map to
$.x-meta.owner.

Add an inline comment documenting the putIfAbsent invariant, and document
that RuleExecutor.effectivePath's field argument must be a simple identifier
(not a JSONPath or bracket expression).

Refs #32
@eskenazit

Copy link
Copy Markdown
Contributor Author

Review feedback addressed — commit 504f3f6

Thanks for the review. All three findings are now addressed in 504f3f6:

🔴 Finding 1 — dotted-key ambiguity ($.x-meta.owner)
The existing validateShouldResolveSourceRangeWhenKeyContainsDot test passed by coincidence: a flat key "x-meta.owner" and a nested x-meta: { owner } both key to $.x-meta.owner, so the round-trip is consistent but the path is not unambiguous. Treated with the full option:

  • Documented the ambiguity on the JacksonSourceMap class Javadoc and on JsonPathEvaluator.toDotNotation.
  • Reworded the existing test (name kept, comment clarified) to state it proves the round-trip coincides for an isolated dotted key, not that the path is unambiguous.
  • Added validateShouldKeepFirstLocationWhenDottedKeyCollidesWithNestedPath, a collision test that puts a flat dotted key and a nested structure in the same document and pins the real behaviour: putIfAbsent keeps the first token encountered (the flat key on line 2 wins). This guards against a future change to the keying or to putIfAbsent → put.

🟡 Finding 2 — putIfAbsent invariant (JacksonSourceMap.index)
Added an inline comment: a given path is claimed by at most one token in a valid document; putIfAbsent guards malformed/dotted-key-colliding input from silently overwriting a prior valid location (first token wins).

🔵 Finding 3 — RuleExecutor.effectivePath contract
Extended the Javadoc to document that field must be a simple identifier (a plain object key), never a JSONPath or bracket expression, since it is appended verbatim after a ..

Validation: mvn clean verify -pl polychro-api,polychro-ruleset -am is green — 101 (api) + 419 (ruleset) tests pass, and JaCoCo reports "All coverage checks have been met" on both modules.

This PR remains a first step toward #32 (built-in functions only); the polyglot RuleFunction SPI path-carrying layer is still open, so it does not close the issue.

eskenazit added a commit that referenced this pull request Jun 8, 2026
Document the intrinsic ambiguity of paths that traverse a key containing a
dot: a flat "x-meta.owner" key is indistinguishable from a nested
x-meta: { owner }, both keyed as $.x-meta.owner. The note is added to
JacksonSourceMap and to JsonPathEvaluator.toDotNotation.

Reword the existing dotted-key test to state honestly that it proves the
round-trip coincides for an isolated dotted key, not that the path is
unambiguous. Add a collision test pinning the putIfAbsent "first token wins"
behaviour when a flat dotted key and a nested structure both map to
$.x-meta.owner.

Add an inline comment documenting the putIfAbsent invariant, and document
that RuleExecutor.effectivePath's field argument must be a simple identifier
(not a JSONPath or bracket expression).

Refs #32
@eskenazit eskenazit force-pushed the fix/issue-32-ruleset-source-ranges branch from 504f3f6 to 9799e8f Compare June 8, 2026 16:52
eskenazit added a commit that referenced this pull request Jun 8, 2026
…st (#32)

Review feedback (PR #33, @eskenazit): forContentShouldLocateScalarInside
ArrayObjectOnYaml only asserted startLine for $.consumes[0].baseUri,
while every other golden scalar (unquoted, quoted, escaped, block)
asserts both startColumn and endColumn. A regression in column
computation for array-nested scalars would have gone undetected.

Add startColumn (13, the opening quote), endColumn (35, exclusive past
the closing quote), endLine, and an endColumn > startColumn guard so the
array-nested case carries the same column coverage as the rest of the
golden fixture. Values measured empirically against JacksonSourceMap,
iso-Spectral. No production change.
@eskenazit eskenazit requested a review from jlouvel June 8, 2026 17:07
@eskenazit eskenazit changed the title feat: compute source ranges for built-in ruleset diagnostics (first step of #32) feat: compute iso-Spectral source ranges for YAML ruleset diagnostics (#32) Jun 9, 2026
eskenazit added 3 commits June 9, 2026 15:48
…agnostics (#32)

Compute 0-based, end-exclusive source ranges (iso-Spectral) for built-in
ruleset diagnostics on YAML capabilities. JacksonSourceMap spans each scalar's
full token (plain, quoted, escaped, block, array-nested), and the range is
propagated through RulesetValidator to every Diagnostic. Non-YAML formats keep
SourceMap.NONE pending #34-#37.

Refs #32
…nges (#32)

Wire polyglot (JS) ruleset functions into production validation and resolve a
source range per violation. The SPI carries a Violation record (message + path),
FunctionRegistry resolves functions per ruleset, and PolyglotRuleFunction
extracts the JsonPath of each violation so the diagnostic carries an
iso-Spectral range. Includes golden and path-extraction tests.

Refs #32
@eskenazit eskenazit force-pushed the fix/issue-32-ruleset-source-ranges branch from d2bf272 to a45106b Compare June 9, 2026 13:49
@eskenazit

eskenazit commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@jlouvel this is now OK for review !

FYI this issue is blocking crafter using Polychro

@jlouvel jlouvel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough, high-quality PR for issue #32. The iso-Spectral golden fixtures with documented Spectral provenance, the exhaustive YAML end-scan edge-case coverage, the graceful degradation (unparseable content -> SourceMap.NONE, unknown path -> null range), and the backward-compatible FunctionProvider SPI evolution (deprecated functions() with a default delegating to the new context-aware overload) are all excellent.

No blocking issues. The 5 notes below are polish / hardening suggestions only:

  1. RuleExecutor: evaluatePaths runs a second JSONPath traversal alongside evaluate, doubling cost per rule.
  2. JacksonSourceMap: content.split is re-run per YAML scalar, risking O(n^2) on large documents.
  3. JacksonSourceMap: block-scalar end +1 is intentional but easy to misread at the call site.
  4. RuleExecutor: the synthetic Document Javadoc claims the null format is never read; worth hardening the note.
  5. PolyglotRuleFunction: the retained extractMessages shim may be unused now that the executor only calls evaluateViolations.

for (String given : rule.given()) {
List<JsonNode> matches = evaluator.evaluate(root, given);
for (JsonNode match : matches) {
List<String> matchPaths = evaluator.evaluatePaths(root, given);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every given selector, evaluatePaths(root, given) is called immediately after evaluate(root, given), performing a second full JSONPath traversal per rule. On large documents this roughly doubles the JSONPath cost. The correctness relies on Jayway returning the value-list and the AS_PATH_LIST path-list in identical order and cardinality (which the pathAt fallback hints is not guaranteed). Consider deriving both from a single traversal, or add a comment confirming the order/cardinality contract. Not blocking.

* @return a two-element array {@code [endLine, endColumn]} (both 0-based, end exclusive)
*/
static int[] endOfYamlScalar(String content, int startLine, int startColumn) {
String[] lines = content.split("\n", -1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endOfYamlScalar calls content.split("\n", -1) and is invoked once per YAML scalar via toRange. For a document with many scalars this re-splits the entire content on every call, giving O(n^2) behaviour on large files. Consider splitting the content into lines once in index() and threading the line array down to the end-scan helpers. Minor / perf only.

trimmed--;
}
lastLine = i;
// include the trailing newline folded into the block scalar (iso-Spectral).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastCol = trimmed + 1 to include the folded trailing newline is correct and iso-Spectral, but the + 1 differs from the plain mono-line scalar end (no +1) and produces an end column one past the physical line length. The class Javadoc explains the intent; a one-line comment right here reaffirming the +1 is the deliberate folded-newline inclusion would make the call site harder to misread. Nit.

* {@link Document}; production validation goes through {@link #execute(Rule, Document)}.
*
* <p>The synthetic {@link Document} carries {@link SourceMap#NONE} precisely so range resolution
* is skipped: every lookup returns a {@code null} range and the {@code null} format is never

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc states the synthetic Document(root, null, null, SourceMap.NONE, null) is safe because the null format is never read. That holds today only because SourceMap.NONE.resolve(...) short-circuits before any format-dependent logic runs. A brief note that this invariant depends on SourceMap.NONE never triggering format resolution would harden it against a future change to the NONE path. Nit.


/**
* Convert a script's result array into the {@code message} strings only, dropping any per-result
* {@code path}. Retained as a thin shim over {@link #extractViolations(Value)} for callers (and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that evaluate() delegates to evaluateViolations() and RuleExecutor only calls evaluateViolations, the retained extractMessages(Value) shim appears to have no production caller left. If it is only used by tests, a short note to that effect (or removing it) would avoid the impression that messages and violations are extracted on two separate passes. Confirm no path executes the sandboxed JS twice. Nit.

@jlouvel jlouvel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add minor comments, not blocking so approving

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: compute source ranges for ruleset diagnostics (path → SourceRange)

2 participants