You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cgr's C frontend (tree-sitter, the default CPP_FRONTEND=treesitter) silently drops real function definitions that follow a #define macro whose body contains a block comment. The root cause is a bug in the tree-sitter-c grammar, present through the latest release (0.24.2). This issue tracks the investigation so we can file a well-supported upstream report against tree-sitter/tree-sitter-c after confirming the analysis.
Root cause (minimal reproduction)
A block comment /* */ embedded between tokens inside a #define macro body makes tree-sitter-c emit an ERROR node, and the parser fails to recover, so the next top-level declaration is dropped from the tree:
#defineM do { /*c*/; } while (0)
voidafter(void) {}
Parsing the above yields (translation_unit (preproc_def ...) (ERROR ...)) with no function_definition for after. The same macro without the comment, or with the comment at the very end of the body, parses cleanly.
GNU bison emits exactly this shape. Its generated parser contains a macro such as:
#defineFAIL(loc, msg) \
do { \
location l = loc; \
yyerror(&l, answer, errors, locations, lexer_param_ptr, msg); \
/*YYERROR*/; \
} while (0)
voidyyerror(YYLTYPE*loc, ...) { ... }
The /*YYERROR*/ comment inside FAIL triggers the bug, and the yyerror definition immediately after it is never registered.
Impact on cgr
Surfaced by the C multi-language retrieval eval (PR #554) on the jq codebase. In jq's bison generated parser.c, tree-sitter produces 20 ERROR nodes, and the void yyerror(...) definition at parser.c:406 falls inside one. cgr therefore never creates a yyerror function node and cannot resolve any call to it. Any function definition that follows a comment bearing #define macro in machine generated C is affected. Hand written C and most source is unaffected; the practical blast radius is generated parsers and lexers.
cgr's resolution logic is correct (zero misresolutions were found); this is purely a parse coverage hole inherited from the grammar.
Version findings
Reproduces on the pinned tree-sitter-c 0.24.1 and on the latest 0.24.2, so a version bump does not fix it.
Distinct from existing upstream issues: Failed to process ES6 exports query #235 is a comment after the macro value being absorbed into preproc_arg with no downstream corruption, and Chore/update readme #55 (closed) is a multiline comment. This case is an embedded block comment that errors and corrupts the following declaration.
Action items
Reduce the grammar trigger further and confirm against a fresh tree-sitter-c build from master.
File the upstream issue on tree-sitter/tree-sitter-c with the minimal reproduction and parse trees.
Optionally prepare an upstream grammar patch with a corpus test.
Once filed, link the upstream issue from evals/README.md where the limitation is documented.
Notes
A heuristic that scrapes definitions out of ERROR regions is deliberately avoided as a workaround. The fix belongs in the grammar. An alternative for affected projects is to route C through cgr's existing libclang frontend (CPP_FRONTEND=libclang), which preprocesses and parses generated code correctly.
Summary
cgr's C frontend (tree-sitter, the default
CPP_FRONTEND=treesitter) silently drops real function definitions that follow a#definemacro whose body contains a block comment. The root cause is a bug in thetree-sitter-cgrammar, present through the latest release (0.24.2). This issue tracks the investigation so we can file a well-supported upstream report againsttree-sitter/tree-sitter-cafter confirming the analysis.Root cause (minimal reproduction)
A block comment
/* */embedded between tokens inside a#definemacro body makestree-sitter-cemit anERRORnode, and the parser fails to recover, so the next top-level declaration is dropped from the tree:Parsing the above yields
(translation_unit (preproc_def ...) (ERROR ...))with nofunction_definitionforafter. The same macro without the comment, or with the comment at the very end of the body, parses cleanly.GNU bison emits exactly this shape. Its generated parser contains a macro such as:
The
/*YYERROR*/comment insideFAILtriggers the bug, and theyyerrordefinition immediately after it is never registered.Impact on cgr
Surfaced by the C multi-language retrieval eval (PR #554) on the
jqcodebase. Injq's bison generatedparser.c, tree-sitter produces 20ERRORnodes, and thevoid yyerror(...)definition atparser.c:406falls inside one. cgr therefore never creates ayyerrorfunction node and cannot resolve any call to it. Any function definition that follows a comment bearing#definemacro in machine generated C is affected. Hand written C and most source is unaffected; the practical blast radius is generated parsers and lexers.cgr's resolution logic is correct (zero misresolutions were found); this is purely a parse coverage hole inherited from the grammar.
Version findings
tree-sitter-c0.24.1 and on the latest 0.24.2, so a version bump does not fix it.preproc_argwith no downstream corruption, and Chore/update readme #55 (closed) is a multiline comment. This case is an embedded block comment that errors and corrupts the following declaration.Action items
tree-sitter-cbuild frommaster.tree-sitter/tree-sitter-cwith the minimal reproduction and parse trees.evals/README.mdwhere the limitation is documented.Notes
A heuristic that scrapes definitions out of
ERRORregions is deliberately avoided as a workaround. The fix belongs in the grammar. An alternative for affected projects is to route C through cgr's existing libclang frontend (CPP_FRONTEND=libclang), which preprocesses and parses generated code correctly.