Skip to content

tree-sitter-c drops function definitions after a #define whose body contains a block comment (affects C frontend on generated parsers) #555

Description

@vitali87

Summary

cgr's C frontend (tree-sitter, the default CPP_FRONTEND=treesitter) silently drops real function definitions that follow a #define macro whose body contains a block comment. The root cause is a bug in the tree-sitter-c grammar, present through the latest release (0.24.2). This issue tracks the investigation so we can file a well-supported upstream report against tree-sitter/tree-sitter-c after confirming the analysis.

Root cause (minimal reproduction)

A block comment /* */ embedded between tokens inside a #define macro body makes tree-sitter-c emit an ERROR node, and the parser fails to recover, so the next top-level declaration is dropped from the tree:

#define M do { /*c*/; } while (0)
void after(void) {}

Parsing the above yields (translation_unit (preproc_def ...) (ERROR ...)) with no function_definition for after. The same macro without the comment, or with the comment at the very end of the body, parses cleanly.

GNU bison emits exactly this shape. Its generated parser contains a macro such as:

#define FAIL(loc, msg)                                             \
  do {                                                             \
    location l = loc;                                              \
    yyerror(&l, answer, errors, locations, lexer_param_ptr, msg);  \
    /*YYERROR*/;                                                   \
  } while (0)

void yyerror(YYLTYPE* loc, ...) { ... }

The /*YYERROR*/ comment inside FAIL triggers the bug, and the yyerror definition immediately after it is never registered.

Impact on cgr

Surfaced by the C multi-language retrieval eval (PR #554) on the jq codebase. In jq's bison generated parser.c, tree-sitter produces 20 ERROR nodes, and the void yyerror(...) definition at parser.c:406 falls inside one. cgr therefore never creates a yyerror function node and cannot resolve any call to it. Any function definition that follows a comment bearing #define macro in machine generated C is affected. Hand written C and most source is unaffected; the practical blast radius is generated parsers and lexers.

cgr's resolution logic is correct (zero misresolutions were found); this is purely a parse coverage hole inherited from the grammar.

Version findings

  • Reproduces on the pinned tree-sitter-c 0.24.1 and on the latest 0.24.2, so a version bump does not fix it.
  • Distinct from existing upstream issues: Failed to process ES6 exports query #235 is a comment after the macro value being absorbed into preproc_arg with no downstream corruption, and Chore/update readme #55 (closed) is a multiline comment. This case is an embedded block comment that errors and corrupts the following declaration.

Action items

  • Reduce the grammar trigger further and confirm against a fresh tree-sitter-c build from master.
  • File the upstream issue on tree-sitter/tree-sitter-c with the minimal reproduction and parse trees.
  • Optionally prepare an upstream grammar patch with a corpus test.
  • Once filed, link the upstream issue from evals/README.md where the limitation is documented.

Notes

A heuristic that scrapes definitions out of ERROR regions is deliberately avoided as a workaround. The fix belongs in the grammar. An alternative for affected projects is to route C through cgr's existing libclang frontend (CPP_FRONTEND=libclang), which preprocesses and parses generated code correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions