Skip to content

Comments

fix(grammar): refactor CFG generation to use llguidance suffix syntax#167

Merged
Ki-Seki merged 6 commits intomainfrom
perf/build-cfg
Jan 23, 2026
Merged

fix(grammar): refactor CFG generation to use llguidance suffix syntax#167
Ki-Seki merged 6 commits intomainfrom
perf/build-cfg

Conversation

@Ki-Seki
Copy link
Member

@Ki-Seki Ki-Seki commented Jan 23, 2026

Refactor the build_cfg function to generate a flattened grammar structure compatible with llguidance.

Key changes:

  • Flatten the start rule to sequence tags and whitespace directly.
  • Implement [suffix="..."] and [capture] attributes on lowercase rules (m_i) to correctly consume closing tags.
  • Separate regex definitions into uppercase terminals (M_i), enabling safe use of greedy matching /(?s:.*)/.
  • Add the required %llguidance {} header.

This resolves validation errors and prevents issues where greedy regexes swallowed closing tags or generated duplicate closures.

Refactor the `build_cfg` function to generate a flattened grammar structure compatible with llguidance.

Key changes:
- Flatten the `start` rule to sequence tags and whitespace directly.
- Implement `[suffix="..."]` and `[capture]` attributes on lowercase rules (`m_i`) to correctly consume closing tags.
- Separate regex definitions into uppercase terminals (`M_i`), enabling safe use of greedy matching `/(?s:.*)/`.
- Add the required `%llguidance {}` header.

This resolves validation errors and prevents issues where greedy regexes swallowed closing tags or generated duplicate closures.
Copilot AI review requested due to automatic review settings January 23, 2026 16:51
@Ki-Seki
Copy link
Member Author

Ki-Seki commented Jan 23, 2026

Before/After Example

Query

query = '<|GIM_QUERY|>The capital of <|MASKED desc="single word" regex="中国|法国"|><|/MASKED|> is Beijing<|MASKED desc="punctuation mark" regex="\\."|><|/MASKED|><|/GIM_QUERY|>'

Old build_cfg and CFG

def build_cfg(query: Query) -> str:
    """Build an LLGuidance context-free grammar (CFG) string based on the query object.

    LLGuidance syntax reference: https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md
    """
    num_tags = len(query.tags)
    grammar_first_line = f'''start: "{RESPONSE_PREFIX}" {" ".join(f"tag{i}" for i in range(num_tags))} "{RESPONSE_SUFFIX}"'''

    grammar_rest_lines = []
    for i, tag in enumerate(query.tags):
        # `/(?s:.)*?/` is a non-greedy match for any character including newlines
        content_pattern = f"/{tag.regex}/" if tag.regex else "/(?s:.)*?/"
        grammar_rest_lines.append(
            f'tag{i}: "{TAG_OPEN_LEFT} id=\\"m_{i}\\"{TAG_OPEN_RIGHT}" {content_pattern} "{TAG_END}"'
        )

    grammar = grammar_first_line + "\n" + "\n".join(grammar_rest_lines)

    is_error, msgs = validate_grammar_spec(get_grammar_spec(grammar))
    if is_error:
        raise ValueError(
            "Invalid CFG grammar constructed from the query object:\n"
            + "\n".join(msgs)
            + "\nWe recommend checking the syntax documentation at https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md"
        )
    return grammar
old_cfg = 'start: "<|GIM_RESPONSE|>" tag0 tag1 "<|/GIM_RESPONSE|>"\ntag0: "<|MASKED id=\\"m_0\\"|>" /中国|法国/ "<|/MASKED|>"\ntag1: "<|MASKED id=\\"m_1\\"|>" /\\./ "<|/MASKED|>"'

New build_cfg and CFG

def build_cfg(query: Query) -> str:
    """Build an LLGuidance context-free grammar (CFG) string based on the query object.

    Constructs a flattened grammar structure compatible with LLGuidance's suffix/capture logic.

    Ref:
    - https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md: Incomplete documentation of llguidance grammar syntax
    - https://github.com/guidance-ai/guidance/blob/main/guidance/_ast.py: LarkSerializer implementation
    - https://github.com/guidance-ai/llguidance: Source code
    """
    num_tags = len(query.tags)

    # 1. Header declaration
    lines = ["%llguidance {}"]

    # 2. Build start rule
    # Target format: start: "PREFIX" REGEX "OPEN_TAG_0" m_0 REGEX "OPEN_TAG_1" m_1 ... REGEX "SUFFIX"
    start_parts = [f'"{RESPONSE_PREFIX}"']

    for i in range(num_tags):
        # Add whitespace rule reference
        start_parts.append("REGEX")

        # Add opening tag literal, e.g.: "<|MASKED id=\"m_0\"|>"
        # Note escaping: id=\"m_{i}\"
        open_tag_str = f'"{TAG_OPEN_LEFT} id=\\"m_{i}\\"{TAG_OPEN_RIGHT}"'
        start_parts.append(open_tag_str)

        # Add content rule reference (lowercase m_i)
        start_parts.append(f"m_{i}")

    # Add trailing whitespace and suffix
    start_parts.append("REGEX")
    start_parts.append(f'"{RESPONSE_SUFFIX}"')

    lines.append(f"start: {' '.join(start_parts)}")

    # 3. Define whitespace rule (named REGEX to match examples, usually can also be called WS)
    lines.append(r"REGEX: /\s*/")

    # 4. Generate specific rules for each tag
    for i, tag in enumerate(query.tags):
        # Note: When used with suffix, using greedy match /(?s:.*)/ instead of /(?s:.)*?/ is correct and legal.
        pattern = f"/{tag.regex}/" if tag.regex else "/(?s:.*)/"

        # Rule m_i (logical layer):
        # - capture: tells the engine to capture this part.
        # - suffix: specifies the ending tag, the engine stops and consumes it when encountered.
        # Note: Here we reference the TAG_END constant (i.e., "<|/MASKED|>")
        lines.append(f'm_{i}[capture, suffix="{TAG_END}"]: M_{i}')

        # Rule M_i (regex layer):
        # Define the actual matching pattern for this tag.
        lines.append(f"M_{i}: {pattern}")

        # TODO: There may be many tags with "/(?s:.*)/" pattern, which can be inefficient.

    # 5. Assemble final string
    grammar = "\n".join(lines) + "\n"

    is_error, msgs = validate_grammar_spec(get_grammar_spec(grammar))
    if is_error:
        raise ValueError(
            "Invalid CFG grammar constructed from the query object:\n"
            + "\n".join(msgs)
            + "\nWe recommend checking the syntax documentation at https://github.com/guidance-ai/llguidance/blob/main/docs/syntax.md"
        )
    return grammar
new_cfg = '%llguidance {}\nstart: "<|GIM_RESPONSE|>" REGEX "<|MASKED id=\\"m_0\\"|>" m_0 REGEX "<|MASKED id=\\"m_1\\"|>" m_1 REGEX "<|/GIM_RESPONSE|>"\nREGEX: /\\s*/\nm_0[capture, suffix="<|/MASKED|>"]: M_0\nM_0: /中国|法国/\nm_1[capture, suffix="<|/MASKED|>"]: M_1\nM_1: /\\./\n'

This comment was marked as outdated.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings January 23, 2026 17:04

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings January 23, 2026 17:46
@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Ki-Seki Ki-Seki added the good first issue Good for newcomers label Jan 23, 2026

This comment was marked as outdated.

@Ki-Seki Ki-Seki merged commit 3fa5e49 into main Jan 23, 2026
16 checks passed
@Ki-Seki Ki-Seki deleted the perf/build-cfg branch January 23, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

good first issue Good for newcomers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant