Skip to content

Expression parser docs and pyparsing notes #2860

@ptmcg

Description

@ptmcg
Image [pyiceberg_diag.html](https://github.com/user-attachments/files/24314065/pyiceberg_diag.html)

Feature Request / Improvement

I'm happy to see that pyparsing is part of the pyiceberg project. I've attached two files that are railroad diagrams for this parser:

  • pyiceberg_diag.html - browseable railroad diagram (SVG), non-terrminal elements are links to subdiagrams
  • pyiceberg_parser.png - static PNG version of the same diagram

Overall I think the parser is well-structured, and makes good use of the infix_notation, one_of, and DelimitedList helpers.

My suggestions are largely cosmetic, but may add some parse-time performance improvements.

Recommendations to use set_name are either for better diagramming, better exception messages, or both.

set_results_name() vs. set_name()

I'd like to point out the distinction between set_name and set_results_name. set_name is most useful to describe the expression itself, in abstract. set_results_name should be used to mark expressions with some contextual name, so that their parsed value can be retrieved by that name. I usually think of set_name describes what the expression is, and set_results_name is how the expression is used.

For example, these expressions should probably use set_name, not set_results_name:

boolean = one_of(["true", "false"], caseless=True).set_results_name("boolean")
string = sgl_quoted_string.set_results_name("raw_quoted_string")
decimal = common.real().set_results_name("decimal")
integer = common.signed_integer().set_results_name("integer")
literal = Group(string | decimal | integer | boolean).set_results_name("literal")
literal_set = Group(
    DelimitedList(string) | DelimitedList(decimal) | DelimitedList(integer) | DelimitedList(boolean)
).set_results_name("literal_set")

You'll also find that exception messages are less cryptic with more use of set_name.

Word(nums).parse_string("NAN")
# raises ParseException: Expected W:(0-9), found 'NAN'

Word(nums).set_name("integer").parse_string("NAN")
# raises ParseException: Expected integer, found 'NAN'

There are 11 calls to set_results_name.

Many of the set_results_name calls are probably better as set_name calls. I try to reserve set_results_name for expressions that need to be referenced in a parse action, or other post-parsing code. "op" is a good example of using set_results_name.

"column" is an excellent example of being an exception to this guideline. It is used in many places, but always as the subject of some larger expression. "column" should be defined using both set_results_name and set_name.

Lastly, you can keep your set_name and set_results_name calls to a minimum:

  • insert a call to pyparsing.autoname_elements(), which will call set_name on all locally defined ParserElements that have not already been given a name (making this call before calling the infix_notation function will use the names as part of that functions internal expression building)
  • use the expr("results_name") short-cut for expr.set_results_name("results_name")

Collapse IS/IS NOT calls

You can reduce the number of terms in your parser by merging IS and IS NOT expressions (fewer terms can translate into faster parsing). Here you define separate is_null and not_null expressions in building null_check:

is_null = column + IS + NULL
not_null = column + IS + NOT + NULL
null_check = is_null | not_null

I propose defining an IS_NOT expression, so that null_check can simplify down (and also reduce 2 parse action functions down to one):

IS_NOT = (IS + NOT).add_parse_action(lambda: "IS_NOT")
null_check = column + (IS_NOT | IS)("op") + NULL

@null_check.set_parse_action
def _(result: ParseResults) -> BooleanExpression:
    expr_class = IsNull if result.op == "IS" else NotNull
    return expr_class(result.column)

Similar treatment can be given to IN/NOT IN and LIKE/NOT LIKE.

Creating the diagram

I wrote this little script to build the attached diagrams:

import pyiceberg.expressions.parser as ibp

ibp.boolean_expression.create_diagram("pyiceberg_parser_diag.html")

Run this before making any changes, and you'll get a diagram that is difficult to navigate, probably even difficult to view! Make a few set_name calls (like adding set_name("column") to column) and regenerate the diagram and see how the structure becomes clearer. The non-terminals in the diagram are actually links to their corresponding sub-diagrams, so navigation in a large diagram is much easier. I've gone back and refactored a number of the pyparsing examples, after generating their diagrams.

Other notes

Not really a pyparsing note, but just offering this alternative style for your if-elif-else chains:

@left_ref.set_parse_action
def _(result: ParseResults) -> BooleanExpression:
    op_classes = {
        ">": GreaterThan,
        ">=": GreaterThanOrEqual,
        "<": LessThan,
        "<=": LessThanOrEqual,
        "=": EqualTo,
        "==": EqualTo,
        "!=": NotEqualTo,
        "<>": NotEqualTo,
    }
    op_class = op_classes.get(result.op)
    if op_class is not None:
        return op_class(result.column, result.literal)
    raise ValueError(f"Unsupported operation type: {result.op!r}")

I encourage people at work, when echoing erroneous values in an exception, to add the "!r" format marker. This encloses the value in single quotes and expands any non-printing characters to hex notation. A big help for when the parser accidentally includes trailing space in the operator string, and so it doesn't match any of the expected patterns.

Finally

Again, your parser is fine as-is, these suggestions just may make it a bit easier to maintain, and even a bit faster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions