Skip to content

Unable to COUNT(DISTINCT x) on nodes #108

Description

@prrao87

A common query pattern in Cypher benchmarks is to count the number of distinct nodes and return them during projection.

Issue

The following query works in other graph systems that support Cypher (Neo4j, Kuzu, Ladybug), but are fail at the parsing stage.

MATCH (p:Person)-[:workAt]->(o:Organisation)
RETURN COUNT(DISTINCT p.id) AS num_e, o.id
ORDER BY num_e DESC
LIMIT 1

This fails:

Error: ValueError: Cypher parse error at position 74: Unexpected input after query: (DISTINCT p.id) AS num_e, o.id
        ORDER BY num_e DESC
        LIMIT 1

The workaround to this is to attach it to a WITH clause in Cypher, but that also doesn't work in lance-graph (until we have a new release) because of #102, and fails as shown below.

MATCH (p:Person)-[:workAt]->(o:Organisation)
WITH DISTINCT p.id AS pid, o.id AS oid
RETURN COUNT(pid) AS num_e, oid
ORDER BY num_e DESC
LIMIT 1

Returns:

Error: ValueError: Cypher parse error at position 0: Failed to parse Cypher query: Parsing Error: Error { input: "WITH DISTINCT p.id AS pid, o.id AS oid\n        RETURN COUNT(pid) AS num_e, oid\n        ORDER BY num_e DESC\n        LIMIT 1\n    ", code: Tag }

Script to repro

Here's a minimal script to repro:

from __future__ import annotations

import pyarrow as pa
from lance_graph import CypherQuery, GraphConfig


def main() -> None:
    # Minimal in-memory graph: Persons workAt Organisations.
    persons = pa.table({"id": [1, 2]})
    orgs = pa.table({"id": [10, 11], "type": ["company", "company"]})
    work_at = pa.table({"src": [1, 2, 1], "dst": [10, 10, 11]})

    cfg = (
        GraphConfig.builder()
        .with_node_label("Person", "id")
        .with_node_label("Organisation", "id")
        .with_relationship("workAt", "src", "dst")
        .build()
    )

    datasets = {
        "Person": persons,
        "Organisation": orgs,
        "workAt": work_at,
    }

    query = """
        MATCH (p:Person)-[:workAt]->(o:Organisation)
        RETURN COUNT(DISTINCT p.id) AS num_e, o.id
        ORDER BY num_e DESC
        LIMIT 1
    """

    print(query)
    try:
        result = CypherQuery(query).with_config(cfg).execute(datasets)
        print(result)
    except Exception as exc:
        print(f"Error: {type(exc).__name__}: {exc}")


if __name__ == "__main__":
    main()

Expectation

Counting the number of distinct nodes via the above pattern is essential for some upcoming LDBC benchmarks I plan to run in lance-graph, I think this would be a great addition to the query parser's repertoire, and would really appreciate if this particular issue could be prioritized so that we can expand on the benchmarks we test with lance-graph to draw more community members in. Thank you!

cc @ChunxuTang @beinan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions