Skip to content

Adds subtree (descendant) matching to taxonomy where filters#1648

Draft
MA2153 wants to merge 12 commits into
emdash-cms:mainfrom
MA2153:feat/taxonomy-subtree-where-filter
Draft

Adds subtree (descendant) matching to taxonomy where filters#1648
MA2153 wants to merge 12 commits into
emdash-cms:mainfrom
MA2153:feat/taxonomy-subtree-where-filter

Conversation

@MA2153

@MA2153 MA2153 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds a subtree operator to collection where taxonomy filters so selecting a parent term matches that term and all of its descendants, resolved in SQL:

where: { region: { subtree: "europe" } } // matches europe + every descendant region

Today the taxonomy where filter matches by exact slug only, so "this term or anything filed under it" (selecting a parent category in a faceted browse UI) is inexpressible. The only workaround — enumerating the subtree client-side and passing every descendant slug — expands to one bound parameter per slug and overflows D1's 100-bind-parameter cap (D1_ERROR: too many SQL variables) on deep hierarchies. It also can't be chunked without breaking keyset pagination.

This resolves the subtree server-side from a single root slug via a recursive CTE over taxonomies.parent_id, so the bound-parameter count is independent of subtree size. After #1646 both parent_id and content_taxonomies.taxonomy_id live in translation_group space, so the walk is locale-correct and matches taxonomy_id directly.

Also adds an opt-in rollup option to getTaxonomyTerms (and the admin terms endpoint via ?rollup=1) returning distinct-entry subtree counts, so a facet badge equals what selecting that facet returns. Default behavior (exact-slug filter, exact-term counts) is unchanged.

Related: Discussion #1647 (opened in Ideas; awaiting maintainer approval — opening the PR early to share the implementation, happy to adjust the operator surface, e.g. { subtree } vs { descendantsOf }, per the discussion).

Note for reviewers — query-count snapshot: CI may report +1 query on GET /posts/building-for-the-long-term (17→18). This is pre-existing and unrelated to this PR: the snapshot was last refreshed by #1619, and #1577 (offset pagination, which changed bucketFilter) merged afterward without refreshing it. This branch's diff does not touch the loader's bucketing/cache-key path (only a WhereSubtree type re-export in query.ts). I deliberately did not update the snapshot here to avoid misattributing the drift; it belongs to a separate fix.

Type of change

Checklist

AI-generated code disclosure

  • This PR includes AI-generated code — model/tool: Claude Opus 4.8 (Claude Code)

Screenshots / test output

Dialect-parity tests run under describeEachDialect (SQLite locally; Postgres in CI via PG_CONNECTION_STRING). Highlights:

  • loader-taxonomy-subtree-filter.test.ts — single-root and multi-root match, >999-descendant overflow guard (would exceed SQLite's bind limit if descendants were enumerated rather than walked in-SQL), mixed exact + subtree across taxonomies, empty-roots short-circuit, cross-locale (match by translation_group), keyset pagination.
  • taxonomy-subtree-counts.test.tsdistinct-entry rollup honesty (an entry tagged at both a parent and its child counts once), getTaxonomyTerms({ rollup }), and handleTermList({ rollup }).
Test Files  2 passed (2)
      Tests  10 passed (10)

🤖 Generated with Claude Code

MA2153 and others added 9 commits June 29, 2026 12:35
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 399fe8e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 16 packages
Name Type
emdash Minor
@emdash-cms/cloudflare Minor
@emdash-cms/sandbox-workerd Patch
@emdash-cms/fixture-perf-site Patch
@emdash-cms/perf-demo-site Patch
@emdash-cms/cache-demo-site Patch
@emdash-cms/do-demo-site Patch
@emdash-cms/do-solo-demo-site Patch
@emdash-cms/admin Minor
@emdash-cms/auth Minor
@emdash-cms/blocks Minor
@emdash-cms/gutenberg-to-portable-text Minor
@emdash-cms/x402 Minor
create-emdash Minor
@emdash-cms/auth-atproto Patch
@emdash-cms/plugin-embeds Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions github-actions Bot added review/needs-review No maintainer or bot review yet area/core size/XL labels Jun 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Scope check

This PR changes 567 lines across 12 files. Large PRs are harder to review and more likely to be closed without review.

If this scope is intentional, no action needed. A maintainer will review it. If not, please consider splitting this into smaller PRs.

See CONTRIBUTING.md for contribution guidelines.

@pkg-pr-new

pkg-pr-new Bot commented Jun 29, 2026

Copy link
Copy Markdown

Open in StackBlitz

@emdash-cms/admin

npm i https://pkg.pr.new/@emdash-cms/admin@1648

@emdash-cms/auth

npm i https://pkg.pr.new/@emdash-cms/auth@1648

@emdash-cms/auth-atproto

npm i https://pkg.pr.new/@emdash-cms/auth-atproto@1648

@emdash-cms/blocks

npm i https://pkg.pr.new/@emdash-cms/blocks@1648

@emdash-cms/cloudflare

npm i https://pkg.pr.new/@emdash-cms/cloudflare@1648

@emdash-cms/contentful-to-portable-text

npm i https://pkg.pr.new/@emdash-cms/contentful-to-portable-text@1648

emdash

npm i https://pkg.pr.new/emdash@1648

create-emdash

npm i https://pkg.pr.new/create-emdash@1648

@emdash-cms/gutenberg-to-portable-text

npm i https://pkg.pr.new/@emdash-cms/gutenberg-to-portable-text@1648

@emdash-cms/plugin-cli

npm i https://pkg.pr.new/@emdash-cms/plugin-cli@1648

@emdash-cms/plugin-types

npm i https://pkg.pr.new/@emdash-cms/plugin-types@1648

@emdash-cms/registry-client

npm i https://pkg.pr.new/@emdash-cms/registry-client@1648

@emdash-cms/registry-lexicons

npm i https://pkg.pr.new/@emdash-cms/registry-lexicons@1648

@emdash-cms/sandbox-workerd

npm i https://pkg.pr.new/@emdash-cms/sandbox-workerd@1648

@emdash-cms/x402

npm i https://pkg.pr.new/@emdash-cms/x402@1648

@emdash-cms/plugin-ai-moderation

npm i https://pkg.pr.new/@emdash-cms/plugin-ai-moderation@1648

@emdash-cms/plugin-atproto

npm i https://pkg.pr.new/@emdash-cms/plugin-atproto@1648

@emdash-cms/plugin-audit-log

npm i https://pkg.pr.new/@emdash-cms/plugin-audit-log@1648

@emdash-cms/plugin-color

npm i https://pkg.pr.new/@emdash-cms/plugin-color@1648

@emdash-cms/plugin-embeds

npm i https://pkg.pr.new/@emdash-cms/plugin-embeds@1648

@emdash-cms/plugin-field-kit

npm i https://pkg.pr.new/@emdash-cms/plugin-field-kit@1648

@emdash-cms/plugin-forms

npm i https://pkg.pr.new/@emdash-cms/plugin-forms@1648

@emdash-cms/plugin-webhook-notifier

npm i https://pkg.pr.new/@emdash-cms/plugin-webhook-notifier@1648

commit: 399fe8e

@emdashbot emdashbot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach

This is the right change for the right problem. Exact-slug taxonomy filters cannot express "this term or anything under it", and the only workaround (enumerating every descendant slug) blows past D1's bind-parameter limit on deep trees. Adding a first-class { subtree } operator and resolving descendants in SQL with a recursive CTE fits EmDash's architecture cleanly: it reuses the existing translation_group-aware model from migration 045, keeps the parameter count constant, and doesn't disturb the default exact-slug behavior.

What I checked

  • SQL safety: The recursive CTEs use Kysely's sql tagged template for values and sql.ref() for identifiers; the dynamic table/collection names are validated via getTableName/getTaxonomyNames. No raw interpolation of user slugs.
  • Locale / i18n correctness: The subtree walk uses translation_group/parent_id, so matches are locale-agnostic in the same way content_taxonomies.taxonomy_id is. The loader's outer locale filter still scopes the returned entries.
  • Authorization: The admin terms route still checks taxonomies:read; the new ?rollup query param is just passed through.
  • Cache invalidation: getTaxonomyTerms({ rollup: true }) is still wrapped in the existing cachedQuery/requestCached layers with a cache key that includes r1, so term mutations via invalidateTermCache() bust it correctly.
  • Tests: Dialect-parametric tests cover single root, multiple roots, the >999-descendant overflow guard, mixing exact + subtree filters, empty-root short-circuit, cross-locale group matching, keyset pagination, and distinct-entry rollup counts. A changeset is present.

Headline conclusion

The code is clean and I don't see any blocking bugs. I have one small suggestion: the public where docstrings in loader.ts and query.ts list exact/array/range examples but don't mention the new subtree operator, so developers won't discover it from autocomplete/docs.

Process note: Per AGENTS.md, new features require a maintainer-approved Discussion. This PR links to Discussion #1647, which is currently in Ideas and awaiting approval. The implementation looks ready, but merge should wait for that approval.


Findings

  • [suggestion] packages/core/src/loader.ts:643

    The public where docstring lists exact, byline, field, and range examples but omits the new subtree operator. Add a usage example so callers discover the feature through API docs/autocomplete.

    	 * @example { published_at: { gte: '2024-01-01', lt: '2025-01-01' } } - date range
    	 * @example { category: { subtree: 'news' } } - match a term and all descendants
    
  • [suggestion] packages/core/src/query.ts:124

    The public where docstring lists exact/array/byline/field/range examples but does not mention the new subtree operator added by this PR. Add an example so the public API surface is documented.

    	 * @example { published_at: { gte: '2024-01-01', lt: '2025-01-01' } } - Date range
    	 * @example { category: { subtree: 'news' } } - Match a term and all its descendants
    

Postgres COUNT() returns bigint as a string, so getTaxonomyTermCounts
returned "1" instead of 1 under the pg driver, failing the rollup test's
exact-count assertion. Coerce with Number(), matching countEntriesForSubtrees.

Also document the new `subtree` where-operator in the loader/query docstrings
(review suggestion).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added review/needs-rereview Author pushed changes since the last review and removed review/needs-review No maintainer or bot review yet labels Jun 29, 2026
@MA2153 MA2153 marked this pull request as draft June 29, 2026 13:16

@MA2153 MA2153 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production note: the subtree filter's EXISTS plan is entry-driven and scans the whole collection at low selectivity

Caught this in a production D1 trace with the feature live behind a faceted-browse UI. Sharing here since this PR is the source of the SQL — it's not a blocker for the operator surface, but it's worth weighing before this lands.

What the trace showed

One faceted-browse request — a single collection filtered by { <tax>: { subtree: [<two sibling leaf terms>] } }, first page (LIMIT 24) — issued one D1 query that read 43,986 rows to return ≤24 cards (sql_duration_ms ≈ 68, served_by_primary). It dominated the request by ~4 orders of magnitude: every other span read 0–1 rows, and the rollup counts were served from the KV object cache and cost no D1 at all.

The query is exactly the subtreeCond block added in loader.ts:

... FROM <collection>
WHERE <status> AND locale = ?
  AND EXISTS (
    SELECT 1 FROM content_taxonomies ct
    WHERE ct.collection = ? AND ct.entry_id = <collection>.id
      AND ct.taxonomy_id IN (WITH RECURSIVE sub(grp) AS (...) SELECT grp FROM sub)
  )
ORDER BY created_at DESC, id DESC
LIMIT 24

Why it reads so much

The filter is an EXISTS correlated to <collection>.id, so the plan is entry-driven: walk the collection in created_at DESC order and probe content_taxonomies per candidate until 24 rows pass EXISTS. When the selected subtree is sparse relative to the sort order (here the chosen leaf terms tag ~0.1% of the collection, with none near the top of the recency order), the engine pages through tens of thousands of entries to fill a single page.

Indexes aren't the problem — content_taxonomies PK (collection, entry_id, taxonomy_id) makes each per-entry probe an index hit. The problem is the plan visits every candidate entry. The one index that could drive selection — idx_content_taxonomies_term (taxonomy_id) — sits on the side this entry-driven plan never reaches.

This shape isn't unique to subtree; the pre-existing exact-slug taxonomyCond is the same EXISTS-per-row pattern. But subtree is the operator built for faceted browse over deep hierarchies — exactly the case where collections are large and a parent/section selection is sparse-or-old against a recency sort — so it's where the plan degrades toward O(table) reads. (Worth confirming with EXPLAIN QUERY PLAN on a representative dataset; the row counts strongly indicate the above.)

Suggestion — a pivot-driven plan for taxonomy/subtree filters

Resolve matching entry_ids from the pivot first, then fetch/sort entries:

WITH sub(grp) AS ( ...recursive subtree... ),
     matched AS (
       SELECT DISTINCT entry_id FROM content_taxonomies
       WHERE collection = ? AND taxonomy_id IN (SELECT grp FROM sub)
     )
SELECT ... FROM <collection>
JOIN matched ON matched.entry_id = <collection>.id
WHERE <status> AND locale = ?
ORDER BY created_at DESC, id DESC LIMIT 24

matched reads only the taggings under the subtree via idx_content_taxonomies_term (hundreds, not the whole table). I don't think it's a free swap, though — tradeoffs to weigh:

  • Selectivity cuts both ways. Pivot-first wins when the subtree is sparse (the faceted-browse common case); the current entry-driven plan can win when the subtree matches most rows and the recency index lets it stop early after 24. A cost-based choice — or at least a documented heuristic — may be warranted rather than always doing one or the other.
  • Multi-facet AND (several taxonomy keys in one where) becomes an intersection of matched sets rather than independent EXISTS clauses.
  • Keyset pagination must still order by the entry's (created_at, id); the JOIN form preserves that.

Smaller, orthogonal win regardless of plan: hoist the recursive sub resolution into a single top-level CTE shared by the filter (and reuse it for the rollup count path in taxonomy.ts), so the subtree group set is resolved once.

Not a blocker

The headline win here — resolving the subtree server-side from root slugs so the bound-parameter count is independent of subtree size — is the right call, and the recursive CTE is the correct tool for it. This is purely about the execution plan of the resulting EXISTS, which I'm flagging because in the motivating use case the production cost of the whole request landed almost entirely on this one query.

@github-actions

Copy link
Copy Markdown
Contributor

Overlapping PRs

This PR modifies files that are also changed by other open PRs:

This may cause merge conflicts or duplicated work. A maintainer will coordinate.

@github-actions github-actions Bot removed the review/needs-rereview Author pushed changes since the last review label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant