Adds subtree (descendant) matching to taxonomy where filters#1648
Adds subtree (descendant) matching to taxonomy where filters#1648MA2153 wants to merge 12 commits into
where filters#1648Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 399fe8e The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Scope checkThis PR changes 567 lines across 12 files. Large PRs are harder to review and more likely to be closed without review. If this scope is intentional, no action needed. A maintainer will review it. If not, please consider splitting this into smaller PRs. See CONTRIBUTING.md for contribution guidelines. |
@emdash-cms/admin
@emdash-cms/auth
@emdash-cms/auth-atproto
@emdash-cms/blocks
@emdash-cms/cloudflare
@emdash-cms/contentful-to-portable-text
emdash
create-emdash
@emdash-cms/gutenberg-to-portable-text
@emdash-cms/plugin-cli
@emdash-cms/plugin-types
@emdash-cms/registry-client
@emdash-cms/registry-lexicons
@emdash-cms/sandbox-workerd
@emdash-cms/x402
@emdash-cms/plugin-ai-moderation
@emdash-cms/plugin-atproto
@emdash-cms/plugin-audit-log
@emdash-cms/plugin-color
@emdash-cms/plugin-embeds
@emdash-cms/plugin-field-kit
@emdash-cms/plugin-forms
@emdash-cms/plugin-webhook-notifier
commit: |
There was a problem hiding this comment.
Approach
This is the right change for the right problem. Exact-slug taxonomy filters cannot express "this term or anything under it", and the only workaround (enumerating every descendant slug) blows past D1's bind-parameter limit on deep trees. Adding a first-class { subtree } operator and resolving descendants in SQL with a recursive CTE fits EmDash's architecture cleanly: it reuses the existing translation_group-aware model from migration 045, keeps the parameter count constant, and doesn't disturb the default exact-slug behavior.
What I checked
- SQL safety: The recursive CTEs use Kysely's
sqltagged template for values andsql.ref()for identifiers; the dynamic table/collection names are validated viagetTableName/getTaxonomyNames. No raw interpolation of user slugs. - Locale / i18n correctness: The subtree walk uses
translation_group/parent_id, so matches are locale-agnostic in the same waycontent_taxonomies.taxonomy_idis. The loader's outerlocalefilter still scopes the returned entries. - Authorization: The admin terms route still checks
taxonomies:read; the new?rollupquery param is just passed through. - Cache invalidation:
getTaxonomyTerms({ rollup: true })is still wrapped in the existingcachedQuery/requestCachedlayers with a cache key that includesr1, so term mutations viainvalidateTermCache()bust it correctly. - Tests: Dialect-parametric tests cover single root, multiple roots, the >999-descendant overflow guard, mixing exact + subtree filters, empty-root short-circuit, cross-locale group matching, keyset pagination, and distinct-entry rollup counts. A changeset is present.
Headline conclusion
The code is clean and I don't see any blocking bugs. I have one small suggestion: the public where docstrings in loader.ts and query.ts list exact/array/range examples but don't mention the new subtree operator, so developers won't discover it from autocomplete/docs.
Process note: Per AGENTS.md, new features require a maintainer-approved Discussion. This PR links to Discussion #1647, which is currently in Ideas and awaiting approval. The implementation looks ready, but merge should wait for that approval.
Findings
-
[suggestion]
packages/core/src/loader.ts:643The public
wheredocstring lists exact, byline, field, and range examples but omits the newsubtreeoperator. Add a usage example so callers discover the feature through API docs/autocomplete.* @example { published_at: { gte: '2024-01-01', lt: '2025-01-01' } } - date range * @example { category: { subtree: 'news' } } - match a term and all descendants -
[suggestion]
packages/core/src/query.ts:124The public
wheredocstring lists exact/array/byline/field/range examples but does not mention the newsubtreeoperator added by this PR. Add an example so the public API surface is documented.* @example { published_at: { gte: '2024-01-01', lt: '2025-01-01' } } - Date range * @example { category: { subtree: 'news' } } - Match a term and all its descendants
Postgres COUNT() returns bigint as a string, so getTaxonomyTermCounts returned "1" instead of 1 under the pg driver, failing the rollup test's exact-count assertion. Coerce with Number(), matching countEntriesForSubtrees. Also document the new `subtree` where-operator in the loader/query docstrings (review suggestion). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MA2153
left a comment
There was a problem hiding this comment.
Production note: the subtree filter's EXISTS plan is entry-driven and scans the whole collection at low selectivity
Caught this in a production D1 trace with the feature live behind a faceted-browse UI. Sharing here since this PR is the source of the SQL — it's not a blocker for the operator surface, but it's worth weighing before this lands.
What the trace showed
One faceted-browse request — a single collection filtered by { <tax>: { subtree: [<two sibling leaf terms>] } }, first page (LIMIT 24) — issued one D1 query that read 43,986 rows to return ≤24 cards (sql_duration_ms ≈ 68, served_by_primary). It dominated the request by ~4 orders of magnitude: every other span read 0–1 rows, and the rollup counts were served from the KV object cache and cost no D1 at all.
The query is exactly the subtreeCond block added in loader.ts:
... FROM <collection>
WHERE <status> AND locale = ?
AND EXISTS (
SELECT 1 FROM content_taxonomies ct
WHERE ct.collection = ? AND ct.entry_id = <collection>.id
AND ct.taxonomy_id IN (WITH RECURSIVE sub(grp) AS (...) SELECT grp FROM sub)
)
ORDER BY created_at DESC, id DESC
LIMIT 24Why it reads so much
The filter is an EXISTS correlated to <collection>.id, so the plan is entry-driven: walk the collection in created_at DESC order and probe content_taxonomies per candidate until 24 rows pass EXISTS. When the selected subtree is sparse relative to the sort order (here the chosen leaf terms tag ~0.1% of the collection, with none near the top of the recency order), the engine pages through tens of thousands of entries to fill a single page.
Indexes aren't the problem — content_taxonomies PK (collection, entry_id, taxonomy_id) makes each per-entry probe an index hit. The problem is the plan visits every candidate entry. The one index that could drive selection — idx_content_taxonomies_term (taxonomy_id) — sits on the side this entry-driven plan never reaches.
This shape isn't unique to subtree; the pre-existing exact-slug taxonomyCond is the same EXISTS-per-row pattern. But subtree is the operator built for faceted browse over deep hierarchies — exactly the case where collections are large and a parent/section selection is sparse-or-old against a recency sort — so it's where the plan degrades toward O(table) reads. (Worth confirming with EXPLAIN QUERY PLAN on a representative dataset; the row counts strongly indicate the above.)
Suggestion — a pivot-driven plan for taxonomy/subtree filters
Resolve matching entry_ids from the pivot first, then fetch/sort entries:
WITH sub(grp) AS ( ...recursive subtree... ),
matched AS (
SELECT DISTINCT entry_id FROM content_taxonomies
WHERE collection = ? AND taxonomy_id IN (SELECT grp FROM sub)
)
SELECT ... FROM <collection>
JOIN matched ON matched.entry_id = <collection>.id
WHERE <status> AND locale = ?
ORDER BY created_at DESC, id DESC LIMIT 24matched reads only the taggings under the subtree via idx_content_taxonomies_term (hundreds, not the whole table). I don't think it's a free swap, though — tradeoffs to weigh:
- Selectivity cuts both ways. Pivot-first wins when the subtree is sparse (the faceted-browse common case); the current entry-driven plan can win when the subtree matches most rows and the recency index lets it stop early after 24. A cost-based choice — or at least a documented heuristic — may be warranted rather than always doing one or the other.
- Multi-facet AND (several taxonomy keys in one
where) becomes an intersection ofmatchedsets rather than independentEXISTSclauses. - Keyset pagination must still order by the entry's
(created_at, id); the JOIN form preserves that.
Smaller, orthogonal win regardless of plan: hoist the recursive sub resolution into a single top-level CTE shared by the filter (and reuse it for the rollup count path in taxonomy.ts), so the subtree group set is resolved once.
Not a blocker
The headline win here — resolving the subtree server-side from root slugs so the bound-parameter count is independent of subtree size — is the right call, and the recursive CTE is the correct tool for it. This is purely about the execution plan of the resulting EXISTS, which I'm flagging because in the motivating use case the production cost of the whole request landed almost entirely on this one query.
Overlapping PRsThis PR modifies files that are also changed by other open PRs:
This may cause merge conflicts or duplicated work. A maintainer will coordinate. |
What does this PR do?
Adds a
subtreeoperator to collectionwheretaxonomy filters so selecting a parent term matches that term and all of its descendants, resolved in SQL:Today the taxonomy
wherefilter matches by exact slug only, so "this term or anything filed under it" (selecting a parent category in a faceted browse UI) is inexpressible. The only workaround — enumerating the subtree client-side and passing every descendant slug — expands to one bound parameter per slug and overflows D1's 100-bind-parameter cap (D1_ERROR: too many SQL variables) on deep hierarchies. It also can't be chunked without breaking keyset pagination.This resolves the subtree server-side from a single root slug via a recursive CTE over
taxonomies.parent_id, so the bound-parameter count is independent of subtree size. After #1646 bothparent_idandcontent_taxonomies.taxonomy_idlive in translation_group space, so the walk is locale-correct and matchestaxonomy_iddirectly.Also adds an opt-in
rollupoption togetTaxonomyTerms(and the admin terms endpoint via?rollup=1) returning distinct-entry subtree counts, so a facet badge equals what selecting that facet returns. Default behavior (exact-slug filter, exact-term counts) is unchanged.Related: Discussion #1647 (opened in Ideas; awaiting maintainer approval — opening the PR early to share the implementation, happy to adjust the operator surface, e.g.
{ subtree }vs{ descendantsOf }, per the discussion).Type of change
Checklist
pnpm typecheckpassespnpm lintpasses (0 diagnostics)pnpm testpasses (targeted: the new subtree filter + subtree count suites, 10 tests + the schema-coercion unit tests)pnpm formathas been runmessages.pochanges included (n/a for i18n)emdash: minor)AI-generated code disclosure
Screenshots / test output
Dialect-parity tests run under
describeEachDialect(SQLite locally; Postgres in CI viaPG_CONNECTION_STRING). Highlights:loader-taxonomy-subtree-filter.test.ts— single-root and multi-root match, >999-descendant overflow guard (would exceed SQLite's bind limit if descendants were enumerated rather than walked in-SQL), mixed exact + subtree across taxonomies, empty-roots short-circuit, cross-locale (match by translation_group), keyset pagination.taxonomy-subtree-counts.test.ts— distinct-entry rollup honesty (an entry tagged at both a parent and its child counts once),getTaxonomyTerms({ rollup }), andhandleTermList({ rollup }).🤖 Generated with Claude Code