Skip to content

perf(origdatablocks): add datasetId index to OrigDatablock and Datablock#2725

Merged
alubbock merged 4 commits into
SciCatProject:masterfrom
rosalindfranklininstitute:perf/001-datasetid-index
May 19, 2026
Merged

perf(origdatablocks): add datasetId index to OrigDatablock and Datablock#2725
alubbock merged 4 commits into
SciCatProject:masterfrom
rosalindfranklininstitute:perf/001-datasetid-index

Conversation

@alubbock
Copy link
Copy Markdown
Member

@alubbock alubbock commented May 8, 2026

Description

Adds a MongoDB index on datasetId to the OrigDatablock and Datablock collections.

Motivation

All $lookup pipelines that join datasets to their blocks filter on datasetId, but neither collection had an index on that field. Without it, every lookup requires a full collection scan, O(n) per parent document. At scale this is the dominant cost of the dataset detail view (which eagerly joins origdatablocks and datablocks by default) and of any archival workflow that loads blocks by dataset.

Changes:

  • src/origdatablocks/schemas/origdatablock.schema.ts -- OrigDatablockSchema.index({ datasetId: 1 })
  • src/datablocks/schemas/datablock.schema.ts -- DatablockSchema.index({ datasetId: 1 })

Tests included

  • Included for each change/fix?
  • Passing?

Two new schema regression tests (origdatablock.schema.spec.ts, datablock.schema.spec.ts) verify that the index definition is present on the compiled Mongoose schema, guarding against accidental removal.

Documentation

  • swagger documentation updated (required for API changes) -- n/a, no API change
  • official documentation updated -- n/a, internal index definition only

Summary by Sourcery

Add a MongoDB index on datasetId to datablock-related collections and guard it with schema regression tests.

New Features:

  • Introduce a datasetId index on the Datablock collection.
  • Introduce a datasetId index on the OrigDatablock collection.

Tests:

  • Add schema regression tests to verify the datasetId index exists on Datablock and OrigDatablock Mongoose schemas.

alubbock added 2 commits May 8, 2026 22:07
All $lookup pipelines joining datasets to their blocks filter on
datasetId, causing a full collection scan per parent document without
this index (PERF-001).
@alubbock alubbock requested a review from a team as a code owner May 8, 2026 21:55
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Member

@Junjiequan Junjiequan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@alubbock alubbock enabled auto-merge May 19, 2026 08:49
@alubbock alubbock merged commit 006da4b into SciCatProject:master May 19, 2026
20 of 21 checks passed
@alubbock alubbock deleted the perf/001-datasetid-index branch May 19, 2026 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants