Skip to content

fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles#2726

Open
alubbock wants to merge 3 commits into
SciCatProject:masterfrom
rosalindfranklininstitute:perf/002-aggregate-size-calculation
Open

fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles#2726
alubbock wants to merge 3 commits into
SciCatProject:masterfrom
rosalindfranklininstitute:perf/002-aggregate-size-calculation

Conversation

@alubbock
Copy link
Copy Markdown
Member

@alubbock alubbock commented May 8, 2026

Description

Replaces the updateDatasetSizeAndFiles helper in OrigDatablocksController with a single MongoDB $group aggregation.

Motivation

Correctness

The previous implementation called findAll with no limits, which caused parseLimitFilters to apply its default of limit: 100. For any dataset backed by more than 100 origdatablocks, only the first 100 were included in the sum, silently understating size and numberOfFiles on the dataset record. This bug is triggered on every origdatablock create, update, and delete.

The aggregation has no such cap -- it processes every matching document.

Performance

Beyond correctness, the previous approach fetched full origdatablock documents into application memory, including their complete dataFileList arrays, and computed the totals in JavaScript. The replacement pushes both calculations to the database using $sum and returns only the two scalars needed.

Related upstream issues: #2250 (double-counting of size/files when creating datablocks) and #1688 (size reset to 0 after dataset PATCH) both involve the same size-tracking logic.

Changes:

  • src/origdatablocks/origdatablocks.service.ts -- new aggregateSizeAndFileCount(datasetId) method using $match + $group; dataFileList wrapped in $ifNull so documents with a missing or null field return 0 rather than erroring; returns { size, numberOfFiles }, stripping the internal _id: null field from the $group result
  • src/origdatablocks/origdatablocks.controller.ts -- updateDatasetSizeAndFiles reduced to a single call to the new service method; unused PartialUpdateDatasetDto import removed

Tests included

  • Included for each change/fix?
  • Passing?

Three new service-level tests for aggregateSizeAndFileCount:

  • returns summed totals when origdatablocks exist
  • returns { size: 0, numberOfFiles: 0 } when none exist for the dataset
  • passes the correct datasetId to the $match stage

Two new controller-level tests for updateDatasetSizeAndFiles:

  • verifies the aggregation result flows through to datasetsService.findByIdAndUpdate
  • verifies zero totals are propagated correctly

Documentation

  • swagger documentation updated (required for API changes) -- n/a, no API change
  • official documentation updated -- n/a, internal implementation change only

Summary by Sourcery

Replace dataset size and file count recalculation for origdatablocks with a MongoDB aggregation-based implementation to ensure correct totals for all datasets.

Bug Fixes:

  • Fix dataset size and numberOfFiles being understated when more than 100 origdatablocks exist by eliminating the implicit query limit during recalculation.

Enhancements:

  • Introduce an aggregateSizeAndFileCount service method that computes dataset size and file count in MongoDB using a $group aggregation and handles missing dataFileList fields.
  • Simplify OrigDatablocksController.updateDatasetSizeAndFiles to delegate to the new aggregation-based service method and update the dataset in a single step.

Tests:

  • Add unit tests for aggregateSizeAndFileCount covering normal aggregation, empty results, and correct datasetId matching in the aggregation pipeline.
  • Add controller tests to verify updateDatasetSizeAndFiles uses the aggregation results to update datasets, including propagation of zero totals.

alubbock added 2 commits May 8, 2026 22:09
…iles

The previous implementation fetched all OrigDatablock documents for a
dataset into memory — including full dataFileList arrays — then summed
size and file count in JavaScript. This sent unbounded data across the
wire on every origdatablock create/update/delete.

Replaced with a single $group aggregation that computes both totals
server-side (PERF-002).
…dFiles

- Service: three cases covering normal totals, empty dataset, and
  correct $match datasetId propagation
- Controller: two cases verifying the aggregation result flows through
  to datasetsService.findByIdAndUpdate (including zero-total case)
- Fix: destructure only {size, numberOfFiles} from the $group result
  so callers don't receive the internal _id:null field
@alubbock alubbock requested a review from a team as a code owner May 8, 2026 21:57
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In aggregateSizeAndFileCount, consider making the $size expression more defensive (e.g. wrapping "$dataFileList" in $ifNull or $cond to handle missing or non-array values) so that the aggregation remains robust even if legacy or malformed documents lack dataFileList.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `aggregateSizeAndFileCount`, consider making the `$size` expression more defensive (e.g. wrapping `"$dataFileList"` in `$ifNull` or `$cond` to handle missing or non-array values) so that the aggregation remains robust even if legacy or malformed documents lack `dataFileList`.

## Individual Comments

### Comment 1
<location path="src/origdatablocks/origdatablocks.controller.spec.ts" line_range="167" />
<code_context>
     });
   });
+
+  describe("updateDatasetSizeAndFiles", () => {
+    let controllerWithRealMethod: OrigDatablocksController;
+    let realOrigDatablocksService: OrigDatablocksServiceMock;
</code_context>
<issue_to_address>
**issue (complexity):** Consider reusing the existing test module/controller and, optionally, driving assertions through the public update handler to reduce duplication and private-method coupling while keeping the same coverage.

You can keep the new coverage while avoiding the extra module setup, second controller instance, and private‑method coupling duplication.

Two concrete changes:

1. **Reuse the existing module/controller for all tests**  
   Move the internal method mocking into the `describe("update")` block so the controller’s real `updateDatasetSizeAndFiles` is available for the new tests:

```ts
describe("OrigDatablocksController", () => {
  let controller: OrigDatablocksController;
  let origDatablocksService: OrigDatablocksServiceMock;
  let datasetsService: DatasetsServiceMock;

  beforeEach(async () => {
    const module: TestingModule = await Test.createTestingModule({
      controllers: [OrigDatablocksController],
      imports: [ConfigModule],
      providers: [
        { provide: OrigDatablocksService, useClass: OrigDatablocksServiceMock },
        { provide: DatasetsService, useClass: DatasetsServiceMock },
        { provide: CaslAbilityFactory, useClass: CaslAbilityFactoryMock },
      ],
    }).compile();

    controller = module.get(OrigDatablocksController);
    origDatablocksService = module.get(OrigDatablocksService) as any;
    datasetsService = module.get(DatasetsService) as any;
  });

  describe("update", () => {
    beforeEach(() => {
      // Mock internal helper only for update handler tests
      controller["updateDatasetSizeAndFiles"] = jest.fn();
    });

    // existing update tests...
  });

  describe("updateDatasetSizeAndFiles", () => {
    it("should use aggregateSizeAndFileCount and update the dataset", async () => {
      origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
        size: 2000,
        numberOfFiles: 5,
      });

      await controller["updateDatasetSizeAndFiles"]("testPid");

      expect(
        origDatablocksService.aggregateSizeAndFileCount,
      ).toHaveBeenCalledWith("testPid");
      expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("testPid", {
        size: 2000,
        numberOfFiles: 5,
      });
    });

    it("should propagate zero totals when no origdatablocks exist", async () => {
      origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
        size: 0,
        numberOfFiles: 0,
      });

      await controller["updateDatasetSizeAndFiles"]("emptyPid");

      expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("emptyPid", {
        size: 0,
        numberOfFiles: 0,
      });
    });
  });
});
```

This removes the second `beforeEach`, the `controllerWithRealMethod` variable, and duplicate providers, while preserving behavior and coverage.

2. **Optional: reduce private‑method coupling by testing via public API**  
If feasible, add/adjust one test that drives `update` and asserts on the dataset service call instead of calling the private helper directly:

```ts
it("update should aggregate and update dataset size/files", async () => {
  origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
    size: 2000,
    numberOfFiles: 5,
  });

  await controller.update(mockRequest as any, "123", mockDto);

  expect(origDatablocksService.aggregateSizeAndFileCount).toHaveBeenCalledWith(
    "ds1", // or appropriate dataset id
  );
  expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("ds1", {
    size: 2000,
    numberOfFiles: 5,
  });
});
```

You can keep the private‑method tests if you want the tighter unit checks, but the first change alone substantially reduces complexity without losing functionality.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/origdatablocks/origdatablocks.controller.spec.ts
- Wrap dataFileList in $ifNull in aggregateSizeAndFileCount so that
  documents with a missing or null field do not cause the $size
  operator to error
- Collapse controller spec to a single module setup; move the
  updateDatasetSizeAndFiles stub into the update describe block so
  the helper's own tests can use the real implementation directly,
  removing the duplicate beforeEach and extra controller instance
@alubbock alubbock changed the title perf(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles May 8, 2026
@alubbock
Copy link
Copy Markdown
Member Author

alubbock commented May 8, 2026

@sourcery-ai review

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In aggregateSizeAndFileCount, consider mirroring the $ifNull pattern for size (e.g. $sum: { $ifNull: ['$size', 0] }) so that missing or null size fields don’t cause unexpected aggregation behavior over legacy or partially-migrated documents.
  • You might want to add an explicit $project stage after $group to limit the fields returned to just size and numberOfFiles, which can help keep the aggregation contract tight and avoid accidental leakage of intermediary fields if the pipeline is extended later.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `aggregateSizeAndFileCount`, consider mirroring the `$ifNull` pattern for `size` (e.g. `$sum: { $ifNull: ['$size', 0] }`) so that missing or null `size` fields don’t cause unexpected aggregation behavior over legacy or partially-migrated documents.
- You might want to add an explicit `$project` stage after `$group` to limit the fields returned to just `size` and `numberOfFiles`, which can help keep the aggregation contract tight and avoid accidental leakage of intermediary fields if the pipeline is extended later.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@alubbock
Copy link
Copy Markdown
Member Author

alubbock commented May 8, 2026

@sourcery-ai Thanks for the follow-up review. Two responses:

$ifNull on size — not needed here. MongoDB's $sum already treats missing or null numeric fields as 0, so the behaviour is identical to wrapping in $ifNull. This is different from the dataFileList case, where a missing array causes $size to error. size is also declared required in the schema, making the defensive case unlikely. No behaviour change would result from adding it.

Explicit $project stage — the field stripping is already handled in TypeScript before the result is returned:

return result
  ? { size: result.size, numberOfFiles: result.numberOfFiles }
  : { size: 0, numberOfFiles: 0 };

The $group stage only ever emits _id, size, and numberOfFiles, so there are no intermediary fields to leak regardless. Adding a $project would be a no-op pipeline stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant