fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles by alubbock · Pull Request #2726 · SciCatProject/backend

alubbock · 2026-05-08T21:57:34Z

Description

Replaces the updateDatasetSizeAndFiles helper in OrigDatablocksController with a single MongoDB $group aggregation.

Motivation

Correctness

The previous implementation called findAll with no limits, which caused parseLimitFilters to apply its default of limit: 100. For any dataset backed by more than 100 origdatablocks, only the first 100 were included in the sum, silently understating size and numberOfFiles on the dataset record. This bug is triggered on every origdatablock create, update, and delete.

The aggregation has no such cap -- it processes every matching document.

Performance

Beyond correctness, the previous approach fetched full origdatablock documents into application memory, including their complete dataFileList arrays, and computed the totals in JavaScript. The replacement pushes both calculations to the database using $sum and returns only the two scalars needed.

Related upstream issues: #2250 (double-counting of size/files when creating datablocks) and #1688 (size reset to 0 after dataset PATCH) both involve the same size-tracking logic.

Changes:

src/origdatablocks/origdatablocks.service.ts -- new aggregateSizeAndFileCount(datasetId) method using $match + $group; dataFileList wrapped in $ifNull so documents with a missing or null field return 0 rather than erroring; returns { size, numberOfFiles }, stripping the internal _id: null field from the $group result
src/origdatablocks/origdatablocks.controller.ts -- updateDatasetSizeAndFiles reduced to a single call to the new service method; unused PartialUpdateDatasetDto import removed

Tests included

Included for each change/fix?
Passing?

Three new service-level tests for aggregateSizeAndFileCount:

returns summed totals when origdatablocks exist
returns { size: 0, numberOfFiles: 0 } when none exist for the dataset
passes the correct datasetId to the $match stage

Two new controller-level tests for updateDatasetSizeAndFiles:

verifies the aggregation result flows through to datasetsService.findByIdAndUpdate
verifies zero totals are propagated correctly

Documentation

swagger documentation updated (required for API changes) -- n/a, no API change
official documentation updated -- n/a, internal implementation change only

Summary by Sourcery

Replace dataset size and file count recalculation for origdatablocks with a MongoDB aggregation-based implementation to ensure correct totals for all datasets.

Bug Fixes:

Fix dataset size and numberOfFiles being understated when more than 100 origdatablocks exist by eliminating the implicit query limit during recalculation.

Enhancements:

Introduce an aggregateSizeAndFileCount service method that computes dataset size and file count in MongoDB using a $group aggregation and handles missing dataFileList fields.
Simplify OrigDatablocksController.updateDatasetSizeAndFiles to delegate to the new aggregation-based service method and update the dataset in a single step.

Tests:

Add unit tests for aggregateSizeAndFileCount covering normal aggregation, empty results, and correct datasetId matching in the aggregation pipeline.
Add controller tests to verify updateDatasetSizeAndFiles uses the aggregation results to update datasets, including propagation of zero totals.

…iles The previous implementation fetched all OrigDatablock documents for a dataset into memory — including full dataFileList arrays — then summed size and file count in JavaScript. This sent unbounded data across the wire on every origdatablock create/update/delete. Replaced with a single $group aggregation that computes both totals server-side (PERF-002).

…dFiles - Service: three cases covering normal totals, empty dataset, and correct $match datasetId propagation - Controller: two cases verifying the aggregation result flows through to datasetsService.findByIdAndUpdate (including zero-total case) - Fix: destructure only {size, numberOfFiles} from the $group result so callers don't receive the internal _id:null field

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In aggregateSizeAndFileCount, consider making the $size expression more defensive (e.g. wrapping "$dataFileList" in $ifNull or $cond to handle missing or non-array values) so that the aggregation remains robust even if legacy or malformed documents lack dataFileList.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `aggregateSizeAndFileCount`, consider making the `$size` expression more defensive (e.g. wrapping `"$dataFileList"` in `$ifNull` or `$cond` to handle missing or non-array values) so that the aggregation remains robust even if legacy or malformed documents lack `dataFileList`.

## Individual Comments

### Comment 1
<location path="src/origdatablocks/origdatablocks.controller.spec.ts" line_range="167" />
<code_context>
     });
   });
+
+  describe("updateDatasetSizeAndFiles", () => {
+    let controllerWithRealMethod: OrigDatablocksController;
+    let realOrigDatablocksService: OrigDatablocksServiceMock;
</code_context>
<issue_to_address>
**issue (complexity):** Consider reusing the existing test module/controller and, optionally, driving assertions through the public update handler to reduce duplication and private-method coupling while keeping the same coverage.

You can keep the new coverage while avoiding the extra module setup, second controller instance, and private‑method coupling duplication.

Two concrete changes:

1. **Reuse the existing module/controller for all tests**  
   Move the internal method mocking into the `describe("update")` block so the controller’s real `updateDatasetSizeAndFiles` is available for the new tests:

```ts
describe("OrigDatablocksController", () => {
  let controller: OrigDatablocksController;
  let origDatablocksService: OrigDatablocksServiceMock;
  let datasetsService: DatasetsServiceMock;

  beforeEach(async () => {
    const module: TestingModule = await Test.createTestingModule({
      controllers: [OrigDatablocksController],
      imports: [ConfigModule],
      providers: [
        { provide: OrigDatablocksService, useClass: OrigDatablocksServiceMock },
        { provide: DatasetsService, useClass: DatasetsServiceMock },
        { provide: CaslAbilityFactory, useClass: CaslAbilityFactoryMock },
      ],
    }).compile();

    controller = module.get(OrigDatablocksController);
    origDatablocksService = module.get(OrigDatablocksService) as any;
    datasetsService = module.get(DatasetsService) as any;
  });

  describe("update", () => {
    beforeEach(() => {
      // Mock internal helper only for update handler tests
      controller["updateDatasetSizeAndFiles"] = jest.fn();
    });

    // existing update tests...
  });

  describe("updateDatasetSizeAndFiles", () => {
    it("should use aggregateSizeAndFileCount and update the dataset", async () => {
      origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
        size: 2000,
        numberOfFiles: 5,
      });

      await controller["updateDatasetSizeAndFiles"]("testPid");

      expect(
        origDatablocksService.aggregateSizeAndFileCount,
      ).toHaveBeenCalledWith("testPid");
      expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("testPid", {
        size: 2000,
        numberOfFiles: 5,
      });
    });

    it("should propagate zero totals when no origdatablocks exist", async () => {
      origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
        size: 0,
        numberOfFiles: 0,
      });

      await controller["updateDatasetSizeAndFiles"]("emptyPid");

      expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("emptyPid", {
        size: 0,
        numberOfFiles: 0,
      });
    });
  });
});
```

This removes the second `beforeEach`, the `controllerWithRealMethod` variable, and duplicate providers, while preserving behavior and coverage.

2. **Optional: reduce private‑method coupling by testing via public API**  
If feasible, add/adjust one test that drives `update` and asserts on the dataset service call instead of calling the private helper directly:

```ts
it("update should aggregate and update dataset size/files", async () => {
  origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
    size: 2000,
    numberOfFiles: 5,
  });

  await controller.update(mockRequest as any, "123", mockDto);

  expect(origDatablocksService.aggregateSizeAndFileCount).toHaveBeenCalledWith(
    "ds1", // or appropriate dataset id
  );
  expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("ds1", {
    size: 2000,
    numberOfFiles: 5,
  });
});
```

You can keep the private‑method tests if you want the tighter unit checks, but the first change alone substantially reduces complexity without losing functionality.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

- Wrap dataFileList in $ifNull in aggregateSizeAndFileCount so that documents with a missing or null field do not cause the $size operator to error - Collapse controller spec to a single module setup; move the updateDatasetSizeAndFiles stub into the update describe block so the helper's own tests can use the real implementation directly, removing the duplicate beforeEach and extra controller instance

alubbock · 2026-05-08T22:16:32Z

@sourcery-ai review

sourcery-ai

Hey - I've left some high level feedback:

In aggregateSizeAndFileCount, consider mirroring the $ifNull pattern for size (e.g. $sum: { $ifNull: ['$size', 0] }) so that missing or null size fields don’t cause unexpected aggregation behavior over legacy or partially-migrated documents.
You might want to add an explicit $project stage after $group to limit the fields returned to just size and numberOfFiles, which can help keep the aggregation contract tight and avoid accidental leakage of intermediary fields if the pipeline is extended later.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `aggregateSizeAndFileCount`, consider mirroring the `$ifNull` pattern for `size` (e.g. `$sum: { $ifNull: ['$size', 0] }`) so that missing or null `size` fields don’t cause unexpected aggregation behavior over legacy or partially-migrated documents.
- You might want to add an explicit `$project` stage after `$group` to limit the fields returned to just `size` and `numberOfFiles`, which can help keep the aggregation contract tight and avoid accidental leakage of intermediary fields if the pipeline is extended later.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

alubbock · 2026-05-08T22:21:39Z

@sourcery-ai Thanks for the follow-up review. Two responses:

$ifNull on size — not needed here. MongoDB's $sum already treats missing or null numeric fields as 0, so the behaviour is identical to wrapping in $ifNull. This is different from the dataFileList case, where a missing array causes $size to error. size is also declared required in the schema, making the defensive case unlikely. No behaviour change would result from adding it.

Explicit $project stage — the field stripping is already handled in TypeScript before the result is returned:

return result
  ? { size: result.size, numberOfFiles: result.numberOfFiles }
  : { size: 0, numberOfFiles: 0 };

The $group stage only ever emits _id, size, and numberOfFiles, so there are no intermediary fields to leak regardless. Adding a $project would be a no-op pipeline stage.

alubbock added 2 commits May 8, 2026 22:09

alubbock requested a review from a team as a code owner May 8, 2026 21:57

sourcery-ai Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/origdatablocks/origdatablocks.controller.spec.ts

alubbock changed the title ~~perf(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles~~ fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles May 8, 2026

sourcery-ai Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles#2726

fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles#2726
alubbock wants to merge 3 commits into
SciCatProject:masterfrom
rosalindfranklininstitute:perf/002-aggregate-size-calculation

alubbock commented May 8, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

alubbock commented May 8, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

alubbock commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alubbock commented May 8, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Correctness

Performance

Changes:

Tests included

Documentation

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alubbock commented May 8, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

alubbock commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alubbock commented May 8, 2026 •

edited by sourcery-ai Bot

Loading