fix(origdatablocks): replace JS map/reduce with aggregation in updateDatasetSizeAndFiles#2726
Conversation
…iles The previous implementation fetched all OrigDatablock documents for a dataset into memory — including full dataFileList arrays — then summed size and file count in JavaScript. This sent unbounded data across the wire on every origdatablock create/update/delete. Replaced with a single $group aggregation that computes both totals server-side (PERF-002).
…dFiles
- Service: three cases covering normal totals, empty dataset, and
correct $match datasetId propagation
- Controller: two cases verifying the aggregation result flows through
to datasetsService.findByIdAndUpdate (including zero-total case)
- Fix: destructure only {size, numberOfFiles} from the $group result
so callers don't receive the internal _id:null field
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In
aggregateSizeAndFileCount, consider making the$sizeexpression more defensive (e.g. wrapping"$dataFileList"in$ifNullor$condto handle missing or non-array values) so that the aggregation remains robust even if legacy or malformed documents lackdataFileList.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `aggregateSizeAndFileCount`, consider making the `$size` expression more defensive (e.g. wrapping `"$dataFileList"` in `$ifNull` or `$cond` to handle missing or non-array values) so that the aggregation remains robust even if legacy or malformed documents lack `dataFileList`.
## Individual Comments
### Comment 1
<location path="src/origdatablocks/origdatablocks.controller.spec.ts" line_range="167" />
<code_context>
});
});
+
+ describe("updateDatasetSizeAndFiles", () => {
+ let controllerWithRealMethod: OrigDatablocksController;
+ let realOrigDatablocksService: OrigDatablocksServiceMock;
</code_context>
<issue_to_address>
**issue (complexity):** Consider reusing the existing test module/controller and, optionally, driving assertions through the public update handler to reduce duplication and private-method coupling while keeping the same coverage.
You can keep the new coverage while avoiding the extra module setup, second controller instance, and private‑method coupling duplication.
Two concrete changes:
1. **Reuse the existing module/controller for all tests**
Move the internal method mocking into the `describe("update")` block so the controller’s real `updateDatasetSizeAndFiles` is available for the new tests:
```ts
describe("OrigDatablocksController", () => {
let controller: OrigDatablocksController;
let origDatablocksService: OrigDatablocksServiceMock;
let datasetsService: DatasetsServiceMock;
beforeEach(async () => {
const module: TestingModule = await Test.createTestingModule({
controllers: [OrigDatablocksController],
imports: [ConfigModule],
providers: [
{ provide: OrigDatablocksService, useClass: OrigDatablocksServiceMock },
{ provide: DatasetsService, useClass: DatasetsServiceMock },
{ provide: CaslAbilityFactory, useClass: CaslAbilityFactoryMock },
],
}).compile();
controller = module.get(OrigDatablocksController);
origDatablocksService = module.get(OrigDatablocksService) as any;
datasetsService = module.get(DatasetsService) as any;
});
describe("update", () => {
beforeEach(() => {
// Mock internal helper only for update handler tests
controller["updateDatasetSizeAndFiles"] = jest.fn();
});
// existing update tests...
});
describe("updateDatasetSizeAndFiles", () => {
it("should use aggregateSizeAndFileCount and update the dataset", async () => {
origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
size: 2000,
numberOfFiles: 5,
});
await controller["updateDatasetSizeAndFiles"]("testPid");
expect(
origDatablocksService.aggregateSizeAndFileCount,
).toHaveBeenCalledWith("testPid");
expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("testPid", {
size: 2000,
numberOfFiles: 5,
});
});
it("should propagate zero totals when no origdatablocks exist", async () => {
origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
size: 0,
numberOfFiles: 0,
});
await controller["updateDatasetSizeAndFiles"]("emptyPid");
expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("emptyPid", {
size: 0,
numberOfFiles: 0,
});
});
});
});
```
This removes the second `beforeEach`, the `controllerWithRealMethod` variable, and duplicate providers, while preserving behavior and coverage.
2. **Optional: reduce private‑method coupling by testing via public API**
If feasible, add/adjust one test that drives `update` and asserts on the dataset service call instead of calling the private helper directly:
```ts
it("update should aggregate and update dataset size/files", async () => {
origDatablocksService.aggregateSizeAndFileCount.mockResolvedValue({
size: 2000,
numberOfFiles: 5,
});
await controller.update(mockRequest as any, "123", mockDto);
expect(origDatablocksService.aggregateSizeAndFileCount).toHaveBeenCalledWith(
"ds1", // or appropriate dataset id
);
expect(datasetsService.findByIdAndUpdate).toHaveBeenCalledWith("ds1", {
size: 2000,
numberOfFiles: 5,
});
});
```
You can keep the private‑method tests if you want the tighter unit checks, but the first change alone substantially reduces complexity without losing functionality.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
- Wrap dataFileList in $ifNull in aggregateSizeAndFileCount so that documents with a missing or null field do not cause the $size operator to error - Collapse controller spec to a single module setup; move the updateDatasetSizeAndFiles stub into the update describe block so the helper's own tests can use the real implementation directly, removing the duplicate beforeEach and extra controller instance
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In
aggregateSizeAndFileCount, consider mirroring the$ifNullpattern forsize(e.g.$sum: { $ifNull: ['$size', 0] }) so that missing or nullsizefields don’t cause unexpected aggregation behavior over legacy or partially-migrated documents. - You might want to add an explicit
$projectstage after$groupto limit the fields returned to justsizeandnumberOfFiles, which can help keep the aggregation contract tight and avoid accidental leakage of intermediary fields if the pipeline is extended later.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `aggregateSizeAndFileCount`, consider mirroring the `$ifNull` pattern for `size` (e.g. `$sum: { $ifNull: ['$size', 0] }`) so that missing or null `size` fields don’t cause unexpected aggregation behavior over legacy or partially-migrated documents.
- You might want to add an explicit `$project` stage after `$group` to limit the fields returned to just `size` and `numberOfFiles`, which can help keep the aggregation contract tight and avoid accidental leakage of intermediary fields if the pipeline is extended later.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai Thanks for the follow-up review. Two responses:
Explicit return result
? { size: result.size, numberOfFiles: result.numberOfFiles }
: { size: 0, numberOfFiles: 0 };The |
Description
Replaces the
updateDatasetSizeAndFileshelper inOrigDatablocksControllerwith a single MongoDB$groupaggregation.Motivation
Correctness
The previous implementation called
findAllwith nolimits, which causedparseLimitFiltersto apply its default oflimit: 100. For any dataset backed by more than 100 origdatablocks, only the first 100 were included in the sum, silently understatingsizeandnumberOfFileson the dataset record. This bug is triggered on every origdatablock create, update, and delete.The aggregation has no such cap -- it processes every matching document.
Performance
Beyond correctness, the previous approach fetched full origdatablock documents into application memory, including their complete
dataFileListarrays, and computed the totals in JavaScript. The replacement pushes both calculations to the database using$sumand returns only the two scalars needed.Related upstream issues: #2250 (double-counting of size/files when creating datablocks) and #1688 (size reset to 0 after dataset PATCH) both involve the same size-tracking logic.
Changes:
src/origdatablocks/origdatablocks.service.ts-- newaggregateSizeAndFileCount(datasetId)method using$match+$group;dataFileListwrapped in$ifNullso documents with a missing or null field return 0 rather than erroring; returns{ size, numberOfFiles }, stripping the internal_id: nullfield from the$groupresultsrc/origdatablocks/origdatablocks.controller.ts--updateDatasetSizeAndFilesreduced to a single call to the new service method; unusedPartialUpdateDatasetDtoimport removedTests included
Three new service-level tests for
aggregateSizeAndFileCount:{ size: 0, numberOfFiles: 0 }when none exist for the datasetdatasetIdto the$matchstageTwo new controller-level tests for
updateDatasetSizeAndFiles:datasetsService.findByIdAndUpdateDocumentation
Summary by Sourcery
Replace dataset size and file count recalculation for origdatablocks with a MongoDB aggregation-based implementation to ensure correct totals for all datasets.
Bug Fixes:
Enhancements:
Tests: