Skip to content

refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697

Open
Junjiequan wants to merge 31 commits into
masterfrom
refactor-metadatakeys-service
Open

refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697
Junjiequan wants to merge 31 commits into
masterfrom
refactor-metadatakeys-service

Conversation

@Junjiequan
Copy link
Copy Markdown
Member

@Junjiequan Junjiequan commented Apr 20, 2026

Description

MetadataKeys are now deduplicated, instead of one entry per key per dataset, a single entry is shared across all datasets containing the same key, with userGroups, usageCount, and userGroupCounts merged.

userGroupCounts

New internal field that tracks how many datasets per group reference this key:

dataset1: {ownerGroup:"group1", userGroups:["group2"]},
dataset2: {ownerGroup:"group1"}
dataset3: {ownerGroup:"group1"}

"userGroupCounts": { "group-1": 3, "group-2": 1 },
"userGroups": ["group-1", "group-2"],
"usageCount": 3

When a group's count hits 0 it is removed from userGroups. When usageCount hits 0 the entry is deleted.

Changes

  • Parent document (dataset, proposal) permission changes (ownerGroup, accessGroups) update userGroups and userGroupCounts accordingly
  • If any parent document has isPublished: true, the metadata key will have isPublished:true as well and is visible to all users
  • Migration script included with documentation explaining the aggregation pipeline to populate MetadataKeys from existing Dataset documents — must be run manually before deploying (duration depends on dataset count and scientificMetadata count per dataset). Keys used by more than 1000 datasets are automatically made public.

Motivation

Fixes

  • Bug fixed (#X)

Tests included

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

official documentation info

Summary by Sourcery

Track metadata keys at a global level across datasets and other entities, aggregating usage and access control by key rather than per-source document.

New Features:

  • Expose metadata key usage counts and per-group reference counts, enabling queries over how often and by whom metadata keys are used.
  • Add a migration to backfill the MetadataKeys collection from existing dataset scientificMetadata, merging keys across datasets and computing group usage.
  • Extend metadata key search capabilities with filtering on humanReadableName, regex matching, field projection, and pagination semantics.

Enhancements:

  • Refine metadata key aggregation to use sourceType plus key and humanReadableName as the identifier, allowing the same key to be merged across multiple datasets and groups.
  • Update datasets, samples, proposals, and instruments services to emit unified MetadataSourceDoc structures that carry resolved userGroups and drive consistent metadata key bookkeeping.
  • Strengthen access-control semantics so metadata keys respect publication status and user group visibility while avoiding exposure of group-private keys to unauthorized users.
  • Improve documentation for the MetadataKeys module and its migration, including examples of filters and an architectural overview of synchronization behavior.

Documentation:

  • Expand developer documentation for the MetadataKeys module to describe the new schema, synchronization flow, filter options, and initial migration requirements.
  • Document the new migration that rebuilds the MetadataKeys collection from dataset scientificMetadata, including pipeline details and operational guidance.

Tests:

  • Add extensive unit tests for MetadataKeysService covering aggregation pipelines, insert/delete/replace behaviors, and group/publish state transitions.
  • Add end-to-end tests around the /metadatakeys API to validate access control, search behavior, pagination, projection, and usageCount semantics across different user roles.

@Junjiequan Junjiequan requested a review from a team as a code owner April 20, 2026 18:41
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 6 issues, and left some high level feedback:

  • The migration uses aggregate(...).toArray() with $merge, which will materialize all results in memory; consider iterating the cursor (or omitting toArray() entirely when $merge is the terminal stage) to avoid potential OOM issues on large datasets.
  • Several unit tests for MetadataKeysService assert on specific updateMany call indices (e.g. mock.calls[1]); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
  • MetadataSourceDoc.sourceType is now derived from model.collection.name (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the sourceType convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The migration uses `aggregate(...).toArray()` with `$merge`, which will materialize all results in memory; consider iterating the cursor (or omitting `toArray()` entirely when `$merge` is the terminal stage) to avoid potential OOM issues on large datasets.
- Several unit tests for `MetadataKeysService` assert on specific `updateMany` call indices (e.g. `mock.calls[1]`); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
- `MetadataSourceDoc.sourceType` is now derived from `model.collection.name` (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the `sourceType` convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.

## Individual Comments

### Comment 1
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="274-283" />
<code_context>
   private async updateSharedKeys(
</code_context>
<issue_to_address>
**issue (bug_risk):** Keys whose humanReadableName changed are updated under the new _id before being treated as delete+insert, which can cause inconsistent counters and partial documents.

In `updateSharedKeys`, the group/publish update block runs for all `sharedKeys`, including those with changed `human_name`. For these, `ids` are built from the new `humanReadableName`, so `updateMany` may target documents that don’t yet exist (and should instead be created via `insertManyFromSource` with full `$setOnInsert`). Later, `humanNameChangedKeys` are processed again via `deleteMany(oldDoc)` + `insertManyFromSource(newDoc)`, so the same logical key is updated twice under different assumptions. To prevent double-counting and partial documents, exclude `humanNameChangedKeys` from the initial group/publish updates and rely only on the delete+insert path for name changes.
</issue_to_address>

### Comment 2
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="297-307" />
<code_context>
+    const publishedFlippedOn = !oldDoc.isPublished && newDoc.isPublished;
+    const hasGroupChanges = addedGroups.length > 0 || removedGroups.length > 0;
+
+    if (hasGroupChanges || publishedFlippedOn) {
+      await this.metadataKeyModel.updateMany(filter, {
+        ...(addedGroups.length > 0 && {
+          $addToSet: { userGroups: { $each: addedGroups } },
+        }),
+        $inc: {
+          ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
+          ...(removedGroups.length > 0 &&
+            this.groupCountDeltas(removedGroups, -1)),
+        },
+        ...(publishedFlippedOn && { $set: { isPublished: true } }),
+      });
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `$inc` operator is always included, even when no groups changed, which risks sending an empty `$inc` object.

In this branch, when `publishedFlippedOn` is true but there are no added/removed groups, the update still includes `$inc` with no fields. MongoDB may reject or mis-handle an empty `$inc`. Consider constructing the `$inc` object separately and only spreading it into the update when it has at least one key (i.e., when there is a non-empty group delta).

```suggestion
      const incUpdate = {
        ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
        ...(removedGroups.length > 0 &&
          this.groupCountDeltas(removedGroups, -1)),
      };

      await this.metadataKeyModel.updateMany(filter, {
        ...(addedGroups.length > 0 && {
          $addToSet: { userGroups: { $each: addedGroups } },
        }),
        ...(Object.keys(incUpdate).length > 0 && { $inc: incUpdate }),
        ...(publishedFlippedOn && { $set: { isPublished: true } }),
      });
```
</issue_to_address>

### Comment 3
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="8-9" />
<code_context>
+List metadata keys visible to the current user for a given source type:

 ```json
 {
   "where": {
-    "sourceType": "dataset",
</code_context>
<issue_to_address>
**issue (bug_risk):** The combination of `preserveNullAndEmptyArrays` and the subsequent `$match` drops group-less datasets from `usageCount`, contrary to the comment.

Using `$unwind` with `preserveNullAndEmptyArrays: true` creates records with `userGroups: null` for group-less datasets, but the subsequent `$match: { userGroups: { $nin: [null, ""] } }` removes them. As a result, datasets without groups never reach the later grouping and are missing from `usageCount`, which contradicts the intent described in the comment. If group-less datasets should still contribute to `usageCount`, the pipeline needs to retain them (e.g., by handling `userGroups: null` explicitly or isolating the group-based logic so it doesn’t drop these records).
</issue_to_address>

### Comment 4
<location path="test/MetadataKeys.js" line_range="514-419" />
<code_context>
+  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that updates `accessGroups` (not just `ownerGroup`) and asserts that `userGroups` and access control on keys are updated accordingly.

Since `MetadataKeys` derives `userGroups` from both `ownerGroup` and `accessGroups`, and the service updates counts based on `userGroups`, this path should be tested as well. Please also add a test that:

- Starts with a dataset with `accessGroups: ["group1"]` and a key
- Patches the dataset to change `accessGroups` (e.g. to `["group2"]`)
- Asserts via `/metadatakeys` that:
  - `userGroups` on the key reflects the updated access groups
  - A user who only belonged to the removed group can no longer see the key

This will cover the `accessGroups` mutation path and its impact on access control and group reference counting.

Suggested implementation:

```javascript
    await deleteDataset(pid);
  });

  it("0641: changing accessGroups updates userGroups and key visibility", async () => {
    const pid = await createDataset({
      ...TestData.DatasetWithScientificMetadataV4,
      ownerGroup: "group-owner",
      accessGroups: ["group1"],
      isPublished: false,
      scientificMetadata: { access_group_change_key: { value: 42 } },
    });

    const filterKey = {
      where: { sourceType: "Dataset", key: "access_group_change_key" },
    };

    // user belonging to group1 can see the key initially
    const keysForGroup1Before = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1Before.body.should.have.lengthOf(1);
    keysForGroup1Before.body[0].userGroups.should.contain("group1");
    keysForGroup1Before.body[0].userGroups.should.not.contain("group2");

    // patch dataset to switch accessGroups from group1 to group2
    await patchDataset(
      pid,
      { accessGroups: ["group2"] },
      accessTokenAdmin
    );

    // after patch, key should reflect updated userGroups (derived from updated accessGroups)
    const keysForGroup2After = await getMetadataKeys(
      filterKey,
      accessTokenGroup2
    );
    keysForGroup2After.body.should.have.lengthOf(1);
    keysForGroup2After.body[0].userGroups.should.not.contain("group1");
    keysForGroup2After.body[0].userGroups.should.contain("group2");

    // user that only belonged to removed group1 can no longer see the key
    const keysForGroup1After = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1After.body.should.have.lengthOf(0);

    await deleteDataset(pid);
  });

  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {

```

To make this compile and align with your existing tests, you may need to:
1. Ensure `accessTokenGroup1` and `accessTokenGroup2` exist and represent users who:
   - Both can see datasets/keys owned by `group-owner`.
   - `accessTokenGroup1` has group membership `group1` but not `group2`.
   - `accessTokenGroup2` has group membership `group2` (and ideally not `group1`).
   If your test suite uses a different naming convention (e.g. `accessTokenUserGroup1`), adjust the token names in the new test accordingly.
2. Confirm that `patchDataset` is available in this file and patches `accessGroups` without resetting other fields. If the helper has a different signature, update the call in the new test to match.
3. If your `getMetadataKeys` helper or the `/metadatakeys` response shape differs (e.g. `userGroups` lives under a nested property), adapt the assertions to reference the correct path.
</issue_to_address>

### Comment 5
<location path="docs/developer-guide/migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.md" line_range="143" />
<code_context>
+{ datasetId: "uuid-B", key: "temperature", isPublished: false, humanReadableName: "Temperature", userGroups: ["group-1"] }
+```
+
+> `ownerGroup` is mandatory field so no null fallback is needed. `accessGroups` is optional so it falls back to `[]`.
+
+---
</code_context>
<issue_to_address>
**nitpick (typo):** Add an article in “ownerGroup is mandatory field” for correct grammar.

You could rephrase this as: “`ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.”

```suggestion
> `ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.
```
</issue_to_address>

### Comment 6
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="3" />
<code_context>
+const SOURCE_COLLECTIONS = ["Dataset"];
+
+function buildPipeline(sourceType) {
+  return [
+    // -------------------------------------------------------------------------
</code_context>
<issue_to_address>
**issue (complexity):** Consider decomposing the monolithic aggregation pipeline into smaller helper functions and a dedicated merge-stage builder so each phase of the migration is easier to understand and maintain.

You can keep all current behavior and significantly reduce cognitive load by decomposing the pipeline and isolating the merge logic.

### 1. Split the monolithic `buildPipeline` into small, named helpers

Right now `buildPipeline` mixes projection, grouping, and merge configuration in one large array. You can keep the exact aggregation semantics but break it up into helpers so each concern is localized and easier to test and review.

For example:

```js
function buildInitialStages() {
  return [
    { $match: { scientificMetadata: { $exists: true, $type: "object" } } },
    {
      $project: {
        datasetId: "$_id",
        ownerGroup: 1,
        accessGroups: 1,
        isPublished: 1,
        metaArr: { $objectToArray: "$scientificMetadata" },
      },
    },
    { $unwind: "$metaArr" },
    {
      $project: {
        datasetId: 1,
        key: "$metaArr.k",
        isPublished: 1,
        humanReadableName: { $ifNull: ["$metaArr.v.human_name", ""] },
        userGroups: {
          $setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }],
        },
      },
    },
    {
      $unwind: {
        path: "$userGroups",
        preserveNullAndEmptyArrays: true,
      },
    },
    { $match: { userGroups: { $nin: [null, ""] } } },
  ];
}

function groupByKeyAndGroup(sourceType) {
  return [
    {
      $group: {
        _id: {
          metaKeyId: {
            $concat: [`${sourceType}_`, "$key", "_", "$humanReadableName"],
          },
          group: "$userGroups",
        },
        key: { $first: "$key" },
        humanReadableName: { $first: "$humanReadableName" },
        isPublished: { $max: "$isPublished" },
        groupCount: { $sum: 1 },
        datasetIds: { $addToSet: "$datasetId" },
      },
    },
    {
      $group: {
        _id: "$_id.metaKeyId",
        key: { $first: "$key" },
        humanReadableName: { $first: "$humanReadableName" },
        isPublished: { $max: "$isPublished" },
        userGroups: { $push: "$_id.group" },
        userGroupCountsArr: { $push: { k: "$_id.group", v: "$groupCount" } },
        datasetIdSets: { $push: "$datasetIds" },
      },
    },
  ];
}

function projectFinalShape(sourceType) {
  return [
    {
      $project: {
        _id: 1,
        key: 1,
        sourceType: { $literal: sourceType },
        humanReadableName: 1,
        isPublished: 1,
        userGroups: 1,
        userGroupCounts: { $arrayToObject: "$userGroupCountsArr" },
        usageCount: {
          $size: {
            $reduce: {
              input: "$datasetIdSets",
              initialValue: [],
              in: { $setUnion: ["$$value", "$$this"] },
            },
          },
        },
        createdBy: { $literal: "migration" },
        createdAt: { $toDate: "$$NOW" },
      },
    },
  ];
}

function buildPipeline(sourceType) {
  return [
    ...buildInitialStages(),
    ...groupByKeyAndGroup(sourceType),
    ...projectFinalShape(sourceType),
    buildMergeStage(),
  ];
}
```

This keeps the same pipeline stages but makes each phase independently understandable.

### 2. Extract the `$merge` `whenMatched` pipeline into a dedicated builder

The `whenMatched` expression is dense and currently buried at the bottom of `buildPipeline`. Extracting it into its own function makes it clearer that it’s “just” merge policy and decouples it from the shape-building stages.

```js
function buildWhenMatchedPipeline() {
  return [
    {
      $replaceWith: {
        $mergeObjects: [
          "$$new",
          {
            createdAt: { $ifNull: ["$createdAt", "$$new.createdAt"] },
            createdBy: { $ifNull: ["$createdBy", "$$new.createdBy"] },
            updatedBy: { $literal: "migration" },
            updatedAt: { $toDate: "$$NOW" },

            userGroups: { $setUnion: ["$userGroups", "$$new.userGroups"] },

            userGroupCounts: {
              $arrayToObject: {
                $map: {
                  input: {
                    $setUnion: [
                      {
                        $map: {
                          input: { $objectToArray: "$userGroupCounts" },
                          as: "e",
                          in: "$$e.k",
                        },
                      },
                      {
                        $map: {
                          input: { $objectToArray: "$$new.userGroupCounts" },
                          as: "e",
                          in: "$$e.k",
                        },
                      },
                    ],
                  },
                  as: "group",
                  in: {
                    k: "$$group",
                    v: {
                      $add: [
                        {
                          $ifNull: [
                            {
                              $getField: {
                                field: "$$group",
                                input: "$userGroupCounts",
                              },
                            },
                            0,
                          ],
                        },
                        {
                          $ifNull: [
                            {
                              $getField: {
                                field: "$$group",
                                input: "$$new.userGroupCounts",
                              },
                            },
                            0,
                          ],
                        },
                      ],
                    },
                  },
                },
              },
            },

            isPublished: { $or: ["$isPublished", "$$new.isPublished"] },
            usageCount: { $add: ["$usageCount", "$$new.usageCount"] },
          },
        ],
      },
    },
  ];
}

function buildMergeStage() {
  return {
    $merge: {
      into: "MetadataKeys",
      on: "_id",
      whenMatched: buildWhenMatchedPipeline(),
      whenNotMatched: "insert",
    },
  };
}
```

This preserves all existing merge behavior (including future multi-source support), but makes the migration file far easier to scan: `buildPipeline` reads as “shape → aggregate → finalize → merge”, and the complex merge internals live in a clearly named helper that can be documented and unit-tested separately.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread src/metadata-keys/metadatakeys.service.ts Outdated
Comment thread src/metadata-keys/metadatakeys.service.ts Outdated
Comment thread migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js Outdated
Comment thread test/MetadataKeys.js
@Junjiequan Junjiequan force-pushed the refactor-metadatakeys-service branch 2 times, most recently from 8c915ea to 7273ab8 Compare April 27, 2026 13:11
@Junjiequan Junjiequan force-pushed the refactor-metadatakeys-service branch from 7273ab8 to 9a41f76 Compare April 28, 2026 07:51
Copy link
Copy Markdown
Member

@rkweehinzmann rkweehinzmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reimplementation of this service!

One other question:
Are nested scientific metadata values handled that have themselves nested values?

Thank you.

- Stability: Crashes occurred when retrieval limits were missing or improperly configured.
- Risks: Users could see metadata keys they did not have permissions to access.

---
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain in which cases this can happen that a user sees metadata keys they do not have permissions for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the metadata keys returned from /api/v3/datasets/metadataKeys are not filterd by permission

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but what does it mean, eg if /api/v4/datasets/metadatakeys is used they will always be filtered by permissions? From your initial description I thought it can have to do with setting limits wrongly ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new endpoint /api/v4/metadataKeys/findAll?{query} will always return filtered metadatakeys by permissions.

I'm not sure I understand your second quesion

/**
* Called when a dataset is updated.
*
* Diffs the old and new metadata to produce three disjoint key sets:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who will know what old and new metadata is in 2 months or 2 years (or even now)? Could you add some short description of what is called old and what new, please? Thank you.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old metadata is the metadata befor change, new metadata is the metadata after change

so if I change metadata from {"metadataKey1":"text"} to {"meatadataKey2":"text"}
the former is old metadata, latter is the new metadata

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want me to see that metadataKey1 and metadataKey2 are different names, but both refer to the same value (which is the actual meta data and did not change)? I meant if you call something old or new, you're referring of to an old/new way of handling things. What was the old way, what the new way? Apologies, probably too obvious for the programmer!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify what it is used for.

Consider a data provider A and a data receiver B.
Initially, A sends {key1}[old] to B, and B stores it. Later, A renames the key from {key1}[old] to {key2}[new]. If A only sends {key2}[new], B ends up holding both {key1}[old] and {key2}[new] — even though the old key no longer exists on A's side. This creates a sync inconsistency.
To prevent this, when A changes {key1}[old] to {key2}[new], it must provide both the old and new state to B, effectively saying:

"Here is what changed — discard [old] and replace it with [new]."

This is exactly why replaceManyFromSource requires both oldDoc (pre-update) and newDoc (post-update) — so B can recompute the difference and stay in sync with A.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also updated the comment to state that, old is before change and new is after change. There's many terms to describe it, but old and new is probably is the shortest term and it should be self-explanatory


Rebuilds the `MetadataKeys` collection from scratch by scanning all `Dataset` documents and extracting every key found in `scientificMetadata`.

Each unique `(sourceType, key, humanReadableName)` combination becomes one `MetadataKey` document. If the same key appears across multiple datasets, their `userGroups` and counts are merged into a single entry.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need humanReadableName? Isn't sourceType, key already unique?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same key might have completely different humanReadableName, so the uniqueness is based on sourceType+key+humanReadableName

```js
// MetadataKey document
{
_id: "Dataset_temperature_Temperature", // ${sourceType}_${key}_${humanReadableName}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, I would keep uuid as ID and eventually add an additional field with the concat. This will make easier to have multiple documents with the same combination in case we will ever want to split big documents

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the documentation later, I changed the id to use UUID but forgot to change doc here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs updated

humanReadableName: "Temperature",
sourceType: "Dataset",
isPublished: true,
usageCount: 2,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this? Isn't userGroupCounts enough?

Copy link
Copy Markdown
Member Author

@Junjiequan Junjiequan May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids edge cases where a dataset has no groups but isPublished: true (unlikely, but ownerGroup is not a required field in OwnableDto).
Additionally, userGroupCounts cannot replace usageCount because a dataset can belong to multiple groups, whereas usageCount counts each dataset only once.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about this:

function buildPipeline(sourceType) {
  return [
    {
      $project: {
        isPublished: 1,
        groups: { $setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }] },
        meta: { $objectToArray: "$scientificMetadata" },
      }
    },
    { $unwind: "$meta" },
    { $unwind: "$groups" },
    { $match: { groups: { $ne: null, $ne: "" } } },
    {
      $group: {
        _id: { 
          k: "$meta.k", 
          g: "$groups", 
          n: { $ifNull: ["$meta.v.human_name", ""] } 
        },
        count: { $sum: 1 },
        isPublished: { $max: "$isPublished" }
      }
    }
  ];
}

module.exports = {
  async up(db) {
    const start = Date.now();
    const elapsed = () => `${((Date.now() - start) / 1000).toFixed(1)}s`;

    await db.collection("MetadataKeys").deleteMany({});
    await db.collection("MetadataKeys").createIndex({ metaKeyId: 1 }, { unique: true });

    for (const collectionName of SOURCE_COLLECTIONS) {
      console.log(`[${elapsed()}] Starting stream for ${collectionName}...`);

      // 1. Open a single cursor for the entire collection
      const cursor = db.collection(collectionName).aggregate(buildPipeline(collectionName), { 
        allowDiskUse: true 
      });

      let operations = [];
      let totalProcessed = 0;

      // 2. Iterate over the cursor without toArray()
      for await (const doc of cursor) {
        const metaKeyId = `${collectionName}_${doc._id.k}_${doc._id.n}`;
        
        operations.push({
          updateOne: {
            filter: { metaKeyId },
            update: {
              $set: {
                key: doc._id.k,
                humanReadableName: doc._id.n,
                sourceType: collectionName,
                updatedBy: "migration",
                updatedAt: new Date()
              },
              $setOnInsert: { createdAt: new Date(), createdBy: "migration" },
              $max: { isPublished: doc.isPublished },
              $addToSet: { userGroups: doc._id.g },
              $inc: { 
                usageCount: 1,
                [`userGroupCounts.${doc._id.g.replace(/\./g, '_')}`]: doc.count 
              }
            },
            upsert: true
          }
        });

        // 3. Periodic Flush: execute bulk write every BATCH_SIZE items
        if (operations.length >= BATCH_SIZE) {
          await db.collection("MetadataKeys").bulkWrite(operations);
          totalProcessed += operations.length;
          console.log(`[${elapsed()}] ${collectionName}: Flushed ${totalProcessed.toLocaleString()} keys...`);
          operations = []; // Clear the bucket
        }
      }

      // 4. Final flush for any remaining operations
      if (operations.length > 0) {
        await db.collection("MetadataKeys").bulkWrite(operations);
        totalProcessed += operations.length;
      }

      console.log(`[${elapsed()}] ✅ ${collectionName} complete. Total keys: ${totalProcessed}`);
    }

    // Clean up: switch metaKeyId to the official _id
    await db.collection("MetadataKeys").updateMany({}, [
      { $set: { id: "$metaKeyId" } }, 
      { $unset: "metaKeyId" }
    ]);
  }
};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be changed with the right key inplace of this)

      const metaKeyId = `${sourceType}_${key}_${humanName}`;

Copy link
Copy Markdown
Member Author

@Junjiequan Junjiequan May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid to use cursor for the migration script, it will be extremly slow to retrive data store in the memory and process them. since each doc transferred into the mongoDB needs to be encoded

Example of comparison of the performances using cursor vs aggregate + $mege to handle everything:

image

This testing was using 20k datasets with 5~10 ScientificMetadata per datasets

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but doesn't that depend largely on the BATCH_SIZE? The thing is that the current migration is hardly maintainable, it's extremely complex if we need to change anything

Copy link
Copy Markdown
Member Author

@Junjiequan Junjiequan May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the bigger BATCH_SIZE the faster with aggregate + $merge, .

I would expect this migration script not to be touched and any further maintainance should be a new migration patch on top of this migration.
But TBH, I don't really mind the script to be slow, as long as it's robust, since we need it to run only once. So, if you suggest that readability should be more prioratized in this case, Im fine with it.
I also felt like current implementation is hard to read and that's why I included a doc to explain each step.

Copy link
Copy Markdown
Member Author

@Junjiequan Junjiequan May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this migration script not to be touched

By this I meant once the migration is merged I expect it not to be changed, not that im saying current code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readability is improved by removing some unnecessary fields, such as check existing fields then overwrite logic, which is not needed, since we wipe everything when migration starts.
Documentation is also updated with better explanation for aggregation pipe

export const MetadataKeySchema = SchemaFactory.createForClass(MetadataKeyClass);

MetadataKeySchema.index({ sourceType: 1, sourceId: 1 });
MetadataKeySchema.index({ sourceType: 1, isPublished: 1, key: 1 });
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need so many indexes?

Copy link
Copy Markdown
Member Author

@Junjiequan Junjiequan May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexs are to handle queries from unauthenticated users and authenticated users.

for authenticated users the query will be

MetadataKey.find({
  sourceType: 'dataset',
  $and: [
    {
      $or: [
        { isPublished: true },
        { userGroups: { $in: user.groups } }
      ]
    },
    {
      $or: [
        { key: job_id },
        { humanReadableName: JOB_ID }
      ]
    }
  ]
});

for unauthenticated users the query will be:

MetadataKey.find({
  sourceType: 'dataset',
  $and: [
    {
      $or: [
        { isPublished: true },
      ]
    },
    {
      $or: [
        { key: job_id },
        { humanReadableName: JOB_ID }
      ]
    }
  ]
});

Comment thread src/samples/samples.service.ts Outdated
@@ -45,10 +45,10 @@ export class SamplesService {
doc: UpdateQuery<SampleDocument>,
): MetadataSourceDoc {
const source: MetadataSourceDoc = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this snippet is replicated in all collections, worth making a util function that takes doc and name as input?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe it's worth to have a shared function here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 1d4f99a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about something like this:

@Injectable({ scope: Scope.REQUEST })
export class MetadataKeysService {
  constructor(
    @InjectModel(MetadataKeyClass.name)
    private metadataKeyModel: Model<MetadataKeyDocument>,
  ) {}

  // Centralized logic for changing counts
  private async adjustCounts(doc: MetadataSourceDoc, delta: 1 | -1): Promise<void> {
    const { sourceType, userGroups, metadata } = doc;
    const ops = Object.entries(metadata).map(([key, entry]) => {
      const humanName = (entry as ScientificMetadataEntry).human_name ?? "";
      const metaKeyId = `${sourceType}_${key}_${humanName}`;

      return {
        updateOne: {
          filter: { metaKeyId },
          update: {
            $set: { key, sourceType, humanReadableName: humanName },
            $inc: {
              usageCount: delta,
              ...Object.fromEntries(userGroups.map(g => [`userGroupCounts.${g.replace(/\./g, '_')}`, delta]))
            },
            ...(delta === 1 && { $max: { isPublished: doc.isPublished } }),
            $setOnInsert: { createdAt: new Date() }
          },
          upsert: delta === 1, // Only upsert when adding
        }
      };
    });

    if (ops.length > 0) {
      await this.metadataKeyModel.bulkWrite(ops);
      
      // Cleanup: Only run when decrementing
      if (delta === -1) {
        await this.metadataKeyModel.deleteMany({ usageCount: { $lte: 0 } });
      }
    }
  }

  async insertManyFromSource(doc: MetadataSourceDoc) {
    await this.adjustCounts(doc, 1);
  }

  async deleteMany(doc: MetadataSourceDoc) {
    await this.adjustCounts(doc, -1);
  }

  async replaceManyFromSource(oldDoc: MetadataSourceDoc, newDoc: MetadataSourceDoc) {
    // Standard "out with the old, in with the new"
    // MongoDB handles the interleaving safely
    await this.deleteMany(oldDoc);
    await this.insertManyFromSource(newDoc);
  }
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(to be changed with the right key inplace of this)

      const metaKeyId = `${sourceType}_${key}_${humanName}`;

Copy link
Copy Markdown
Member Author

@Junjiequan Junjiequan May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code you suggested here looks very good, I will test it locally and see if there's any missing edge cases

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 32f9540

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants