refactor(metadatakeys): merge duplicate keys across datasets and track group references by Junjiequan · Pull Request #2697 · SciCatProject/backend

Junjiequan · 2026-04-20T18:41:11Z

Description

MetadataKeys are now deduplicated, instead of one entry per key per dataset, a single entry is shared across all datasets containing the same key, with userGroups, usageCount, and userGroupCounts merged.

userGroupCounts

New internal field that tracks how many datasets per group reference this key:

dataset1: {ownerGroup:"group1", userGroups:["group2"]},
dataset2: {ownerGroup:"group1"}
dataset3: {ownerGroup:"group1"}

"userGroupCounts": { "group-1": 3, "group-2": 1 },
"userGroups": ["group-1", "group-2"],
"usageCount": 3

When a group's count hits 0 it is removed from userGroups. When usageCount hits 0 the entry is deleted.

Changes

Parent document (dataset, proposal) permission changes (ownerGroup, accessGroups) update userGroups and userGroupCounts accordingly
If any parent document has isPublished: true, the metadata key will have isPublished:true as well and is visible to all users
Migration script included with documentation explaining the aggregation pipeline to populate MetadataKeys from existing Dataset documents — must be run manually before deploying (duration depends on dataset count and scientificMetadata count per dataset). Keys used by more than 1000 datasets are automatically made public.

Motivation

Fixes

Bug fixed (#X)

Tests included

Included for each change/fix?
Passing?

Documentation

swagger documentation updated (required for API changes)
official documentation updated

official documentation info

Summary by Sourcery

Track metadata keys at a global level across datasets and other entities, aggregating usage and access control by key rather than per-source document.

New Features:

Expose metadata key usage counts and per-group reference counts, enabling queries over how often and by whom metadata keys are used.
Add a migration to backfill the MetadataKeys collection from existing dataset scientificMetadata, merging keys across datasets and computing group usage.
Extend metadata key search capabilities with filtering on humanReadableName, regex matching, field projection, and pagination semantics.

Enhancements:

Refine metadata key aggregation to use sourceType plus key and humanReadableName as the identifier, allowing the same key to be merged across multiple datasets and groups.
Update datasets, samples, proposals, and instruments services to emit unified MetadataSourceDoc structures that carry resolved userGroups and drive consistent metadata key bookkeeping.
Strengthen access-control semantics so metadata keys respect publication status and user group visibility while avoiding exposure of group-private keys to unauthorized users.
Improve documentation for the MetadataKeys module and its migration, including examples of filters and an architectural overview of synchronization behavior.

Documentation:

Expand developer documentation for the MetadataKeys module to describe the new schema, synchronization flow, filter options, and initial migration requirements.
Document the new migration that rebuilds the MetadataKeys collection from dataset scientificMetadata, including pipeline details and operational guidance.

Tests:

Add extensive unit tests for MetadataKeysService covering aggregation pipelines, insert/delete/replace behaviors, and group/publish state transitions.
Add end-to-end tests around the /metadatakeys API to validate access control, search behavior, pagination, projection, and usageCount semantics across different user roles.

sourcery-ai

Hey - I've found 6 issues, and left some high level feedback:

The migration uses aggregate(...).toArray() with $merge, which will materialize all results in memory; consider iterating the cursor (or omitting toArray() entirely when $merge is the terminal stage) to avoid potential OOM issues on large datasets.
Several unit tests for MetadataKeysService assert on specific updateMany call indices (e.g. mock.calls[1]); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
MetadataSourceDoc.sourceType is now derived from model.collection.name (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the sourceType convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The migration uses `aggregate(...).toArray()` with `$merge`, which will materialize all results in memory; consider iterating the cursor (or omitting `toArray()` entirely when `$merge` is the terminal stage) to avoid potential OOM issues on large datasets.
- Several unit tests for `MetadataKeysService` assert on specific `updateMany` call indices (e.g. `mock.calls[1]`); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
- `MetadataSourceDoc.sourceType` is now derived from `model.collection.name` (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the `sourceType` convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.

## Individual Comments

### Comment 1
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="274-283" />
<code_context>
   private async updateSharedKeys(
</code_context>
<issue_to_address>
**issue (bug_risk):** Keys whose humanReadableName changed are updated under the new _id before being treated as delete+insert, which can cause inconsistent counters and partial documents.

In `updateSharedKeys`, the group/publish update block runs for all `sharedKeys`, including those with changed `human_name`. For these, `ids` are built from the new `humanReadableName`, so `updateMany` may target documents that don’t yet exist (and should instead be created via `insertManyFromSource` with full `$setOnInsert`). Later, `humanNameChangedKeys` are processed again via `deleteMany(oldDoc)` + `insertManyFromSource(newDoc)`, so the same logical key is updated twice under different assumptions. To prevent double-counting and partial documents, exclude `humanNameChangedKeys` from the initial group/publish updates and rely only on the delete+insert path for name changes.
</issue_to_address>

### Comment 2
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="297-307" />
<code_context>
+    const publishedFlippedOn = !oldDoc.isPublished && newDoc.isPublished;
+    const hasGroupChanges = addedGroups.length > 0 || removedGroups.length > 0;
+
+    if (hasGroupChanges || publishedFlippedOn) {
+      await this.metadataKeyModel.updateMany(filter, {
+        ...(addedGroups.length > 0 && {
+          $addToSet: { userGroups: { $each: addedGroups } },
+        }),
+        $inc: {
+          ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
+          ...(removedGroups.length > 0 &&
+            this.groupCountDeltas(removedGroups, -1)),
+        },
+        ...(publishedFlippedOn && { $set: { isPublished: true } }),
+      });
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `$inc` operator is always included, even when no groups changed, which risks sending an empty `$inc` object.

In this branch, when `publishedFlippedOn` is true but there are no added/removed groups, the update still includes `$inc` with no fields. MongoDB may reject or mis-handle an empty `$inc`. Consider constructing the `$inc` object separately and only spreading it into the update when it has at least one key (i.e., when there is a non-empty group delta).

```suggestion
      const incUpdate = {
        ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
        ...(removedGroups.length > 0 &&
          this.groupCountDeltas(removedGroups, -1)),
      };

      await this.metadataKeyModel.updateMany(filter, {
        ...(addedGroups.length > 0 && {
          $addToSet: { userGroups: { $each: addedGroups } },
        }),
        ...(Object.keys(incUpdate).length > 0 && { $inc: incUpdate }),
        ...(publishedFlippedOn && { $set: { isPublished: true } }),
      });
```
</issue_to_address>

### Comment 3
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="8-9" />
<code_context>
+List metadata keys visible to the current user for a given source type:

 ```json
 {
   "where": {
-    "sourceType": "dataset",
</code_context>
<issue_to_address>
**issue (bug_risk):** The combination of `preserveNullAndEmptyArrays` and the subsequent `$match` drops group-less datasets from `usageCount`, contrary to the comment.

Using `$unwind` with `preserveNullAndEmptyArrays: true` creates records with `userGroups: null` for group-less datasets, but the subsequent `$match: { userGroups: { $nin: [null, ""] } }` removes them. As a result, datasets without groups never reach the later grouping and are missing from `usageCount`, which contradicts the intent described in the comment. If group-less datasets should still contribute to `usageCount`, the pipeline needs to retain them (e.g., by handling `userGroups: null` explicitly or isolating the group-based logic so it doesn’t drop these records).
</issue_to_address>

### Comment 4
<location path="test/MetadataKeys.js" line_range="514-419" />
<code_context>
+  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that updates `accessGroups` (not just `ownerGroup`) and asserts that `userGroups` and access control on keys are updated accordingly.

Since `MetadataKeys` derives `userGroups` from both `ownerGroup` and `accessGroups`, and the service updates counts based on `userGroups`, this path should be tested as well. Please also add a test that:

- Starts with a dataset with `accessGroups: ["group1"]` and a key
- Patches the dataset to change `accessGroups` (e.g. to `["group2"]`)
- Asserts via `/metadatakeys` that:
  - `userGroups` on the key reflects the updated access groups
  - A user who only belonged to the removed group can no longer see the key

This will cover the `accessGroups` mutation path and its impact on access control and group reference counting.

Suggested implementation:

```javascript
    await deleteDataset(pid);
  });

  it("0641: changing accessGroups updates userGroups and key visibility", async () => {
    const pid = await createDataset({
      ...TestData.DatasetWithScientificMetadataV4,
      ownerGroup: "group-owner",
      accessGroups: ["group1"],
      isPublished: false,
      scientificMetadata: { access_group_change_key: { value: 42 } },
    });

    const filterKey = {
      where: { sourceType: "Dataset", key: "access_group_change_key" },
    };

    // user belonging to group1 can see the key initially
    const keysForGroup1Before = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1Before.body.should.have.lengthOf(1);
    keysForGroup1Before.body[0].userGroups.should.contain("group1");
    keysForGroup1Before.body[0].userGroups.should.not.contain("group2");

    // patch dataset to switch accessGroups from group1 to group2
    await patchDataset(
      pid,
      { accessGroups: ["group2"] },
      accessTokenAdmin
    );

    // after patch, key should reflect updated userGroups (derived from updated accessGroups)
    const keysForGroup2After = await getMetadataKeys(
      filterKey,
      accessTokenGroup2
    );
    keysForGroup2After.body.should.have.lengthOf(1);
    keysForGroup2After.body[0].userGroups.should.not.contain("group1");
    keysForGroup2After.body[0].userGroups.should.contain("group2");

    // user that only belonged to removed group1 can no longer see the key
    const keysForGroup1After = await getMetadataKeys(
      filterKey,
      accessTokenGroup1
    );
    keysForGroup1After.body.should.have.lengthOf(0);

    await deleteDataset(pid);
  });

  it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {

```

To make this compile and align with your existing tests, you may need to:
1. Ensure `accessTokenGroup1` and `accessTokenGroup2` exist and represent users who:
   - Both can see datasets/keys owned by `group-owner`.
   - `accessTokenGroup1` has group membership `group1` but not `group2`.
   - `accessTokenGroup2` has group membership `group2` (and ideally not `group1`).
   If your test suite uses a different naming convention (e.g. `accessTokenUserGroup1`), adjust the token names in the new test accordingly.
2. Confirm that `patchDataset` is available in this file and patches `accessGroups` without resetting other fields. If the helper has a different signature, update the call in the new test to match.
3. If your `getMetadataKeys` helper or the `/metadatakeys` response shape differs (e.g. `userGroups` lives under a nested property), adapt the assertions to reference the correct path.
</issue_to_address>

### Comment 5
<location path="docs/developer-guide/migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.md" line_range="143" />
<code_context>
+{ datasetId: "uuid-B", key: "temperature", isPublished: false, humanReadableName: "Temperature", userGroups: ["group-1"] }
+```
+
+> `ownerGroup` is mandatory field so no null fallback is needed. `accessGroups` is optional so it falls back to `[]`.
+
+---
</code_context>
<issue_to_address>
**nitpick (typo):** Add an article in “ownerGroup is mandatory field” for correct grammar.

You could rephrase this as: “`ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.”

```suggestion
> `ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.
```
</issue_to_address>

### Comment 6
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="3" />
<code_context>
+const SOURCE_COLLECTIONS = ["Dataset"];
+
+function buildPipeline(sourceType) {
+  return [
+    // -------------------------------------------------------------------------
</code_context>
<issue_to_address>
**issue (complexity):** Consider decomposing the monolithic aggregation pipeline into smaller helper functions and a dedicated merge-stage builder so each phase of the migration is easier to understand and maintain.

You can keep all current behavior and significantly reduce cognitive load by decomposing the pipeline and isolating the merge logic.

### 1. Split the monolithic `buildPipeline` into small, named helpers

Right now `buildPipeline` mixes projection, grouping, and merge configuration in one large array. You can keep the exact aggregation semantics but break it up into helpers so each concern is localized and easier to test and review.

For example:

```js
function buildInitialStages() {
  return [
    { $match: { scientificMetadata: { $exists: true, $type: "object" } } },
    {
      $project: {
        datasetId: "$_id",
        ownerGroup: 1,
        accessGroups: 1,
        isPublished: 1,
        metaArr: { $objectToArray: "$scientificMetadata" },
      },
    },
    { $unwind: "$metaArr" },
    {
      $project: {
        datasetId: 1,
        key: "$metaArr.k",
        isPublished: 1,
        humanReadableName: { $ifNull: ["$metaArr.v.human_name", ""] },
        userGroups: {
          $setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }],
        },
      },
    },
    {
      $unwind: {
        path: "$userGroups",
        preserveNullAndEmptyArrays: true,
      },
    },
    { $match: { userGroups: { $nin: [null, ""] } } },
  ];
}

function groupByKeyAndGroup(sourceType) {
  return [
    {
      $group: {
        _id: {
          metaKeyId: {
            $concat: [`${sourceType}_`, "$key", "_", "$humanReadableName"],
          },
          group: "$userGroups",
        },
        key: { $first: "$key" },
        humanReadableName: { $first: "$humanReadableName" },
        isPublished: { $max: "$isPublished" },
        groupCount: { $sum: 1 },
        datasetIds: { $addToSet: "$datasetId" },
      },
    },
    {
      $group: {
        _id: "$_id.metaKeyId",
        key: { $first: "$key" },
        humanReadableName: { $first: "$humanReadableName" },
        isPublished: { $max: "$isPublished" },
        userGroups: { $push: "$_id.group" },
        userGroupCountsArr: { $push: { k: "$_id.group", v: "$groupCount" } },
        datasetIdSets: { $push: "$datasetIds" },
      },
    },
  ];
}

function projectFinalShape(sourceType) {
  return [
    {
      $project: {
        _id: 1,
        key: 1,
        sourceType: { $literal: sourceType },
        humanReadableName: 1,
        isPublished: 1,
        userGroups: 1,
        userGroupCounts: { $arrayToObject: "$userGroupCountsArr" },
        usageCount: {
          $size: {
            $reduce: {
              input: "$datasetIdSets",
              initialValue: [],
              in: { $setUnion: ["$$value", "$$this"] },
            },
          },
        },
        createdBy: { $literal: "migration" },
        createdAt: { $toDate: "$$NOW" },
      },
    },
  ];
}

function buildPipeline(sourceType) {
  return [
    ...buildInitialStages(),
    ...groupByKeyAndGroup(sourceType),
    ...projectFinalShape(sourceType),
    buildMergeStage(),
  ];
}
```

This keeps the same pipeline stages but makes each phase independently understandable.

### 2. Extract the `$merge` `whenMatched` pipeline into a dedicated builder

The `whenMatched` expression is dense and currently buried at the bottom of `buildPipeline`. Extracting it into its own function makes it clearer that it’s “just” merge policy and decouples it from the shape-building stages.

```js
function buildWhenMatchedPipeline() {
  return [
    {
      $replaceWith: {
        $mergeObjects: [
          "$$new",
          {
            createdAt: { $ifNull: ["$createdAt", "$$new.createdAt"] },
            createdBy: { $ifNull: ["$createdBy", "$$new.createdBy"] },
            updatedBy: { $literal: "migration" },
            updatedAt: { $toDate: "$$NOW" },

            userGroups: { $setUnion: ["$userGroups", "$$new.userGroups"] },

            userGroupCounts: {
              $arrayToObject: {
                $map: {
                  input: {
                    $setUnion: [
                      {
                        $map: {
                          input: { $objectToArray: "$userGroupCounts" },
                          as: "e",
                          in: "$$e.k",
                        },
                      },
                      {
                        $map: {
                          input: { $objectToArray: "$$new.userGroupCounts" },
                          as: "e",
                          in: "$$e.k",
                        },
                      },
                    ],
                  },
                  as: "group",
                  in: {
                    k: "$$group",
                    v: {
                      $add: [
                        {
                          $ifNull: [
                            {
                              $getField: {
                                field: "$$group",
                                input: "$userGroupCounts",
                              },
                            },
                            0,
                          ],
                        },
                        {
                          $ifNull: [
                            {
                              $getField: {
                                field: "$$group",
                                input: "$$new.userGroupCounts",
                              },
                            },
                            0,
                          ],
                        },
                      ],
                    },
                  },
                },
              },
            },

            isPublished: { $or: ["$isPublished", "$$new.isPublished"] },
            usageCount: { $add: ["$usageCount", "$$new.usageCount"] },
          },
        ],
      },
    },
  ];
}

function buildMergeStage() {
  return {
    $merge: {
      into: "MetadataKeys",
      on: "_id",
      whenMatched: buildWhenMatchedPipeline(),
      whenNotMatched: "insert",
    },
  };
}
```

This preserves all existing merge behavior (including future multi-source support), but makes the migration file far easier to scan: `buildPipeline` reads as “shape → aggregate → finalize → merge”, and the complex merge internals live in a clearly named helper that can be documented and unit-tested separately.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

… into a single metadata key with combined userGroups

…k group references

Co-authored-by: Copilot <copilot@github.com>

rkweehinzmann

Thanks for the reimplementation of this service!

One other question:
Are nested scientific metadata values handled that have themselves nested values?

Thank you.

rkweehinzmann · 2026-04-28T09:48:31Z

 - Stability: Crashes occurred when retrieval limits were missing or improperly configured.
 - Risks: Users could see metadata keys they did not have permissions to access.

+---


Could you explain in which cases this can happen that a user sees metadata keys they do not have permissions for?

All the metadata keys returned from /api/v3/datasets/metadataKeys are not filterd by permission

ok, but what does it mean, eg if /api/v4/datasets/metadatakeys is used they will always be filtered by permissions? From your initial description I thought it can have to do with setting limits wrongly ?

The new endpoint /api/v4/metadataKeys/findAll?{query} will always return filtered metadatakeys by permissions.

I'm not sure I understand your second quesion

rkweehinzmann · 2026-04-28T09:51:56Z

+  /**
+   * Called when a dataset is updated.
+   *
+   * Diffs the old and new metadata to produce three disjoint key sets:


Who will know what old and new metadata is in 2 months or 2 years (or even now)? Could you add some short description of what is called old and what new, please? Thank you.

old metadata is the metadata befor change, new metadata is the metadata after change

so if I change metadata from {"metadataKey1":"text"} to {"meatadataKey2":"text"}
the former is old metadata, latter is the new metadata

Did you want me to see that metadataKey1 and metadataKey2 are different names, but both refer to the same value (which is the actual meta data and did not change)? I meant if you call something old or new, you're referring of to an old/new way of handling things. What was the old way, what the new way? Apologies, probably too obvious for the programmer!

Just to clarify what it is used for.

Consider a data provider A and a data receiver B.
Initially, A sends {key1}[old] to B, and B stores it. Later, A renames the key from {key1}[old] to {key2}[new]. If A only sends {key2}[new], B ends up holding both {key1}[old] and {key2}[new] — even though the old key no longer exists on A's side. This creates a sync inconsistency.
To prevent this, when A changes {key1}[old] to {key2}[new], it must provide both the old and new state to B, effectively saying:

"Here is what changed — discard [old] and replace it with [new]."

This is exactly why replaceManyFromSource requires both oldDoc (pre-update) and newDoc (post-update) — so B can recompute the difference and stay in sync with A.

I also updated the comment to state that, old is before change and new is after change. There's many terms to describe it, but old and new is probably is the shortest term and it should be self-explanatory

This reverts commit 8320494.

modify unit test accordingly

…ion script

…ableName)

minottic · 2026-05-04T14:18:09Z

+
+Rebuilds the `MetadataKeys` collection from scratch by scanning all `Dataset` documents and extracting every key found in `scientificMetadata`.
+
+Each unique `(sourceType, key, humanReadableName)` combination becomes one `MetadataKey` document. If the same key appears across multiple datasets, their `userGroups` and counts are merged into a single entry.


why do we need humanReadableName? Isn't sourceType, key already unique?

Same key might have completely different humanReadableName, so the uniqueness is based on sourceType+key+humanReadableName

minottic · 2026-05-04T14:19:51Z

+```js
+// MetadataKey document
+{
+  _id: "Dataset_temperature_Temperature",  // ${sourceType}_${key}_${humanReadableName}


as discussed offline, I would keep uuid as ID and eventually add an additional field with the concat. This will make easier to have multiple documents with the same combination in case we will ever want to split big documents

I will update the documentation later, I changed the id to use UUID but forgot to change doc here.

docs updated

minottic · 2026-05-04T14:20:43Z

+  humanReadableName: "Temperature",
+  sourceType: "Dataset",
+  isPublished: true,
+  usageCount: 2,


why do you need this? Isn't userGroupCounts enough?

This avoids edge cases where a dataset has no groups but isPublished: true (unlikely, but ownerGroup is not a required field in OwnableDto).
Additionally, userGroupCounts cannot replace usageCount because a dataset can belong to multiple groups, whereas usageCount counts each dataset only once.

minottic · 2026-05-04T20:19:16Z

what about this:

function buildPipeline(sourceType) { return [ { $project: { isPublished: 1, groups: { $setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }] }, meta: { $objectToArray: "$scientificMetadata" }, } }, { $unwind: "$meta" }, { $unwind: "$groups" }, { $match: { groups: { $ne: null, $ne: "" } } }, { $group: { _id: { k: "$meta.k", g: "$groups", n: { $ifNull: ["$meta.v.human_name", ""] } }, count: { $sum: 1 }, isPublished: { $max: "$isPublished" } } } ]; } module.exports = { async up(db) { const start = Date.now(); const elapsed = () => `${((Date.now() - start) / 1000).toFixed(1)}s`; await db.collection("MetadataKeys").deleteMany({}); await db.collection("MetadataKeys").createIndex({ metaKeyId: 1 }, { unique: true }); for (const collectionName of SOURCE_COLLECTIONS) { console.log(`[${elapsed()}] Starting stream for ${collectionName}...`); // 1. Open a single cursor for the entire collection const cursor = db.collection(collectionName).aggregate(buildPipeline(collectionName), { allowDiskUse: true }); let operations = []; let totalProcessed = 0; // 2. Iterate over the cursor without toArray() for await (const doc of cursor) { const metaKeyId = `${collectionName}_${doc._id.k}_${doc._id.n}`; operations.push({ updateOne: { filter: { metaKeyId }, update: { $set: { key: doc._id.k, humanReadableName: doc._id.n, sourceType: collectionName, updatedBy: "migration", updatedAt: new Date() }, $setOnInsert: { createdAt: new Date(), createdBy: "migration" }, $max: { isPublished: doc.isPublished }, $addToSet: { userGroups: doc._id.g }, $inc: { usageCount: 1, [`userGroupCounts.${doc._id.g.replace(/\./g, '_')}`]: doc.count } }, upsert: true } }); // 3. Periodic Flush: execute bulk write every BATCH_SIZE items if (operations.length >= BATCH_SIZE) { await db.collection("MetadataKeys").bulkWrite(operations); totalProcessed += operations.length; console.log(`[${elapsed()}] ${collectionName}: Flushed ${totalProcessed.toLocaleString()} keys...`); operations = []; // Clear the bucket } } // 4. Final flush for any remaining operations if (operations.length > 0) { await db.collection("MetadataKeys").bulkWrite(operations); totalProcessed += operations.length; } console.log(`[${elapsed()}] ✅ ${collectionName} complete. Total keys: ${totalProcessed}`); } // Clean up: switch metaKeyId to the official _id await db.collection("MetadataKeys").updateMany({}, [ { $set: { id: "$metaKeyId" } }, { $unset: "metaKeyId" } ]); } };

(to be changed with the right key inplace of this)

const metaKeyId = `${sourceType}_${key}_${humanName}`;

I would avoid to use cursor for the migration script, it will be extremly slow to retrive data store in the memory and process them. since each doc transferred into the mongoDB needs to be encoded

Example of comparison of the performances using cursor vs aggregate + $mege to handle everything:

This testing was using 20k datasets with 5~10 ScientificMetadata per datasets

but doesn't that depend largely on the BATCH_SIZE? The thing is that the current migration is hardly maintainable, it's extremely complex if we need to change anything

In this case, the bigger BATCH_SIZE the faster with aggregate + $merge, .

I would expect this migration script not to be touched and any further maintainance should be a new migration patch on top of this migration.
But TBH, I don't really mind the script to be slow, as long as it's robust, since we need it to run only once. So, if you suggest that readability should be more prioratized in this case, Im fine with it.
I also felt like current implementation is hard to read and that's why I included a doc to explain each step.

I would expect this migration script not to be touched

By this I meant once the migration is merged I expect it not to be changed, not that im saying current code.

The readability is improved by removing some unnecessary fields, such as check existing fields then overwrite logic, which is not needed, since we wipe everything when migration starts.
Documentation is also updated with better explanation for aggregation pipe

minottic · 2026-05-04T20:29:38Z

 export const MetadataKeySchema = SchemaFactory.createForClass(MetadataKeyClass);

-MetadataKeySchema.index({ sourceType: 1, sourceId: 1 });
+MetadataKeySchema.index({ sourceType: 1, isPublished: 1, key: 1 });


why do we need so many indexes?

The indexs are to handle queries from unauthenticated users and authenticated users.

for authenticated users the query will be

MetadataKey.find({ sourceType: 'dataset', $and: [ { $or: [ { isPublished: true }, { userGroups: { $in: user.groups } } ] }, { $or: [ { key: job_id }, { humanReadableName: JOB_ID } ] } ] });

for unauthenticated users the query will be:

MetadataKey.find({ sourceType: 'dataset', $and: [ { $or: [ { isPublished: true }, ] }, { $or: [ { key: job_id }, { humanReadableName: JOB_ID } ] } ] });

minottic · 2026-05-04T20:33:46Z

@@ -45,10 +45,10 @@ export class SamplesService {
    doc: UpdateQuery<SampleDocument>,
  ): MetadataSourceDoc {
    const source: MetadataSourceDoc = {


this snippet is replicated in all collections, worth making a util function that takes doc and name as input?

Yes, maybe it's worth to have a shared function here

addressed in 1d4f99a

minottic · 2026-05-04T20:52:08Z

what about something like this:

@Injectable({ scope: Scope.REQUEST }) export class MetadataKeysService { constructor( @InjectModel(MetadataKeyClass.name) private metadataKeyModel: Model<MetadataKeyDocument>, ) {} // Centralized logic for changing counts private async adjustCounts(doc: MetadataSourceDoc, delta: 1 | -1): Promise<void> { const { sourceType, userGroups, metadata } = doc; const ops = Object.entries(metadata).map(([key, entry]) => { const humanName = (entry as ScientificMetadataEntry).human_name ?? ""; const metaKeyId = `${sourceType}_${key}_${humanName}`; return { updateOne: { filter: { metaKeyId }, update: { $set: { key, sourceType, humanReadableName: humanName }, $inc: { usageCount: delta, ...Object.fromEntries(userGroups.map(g => [`userGroupCounts.${g.replace(/\./g, '_')}`, delta])) }, ...(delta === 1 && { $max: { isPublished: doc.isPublished } }), $setOnInsert: { createdAt: new Date() } }, upsert: delta === 1, // Only upsert when adding } }; }); if (ops.length > 0) { await this.metadataKeyModel.bulkWrite(ops); // Cleanup: Only run when decrementing if (delta === -1) { await this.metadataKeyModel.deleteMany({ usageCount: { $lte: 0 } }); } } } async insertManyFromSource(doc: MetadataSourceDoc) { await this.adjustCounts(doc, 1); } async deleteMany(doc: MetadataSourceDoc) { await this.adjustCounts(doc, -1); } async replaceManyFromSource(oldDoc: MetadataSourceDoc, newDoc: MetadataSourceDoc) { // Standard "out with the old, in with the new" // MongoDB handles the interleaving safely await this.deleteMany(oldDoc); await this.insertManyFromSource(newDoc); } }

(to be changed with the right key inplace of this)

const metaKeyId = `${sourceType}_${key}_${humanName}`;

I think the code you suggested here looks very good, I will test it locally and see if there's any missing edge cases

addressed in 32f9540

documentation

…ments

…etadataKeys

Junjiequan requested a review from a team as a code owner April 20, 2026 18:41

sourcery-ai Bot reviewed Apr 20, 2026

View reviewed changes

Junjiequan added the review: please label Apr 23, 2026

Junjiequan force-pushed the refactor-metadatakeys-service branch 2 times, most recently from 8c915ea to 7273ab8 Compare April 27, 2026 13:11

Junjiequan and others added 11 commits April 28, 2026 09:51

refactor(metadatakeys): merge duplicate metadata keys across datasets…

52b13ae

… into a single metadata key with combined userGroups

refactor(metadatakeys): merge duplicate keys across datasets and trac…

90cc22f

…k group references

trim migration script comment and fix failing tests

7daa0c1

fix typo and api test

af3fef7

set ownerGroup as required with nonEmpty value

8320494

update doc

386ccf0

fix api test

34e0a24

fix api test

74851db

fix api test

e66ebac

fix api test

125c507

address sourcer comment

9a41f76

Co-authored-by: Copilot <copilot@github.com>

Junjiequan force-pushed the refactor-metadatakeys-service branch from 7273ab8 to 9a41f76 Compare April 28, 2026 07:51

remove unncessary request import

22e8a2f

rkweehinzmann requested changes Apr 28, 2026

View reviewed changes

Junjiequan added 11 commits April 28, 2026 14:26

add batch for scientificMetadata sync migration

352437d

address comment

40a0c6d

minor fix

b5446ed

Revert "set ownerGroup as required with nonEmpty value"

d51d069

This reverts commit 8320494.

Merge branch 'master' into refactor-metadatakeys-service

add29b3

revert existingDataset in findByIdAndUpdate

ff1a457

remove unnecessary comment

43ea114

add back existingDataset &

921d11b

modify unit test accordingly

fix migration pipe

d90c3cd

Merge branch 'master' into refactor-metadatakeys-service

d71cb40

include MAX_USER_GROUPS_PER_METADATA_KEY logic for metdataKeys migrat…

1aee8bc

…ion script

Junjiequan added 2 commits May 4, 2026 17:05

update seed migration script

c6905fa

Replace concat _id with compound filter (sourceType + key + humanRead…

0aeb977

…ableName)

minottic reviewed May 5, 2026

View reviewed changes

Junjiequan added 6 commits May 11, 2026 12:39

improve migration script and

3b10fdc

documentation

improve metadataKeys doc update logic and add unit tests based on com…

32f9540

…ments

set createdBy, updatedBy on insert and updatedAt on every upsert in M…

665f392

…etadataKeys

fix minor typescript import errors

fb5dfb0

reorder imports

6ff971d

add shared createMetadataKeysInstance function to reduce duplication

1d4f99a


		Rebuilds the `MetadataKeys` collection from scratch by scanning all `Dataset` documents and extracting every key found in `scientificMetadata`.

		Each unique `(sourceType, key, humanReadableName)` combination becomes one `MetadataKey` document. If the same key appears across multiple datasets, their `userGroups` and counts are merged into a single entry.

Conversation

Junjiequan commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

userGroupCounts

Changes

Motivation

Fixes

Tests included

Documentation

official documentation info

Summary by Sourcery

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkweehinzmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Junjiequan May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Junjiequan May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Junjiequan May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Junjiequan May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Junjiequan commented Apr 20, 2026 •

edited

Loading

Junjiequan May 8, 2026 •

edited

Loading

Junjiequan May 8, 2026 •

edited

Loading

Junjiequan May 8, 2026 •

edited

Loading

Junjiequan May 8, 2026 •

edited

Loading

Junjiequan May 8, 2026 •

edited

Loading

Junjiequan May 8, 2026 •

edited

Loading