refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697
refactor(metadatakeys): merge duplicate keys across datasets and track group references#2697Junjiequan wants to merge 31 commits into
Conversation
There was a problem hiding this comment.
Hey - I've found 6 issues, and left some high level feedback:
- The migration uses
aggregate(...).toArray()with$merge, which will materialize all results in memory; consider iterating the cursor (or omittingtoArray()entirely when$mergeis the terminal stage) to avoid potential OOM issues on large datasets. - Several unit tests for
MetadataKeysServiceassert on specificupdateManycall indices (e.g.mock.calls[1]); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust. MetadataSourceDoc.sourceTypeis now derived frommodel.collection.name(e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize thesourceTypeconvention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The migration uses `aggregate(...).toArray()` with `$merge`, which will materialize all results in memory; consider iterating the cursor (or omitting `toArray()` entirely when `$merge` is the terminal stage) to avoid potential OOM issues on large datasets.
- Several unit tests for `MetadataKeysService` assert on specific `updateMany` call indices (e.g. `mock.calls[1]`); this makes them brittle as the implementation evolves—matching calls by argument shape instead of positional index would make the tests more robust.
- `MetadataSourceDoc.sourceType` is now derived from `model.collection.name` (e.g. for datasets), while docs/tests use capitalized literals like "Dataset"; it would be good to double-check and standardize the `sourceType` convention across services, migrations, and tests to avoid subtle mismatches in queries and IDs.
## Individual Comments
### Comment 1
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="274-283" />
<code_context>
private async updateSharedKeys(
</code_context>
<issue_to_address>
**issue (bug_risk):** Keys whose humanReadableName changed are updated under the new _id before being treated as delete+insert, which can cause inconsistent counters and partial documents.
In `updateSharedKeys`, the group/publish update block runs for all `sharedKeys`, including those with changed `human_name`. For these, `ids` are built from the new `humanReadableName`, so `updateMany` may target documents that don’t yet exist (and should instead be created via `insertManyFromSource` with full `$setOnInsert`). Later, `humanNameChangedKeys` are processed again via `deleteMany(oldDoc)` + `insertManyFromSource(newDoc)`, so the same logical key is updated twice under different assumptions. To prevent double-counting and partial documents, exclude `humanNameChangedKeys` from the initial group/publish updates and rely only on the delete+insert path for name changes.
</issue_to_address>
### Comment 2
<location path="src/metadata-keys/metadatakeys.service.ts" line_range="297-307" />
<code_context>
+ const publishedFlippedOn = !oldDoc.isPublished && newDoc.isPublished;
+ const hasGroupChanges = addedGroups.length > 0 || removedGroups.length > 0;
+
+ if (hasGroupChanges || publishedFlippedOn) {
+ await this.metadataKeyModel.updateMany(filter, {
+ ...(addedGroups.length > 0 && {
+ $addToSet: { userGroups: { $each: addedGroups } },
+ }),
+ $inc: {
+ ...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
+ ...(removedGroups.length > 0 &&
+ this.groupCountDeltas(removedGroups, -1)),
+ },
+ ...(publishedFlippedOn && { $set: { isPublished: true } }),
+ });
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The `$inc` operator is always included, even when no groups changed, which risks sending an empty `$inc` object.
In this branch, when `publishedFlippedOn` is true but there are no added/removed groups, the update still includes `$inc` with no fields. MongoDB may reject or mis-handle an empty `$inc`. Consider constructing the `$inc` object separately and only spreading it into the update when it has at least one key (i.e., when there is a non-empty group delta).
```suggestion
const incUpdate = {
...(addedGroups.length > 0 && this.groupCountDeltas(addedGroups, 1)),
...(removedGroups.length > 0 &&
this.groupCountDeltas(removedGroups, -1)),
};
await this.metadataKeyModel.updateMany(filter, {
...(addedGroups.length > 0 && {
$addToSet: { userGroups: { $each: addedGroups } },
}),
...(Object.keys(incUpdate).length > 0 && { $inc: incUpdate }),
...(publishedFlippedOn && { $set: { isPublished: true } }),
});
```
</issue_to_address>
### Comment 3
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="8-9" />
<code_context>
+List metadata keys visible to the current user for a given source type:
```json
{
"where": {
- "sourceType": "dataset",
</code_context>
<issue_to_address>
**issue (bug_risk):** The combination of `preserveNullAndEmptyArrays` and the subsequent `$match` drops group-less datasets from `usageCount`, contrary to the comment.
Using `$unwind` with `preserveNullAndEmptyArrays: true` creates records with `userGroups: null` for group-less datasets, but the subsequent `$match: { userGroups: { $nin: [null, ""] } }` removes them. As a result, datasets without groups never reach the later grouping and are missing from `usageCount`, which contradicts the intent described in the comment. If group-less datasets should still contribute to `usageCount`, the pipeline needs to retain them (e.g., by handling `userGroups: null` explicitly or isolating the group-based logic so it doesn’t drop these records).
</issue_to_address>
### Comment 4
<location path="test/MetadataKeys.js" line_range="514-419" />
<code_context>
+ it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
</code_context>
<issue_to_address>
**suggestion (testing):** Add a test that updates `accessGroups` (not just `ownerGroup`) and asserts that `userGroups` and access control on keys are updated accordingly.
Since `MetadataKeys` derives `userGroups` from both `ownerGroup` and `accessGroups`, and the service updates counts based on `userGroups`, this path should be tested as well. Please also add a test that:
- Starts with a dataset with `accessGroups: ["group1"]` and a key
- Patches the dataset to change `accessGroups` (e.g. to `["group2"]`)
- Asserts via `/metadatakeys` that:
- `userGroups` on the key reflects the updated access groups
- A user who only belonged to the removed group can no longer see the key
This will cover the `accessGroups` mutation path and its impact on access control and group reference counting.
Suggested implementation:
```javascript
await deleteDataset(pid);
});
it("0641: changing accessGroups updates userGroups and key visibility", async () => {
const pid = await createDataset({
...TestData.DatasetWithScientificMetadataV4,
ownerGroup: "group-owner",
accessGroups: ["group1"],
isPublished: false,
scientificMetadata: { access_group_change_key: { value: 42 } },
});
const filterKey = {
where: { sourceType: "Dataset", key: "access_group_change_key" },
};
// user belonging to group1 can see the key initially
const keysForGroup1Before = await getMetadataKeys(
filterKey,
accessTokenGroup1
);
keysForGroup1Before.body.should.have.lengthOf(1);
keysForGroup1Before.body[0].userGroups.should.contain("group1");
keysForGroup1Before.body[0].userGroups.should.not.contain("group2");
// patch dataset to switch accessGroups from group1 to group2
await patchDataset(
pid,
{ accessGroups: ["group2"] },
accessTokenAdmin
);
// after patch, key should reflect updated userGroups (derived from updated accessGroups)
const keysForGroup2After = await getMetadataKeys(
filterKey,
accessTokenGroup2
);
keysForGroup2After.body.should.have.lengthOf(1);
keysForGroup2After.body[0].userGroups.should.not.contain("group1");
keysForGroup2After.body[0].userGroups.should.contain("group2");
// user that only belonged to removed group1 can no longer see the key
const keysForGroup1After = await getMetadataKeys(
filterKey,
accessTokenGroup1
);
keysForGroup1After.body.should.have.lengthOf(0);
await deleteDataset(pid);
});
it("0640: changing ownerGroup updates userGroups — old group removed, new group added", async () => {
```
To make this compile and align with your existing tests, you may need to:
1. Ensure `accessTokenGroup1` and `accessTokenGroup2` exist and represent users who:
- Both can see datasets/keys owned by `group-owner`.
- `accessTokenGroup1` has group membership `group1` but not `group2`.
- `accessTokenGroup2` has group membership `group2` (and ideally not `group1`).
If your test suite uses a different naming convention (e.g. `accessTokenUserGroup1`), adjust the token names in the new test accordingly.
2. Confirm that `patchDataset` is available in this file and patches `accessGroups` without resetting other fields. If the helper has a different signature, update the call in the new test to match.
3. If your `getMetadataKeys` helper or the `/metadatakeys` response shape differs (e.g. `userGroups` lives under a nested property), adapt the assertions to reference the correct path.
</issue_to_address>
### Comment 5
<location path="docs/developer-guide/migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.md" line_range="143" />
<code_context>
+{ datasetId: "uuid-B", key: "temperature", isPublished: false, humanReadableName: "Temperature", userGroups: ["group-1"] }
+```
+
+> `ownerGroup` is mandatory field so no null fallback is needed. `accessGroups` is optional so it falls back to `[]`.
+
+---
</code_context>
<issue_to_address>
**nitpick (typo):** Add an article in “ownerGroup is mandatory field” for correct grammar.
You could rephrase this as: “`ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.”
```suggestion
> `ownerGroup` is a mandatory field, so no null fallback is needed. `accessGroups` is optional, so it falls back to `[]`.
```
</issue_to_address>
### Comment 6
<location path="migrations/20260420145401-sync-dataset-scientificMetadata-to-metadatakeys.js" line_range="3" />
<code_context>
+const SOURCE_COLLECTIONS = ["Dataset"];
+
+function buildPipeline(sourceType) {
+ return [
+ // -------------------------------------------------------------------------
</code_context>
<issue_to_address>
**issue (complexity):** Consider decomposing the monolithic aggregation pipeline into smaller helper functions and a dedicated merge-stage builder so each phase of the migration is easier to understand and maintain.
You can keep all current behavior and significantly reduce cognitive load by decomposing the pipeline and isolating the merge logic.
### 1. Split the monolithic `buildPipeline` into small, named helpers
Right now `buildPipeline` mixes projection, grouping, and merge configuration in one large array. You can keep the exact aggregation semantics but break it up into helpers so each concern is localized and easier to test and review.
For example:
```js
function buildInitialStages() {
return [
{ $match: { scientificMetadata: { $exists: true, $type: "object" } } },
{
$project: {
datasetId: "$_id",
ownerGroup: 1,
accessGroups: 1,
isPublished: 1,
metaArr: { $objectToArray: "$scientificMetadata" },
},
},
{ $unwind: "$metaArr" },
{
$project: {
datasetId: 1,
key: "$metaArr.k",
isPublished: 1,
humanReadableName: { $ifNull: ["$metaArr.v.human_name", ""] },
userGroups: {
$setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }],
},
},
},
{
$unwind: {
path: "$userGroups",
preserveNullAndEmptyArrays: true,
},
},
{ $match: { userGroups: { $nin: [null, ""] } } },
];
}
function groupByKeyAndGroup(sourceType) {
return [
{
$group: {
_id: {
metaKeyId: {
$concat: [`${sourceType}_`, "$key", "_", "$humanReadableName"],
},
group: "$userGroups",
},
key: { $first: "$key" },
humanReadableName: { $first: "$humanReadableName" },
isPublished: { $max: "$isPublished" },
groupCount: { $sum: 1 },
datasetIds: { $addToSet: "$datasetId" },
},
},
{
$group: {
_id: "$_id.metaKeyId",
key: { $first: "$key" },
humanReadableName: { $first: "$humanReadableName" },
isPublished: { $max: "$isPublished" },
userGroups: { $push: "$_id.group" },
userGroupCountsArr: { $push: { k: "$_id.group", v: "$groupCount" } },
datasetIdSets: { $push: "$datasetIds" },
},
},
];
}
function projectFinalShape(sourceType) {
return [
{
$project: {
_id: 1,
key: 1,
sourceType: { $literal: sourceType },
humanReadableName: 1,
isPublished: 1,
userGroups: 1,
userGroupCounts: { $arrayToObject: "$userGroupCountsArr" },
usageCount: {
$size: {
$reduce: {
input: "$datasetIdSets",
initialValue: [],
in: { $setUnion: ["$$value", "$$this"] },
},
},
},
createdBy: { $literal: "migration" },
createdAt: { $toDate: "$$NOW" },
},
},
];
}
function buildPipeline(sourceType) {
return [
...buildInitialStages(),
...groupByKeyAndGroup(sourceType),
...projectFinalShape(sourceType),
buildMergeStage(),
];
}
```
This keeps the same pipeline stages but makes each phase independently understandable.
### 2. Extract the `$merge` `whenMatched` pipeline into a dedicated builder
The `whenMatched` expression is dense and currently buried at the bottom of `buildPipeline`. Extracting it into its own function makes it clearer that it’s “just” merge policy and decouples it from the shape-building stages.
```js
function buildWhenMatchedPipeline() {
return [
{
$replaceWith: {
$mergeObjects: [
"$$new",
{
createdAt: { $ifNull: ["$createdAt", "$$new.createdAt"] },
createdBy: { $ifNull: ["$createdBy", "$$new.createdBy"] },
updatedBy: { $literal: "migration" },
updatedAt: { $toDate: "$$NOW" },
userGroups: { $setUnion: ["$userGroups", "$$new.userGroups"] },
userGroupCounts: {
$arrayToObject: {
$map: {
input: {
$setUnion: [
{
$map: {
input: { $objectToArray: "$userGroupCounts" },
as: "e",
in: "$$e.k",
},
},
{
$map: {
input: { $objectToArray: "$$new.userGroupCounts" },
as: "e",
in: "$$e.k",
},
},
],
},
as: "group",
in: {
k: "$$group",
v: {
$add: [
{
$ifNull: [
{
$getField: {
field: "$$group",
input: "$userGroupCounts",
},
},
0,
],
},
{
$ifNull: [
{
$getField: {
field: "$$group",
input: "$$new.userGroupCounts",
},
},
0,
],
},
],
},
},
},
},
},
isPublished: { $or: ["$isPublished", "$$new.isPublished"] },
usageCount: { $add: ["$usageCount", "$$new.usageCount"] },
},
],
},
},
];
}
function buildMergeStage() {
return {
$merge: {
into: "MetadataKeys",
on: "_id",
whenMatched: buildWhenMatchedPipeline(),
whenNotMatched: "insert",
},
};
}
```
This preserves all existing merge behavior (including future multi-source support), but makes the migration file far easier to scan: `buildPipeline` reads as “shape → aggregate → finalize → merge”, and the complex merge internals live in a clearly named helper that can be documented and unit-tested separately.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
8c915ea to
7273ab8
Compare
… into a single metadata key with combined userGroups
…k group references
Co-authored-by: Copilot <copilot@github.com>
7273ab8 to
9a41f76
Compare
rkweehinzmann
left a comment
There was a problem hiding this comment.
Thanks for the reimplementation of this service!
One other question:
Are nested scientific metadata values handled that have themselves nested values?
Thank you.
| - Stability: Crashes occurred when retrieval limits were missing or improperly configured. | ||
| - Risks: Users could see metadata keys they did not have permissions to access. | ||
|
|
||
| --- |
There was a problem hiding this comment.
Could you explain in which cases this can happen that a user sees metadata keys they do not have permissions for?
There was a problem hiding this comment.
All the metadata keys returned from /api/v3/datasets/metadataKeys are not filterd by permission
There was a problem hiding this comment.
ok, but what does it mean, eg if /api/v4/datasets/metadatakeys is used they will always be filtered by permissions? From your initial description I thought it can have to do with setting limits wrongly ?
There was a problem hiding this comment.
The new endpoint /api/v4/metadataKeys/findAll?{query} will always return filtered metadatakeys by permissions.
I'm not sure I understand your second quesion
| /** | ||
| * Called when a dataset is updated. | ||
| * | ||
| * Diffs the old and new metadata to produce three disjoint key sets: |
There was a problem hiding this comment.
Who will know what old and new metadata is in 2 months or 2 years (or even now)? Could you add some short description of what is called old and what new, please? Thank you.
There was a problem hiding this comment.
old metadata is the metadata befor change, new metadata is the metadata after change
so if I change metadata from {"metadataKey1":"text"} to {"meatadataKey2":"text"}
the former is old metadata, latter is the new metadata
There was a problem hiding this comment.
Did you want me to see that metadataKey1 and metadataKey2 are different names, but both refer to the same value (which is the actual meta data and did not change)? I meant if you call something old or new, you're referring of to an old/new way of handling things. What was the old way, what the new way? Apologies, probably too obvious for the programmer!
There was a problem hiding this comment.
Just to clarify what it is used for.
Consider a data provider A and a data receiver B.
Initially, A sends {key1}[old] to B, and B stores it. Later, A renames the key from {key1}[old] to {key2}[new]. If A only sends {key2}[new], B ends up holding both {key1}[old] and {key2}[new] — even though the old key no longer exists on A's side. This creates a sync inconsistency.
To prevent this, when A changes {key1}[old] to {key2}[new], it must provide both the old and new state to B, effectively saying:
"Here is what changed — discard [old] and replace it with [new]."
This is exactly why replaceManyFromSource requires both oldDoc (pre-update) and newDoc (post-update) — so B can recompute the difference and stay in sync with A.
There was a problem hiding this comment.
I also updated the comment to state that, old is before change and new is after change. There's many terms to describe it, but old and new is probably is the shortest term and it should be self-explanatory
This reverts commit 8320494.
modify unit test accordingly
|
|
||
| Rebuilds the `MetadataKeys` collection from scratch by scanning all `Dataset` documents and extracting every key found in `scientificMetadata`. | ||
|
|
||
| Each unique `(sourceType, key, humanReadableName)` combination becomes one `MetadataKey` document. If the same key appears across multiple datasets, their `userGroups` and counts are merged into a single entry. |
There was a problem hiding this comment.
why do we need humanReadableName? Isn't sourceType, key already unique?
There was a problem hiding this comment.
Same key might have completely different humanReadableName, so the uniqueness is based on sourceType+key+humanReadableName
| ```js | ||
| // MetadataKey document | ||
| { | ||
| _id: "Dataset_temperature_Temperature", // ${sourceType}_${key}_${humanReadableName} |
There was a problem hiding this comment.
as discussed offline, I would keep uuid as ID and eventually add an additional field with the concat. This will make easier to have multiple documents with the same combination in case we will ever want to split big documents
There was a problem hiding this comment.
I will update the documentation later, I changed the id to use UUID but forgot to change doc here.
| humanReadableName: "Temperature", | ||
| sourceType: "Dataset", | ||
| isPublished: true, | ||
| usageCount: 2, |
There was a problem hiding this comment.
why do you need this? Isn't userGroupCounts enough?
There was a problem hiding this comment.
This avoids edge cases where a dataset has no groups but isPublished: true (unlikely, but ownerGroup is not a required field in OwnableDto).
Additionally, userGroupCounts cannot replace usageCount because a dataset can belong to multiple groups, whereas usageCount counts each dataset only once.
There was a problem hiding this comment.
what about this:
function buildPipeline(sourceType) {
return [
{
$project: {
isPublished: 1,
groups: { $setUnion: [["$ownerGroup"], { $ifNull: ["$accessGroups", []] }] },
meta: { $objectToArray: "$scientificMetadata" },
}
},
{ $unwind: "$meta" },
{ $unwind: "$groups" },
{ $match: { groups: { $ne: null, $ne: "" } } },
{
$group: {
_id: {
k: "$meta.k",
g: "$groups",
n: { $ifNull: ["$meta.v.human_name", ""] }
},
count: { $sum: 1 },
isPublished: { $max: "$isPublished" }
}
}
];
}
module.exports = {
async up(db) {
const start = Date.now();
const elapsed = () => `${((Date.now() - start) / 1000).toFixed(1)}s`;
await db.collection("MetadataKeys").deleteMany({});
await db.collection("MetadataKeys").createIndex({ metaKeyId: 1 }, { unique: true });
for (const collectionName of SOURCE_COLLECTIONS) {
console.log(`[${elapsed()}] Starting stream for ${collectionName}...`);
// 1. Open a single cursor for the entire collection
const cursor = db.collection(collectionName).aggregate(buildPipeline(collectionName), {
allowDiskUse: true
});
let operations = [];
let totalProcessed = 0;
// 2. Iterate over the cursor without toArray()
for await (const doc of cursor) {
const metaKeyId = `${collectionName}_${doc._id.k}_${doc._id.n}`;
operations.push({
updateOne: {
filter: { metaKeyId },
update: {
$set: {
key: doc._id.k,
humanReadableName: doc._id.n,
sourceType: collectionName,
updatedBy: "migration",
updatedAt: new Date()
},
$setOnInsert: { createdAt: new Date(), createdBy: "migration" },
$max: { isPublished: doc.isPublished },
$addToSet: { userGroups: doc._id.g },
$inc: {
usageCount: 1,
[`userGroupCounts.${doc._id.g.replace(/\./g, '_')}`]: doc.count
}
},
upsert: true
}
});
// 3. Periodic Flush: execute bulk write every BATCH_SIZE items
if (operations.length >= BATCH_SIZE) {
await db.collection("MetadataKeys").bulkWrite(operations);
totalProcessed += operations.length;
console.log(`[${elapsed()}] ${collectionName}: Flushed ${totalProcessed.toLocaleString()} keys...`);
operations = []; // Clear the bucket
}
}
// 4. Final flush for any remaining operations
if (operations.length > 0) {
await db.collection("MetadataKeys").bulkWrite(operations);
totalProcessed += operations.length;
}
console.log(`[${elapsed()}] ✅ ${collectionName} complete. Total keys: ${totalProcessed}`);
}
// Clean up: switch metaKeyId to the official _id
await db.collection("MetadataKeys").updateMany({}, [
{ $set: { id: "$metaKeyId" } },
{ $unset: "metaKeyId" }
]);
}
};There was a problem hiding this comment.
(to be changed with the right key inplace of this)
const metaKeyId = `${sourceType}_${key}_${humanName}`;There was a problem hiding this comment.
I would avoid to use cursor for the migration script, it will be extremly slow to retrive data store in the memory and process them. since each doc transferred into the mongoDB needs to be encoded
Example of comparison of the performances using cursor vs aggregate + $mege to handle everything:
This testing was using 20k datasets with 5~10 ScientificMetadata per datasets
There was a problem hiding this comment.
but doesn't that depend largely on the BATCH_SIZE? The thing is that the current migration is hardly maintainable, it's extremely complex if we need to change anything
There was a problem hiding this comment.
In this case, the bigger BATCH_SIZE the faster with aggregate + $merge, .
I would expect this migration script not to be touched and any further maintainance should be a new migration patch on top of this migration.
But TBH, I don't really mind the script to be slow, as long as it's robust, since we need it to run only once. So, if you suggest that readability should be more prioratized in this case, Im fine with it.
I also felt like current implementation is hard to read and that's why I included a doc to explain each step.
There was a problem hiding this comment.
I would expect this migration script not to be touched
By this I meant once the migration is merged I expect it not to be changed, not that im saying current code.
There was a problem hiding this comment.
The readability is improved by removing some unnecessary fields, such as check existing fields then overwrite logic, which is not needed, since we wipe everything when migration starts.
Documentation is also updated with better explanation for aggregation pipe
| export const MetadataKeySchema = SchemaFactory.createForClass(MetadataKeyClass); | ||
|
|
||
| MetadataKeySchema.index({ sourceType: 1, sourceId: 1 }); | ||
| MetadataKeySchema.index({ sourceType: 1, isPublished: 1, key: 1 }); |
There was a problem hiding this comment.
why do we need so many indexes?
There was a problem hiding this comment.
The indexs are to handle queries from unauthenticated users and authenticated users.
for authenticated users the query will be
MetadataKey.find({
sourceType: 'dataset',
$and: [
{
$or: [
{ isPublished: true },
{ userGroups: { $in: user.groups } }
]
},
{
$or: [
{ key: job_id },
{ humanReadableName: JOB_ID }
]
}
]
});for unauthenticated users the query will be:
MetadataKey.find({
sourceType: 'dataset',
$and: [
{
$or: [
{ isPublished: true },
]
},
{
$or: [
{ key: job_id },
{ humanReadableName: JOB_ID }
]
}
]
});| @@ -45,10 +45,10 @@ export class SamplesService { | |||
| doc: UpdateQuery<SampleDocument>, | |||
| ): MetadataSourceDoc { | |||
| const source: MetadataSourceDoc = { | |||
There was a problem hiding this comment.
this snippet is replicated in all collections, worth making a util function that takes doc and name as input?
There was a problem hiding this comment.
Yes, maybe it's worth to have a shared function here
There was a problem hiding this comment.
what about something like this:
@Injectable({ scope: Scope.REQUEST })
export class MetadataKeysService {
constructor(
@InjectModel(MetadataKeyClass.name)
private metadataKeyModel: Model<MetadataKeyDocument>,
) {}
// Centralized logic for changing counts
private async adjustCounts(doc: MetadataSourceDoc, delta: 1 | -1): Promise<void> {
const { sourceType, userGroups, metadata } = doc;
const ops = Object.entries(metadata).map(([key, entry]) => {
const humanName = (entry as ScientificMetadataEntry).human_name ?? "";
const metaKeyId = `${sourceType}_${key}_${humanName}`;
return {
updateOne: {
filter: { metaKeyId },
update: {
$set: { key, sourceType, humanReadableName: humanName },
$inc: {
usageCount: delta,
...Object.fromEntries(userGroups.map(g => [`userGroupCounts.${g.replace(/\./g, '_')}`, delta]))
},
...(delta === 1 && { $max: { isPublished: doc.isPublished } }),
$setOnInsert: { createdAt: new Date() }
},
upsert: delta === 1, // Only upsert when adding
}
};
});
if (ops.length > 0) {
await this.metadataKeyModel.bulkWrite(ops);
// Cleanup: Only run when decrementing
if (delta === -1) {
await this.metadataKeyModel.deleteMany({ usageCount: { $lte: 0 } });
}
}
}
async insertManyFromSource(doc: MetadataSourceDoc) {
await this.adjustCounts(doc, 1);
}
async deleteMany(doc: MetadataSourceDoc) {
await this.adjustCounts(doc, -1);
}
async replaceManyFromSource(oldDoc: MetadataSourceDoc, newDoc: MetadataSourceDoc) {
// Standard "out with the old, in with the new"
// MongoDB handles the interleaving safely
await this.deleteMany(oldDoc);
await this.insertManyFromSource(newDoc);
}
}There was a problem hiding this comment.
(to be changed with the right key inplace of this)
const metaKeyId = `${sourceType}_${key}_${humanName}`;There was a problem hiding this comment.
I think the code you suggested here looks very good, I will test it locally and see if there's any missing edge cases
documentation
Description
MetadataKeys are now deduplicated, instead of one entry per key per dataset, a single entry is shared across all datasets containing the same key, with
userGroups,usageCount, anduserGroupCountsmerged.userGroupCounts
New internal field that tracks how many datasets per group reference this key:
When a group's count hits 0 it is removed from userGroups. When usageCount hits 0 the entry is deleted.
Changes
dataset,proposal) permission changes (ownerGroup,accessGroups) update userGroups and userGroupCounts accordinglyisPublished: true, the metadata key will haveisPublished:trueas well and is visible to all usersMotivation
Fixes
Tests included
Documentation
official documentation info
Summary by Sourcery
Track metadata keys at a global level across datasets and other entities, aggregating usage and access control by key rather than per-source document.
New Features:
Enhancements:
Documentation:
Tests: