Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
52b13ae
refactor(metadatakeys): merge duplicate metadata keys across dataset…
Junjiequan Apr 16, 2026
90cc22f
refactor(metadatakeys): merge duplicate keys across datasets and trac…
Junjiequan Apr 20, 2026
7daa0c1
trim migration script comment and fix failing tests
Junjiequan Apr 20, 2026
af3fef7
fix typo and api test
Junjiequan Apr 20, 2026
8320494
set ownerGroup as required with nonEmpty value
Junjiequan Apr 20, 2026
386ccf0
update doc
Junjiequan Apr 20, 2026
34e0a24
fix api test
Junjiequan Apr 20, 2026
74851db
fix api test
Junjiequan Apr 20, 2026
e66ebac
fix api test
Junjiequan Apr 21, 2026
125c507
fix api test
Junjiequan Apr 21, 2026
9a41f76
address sourcer comment
Junjiequan Apr 23, 2026
22e8a2f
remove unncessary request import
Junjiequan Apr 28, 2026
352437d
add batch for scientificMetadata sync migration
Junjiequan Apr 28, 2026
40a0c6d
address comment
Junjiequan Apr 28, 2026
b5446ed
minor fix
Junjiequan Apr 28, 2026
d51d069
Revert "set ownerGroup as required with nonEmpty value"
Junjiequan Apr 29, 2026
add29b3
Merge branch 'master' into refactor-metadatakeys-service
Junjiequan Apr 29, 2026
ff1a457
revert existingDataset in findByIdAndUpdate
Junjiequan Apr 29, 2026
43ea114
remove unnecessary comment
Junjiequan Apr 29, 2026
921d11b
add back existingDataset &
Junjiequan Apr 30, 2026
d90c3cd
fix migration pipe
Junjiequan Apr 30, 2026
d71cb40
Merge branch 'master' into refactor-metadatakeys-service
Junjiequan Apr 30, 2026
1aee8bc
include MAX_USER_GROUPS_PER_METADATA_KEY logic for metdataKeys migrat…
Junjiequan Apr 30, 2026
c6905fa
update seed migration script
Junjiequan May 4, 2026
0aeb977
Replace concat _id with compound filter (sourceType + key + humanRead…
Junjiequan May 4, 2026
3b10fdc
improve migration script and
Junjiequan May 11, 2026
32f9540
improve metadataKeys doc update logic and add unit tests based on com…
Junjiequan May 11, 2026
665f392
set createdBy, updatedBy on insert and updatedAt on every upsert in M…
Junjiequan May 11, 2026
fb5dfb0
fix minor typescript import errors
Junjiequan May 11, 2026
6ff971d
reorder imports
Junjiequan May 11, 2026
1d4f99a
add shared createMetadataKeysInstance function to reduce duplication
Junjiequan May 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 135 additions & 30 deletions docs/developer-guide/metadatakeys-module.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,74 +17,179 @@ The previous implementation in the Datasets service lacked a permission-based fi
- Stability: Crashes occurred when retrieval limits were missing or improperly configured.
- Risks: Users could see metadata keys they did not have permissions to access.

---
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain in which cases this can happen that a user sees metadata keys they do not have permissions for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the metadata keys returned from /api/v3/datasets/metadataKeys are not filterd by permission

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but what does it mean, eg if /api/v4/datasets/metadatakeys is used they will always be filtered by permissions? From your initial description I thought it can have to do with setting limits wrongly ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new endpoint /api/v4/metadataKeys/findAll?{query} will always return filtered metadatakeys by permissions.

I'm not sure I understand your second quesion


## Module Architecture

This module consists of a dedicated Controller and Service layer that implements a robust permission-aware logic.

### MetadataKeysController

Provides the API interface for searching keys. Allowed filters can be found in `src/metadata-keys/metadatakeys.service.ts` and exmaple can be find in `src/metadata-keys/types/metadatakeys-filter-content.ts`
Provides the API interface for searching metadata keys.

- `Endpoint`: GET /metadatakeys (replaces /datasets/metadataKeys)
- `Method`: findAll
- `Endpoint Access`: Endpoint can be Accessed by any users
- **Endpoint**: `GET /metadatakeys` (replaces `GET /datasets/metadataKeys`)
- **Method**: `findAll`
- **Access**: Any authenticated user (permission filtering is applied server-side)
- Allowed filter fields: see `src/metadata-keys/types/metadatakeys-lookup.ts`
- Filter examples: see `src/metadata-keys/types/metadatakeys-filter-content.ts`

---

### MetadataKeysService

This handles the business logic and talks to the database. It is divided into user-facing search logic and internal data synchronization.
Handles business logic and database access. Split into two concerns:

#### 1. User-facing search — `findAll`

Applies CASL permission filters before querying:

| User type | Visible keys |
| -------------------- | -------------------------------------------------------- |
| Admin | All keys in the system |
| Authenticated user | Keys where they belong to `ownerGroup` or `accessGroups` |
| Unauthenticated user | Keys marked `isPublished: true` |

Results default to 100 per page if no limit is provided.

#### 2. Internal synchronization

These methods are called internally when source documents are created, updated, or deleted. They are never called directly from the controller.

##### `insertManyFromSource(doc)`

Called when a dataset is **created** or **gains new metadata keys**.

For each key in `scientificMetadata`:

- Upserts a `MetadataKey` document identified by `${sourceType}_${key}_${humanReadableName}`
- Increments `usageCount` (total datasets referencing this key)
- Increments per-group reference counts in `userGroupCounts`
- Adds new groups to the `userGroups` query array via `$addToSet`
- Sets `isPublished: true` if the source dataset is published (never unsets inline — the cronjob handles the `true → false` transition)

##### `deleteMany(doc)`

#### Permission Layer (Applies to findAll only):
Called when a dataset is **deleted** or **loses metadata keys**.

When a user searches for keys, the service uses accessibleBy to automatically append access filters based on CASL permissions:
Runs three sequential steps:

- `Admins`: Can search and get all metadata keys in the system.
- `Authenticated Users`: Can only get keys where they are part of the ownerGroup or accessGroups.
- `Unauthenticated Users`: Can only get keys that are marked as isPublished.
1. Decrements `usageCount` and per-group counts in `userGroupCounts`
2. Recomputes the `userGroups` array from the updated counts — drops any group whose count reached zero
3. Deletes `MetadataKey` documents where `usageCount <= 0`
`usageCount` is the authoritative deletion signal. A dataset with no `userGroups` and `isPublished: false` would be invisible to both `userGroupCounts` and `isPublished` checks, so neither alone can substitute for it.

#### Service Methods:
##### `replaceManyFromSource(oldDoc, newDoc)`

- `findAll`: The only public-facing method. It applies the permission layer and then uses a database aggregation pipeline to find and return the specific keys requested by the user. Every search is limited to 100 results by default, if limit is not provided.
- `insertManyFromSource`: An internal method that takes an original document (like a Dataset), extracts fields from **scientificMetadata**, **metadata**, and **customMetadata**, and creates new records in the Metadata Keys collection.
- `deleteMany`: Removes metadata key entries associated with a source document when that document is deleted from the system.
- `replaceManyFromSource`: Triggered when a source document (e.g., a Dataset or Proposal) is updated. It calls `deleteMany` and `insertManyFromSource` sequentially.
Called when a dataset is **updated**. Diffs the old and new `scientificMetadata` to produce three disjoint key sets:

## Usage Example
| Set | Keys | Action |
| ------- | ---------------- | -------------------------------------------------------------------- |
| Added | Only in `newDoc` | `insertManyFromSource` |
| Removed | Only in `oldDoc` | `deleteMany` |
| Shared | In both | `updateSharedKeys` (group / isPublished / humanReadableName changes) |

To list all metadata keys associated with a dataset, the user must provide the sourceType and sourceId. If the fields array is provided, only those specific fields will be returned:
The three sets are disjoint by `_id` so they run in parallel via `Promise.all`.

For shared keys, three things are handled independently:

- **userGroups changed** — added groups are incremented, removed groups are decremented, then `userGroups` array is recomputed from the updated counts
- **isPublished flipped true** — sets `isPublished: true` inline; `false` is left to the cronjob
- **humanReadableName changed** — since `humanReadableName` is part of `_id`, this is treated as a delete of the old document + insert of a new one

---

## Schema

Each `MetadataKey` document has the following key fields:

| Field | Type | Description |
| ------------------- | --------------------- | ---------------------------------------------------------------------------------------- |
| `_id` | `string` | Composite key: `${sourceType}_${key}_${humanReadableName}` |
| `key` | `string` | The raw metadata key name |
| `humanReadableName` | `string` | Human-readable label from `human_name`, empty string if absent |
| `sourceType` | `string` | Source collection: `Dataset`, `Proposal`, `Sample`, etc. |
| `userGroups` | `string[]` | Groups that can see this key — kept in sync with `userGroupCounts` for query performance |
| `userGroupCounts` | `Map<string, number>` | Per-group reference counts — source of truth for safe group removal |
| `usageCount` | `number` | Total datasets referencing this key — authoritative deletion signal |
| `isPublished` | `boolean` | True if any contributing dataset is published |

`userGroups` and `userGroupCounts` are intentionally redundant. `userGroupCounts` owns the truth and enables safe atomic decrements. `userGroups` is a denormalized array kept for query performance — MongoDB's multikey index on `userGroups` makes `{ userGroups: { $in: [...] } }` efficient in a way that querying Map keys directly is not.

---

## Filter Examples

List metadata keys visible to the current user for a given source type:

```json
{
"where": {
"sourceType": "dataset",
"sourceId": "datasetId"
"sourceType": "Dataset"
},
"fields": ["humanreadableName", "key"],
"fields": ["key", "humanReadableName"],
"limits": {
"limit": 10,
"skip": 0,
"sort": {
"createdAt": "asc | desc"
"createdAt": "desc"
}
}
}
```

To retrieve a specific metadata key, use the following filter:
Find a specific key by name:

```json
{
"where": {
"sourceType": "Dataset",
"key": "temperature"
},
"limits": {
"limit": 1,
"skip": 0
}
}
```

Partial search on `key`:

```json
{
"where": {
"sourceType": "dataset",
"sourceId": "datasetId",
"key": "metadata_key_name"
"sourceType": "Dataset",
"key": { "$regex": "temp", "$options": "i" }
},
"fields": ["key"],
"limits": {
"limit": 10,
"skip": 0,
"sort": {
"createdAt": "asc | desc"
}
"skip": 0
}
}
```

Partial search on `humanReadableName`:

```json
{
"where": {
"sourceType": "Dataset",
"humanReadableName": { "$regex": "temp", "$options": "i" }
},
"limits": {
"limit": 10,
"skip": 0
}
}
```

---

## Initial Migration

The `MetadataKeys` collection is populated by a migration script that must be run manually before the service is deployed for the first time.

See: `migrations/20260417145401-sync-dataset-scientificMetadata-to-metadatakeys.js`

Documentation: `migrations/20260417145401-sync-dataset-scientificMetadata-to-metadatakeys.md`

> ⚠️ The application will start normally without the migration, but the MetadataKeys service will return empty results until it is run.
Loading
Loading