Skip to content

test(datasets): add seed script for generating mock datasets#2728

Open
Junjiequan wants to merge 2 commits into
masterfrom
seed-datasets-script
Open

test(datasets): add seed script for generating mock datasets#2728
Junjiequan wants to merge 2 commits into
masterfrom
seed-datasets-script

Conversation

@Junjiequan
Copy link
Copy Markdown
Member

@Junjiequan Junjiequan commented May 11, 2026

Description

First draft seed script for local development that randomly generates
mock datasets with random scientific metadata for query performance testing.

Motivation

Fixes

  • Bug fixed (#X)

Changes:

  • changes made

Tests included

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

official documentation info

Summary by Sourcery

Add a MongoDB seed script for generating large volumes of mock datasets with realistic scientific metadata for local performance testing.

New Features:

  • Introduce a dataset seeding script that populates a MongoDB collection with configurable numbers of mock datasets and metadata for performance and query testing.

Documentation:

  • Document how to use and configure the dataset seed script, including parameters and example invocations.

@Junjiequan Junjiequan requested a review from a team as a code owner May 11, 2026 13:36
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • Consider parameterizing MONGO_URI, DB_NAME, COLLECTION, and possibly TOTAL via environment variables or CLI flags instead of hard-coding them so the script can be reused across different environments without edits.
  • The computation of TOTAL_GROUPS uses randomInt(5, 10) at module load time, which makes the number of groups vary between runs; if you want reproducible seeding or easier reasoning about group distribution, derive this from a fixed value or make it configurable.
  • Wrap the Mongo client lifecycle in a try/finally (or similar) to ensure client.close() is always called even if an error occurs during batch inserts or sampling.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider parameterizing `MONGO_URI`, `DB_NAME`, `COLLECTION`, and possibly `TOTAL` via environment variables or CLI flags instead of hard-coding them so the script can be reused across different environments without edits.
- The computation of `TOTAL_GROUPS` uses `randomInt(5, 10)` at module load time, which makes the number of groups vary between runs; if you want reproducible seeding or easier reasoning about group distribution, derive this from a fixed value or make it configurable.
- Wrap the Mongo client lifecycle in a `try/finally` (or similar) to ensure `client.close()` is always called even if an error occurs during batch inserts or sampling.

## Individual Comments

### Comment 1
<location path="scripts/seed/seed-datasets.js" line_range="108-111" />
<code_context>
+        const humanName = Math.random() < 0.5 && {
+          human_name: randomString(randomInt(5, 15)),
+        };
+        metadata[key] = {
+          value: randomInt(1, 999999),
+          unit: randomItem(UNITS),
+          ...(humanName !== false && humanName),
+        };
+      }
</code_context>
<issue_to_address>
**issue (bug_risk):** The spread of `humanName !== false && humanName` can evaluate to `...false`, which will throw at runtime.

In this branch `humanName` is either an object or `false`, so `humanName !== false && humanName` evaluates to that object or `false`. Using it in `...(humanName !== false && humanName)` can therefore try to spread `false`. Instead, build an object and spread that, for example:

```js
const humanName = Math.random() < 0.5
  ? { human_name: randomString(randomInt(5, 15)) }
  : {};

metadata[key] = {
  value: randomInt(1, 999999),
  unit: randomItem(UNITS),
  ...humanName,
};
```

or keep the short-circuit style but normalize to an object:

```js
const humanName = Math.random() < 0.5 && {
  human_name: randomString(randomInt(5, 15)),
};

metadata[key] = {
  value: randomInt(1, 999999),
  unit: randomItem(UNITS),
  ...(humanName || {}),
};
```

The shared-key branch should be adjusted in the same way.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread scripts/seed/seed-datasets.js Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant