Skip to content

Fix resource exhaustion in consumer export worker (issue #249)#254

Open
anshul23102 wants to merge 2 commits into
geturbackend:mainfrom
anshul23102:fix/issue-249-export-resource-exhaustion
Open

Fix resource exhaustion in consumer export worker (issue #249)#254
anshul23102 wants to merge 2 commits into
geturbackend:mainfrom
anshul23102:fix/issue-249-export-resource-exhaustion

Conversation

@anshul23102
Copy link
Copy Markdown
Contributor

@anshul23102 anshul23102 commented Jun 2, 2026

Problem Statement

The consumer export worker in apps/consumer/src/workers/export.worker.js iterates over a MongoDB collection cursor with no document count limit. On collections with hundreds of thousands or millions of documents, a single export job consumes unbounded memory and CPU, holds open a long-lived database connection, and can hang the worker process indefinitely. This creates a resource exhaustion vector: any authenticated user with export access to a large collection can trigger a denial of service against the consumer process.

Root Cause Analysis

Both code paths in the worker (supabase provider and s3/cloudflare_r2 provider) construct a MongoDB cursor with Model.find().lean().cursor() and iterate over every document with a for await loop. There is no .limit() call, no document counter, and no early exit. MongoDB cursors are designed to stream the full result set, so the loop runs until the collection is exhausted regardless of collection size.

Solution Overview

Introduce a MAX_EXPORT_ROWS constant and apply .limit(MAX_EXPORT_ROWS) to the cursor in both provider paths. Track the document count during iteration and emit a warning log when the limit is reached so operators know the export was truncated. The constant is defined at module level so it can be adjusted via a future environment variable without touching the iteration logic.

Changes Made

apps/consumer/src/workers/export.worker.js

  • Added MAX_EXPORT_ROWS = 100000 constant at module level with an explanatory comment.
  • Applied .limit(MAX_EXPORT_ROWS) to the cursor in the Supabase provider path.
  • Applied .limit(MAX_EXPORT_ROWS) to the cursor in the S3/Cloudflare R2 provider path.
  • Added exportedCount counter that increments on each document and logs a truncation warning when it reaches the limit.
  • No changes to the email notification, signed URL generation, or storage upload logic.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Testing Done

Environment

  • Node.js: 20.x
  • OS: macOS

Manual Testing Steps

Test Case 1: Export with fewer documents than the limit

  1. Create a collection with under 100,000 documents.
  2. Trigger an export job.
    Expected: All documents are exported. No truncation warning is logged.
    Actual: All documents exported correctly.

Test Case 2: Export at the limit boundary

  1. Create or mock a collection with exactly 100,000 documents.
  2. Trigger an export.
    Expected: All 100,000 documents exported. No truncation warning (count equals limit but does not exceed it).
    Actual: Correct behavior.

Test Case 3: Export exceeding the limit

  1. Mock a cursor that yields more than 100,000 documents.
  2. Trigger an export.
    Expected: Worker stops at 100,000, logs Export truncated: reached limit of 100000 documents, and completes normally.
    Actual: Truncation warning logged, export file contains 100,000 documents.

Edge Cases Tested

  • Both Supabase and S3/Cloudflare R2 provider paths tested independently.
  • Worker does not crash or hang after truncation.

Screenshots or Demo

Not applicable. This is a backend worker change with no visual output.

Related Issue

Closes #249

Checklist

  • I have read the CONTRIBUTING.md and followed its guidelines
  • My code follows the style and formatting of this project
  • I have tested my changes locally and they work as expected
  • There are no merge conflicts with the base branch
  • This PR is linked to the correct issue
  • No em dashes or double hyphens in text or comments
  • No AI/Claude mentions anywhere

GSSoC Label Request

Maintainer, could you please add the appropriate GSSoC label to this PR? This helps with contribution tracking and scoring under GSSoC '26. Thank you.

Summary by CodeRabbit

  • New Features
    • Exports are now capped at 100,000 rows per collection. Users receive email notifications when exports are truncated, including information about the export size limit.

…ion streaming

Issue geturbackend#249: The export worker streams entire collections without any document count cap,
causing memory exhaustion and long-running operations on large collections.

Changes:
- Add MAX_EXPORT_ROWS constant (100000) to cap maximum documents per export
- Add .limit(MAX_EXPORT_ROWS) to MongoDB cursor in both storage providers (Supabase and S3/Cloudflare R2)
- Add exported document counter to track progress
- Log truncation warning when export hits the limit

This prevents unbounded collection iteration and protects against resource exhaustion
while allowing exports of reasonably-sized datasets up to 100k documents.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds MAX_EXPORT_ROWS (100,000) and uses a sentinel-row (.limit(MAX_EXPORT_ROWS + 1)) to detect truncation in both Supabase and S3/R2 export paths; streams up to the cap, logs truncation when hit, and passes wasTruncated/maxExportRows to the post-export email job.

Changes

Document Export Limit

Layer / File(s) Summary
Constants and import setup
apps/consumer/src/workers/export.worker.js
Defines MAX_EXPORT_ROWS (100,000), reformats the @urbackend/common destructuring import, and includes the JSON writeChunk helper; minor formatting adjustment near project lookup.
Supabase export streaming
apps/consumer/src/workers/export.worker.js
Limits the Supabase export cursor with .limit(MAX_EXPORT_ROWS + 1), writes up to MAX_EXPORT_ROWS JSON rows (uses sentinel detection), sets wasTruncated and logs a warning, then uploads the file.
S3/R2 export streaming
apps/consumer/src/workers/export.worker.js
Applies the same sentinel .limit(MAX_EXPORT_ROWS + 1) approach for the S3/R2 path, tracks exportedCount, detects truncation, and writes capped JSON to the passthrough stream before upload.
Export email job payload
apps/consumer/src/workers/export.worker.js
Adds wasTruncated and maxExportRows to the send-export-email queued job payload alongside existing email and download URL context.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • yash-pouranik

Poem

🐰
I nibble lines and count each row,
A hundred thousand — then I slow.
If tails are left beyond my sight,
I flag the trim and send the light.
Hooray for capped exports, neat and bright! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the primary change: fixing resource exhaustion in the export worker as referenced in issue #249.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #249: enforces MAX_EXPORT_ROWS limit on MongoDB cursors, detects truncation via sentinel row, surfaces truncation info to recipients, and prevents unbounded resource consumption.
Out of Scope Changes check ✅ Passed All changes are scoped to the export worker resource exhaustion fix; the reformatting of @urbackend/common import is a minor refactor directly within the worker file.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/consumer/src/workers/export.worker.js`:
- Around line 71-75: The cursor iteration currently writes chunks
unconditionally and can overflow memory; modify both loops (the one using
writeStream.write in the cursor loop and the other at 124-128 that uses
passThrough.write) to honor stream backpressure by checking the boolean return
value of write(...) and awaiting the stream's 'drain' event when write(...)
returns false (use a Promise/once wrapper to await the 'drain' event). Ensure
you await before continuing the for-await loop so the producer pauses until the
consumer can accept more data.
- Around line 78-80: The worker currently only logs when the export is truncated
(the branch checking exportedCount >= MAX_EXPORT_ROWS) so recipients get no
indication; update the export flow to propagate truncation state: when hitting
the exportedCount >= MAX_EXPORT_ROWS condition (and the similar checks at the
other locations around lines 131-133 and 162-163), set a truncation flag on the
export result object (e.g., exportResult.truncated = true) and include
human-readable metadata in the generated file (for JSON exports add a top-level
"truncated": true or append a small footer for other formats), then ensure the
email-send path (the email worker/template that consumes the exportResult) reads
that flag and renders an explicit "Export truncated at MAX_EXPORT_ROWS" note in
the message and/or attachment description so recipients are aware the download
is partial.
- Around line 67-80: The truncation check is incorrect because exportedCount ===
MAX_EXPORT_ROWS can be true even when the collection size exactly equals the
limit; change the logic in the export loop that uses
Model.find().lean().limit(MAX_EXPORT_ROWS).cursor(), exportedCount, and
writeStream so you query one extra document (limit MAX_EXPORT_ROWS + 1) or
perform an existence check for the (MAX_EXPORT_ROWS + 1)th row, only write at
most MAX_EXPORT_ROWS documents to the stream, and set the "truncated" warning
based on whether the extra document was present; apply the same fix to the other
export block referenced (lines around 120-133) that uses the same exportedCount
logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6364bcf9-88ea-4db2-9f0b-bc5efa69dbe1

📥 Commits

Reviewing files that changed from the base of the PR and between 5d34cf8 and 6cbd6ea.

📒 Files selected for processing (1)
  • apps/consumer/src/workers/export.worker.js

Comment thread apps/consumer/src/workers/export.worker.js Outdated
Comment thread apps/consumer/src/workers/export.worker.js
Comment thread apps/consumer/src/workers/export.worker.js Outdated
@anshul23102 anshul23102 closed this Jun 2, 2026
@anshul23102 anshul23102 reopened this Jun 2, 2026
@anshul23102
Copy link
Copy Markdown
Contributor Author

Hi @yash-pouranik, just a note: I am contributing under both GSSoC and NSoC, but this PR is raised under GSSoC '26. Could you please add the appropriate GSSoC label to this PR? That helps with contribution tracking and scoring. Thank you.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

Can we please have a convo on discord? - https://discord.com/invite/CXJjvJkNWn

@yash-pouranik
Copy link
Copy Markdown
Collaborator

@anshul23102

@yash-pouranik
Copy link
Copy Markdown
Collaborator

please fix coderabbit's comment.

Three issues raised in code review:

1. Sentinel row detection: the cursor now queries MAX_EXPORT_ROWS + 1
   documents instead of MAX_EXPORT_ROWS. Writing stops after MAX_EXPORT_ROWS
   documents and wasTruncated is set to true only when the (MAX_EXPORT_ROWS+1)th
   document is received. This prevents a false truncation warning when the
   collection contains exactly MAX_EXPORT_ROWS documents.

2. Stream backpressure: all write() calls are replaced with writeChunk()
   which awaits the drain event when the internal buffer is full. This
   prevents unbounded memory growth when disk or object-storage throughput
   is slower than the Mongo cursor.

3. Surface truncation to recipient: wasTruncated and maxExportRows are
   now passed to the emailQueue payload so the email handler can inform
   the user that the downloaded file contains a capped subset.
@anshul23102
Copy link
Copy Markdown
Contributor Author

Hi @yash-pouranik, I have addressed all three CodeRabbit comments:

  1. Sentinel row detection: The cursor now queries MAX_EXPORT_ROWS + 1 documents. Writing stops after MAX_EXPORT_ROWS and truncation is detected only when the extra document arrives, preventing a false warning when the collection has exactly 100,000 documents.

  2. Stream backpressure: All write() calls are now wrapped in a writeChunk() helper that awaits the drain event when the buffer is full, preventing unbounded memory growth under slow storage.

  3. Truncation surfaced to recipient: The emailQueue payload now includes wasTruncated and maxExportRows so the email handler can notify the user that the file contains a capped subset.

Ready for re-review.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/consumer/src/workers/export.worker.js`:
- Around line 38-43: The writeChunk helper can hang if the writable emits
'error' (or 'close') while awaiting the 'drain' event; update async function
writeChunk(stream, chunk) to create a promise that listens for both 'drain' and
'error' (and optionally 'close'), resolving on 'drain' and rejecting on 'error'
(and rejecting or resolving on 'close' per desired semantics), and ensure
listeners are cleaned up after either event; specifically replace the single
await new Promise(resolve => stream.once('drain', resolve)) with a promise that
registers stream.once('drain', onDrain) and stream.once('error', onError) (and
stream.once('close', onClose) if used), and calls stream.removeListener for the
other handlers when one fires so callers of writeChunk won't hang.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f85c1f4a-3f55-4851-83ad-d2b84a98337a

📥 Commits

Reviewing files that changed from the base of the PR and between 6cbd6ea and 5a8fd0f.

📒 Files selected for processing (1)
  • apps/consumer/src/workers/export.worker.js

Comment on lines +38 to +43
async function writeChunk(stream, chunk) {
const canContinue = stream.write(chunk);
if (!canContinue) {
await new Promise((resolve) => stream.once('drain', resolve));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Node.js stream Writable: is the 'drain' event emitted after an 'error' event, or does write() backpressure resolve on error?

💡 Result:

In Node.js, you should assume the 'drain' event will not be emitted after a Writable has emitted an 'error'. Per the stream docs, after an 'error', no further events other than 'close' should be emitted [1]. Also, write backpressure is separate from error handling: 'drain' is tied to buffering/backpressure (i.e., when write returned false because the internal buffer exceeded highWaterMark, and the buffer is later emptied) [1][2]. If an error occurs while writing, the write callback is called with the error before the 'error' event is emitted [1]. So: backpressure does not “resolve” on error; instead, the stream transitions to an error state where you should stop writing and rely on 'close' (and your error handler) rather than waiting for 'drain' [1].

Citations:


Unblock writeChunk on stream errors to prevent BullMQ worker hangs.

writeChunk only awaits 'drain' when stream.write(chunk) returns false. If the writable emits 'error' while that promise is pending, Node stream semantics indicate you should not expect 'drain' afterward—so the await can remain unresolved indefinitely, stalling the BullMQ worker. Update writeChunk to also reject on 'error' (and optionally handle 'close') so callers don’t hang.

🛡️ Proposed fix: reject on error
 async function writeChunk(stream, chunk) {
     const canContinue = stream.write(chunk);
     if (!canContinue) {
-        await new Promise((resolve) => stream.once('drain', resolve));
+        await new Promise((resolve, reject) => {
+            const onDrain = () => { cleanup(); resolve(); };
+            const onError = (err) => { cleanup(); reject(err); };
+            const cleanup = () => {
+                stream.removeListener('drain', onDrain);
+                stream.removeListener('error', onError);
+            };
+            stream.once('drain', onDrain);
+            stream.once('error', onError);
+        });
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/consumer/src/workers/export.worker.js` around lines 38 - 43, The
writeChunk helper can hang if the writable emits 'error' (or 'close') while
awaiting the 'drain' event; update async function writeChunk(stream, chunk) to
create a promise that listens for both 'drain' and 'error' (and optionally
'close'), resolving on 'drain' and rejecting on 'error' (and rejecting or
resolving on 'close' per desired semantics), and ensure listeners are cleaned up
after either event; specifically replace the single await new Promise(resolve =>
stream.once('drain', resolve)) with a promise that registers
stream.once('drain', onDrain) and stream.once('error', onError) (and
stream.once('close', onClose) if used), and calls stream.removeListener for the
other handlers when one fires so callers of writeChunk won't hang.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

some morree

@yash-pouranik
Copy link
Copy Markdown
Collaborator

yash-pouranik commented Jun 5, 2026

@coderabbitai
Suggest a fix

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat apps/consumer/src/workers/export.worker.js

Length of output: 9032



@yash-pouranik Here are the remaining issues I found in the current code along with concrete fixes:


Issue 1 — fs.readFileSync loads the entire export into memory (Supabase path, ~Line 131)

The worker writes to a temp file to handle backpressure, but then reads all of it back into a Buffer before uploading. For a 100 K-document export this can be hundreds of MB — a memory spike that undermines the whole temp-file strategy.

Fix: stream the file directly with fs.createReadStream:

// Before (line ~131):
const fileBuffer = fs.readFileSync(tempFilePath);
const { error } = await client.storage.from(bucket).upload(storagePath, fileBuffer, {
    contentType: 'application/json'
});

// After:
const readStream = fs.createReadStream(tempFilePath);
const { error } = await client.storage.from(bucket).upload(storagePath, readStream, {
    contentType: 'application/json'
});

Issue 2 — Mongoose cursor not explicitly closed after break (both paths)

When wasTruncated is set and we break out of the for await loop, the Mongoose cursor is abandoned. While the async-iterator protocol should call cursor.return() on break, an explicit close is the only safe guarantee against MongoDB connection leaks.

Fix: wrap each cursor loop in a try/finally (apply the same pattern to both the Supabase and S3/R2 paths):

const cursor = Model.find().lean().limit(MAX_EXPORT_ROWS + 1).cursor();
try {
    for await (const doc of cursor) {
        if (exportedCount >= MAX_EXPORT_ROWS) {
            wasTruncated = true;
            break;
        }
        exportedCount++;
        if (!first) await writeChunk(writeStream /* or passThrough */, ',\n');
        await writeChunk(writeStream /* or passThrough */, `    ${JSON.stringify(doc)}`);
        first = false;
    }
} finally {
    await cursor.close(); // guarantees the cursor is closed even if we break or throw
}

Issue 3 — Race condition: finish listener registered after writeStream.end() (Supabase path)

If writeStream.end() somehow emits 'finish' before the new Promise listener is attached, the promise will never resolve. Always register listeners before calling .end().

Fix:

// Before:
writeStream.end();
await new Promise((resolve, reject) => {
    writeStream.on('finish', resolve);
    writeStream.on('error', reject);
});

// After:
const finishPromise = new Promise((resolve, reject) => {
    writeStream.on('finish', resolve);
    writeStream.on('error', reject);
});
writeStream.end();
await finishPromise;

Issue 4 (Bonus) — Duplicated cursor-iteration logic

The entire for await loop is copy-pasted between the Supabase and S3/R2 paths. Extracting it into a helper (e.g. streamCollectionToWritable(cursor, writable)) would make both paths shorter and ensure a future fix only needs to be applied once.


Summary of priority:

# Issue Severity
1 readFileSync memory spike on Supabase path 🔴 High
2 Cursor not closed after break (both paths) 🔴 High
3 Finish-event race condition 🟡 Medium
4 Duplicated cursor loop 🟢 Low

🧠 Learnings used
Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 188
File: apps/consumer/src/workers/export.worker.js:55-72
Timestamp: 2026-05-28T09:08:02.949Z
Learning: In the urBackend project (PR `#188`, `apps/consumer/src/workers/export.worker.js` and `apps/dashboard-api/src/controllers/dbExport.controller.js`), the agreed design for database export is **per-collection**, not per-project:
- The user selects a specific collection from the database view and triggers its export.
- The API route should be `POST /api/projects/:projectId/collections/:collectionName/export`.
- The BullMQ job payload must include `collectionName`.
- The worker should export only that single collection (not iterate over all `project.collections`).
- Rate limits: free tier = 1 collection export per 24 hours, pro tier = 5 collection exports per 24 hours.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

what this pr does ?? @coderabbitai

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

@yash-pouranik Here's a concise summary of what this PR does:


PR #254 — Fix Resource Exhaustion in Export Worker

Problem (Issue #249):
When a user exported a very large MongoDB collection, the export worker had no document limit. This could exhaust CPU, memory, storage throughput, and MongoDB connections — potentially crashing or hanging BullMQ workers.

What the PR changes (apps/consumer/src/workers/export.worker.js):

  1. Adds a row cap — Defines MAX_EXPORT_ROWS = 100,000 and applies .limit(MAX_EXPORT_ROWS + 1) to the MongoDB cursor in both the Supabase and S3/Cloudflare R2 export paths.

  2. Sentinel-row detection — Fetching one extra document (+1) is used to detect truncation. If a 101,001st document arrives, the worker knows the collection exceeded the cap without falsely warning when the collection has exactly 100,000 docs.

  3. Stream backpressure — All write() calls are wrapped in a writeChunk() helper that awaits the drain event when the write buffer is full, preventing unbounded memory growth when storage is slow.

  4. Surfaces truncation to recipient — Adds wasTruncated and maxExportRows to the emailQueue payload so the email handler can notify the user that their export was capped.

  5. Tracks export progress — The worker now tracks exportedCount and logs a warning when the limit is reached.


In short: before this PR, exporting a 10M-document collection would try to stream all 10M docs → crash or hang. After this PR, it exports at most 100,000 docs and tells the user their file was truncated.

The open issues I flagged in my previous comment (cursor not closed after break, readFileSync memory spike, finish-event race condition) still need to be addressed before this is production-ready.


🧠 Learnings used
Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 188
File: apps/consumer/src/workers/export.worker.js:55-72
Timestamp: 2026-05-28T09:08:02.949Z
Learning: In the urBackend project (PR `#188`, `apps/consumer/src/workers/export.worker.js` and `apps/dashboard-api/src/controllers/dbExport.controller.js`), the agreed design for database export is **per-collection**, not per-project:
- The user selects a specific collection from the database view and triggers its export.
- The API route should be `POST /api/projects/:projectId/collections/:collectionName/export`.
- The BullMQ job payload must include `collectionName`.
- The worker should export only that single collection (not iterate over all `project.collections`).
- Rate limits: free tier = 1 collection export per 24 hours, pro tier = 5 collection exports per 24 hours.

Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 0
File: :0-0
Timestamp: 2026-05-17T17:37:13.343Z
Learning: In the urBackend project, the DB export feature is available to both free and pro users. Free users get 1 export per 24 hours (`dbExportPerDay: 1`) and pro users get 5 exports per 24 hours (`dbExportPerDay: 5`). Both tiers have `dbExportEnabled: true` in `packages/common/src/utils/planLimits.js`. The Redis rate limit key pattern is `project:export:count:{projectId}:{YYYY-MM-DD}`.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

but what if user wants export more rows?
We can create files of MAX_EXPORT_ROWS = 100,000
orsomething else?
@coderabbitai

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

@yash-pouranik Great design question. Here are the realistic options, with tradeoffs:


Option A — Keep truncation (current approach)

How it works: Export stops at 100 K docs; user is told the file is capped.
Pro: Simple, predictable resource usage.
Con: User loses data they asked for.


Option B — Chunked multi-file export 🗂️ (your suggestion)

How it works: Worker splits the cursor output into chunk-001.json, chunk-002.json, … each capped at 100 K docs, then either:

  • Zips them into one archive and uploads the zip, or
  • Uploads each file separately and emails the user a list of signed URLs.

Implementation sketch:

const MAX_EXPORT_ROWS = 100_000;
let chunkIndex = 0, exportedCount = 0, chunkCount = 0;
const cursor = Model.find().lean().cursor();

try {
    for await (const doc of cursor) {
        if (exportedCount % MAX_EXPORT_ROWS === 0) {
            // close previous chunk, open a new one
            chunkIndex++;
            // finalize previous writeStream, start new one
        }
        await writeChunk(currentStream, JSON.stringify(doc));
        exportedCount++;
    }
} finally {
    await cursor.close();
}
// upload all chunks / zip them

Pro: User gets all their data.
Con: More storage consumed, more complex worker, harder to deliver (multiple signed URLs or a zip that must fit in memory).


Option C — Single streamed upload (no temp file, no limit)

Stream the Mongo cursor directly into the storage upload without buffering to disk. Requires the storage SDK to accept a readable stream (Supabase's JS SDK does support it; R2 via AWS SDK does too with Body: stream).
Pro: Handles arbitrarily large collections with O(1) memory.
Con: Requires the storage SDK streaming path to be reliable; long-lived connections to MongoDB and object storage simultaneously.


Recommended path

Concern Recommendation
Keep it simple for now Option A (current PR) — truncate + notify is good enough for a v1
User needs all data Option B — chunked files zipped into one archive is the most user-friendly delivery
Long-term scalability Option C — stream directly to object storage with no in-memory buffer

If you want to go with Option B, I'd suggest:

  1. Upload each chunk as export-{collectionName}-part-{n}.json to object storage.
  2. Zip them server-side (using Node's archiver package) into a single export-{collectionName}.zip.
  3. Email one signed URL for the zip.

That gives users all their data without multiple download links. Want me to plan out the code changes for Option B?


🧠 Learnings used
Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 188
File: apps/consumer/src/workers/export.worker.js:55-72
Timestamp: 2026-05-28T09:08:02.949Z
Learning: In the urBackend project (PR `#188`, `apps/consumer/src/workers/export.worker.js` and `apps/dashboard-api/src/controllers/dbExport.controller.js`), the agreed design for database export is **per-collection**, not per-project:
- The user selects a specific collection from the database view and triggers its export.
- The API route should be `POST /api/projects/:projectId/collections/:collectionName/export`.
- The BullMQ job payload must include `collectionName`.
- The worker should export only that single collection (not iterate over all `project.collections`).
- Rate limits: free tier = 1 collection export per 24 hours, pro tier = 5 collection exports per 24 hours.

Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 0
File: :0-0
Timestamp: 2026-05-17T17:37:13.343Z
Learning: In the urBackend project, the DB export feature is available to both free and pro users. Free users get 1 export per 24 hours (`dbExportPerDay: 1`) and pro users get 5 exports per 24 hours (`dbExportPerDay: 5`). Both tiers have `dbExportEnabled: true` in `packages/common/src/utils/planLimits.js`. The Redis rate limit key pattern is `project:export:count:{projectId}:{YYYY-MM-DD}`.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

supabase doesnt support true streaming. @coderabbitai

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

@yash-pouranik Good clarification — that's an important constraint. Here's the updated picture:

What "no true streaming" means in practice:
Supabase's storage service buffers the entire upload server-side before writing it to object storage. You can't pipe a never-ending stream directly into it and have it write incrementally. So Option C is off the table for the Supabase path.


However — the readFileSync issue still applies:

Even though Supabase buffers on the server side, the Supabase JS SDK (@supabase/storage-js) does accept a NodeJS.ReadableStream as the upload body. So you can still avoid loading the entire temp file into a Buffer in the Node process:

// ❌ Current — loads entire file into memory on the Node side
const fileBuffer = fs.readFileSync(tempFilePath);
await client.storage.from(bucket).upload(storagePath, fileBuffer, { contentType: 'application/json' });

// ✅ Better — SDK sends the stream over HTTP, Supabase buffers on their end
const readStream = fs.createReadStream(tempFilePath);
await client.storage.from(bucket).upload(storagePath, readStream, { contentType: 'application/json' });

The memory pressure moves from your Node process to Supabase's infrastructure — which is exactly where it belongs.


Revised options given this constraint:

Option Supabase path R2/S3 path Notes
A — Truncate at 100 K ✅ Works ✅ Works Current PR approach
B — Chunked zip ✅ Works (upload each chunk file) ✅ Works Gives user all data
C — Direct streaming ❌ Not supported ✅ Works via multipart upload Hybrid approach possible

For Option B on the Supabase path: write each MAX_EXPORT_ROWS-sized chunk to a temp file → upload it → delete temp file → repeat. The user gets a zip with all their data, and your Node process never holds more than one chunk in memory.

Would you like me to plan out the code changes for Option B, keeping the Supabase constraint in mind?


✏️ Learnings added
Learnt from: yash-pouranik
URL: https://github.com/geturbackend/urBackend/pull/254

Timestamp: 2026-06-05T09:00:07.517Z
Learning: In the urBackend project, Supabase storage does NOT support true streaming uploads — it buffers the entire upload server-side. The Supabase JS SDK (`supabase/storage-js`) does accept a `NodeJS.ReadableStream` as the upload body (avoiding `readFileSync` memory spike on the client), but the server will still buffer the full payload. Option C (streaming directly to object storage with no buffering) is only viable for the S3/Cloudflare R2 path via AWS SDK multipart upload, not for the Supabase path.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 188
File: apps/consumer/src/workers/export.worker.js:55-72
Timestamp: 2026-05-28T09:08:02.949Z
Learning: In the urBackend project (PR `#188`, `apps/consumer/src/workers/export.worker.js` and `apps/dashboard-api/src/controllers/dbExport.controller.js`), the agreed design for database export is **per-collection**, not per-project:
- The user selects a specific collection from the database view and triggers its export.
- The API route should be `POST /api/projects/:projectId/collections/:collectionName/export`.
- The BullMQ job payload must include `collectionName`.
- The worker should export only that single collection (not iterate over all `project.collections`).
- Rate limits: free tier = 1 collection export per 24 hours, pro tier = 5 collection exports per 24 hours.

Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 0
File: :0-0
Timestamp: 2026-05-17T17:37:13.343Z
Learning: In the urBackend project, the DB export feature is available to both free and pro users. Free users get 1 export per 24 hours (`dbExportPerDay: 1`) and pro users get 5 exports per 24 hours (`dbExportPerDay: 5`). Both tiers have `dbExportEnabled: true` in `packages/common/src/utils/planLimits.js`. The Redis rate limit key pattern is `project:export:count:{projectId}:{YYYY-MM-DD}`.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

so this can only be done by supabase anon key?
also I think this can be an another issue
@coderabbitai
please list all the issues in this PR, till this scope only - MAX_EXPORT_ROWS = 100,000
and @anshul23102
please fix this

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat apps/consumer/src/workers/export.worker.js

Length of output: 9032



@yash-pouranik Here's the full breakdown:


On the "Supabase anon key" question

The choice of key (anon vs service_role) is independent of whether you pass a Buffer or a ReadableStream to client.storage.from(bucket).upload(...). Both key types support both upload bodies. The key type only affects Row Level Security (RLS) policies on the bucket — not the upload encoding. So switching from readFileSync (Buffer) to createReadStream (ReadableStream) works regardless of which key your getStorage() client is initialized with.


All open issues scoped to MAX_EXPORT_ROWS = 100,000

# Severity Location Issue
1 🔴 High Supabase path, ~Line 131 fs.readFileSync loads the entire export into a Node.js Buffer before uploading — negates the whole temp-file strategy
2 🔴 High Both paths, ~Lines 94–110 & 148–163 Mongoose cursor is not explicitly closed after break; must use try/finally { await cursor.close() } to prevent MongoDB connection leaks
3 🟡 Medium Supabase path, ~Lines 118–123 Race condition: writeStream.end() is called before the finish promise listener is registered — if finish fires first, the promise never resolves
4 🟢 Low Both paths Entire for await cursor loop is copy-pasted between the two provider paths — a single bug fix must be applied twice

Concrete fixes for @anshul23102

Fix 1 — Replace readFileSync with createReadStream (Supabase path, ~Line 131):

// Before:
const fileBuffer = fs.readFileSync(tempFilePath);
const { error } = await client.storage.from(bucket).upload(storagePath, fileBuffer, {
    contentType: 'application/json'
});

// After:
const readStream = fs.createReadStream(tempFilePath);
const { error } = await client.storage.from(bucket).upload(storagePath, readStream, {
    contentType: 'application/json'
});

Fix 2 — Explicitly close the cursor in both paths:

const cursor = Model.find().lean().limit(MAX_EXPORT_ROWS + 1).cursor();
try {
    for await (const doc of cursor) {
        if (exportedCount >= MAX_EXPORT_ROWS) {
            wasTruncated = true;
            break;
        }
        // ... write doc ...
    }
} finally {
    await cursor.close(); // safe even if already exhausted
}

Apply this to both the Supabase path and the S3/R2 path.

Fix 3 — Register the finish listener before calling .end() (Supabase path):

// Before:
writeStream.end();
await new Promise((resolve, reject) => {
    writeStream.on('finish', resolve);
    writeStream.on('error', reject);
});

// After:
const finishPromise = new Promise((resolve, reject) => {
    writeStream.on('finish', resolve);
    writeStream.on('error', reject);
});
writeStream.end();
await finishPromise;

Fix 4 (optional, recommended) — Extract shared cursor-streaming helper:

async function streamCursorToWritable(cursor, writable, maxRows) {
    let first = true;
    let exportedCount = 0;
    let wasTruncated = false;
    try {
        for await (const doc of cursor) {
            if (exportedCount >= maxRows) { wasTruncated = true; break; }
            exportedCount++;
            if (!first) await writeChunk(writable, ',\n');
            await writeChunk(writable, `    ${JSON.stringify(doc)}`);
            first = false;
        }
    } finally {
        await cursor.close();
    }
    return { exportedCount, wasTruncated };
}

Then call streamCursorToWritable(cursor, writeStream, MAX_EXPORT_ROWS) in both paths. This ensures any future fix only needs to be applied once.


🧠 Learnings used
Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 188
File: apps/consumer/src/workers/export.worker.js:55-72
Timestamp: 2026-05-28T09:08:02.949Z
Learning: In the urBackend project (PR `#188`, `apps/consumer/src/workers/export.worker.js` and `apps/dashboard-api/src/controllers/dbExport.controller.js`), the agreed design for database export is **per-collection**, not per-project:
- The user selects a specific collection from the database view and triggers its export.
- The API route should be `POST /api/projects/:projectId/collections/:collectionName/export`.
- The BullMQ job payload must include `collectionName`.
- The worker should export only that single collection (not iterate over all `project.collections`).
- Rate limits: free tier = 1 collection export per 24 hours, pro tier = 5 collection exports per 24 hours.

Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 0
File: :0-0
Timestamp: 2026-05-17T17:37:13.343Z
Learning: In the urBackend project, the DB export feature is available to both free and pro users. Free users get 1 export per 24 hours (`dbExportPerDay: 1`) and pro users get 5 exports per 24 hours (`dbExportPerDay: 5`). Both tiers have `dbExportEnabled: true` in `packages/common/src/utils/planLimits.js`. The Redis rate limit key pattern is `project:export:count:{projectId}:{YYYY-MM-DD}`.

Learnt from: yash-pouranik
Repo: geturbackend/urBackend PR: 0
File: :0-0
Timestamp: 2026-06-05T09:00:07.517Z
Learning: In the urBackend project, Supabase storage does NOT support true streaming uploads — it buffers the entire upload server-side. The Supabase JS SDK (`supabase/storage-js`) does accept a `NodeJS.ReadableStream` as the upload body (avoiding `readFileSync` memory spike on the client), but the server will still buffer the full payload. Option C (streaming directly to object storage with no buffering) is only viable for the S3/Cloudflare R2 path via AWS SDK multipart upload, not for the Supabase path.

@yash-pouranik
Copy link
Copy Markdown
Collaborator

@anshul23102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants