Skip to content

feat: [document-compression-updater]- Improvements: Auto-remove dummy compression field, drop tracker collection and more#182

Open
NestorRV wants to merge 9 commits intoawslabs:masterfrom
NestorRV:compression-updater-improvements
Open

feat: [document-compression-updater]- Improvements: Auto-remove dummy compression field, drop tracker collection and more#182
NestorRV wants to merge 9 commits intoawslabs:masterfrom
NestorRV:compression-updater-improvements

Conversation

@NestorRV
Copy link
Copy Markdown

@NestorRV NestorRV commented Apr 1, 2026

Summary

  • Add connection retry logic and startup validation
  • Fix UnboundLocalError on empty collection in setup()
  • Replace bare print() calls with printLog() throughout
  • Add per-batch ASCII progress bar, elapsed time, rate, and ETA logging
  • Auto-remove dummy compression field after each batch via $unset
  • Add --skip-cleanup and --append-log flags
  • Drop tracker collection on successful completion
  • Remove dead multiprocessing queue code and unused imports

Testing

  • I've run the tool against a testing collection and I can see the tool doing the work:
[2026-04-06T14:12:33.049971] [██████░░░░░░░░░░░░░░] 32.8% | 1,650,000/5,029,703 | batch: 10,000 docs in 22.4s | rate: 73697 docs/s | ETA: 0:00:45 | sleeping for 0 seconds
[2026-04-06T14:12:56.425212] [██████░░░░░░░░░░░░░░] 33.0% | 1,660,000/5,029,703 | batch: 10,000 docs in 23.4s | rate: 36273 docs/s | ETA: 0:01:32 | sleeping for 0 seconds
[2026-04-06T14:13:19.677014] [██████░░░░░░░░░░░░░░] 33.2% | 1,670,000/5,029,703 | batch: 10,000 docs in 23.2s | rate: 24197 docs/s | ETA: 0:02:18 | sleeping for 0 seconds
[2026-04-06T14:13:44.077546] [██████░░░░░░░░░░░░░░] 33.4% | 1,680,000/5,029,703 | batch: 10,000 docs in 24.4s | rate: 17984 docs/s | ETA: 0:03:06 | sleeping for 0 seconds
[2026-04-06T14:14:07.242763] [██████░░░░░░░░░░░░░░] 33.6% | 1,690,000/5,029,703 | batch: 10,000 docs in 23.2s | rate: 14496 docs/s | ETA: 0:03:50 | sleeping for 0 seconds
[2026-04-06T14:14:30.535060] [██████░░░░░░░░░░░░░░] 33.8% | 1,700,000/5,029,703 | batch: 10,000 docs in 23.3s | rate: 12154 docs/s | ETA: 0:04:33 | sleeping for 0 seconds
[2026-04-06T14:14:53.894348] [██████░░░░░░░░░░░░░░] 34.0% | 1,710,000/5,029,703 | batch: 10,000 docs in 23.4s | rate: 10476 docs/s | ETA: 0:05:16 | sleeping for 0 seconds
[2026-04-06T14:15:17.153046] [██████░░░░░░░░░░░░░░] 34.2% | 1,720,000/5,029,703 | batch: 10,000 docs in 23.3s | rate: 9223 docs/s | ETA: 0:05:58 | sleeping for 0 seconds
[2026-04-06T14:15:41.443746] [██████░░░░░░░░░░░░░░] 34.4% | 1,730,000/5,029,703 | batch: 10,000 docs in 24.3s | rate: 8207 docs/s | ETA: 0:06:42 | sleeping for 0 seconds
[2026-04-06T14:16:04.822491] [██████░░░░░░░░░░░░░░] 34.6% | 1,740,000/5,029,703 | batch: 10,000 docs in 23.4s | rate: 7431 docs/s | ETA: 0:07:22 | sleeping for 0 seconds
[2026-04-06T14:16:28.102087] [██████░░░░░░░░░░░░░░] 34.8% | 1,750,000/5,029,703 | batch: 10,000 docs in 23.3s | rate: 6798 docs/s | ETA: 0:08:02 | sleeping for 0 seconds

Changes

Error Handling

  • Added get_mongo_client() helper with retry logic (up to 3 attempts, 5s delay) used everywhere a connection is needed
  • Added validate_connection() called at startup to verify the URI is reachable and the target database/collection exist before any work begins
  • Fixed an UnboundLocalError crash in setup() when the collection is empty
  • Wrapped all MongoDB operations in setup() and task_worker() with try/except, with reconnect logic in the worker's batch loop

Code Quality

  • Removed unused imports: threading, string
  • Removed redundant variable assignments at the top of task_worker()
  • Added --append-log flag — log file is no longer silently deleted on every startup unless the flag is omitted

Observability

  • Replaced all bare print() calls with printLog() for consistent output to both stdout and the log file
  • Added per-batch ASCII progress bar, elapsed time, docs/sec rate, and ETA after each batch

Correctness

  • The dummy field used to trigger compression is now automatically removed from each document immediately after each batch via a second bulk_write with $unset
  • Added --skip-cleanup flag for cases where removing the field is not required
  • Each tracker entry now includes a cleanupComplete boolean
  • On successful completion the tracker collection is automatically dropped

README

  • Updated to reflect all new flags: --append-log, --skip-cleanup
  • Added notes on automatic dummy field cleanup and tracker collection drop behaviour

🤖 Generated with Claude Code

… compression field, drop tracker collection and more

- Add connection retry logic and startup validation
- Fix UnboundLocalError on empty collection in setup()
- Replace bare print() calls with printLog() throughout
- Add per-batch progress bar, elapsed time, rate, and ETA logging
- Auto-remove dummy compression field after each batch via $unset
- Add --skip-cleanup and --append-log flags
- Drop tracker collection on successful completion
- Remove dead multiprocessing queue code and unused imports
@NestorRV NestorRV changed the title Improve document-compression-updater: error handling, observability, and correctness feat: [document-compression-updater]- Improvements: Auto-remove dummy compression field, drop tracker collection and more Apr 1, 2026
NestorRV and others added 3 commits April 6, 2026 15:09
…ent incompatibility

Replace dict arguments in .sort() calls with list-of-tuples to support pymongo 3.x.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@tmcallaghan tmcallaghan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review and address comments. Always log when possible. Also, please avoid whitespace beautification (adding/removing spaces) unless you are also modifying that line of code for the purpose of the PR.

NestorRV and others added 5 commits April 7, 2026 10:27
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NestorRV
Copy link
Copy Markdown
Author

NestorRV commented Apr 7, 2026

@tmcallaghan - I've addressed all your comments here! I've run the tool locally and it's still working fine. Thanks!

@NestorRV NestorRV requested a review from tmcallaghan April 7, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants