feat: Add `collect batch` command #59

Snider · 2026-02-02T00:40:55Z

This change introduces a new collect batch command that allows users to collect multiple URLs from a file, stdin, or a JSON registry. The command supports parallel downloads, rate limiting, and progress reporting, providing a powerful and efficient way to collect a large number of resources.

Fixes #23

PR created automatically by Jules for task 10729746708703034073 started by @Snider

This commit introduces a new `collect batch` command that allows users to collect multiple URLs from a file, stdin, or a JSON registry. The command supports the following features: - Parallel downloads with a configurable number of workers. - Rate limiting with a configurable delay between requests. - The ability to skip already downloaded files. - Progress reporting with a progress bar. - Reading URLs from a file, stdin, or a JSON file with a `jq` filter. Co-authored-by: Snider <631881+Snider@users.noreply.github.com>

google-labs-jules · 2026-02-02T00:40:56Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

gemini-code-assist · 2026-02-02T00:41:11Z

Summary of Changes

Hello @Snider, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the tool's data collection capabilities by introducing a new collect batch command. This command provides a robust and efficient way to download multiple resources, offering flexibility in how URLs are provided (files, stdin, or JSON with jq filtering). It also incorporates essential features for managing large-scale collections, such as parallel processing, configurable delays for rate limiting, and visual progress feedback, along with the ability to resume interrupted operations.

Highlights

New collect batch command: Introduced a new command collect batch under the collect subcommand, enabling the collection of multiple URLs.
Flexible URL input: The command supports reading URLs from a specified file, standard input, or by applying a jq filter to JSON input.
Parallel downloads and rate limiting: Resources can be downloaded concurrently using the --parallel flag, and a --delay flag allows for rate limiting between requests.
Progress reporting and resume functionality: A progress bar is displayed for interactive sessions, and the --continue flag allows skipping already downloaded files.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-02T00:41:13Z

Warning

Rate limit exceeded

@Snider has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

A new batch collection command is introduced, enabling concurrent downloads of URLs from files, JSON input with jq filtering, or stdin. The implementation includes configurable parallelism, rate limiting, file deduplication, and progress tracking.

Changes

Cohort / File(s)	Summary
Batch Collection Command `cmd/collect_batch.go`	Implements a new Cobra command for batch URL collection. Reads URLs from files or stdin with optional jq filtering, validates and normalises filenames, performs concurrent HTTP downloads with configurable worker pools, delay between requests, and file skip logic. Includes progress bar visualisation and comprehensive error handling.
Test Suite `cmd/collect_batch_test.go`	Comprehensive test coverage for batch collection including URL parsing (plain text and JSON with jq), filename derivation edge cases, parallel and sequential download workflows, stdin handling, file skipping with the `--continue` flag, delay enforcement, and error cases for invalid inputs.
Dependencies `go.mod`	Adds indirect dependencies: `github.com/itchyny/gojq v0.12.18` and `github.com/itchyny/timefmt-go v0.1.7` to support jq expression evaluation.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Command
    participant Parser
    participant URLPool
    participant Worker
    participant HTTPClient
    participant FileSystem
    participant ProgressBar

    User->>Command: Run collect batch (file/stdin + flags)
    Command->>Parser: Read and parse URLs
    Parser-->>Command: URLs list
    Command->>ProgressBar: Initialise progress (if TTY)
    Command->>URLPool: Create worker pool (N workers)
    
    loop For each URL
        Command->>URLPool: Send URL to channel
    end
    
    par Worker Processing
        Worker->>HTTPClient: GET URL
        HTTPClient-->>Worker: Response body
        Worker->>FileSystem: Create/check file
        Worker->>FileSystem: Write response to disk
        Worker->>ProgressBar: Update progress
    and Worker Processing
        Worker->>HTTPClient: GET URL
        HTTPClient-->>Worker: Response body
        Worker->>FileSystem: Create/check file
        Worker->>FileSystem: Write response to disk
        Worker->>ProgressBar: Update progress
    end
    
    Command->>URLPool: Wait for completion
    URLPool-->>Command: All downloads finished
    Command-->>User: Report results

Poem

🐰 Hops of joy for URLs by batch,
Workers racing—what a match!
From jq filters to files in flight,
Progress bars glowing in the night! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main addition of a new batch collection command.
Description check	✅ Passed	The description accurately relates to the changeset, explaining the new collect batch command with its key features including parallel downloads, rate limiting, and progress reporting.
Linked Issues check	✅ Passed	All coding requirements from issue `#23` are addressed: URLs from text files, JSON with jq filter, stdin support, parallel downloads (--parallel flag), rate limiting (--delay flag), and progress reporting are implemented.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing the batch collection command with supporting tests and dependencies, with no extraneous modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat-collect-batch-10729746708703034073

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a collect batch command, which is a great addition for downloading multiple files. The implementation correctly handles parallel downloads, input from files or stdin, and JSON parsing with jq. The test coverage is also good.

I've found a few issues to address:

A critical race condition in concurrent file downloads that can lead to corrupted data.
The directory creation uses overly permissive file modes.
The rate-limiting implementation isn't effective for parallel downloads.
Potential for high memory usage when parsing large JSON files.

Details and suggestions are in the specific comments. After addressing these points, the feature will be much more robust and efficient.

gemini-code-assist · 2026-02-02T00:43:24Z

cmd/collect_batch.go

+	return collectBatchCmd
+}
+
+func downloadURL(cmd *cobra.Command, u, outputDir string, skipExisting bool, delayDuration time.Duration, bar *progressbar.ProgressBar, outMutex *sync.Mutex) {


There is a critical race condition in this function. When multiple parallel workers try to download URLs that map to the same file path, they can interfere with each other, leading to corrupted files. For example, one worker might be writing to a file while another truncates it by calling os.Create.

To fix this, you should serialize all file operations for a given path. You can achieve this using a sync.Map to hold a mutex for each file path.

Add this package-level variable:
var fileLocks sync.Map

Add locking logic inside downloadURL:
func downloadURL(...) { // ... after getting filePath filePath := filepath.Join(outputDir, fileName) mu, _ := fileLocks.LoadOrStore(filePath, &sync.Mutex{}) mu.(*sync.Mutex).Lock() defer mu.(*sync.Mutex).Unlock() // ... rest of the original function body }

cmd/collect_batch.go

gemini-code-assist · 2026-02-02T00:43:24Z

cmd/collect_batch.go

+	if delayDuration > 0 {
+		time.Sleep(delayDuration)
+	}


The current implementation of --delay is not ideal for rate-limiting parallel downloads. It causes each worker to sleep independently, leading to request bursts rather than a steady global rate.
A better approach is to use a global rate limiter that all workers respect. This time.Sleep should be removed, and a rate-limiting mechanism (e.g., using time.Ticker) should be added in the worker dispatch loop in RunE.

Example:

// In RunE: var limiter <-chan time.Time if delayDuration > 0 { ticker := time.NewTicker(delayDuration) defer ticker.Stop() limiter = ticker.C } // ... // In worker goroutine: for u := range urlsChan { if limiter != nil { <-limiter } downloadURL(...) // ... }

gemini-code-assist · 2026-02-02T00:43:24Z

cmd/collect_batch.go

+
+	var input interface{}
+	decoder := json.NewDecoder(reader)
+	if err := decoder.Decode(&input); err != nil {


decoder.Decode(&input) reads the entire JSON content into memory before processing. This can lead to high memory consumption and performance issues with very large JSON files. Consider using a streaming JSON parser if large inputs are expected, to process the data more efficiently without loading it all into memory at once.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@cmd/collect_batch_test.go`:
- Around line 180-203: The test reuses the same Cobra command instance `cmd`
between runs which can retain flag/state; to fix, instantiate a fresh command
for the second run by calling `NewCollectBatchCmd()` again (e.g., newCmd :=
NewCollectBatchCmd()) and then call `newCmd.SetOut(&out)`,
`newCmd.SetArgs([]string{urlsFile, "--output-dir", outputDir, "--continue"})`
and `newCmd.Execute()` so flags/state do not leak from the first invocation of
`cmd`.

In `@cmd/collect_batch.go`:
- Around line 131-136: Replace the use of http.Get(...) with an http.Client that
has a sane Timeout to avoid hanging; use that client to perform the GET for URL
variable u, capture resp and err as you do now, and on error call
logMessage(cmd, fmt.Sprintf(...), bar, outMutex) and return; after a successful
request check resp.StatusCode and treat any status >= 400 as an error
(logMessage and return) before proceeding to read/write the body; ensure you
still defer resp.Body.Close() after confirming resp is non-nil.

🧹 Nitpick comments (4)

cmd/collect_batch.go (3)
70-72: Consider using more restrictive directory permissions.

os.ModePerm (0777) grants full permissions to all users. Typically, 0755 is preferred for directories to restrict write access.
♻️ Suggested change
-		if err := os.MkdirAll(outputDir, os.ModePerm); err != nil {
+		if err := os.MkdirAll(outputDir, 0755); err != nil {
212-224: URL query strings are preserved in filenames.

filepath.Base(parsedURL.Path) does not account for query strings in the raw URL if they were somehow included in the path, and more importantly, parsedURL.RawQuery content is ignored. While url.Parse separates query strings correctly, consider using path.Base for URL paths and stripping query parameters explicitly for robustness.

Additionally, multiple URLs with the same filename (e.g., different domains) would overwrite each other silently.
♻️ Suggested improvement
+import "path"
+
 func getFileNameFromURL(rawURL string) (string, error) {
 	parsedURL, err := url.Parse(rawURL)
 	if err != nil {
 		return "", err
 	}
 	if parsedURL.Scheme != "http" && parsedURL.Scheme != "https" {
 		return "", fmt.Errorf("invalid URL scheme: %s", parsedURL.Scheme)
 	}
 	if parsedURL.Path == "" || parsedURL.Path == "/" {
 		return "index.html", nil
 	}
-	return filepath.Base(parsedURL.Path), nil
+	return path.Base(parsedURL.Path), nil
 }
Consider documenting the filename collision behaviour or adding hostname-based prefixing as a future enhancement.

205-208: Non-string jq results are silently ignored.

When the jq filter returns non-string values (e.g., numbers, objects), they are silently skipped. This could be confusing if users misconfigure their filter. Consider logging a warning or returning an error for unexpected types.
♻️ Optional: warn on non-string values
 	for {
 		v, ok := iter.Next()
 		if !ok {
 			break
 		}
 		if err, ok := v.(error); ok {
 			return nil, fmt.Errorf("error executing jq filter: %w", err)
 		}
 		if s, ok := v.(string); ok {
 			urls = append(urls, s)
+		} else if v != nil {
+			// Optionally log: non-string value ignored
 		}
 	}
cmd/collect_batch_test.go (1)
164-165: Pipe creation error is ignored.

The error from os.Pipe() is discarded. Whilst unlikely to fail in tests, checking the error improves robustness.
♻️ Suggested change
-		r, w, _ := os.Pipe()
+		r, w, err := os.Pipe()
+		if err != nil {
+			t.Fatalf("failed to create pipe: %v", err)
+		}

cmd/collect_batch_test.go

cmd/collect_batch.go

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

This commit resolves the CI failure by correcting a typo in the `downloadURL` function. The `httpClient.Get(u)` call was replaced with the correct `http.Get(u)` call, and an unused import was removed.

google-labs-jules bot mentioned this pull request Feb 2, 2026

feat: Batch collection from URL list file #23

Open

6 tasks

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

coderabbitai bot requested changes Feb 2, 2026

View reviewed changes

cmd/collect_batch_test.go Show resolved Hide resolved

cmd/collect_batch.go Show resolved Hide resolved

Copilot AI mentioned this pull request Feb 2, 2026

[WIP] Combine multiple PRs into a single squash commit #112

Merged

9 tasks

Snider and others added 3 commits February 2, 2026 06:44

Update cmd/collect_batch.go

0521d83

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update cmd/collect_batch.go

2f45e58

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update cmd/collect_batch_test.go

ef08cdf

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai bot approved these changes Feb 2, 2026

View reviewed changes

fix: Resolve CI failure

dad5cd2

This commit resolves the CI failure by correcting a typo in the `downloadURL` function. The `httpClient.Get(u)` call was replaced with the correct `http.Get(u)` call, and an unused import was removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `collect batch` command #59

feat: Add `collect batch` command #59

Uh oh!

Snider commented Feb 2, 2026

Uh oh!

google-labs-jules bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add collect batch command #59

Are you sure you want to change the base?

feat: Add collect batch command #59

Uh oh!

Conversation

Snider commented Feb 2, 2026

Uh oh!

google-labs-jules bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add `collect batch` command #59

feat: Add `collect batch` command #59

coderabbitai bot commented Feb 2, 2026 •

edited

Loading