[TE-5476] Add bktec backfill-commit-metadata subcommand#471
Conversation
5f60e10 to
570aed5
Compare
Add the git abstraction layer for the backfill-commit-metadata command: - GitRunner interface with ExecGitRunner production implementation for testability via dependency injection - DetectDefaultBranch with origin/HEAD -> origin/main -> origin/master fallback chain - 3-strategy fork-point detection (--fork-point, mainline parent fallback with MainlineCache, plain merge-base) ported from reporummage - FetchBulkMetadata using git log --no-walk --stdin with ASCII unit/record separators for single-call bulk extraction - FilterExistingCommits via git cat-file --batch-check for efficient missing commit detection - CollectDiffs with 10-goroutine worker pool for concurrent per-commit diff collection (--name-only, --numstat, full diff, --raw) - Comprehensive unit tests using FakeGitRunner for all components
Add JSONL + tar.gz packaging for commit metadata export: - CommitRecord type with 14 fields matching the export schema (schema_version, commit_sha, parent_shas, author/committer info, message, diffs) - ArchiveMetadata type for the metadata.json manifest - CreateTarball writes commit-metadata.jsonl and metadata.json to a gzipped tar archive in a temp file, using stdlib archive/tar and compress/gzip - GitDiff and GitDiffRaw use omitempty for smaller output with --skip-diffs - Unit tests covering structure, JSONL content, metadata, empty records, omitempty behaviour, and schema version
Add presigned S3 POST upload using multipart form encoding: - PresignedUploadForm type matching the Buildkite API response shape - UploadToS3 streams file content via io.Pipe to avoid buffering the entire tarball in memory - Form data fields are sent before the file field (S3 requirement) - Fields sent in deterministic order for testability - No auth headers needed (presigned form fields handle S3 authorisation) - Unit tests with httptest server verifying field presence, file content, field ordering, server errors, and missing file handling
Add FetchCommitList and PresignUpload API client methods, plus config changes for the backfill-commit-metadata command: - FetchCommitList requests text/plain from the commits endpoint (TE-5489) and parses newline-delimited SHAs. Bypasses DoWithRetry to set custom Accept header and handle text responses directly. - PresignUpload requests a presigned S3 POST form from the backfill upload endpoint (TE-5441). Uses DoWithRetry for standard JSON handling. - Add Days, Output, and SkipDiffs fields to Config struct - Add ValidateForBackfillCommitMetadata with minimal validation (only API connection fields required, no runner/parallelism/build env)
Add the backfill-commit-metadata subcommand, hidden behind BKTEC_ENABLE_BACKFILL=1 env var (following BKTEC_PREVIEW_SELECTION pattern): - BackfillCommitMetadata orchestrates the full pipeline: fetch commit list from API, detect default branch, build mainline cache, filter existing commits, bulk-fetch metadata, collect diffs concurrently, assemble JSONL records, package as tar.gz, and upload via presigned S3 - CLI flags: --skip-diffs, --output, --days (reuses existing auth flags) - buildCommands() conditionally registers the command when env var is set - ValidateForBackfillCommitMetadata used for minimal config validation - Progress reporting to stderr (every 100 commits + completion) - Tests covering happy path, skip-diffs, no commits, all commits missing, API error, days parameter, and S3 upload flow
S3 presigned POST uploads require a Content-Length header. The previous io.Pipe streaming approach produced a body with unknown length, causing S3 to reject the request with 411 MissingContentLength. Switch from io.Pipe to bytes.Buffer so the HTTP client can compute and set Content-Length automatically. This buffers the tarball in memory which is acceptable for backfill uploads (typically tens of MB).
Remove the os.Rename-with-fallback pattern and always copy the temp tarball to the output path. The rename only saved a syscall when src and dst were on the same filesystem, which is the common case but not worth the extra code path.
…SHAs from filter Update the internal/git package with three related changes: - FilterExistingCommits now returns the missing SHA list ([]string) instead of just a count, so callers can pass the list to FetchMissingCommits - DetectDefaultBranch accepts a remote parameter instead of hardcoding 'origin', enabling --remote flag support for non-standard remotes - Add FetchMissingCommits with chunked fetching (1000 per batch) and recursive bisect-on-error to isolate unfetchable SHAs (ported from reporummage's fetchCommitsInChunks + fetchCommitsBisectOnError) - Add TestDetectDefaultBranch_CustomRemote and four FetchMissingCommits tests (all succeed, bisect on error, all unfetchable, empty list)
Wire the new git package changes into the CLI and command orchestration: - Add Remote config field and --remote CLI flag (default 'origin', env var BUILDKITE_TEST_ENGINE_BACKFILL_REMOTE) - Pass cfg.Remote to DetectDefaultBranch - Insert fetch-missing step between filter and mainline cache: after initial filter, attempt to fetch missing commits from the remote, then re-filter to pick up newly available commits - Update SkippedCommits in archive metadata to use len(missingCommits) - Restore backfillEnabled() and BKTEC_ENABLE_BACKFILL env var gating that was lost during unstaged file restoration
After writing the tar.gz, rename the temp file from the random os.CreateTemp suffix to include a UTC timestamp (e.g. bktec-commit-metadata-20260402T100000Z.tar.gz). This makes leftover temp files identifiable if cleanup is skipped. Falls back to the original temp path if the rename fails.
Add workerCount parameter to CollectDiffs instead of using a hardcoded constant. Export DefaultWorkerCount (10) for callers that want the standard pool size. This allows tuning concurrency for environments with limited resources or when processing very large commit lists.
Add --concurrency flag (default 10, env BUILDKITE_TEST_ENGINE_BACKFILL_CONCURRENCY) to control the number of concurrent git operations during diff collection. Useful for tuning on resource-constrained environments or when processing very large commit lists.
The map stores each commit's first parent only, not multiple parents. Rename the field to singular to accurately reflect the 1:1 mapping.
Move FakeGitRunner from git_test.go to internal/git/fake_runner.go (non-test file under internal/) so it can be imported by tests in other packages. Remove the duplicate fakeGitRunner from the command tests and use git.FakeGitRunner directly.
De-duplicate Output and OutputWithStdin by extracting a common run method that handles command setup, debug logging, execution, and error formatting. The only difference (stdin) is passed as an io.Reader parameter (nil for Output, strings.NewReader for OutputWithStdin). Also adds the package-level doc comment for the git package.
Replace the dense single-line format string with named constants for each git format placeholder (fmtHash, fmtAuthorN, etc.) joined by the field separator. Each placeholder is documented with its meaning, making the field order and separators self-evident.
Export metadataFormat as MetadataFormat so tests in other packages can reference the single source of truth. Remove the hardcoded copy (testMetadataFormat) from the command tests.
21d1044 to
9d19c16
Compare
…ELECTION Replace the BKTEC_ENABLE_BACKFILL env var gating with the existing BKTEC_PREVIEW_SELECTION env var. The backfill-commit-metadata command is always registered but hidden from bktec --help unless BKTEC_PREVIEW_SELECTION is enabled. Users who know about it can invoke it directly regardless of the env var. Remove backfillEnvVar, backfillEnabled(), and buildCommands(). The command list is now a static declaration on the cliCommand var.
Extend ArchiveMetadata with fields reflecting the export configuration and data provenance: - Days, Remote, SkippedDiffs: record the config options used, so the downstream pipeline knows the parameters of the export - MinCommitDate, MaxCommitDate: ISO 8601 date range of the commits in the archive, computed from CommitterDate during record assembly (lexicographic comparison works for ISO 8601 strings) These help the ingestion pipeline understand what data the archive contains without needing to scan the JSONL.
Fast-fail with a clear error if the API token is missing required scopes, rather than spending minutes on git metadata collection only to get a 403 at upload time. - Add VerifyTokenScopes via GET /v2/access-token, checking for read_suites (commit list) and write_suites (presigned upload) - When --output is set, only require read_suites since upload is skipped; downgrade missing write_suites to a stderr warning - Scope check runs immediately after API client creation (step 2), before any git commands - Tests: _ScopeCheckFails (missing write_suites without --output returns error), _ScopeCheckWarnsWithOutput (missing write_suites with --output warns but succeeds)
On presign or S3 upload failure, keep the temp tarball and print its path to stderr instead of deleting it. This avoids forcing the user to re-run the entire git metadata collection (which can take minutes) just to retry the upload step. The retained file can be uploaded later with the upload-commit-metadata command.
Move backfill-commit-metadata from a top-level hidden command to bktec tools backfill-commit-metadata using urfave/cli/v3 native command nesting. The tools parent is hidden from bktec --help unless BKTEC_PREVIEW_SELECTION is enabled, but is always invocable directly. This keeps the top-level namespace clean for run/plan and provides a natural home for future utility commands (e.g. upload-commit-metadata). Global flags like --debug propagate via cmd.Root().Bool().
Add a command to upload a previously generated commit metadata tarball to Buildkite, for cases where generation and upload happen in separate steps (e.g. air-gapped environments or split CI pipelines). - UploadCommitMetadata command: verify file exists, check write_suites token scope, presign, upload to S3 - UploadFile config field and ValidateForUploadCommitMetadata validator (requires API connection fields + --file path, no suite slug needed) - --file flag (required) under bktec tools upload-commit-metadata - Shares the existing PresignUpload + UploadToS3 logic with the backfill command - Tests: happy path (end-to-end with mock S3), file not found, scope check failure (missing write_suites)
9d19c16 to
f18908a
Compare
Output a human-friendly file size (bytes, KiB, or MiB) to stdout after creating the tarball, making it easy to see the export size at a glance or capture programmatically.
The return value from runner.Output was captured and immediately discarded. Use _ directly in the assignment instead.
The processed counter is only read and written from the single goroutine that ranges over resultCh, so atomics are unnecessary.
Replace the parenthetical referencing an internal tool name with a description of what the symbolic-ref call actually does.
Pass GitRunner as a parameter to BackfillCommitMetadata instead of using a package-level mutable global. Tests now pass the fake runner directly, removing the setGitRunnerFactory helper and the implicit coupling via shared mutable state. Also renumber step comments (4-12) after the removed step.
Remove the deferred out.Close() and close explicitly in both the error and success paths. Previously the file was closed twice on success, with the deferred close silently discarding any error.
b672f29 to
26e7b52
Compare
…support Add context.Context parameter to UploadToS3 and use http.NewRequestWithContext so the S3 upload can be cancelled or timed out externally.
d6240b1 to
b38adad
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b38adad311
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
I wonder if # all in one
bktec tools backfill-commit-metadata
# or, produce locally…
bktec tools backfill-commit-metadata --output data.tar.gz
# … and then upload:
bktec tools backfill-commit-metadata --upload data.tar.gz |
…as --upload flag Replace the separate upload-commit-metadata subcommand with a --upload flag on backfill-commit-metadata, per review feedback. The two operations are the same workflow (generate + upload) split across invocations, so a single command with mode dispatch is a better UX. When --upload is set, BackfillCommitMetadata dispatches to a private uploadOnly helper that skips all git work and goes straight to scope verification, presign, and S3 upload. Validation branches on UploadFile: upload mode only requires API connection fields (no suite slug, days, concurrency). - Add --upload flag to backfill-commit-metadata (replaces --file on the removed upload-commit-metadata subcommand) - Remove upload-commit-metadata subcommand, handler, validation method, flags, and dedicated command/test files - Update ValidateForBackfillCommitMetadata to branch on UploadFile - Add uploadOnly private function with the upload-only pipeline - Move upload tests into backfill test file (3 tests: happy path, file not found, scope check failure) - Update docs and README to reference --upload instead of the separate command
I like this and it's been implemented in 3549938 |
FetchCommitList and UploadToS3 had no per-request timeouts, so a stalled connection could block the command indefinitely. - Add 30s context timeout to FetchCommitList before the HTTP request - Replace http.DefaultClient in UploadToS3 with a client using a 5 minute timeout (supports up to 1GB uploads at ~30 Mbps)
…llisions The temp tarball is renamed to include a UTC timestamp for identification, but second-level granularity means concurrent runs started in the same second would collide. Add milliseconds to the format string to narrow the collision window.
|
Without args, Setting those in env does work, but for a CLI tool I think the more conventional solution is to pass These flags both work, but only one of them is mentioned in So, ideally:
|
Nest commit-metadata.jsonl and metadata.json inside a directory named backfill-<org>-<suite>-<timestamp> (e.g. backfill-my-org-my-suite-20260402T100000.000Z/) instead of placing them at the tar root. Having files at the top level of a tar archive is considered bad practice as extracting without -C can pollute the working directory. The directory entry is written as an explicit tar.TypeDir header before the two file entries. The directory name includes org slug, suite slug, and a UTC timestamp for uniqueness across multiple archives. Tests updated to use suffix-based file lookup since the directory name includes a runtime timestamp. New TestCreateTarball_DirectoryStructure validates the directory entry type, naming convention, and file ordering.
…rs.go De-duplicate ReadTarball, FindTarEntry, and HasTarEntry which were defined independently in both internal/packaging/tarball_test.go and internal/command/backfill_commit_metadata_test.go. Export them from internal/packaging/test_helpers.go (non-test file under internal/, following the FakeGitRunner pattern in internal/git/) so both packages can import a single implementation. Also simplifies the SkipDiffs command test by replacing a manual tar-reading loop with ReadTarball + FindTarEntry, removing the archive/tar and compress/gzip imports from that file.
…validation errors The backfill command is CLI-first (run manually outside CI), but --organization-slug was hidden from --help because the shared flag variable has Hidden: true for the run/plan commands where BUILDKITE_ORGANIZATION_SLUG is auto-populated by the Buildkite agent. Add a non-hidden backfillOrganizationSlugFlag under the TEST ENGINE category so it appears alongside --access-token and --suite-slug in bktec tools backfill-commit-metadata --help. Update ValidateForBackfillCommitMetadata() error messages to reference flag names first (e.g. "--organization-slug / BUILDKITE_ORGANIZATION_SLUG must not be blank") instead of env-var-only format. This is consistent with how --days and --concurrency are already validated in the same function, and more helpful for CLI users who may not know the env var names. Add validation tests for missing org slug, suite slug, access token, and upload-only mode skipping suite slug. Update docs and README examples to show flag-based usage.
Now it is clear in |
jameshill
left a comment
There was a problem hiding this comment.
Low risk — purely additive. New command is hidden behind BKTEC_PREVIEW_SELECTION and touches nothing in the existing run/split/plan paths. Config struct additions are all json:"-" so inert to existing customers. Happy for this to go in for internal usage.

Description
The test prediction training pipeline needs historical changeset data to train models. We don't have access to customer repos, so customers need a tool they can run themselves inside their own checkout.
This PR adds a new subcommand under
bktec tools(hidden frombktec --helpunlessBKTEC_PREVIEW_SELECTIONis enabled):bktec tools backfill-commit-metadataCollects historical git commit metadata from a customer's repository and uploads to Buildkite via presigned S3. The pipeline:
read_suites+write_suites) viaGET /v2/access-token-- fast-fail before expensive git work. When--outputis set, missingwrite_suitesdowngrades to a warning.GET .../suites/:suite/commits?days=N)<remote>/HEAD-><remote>/main-><remote>/masterfallbackgit cat-file --batch-checkgit fetch+ recursive bisect-on-error to isolate unfetchable SHAs--sinceto match the lookback window) for 3-strategy fork-point detection (--fork-point, mainline parent fallback, plainmerge-base)git log --no-walk --stdin(single process)--name-only,--numstat, full diff,--raw)backfill-<org>-<suite>-<timestamp>directory:commit-metadata.jsonl+metadata.jsonwith config options and commit date range)--outputOn upload failure, the tarball is retained locally and the path is printed so the user can retry with
--upload.--uploadflagThe
--upload <file>flag provides an upload-only mode that skips all git work and uploads a previously generated tarball directly. This supports workflows where generation and upload happen in separate steps (e.g. air-gapped environments, split CI pipelines, or retrying a failed upload).Architecture
The implementation adds four new packages under
internal/, all using stdlib only (no new module dependencies):internal/git/--GitRunnerinterface (with dependency injection), default branch detection, fork-point, bulk metadata, filter, fetch-missing, concurrent diff collectioninternal/packaging/-- tar.gz archive creation with JSONL + metadata.jsoninternal/upload/-- presigned S3 POST with multipart form (context-aware for cancellation)internal/api/--FetchCommitList,PresignUpload,VerifyTokenScopesclient methodsContext
Resolves TE-5476
Depends on:
Changes
internal/gitpackage:GitRunnerinterface withExecGitRunnerproduction implementation and sharedFakeGitRunnerfor tests, default branch detection with<remote>/HEADfallback chain, 3-strategy fork-point detection withMainlineCache(bounded by--sinceto match the lookback window), bulk metadata fetch viagit log --no-walk --stdin, local commit filtering viagit cat-file --batch-check, chunked missing commit fetch with recursive bisect-on-error, concurrent diff collection with configurable worker pool and context cancellationinternal/packagingpackage: tar.gz archive creation with files nested inside abackfill-<org>-<suite>-<timestamp>directory containingcommit-metadata.jsonl(one JSON object per line) andmetadata.jsonmanifest with config options (days,remote,skip_diffs) and commit date range (min_commit_date,max_commit_date)internal/uploadpackage: presigned S3 POST upload with buffered multipart form encoding (Content-Length required by S3), acceptscontext.Contextfor cancellation/timeout supportFetchCommitList(text/plain content negotiation for newline-delimited SHAs, URL path segments escaped),PresignUpload(presigned S3 POST form, URL path segments escaped),VerifyTokenScopes(fast-fail scope check viaGET /v2/access-token)bktec tools backfill-commit-metadatasubcommand: full pipeline from commit list fetch through to S3 upload, with--days,--remote,--skip-diffs,--output,--upload,--concurrencyflags--uploadflag dispatches to an upload-only path (scope check, presign, upload) -- replaces the former separateupload-commit-metadatasubcommandDays,Remote,Concurrency,SkipDiffs,Output,UploadFile) and validation that branches on--upload(upload mode only requires API connection fields)toolsparent command visibility onBKTEC_PREVIEW_SELECTION(always invocable, hidden from--helpunless env var is set)read_suites+write_suites, downgrade missingwrite_suitesto warning when--outputis set--uploaddocs/commit-metadata-backfill.mdguide and README sectionGitRunner(passed as parameter, not a global)--organization-slugfor the backfill command (visible in--helpunder TEST ENGINE category) since the command is CLI-first, not CI-only likerun/planValidateForBackfillCommitMetadata()(e.g.--organization-slug / BUILDKITE_ORGANIZATION_SLUG must not be blank) instead of env-var-only formatVerification
Backed by specs. AI assisted.
Verified working e2e locally.
Alternatively, it can be invoked without env vars:
Deployment
Low risk. The command is hidden behind the
toolsparent command which is only visible whenBKTEC_PREVIEW_SELECTIONis enabled. The command is always registered (invocable directly) but not discoverable without the env var. No changes to existingrunorplanbehaviour.Rollback
Yes