Skip to content

Commit 5c3cae4

Browse files
authored
brotli-compress .socket.facts.json on upload (#1291)
* refactor: brotli-compress .socket.facts.json on upload Compress `.socket.facts.json` with brotli at upload time, just before `fetchCreateOrgFullScan` POSTs the multipart `/full-scans` body. Coana keeps writing plain JSON; the local readers (`extractTier1ReachabilityScanId`, `extractReachabilityErrors`) keep reading plain JSON; only the on-the-wire bytes between socket-cli and depscan change. depscan transparently decompresses at the api-v0 multipart ingest boundary in a coordinated change. Why: * Mono-repo `.socket.facts.json` files routinely exceed 60MB. On a representative simple-npm fixture, brotli compresses 150,748 bytes to 19,971 (~86.8% reduction); production mono-repos see a much larger absolute saving. * Coana has a second consumer downstream - teaching coana to write `.br` would break it. Brotli on the wire-only path keeps coana's contract invariant. * `tier1ReachabilityScanId` and reachability-error reporting still read plain JSON locally; no brotli round-trip on those paths. * Compression is a transport detail owned by the upload site; cleanup is one `finally` block. Implementation: * `src/utils/coana.mts` - new exported `compressSocketFactsForUpload(scanPaths)` returning `{ paths, cleanup }`. Per `.socket.facts.json` entry, `mkdtempSync` a fresh `.socket-br-XXX/` directory inside the source's parent dir (NOT under `os.tmpdir()` - see below), `brotliCompressSync` bytes to `${td}/.socket.facts.json.br`, swap the path. Other paths pass through. Missing facts paths pass through. Cleanup is idempotent with `force: true`. * `src/commands/scan/handle-create-new-scan.mts` - wraps the `fetchCreateOrgFullScan` call in `try { compress; upload; } finally { cleanup(); }`. Cleanup runs on success, throw, or any abort path. Why the temp lives next to the source: The SDK computes `path.relative(cwd, brPath)` for the multipart name field. depscan's multipart ingest (`addStreamEntry`) checks `containsTraversal(unixified)` and bails on any `..` segment. A tmpdir-based path resolves to `../../../var/folders/...`, gets dropped to `unmatchedFiles`, and the SocketFacts content never lands in the scan. Putting the temp inside `path.dirname(originalFactsPath)` produces a relative path like `.socket-br-XXX/.socket.facts.json.br` - traversal-free, ingests cleanly. Tests: * `src/utils/coana.test.mts` - 16 cases. - `compressSocketFactsForUpload` x 5: round-trip JSON via `brotliDecompressSync`, basename swap to `.socket.facts.json.br`, non-facts paths pass through, missing facts paths pass through, cleanup idempotent, mixed-entry order. Pins the contract that the temp lives in a subdir of the source's parent (traversal-free). - `extractTier1ReachabilityScanId` x 7: plain JSON, missing file, missing field, null, empty/blank, trim, numeric coercion. - `extractReachabilityErrors` x 4: extraction, missing file, no-components, no-inner-reachability. This change requires the matching depscan multipart decompression patch on the receiving side; that change ships first. * fix: write brotli .br as sibling of source, not in temp subdir Previous form wrapped each `.br` in a per-scan `mkdtempSync` subdir under the source dir for concurrency isolation. That created a directory-handling asymmetry on depscan's side: a wire path of `dirA/.socket.facts.json.br` flattened to `.socket.facts.json` at root via depscan's basename-strip, while plain `dirA/.socket.facts.json` preserved the dir. depscan dropped the basename-strip in the corresponding PR; switch to writing `.br` as a sibling of the source so wire and storage paths match for both branches. Net effect: a brotli upload from `<source>/.socket.facts.json` lands at `<source>/.socket.facts.json` in the manifest tarball - identical to plain. Concurrent-scan note: coana already writes to a single source path, so the source `.socket.facts.json` itself is racy under concurrent runs against the same dir; the sibling `.br` doesn't introduce a new race that wasn't already there. * `src/utils/coana.mts`: `compressSocketFactsForUpload` writes `${p}.br` next to the source; `cleanup()` does `rm(brPath, { force: true })` per file. Drops `mkdtemp` import that's no longer used. * `src/utils/coana.test.mts`: directory-shape assertion replaced with `swappedPath === ${input}.br` (sibling). First test now also asserts the source survives `cleanup()`. 16 cases. * fix: clean up partial brotli siblings on compress failure Per Cursor Bugbot finding on src/utils/coana.mts: if any brotli pipeline in the `Promise.all` batch rejects, the helper rejected before returning its `cleanup` callback, so already-completed sibling `.br` files were orphaned next to their source. Three changes that together make the helper failure-safe: * Track the intended `.br` path BEFORE the pipeline starts. Previous code pushed to `brPaths` after `await pipeline(...)`, so a sibling that the pipeline started writing but didn't finish was invisible to cleanup. `rm({ force: true })` no-ops on missing paths, so tracking the intent is safe. * Switch the parallel batch from `Promise.all` to `Promise.allSettled`. With `all`, the first rejection bubbled out while sibling pipelines were still writing bytes; calling `cleanup()` then would race the live writes and could `rm` a sibling only to have it re-created immediately. `allSettled` waits for every pipeline to settle, so cleanup sees a stable set of files to remove. * On batch failure, run `cleanup()` internally before re-throwing so the caller (which only gets a chance to call cleanup when the helper resolves) doesn't have to. Add `recursive: true` to the rm call so a defensively-occupied directory at a `.br` path doesn't halt cleanup partway through the list. Test added: `removes partial .br siblings if compression fails mid-batch` — pre-occupies entry B's `.br` destination with a directory so its `createWriteStream` rejects with EISDIR, then asserts that entry A's `.br` is not orphaned on disk after the batch rejects and both source files survive untouched. * chore(release): 1.1.98 Bump version to 1.1.98 and add changelog entry for the brotli upload-compression refactor in this PR.
1 parent 4c49d6e commit 5c3cae4

5 files changed

Lines changed: 446 additions & 23 deletions

File tree

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,11 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
1010
- **`socket manifest bazel [beta]`** — Generate Bazel JVM SBOM manifests by running `bazel query` against discovered Maven repos in a Bazel workspace. Closes the inline-Maven-declaration gap that lockfile-only parsing misses for repos like envoy, ray, tensorflow, tink-java, and or-tools. Auto-detects Bzlmod and legacy `WORKSPACE`.
1111
- **`socket scan create --auto-manifest`** now covers Bazel workspaces in addition to Gradle/Scala/Kotlin/Conda. Repos with `MODULE.bazel`, `WORKSPACE`, or `WORKSPACE.bazel` are detected automatically and their Maven dependencies extracted as part of the standard scan-create flow.
1212

13+
## [1.1.98](https://github.com/SocketDev/socket-cli/releases/tag/v1.1.98) - 2026-05-20
14+
15+
### Changed
16+
- `socket scan create --reach` now uploads the reachability facts file as brotli on the wire, shrinking mono-repo upload sizes by roughly 85% with no change to the on-disk or stored format. Faster scan submissions on slow connections.
17+
1318
## [1.1.97](https://github.com/SocketDev/socket-cli/releases/tag/v1.1.97) - 2026-05-18
1419

1520
### Changed

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "socket",
3-
"version": "1.1.97",
3+
"version": "1.1.98",
44
"description": "CLI for Socket.dev",
55
"homepage": "https://github.com/SocketDev/socket-cli",
66
"license": "MIT AND OFL-1.1",

src/commands/scan/handle-create-new-scan.mts

Lines changed: 35 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ import { outputCreateNewScan } from './output-create-new-scan.mts'
1515
import { performReachabilityAnalysis } from './perform-reachability-analysis.mts'
1616
import constants from '../../constants.mts'
1717
import { checkCommandInput } from '../../utils/check-input.mts'
18+
import { compressSocketFactsForUpload } from '../../utils/coana.mts'
1819
import { findSocketYmlSync } from '../../utils/config.mts'
1920
import { getPackageFilesForScan } from '../../utils/path-resolve.mts'
2021
import { readOrDefaultSocketJson } from '../../utils/socket-json.mts'
@@ -279,28 +280,40 @@ export async function handleCreateNewScan({
279280
tier1ReachabilityScanId = reachResult.data?.tier1ReachabilityScanId
280281
}
281282

282-
const fullScanCResult = await fetchCreateOrgFullScan(
283-
scanPaths,
284-
orgSlug,
285-
{
286-
commitHash,
287-
commitMessage,
288-
committers,
289-
pullRequest,
290-
repoName,
291-
branchName,
292-
scanType: reach.runReachabilityAnalysis
293-
? constants.SCAN_TYPE_SOCKET_TIER1
294-
: constants.SCAN_TYPE_SOCKET,
295-
workspace,
296-
},
297-
{
298-
cwd,
299-
defaultBranch,
300-
pendingHead,
301-
tmp,
302-
},
303-
)
283+
// Brotli-compress any .socket.facts.json paths in scanPaths just before
284+
// upload. depscan's api-v0 multipart boundary streams brotli decode based
285+
// on the .br filename suffix. Coana keeps writing plain .socket.facts.json
286+
// on disk, so the local read paths (extractTier1ReachabilityScanId,
287+
// extractReachabilityErrors) stay correct. The cleanup() in the finally
288+
// block removes the temp dirs whether the upload succeeded or threw.
289+
const compressed = await compressSocketFactsForUpload(scanPaths)
290+
let fullScanCResult: Awaited<ReturnType<typeof fetchCreateOrgFullScan>>
291+
try {
292+
fullScanCResult = await fetchCreateOrgFullScan(
293+
compressed.paths,
294+
orgSlug,
295+
{
296+
commitHash,
297+
commitMessage,
298+
committers,
299+
pullRequest,
300+
repoName,
301+
branchName,
302+
scanType: reach.runReachabilityAnalysis
303+
? constants.SCAN_TYPE_SOCKET_TIER1
304+
: constants.SCAN_TYPE_SOCKET,
305+
workspace,
306+
},
307+
{
308+
cwd,
309+
defaultBranch,
310+
pendingHead,
311+
tmp,
312+
},
313+
)
314+
} finally {
315+
await compressed.cleanup()
316+
}
304317

305318
const scanId = fullScanCResult.ok ? fullScanCResult.data?.id : undefined
306319

src/utils/coana.mts

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,11 @@
33
* Manages reachability analysis via Coana tech CLI.
44
*
55
* Key Functions:
6+
* - compressSocketFactsForUpload: Brotli-compress any .socket.facts.json
7+
* entries in scanPaths just before upload, returning swapped paths plus a
8+
* cleanup callback. Coana keeps writing plain JSON; the on-the-wire form
9+
* to depscan is brotli (api-v0 decodes at the multipart boundary).
10+
* - extractReachabilityErrors: Extract per-component reachability errors
611
* - extractTier1ReachabilityScanId: Extract scan ID from socket facts file
712
*
813
* Integration:
@@ -11,8 +16,105 @@
1116
* - Extracts tier 1 reachability scan identifiers
1217
*/
1318

19+
import { createReadStream, createWriteStream, existsSync } from 'node:fs'
20+
import { rm } from 'node:fs/promises'
21+
import path from 'node:path'
22+
import { pipeline } from 'node:stream/promises'
23+
import { createBrotliCompress } from 'node:zlib'
24+
1425
import { readJsonSync } from '@socketsecurity/registry/lib/fs'
1526

27+
import constants from '../constants.mts'
28+
29+
const { DOT_SOCKET_DOT_FACTS_JSON } = constants
30+
31+
export type CompressedScanPaths = {
32+
paths: string[]
33+
cleanup: () => Promise<void>
34+
}
35+
36+
/**
37+
* For each `.socket.facts.json` in `scanPaths`, stream-brotli-compress a
38+
* sibling `.socket.facts.json.br` next to the original file and swap its
39+
* path in. Other paths pass through unchanged. Missing files also pass
40+
* through unchanged (the upload will fail downstream with the same error
41+
* it would have).
42+
*
43+
* Streaming + worker-thread compression keeps the event loop responsive:
44+
* default brotli quality (11) on a 60+MB facts file takes multiple seconds
45+
* of CPU, which would otherwise freeze the spinner / signal handlers /
46+
* any concurrent work.
47+
*
48+
* The `.br` lives next to the source rather than under the OS temp dir
49+
* because depscan's multipart ingest (`addStreamEntry`) rejects entries
50+
* whose names contain `..` traversal segments. The SDK computes the
51+
* multipart entry name via `path.relative(cwd, brPath)`, so an OS-tmpdir
52+
* temp path turns into `../../../var/folders/...` and gets dropped as
53+
* `unmatchedFiles`. Sibling-write keeps the relative path inside cwd, and
54+
* keeps the directory shape symmetric with the plain `.socket.facts.json`
55+
* upload (depscan strips only the `.br` suffix at ingest, so
56+
* `<dir>/.socket.facts.json.br` and `<dir>/.socket.facts.json` resolve to
57+
* the same storage path).
58+
*
59+
* Concurrent scans against the same source directory are already racy on
60+
* `.socket.facts.json` itself (coana writes to a single path), so the
61+
* sibling `.br` doesn't introduce a new race.
62+
*
63+
* Caller MUST `await cleanup()` (typically in a `finally` block) once the
64+
* upload completes — successful or not — to remove the sibling files.
65+
*/
66+
export async function compressSocketFactsForUpload(
67+
scanPaths: string[],
68+
): Promise<CompressedScanPaths> {
69+
const brPaths: string[] = []
70+
const cleanup = async () => {
71+
const targets = brPaths.splice(0)
72+
// `recursive: true` defends against the (defensive) case where a sibling
73+
// path was somehow created as a directory — `rm` would otherwise throw
74+
// on the first such entry and skip the rest. `force: true` no-ops on
75+
// missing paths so the function stays idempotent.
76+
await Promise.all(targets.map(t => rm(t, { recursive: true, force: true })))
77+
}
78+
// Use `allSettled` (not `all`) so a failure in one entry doesn't leak the
79+
// others' in-flight pipelines past our `catch`. If we used `all`, the
80+
// first rejection would bubble out while sibling pipelines were still
81+
// writing bytes — `cleanup()` would race with those writes and could
82+
// remove a `.br` only to have it re-created after we returned.
83+
const results = await Promise.allSettled(
84+
scanPaths.map(async p => {
85+
if (path.basename(p) !== DOT_SOCKET_DOT_FACTS_JSON) {
86+
return p
87+
}
88+
if (!existsSync(p)) {
89+
return p
90+
}
91+
const brPath = `${p}.br`
92+
// Track the sibling path BEFORE the pipeline starts so a
93+
// partially-written `.br` is removed even if the pipeline rejects.
94+
// `rm({ force: true })` no-ops on missing files, so tracking before
95+
// creation is safe.
96+
brPaths.push(brPath)
97+
await pipeline(
98+
createReadStream(p),
99+
createBrotliCompress(),
100+
createWriteStream(brPath),
101+
)
102+
return brPath
103+
}),
104+
)
105+
const failure = results.find(
106+
(r): r is PromiseRejectedResult => r.status === 'rejected',
107+
)
108+
if (failure) {
109+
// All pipelines have settled, so cleanup() can safely remove every
110+
// `.br` we tracked (succeeded or partial) without racing live writes.
111+
await cleanup()
112+
throw failure.reason
113+
}
114+
const paths = results.map(r => (r as PromiseFulfilledResult<string>).value)
115+
return { paths, cleanup }
116+
}
117+
16118
export type ReachabilityError = {
17119
componentName: string
18120
componentVersion: string

0 commit comments

Comments
 (0)