Skip to content

fix(aws/s3): store object bodies as base64 so binary uploads survive#168

Open
andthezhang wants to merge 1 commit into
vercel-labs:mainfrom
andthezhang:fix/s3-binary-safe-objects
Open

fix(aws/s3): store object bodies as base64 so binary uploads survive#168
andthezhang wants to merge 1 commit into
vercel-labs:mainfrom
andthezhang:fix/s3-binary-safe-objects

Conversation

@andthezhang

Copy link
Copy Markdown

Problem

The S3 emulator corrupts any non-text upload. PutObject reads the request body with c.req.text() (and the presigned-POST path with file.text()), which decodes the bytes as UTF-8 and replaces every byte that isn't valid UTF-8 with the replacement character U+FFFD (EF BF BD).

As a result, any binary object — audio, images, gzip, protobuf, etc. — comes back corrupted on GetObject, and Content-Length is inflated (each U+FFFD re-encodes to 3 bytes).

Fix

  • PutObject / presigned POST now read the raw bytes via arrayBuffer() and persist them base64-encoded.
  • ETag is computed from the raw bytes, and Content-Length is the true byte count.
  • GetObject decodes the stored base64 back to a byte-exact Uint8Array (sent via c.body(...)).
  • CopyObject already round-trips the stored body verbatim, so it inherits the fix.
  • md5() now accepts Buffer/Uint8Array so it can hash raw bytes directly.

No new dependencies; no behavioral change for text objects.

Test

Adds a round-trip test that uploads a binary object (NUL, 0x80/0xff/0xfe, PNG magic), reads it back, and asserts it is byte-for-byte identical and that Content-Length equals the raw byte count.

pnpm --filter @emulators/aws test → all green; type-check clean.

PutObject read the request body with `.text()` (and the presigned-POST
path with `file.text()`), which decodes bytes as UTF-8 and replaces every
non-UTF-8 byte with U+FFFD (EF BF BD). Any binary object — audio, images,
gzip, protobuf — came back corrupted on GetObject, and Content-Length was
inflated because each replacement char re-encodes to 3 bytes.

Read the raw bytes via `arrayBuffer()` instead and persist them base64.
ETag is now md5 of the raw bytes, Content-Length is the true byte count,
and GetObject decodes the base64 back to a byte-exact Uint8Array. Copy
already round-trips the stored body verbatim, so it inherits the fix.

`md5()` now accepts Buffer/Uint8Array so it can hash raw bytes directly.

Test: round-trips a binary object (NUL, 0x80/0xff/0xfe, PNG magic)
byte-for-byte and asserts Content-Length is the raw byte count.
@vercel

vercel Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

@andthezhang is attempting to deploy a commit to the Vercel Labs Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant