Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions umbra-s3/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
data/
db/
.s3-env
81 changes: 81 additions & 0 deletions umbra-s3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Umbra (S3)

ClickBench for [Umbra](https://umbra-db.com/) with the `hits` table stored on
**Amazon S3** (`backend=cloud`) instead of local disk. It is the same Umbra
benchmark as [`../umbra`](../umbra), with two differences:

- `create.sql` registers an S3 bucket as Umbra remote storage and creates the
table with `backend=cloud`, so table data lives in the bucket.
- You must provision that bucket first with [`./create-bucket`](#1-create-the-s3-bucket).

The dataset (`hits.parquet`) is still ingested from a local copy via
`umbra.parquetview`; only the resulting table is stored in S3.

## Prerequisites

- A fresh Ubuntu 24.04+ VM (the scripts `sudo apt-get install` Docker, the
Postgres client, and the AWS CLI as needed).
- Docker access (the default flow runs `umbradb/umbra` in a container).
- **AWS credentials that can create and write an S3 bucket.** `create-bucket`
picks them up, in order, from:
1. `$AWS_ACCESS_KEY_ID` / `$AWS_SECRET_ACCESS_KEY` in the environment,
2. whatever `aws configure` has stored,
3. an interactive prompt (only if neither of the above can reach S3).

The *same static keys* are handed to Umbra's `create remote storage`
statement, so they must allow normal S3 data access (not just bucket
creation). No IAM user/role is created.

## 1. Create the S3 bucket

```bash
cd umbra-s3
./create-bucket
```

This:

- ensures the AWS CLI is installed,
- resolves working AWS credentials (see above),
- generates a globally-unique bucket name `clickbench-umbra-s3-<YYYYMMDD>-<uuid>`
and creates it in your region,
- writes everything Umbra needs to **`.s3-env`** (bucket, region, key id, key).

`.s3-env` is gitignored and `chmod 600`. **`./load` sources it automatically**,
so once `create-bucket` has run you do not need to export anything by hand.
Re-running `create-bucket` reuses the bucket/credentials already in `.s3-env`.

### Region and path

- Region: `$UMBRA_S3_REGION`, else `$AWS_DEFAULT_REGION`, else `us-east-1`.
- Path prefix inside the bucket: `$UMBRA_S3_PATH` (default `umbra`).

Umbra addresses the bucket as `s3://<bucket>:<region>/<path>` — the region is
part of the URI, not a separate option.

## 2. Run the benchmark

Either run the standard ClickBench driver directly from this directory:

```bash
cd umbra-s3
./benchmark.sh
```

The driver (`../lib/benchmark-common.sh`, via `benchmark.sh`) runs the
primitives in order: `install` → `start` → `load` → the 43 queries
(cold + 2 warm each) → `stop`. `install` downloads `hits.parquet` into `data/`
(kept across runs, out of the measured load time); `load` registers the S3
remote storage, creates the `backend=cloud` table, and ingests it.

> Run `./create-bucket` **before** the benchmark. `load` fails fast with a
> clear message if `UMBRA_S3_*` are unset (i.e. no `.s3-env`).

## 3. Tear down

```bash
./delete-bucket
```

Empties and deletes the bucket recorded in `.s3-env`, then removes `.s3-env`.
Idempotent, and touches no IAM resources.
5 changes: 5 additions & 0 deletions umbra-s3/benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash
# Thin shim — actual flow is in lib/benchmark-common.sh.
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-single"
export BENCH_DURABLE=yes
exec ../lib/benchmark-common.sh
4 changes: 4 additions & 0 deletions umbra-s3/check
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash
set -e

PGPASSWORD=postgres psql -p 5432 -h 127.0.0.1 -U postgres -c 'SELECT 1' >/dev/null
103 changes: 103 additions & 0 deletions umbra-s3/create-bucket
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
#!/bin/bash
set -eu

# Create the S3 bucket that backs Umbra's (backend=cloud) hits table and record
# the credentials Umbra needs in .s3-env. No IAM user is created — this uses the
# access key the caller already has (env AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY,
# else `aws configure`). Those same static keys are what Umbra's
# create remote storage s3 using '<bucket>' with secret '<keyId>' '<key>'
# needs, so we verify they can do S3 and, only if they can't, prompt for keys.
#
# The bucket name is always generated as clickbench-umbra-s3-<date>-<uuid>;
# re-running reuses the bucket/credentials recorded in .s3-env.

here="$(cd "$(dirname "$0")" && pwd)"
envfile="$here/.s3-env"

region="${UMBRA_S3_REGION:-${AWS_DEFAULT_REGION:-us-east-1}}"

# Ensure the AWS CLI is available (the base image / install step doesn't ship
# it). Install it on first use.
if ! command -v aws >/dev/null 2>&1; then
sudo apt-get update -y
sudo apt-get install -y awscli
fi

# True if the given key id/secret can talk to S3 (lists buckets).
creds_work() {
AWS_ACCESS_KEY_ID="$1" AWS_SECRET_ACCESS_KEY="$2" \
AWS_DEFAULT_REGION="$region" aws s3 ls >/dev/null 2>&1
}

# --- credentials -----------------------------------------------------------
# Reuse keys from a previous run if present.
# shellcheck disable=SC1091
[ -f "$envfile" ] && . "$envfile"

key_id="${UMBRA_S3_ACCESS_KEY_ID:-${AWS_ACCESS_KEY_ID:-}}"
key_secret="${UMBRA_S3_ACCESS_KEY:-${AWS_SECRET_ACCESS_KEY:-}}"

# Fall back to whatever `aws configure` has stored.
if [ -z "$key_id" ] || [ -z "$key_secret" ]; then
key_id="$(aws configure get aws_access_key_id 2>/dev/null || true)"
key_secret="$(aws configure get aws_secret_access_key 2>/dev/null || true)"
fi

# If the current credentials don't exist or can't reach S3, ask for keys.
if [ -z "$key_id" ] || [ -z "$key_secret" ] || ! creds_work "$key_id" "$key_secret"; then
echo "create-bucket: current AWS credentials can't access S3." >&2
read -r -p "AWS Access Key ID: " key_id
read -r -s -p "AWS Secret Access Key: " key_secret; echo
if ! creds_work "$key_id" "$key_secret"; then
echo "create-bucket: those keys can't access S3 either — aborting." >&2
exit 1
fi
fi

export AWS_ACCESS_KEY_ID="$key_id"
export AWS_SECRET_ACCESS_KEY="$key_secret"
export AWS_DEFAULT_REGION="$region"

# --- bucket name -----------------------------------------------------------
# Reuse a previous run's bucket if .s3-env carried one over; else generate
# clickbench-umbra-s3-<YYYYMMDD>-<short-uuid>
# (global, <=63 chars, lowercase alphanumerics + hyphens).
bucket="${UMBRA_S3_BUCKET:-}"
if [ -z "$bucket" ]; then
uuid="$( (uuidgen 2>/dev/null || cat /proc/sys/kernel/random/uuid) \
| tr 'A-Z' 'a-z' | tr -cd 'a-f0-9' | cut -c1-12)"
bucket="clickbench-umbra-s3-$(date +%Y%m%d)-$uuid"
echo "create-bucket: generated bucket name $bucket"
fi

# --- create ----------------------------------------------------------------
if aws s3api head-bucket --bucket "$bucket" >/dev/null 2>&1; then
echo "create-bucket: s3://$bucket already exists"
else
# us-east-1 must NOT be passed as a LocationConstraint (the API rejects
# it); every other region requires it.
if [ "$region" = "us-east-1" ]; then
aws s3api create-bucket --bucket "$bucket" >/dev/null
else
aws s3api create-bucket --bucket "$bucket" \
--create-bucket-configuration "LocationConstraint=$region" >/dev/null
fi
aws s3api wait bucket-exists --bucket "$bucket"
echo "create-bucket: created s3://$bucket in $region"
fi

# --- persist ---------------------------------------------------------------
# UMBRA_S3_URI is the full storage URI load passes through verbatim — the AWS
# s3://<bucket>:<region>/<path> form (region is part of the URI, path follows).
path="${UMBRA_S3_PATH:-umbra}"
umask 077
cat > "$envfile" <<EOF
# Generated by create-bucket — credentials for Umbra S3 remote storage.
# Sourced by load. Do not commit. Run delete-bucket to clean up.
export UMBRA_S3_URI=s3://$bucket:$region/$path
export UMBRA_S3_BUCKET=$bucket
export UMBRA_S3_REGION=$region
export UMBRA_S3_ACCESS_KEY_ID=$key_id
export UMBRA_S3_ACCESS_KEY=$key_secret
EOF
echo "create-bucket: done -> $envfile (load sources this for you)."
Loading