Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
ede542f
Add formally verified Flare operator in pure Lean 4
junjihashimoto Mar 9, 2026
54c6e0d
Add ESR (Eventually Stable Reconciliation) liveness proofs
junjihashimoto Mar 9, 2026
a21f06d
Add I/O binding, NodeState transition, C++ K8s operator mode, Dockerf…
junjihashimoto Mar 9, 2026
c02ab9c
Fix state persistence: rebuild partitionMap from nodeMap as single so…
junjihashimoto Mar 10, 2026
cc01d1f
Add cluster replication (Blue/Green migration) operator logic and E2E…
junjihashimoto Mar 10, 2026
329d81c
Fix broken sharding: sync partitionSize from CRD and fix lease timest…
junjihashimoto Mar 10, 2026
c97ac3c
Fix failover bugs and add E2E failover test
junjihashimoto Mar 10, 2026
105a16b
Add state tracing and Lean 4 E2E test framework
junjihashimoto Mar 10, 2026
ea1e664
Fix race condition, JSON parse, and E2E test reliability
junjihashimoto Mar 10, 2026
6231f71
Add native TCP topology broadcast to fix inconsistent node views
junjihashimoto Mar 11, 2026
e6bbded
Fix key distribution by implementing correct topology broadcast and p…
junjihashimoto Mar 13, 2026
25883a4
Add comprehensive documentation for Flare Operator
junjihashimoto Mar 13, 2026
1d2b578
Update operator image tag to flare-operator:test for testing
junjihashimoto Mar 13, 2026
92b648b
Fix E2E test assertion for partition-size
junjihashimoto Mar 13, 2026
1f0abc9
Add formal verification for distributed system safety
junjihashimoto Mar 13, 2026
e5777c0
Update documentation for formal verification completion
junjihashimoto Mar 13, 2026
ac5b730
Add Prometheus metrics support for operator observability
junjihashimoto Mar 13, 2026
8cf72eb
Add exponential backoff retry logic for K8s API resilience
junjihashimoto Mar 13, 2026
781ff7a
Update documentation for production readiness features
junjihashimoto Mar 13, 2026
ad80a71
Add health check endpoints for Kubernetes probes
junjihashimoto Mar 13, 2026
b086b8c
Integrate Prometheus metrics into operator reconcile loop
junjihashimoto Mar 13, 2026
c6c11d3
Integrate health check endpoints into operator lifecycle
junjihashimoto Mar 13, 2026
a3ae3a2
Update documentation for completed production feature integrations
junjihashimoto Mar 13, 2026
10ffa97
Add automatic partition reduction detection and safe migration guide
junjihashimoto Mar 13, 2026
8e57ccf
Add E2E test for partition reduction detection
junjihashimoto Mar 13, 2026
46283ea
Improve partition reduction E2E test reliability
junjihashimoto Mar 14, 2026
c7f9c54
Add Helm chart for Flare Operator
junjihashimoto Mar 14, 2026
da35268
Add tmpfs support and examples to Helm chart
junjihashimoto Mar 14, 2026
e4c6a6d
Phase 1 & 2: Complete FSM with all 5 critical domain requirements
junjihashimoto Mar 14, 2026
6b9519d
Phase 3: IO interpreters for FSM effects
junjihashimoto Mar 14, 2026
c2a1386
Phase 4: FSM driver loop with error handling (all 5 requirements)
junjihashimoto Mar 14, 2026
ea59e5e
Wire FSM driver to production operator loop
junjihashimoto Mar 14, 2026
961f9a3
E2E test isolation: unique namespaces per test
junjihashimoto Mar 14, 2026
b632aa7
Add Proxy Pool throttling to prevent thundering herd
junjihashimoto Mar 14, 2026
6f25851
Add Blast Radius Circuit Breaker for AZ-level failure protection
junjihashimoto Mar 15, 2026
2fc140f
Add .lake/ to .gitignore to exclude Lean build artifacts
junjihashimoto Mar 15, 2026
d3de2a0
Add configurable circuit breaker with simulation and analysis
junjihashimoto Mar 15, 2026
9e5842b
Prepare for PR: clean up files and add comprehensive documentation
junjihashimoto Mar 15, 2026
28d9bb7
Fix GitHub Actions E2E workflow and Dockerfile
junjihashimoto Mar 15, 2026
bdf6660
Remove redundant shell-based E2E tests
junjihashimoto Mar 15, 2026
6b8625f
Fix Dockerfile: avoid conflict with existing operator group
junjihashimoto Mar 15, 2026
85f799f
Remove test_partition_safety.sh wrapper script
junjihashimoto Mar 15, 2026
46a91fa
Fix Dockerfile: use auto-assigned UID for flare-operator user
junjihashimoto Mar 15, 2026
edc951d
Fix CI: replace Helm with direct manifest deployment
junjihashimoto Mar 15, 2026
b5117f3
Fix CI: create namespace before RBAC installation
junjihashimoto Mar 28, 2026
bf4da7a
Fix E2E tests: create namespace-specific RBAC resources
junjihashimoto Mar 28, 2026
4be02f7
Add debug output to E2E test failures
junjihashimoto Mar 28, 2026
daf3ee3
Fix E2E: throw error when operator deployment fails
junjihashimoto Mar 28, 2026
794a2a4
E2E: dump operator logs after grace period
junjihashimoto Mar 28, 2026
b15950e
Improve FSM error message for AfterListPods
junjihashimoto Mar 28, 2026
e5283d5
Fix critical FSM bug: handle chained requests properly
junjihashimoto Mar 29, 2026
78a5237
Fix circuit breaker: use FQDN for pod matching
junjihashimoto Mar 29, 2026
ad0e954
Fix E2E multi-cluster setup: complete deploySecondCluster implementation
junjihashimoto Mar 29, 2026
c806ba5
Add debug logging for proxy node assignment
junjihashimoto Mar 29, 2026
595633e
Increase E2E test timeout from 20min to 40min
junjihashimoto Mar 29, 2026
106b723
Fix scale-out-slave: remove throttling for replica scale-out
junjihashimoto Mar 30, 2026
974f001
Add debug logging for cluster replication ConfigMap issue
junjihashimoto Mar 30, 2026
c731039
Fix cluster replication ConfigMap persistence issue
junjihashimoto Mar 30, 2026
d046c67
Add RocksDB storage backend with WAL-based incremental replication
junjihashimoto Apr 8, 2026
c231fee
Merge branch 'feature/rocksdb' into flare-operator
junjihashimoto Apr 11, 2026
c1c0165
nix: fix cutter build under GCC 14
junjihashimoto Apr 11, 2026
987dd8b
nix: add shell.nix for plain nix-shell entry
junjihashimoto Apr 11, 2026
743bb18
docs: add E2E test issues summary for RocksDB work
junjihashimoto Apr 11, 2026
58a0542
operator: propagate spec.rocksdb.* to flared ConfigMap (G10)
junjihashimoto Apr 11, 2026
eb9b0e8
e2e: add strict-durability suite for rocksdb-sync-writes (G11)
junjihashimoto Apr 11, 2026
a864eca
e2e: add wal-bandwidth-throttle suite (G12)
junjihashimoto Apr 11, 2026
747f449
e2e: add RocksDB-enabled flared image + ConfigMap mount wiring
junjihashimoto Apr 11, 2026
72b149b
e2e: fail fast on kubectl apply errors + exclude .lake from docker co…
junjihashimoto Apr 12, 2026
2e22db3
fix: wire handleRocksdbConfig into reconcileOnceFSM (not just legacy …
junjihashimoto Apr 12, 2026
85ca409
e2e: fix G10/G11 test 5 to check rocksdb_master_id, not config values
junjihashimoto Apr 12, 2026
ec88124
Harden dead-node detection for production workloads
junjihashimoto Apr 12, 2026
1ddc9c0
docs: update e2e-test-issues with G10-G12 results and production hard…
junjihashimoto Apr 12, 2026
ac70296
docs: update e2e-test-issues with G10-G12 results and production hard…
junjihashimoto Apr 13, 2026
04c526e
e2e: document flared reload() limitation for walSyncBwlimit (G12 test 5)
junjihashimoto Apr 14, 2026
80d1dc7
e2e: fix deployCluster for fresh kind clusters
junjihashimoto Apr 16, 2026
c1bf31c
docs: update e2e-test-issues with G10-G12 results and production hard…
junjihashimoto Apr 16, 2026
5235f13
e2e: retry transient etcd errors + extend SA timeout to 120s
junjihashimoto Apr 16, 2026
3372350
docs: update e2e-test-issues with G10-G12 results and production hard…
junjihashimoto Apr 17, 2026
07a852a
e2e: add G1 WAL sync, G2 WAL purge, G5 self-demote, G7 orphan scan/purge
junjihashimoto Apr 17, 2026
76eeb37
e2e: fix G1 test 5 to SKIP on emptyDir (no PVC)
junjihashimoto Apr 17, 2026
3d795f4
docs: final e2e results — all 15 suites (96 tests) verified
junjihashimoto Apr 17, 2026
9d4c659
docs: add OSS bug analysis and logging gap assessment
junjihashimoto Apr 18, 2026
74fa9d2
improve operator logging for production debugging
junjihashimoto Apr 18, 2026
1c047f8
e2e: add terminating-pod-handling and failover-during-replication tests
junjihashimoto Apr 18, 2026
e110c81
docs: update OSS analysis with test results + logging status
junjihashimoto Apr 20, 2026
103bc9a
docs: add design review document for operator + RocksDB replication
junjihashimoto Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Keep build contexts small and force the docker builders to always
# compile from source. Without this, Dockerfile.operator's COPY picks
# up the host's Nix-built `.lake/build/bin/flare_operator` (which has
# a /nix/store glibc interpreter baked into its ELF header) and simply
# re-packages it, producing an image that crashes at startup with
# "exec /usr/local/bin/flare_operator: no such file or directory".
#
# Excluding .lake/ forces lake to build inside the Ubuntu builder stage
# with /lib64/ld-linux-x86-64.so.2, which is what actually exists in
# the runtime image.
flare_operator/.lake/
flare_operator/.lake~/

# Git metadata and editor/OS junk -- not useful inside any image and
# wasteful to ship as build context.
.git/
.gitignore
*.swp
*~
.DS_Store
114 changes: 114 additions & 0 deletions .github/workflows/e2e-tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
name: E2E Tests

on:
push:
branches:
- flare-operator
- master
pull_request:
branches:
- flare-operator
- master

jobs:
e2e-test:
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- name: Checkout
uses: actions/checkout@v4
with:
submodules: recursive

- name: Install elan (Lean version manager)
run: |
curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh -s -- -y
echo "$HOME/.elan/bin" >> $GITHUB_PATH

- name: Install kind
run: |
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.24.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

- name: Install kubectl
run: |
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/kubectl

- name: Install helm
run: |
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

- name: Build Lean operator
run: |
cd flare_operator
lake build flare_operator flare_e2e

- name: Create kind cluster
run: kind create cluster --name flare-e2e --wait 120s

- name: Build Docker images
run: |
docker build -t flare-operator:test -f Dockerfile.operator .
docker build -t flare-node:test -f Dockerfile.flare-node .

- name: Load images into kind
run: |
kind load docker-image flare-operator:test --name flare-e2e
kind load docker-image flare-node:test --name flare-e2e

- name: Create namespace
run: kubectl create namespace flare-system --dry-run=client -o yaml | kubectl apply -f -

- name: Install CRD
run: kubectl apply -f deploy/crd.yaml

- name: Install RBAC
run: kubectl apply -f deploy/rbac.yaml

- name: Deploy operator
run: |
# Use simple deployment manifest instead of Helm to avoid timeout issues in CI
sed 's|flare-operator:latest|flare-operator:test|;s|IfNotPresent|Never|' \
deploy/operator.yaml | kubectl apply -f -

- name: Wait for operator ready
run: |
kubectl rollout status deployment/flare-operator -n flare-system --timeout=300s
echo "=== Operator pods ==="
kubectl get pods -n flare-system
echo "=== Operator logs (initial) ==="
kubectl logs -n flare-system -l app=flare-operator --tail=50

- name: Run E2E tests
run: |
cd flare_operator
export KUBECONFIG=$HOME/.kube/config
timeout 2400 .lake/build/bin/flare_e2e

- name: Dump operator logs on failure
if: failure()
run: |
echo "=== Operator Logs ==="
kubectl logs -n flare-system -l app=flare-operator --tail=500 || true

- name: Dump K8s state on failure
if: failure()
run: |
echo "=== All Namespaces ==="
kubectl get ns || true
echo ""
echo "=== All Pods ==="
kubectl get pods --all-namespaces || true
echo ""
echo "=== Events (all namespaces) ==="
kubectl get events --all-namespaces --sort-by='.lastTimestamp' || true
echo ""
echo "=== FlareCluster CRs ==="
kubectl get flareclusters --all-namespaces -o yaml || true

- name: Delete kind cluster
if: always()
run: kind delete cluster --name flare-e2e
36 changes: 34 additions & 2 deletions .github/workflows/nix-linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ name: nix-linux
on: [push, pull_request]

jobs:
build-cache:
test-legacy:
name: Test Legacy Build (without RocksDB)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
Expand All @@ -12,5 +13,36 @@ jobs:
extra_nix_config: |
experimental-features = nix-command flakes
nix_path: nixpkgs=channel:nixos-unstable
- run: |
- name: Build and test legacy Flare
run: |
nix build .#test-flare -L
- name: Verify RocksDB is NOT compiled
run: |
nix build .#flare -L
if nm result/bin/flared | grep -q rocksdb; then
echo "ERROR: RocksDB symbols found in legacy build"
exit 1
fi
echo "SUCCESS: Legacy build has no RocksDB dependencies"

test-rocksdb:
name: Test RocksDB Build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: cachix/install-nix-action@v22
with:
extra_nix_config: |
experimental-features = nix-command flakes
nix_path: nixpkgs=channel:nixos-unstable
- name: Build and test Flare with RocksDB
run: |
nix build .#test-flare-rocksdb -L
- name: Verify RocksDB is compiled
run: |
nix build .#flare-rocksdb -L
if ! nm result/bin/flared | grep -q rocksdb; then
echo "ERROR: RocksDB symbols NOT found in RocksDB build"
exit 1
fi
echo "SUCCESS: RocksDB build has RocksDB dependencies"
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,6 @@ Makefile.in
/test/*.log
/test/*.trs
/test-driver

# Lean build artifacts
.lake/
Loading
Loading