Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 73 additions & 73 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,58 @@ name: CI

on:
push:
branches: [ main ]
branches: [main]
pull_request:
branches: [ main ]
branches: [main]

jobs:
test:
name: Test and Lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v4
with:
go-version-file: 'go.mod'
- name: Cache Go modules
uses: actions/cache@v4
with:
path: |
~/.cache/go-build
~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: |
${{ runner.os }}-go-
- name: Install dependencies
run: make deps
- name: golangci-lint
uses: golangci/golangci-lint-action@v8
with:
version: v2.6.0

- name: Run checks (vet, fmt-check, test)
run: make vet fmt-check test
# Note: Validation tests with real cloud providers run in separate workflows
# See .github/workflows/validation-*.yml for provider-specific validation tests
- name: Run security scan
run: make security
continue-on-error: true
- name: Run tests with coverage
run: make test-coverage
- name: Upload coverage reports
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: coverage/
- uses: actions/checkout@v4

- name: Set up Go
uses: actions/setup-go@v4
with:
go-version-file: "go.mod"

- name: Cache Go modules
uses: actions/cache@v4
with:
path: |
~/.cache/go-build
~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: |
${{ runner.os }}-go-

- name: Install dependencies
run: make deps

- name: golangci-lint
uses: golangci/golangci-lint-action@v8
with:
version: v2.7.1

- name: Run checks (vet, fmt-check, test)
run: make vet fmt-check test

# Note: Validation tests with real cloud providers run in separate workflows
# See .github/workflows/validation-*.yml for provider-specific validation tests

- name: Run security scan
run: make security
continue-on-error: true

- name: Run tests with coverage
run: make test-coverage

- name: Upload coverage reports
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: coverage/

build:
name: Cross-platform Build
Expand All @@ -63,31 +63,31 @@ jobs:
matrix:
target: [linux, darwin, windows]
steps:
- uses: actions/checkout@v4
- name: Set up Go
uses: actions/setup-go@v4
with:
go-version-file: 'go.mod'
- name: Cache Go modules
uses: actions/cache@v4
with:
path: |
~/.cache/go-build
~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: |
${{ runner.os }}-go-
- name: Install dependencies
run: make deps
- name: Build for ${{ matrix.target }}
run: make build-${{ matrix.target }}
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: build-${{ matrix.target }}
path: build/
- uses: actions/checkout@v4

- name: Set up Go
uses: actions/setup-go@v4
with:
go-version-file: "go.mod"

- name: Cache Go modules
uses: actions/cache@v4
with:
path: |
~/.cache/go-build
~/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: |
${{ runner.os }}-go-

- name: Install dependencies
run: make deps

- name: Build for ${{ matrix.target }}
run: make build-${{ matrix.target }}

- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: build-${{ matrix.target }}
path: build/
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ An early-stage, vendor-agnostic Go SDK for managing **clusterable, GPU-accelerat
## Project Goals

- Define a clean, minimal interface for cloud compute primitives:

- `Instance`
- `Storage`
- `FirewallRule`
Expand Down Expand Up @@ -41,7 +42,7 @@ See [SECURITY.md](docs/SECURITY.md) for complete security specifications and imp

- **Operating System**: Currently supports Ubuntu 22 only
- **Architecture**: Designed for GPU-accelerated compute workloads
- **Access Method**: Requires SSH server and SSH key-based authentication
- **Access Method**: Requires SSH server and SSH key-based authentication. Supports `TunneledSSH`. Indicates whether connections must be routed through a client-side tunnel proxy. This is required for instances that do not have public IP addresses.
- **System Requirements**: Requires systemd to be running and accessible

---
Expand All @@ -65,4 +66,3 @@ See [SECURITY.md](docs/SECURITY.md) for complete security specifications and imp
## Get Involved

This is a foundation — we're opening it early to **learn with the community** and shape a clean, composable `v2`. If you're building GPU compute infrastructure or tooling, we'd love your input.

6 changes: 3 additions & 3 deletions docs/SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ This document outlines the security requirements and best practices for implemen
1. **Default State**: All inbound traffic must be blocked by default (exemption may be made to 22, though the sdk prefers to explicitly set this)
2. **Explicit Allow**: Inbound access must be explicitly granted through `FirewallRule` resources
3. **Outbound Freedom**: Outbound traffic should be unrestricted by default
5. **Security Groups**: Use cloud provider security groups or equivalent (AWS Security Groups, GCP Firewall Rules, Azure NSGs) for network isolation
6. **Default Deny**: Configure security groups with default deny rules for all inbound traffic
4. **Security Groups**: Use cloud provider security groups or equivalent (AWS Security Groups, GCP Firewall Rules, Azure NSGs) for network isolation
5. **Default Deny**: Configure security groups with default deny rules for all inbound traffic

### Cluster Security

Expand Down Expand Up @@ -137,4 +137,4 @@ For security issues, vulnerabilities, or questions:

---

**Note**: This document is a living document and will be updated as security requirements evolve. All cloud integrations must comply with these requirements to ensure the security and integrity of the Brev Compute SDK ecosystem.
**Note**: This document is a living document and will be updated as security requirements evolve. All cloud integrations must comply with these requirements to ensure the security and integrity of the Brev Compute SDK ecosystem.
35 changes: 29 additions & 6 deletions docs/how-to-add-a-provider.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,21 @@
This guide explains how to add a new cloud provider to the Brev Cloud SDK (v1). The Lambda Labs provider is the best working, well-tested example—use it as your canonical reference.

Goals:

- Implement a provider-specific CloudCredential (factory) and CloudClient (implementation) that satisfy pkg/v1 interfaces.
- Accurately declare Capabilities based on the provider’s API surface.
- Implement at least instance lifecycle and instance types, adhering to security requirements.
- Add validation tests and (optionally) a GitHub Actions workflow to run them with real credentials.

Helpful background:

- Architecture overview: ../docs/ARCHITECTURE.md
- Security requirements: ../docs/SECURITY.md
- Validation testing framework: ../docs/VALIDATION_TESTING.md
- v1 design notes: ../pkg/v1/V1_DESIGN_NOTES.md

Provider examples:

- Lambda Labs (canonical): ../internal/lambdalabs/v1/README.md
- Nebius (in progress): ../internal/nebius/v1/README.md
- Fluidstack (in progress): ../internal/fluidstack/v1/README.md
Expand All @@ -32,13 +35,15 @@ CloudClient is a composed interface of provider capabilities. You don’t need t
- Instance types and validation helpers: ../pkg/v1/instancetype.go

Patterns to follow:

- Embed v1.NotImplCloudClient in your client so unsupported methods gracefully return ErrNotImplemented (see ../pkg/v1/notimplemented.go).
- Accurately return capability flags that match your provider’s real API.
- Prefer stable, provider-native identifiers; otherwise use MakeGenericInstanceTypeID/MakeGenericInstanceTypeIDFromInstance.

---

---

## Compute Brokers & Marketplaces (Aggregators)

This SDK supports providers that aggregate compute from multiple upstream sources (multi-cloud brokers, marketplaces, or exchanges). When implementing an aggregator, use these to differentiate where the compute comes from while keeping the interface consistent:
Expand All @@ -49,9 +54,11 @@ This SDK supports providers that aggregate compute from multiple upstream source
- InstanceType attributes (recommended): Use instance type attributes to delineate behavior differences across upstream sources (e.g., performance, network, storage, locality). There is also a `provider` attribute on the instance type you can use to indicate the originating vendor/source.

Notes:

- Capabilities represent what your broker can support. Differences between upstream vendors should be reflected in instance type attributes rather than reducing declared capabilities to the lowest common denominator.
- Keep your `Location`/`SubLocation` stable even if upstream identifiers change; translate upstream → broker-stable naming.
- Conform to the default-deny inbound model; document any upstream limitations under `internal/{provider}/SECURITY.md`.

## Directory Layout

Create a new provider folder:
Expand All @@ -68,6 +75,7 @@ Create a new provider folder:
- validation_test.go (validation suite entry point)

Use Lambda Labs as the pattern:

- ../internal/lambdalabs/v1/client.go
- ../internal/lambdalabs/v1/instance.go
- ../internal/lambdalabs/v1/capabilities.go
Expand Down Expand Up @@ -220,6 +228,7 @@ func (c *{Provider}Client) MergeInstanceTypeForUpdate(_ v1.InstanceType, newIt v
```

See the canonical mapping and conversion logic in Lambda Labs:

- Create/terminate/list/reboot: ../internal/lambdalabs/v1/instance.go
- Capabilities: ../internal/lambdalabs/v1/capabilities.go
- Client/credential + NotImpl: ../internal/lambdalabs/v1/client.go
Expand All @@ -242,38 +251,49 @@ Implement instance types in internal/{provider}/v1/instancetype.go:
The SDK uses a three-level capability system to accurately represent what operations are supported:

### 1. Provider-Level Capabilities

These are high-level features that your cloud provider's API supports, declared in your `GetCapabilities()` method. Capability flags live in ../pkg/v1/capabilities.go. Only include capabilities your API actually supports. For example, Lambda Labs supports:

- Create/terminate/reboot instance (`CapabilityCreateInstance`, `CapabilityTerminateInstance`, `CapabilityRebootInstance`)
- Does not (currently) support stop/start, resize volume, machine image, tags

### 2. Instance Type Capabilities
### 2. Instance Type Capabilities

These are hardware-specific features that vary by instance configuration, expressed as boolean fields on the `InstanceType` struct:

- `Stoppable`: Whether instances of this type can be stopped/started
- `Rebootable`: Whether instances of this type can be rebooted
- `CanModifyFirewallRules`: Whether firewall rules can be modified for this instance type
- `Preemptible`: Whether this instance type supports spot/preemptible pricing
- `TunneledSSH`: Whether connections must be routed through a client-side tunnel proxy. This is required for instances that do not have public IP addresses.

### 3. Instance Capabilities

These are capability boolean fields replicated on individual `Instance` objects, similar to Instance Type capabilities but applied to the running instance rather than the type template. While these fields could theoretically be derived from the associated `InstanceType`, they are duplicated on the instance for performance and convenience reasons. Examples include:

- `Stoppable`: Whether this specific instance can be stopped/started
- `Rebootable`: Whether this specific instance can be rebooted
- `TunneledSSH`: Whether connections must be routed through a client-side tunnel proxy. This is required for instances that do not have public IP addresses.

These fields must be kept accurate and in sync with the corresponding InstanceType capabilities, even though they appear redundant. They can also reflect runtime state-dependent variations - for example, a running instance might support certain operations that a stopped instance cannot, based on the current `LifecycleStatus`.

Reference:

- Lambda capabilities: ../internal/lambdalabs/v1/capabilities.go

---

## Security Requirements

All providers must conform to ../docs/SECURITY.md:

- Default deny all inbound, allow all outbound
- SSH server must be available with key-based auth
- Firewall rules should be explicitly configured via FirewallRule when supported
- If your provider’s firewall model is global/project-scoped rather than per-instance, document limitations in internal/{provider}/SECURITY.md and reflect that by omitting CapabilityModifyFirewall if applicable.

Provider-specific security doc examples:

- Lambda Labs: ../internal/lambdalabs/SECURITY.md
- Nebius: ../internal/nebius/SECURITY.md
- Fluidstack: ../internal/fluidstack/v1/SECURITY.md
Expand All @@ -288,7 +308,8 @@ Use the shared validation suite to test your provider with real credentials.
- Shared package: ../internal/validation/suite.go

Steps:
1) Create internal/{provider}/v1/validation_test.go:

1. Create internal/{provider}/v1/validation_test.go:

```go
package v1
Expand Down Expand Up @@ -317,12 +338,14 @@ func TestValidationFunctions(t *testing.T) {
}
```

2) Local runs:
- make test # skips validation (short)
2. Local runs:

- make test # skips validation (short)
- make test-validation # runs validation (long)
- make test-all # runs everything
- make test-all # runs everything

3. CI workflow (recommended):

3) CI workflow (recommended):
- Add .github/workflows/validation-{provider}.yml (copy Lambda Labs workflow if available or follow VALIDATION_TESTING.md).
- Store secrets in GitHub Actions (e.g., YOUR_PROVIDER_API_KEY).

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package errors
package clouderrors

import (
stderrors "errors"
Expand Down
3 changes: 3 additions & 0 deletions v1/V1_DESIGN_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,6 @@ The terminology around instance-attached storage is one of the more confusing pa
- Instance management is treated as individual resources rather than as part of a larger distributed system.
- Missing concepts like cluster membership, inter-instance communication, shared state, or cluster lifecycle management.
- For support to be added we may need to more fomally implement networks/vpcs or instance groups.

### SSH Connectivity Patterns
- `TunneledSSH`: Indicates whether connections must be routed through a client-side tunnel proxy. This is required for instances that do not have public IP addresses. This is currently implemented as a field on both `InstanceType` and `Instance`.
Loading
Loading