diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f48ae76..735cf43 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -15,8 +15,8 @@ Thank you for your interest in contributing to AUP Learning Cloud! This document 1. **Clone the repository** ```bash - git clone https://github.com/AMDResearch/aup-learning-cloud-dev.git - cd aup-learning-cloud-dev + git clone https://github.com/AMDResearch/aup-learning-cloud.git + cd aup-learning-cloud ``` 2. **Install Python dependencies** @@ -26,7 +26,7 @@ Thank you for your interest in contributing to AUP Learning Cloud! This document 3. **Install frontend dependencies** ```bash - cd runtime/admin/frontend/admin-frontend + cd runtime/hub/frontend pnpm install cd - ``` @@ -63,11 +63,11 @@ ruff format . - **ESLint**: JavaScript/TypeScript/Vue linter - **Prettier**: Code formatter - **TypeScript**: Type checking -- **Config**: `runtime/admin/frontend/admin-frontend/eslint.config.js`, `.prettierrc` +- **Config**: `runtime/hub/frontend/eslint.config.js`, `.prettierrc` Run checks: ```bash -cd runtime/admin/frontend/admin-frontend +cd runtime/hub/frontend # Lint pnpm run lint @@ -140,7 +140,7 @@ find . -name "*.sh" -o -name "*.bash" | \ # YAML yamllint . - # Frontend (from runtime/admin/frontend/admin-frontend) + # Frontend (from runtime/hub/frontend) pnpm run lint pnpm run format:check pnpm run type-check @@ -196,6 +196,6 @@ Standard production code rules apply to: ## Questions? - Open an issue for bugs or feature requests -- Check existing documentation in `docs/` +- Check existing documentation at https://amdresearch.github.io/aup-learning-cloud/ Thank you for contributing! diff --git a/README.md b/README.md index 76975dd..566f914 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ SOFTWARE. AUP Learning Cloud is a tailored JupyterHub deployment designed to provide an intuitive and hands-on AI learning experience. It features a comprehensive suite of AI toolkits running on AMD hardware acceleration, enabling users to learn and experiment with ease. -![Software Architecture](docs/imgs/software-stack.png) +![Software Architecture](deploy/docs/images/software-stack.png) ## Quick Start @@ -79,9 +79,6 @@ For multi-node cluster installation or need more control over the deployment pro AUP Learning Cloud offers the following Learning Toolkits: -> [!IMPORTANT] -> Only [**Deep Learning**](projects/DL) and [**Large Language Model from Scratch**](projects/LLM) are available in the v1.0 release. - - [**Computer Vision**](projects/CV) \ Includes 10 hands-on labs covering common computer vision concepts and techniques. @@ -91,6 +88,9 @@ Includes 12 hands-on labs covering common deep learning concepts and techniques. - [**Large Language Model from Scratch**](projects/LLM) \ Includes 9 hands-on labs designed to teach LLM development from scratch. +- [**Physical Simulation**](projects/PhySim) \ +Hands-on labs for physics simulation and robotics using Genesis. + ## Key Features ### Hardware Acceleration @@ -118,27 +118,30 @@ Dynamic NFS provisioning ensures scalable and persistent storage for user data, ## Available Notebook Environments -Current environments are set up as `RESOURCE_IMAGES` in `runtime/chart/files/hub`. These settings should be consistent with `Prepullers` in `runtime/values.yaml`. +Current environments are configured via `custom.resources.images` in `runtime/values.yaml`. These settings should be consistent with `prePuller.extraImages`. -| Environment | Image | Version | Hardware | -| ----------- | ---------------------------------------- | ------- | ------------------------------- | -| Base CPU | `ghcr.io/amdresearch/auplc-default` | v1.0 | CPU | -| CV COURSE | `ghcr.io/amdresearch/auplc-cv` | v1.0 | GPU (Strix-Halo) | -| DL COURSE | `ghcr.io/amdresearch/auplc-dl` | v1.0 | GPU (Strix-Halo) | -| LLM COURSE | `ghcr.io/amdresearch/auplc-llm` | v1.0 | GPU (Strix-Halo) | +| Environment | Image | Hardware | +| ----------- | ---------------------------------------- | ------------------------------- | +| Base CPU | `ghcr.io/amdresearch/auplc-default` | CPU | +| GPU Base | `ghcr.io/amdresearch/auplc-base` | GPU | +| CV COURSE | `ghcr.io/amdresearch/auplc-cv` | GPU | +| DL COURSE | `ghcr.io/amdresearch/auplc-dl` | GPU | +| LLM COURSE | `ghcr.io/amdresearch/auplc-llm` | GPU | +| PhySim COURSE | `ghcr.io/amdresearch/auplc-physim` | GPU | ## Documentation -- [JupyterHub Configuration](docs/jupyterhub/README.md) - Detailed JupyterHub settings -- [Authentication Guide](docs/jupyterhub/authentication-guide.md) - GitHub OAuth and native authentication -- [User Management Guide](docs/jupyterhub/user-management.md) - Batch user operations with scripts -- [User Quota System](docs/jupyterhub/quota-system.md) - Resource usage tracking and quota management -- [GitHub OAuth Setup](docs/jupyterhub/How_to_Setup_GitHub_OAuth.md) - OAuth configuration -- [Maintenance Manual](docs/user-manual/aup-remote-lab-user-manual-admin-new.md) - Operations guide +Full documentation is available at: **https://amdresearch.github.io/aup-learning-cloud/** + +- [Deployment Guide](deploy/README.md) - Single-node and multi-node deployment +- [Configuration Reference](https://amdresearch.github.io/aup-learning-cloud/jupyterhub/configuration-reference.html) - `runtime/values.yaml` field reference +- [Authentication Guide](https://amdresearch.github.io/aup-learning-cloud/jupyterhub/authentication-guide.html) - GitHub OAuth and native authentication +- [User Management Guide](https://amdresearch.github.io/aup-learning-cloud/jupyterhub/user-management.html) - Batch user operations with scripts +- [User Quota System](https://amdresearch.github.io/aup-learning-cloud/jupyterhub/quota-system.html) - Resource usage tracking and quota management ## Contributing -Please refer to [CONTRIBUTING.md](docs/contribute.md) for details on how to contribute to the project. +Please refer to [CONTRIBUTING.md](CONTRIBUTING.md) for details on how to contribute to the project. ## Acknowledgment diff --git a/deploy/README.md b/deploy/README.md index 39d3ae8..c1a8e84 100644 --- a/deploy/README.md +++ b/deploy/README.md @@ -20,500 +20,51 @@ SOFTWARE. --> -# Deployment Guide +# Deployment -This guide covers the complete deployment of AUP Learning Cloud, including Kubernetes infrastructure and JupyterHub application. +This directory contains infrastructure code for deploying AUP Learning Cloud. -## Prerequisites +## Directory Structure -Before starting the deployment, clone this repository: - -```bash -git clone https://github.com/AMDResearch/aup-learning-cloud.git -cd aup-learning-cloud -``` - -## Deployment Modes - -| Mode | Use Case | Tools | Storage | -|------|----------|-------|---------| -| [Single Node](#single-node-deployment) | Development, demo, small-scale teaching (< 10 users) | Shell + Helm | local-path | -| [Multi-Node Cluster](#multi-node-cluster-deployment) | Production, large-scale deployment | Ansible + Helm | NFS | - ---- - -## Single Node Deployment - -A simplified deployment for development, demos, or small-scale teaching environments. - -> **πŸ’‘ Tip**: If you need to use alternative container registries or package mirrors, see [Mirror Configuration](#mirror-configuration). - -### Prerequisites - -**Hardware** -- AMD Strix / Strix-Halo AIPC or dGPU system -- 32GB+ RAM (64GB recommended) -- 500GB+ SSD - -**Software** -- Ubuntu 24.04.2 LTS -- Helm 3.2.0+ - -### Step 1: Install K3s - -```bash -# Install K3s with readable kubeconfig permissions -curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.32.3+k3s1 K3S_KUBECONFIG_MODE="644" sh - - -# Configure kubectl -mkdir -p ~/.kube -sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config -sudo chown $(id -u):$(id -g) ~/.kube/config -``` - -> **πŸ’‘ Tip**: The `K3S_KUBECONFIG_MODE="644"` environment variable sets the kubeconfig file permissions to be readable by all users. This prevents common "permission denied" errors when running kubectl commands. See [K3s Cluster Access](https://docs.k3s.io/cluster-access) for more details. - -K3s includes a built-in `local-path` StorageClass. No additional storage setup required. - -Verify installation: -```bash -kubectl get nodes -kubectl get storageclass -``` - -**Official documentation**: https://docs.k3s.io/installation - -### Step 2: Install Helm - -Helm is required to deploy JupyterHub. Install it after K3s is running: - -```bash -# Download Helm v3.17.2 or later -wget https://get.helm.sh/helm-v3.17.2-linux-amd64.tar.gz -O /tmp/helm-linux-amd64.tar.gz - -# Extract and install -cd /tmp && tar -zxvf helm-linux-amd64.tar.gz -sudo mv /tmp/linux-amd64/helm /usr/local/bin/helm -rm /tmp/helm-linux-amd64.tar.gz - -# Verify installation -helm version -``` - -**Official documentation**: https://helm.sh/docs/intro/install/ - -### Step 3: Install ROCm Driver - -Follow the official AMD ROCm installation guide for Ubuntu 24.04: -https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html - -Verify installation: -```bash -rocminfo -rocm-smi -``` - -### Step 4: Deploy ROCm Device Plugin - -```bash -kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml -``` - -Verify GPU is detected: -```bash -kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | grep amd -``` - -**Official documentation**: https://github.com/ROCm/k8s-device-plugin - -### Step 5: Label Node - -Label your node based on GPU type. - -First, get the actual node name: -```bash -# List all nodes to see the actual node name -kubectl get nodes - -# Store the node name in a variable -NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}') -echo "Node name: $NODE_NAME" -``` - -Then apply the appropriate label based on your GPU type: -```bash -# For Strix-Halo -kubectl label nodes $NODE_NAME node-type=strix-halo - -# For Strix -kubectl label nodes $NODE_NAME node-type=strix - -# For discrete GPU -kubectl label nodes $NODE_NAME node-type=dgpu -``` - -Verify the label was applied: -```bash -kubectl get nodes --show-labels | grep node-type -``` - -### Step 6: Deploy JupyterHub - -```bash -cd runtime - -# Deploy JupyterHub -helm install jupyterhub ./chart \ - --namespace jupyterhub \ - --create-namespace \ - -f values.yaml ``` - -### Step 7: Access JupyterHub - -Open http://localhost:30890 in your browser. - -The default configuration uses auto-login mode - you will be automatically logged in as 'student' without requiring credentials. - ---- - -## Multi-Node Cluster Deployment - -For production environments with multiple nodes. Uses Ansible for automated cluster setup. - -![Architecture](./docs/images/overview.png) - -### Prerequisites - -#### Hardware - -- A 15U mini rack (standard size) -- AI370 Strix Mini PC or AI395 Strix-Halo AIPC (Example: Minisforum with dual 2.5G Ethernet) - - Node requirement: 3 nodes minimum - - The BIOS needs to be updated to the newest - - You can check the BIOS update via their official website: [MinisForum](https://www.minisforum.com/pages/product-info) [GMKtec](https://www.gmktec.com/pages/drivers-and-software) -- A 1GbE or 2.5GbE network switch -- A firewall router (Optional, Mikrotik RB5009 in this case) - -#### Software - -All components are installed using Ansible playbooks: -- Ubuntu 24.04.2 LTS (Linux Kernel: 6.14) or above -- ROCm 7.1.0 - [Installation Guide](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) -- Ansible: 2.18.3 (Only on controller node) - [Installation Guide](https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html) -- Helm: 3.2.0+ - [Installation Guide](https://helm.sh/docs/intro/install/) -- K3s v1.32.3+k3s1 - [Official Documentation](https://docs.k3s.io/) - -### Project Structure - -```bash deploy/ -β”œβ”€β”€ ansible/ # Host configuration and K3s cluster setup +β”œβ”€β”€ ansible/ # Ansible playbooks for K3s cluster setup β”œβ”€β”€ k8s/ # Kubernetes components (NFS provisioner, device plugins) -└── docs/ # Documentation and images +β”œβ”€β”€ scripts/ # Helper scripts for cluster setup +└── docs/ # Architecture diagrams ``` -### Step 1: Ansible Playbooks +## Documentation -Configure inventory and run playbooks to set up K3s cluster. +For full deployment instructions, see the documentation site: -See [ansible/README.md](./ansible/README.md) for detailed instructions. +- [Single-Node Deployment](https://amdresearch.github.io/aup-learning-cloud/installation/single-node.html) +- [Multi-Node Cluster Deployment](https://amdresearch.github.io/aup-learning-cloud/installation/multi-node.html) +- [Configuration Reference](https://amdresearch.github.io/aup-learning-cloud/jupyterhub/configuration-reference.html) -### Step 2: Install Helm +## Quick Start -After K3s cluster is deployed, install Helm on the controller node: +### Single Node ```bash -# Download Helm v3.17.2 or later -wget https://get.helm.sh/helm-v3.17.2-linux-amd64.tar.gz -O /tmp/helm-linux-amd64.tar.gz - -# Extract and install -cd /tmp && tar -zxvf helm-linux-amd64.tar.gz -sudo mv /tmp/linux-amd64/helm /usr/local/bin/helm -rm /tmp/helm-linux-amd64.tar.gz - -# Verify installation -helm version +cd .. +sudo ./auplc-installer install ``` -**Official documentation**: https://helm.sh/docs/intro/install/ - -### Step 3: Verify GPU/NPU Drivers +### Multi-Node Cluster -Before proceeding, ensure hardware drivers are working: - -**For GPUs:** ```bash -rocminfo -rocm-smi -``` +# 1. Configure Ansible inventory +cd ansible +vim inventory.yml -**For NPUs (TODO):** -```bash -/opt/xilinx/xrt/bin/xrt-smi validate -``` - -### Step 4: Kubernetes Components - -Set up NFS provisioner, device plugins, and node labels. - -See [k8s/README.md](./k8s/README.md) for detailed instructions. - -### Step 5: Deploy JupyterHub - -#### Configure Multi-Node Deployment +# 2. Run playbooks +sudo ansible-playbook playbooks/pb-base.yml +sudo ansible-playbook playbooks/pb-k3s-site.yml -For multi-node cluster deployments, copy and customize the multi-node configuration: - -```bash -cd runtime - -# Copy multi-node configuration template +# 3. Deploy JupyterHub +cd ../../runtime cp values-multi-nodes.yaml.example values-multi-nodes.yaml -``` - -Edit `values-multi-nodes.yaml` to customize for your environment. Key settings include: - -```yaml -# Hub Database Storage -hub: - db: - pvc: - storageClassName: nfs-client - -# User Environment -singleuser: - storage: - dynamic: - storageClass: nfs-client - -# Ingress Configuration -ingress: - enabled: true - ingressClassName: traefik - hosts: - - your-domain.com - tls: - - hosts: - - your-domain.com - secretName: jupyter-tls-cert -``` - -#### Configure Authentication - -##### Option 1: Auto-Login (Default for Single-Node) - -Set in `runtime/values.yaml`: - -```yaml -custom: - authMode: "auto-login" -``` - -No credentials required. Users are automatically logged in as 'student'. Suitable for single-node, personal learning environments. - -##### Option 2: GitHub OAuth - -Update [runtime/values.yaml](../runtime/values.yaml): - -```yaml -custom: - authMode: "github" - teams: - mapping: - cpu: [cpu] - gpu: [Course-CV, Course-DL, Course-LLM] - -hub: - config: - GitHubOAuthenticator: - oauth_callback_url: "https:///hub/oauth_callback" - client_id: "YOUR-CLIENT-ID" - client_secret: "YOUR-CLIENT-SECRET" - allowed_organizations: - - - scope: - - read:user - - read:org -``` - -##### Option 3: Multi-Login (GitHub + Native) - -Update [runtime/values.yaml](../runtime/values.yaml): - -```yaml -custom: - authMode: "multi" - teams: - mapping: - native-users: [cpu, Course-CV, Course-DL] - -hub: - config: - GitHubOAuthenticator: - # Same as Option 2 -``` - -> For more details, see [JupyterHub Configuration Guide](../docs/jupyterhub/README.md) - -#### Deploy - -```bash -cd runtime - -helm install jupyterhub ./chart \ - --namespace jupyterhub \ - --create-namespace \ - -f values-multi-nodes.yaml -``` - -> **Note**: Due to prepuller settings, initial deployment may take a while to pull all images. If you see timeout issues, wait for prepuller pods to complete, then run `helm upgrade`. See [issue #42](https://github.com/AMDResearch/aup-learning-cloud/issues/42) for details. - -#### Access - -| Service Type | URL | -|--------------|-----| -| NodePort | http://localhost:30890 | -| LoadBalancer | Check `kubectl get svc -n jupyterhub` | -| Ingress | https://your-domain.com | - ---- - -## Maintenance - -### Update Configuration - -```bash -cd runtime - -# Upgrade JupyterHub -helm upgrade jupyterhub ./chart \ - --namespace jupyterhub \ - -f values-multi-nodes.yaml -``` - -When updating images, the prepuller will take time to pull. If a node halts, SSH into it and run `sudo systemctl restart k3s-agent`. - -### Uninstall JupyterHub - -```bash -helm uninstall jupyterhub --namespace jupyterhub -``` - -### Uninstall K3s (Single-Node Only) - -```bash -/usr/local/bin/k3s-uninstall.sh -``` - -> For more maintenance details, see [Maintenance Manual](../docs/user-manual/aup-remote-lab-user-manual-admin-new.md) - ---- - -## Troubleshooting - -### kubectl permission denied error - -If you encounter errors like: -``` -error: error loading config file "/etc/rancher/k3s/k3s.yaml": open /etc/rancher/k3s/k3s.yaml: permission denied -``` - -**For single-node deployments:** -The K3s installation command in this guide already includes `K3S_KUBECONFIG_MODE="644"` to prevent this issue. If you still encounter permission errors: - -```bash -# Verify file permissions -ls -la /etc/rancher/k3s/k3s.yaml - -# If needed, copy to user directory -mkdir -p ~/.kube -sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config -sudo chown $(id -u):$(id -g) ~/.kube/config -``` - -**For multi-node (Ansible) deployments:** -Add the following to your `inventory.yml` before running the playbook: -```yaml -k3s_cluster: - vars: - extra_server_args: "--write-kubeconfig-mode=644" -``` - -See [K3s Cluster Access](https://docs.k3s.io/cluster-access) for official documentation. - -### Helm command not found - -If `helm` command is not found, verify the installation: -```bash -# Check if helm is in PATH -which helm - -# If not, ensure /usr/local/bin is in PATH -echo $PATH - -# Reinstall helm if needed (see Step 2 in deployment guides above) -``` - ---- - -## Advanced Configuration - -### Offline / Portable Operation - -The single-node deployment script automatically configures the system for offline and portable operation. When you run `./single-node.sh install`, it: - -1. **Creates a dummy network interface** (`dummy0`) with a stable IP address (`10.255.255.1`) -2. **Binds K3s to the dummy interface** using `--node-ip` and `--flannel-iface` -3. **Pre-pulls all required container images** to local storage -4. **Configures K3s to use local images** from `/var/lib/rancher/k3s/agent/images/` - -This ensures the cluster remains fully functional even when: -- External network is disconnected (network cable unplugged) -- WiFi network changes (connecting to different access points) -- No network is available at all - -**How it works**: K3s is bound to a stable dummy interface IP instead of the physical network interface. This means K3s doesn't care about external network changes - it always uses the same internal IP for cluster communication. - -**Reference**: [K3s Air-Gap Installation](https://docs.k3s.io/installation/airgap) - -### Mirror Configuration - -If you need to use alternative mirrors for container registries or package managers, you can configure them via environment variables when running the deployment script. - -#### Container Registry Mirror - -Set `MIRROR_PREFIX` to use a registry mirror that supports the **universal prefix mode**. The prefix will be prepended to all container image references: - -```bash -# Example: quay.io/jupyterhub/k8s-hub:4.1.0 becomes -# mirror.example.com/quay.io/jupyterhub/k8s-hub:4.1.0 - -MIRROR_PREFIX="mirror.example.com" ./single-node.sh install -``` - -> **Note**: This configuration works with registry mirrors that support the universal prefix pattern (e.g., `mirror/registry.k8s.io/image`). Some mirror services use per-registry subdomains instead (e.g., `k8s.mirror.org/image`), which require manual K3s registry configuration. See [K3s Private Registry Configuration](https://docs.k3s.io/installation/private-registry) for details. - -#### Package Manager Mirrors - -Set `MIRROR_PIP` and `MIRROR_NPM` to use alternative package repositories during image builds: - -```bash -MIRROR_PIP="https://pypi.example.com/simple" \ -MIRROR_NPM="https://registry.example.com" \ -./single-node.sh build-images -``` - -#### Combined Example - -```bash -MIRROR_PREFIX="mirror.example.com" \ -MIRROR_PIP="https://pypi.example.com/simple" \ -MIRROR_NPM="https://registry.example.com" \ -./single-node.sh install -``` - -For available environment variables, run: -```bash -./single-node.sh help +vim values-multi-nodes.yaml +helm upgrade --install jupyterhub ./chart -n jupyterhub --create-namespace -f values-multi-nodes.yaml ``` diff --git a/deploy/ansible/README.md b/deploy/ansible/README.md index f9d9887..3608aec 100644 --- a/deploy/ansible/README.md +++ b/deploy/ansible/README.md @@ -20,211 +20,40 @@ SOFTWARE. --> -# Ryzen AI PC Cluster Ansible Playbooks +# Ansible Playbooks -Note: This k3s ansible role is a modification based on [k3s-ansible](https://github.com/k3s-io/k3s-ansible/tree/master). +K3s cluster setup playbooks based on [k3s-ansible](https://github.com/k3s-io/k3s-ansible/tree/master). -**Official documentation**: -- [Ansible Documentation](https://docs.ansible.com/ansible/latest/index.html) -- [K3s Documentation](https://docs.k3s.io/) +For full instructions, see [Multi-Node Cluster Deployment](https://amdresearch.github.io/aup-learning-cloud/installation/multi-node.html). - - -## Prerequisites - -Before you start make sure you have setup these requirements: - -* **Ansible:** 2.18.3 or later (only required on the kube-controller node) -* **Python:** 3.12 -* **SSH:** root login allowed, and SSH keys synced across nodes -* **Hosts:** All nodes share the same `/etc/hosts` entries - -> **Tip:** It's recommended to test SSH connectivity from the controller node to all other nodes using key-based authentication with account `root`. - - - -## Initial Setup - -### 1. Select the K8s Controller Node - -Choose one node as the **Kubernetes controller**. This node will also act as the **Ansible control host**. - -### 2. Update `/etc/hosts` on Controller - -On the controller node, ensure `/etc/hosts` contains all cluster nodes -``` - host - node1 - node2 - node3 -... -``` - -This allows each node to resolve the others by hostname. - -> You can automate this with Ansible using a task that loops through all nodes and adds them to `/etc/hosts` (idempotent). - -### 3. Allow Root Login via SSH Key - -Ensure the root account on the controller node can login to all nodes using SSH keys: - -1. Create `.ssh` directory if it does not exist: - -```bash -sudo mkdir -p /root/.ssh -sudo chmod 700 /root/.ssh -``` - -2. Add your public key to `authorized_keys`: +## Quick Reference ```bash -sudo sh -c 'cat /home//.ssh/id_rsa.pub >> /root/.ssh/authorized_keys' -sudo chmod 600 /root/.ssh/authorized_keys -``` +# Configure inventory +vim inventory.yml -> This ensures Ansible can run tasks as root across all nodes without password prompts. - -Also you can use [this script](../scripts/setup_ssh_root_access.sh) to help you spread the public key to all nodes for root login. - -### Notes - -* The controller node is the only one where Ansible needs to be installed. -* All other nodes only require Python 3.12 and SSH access for `root`. -* Maintaining consistent `/etc/hosts` entries is crucial for K3s cluster networking. - - - - -## Host Inventory -Config the host inventory in `inventory.yml` file. - -The server and agent nodes should be defined in the `inventory.yml` file. The server node is the node you install `ansible` and host `NFS` on (suggested). -Currently we only support one server node and multiple agent nodes. -The agent nodes are the nodes you want to join to the cluster which are required to have root access from the server node. -```ini -#Example ---- -k3s_cluster: - children: - server: - hosts: - : - host: - agent: - hosts: - : - strix1: - - # Required Vars - vars: - ansible_port: 22 - ansible_user: root - k3s_version: v1.32.3+k3s1 - # The token should be a random string of reasonable length. You can generate - # one with the following commands: - # - openssl rand -base64 64 - # - pwgen -s 64 1 - # You can use ansible-vault to encrypt this value / keep it secret. - # Or you can omit it if not using Vagrant and let the first server automatically generate one. - token: "changeme!" - api_endpoint: "{{ hostvars[groups['server'][0]]['ansible_host'] | default(groups['server'][0]) }}" -``` - -## Setup with k3s - -```bash -# Setup new nodes -cd deploy/ansible/ +# Base setup sudo ansible-playbook playbooks/pb-base.yml -sudo ansible-playbook playbooks/pb-k3s-site.yml -``` -> **πŸ’‘ Tip - Kubeconfig Permissions**: By default, K3s creates `/etc/rancher/k3s/k3s.yaml` with `600` permissions (root-only). To avoid "permission denied" errors, you can configure K3s to generate the kubeconfig with readable permissions by adding the following to your `inventory.yml`: -> -> ```yaml -> k3s_cluster: -> vars: -> extra_server_args: "--write-kubeconfig-mode=644" -> ``` -> -> This sets the kubeconfig file permissions to `644`, allowing all users to read the file. See [K3s Cluster Access](https://docs.k3s.io/cluster-access) for more details. - -## Add a new node -Make sure your private key has access to the node with root permission. -Insert new hostnames to agent section inside `inventory.yml`, add run site command again. -```bash +# Deploy K3s cluster sudo ansible-playbook playbooks/pb-k3s-site.yml -``` -## Reset -If you want to reset the entire cluster. All your data and config will be removed. -```bash -sudo ansible-playbook playbooks/pb-k3s-reset.yml -``` -After resetting whole cluster, do remove the folder `~/.kube`. - -Only reset a node.f -```bash -sudo ansible-playbook playbooks/pb-k3s-reset.yml --limit -``` - -## Install GPU driver - -You should both install GPU driver and ROCm software on each GPU node. -We are using ROCm 7.1.0, you can change the version in `pb-rocm.yml` file. -```bash +# Install ROCm GPU drivers sudo ansible-playbook playbooks/pb-rocm.yml -``` - -**Official documentation**: https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html - -## Install NPU driver (TODO) -You should both install NPU driver and ROCm software on each NPU node. -Please check the exact filename in the `pb-npu.yml` file. -Here NPU-related settings are not ready for open-source review, this section remains for future use. - -```bash -#sudo ansible-playbook playbooks/pb-npu.yml -``` - -## Troubleshooting -### The Ansible process halts at `TASK [k3s_agent : Enable and check K3s service] ` - -Do check if the k3s service is running on the agent node. -```bash -ssh -sudo systemctl status k3s-agent.service -``` -If the service is not running, you can try to restart the service. -If the service is running but spamming connection error to the `k3s-host` machine, do check if you have setup the `/etc/hosts` file on `k3s-agent` machine correctly. - -### kubectl commands fail with "permission denied" error - -If you encounter errors like: -``` -error: error loading config file "/etc/rancher/k3s/k3s.yaml": open /etc/rancher/k3s/k3s.yaml: permission denied -``` - -**Solution 1 (Recommended)**: Configure K3s to generate kubeconfig with readable permissions by adding to `inventory.yml`: -```yaml -k3s_cluster: - vars: - extra_server_args: "--write-kubeconfig-mode=644" -``` - -Then re-run the playbook: -```bash +# Add new nodes (update inventory.yml first) sudo ansible-playbook playbooks/pb-k3s-site.yml -``` -**Solution 2**: Copy kubeconfig to user directory (already done by playbook if `user_kubectl: true`): -```bash -mkdir -p ~/.kube -sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config -sudo chown $(id -u):$(id -g) ~/.kube/config -export KUBECONFIG=~/.kube/config +# Reset cluster +sudo ansible-playbook playbooks/pb-k3s-reset.yml + +# Reset single node +sudo ansible-playbook playbooks/pb-k3s-reset.yml --limit ``` -See [K3s Cluster Access](https://docs.k3s.io/cluster-access) for more details. +## Prerequisites +- **Ansible**: 2.18.3+ (on controller node only) +- **Python**: 3.12 +- **SSH**: Root login with key-based auth to all nodes +- **Hosts**: Consistent `/etc/hosts` entries across all nodes diff --git a/docs/imgs/software-stack.png b/deploy/docs/images/software-stack.png similarity index 100% rename from docs/imgs/software-stack.png rename to deploy/docs/images/software-stack.png diff --git a/deploy/k8s/README.md b/deploy/k8s/README.md index 7b0bf1e..216258f 100644 --- a/deploy/k8s/README.md +++ b/deploy/k8s/README.md @@ -20,140 +20,34 @@ SOFTWARE. --> -# Ryzen AI PC Cluster Kubernetes Configuration +# Kubernetes Components -## Prerequisites +Kubernetes resource configurations for AUP Learning Cloud cluster. -### Install K9s (optional but recommended) -[K9s](https://github.com/derailed/k9s) provides a nice command line dashboard for cluster inspection. +For full instructions, see [Multi-Node Cluster Deployment](https://amdresearch.github.io/aup-learning-cloud/installation/multi-node.html). -```bash -wget https://github.com/derailed/k9s/releases/latest/download/k9s_linux_amd64.deb && \ -sudo apt install ./k9s_linux_amd64.deb && \ -rm k9s_linux_amd64.deb -``` - -### Install Helm - -Every release of Helm provides binary releases for a variety of OSes. These binary versions can be manually downloaded and installed. - -1. Download your [desired version](https://github.com/helm/helm/releases) -2. Unpack it (`tar -zxvf helm-linux-amd64.tar.gz`) -3. Find the helm binary in the unpacked directory, and move it to its desired destination (mv linux-amd64/helm /usr/local/bin/helm) - -Example: -```bash -wget https://get.helm.sh/helm-v3.17.2-linux-amd64.tar.gz -O /tmp/helm-linux-amd64.tar.gz -cd /tmp && tar -zxvf helm-linux-amd64.tar.gz -sudo mv /tmp/linux-amd64/helm /usr/local/bin/helm -rm /tmp/helm-linux-amd64.tar.gz -``` - -**Official documentation**: https://helm.sh/docs/intro/install/ - -Copy kube config to current user - -```bash -mkdir -p $HOME/.kube -sudo cp -i /etc/rancher/k3s/k3s.yaml $HOME/.kube/config -sudo chown $(id -u):$(id -g) $HOME/.kube/config -``` - -and then export KUBECONFIG in your `.bashrc` - -```bash -echo "export KUBECONFIG=$HOME/.kube/config" >> $HOME/.bashrc -source $HOME/.bashrc -``` +## Contents -## Deploy AMD GPU k8s Device Plugin +- `nfs-provisioner/` β€” NFS dynamic provisioner Helm values -To schedule GPU pods, we need to deploy the AMD GPU k8s device plugin. +## Quick Reference ```bash +# Deploy AMD GPU device plugin kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml -``` - -**Official documentation**: https://github.com/ROCm/k8s-device-plugin - -After deployment, you can see a new resource `amd.com/gpu` in `kubectl describe node`. - -```bash -kubectl describe node | grep amd.com/gpu: - amd.com/gpu: 1 -``` - -## Label Nodes According to GPU Architecture +# Label nodes by GPU type +kubectl label nodes node-type=strix-halo -Proper node labeling is essential for **scheduling workloads on the correct hardware**. Labels allow Kubernetes to distinguish between CPU-only nodes, discrete GPUs, and different GPU groups. - -> **⚠️ Important**: Always verify the actual node names first using `kubectl get nodes` before applying labels. - -The general syntax is: `kubectl label nodes node-type=` - -**Step 1: Get actual node names** -```bash -# List all nodes with their actual names -kubectl get nodes - -# Example output: -# NAME STATUS ROLES AGE VERSION -# phx-1 Ready 1d v1.32.3+k3s1 -# strix-1 Ready 1d v1.32.3+k3s1 -# strix-halo-1 Ready control-plane,master 1d v1.32.3+k3s1 -``` - -**Step 2: Label nodes based on GPU/CPU architecture** - -You can use the provided script `scripts/label-node.sh` or manually label nodes: - -```bash -#!/bin/bash - -# Label CPU nodes -kubectl label nodes phx-1 node-type=phx -kubectl label nodes phx-64g node-type=phx - -# Label discrete GPU nodes -kubectl label nodes rdna4-1 node-type=dgpu - -# Label Strix GPU nodes -kubectl label nodes strix-1 node-type=strix -kubectl label nodes strix-2 node-type=strix -kubectl label nodes strix-3 node-type=strix - -# Label Strix Halo GPU nodes -kubectl label nodes strix-halo-1 node-type=strix-halo - -# Verify labels -kubectl get nodes --show-labels | grep node-type +# Verify GPU detection +kubectl describe node | grep amd.com/gpu ``` -### Label Mapping - -| Node Group | Example Nodes | node-type Label | Hardware Description | -| ---------- | ----------------- | --------------- | -------------------------------------- | -| phx | phx-1, phx-64g | `phx` | Phoenix nodes (AMD 7940HS 7640HS) | -| dgpu | rdna4-1 | `dgpu` | Discrete GPUs (AMD Radeon 7900XTX, 9070XT, W9700) | -| strix | strix-1 ~ strix-3 | `strix` | Strix nodes (AMD AI 370 350) | -| strix-halo | strix-halo-1 | `strix-halo` | Strix-Halo nodes (AMD AI MAX 395) | - -> Using these labels, workloads can specify `nodeSelector` to schedule pods on nodes with the desired GPU type. - - - -## (Experimental) NPU non-sudo support in K8s - -To enable NPU non-sudo support in K3s, you need to deploy the following components: -1. For every NPU node, you should edit their `k3s-agent.service` to solve the `ulimit` issue. -2. Add this line in the service file `LimitMEMLOCK=infinity`. -3. Reload the systemctl and restart K3s service. -4. You can use NPU inside K3s dockers. Known issues include sudden hang up with `signal 15`. - -## Deploy Kubernetes components - -1. nfs-provisioner [nfs-provisioner/README.md](./nfs-provisioner/README.md) - +### Node Label Mapping -Then you can deploy other applications you like. +| node-type | Hardware | +|---------------|----------| +| `phx` | Phoenix (AMD 7940HS / 7640HS) | +| `dgpu` | Discrete GPU (Radeon 9070XT, W9700) | +| `strix` | Strix (AMD AI 370 / 350) | +| `strix-halo` | Strix-Halo (AMD AI MAX 395) | diff --git a/deploy/k8s/nfs-provisioner/README.md b/deploy/k8s/nfs-provisioner/README.md index 4c7058b..0a03fc4 100644 --- a/deploy/k8s/nfs-provisioner/README.md +++ b/deploy/k8s/nfs-provisioner/README.md @@ -22,61 +22,20 @@ SOFTWARE. # NFS Provisioner -## Install NFS Server -In kube-contorller node -```bash -sudo apt install nfs-kernel-server -``` +Helm values for [nfs-subdir-external-provisioner](https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner). -## Create NFS Share -```bash -sudo mkdir -p /nfs -sudo chown -R nobody:nogroup /nfs -sudo chmod 777 /nfs -``` +For full NFS setup instructions, see [Multi-Node Cluster Deployment](https://amdresearch.github.io/aup-learning-cloud/installation/multi-node.html#configure-nfs-storage). -## Configure NFS Server -```bash -sudo nano /etc/exports -``` +## Quick Reference -Add the following line: ```bash -/nfs (rw,sync,no_subtree_check,no_root_squash,insecure) -``` - -## Restart NFS Server -```bash -sudo systemctl restart nfs-kernel-server -``` - -## Install NFS Client -In all nodes (using ansible base role) -```bash -sudo apt install nfs-common -``` - -## Install NFS Provisioner - -![NFS Provisioner](../../docs/images/nfs.jpg) - -```bash -## Add repository +# Install helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm repo update - helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ - --namespace nfs-provisioner \ - --create-namespace \ - -f deploy/k8s/nfs-provisioner/values.yaml -``` - -## Set Default StorageClass + --namespace nfs-provisioner --create-namespace \ + -f values.yaml -```bash -# Set nfs-client as the default StorageClass +# Set as default StorageClass kubectl patch storageclass nfs-client -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' - -# Verify the default StorageClass -kubectl get storageclass ``` diff --git a/docs/imgs/github-1.png b/docs/imgs/github-1.png deleted file mode 100644 index 9b96388..0000000 Binary files a/docs/imgs/github-1.png and /dev/null differ diff --git a/docs/imgs/github-10.png b/docs/imgs/github-10.png deleted file mode 100644 index 8723c95..0000000 Binary files a/docs/imgs/github-10.png and /dev/null differ diff --git a/docs/imgs/github-11.png b/docs/imgs/github-11.png deleted file mode 100644 index 2a9ed18..0000000 Binary files a/docs/imgs/github-11.png and /dev/null differ diff --git a/docs/imgs/github-12.png b/docs/imgs/github-12.png deleted file mode 100644 index 101d0b0..0000000 Binary files a/docs/imgs/github-12.png and /dev/null differ diff --git a/docs/imgs/github-13.png b/docs/imgs/github-13.png deleted file mode 100644 index 64603d7..0000000 Binary files a/docs/imgs/github-13.png and /dev/null differ diff --git a/docs/imgs/github-2.png b/docs/imgs/github-2.png deleted file mode 100644 index b8ad0e2..0000000 Binary files a/docs/imgs/github-2.png and /dev/null differ diff --git a/docs/imgs/github-3.png b/docs/imgs/github-3.png deleted file mode 100644 index 4c61a72..0000000 Binary files a/docs/imgs/github-3.png and /dev/null differ diff --git a/docs/imgs/github-4.png b/docs/imgs/github-4.png deleted file mode 100644 index 1c367d5..0000000 Binary files a/docs/imgs/github-4.png and /dev/null differ diff --git a/docs/imgs/github-5.png b/docs/imgs/github-5.png deleted file mode 100644 index 73d9d83..0000000 Binary files a/docs/imgs/github-5.png and /dev/null differ diff --git a/docs/imgs/github-6.png b/docs/imgs/github-6.png deleted file mode 100644 index 47812b5..0000000 Binary files a/docs/imgs/github-6.png and /dev/null differ diff --git a/docs/imgs/github-7.png b/docs/imgs/github-7.png deleted file mode 100644 index 5cd542b..0000000 Binary files a/docs/imgs/github-7.png and /dev/null differ diff --git a/docs/imgs/github-8.png b/docs/imgs/github-8.png deleted file mode 100644 index 154e72c..0000000 Binary files a/docs/imgs/github-8.png and /dev/null differ diff --git a/docs/imgs/github-9.png b/docs/imgs/github-9.png deleted file mode 100644 index 8631302..0000000 Binary files a/docs/imgs/github-9.png and /dev/null differ diff --git a/docs/imgs/jupyterhub/quota-1-users-page.png b/docs/imgs/jupyterhub/quota-1-users-page.png deleted file mode 100644 index d9690a2..0000000 Binary files a/docs/imgs/jupyterhub/quota-1-users-page.png and /dev/null differ diff --git a/docs/imgs/jupyterhub/quota-2-inline-edit.png b/docs/imgs/jupyterhub/quota-2-inline-edit.png deleted file mode 100644 index 6994669..0000000 Binary files a/docs/imgs/jupyterhub/quota-2-inline-edit.png and /dev/null differ diff --git a/docs/imgs/jupyterhub/quota-3-batch-select.png b/docs/imgs/jupyterhub/quota-3-batch-select.png deleted file mode 100644 index 4ba353c..0000000 Binary files a/docs/imgs/jupyterhub/quota-3-batch-select.png and /dev/null differ diff --git a/docs/imgs/jupyterhub/quota-4-batch-modal.png b/docs/imgs/jupyterhub/quota-4-batch-modal.png deleted file mode 100644 index 2f233f5..0000000 Binary files a/docs/imgs/jupyterhub/quota-4-batch-modal.png and /dev/null differ diff --git a/docs/jupyterhub/How_to_Setup_GitHub_OAuth.md b/docs/jupyterhub/How_to_Setup_GitHub_OAuth.md deleted file mode 100644 index 13a1177..0000000 --- a/docs/jupyterhub/How_to_Setup_GitHub_OAuth.md +++ /dev/null @@ -1,265 +0,0 @@ - - - -# How to Setup GitHub OAuth for JupyterHub - -This guide will walk you through the process of setting up GitHub OAuth for your JupyterHub deployment. - -## Prerequisites - -1. A GitHub account -2. Administrative access to your JupyterHub deployment -3. Your JupyterHub domain/URL - -## Step 1: Create a New GitHub Organization - -1. Go to [github.com](https://github.com) and click on `+` icon in the top right -2. Click **New Organization** from the dropdown menu - - ![New Organization Option](../imgs/github-1.png) - -3. Fill in the organization details: - - Enter your **Organization name** (e.g., "AUP-INT-TEST") - - Provide a **Contact email** - - Select whether this organization belongs to "My personal account" or "A business or institution" - - Complete the verification puzzle - - Accept the Terms of Service - - Click **Next** to create the organization - - ![Organization Setup Form](../imgs/github-2.png) - -## Step 2: Create Teams to Assign Different Permissions - -Teams allow you to organize members and control access to different resources in your JupyterHub deployment. - -1. Navigate to your organization's **Teams** page -2. Click the **New team** button in the top right - - ![Teams Page](../imgs/github-3.png) - -3. Fill in the team creation form: - - **Team name**: Use the same name as the key in your `jupyterhub_config.py` (e.g., "cpu", "gpu", "npu", "official") - - **Description**: Add a description of what this team is for - - **Team visibility**: Select **Visible** (recommended) - this allows all organization members to see the team - - **Team notifications**: Choose whether to enable notifications - - Click **Create team** - - ![Create Team Form](../imgs/github-4.png) - -4. Repeat this process to create all the teams you need for your resource mapping (e.g., cpu, gpu, npu, official, public, test) - -## Step 3: Add Members to the Organization - -1. Go to the **People** tab in your organization -2. Click the **Invite member** button in the top right - - ![People Page](../imgs/github-5.png) - -3. In the invitation dialog: - - Enter the member's **email address or GitHub username** - - Click **Invite** - - ![Invite Member Dialog](../imgs/github-6.png) - -4. Assign the member to appropriate teams and roles: - - **Role in the organization**: - - Select **Member** for normal users (can see all members and be granted access to repositories) - - Select **Owner** for admin users (full administrative rights to the organization) - - **Teams**: Select the teams this member should belong to (e.g., cpu, gpu, official) - - Click **Send invitation** - - ![Role and Team Assignment](../imgs/github-7.png) - -5. Repeat this process for all members you want to add to your organization - -## Step 4: Create a GitHub App - -> **Note**: GitHub Apps are the recommended way to integrate with GitHub. They are created under the organization (not a personal account), support fine-grained permissions, and enable private repository access for users. - -1. Go to your organization's GitHub App creation page: - `https://github.com/organizations//settings/apps/new` - - ![Organization Settings β†’ Developer settings β†’ GitHub Apps](../imgs/github-app-1.png) - -2. Fill in the basic information: - - **GitHub App name**: A unique name (e.g., "auplc-hub") - - **Homepage URL**: Your JupyterHub URL (e.g., `https://your.domain.com`) - - **Callback URL**: Your OAuth callback URL - - Single auth: `https:///hub/oauth_callback` - - Multi auth: `https:///hub/github/oauth_callback` - - ![Basic information form](../imgs/github-app-2.png) - - - **Expire user authorization tokens**: Check (recommended) - - **Request user authorization (OAuth) during installation**: Check - - **Webhook β†’ Active**: Uncheck (not needed) - - ![Token expiration and webhook settings](../imgs/github-app-3.png) - -3. Set permissions: - - **Repository permissions**: - - `Contents`: Read-only (for cloning private repos) - - `Metadata`: Read-only (selected by default) - - ![Repository permissions](../imgs/github-app-4.png) - - - **Organization permissions**: - - `Members`: Read-only (for team-based access control) - - ![Organization permissions](../imgs/github-app-5.png) - -4. Installation scope: - - **Where can this GitHub App be installed?**: Any account - - Click **Create GitHub App** - - ![Installation scope and Create button](../imgs/github-app-6.png) - -5. After creation, note down the following: - - **Client ID**: Displayed on the App's General page (e.g., `Iv23liXXXXXX`) - - **Client secret**: Click **Generate a new client secret** β€” copy it immediately - - **App slug**: The URL-safe name in the App's URL (e.g., `auplc-hub`) - - ![Client ID and secret generation](../imgs/github-app-7.png) - -## Step 5: Configure JupyterHub - -1. Open your deployment configuration file (`runtime/values.yaml` or environment-specific override) - -2. Add the GitHub App configuration: - - ```yaml - custom: - gitClone: - githubAppName: "your-app-slug" # Enables private repo access & repo picker - - hub: - config: - GitHubOAuthenticator: - oauth_callback_url: "https:///hub/github/oauth_callback" - client_id: "" - client_secret: "" - allowed_organizations: - - - scope: [] # GitHub App uses App-level permissions, not OAuth scopes - ``` - - > **Note**: `scope: []` is correct for GitHub Apps. Permissions (Contents, Members, etc.) are configured in the App settings on GitHub, not via OAuth scopes. - -3. Configure team-to-resource mapping in `values.yaml`: - - ```yaml - custom: - teams: - mapping: - cpu: - - cpu - gpu: - - Course-CV - - Course-DL - - Course-LLM - official: - - cpu - - Course-CV - - Course-DL - - Course-LLM - ``` - -4. Deploy: - - ```bash - helm upgrade jupyterhub ./chart -n jupyterhub -f values.yaml - ``` - -## Verification - -1. Navigate to your JupyterHub URL -2. You should see a "Sign in with GitHub" button -3. Click it and authorize the application -4. You should be redirected back to JupyterHub and logged in -5. Verify that users can only access resources based on their team membership - -## Troubleshooting - -- **OAuth callback error**: Ensure your callback URL exactly matches what you configured in GitHub (including HTTPS) -- **Organization not found**: Verify the organization name in your configuration matches your GitHub organization exactly -- **Users can't access resources**: Check that users are added to the correct teams in GitHub -- **Authentication fails**: Verify your Client ID and Client Secret are correct and the secret hasn't expired - -## Migrating from OAuth App to GitHub App - -If you are currently using a legacy GitHub OAuth App, follow these steps to migrate: - -### Why Migrate? - -| | OAuth App | GitHub App | -|---|---|---| -| **Ownership** | Personal account only | Organization-level | -| **Permissions** | Coarse OAuth scopes (`repo` = full read/write to ALL repos) | Fine-grained per-permission (e.g. Contents: read-only) | -| **Private repo access** | Requires `repo` scope (overly broad) | Per-repo authorization by user | -| **Staff changes** | App lost if owner leaves | Org admins retain control | - -### Migration Steps - -1. **Create a GitHub App** under your organization (see [Step 4](#step-4-create-a-github-app) above) - -2. **Update `values.yaml`** β€” change 3 fields, add 1: - - ```yaml - custom: - gitClone: - githubAppName: "your-app-slug" # NEW β€” add this - - hub: - config: - GitHubOAuthenticator: - client_id: "" # CHANGE β€” from OAuth App's ID - client_secret: "" # CHANGE β€” from OAuth App's secret - scope: [] # CHANGE β€” was [read:user, read:org] - # allowed_organizations, oauth_callback_url β€” keep unchanged - ``` - -3. **Deploy**: - - ```bash - helm upgrade jupyterhub ./chart -n jupyterhub -f values.yaml - ``` - -4. **User impact**: - - Existing logged-in sessions continue to work - - On next login, users go through the new GitHub App OAuth flow (same experience) - - Users who want private repo access can authorize repos on the spawn page - -5. **Clean up**: Once all users have re-logged, delete the old OAuth App from GitHub (Settings β†’ Developer settings β†’ OAuth Apps) - -## Security Best Practices - -1. Always use HTTPS for your JupyterHub deployment -2. Keep your Client Secret secure and never commit it to version control -3. Regularly review organization members and their team assignments -4. Use environment variables or secret management systems for storing OAuth credentials -5. Create the GitHub App under the organization (not a personal account) so it survives staff changes -6. Set minimal App permissions β€” Contents (read-only) and Members (read-only) are sufficient - -## Additional Resources - -- [JupyterHub Documentation](https://jupyterhub.readthedocs.io/) -- [GitHub Apps Documentation](https://docs.github.com/en/apps) -- [OAuthenticator Documentation](https://oauthenticator.readthedocs.io/) diff --git a/docs/jupyterhub/README.md b/docs/jupyterhub/README.md deleted file mode 100644 index 435b86a..0000000 --- a/docs/jupyterhub/README.md +++ /dev/null @@ -1,385 +0,0 @@ - - - -# JupyterHub Configuration Guide - -## Documentation - -- [Authentication Guide](./authentication-guide.md) - Setup GitHub OAuth and native authentication -- [User Management Guide](./user-management.md) - Batch user operations with scripts -- [User Quota System](./quota-system.md) - Resource usage tracking and quota management -- [GitHub OAuth Setup](./How_to_Setup_GitHub_OAuth.md) - Step-by-step OAuth configuration - ---- - -## Configuration Files Overview - -The Helm chart uses a layered configuration approach: - -| File | Purpose | -|------|---------| -| `runtime/chart/values.yaml` | Chart defaults (accelerators, resources, teams, quota settings) | -| `runtime/values.yaml` | Deployment overrides (environment-specific settings) | -| `runtime/values.local.yaml` | Local development overrides (gitignored) | - -### Helm Merge Behavior - -- **Maps/Objects**: Deep merge (new keys added, same keys override) -- **Arrays/Lists**: Complete replacement - -Deploy with: -```bash -# Production -helm upgrade jupyterhub ./chart -n jupyterhub -f values.yaml - -# Local development -helm upgrade jupyterhub ./chart -n jupyterhub -f values.yaml -f values.local.yaml -``` - ---- - -## Custom Configuration - -All custom settings are under the `custom` section. Chart defaults are in `runtime/chart/values.yaml`. - -### Authentication Mode - -```yaml -custom: - authMode: "auto-login" # auto-login | dummy | github | multi -``` - -| Mode | Description | -|------|-------------| -| `auto-login` | No credentials, auto-login as 'student' (single-node dev) | -| `dummy` | Accept any username/password (testing) | -| `github` | GitHub OAuth only | -| `multi` | GitHub OAuth + Local accounts | - -### Admin User Auto-Creation - -```yaml -custom: - adminUser: - enabled: true # Generate admin credentials on install -``` - -When enabled, credentials are stored in a Kubernetes secret: -```bash -# Get admin password -kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o jsonpath='{.data.admin-password}' | base64 -d - -# Get API token -kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o jsonpath='{.data.api-token}' | base64 -d -``` - -### Accelerators (GPU/NPU) - -Define available hardware accelerators: - -```yaml -custom: - accelerators: - phx: - displayName: "AMD Radeon 780M (Phoenix Point iGPU)" - description: "RDNA 3.0 (gfx1103) | Compute Units 12 | 4GB LPDDR5X" - nodeSelector: - node-type: phx - env: - HSA_OVERRIDE_GFX_VERSION: "11.0.0" - quotaRate: 2 - my-custom-gpu: - displayName: "My Custom GPU" - nodeSelector: - node-type: my-gpu - quotaRate: 3 -``` - -### Resources (Images & Requirements) - -Define container images and resource requirements: - -```yaml -custom: - resources: - images: - cpu: "ghcr.io/amdresearch/auplc-default:latest" - Course-CV: "ghcr.io/amdresearch/auplc-cv:latest" - my-course: "my-registry/my-image:latest" - - requirements: - cpu: - cpu: "2" - memory: "4Gi" - memory_limit: "6Gi" - Course-CV: - cpu: "4" - memory: "16Gi" - memory_limit: "24Gi" - amd.com/gpu: "1" - my-course: - cpu: "4" - memory: "8Gi" -``` - -### Teams Mapping - -Map teams to allowed resources: - -```yaml -custom: - teams: - mapping: - cpu: - - cpu - gpu: - - Course-CV - - Course-DL - native-users: - - cpu - - Course-CV -``` - -**Note**: Arrays are completely replaced when overriding. If you override `teams.mapping.gpu`, the entire list is replaced, not merged. - -### Quota System - -```yaml -custom: - quota: - enabled: null # null = auto-detect based on authMode - cpuRate: 1 # Quota rate for CPU-only containers - minimumToStart: 10 # Minimum quota to start a container - defaultQuota: 0 # Default quota for new users - - refreshRules: - daily-topup: - enabled: true - schedule: "0 0 * * *" - action: add - amount: 100 - maxBalance: 500 - targets: - includeUnlimited: false - balanceBelow: 400 -``` - -See [quota-system.md](./quota-system.md) for detailed documentation. - ---- - -## Hub Configuration - -### Hub Image - -```yaml -hub: - image: - name: ghcr.io/amdresearch/auplc-hub - tag: latest - pullPolicy: IfNotPresent -``` - -### Login Page Announcement - -```yaml -hub: - extraFiles: - announcement.txt: - mountPath: /usr/local/share/jupyterhub/static/announcement.txt - stringData: | -
-

Welcome!

-

Your announcement here.

-
-``` - -### GitHub Authentication - -GitHub authentication uses a [GitHub App](https://docs.github.com/en/apps) for login and optional private repository access. - -```yaml -custom: - gitClone: - githubAppName: "your-app-slug" # Enables private repo access & repo picker - -hub: - config: - GitHubOAuthenticator: - oauth_callback_url: "https:///hub/github/oauth_callback" - client_id: "" - client_secret: "" - allowed_organizations: - - YOUR-ORG-NAME - scope: [] # GitHub App uses App-level permissions, not OAuth scopes -``` - -See [How_to_Setup_GitHub_OAuth.md](./How_to_Setup_GitHub_OAuth.md) for setup instructions. - -### Git Repository Cloning - -Resources with `allowGitClone: true` show a Git URL input on the spawn page. Users can clone a repository at container startup. - -```yaml -custom: - gitClone: - allowedProviders: - - github.com - - gitlab.com - maxCloneTimeout: 300 -``` - -#### Private Repository Access - -Two independent mechanisms (can be used together): - -**1. GitHub App** β€” for GitHub OAuth users - -Set `githubAppName` (see above). Users authorize per-repo read-only access via the GitHub App UI. A repo picker appears on the spawn page showing authorized repositories. - -**2. Default Access Token** β€” for all users including auto-login - -```yaml -custom: - gitClone: - defaultAccessToken: "ghp_xxxx" # Bot/service account PAT -``` - -Admin-configured PAT applied transparently to all users. Helm auto-creates a K8s Secret. Useful for classroom / single-node setups where everyone needs access to the same private repos without GitHub login. - -Token priority: OAuth token (GitHub App) > defaultAccessToken > none (public only) - ---- - -## Network Settings - -### NodePort Access - -```yaml -proxy: - service: - type: NodePort - nodePorts: - http: 30890 - -ingress: - enabled: false -``` - -Access via `http://:30890` - -### Domain Access with Ingress - -```yaml -proxy: - service: - type: ClusterIP - -ingress: - enabled: true - ingressClassName: traefik - hosts: - - your.domain.com - tls: - - hosts: - - your.domain.com - secretName: jupyter-tls-cert -``` - ---- - -## Storage Settings - -### NFS Storage (Production) - -```yaml -hub: - db: - pvc: - storageClassName: nfs-client - -singleuser: - storage: - dynamic: - storageClass: nfs-client -``` - -See [deploy/k8s/nfs-provisioner](../../deploy/k8s/nfs-provisioner) for NFS setup. - -### Local Storage (Development) - -```yaml -hub: - db: - pvc: - storageClassName: hostpath - -singleuser: - storage: - dynamic: - storageClass: hostpath -``` - ---- - -## PrePuller Settings - -Pre-download images to all nodes for faster container startup: - -```yaml -prePuller: - hook: - enabled: true - continuous: - enabled: true - extraImages: - aup-cpu-notebook: - name: ghcr.io/amdresearch/auplc-default - tag: latest -``` - -For faster deployment (images pulled on-demand): - -```yaml -prePuller: - hook: - enabled: false - continuous: - enabled: false -``` - ---- - -## Applying Changes - -After modifying configuration: - -```bash -helm upgrade jupyterhub ./chart -n jupyterhub -f values.yaml -``` - -Or use the helper script: - -```bash -./scripts/helm_upgrade.bash -``` diff --git a/docs/jupyterhub/authentication-guide.md b/docs/jupyterhub/authentication-guide.md deleted file mode 100644 index 076fd19..0000000 --- a/docs/jupyterhub/authentication-guide.md +++ /dev/null @@ -1,443 +0,0 @@ -# Authentication Guide - -This guide covers the dual authentication system and user management for AUP Learning Cloud. - -## Table of Contents - -- [Overview](#overview) -- [Authentication Methods](#authentication-methods) -- [Configuration](#configuration) -- [Admin Management](#admin-management) -- [User Management](#user-management) -- [Deployment](#deployment) -- [Troubleshooting](#troubleshooting) - -## Overview - -AUP Learning Cloud supports **dual authentication** to accommodate different user types: - -1. **GitHub OAuth**: For technical teams and organization members -2. **Native Authenticator**: For students and external users (admin-managed accounts) - -### Key Features - -- **Auto-admin on install**: Initial admin created automatically with random password -- **No self-registration**: Only admins can create native accounts -- **Individual passwords**: Each user has their own password (can be changed) -- **Unified admin panel**: All users managed in `/hub/admin` -- **Batch operations**: CSV/Excel-based bulk user management -- **Script-based admin management**: Use `set-admin` command - -## Authentication Methods - -### Comparison - -| Feature | GitHub OAuth | Native Authenticator | -|---------|--------------|---------------------| -| **User Type** | Technical teams, org members | Students, external users | -| **Account Creation** | GitHub organization invite | Admin creates manually | -| **Password** | Managed by GitHub | User-defined (changeable) | -| **Access Control** | Based on GitHub teams | Based on username patterns or groups | -| **Best For** | Staff, developers, researchers | Course students, temporary users | - -### Login Flow - -Users see a combined login page with GitHub OAuth button and native login form: - -``` -β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” -β”‚ JupyterHub Login β”‚ -β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ -β”‚ [ Sign in with GitHub ] β”‚ ← GitHub OAuth -β”‚ β”‚ -β”‚ ─── Or use local account ─── β”‚ -β”‚ β”‚ -β”‚ Username: [____________] β”‚ -β”‚ Password: [____________] β”‚ ← Native Auth -β”‚ [ Sign In ] β”‚ -β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ -``` - -## Configuration - -### Enable Auto-Admin (Recommended) - -In `runtime/values.yaml` or `runtime/values-local.yaml`: - -```yaml -custom: - adminUser: - enabled: true # Auto-create admin on helm install -``` - -This automatically: -1. Generates random API token and admin password -2. Creates `jupyterhub-admin-credentials` secret -3. Configures `admin` user with admin privileges on hub startup - -The API token is associated with the `admin` user for script operations. - -### Get Credentials After Install - -```bash -# Get admin password -kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o go-template='{{index .data "admin-password" | base64decode}}' - -# Get API token for scripts -export JUPYTERHUB_TOKEN=$(kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o go-template='{{index .data "api-token" | base64decode}}') -``` - -### Resource Access Mapping - -Configure which resources different user groups can access in `jupyterhub_config.py`: - -```python -TEAM_RESOURCE_MAPPING = { - "cpu": ["cpu"], - "gpu": ["Course-CV","Course-DL","Course-LLM"], - "official": ["cpu", "Course-CV","Course-DL","Course-LLM"], - "AUP": ["Course-CV","Course-DL","Course-LLM"], - "native-users": ["Course-CV","Course-DL","Course-LLM"] -} -``` - -### Native Authenticator Settings - -The `CustomFirstUseAuthenticator` class settings: - -```python -class CustomFirstUseAuthenticator(FirstUseAuthenticator): - service_name = "Native" - create_users = False # Only admin-created users can login -``` - -**Important**: `create_users = False` prevents users from creating accounts themselves. Users must be created by an admin first. - -## Admin Management - -### Initial Admin - -Created automatically on `helm install` when `custom.adminUser.enabled: true`. - -### Adding More Admins - -Use the `set-admin` command (NOT config files): - -```bash -# Grant admin to users -python scripts/manage_users.py set-admin teacher01 teacher02 - -# Grant admin from file -python scripts/manage_users.py set-admin -f new_admins.csv - -# Revoke admin -python scripts/manage_users.py set-admin --revoke student01 -``` - -Or via Admin Panel (`/hub/admin`): -1. Click username -2. Check "Admin" checkbox -3. Save - -### Admin Users Summary - -| Method | When to Use | -|--------|-------------| -| `custom.adminUser.enabled: true` | Initial admin on install | -| `manage_users.py set-admin` | Add/remove admins later | -| Admin Panel `/hub/admin` | Quick single-user changes | - -## User Management - -### Prerequisites - -Install required dependencies: - -```bash -pip install pandas openpyxl requests -``` - -Set environment variables: - -```bash -export JUPYTERHUB_URL="http://localhost:30890" # Or your hub URL -export JUPYTERHUB_TOKEN=$(kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o go-template='{{index .data "api-token" | base64decode}}') -``` - -### Batch User Operations - -For detailed batch user management, see [User Management Guide](user-management.md). - -**Quick Reference**: - -```bash -# Generate user template -python scripts/generate_users_template.py --prefix student --count 50 --output users.csv - -# Create users from file -python scripts/manage_users.py create users.csv - -# List all users -python scripts/manage_users.py list - -# Grant admin privileges -python scripts/manage_users.py set-admin teacher01 teacher02 - -# Revoke admin privileges -python scripts/manage_users.py set-admin --revoke student01 - -# Export/backup users -python scripts/manage_users.py export backup.xlsx - -# Delete users -python scripts/manage_users.py delete remove_list.csv -``` - -### Manual User Management - -Via JupyterHub Admin Panel (`/hub/admin`): - -1. **Add single user**: Click "Add Users" button -2. **Delete user**: Click username β†’ "Delete User" -3. **Make admin**: Click username β†’ Check "Admin" β†’ "Save" -4. **Stop user server**: Click "Stop Server" button - -## Deployment - -### 1. Update Hub Image - -The Hub Docker image must include the `jupyterhub-nativeauthenticator` dependency. - -**File**: `dockerfiles/Hub/Dockerfile` - -```dockerfile -RUN pip install jupyterhub-multiauthenticator jupyterhub-nativeauthenticator -``` - -### 2. Rebuild Hub Image - -```bash -cd dockerfiles/Hub -./build.sh - -# Or manually: -docker build -t ghcr.io/amdresearch/aup-jupyterhub-hub:v1.4.0-dual-auth . -docker push ghcr.io/amdresearch/aup-jupyterhub-hub:v1.4.0-dual-auth -``` - -### 3. Update Helm Values - -**File**: `runtime/values.yaml` - -```yaml -custom: - adminUser: - enabled: true - -hub: - image: - name: ghcr.io/amdresearch/aup-jupyterhub-hub - tag: v1.4.0-dual-auth -``` - -### 4. Deploy or Upgrade - -**Production (K3s)**: - -```bash -cd runtime -bash scripts/helm_upgrade.bash -``` - -**Local Development (Docker Desktop)**: - -```bash -cd runtime -helm upgrade jupyterhub ./chart --namespace jupyterhub -f values-local.yaml -``` - -### 5. Verify Deployment - -```bash -# Check hub pod is running -kubectl --namespace=jupyterhub get pods - -# Check hub logs for admin setup -kubectl --namespace=jupyterhub logs deployment/hub | grep -i admin - -# Get admin password -kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o jsonpath='{.data.admin-password}' | base64 -d && echo -``` - -## Troubleshooting - -### Issue: "Module 'nativeauthenticator' not found" - -**Cause**: Hub image doesn't have `jupyterhub-nativeauthenticator` installed. - -**Solution**: -```bash -# Rebuild Hub image with updated Dockerfile -cd dockerfiles/Hub -./build.sh -``` - -### Issue: Users can self-register - -**Cause**: `create_users` is not set to `False`. - -**Solution**: Verify in `jupyterhub_config.py`: -```python -class CustomFirstUseAuthenticator(FirstUseAuthenticator): - create_users = False -``` - -### Issue: API token authentication fails - -**Symptoms**: -``` -❌ Connection failed with status 403 -``` - -**Solutions**: - -1. **Verify secret exists**: - ```bash - kubectl -n jupyterhub get secret jupyterhub-admin-credentials - ``` - -2. **Check token is loaded**: - ```bash - kubectl -n jupyterhub logs deployment/hub | grep "API token" - ``` - -3. **Regenerate credentials**: - ```bash - kubectl -n jupyterhub delete secret jupyterhub-admin-credentials - helm upgrade jupyterhub ./chart -n jupyterhub -f values-local.yaml - ``` - -### Issue: Admin not created on install - -**Symptoms**: No admin user after helm install. - -**Solutions**: - -1. **Verify config**: - ```yaml - custom: - adminUser: - enabled: true # Must be true - ``` - -2. **Check hub logs**: - ```bash - kubectl -n jupyterhub logs deployment/hub | grep -i "admin" - ``` - -3. **Restart hub**: - ```bash - kubectl -n jupyterhub rollout restart deployment/hub - ``` - -### Issue: Wrong resource permissions - -**Symptoms**: User sees courses they shouldn't access, or missing expected courses. - -**Solution**: Check resource mapping in `jupyterhub_config.py`: - -```python -# Verify user's group assignment in CustomFirstUseAuthenticator -async def authenticate(self, handler, data): - result = await super().authenticate(handler, data) - if result: - username = result.get("name", "").strip().upper() - if "AUP" in username: - result["group"] = "AUP" - # ... - -# Verify group has correct resources -TEAM_RESOURCE_MAPPING = { - "AUP": ["Course-CV","Course-DL","Course-LLM"], - # ... -} -``` - -### Issue: Users created but can't login - -**Cause**: Native users must set password on first login. - -**Solution**: - -1. User should go to login page -2. Enter username created by admin -3. Enter desired password -4. Password is saved for future logins - -**Alternative**: Set passwords via script: -```bash -python scripts/manage_users.py set-passwords users.csv --generate -``` - -### Issue: Batch script fails to connect - -**Symptoms**: -``` -❌ Connection error: Connection refused -``` - -**Solutions**: - -1. **Verify JupyterHub URL**: - ```bash - # For local dev - export JUPYTERHUB_URL="http://localhost:30890" - - # For production - export JUPYTERHUB_URL="https://your-domain.com" - ``` - -2. **Check if hub is accessible**: - ```bash - curl $JUPYTERHUB_URL/hub/api/ - ``` - -3. **Verify namespace and port forwarding** (if using kubectl): - ```bash - kubectl --namespace=jupyterhub port-forward service/hub 8081:8081 - export JUPYTERHUB_URL="http://localhost:8081" - ``` - -## Security Best Practices - -1. **Protect API tokens**: - - Tokens are stored in Kubernetes secrets - - Never commit tokens to git - - Rotate tokens by deleting and recreating the secret - -2. **User password policy**: - - Encourage strong passwords - - Consider adding password complexity requirements - -3. **Admin accounts**: - - Limit number of admin users - - Use `set-admin` command to manage (auditable) - - Review admin list regularly - -4. **GitHub App**: - - Create the App under the organization, not a personal account - - Keep GitHub organization membership updated - - Review team permissions regularly - - Set `scope: []` β€” permissions are configured in the App settings - -## Additional Resources - -- [User Management Guide](user-management.md) - Batch user operations and scripts -- [JupyterHub Documentation](https://jupyterhub.readthedocs.io/) -- [NativeAuthenticator Documentation](https://native-authenticator.readthedocs.io/) -- [JupyterHub REST API](https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html) -- [OAuthenticator Documentation](https://oauthenticator.readthedocs.io/) diff --git a/docs/jupyterhub/quota-system.md b/docs/jupyterhub/quota-system.md deleted file mode 100644 index eb30f23..0000000 --- a/docs/jupyterhub/quota-system.md +++ /dev/null @@ -1,626 +0,0 @@ -# User Quota System - -The quota system manages and tracks resource usage for JupyterHub users. It enables administrators to allocate, monitor, and control how much compute time users can consume. - -## Overview - -### Key Concepts - -- **Balance**: The amount of quota credits a user has available -- **Rate**: The cost per minute for different resource types (CPU, GPU, NPU) -- **Session**: A tracked period of container usage that consumes quota -- **Unlimited Quota**: Special status exempting users from quota deductions - -### How It Works - -1. User requests to start a container -2. System checks if user has sufficient quota for the estimated runtime -3. If sufficient, a usage session begins tracking time -4. When container stops, actual usage is calculated and quota is deducted -5. Formula: `quota_consumed = rate Γ— duration_minutes` - -## Configuration - -Configure the quota system in your Helm values.yaml under `custom.quota`: - -```yaml -custom: - quota: - enabled: null # Enable/disable quota system (null = auto-detect based on authMode) - cpuRate: 1 # Cost per minute for CPU-only containers - minimumToStart: 10 # Minimum quota required to start any container - defaultQuota: 0 # Default quota granted to new users (0 = no initial allocation) -``` - -### New User Default Quota - -The `defaultQuota` setting controls how much quota new users receive automatically: - -- **Value `0` (default)**: New users start with zero quota and must be manually granted quota by an administrator -- **Value `> 0`**: New users are automatically granted this amount when they first attempt to use the system - -**Example configuration:** -```yaml -custom: - quota: - defaultQuota: 100 # Automatically grant 100 quota units to new users -``` - -#### How It Works - -This automatic allocation happens when a new user first tries to start a container. The system will: -1. Check if the user has a quota record in the database -2. If not found and `defaultQuota > 0`, create a record with the default amount -3. Record this as an "initial_grant" transaction in the audit log - -### Unlimited Quota - -Unlimited quota status is managed via the Admin UI. To grant unlimited quota to a user: -1. Go to the Admin Panel (`/hub/admin/users`) -2. Click on the user's quota value -3. Enter `-1`, `∞`, or `unlimited` in the input field -4. Click Save - -### Quota Rates by Resource Type - -Each accelerator type can have a different quota rate defined in `custom.accelerators`: - -```yaml -custom: - accelerators: - phx: - quotaRate: 2 # Phoenix iGPU: 2 quota/minute - strix: - quotaRate: 2 # Strix iGPU: 2 quota/minute - strix-halo: - quotaRate: 3 # Strix Halo iGPU: 3 quota/minute - dgpu: - quotaRate: 4 # Discrete GPU: 4 quota/minute - strix-npu: - quotaRate: 1 # NPU: 1 quota/minute -``` - -### Auto-Enable Behavior - -The quota system is automatically enabled/disabled based on authentication mode: -- **Enabled**: `github`, `multi` authentication modes -- **Disabled**: `auto-login`, `dummy` modes (typically for development) - -### Scheduled Quota Refresh (CronJob) - -You can configure automatic quota refresh rules that run on a schedule. Each rule creates a Kubernetes CronJob. - -#### Basic Setup - -Add refresh rules to your `values.yaml`: - -```yaml -custom: - quota: - refreshRules: - daily-topup: - enabled: true - schedule: "0 0 * * *" # Every day at midnight - action: add # add or set - amount: 100 # Add 100 quota daily - maxBalance: 500 # Cap balance at 500 - targets: - includeUnlimited: false # Skip unlimited users -``` - -#### Configuration Options - -| Field | Type | Description | -|-------|------|-------------| -| `enabled` | bool | Enable/disable this rule | -| `schedule` | string | Cron expression (e.g., `"0 0 * * *"` = daily at midnight) | -| `action` | string | `add` (default) or `set` | -| `amount` | int | Quota amount (positive to add, negative to deduct) | -| `maxBalance` | int | Maximum balance cap (for `add` action) | -| `minBalance` | int | Minimum balance floor (prevents going negative) | -| `targets` | object | Targeting rules (see below) | - -#### Targeting Options - -Filter which users are affected: - -| Target | Type | Description | -|--------|------|-------------| -| `includeUnlimited` | bool | Include users with unlimited quota | -| `balanceBelow` | int | Only users with balance below this value | -| `balanceAbove` | int | Only users with balance above this value | -| `includeUsers` | list | Only these specific usernames | -| `excludeUsers` | list | Exclude these usernames | -| `usernamePattern` | string | Regex pattern (e.g., `"^student_.*"`) | - -#### Example Rules - -```yaml -custom: - quota: - refreshRules: - # Daily top-up: Add 100 to users below 400 - daily-topup: - enabled: true - schedule: "0 0 * * *" - action: add - amount: 100 - maxBalance: 500 - targets: - includeUnlimited: false - balanceBelow: 400 - - # Monthly reset: Set everyone to 500 - monthly-reset: - enabled: true - schedule: "0 0 1 * *" # 1st of each month - action: set - amount: 500 - targets: - includeUnlimited: false - - # Weekly decay: Deduct 50 from users above 100 - weekly-decay: - enabled: false - schedule: "0 0 * * 0" # Every Sunday - amount: -50 - minBalance: 0 - targets: - balanceAbove: 100 -``` - -#### Cron Schedule Syntax - -``` -β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ minute (0-59) -β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ hour (0-23) -β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ day of month (1-31) -β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ month (1-12) -β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ day of week (0-6, Sun=0) -β”‚ β”‚ β”‚ β”‚ β”‚ -* * * * * -``` - -Common examples: -- `"0 0 * * *"` - Daily at midnight -- `"0 8 * * 1-5"` - Weekdays at 8:00 AM -- `"0 0 1 * *"` - 1st of each month at midnight -- `"0 0 * * 0"` - Every Sunday at midnight -- `"*/30 * * * *"` - Every 30 minutes - -#### Verify CronJobs - -After `helm upgrade`, check if CronJobs are created: - -```bash -# List quota refresh CronJobs -kubectl -n jupyterhub get cronjobs -l app.kubernetes.io/component=quota-refresh - -# Check recent job runs -kubectl -n jupyterhub get jobs -l app.kubernetes.io/component=quota-refresh - -# View logs from last run -kubectl -n jupyterhub logs -l app.kubernetes.io/component=quota-refresh --tail=50 -``` - -## Admin Operations - -### Web Interface (Admin Panel) - -The Admin Panel provides a graphical interface for quota management. Access it at `/hub/admin/users`. - -![Users Page with Quota](../imgs/jupyterhub/quota-1-users-page.png) - -#### Quota Column - -When the quota system is enabled, the Users page displays a **Quota** column showing each user's balance. - -#### Inline Editing - -Click directly on a user's quota value to edit: -- Enter a new number and press **Enter** to save -- Press **Escape** to cancel -- Enter `-1`, `∞`, or `unlimited` to grant unlimited status - -![Inline Quota Editing](../imgs/jupyterhub/quota-2-inline-edit.png) - -#### Batch Operations - -1. Select multiple users using checkboxes -2. Click "Set Quota" button (appears when users are selected) - -![Batch Select Users](../imgs/jupyterhub/quota-3-batch-select.png) - -3. Enter the quota value: - - A number (e.g., `500`) to set exact balance - - `-1`, `∞`, or `unlimited` to grant unlimited status -4. Click "Apply" to update all selected users - -![Batch Quota Modal](../imgs/jupyterhub/quota-4-batch-modal.png) - -### REST API Endpoints - -All admin endpoints require authentication. Admin-specific operations require admin privileges. - -#### List All User Quotas - -``` -GET /admin/api/quota/ -``` - -**Response:** -```json -{ - "users": [ - {"username": "user1", "balance": 500, "unlimited": false, "updated_at": "2026-01-15T10:00:00"}, - {"username": "user2", "balance": 1000, "unlimited": true, "updated_at": "2026-01-15T09:30:00"} - ] -} -``` - -#### Get Single User Quota - -``` -GET /admin/api/quota/ -``` - -**Response:** -```json -{ - "username": "user1", - "balance": 500, - "unlimited": false, - "recent_transactions": [ - { - "id": 123, - "username": "user1", - "amount": -10, - "transaction_type": "usage", - "resource_type": "cpu", - "description": "Session 456: 10 minutes", - "balance_before": 510, - "balance_after": 500, - "created_at": "2026-01-15T10:00:00", - "created_by": null - } - ] -} -``` - -#### Modify User Quota - -``` -POST /admin/api/quota/ -Content-Type: application/json -``` - -```json -{ - "action": "set", - "amount": 100, - "unlimited": true, - "description": "Monthly allocation" -} -``` - -- `action`: `"set"`, `"add"`, `"deduct"`, or `"set_unlimited"` -- `amount`: Required for set/add/deduct -- `unlimited`: Required for set_unlimited - -**Actions:** -- `set`: Set balance to exact amount -- `add`: Add amount to current balance -- `deduct`: Subtract amount from balance -- `set_unlimited`: Mark/unmark user as having unlimited quota - -**Response:** -```json -{ - "username": "user1", - "balance": 100, - "action": "set", - "amount": 100 -} -``` - -#### Batch Set Quota - -``` -POST /admin/api/quota/batch -Content-Type: application/json -``` - -```json -{ - "users": [ - {"username": "user1", "amount": 100}, - {"username": "user2", "amount": 200} - ] -} -``` - -**Response:** -```json -{ - "success": 2, - "failed": 0, - "details": [ - {"username": "user1", "status": "success", "balance": 100}, - {"username": "user2", "status": "success", "balance": 200} - ] -} -``` - -#### Batch Refresh Quota - -Flexible batch operation with targeting rules: - -``` -POST /admin/api/quota/refresh -Content-Type: application/json -``` - -```json -{ - "rule_name": "weekly_reset", - "action": "add", - "amount": 100, - "max_balance": 1000, - "min_balance": 0, - "targets": { - "includeUnlimited": false, - "balanceBelow": 500, - "balanceAbove": 0, - "includeUsers": ["user1"], - "excludeUsers": ["admin"], - "usernamePattern": "^student_.*" - } -} -``` - -Request fields: -- `action`: `"add"` or `"set"` -- `max_balance`: Optional cap for balance -- `min_balance`: Optional floor for balance - -**Targeting Rules (AND logic):** -- `includeUnlimited`: Whether to include users marked as unlimited -- `balanceBelow`: Only affect users with balance below this value -- `balanceAbove`: Only affect users with balance above this value -- `includeUsers`: Whitelist of specific usernames -- `excludeUsers`: Blacklist of usernames to skip -- `usernamePattern`: Regex pattern to match usernames - -**Response:** -```json -{ - "users_updated": 25, - "total_change": 2500, - "skipped": 5, - "action": "add", - "rule_name": "weekly_reset" -} -``` - -### CLI Commands (manage_users.py) - -The `manage_users.py` script provides command-line quota management via kubectl. - -#### Set Quota - -Set quota to a specific amount: - -```bash -# Set quota for specific users -python scripts/manage_users.py set-quota user1 user2 --amount 1000 - -# Set quota from file (requires username column, optional quota column) -python scripts/manage_users.py set-quota -f users.csv --amount 500 - -# File with per-user amounts -python scripts/manage_users.py set-quota -f users_with_quota.csv -``` - -**File format (users_with_quota.csv):** -```csv -username,quota -student01,500 -student02,1000 -teacher01,2000 -``` - -#### Add Quota - -Add quota to existing balance: - -```bash -# Add to specific users -python scripts/manage_users.py add-quota user1 user2 --amount 100 - -# Add to users from file -python scripts/manage_users.py add-quota -f users.csv --amount 50 -``` - -#### List Quota Balances - -Display all user quota balances: - -```bash -python scripts/manage_users.py list-quota -``` - -**Output:** -``` -πŸ“‹ Quota Balances (25 users): - -Username Balance Last Updated ------------------------------------------------------------------ -student01 450 2026-01-15 10:30:00 -student02 800 2026-01-15 09:15:00 -teacher01 1500 2026-01-14 16:45:00 -``` - -## User Experience - -### Viewing Personal Quota - -Users can check their quota via the API: - -``` -GET /api/quota/me -``` - -**Response:** -```json -{ - "username": "student01", - "balance": 450, - "unlimited": false, - "rates": {"cpu": 1, "phx": 2, "strix": 2}, - "enabled": true -} -``` - -To get accelerator options, use the separate `/api/accelerators` endpoint. - -### Quota Rates API - -Get quota rates and configuration (available to all authenticated users): - -``` -GET /api/quota/rates -``` - -**Response:** -```json -{ - "enabled": true, - "rates": {"cpu": 1, "phx": 2, "strix": 2}, - "minimum_to_start": 10 -} -``` - -- `enabled`: Whether the quota system is active -- `rates`: Quota consumption rate per minute for each resource type -- `minimum_to_start`: Minimum quota required to start any container - -### Accelerators API - -Get available accelerator options (always available, independent of quota system): - -``` -GET /api/accelerators -``` - -**Response:** -```json -{ - "accelerators": { - "phx": { - "displayName": "AMD Radeonβ„’ 780M (Phoenix Point iGPU)", - "description": "RDNA 3.0 (gfx1103) | Compute Units 12 | 4GB LPDDR5X", - "nodeSelector": {"accelerator": "phx"}, - "quotaRate": 2 - } - } -} -``` - -- `accelerators`: Available accelerator options with display name, description, node selector, and quota rate - -### Insufficient Quota - -When a user attempts to start a container without sufficient quota: - -1. System calculates estimated cost: `rate Γ— requested_runtime` -2. Compares with current balance and minimum requirement -3. If insufficient, returns HTTP 403 error with message: - -``` -Cannot start container: Insufficient quota. Current balance: 5, -estimated cost: 120 (2 quota/min Γ— 60 min). -Please contact administrator to add quota. -``` - -### Quota Deduction - -Quota is deducted when a container stops: - -1. System records actual duration (minimum 1 minute) -2. Calculates actual cost: `rate Γ— actual_duration_minutes` -3. Deducts from user balance -4. Records transaction in audit log - -## Technical Details - -### Stale Session Cleanup - -Sessions stuck for more than 8 hours are automatically cleaned up on hub startup: - -- Marks session as `cleaned_up` -- Records duration but does NOT deduct quota (avoids charging for crashed sessions) -- Logs cleanup for auditing - -### Unlimited Quota Logic - -A user has unlimited quota if marked `unlimited: true` in the database. - -## Troubleshooting - -### User Cannot Start Container - -**Symptom:** "Insufficient quota" error - -**Solutions:** -1. Check user's current balance: - ```bash - python scripts/manage_users.py list-quota | grep username - ``` - -2. Add quota to user: - ```bash - python scripts/manage_users.py add-quota username --amount 500 - ``` - -3. Or grant unlimited quota via Admin UI (click quota value and enter `∞`) - -### Quota Not Being Deducted - -**Symptom:** User's balance doesn't decrease after container use - -**Possible causes:** -1. User has unlimited quota (set via Admin UI) -2. Quota system is disabled (`custom.quota.enabled: false`) -3. Session ended abnormally (cleaned up as stale) - -**Check:** -```bash -# View user's transaction history -curl "$JUPYTERHUB_URL/admin/api/quota/username" \ - -H "Authorization: token $JUPYTERHUB_TOKEN" -``` - -### Session Stuck as Active - -**Symptom:** Old sessions show as active even though containers stopped - -**Solution:** Sessions are automatically cleaned on hub restart, or manually: -```bash -# Restart hub to trigger cleanup -kubectl -n jupyterhub rollout restart deployment/hub -``` - -### Database Issues - -**Location:** `/srv/jupyterhub/quota.sqlite` (inside hub pod) - -**Access for debugging:** -```bash -kubectl -n jupyterhub exec deployment/hub -- sqlite3 /srv/jupyterhub/quota.sqlite \ - "SELECT * FROM user_quota LIMIT 10;" -``` - -## See Also - -- [User Management Guide](./user-management.md) - Batch user operations -- [Authentication Guide](./authentication-guide.md) - User authentication setup diff --git a/docs/jupyterhub/user-management.md b/docs/jupyterhub/user-management.md deleted file mode 100644 index 75289f1..0000000 --- a/docs/jupyterhub/user-management.md +++ /dev/null @@ -1,387 +0,0 @@ -# User Management Guide - -This guide covers user management via the Admin Panel (web interface) and CLI scripts. - -## Prerequisites - -```bash -# Install dependencies -pip install pandas openpyxl requests -``` - -## Initial Setup - -### Auto-Admin on Helm Install - -When `custom.adminUser.enabled: true` is set in values.yaml, helm install automatically: - -1. Generates random API token and admin password -2. Creates `jupyterhub-admin-credentials` secret -3. Configures `admin` user with admin privileges on hub startup - -The API token is associated with the `admin` user for script operations. - -**Get credentials after install:** - -```bash -# Get admin password -kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o go-template='{{index .data "admin-password" | base64decode}}' - -# Get API token -export JUPYTERHUB_TOKEN=$(kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o go-template='{{index .data "api-token" | base64decode}}') -``` - -### Manual Setup (if auto-admin disabled) - -If `custom.adminUser.enabled: false`, create credentials manually: - -```bash -# Create secret with random token and password -kubectl -n jupyterhub create secret generic jupyterhub-admin-credentials \ - --from-literal=api-token=$(openssl rand -hex 32) \ - --from-literal=admin-password=$(openssl rand -base64 12) - -# Restart hub to apply -kubectl -n jupyterhub rollout restart deployment/hub -``` - -## Daily Usage - -### Set Environment Variables - -```bash -# Get token from Kubernetes secret -export JUPYTERHUB_TOKEN=$(kubectl -n jupyterhub get secret jupyterhub-admin-credentials \ - -o go-template='{{index .data "api-token" | base64decode}}') -export JUPYTERHUB_URL="http://localhost:30890" # Or your hub URL -``` - -Or add to your shell profile (~/.bashrc or ~/.zshrc): - -```bash -alias jhtoken='export JUPYTERHUB_TOKEN=$(kubectl -n jupyterhub get secret jupyterhub-admin-credentials -o go-template="{{index .data \"api-token\" | base64decode}}")' -``` - -## Web Interface (Admin Panel) - -Access the Admin Panel at `/hub/admin/users`. - - - - -### User Table - -The user table displays: -- **Username** - Click to view user details -- **Admin** - Admin status badge -- **Quota** - Current quota balance (if quota system enabled) -- **Server** - Server status (Running/Stopped) -- **Last Activity** - Last login or activity time - -### Create User - -1. Click **"Create Users"** button -2. Enter usernames (one per line for batch creation) -3. Choose password option: - - **Generate random passwords** (default) - Each user gets a unique password - - **Set password** - Enter a password for all users -4. (Optional) Check **"Force password change on first login"** -5. (Optional) Check **"Grant admin privileges"** -6. Click **"Create Users"** -7. After creation, copy the username/password table to share with users - - - - -### Edit User - -1. Click the **pencil icon** on a user row to view user details -2. Click **"Edit User"** to enter edit mode -3. Available modifications: - - **Username** - Rename the user (warning: user must login with new name) - - **Admin status** - Grant or revoke admin privileges - - **Groups** - Assign user to groups -4. Click **"Save Changes"** - -### Set Password - -1. Click the **key icon** on a user row -2. Enter new password, or click **"Generate"** for a random password -3. (Optional) Check **"Force password change on next login"** -4. Click **"Set Password"** -5. After success, copy the password to share with the user - - - - -### Server Control - -- **Stop**: Click the red stop button to stop a user's running server -- **Start**: Click the green play button to start a user's server - -### Batch Operations - -1. Select multiple users using checkboxes -2. Available batch actions: - - **Start All Selected** - Start servers for selected users - - **Stop All Selected** - Stop servers for selected users - - **Set Quota** - Set quota for selected users (see [Quota System](./quota-system.md)) - - - - -### Search and Filter - -- Use the search box to filter users by username -- Toggle "Only Active Servers" to show only users with running servers -- Click column headers to sort by that column - -### Manage Groups - -Click **"Manage Groups"** to navigate to the group management page. - -## Script Usage - -### 1. Generate User Templates - -Create user list templates for batch operations. - -```bash -# Generate 50 students -python scripts/generate_users_template.py --prefix student --count 50 --output users.csv - -# Generate AUP users (AUP01, AUP02, ...) -python scripts/generate_users_template.py --prefix AUP --count 30 --start 1 --output aup_users.xlsx - -# Generate with 3-digit padding -python scripts/generate_users_template.py --prefix student --count 100 --digits 3 --output users.csv - -# Custom usernames -python scripts/generate_users_template.py --names alice bob charlie --output custom.csv -``` - -**Output format (CSV/Excel):** - -```csv -username,admin -student01,false -student02,false -student03,false -``` - -### 2. Manage Users - -Perform batch operations via JupyterHub API. - -```bash -# Create users from file -python scripts/manage_users.py create users.csv - -# List all users -python scripts/manage_users.py list - -# Export users to backup file -python scripts/manage_users.py export backup.xlsx - -# Delete users (with confirmation) -python scripts/manage_users.py delete remove_list.csv - -# Delete users (skip confirmation) -python scripts/manage_users.py delete remove_list.csv --yes -``` - -### 3. Manage Admin Privileges - -Grant or revoke admin privileges for existing users. - -```bash -# Grant admin to single user -python scripts/manage_users.py set-admin teacher01 - -# Grant admin to multiple users -python scripts/manage_users.py set-admin teacher01 teacher02 teacher03 - -# Grant admin from file -python scripts/manage_users.py set-admin -f admins.csv - -# Revoke admin privileges -python scripts/manage_users.py set-admin --revoke student01 - -# Batch revoke -python scripts/manage_users.py set-admin --revoke -f demote_list.csv -``` - -### 4. Set Passwords - -Set default passwords for users (requires kubectl access). - -> **Note:** For quota management commands (`set-quota`, `add-quota`, `list-quota`), see the [User Quota System](./quota-system.md) documentation. - -```bash -# Set passwords from file with password column -python scripts/manage_users.py set-passwords users_with_passwords.csv - -# Generate random passwords for users -python scripts/manage_users.py set-passwords users.csv --generate -o passwords_output.csv - -# Set same default password for all users -python scripts/manage_users.py set-passwords users.csv --generate --default-password "Welcome123" - -# Set passwords without forcing change on first login -python scripts/manage_users.py set-passwords users.csv --no-force-change -``` - -## Common Workflows - -### Create New Users - -```bash -# Step 1: Generate template -python scripts/generate_users_template.py \ - --prefix student \ - --count 50 \ - --output new_students.csv - -# Step 2: (Optional) Edit CSV to customize -# Edit new_students.csv if needed - -# Step 3: Create users in JupyterHub -python scripts/manage_users.py create new_students.csv -``` - -**Expected output:** -``` -βœ… Connected to JupyterHub at http://localhost:30890 -πŸ“„ Loaded 50 users from new_students.csv - -πŸ”„ Creating 50 users... - βœ… Created user: student01 (admin=False) - βœ… Created user: student02 (admin=False) - ... - -================================================== -πŸ“Š Results: - βœ… Created: 50 - ⚠️ Already exist: 0 - ❌ Failed: 0 -================================================== -``` - -### Promote Users to Admin - -```bash -# Single user -python scripts/manage_users.py set-admin teacher01 - -# Multiple users -python scripts/manage_users.py set-admin teacher01 teacher02 - -# From file (any CSV with username column) -python scripts/manage_users.py set-admin -f new_admins.csv -``` - -### Backup Users - -```bash -# Backup current users -python scripts/manage_users.py export users_backup_$(date +%Y%m%d).xlsx - -# Later: restore users -python scripts/manage_users.py create users_backup_20260113.xlsx -``` - -### Remove Users - -```bash -# Step 1: Export current users -python scripts/manage_users.py export all_users.csv - -# Step 2: Edit CSV, keep only users to delete -# Create remove_list.csv with usernames to delete - -# Step 3: Delete users -python scripts/manage_users.py delete remove_list.csv -``` - -## File Format - -Both scripts support **CSV** and **Excel** (.xlsx) formats. - -**Required column:** -- `username` - Username to create/delete - -**Optional columns:** -- `admin` - Set to `true` for admin users (default: `false`) -- `password` - Password for set-passwords command - -**Example CSV:** - -```csv -username,admin -student01,false -student02,false -teacher01,true -``` - -## Troubleshooting - -### Connection Issues - -**Symptom:** "Connection refused" - -**Solutions:** -```bash -# Check hub is running -kubectl --namespace=jupyterhub get pods - -# For local dev -export JUPYTERHUB_URL="http://localhost:30890" - -# For production -export JUPYTERHUB_URL="https://your-domain.com" - -# Test connection -curl $JUPYTERHUB_URL/hub/api/ -``` - -### Authentication Issues - -**Symptom:** "Authentication failed" - -**Solutions:** -```bash -# Verify token is set -echo $JUPYTERHUB_TOKEN - -# Test token -curl -X GET $JUPYTERHUB_URL/hub/api/ \ - -H "Authorization: token $JUPYTERHUB_TOKEN" -``` - -### Common Errors - -**"User already exists"** -- This is just a warning, not an error -- The user is already in the system - -**"File must contain 'username' column"** -- Ensure your file has a `username` column header - -## Notes - -- **Auto-Admin**: Initial admin is created automatically on helm install -- **Additional Admins**: Use `set-admin` command or Admin Panel -- **First Login**: Native users must set their password on first login -- **Safety**: Delete operations ask for confirmation (use `--yes` to skip) -- **Batch Size**: No limit, but larger batches (1000+) may be slow - -## Help - -Get detailed help for each script: - -```bash -python scripts/generate_users_template.py --help -python scripts/manage_users.py --help -``` diff --git a/docs/user-manual/aup-remote-lab-user-manual-admin.pdf b/docs/user-manual/aup-remote-lab-user-manual-admin.pdf deleted file mode 100644 index feb1d1f..0000000 Binary files a/docs/user-manual/aup-remote-lab-user-manual-admin.pdf and /dev/null differ