Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 120 additions & 2 deletions setup.KubeConEU25/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,109 @@ nfs-client-simplenfs k8s-sigs.io/simplenfs-nfs-subdir-external-provisioner D

### Prometheus Setup

TODO
We follow the setup provided by the `prometheus-community/kube-prometheus-stack` Helm chart.

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
```

The charts will install: Prometheus, Grafana, Alert Manager, Prometheus Node Exporter and Kube State Metrics. We set up the chart with the following:

- Persistent storage for Prometheus, Grafana and Alert Manager;
- Override the Prometheus Node Exporter port;
- Disable CRDs creation as they are already present.

You may leave the CRDs creation on, along with the default Node Exporter pod. These changes are needed when deploying a separate Prometheus instance in OpenShift.

```bash
cat << EOF >> config.yaml
crds:
enabled: false

prometheus-node-exporter:
service:
port: 9110

alertmanager:
alertmanagerSpec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
storage:
volumeClaimTemplate:
spec:
storageClassName: nfs-client-pokprod
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi

prometheus:
prometheusSpec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: nfs-client-pokprod
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
emptyDir:
medium: Memory

grafana:
persistence:
enabled: true
type: sts
storageClassName: "nfs-client-pokprod"
accessModes:
- ReadWriteOnce
size: 20Gi
finalizers:
- kubernetes.io/pvc-protection
EOF

helm upgrade -i kube-prometheus-stack -n prometheus prometheus-community/kube-prometheus-stack --create-namespace -f config.yaml
```

If deploying on OpenShift based systems, you need to assign the privileged security context to the service accounts that are created by the helm chart.

```bash
oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:kube-prometheus-stack-admission system:serviceaccount:prometheus:kube-prometheus-stack-alertmanager system:serviceaccount:prometheus:kube-prometheus-stack-grafana system:serviceaccount:prometheus:kube-prometheus-stack-kube-state-metrics system:serviceaccount:prometheus:kube-prometheus-stack-operator system:serviceaccount:prometheus:kube-prometheus-stack-prometheus system:serviceaccount:prometheus:kube-prometheus-stack-prometheus-node-exporter
```

You should expect the following pods:

```bash
kubectl get pods
```
```bash
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 16m
kube-prometheus-stack-grafana-0 3/3 Running 0 16m
kube-prometheus-stack-kube-state-metrics-6f76b98d89-pxs69 1/1 Running 0 16m
kube-prometheus-stack-operator-7fbfc985bb-mm9bk 1/1 Running 0 16m
kube-prometheus-stack-prometheus-node-exporter-44llp 1/1 Running 0 16m
kube-prometheus-stack-prometheus-node-exporter-95gp8 1/1 Running 0 16m
kube-prometheus-stack-prometheus-node-exporter-dxf5f 1/1 Running 0 16m
kube-prometheus-stack-prometheus-node-exporter-f45dx 1/1 Running 0 16m
kube-prometheus-stack-prometheus-node-exporter-pfrzk 1/1 Running 0 16m
kube-prometheus-stack-prometheus-node-exporter-zpfzb 1/1 Running 0 16m
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 16m
```

To access the Grafana dashboard on `localhost:3000`:

```bash
kubectl --namespace prometheus get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
```
```bash
export POD_NAME=$(kubectl --namespace prometheus get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname)
kubectl --namespace prometheus port-forward $POD_NAME 3000
```

### MLBatch Cluster Setup

Expand Down Expand Up @@ -184,7 +286,23 @@ We reserve 8 GPUs out of 24 for MLBatch's slack queue.

### Autopilot Extended Setup

TODO
It is possible to configure Autopilot so that it will test PVC creation and deletion given a storage class name.

```bash
cat << EOF >> autopilot-extended.yaml
env:
- name: "PERIODIC_CHECKS"
value: "pciebw,remapped,dcgm,ping,gpupower,pvc"
- name: "PVC_TEST_STORAGE_CLASS"
value: "nfs-client-pokprod"
EOF
```

Then reapply the helm chart, this will start a rollout update.

```bash
helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f autopilot-extended.yaml
```

### MLBatch Teams Setup

Expand Down