Air-Gapped Predictive Network Operations Center (NOC) Copilot
⭐ If you like this project, star it on GitHub!
Intention • System Architecture • Current Progress (v1) • Getting Started & Instructions • Testing & Verification • Roadmap
A secure, offline, and air-gapped Predictive Network Operations Center (NOC) Copilot framework. gap-moe combines machine learning (Isolation Forests), real-time timeseries telemetry forecasting (linear regression), and Retrieval-Augmented Generation (RAG) using local LLM/embeddings to detect network anomalies, forecast time-to-impact, and automatically recommend precise mitigation commands.
It operates on top of a simulated dynamic OSPF hub-and-spoke topology managed by Containerlab and monitored with Prometheus/cAdvisor.
In secure, high-availability, air-gapped network environments, traditional cloud-dependent AI tools are not an option. Network anomalies—such as link congestion, interface flapping, latency degradation, or routing engine memory leaks—require instant diagnostics and SOP (Standard Operating Procedure) runbook matching without exposing network data or configuration topologies to external endpoints.
gap-moe is built to resolve this by providing:
- Unsupervised Anomaly Detection: Isolation Forests train locally on normal baseline telemetry to detect anomalous states without pre-defined static thresholds.
- Deterministic Forecasting: Real-time trend projection via linear regression to predict the estimated Time-To-Impact (TTI) before a system failure (e.g., OOM or link capacity saturation) occurs.
- Retrieval-Augmented Generation (RAG): A local persistent vector database (ChromaDB) stores technical SOP runbooks and network topology context. When an alert fires, the top matching runbooks are retrieved.
- Local LLM Orchestration: Combined with local network context, Ollama coordinates local models (
llama3.2:3bandnomic-embed-text) to synthesize precise, step-by-step diagnostic hypotheses and container-level mitigation commands (such as OSPF dead-interval relaxation, traffic rate-limiting, or routing daemon reloads) for the NOC operator.
The project consists of three core layers:
graph TD
hub[hub<br>10.0.0.1/30] <-->|eth1| transit[transit<br>10.0.0.2/30<br>10.0.1.1/30<br>10.0.2.1/30]
transit <-->|eth2| branch-1[branch-1<br>10.0.1.2/30]
transit <-->|eth3| branch-2[branch-2<br>10.0.2.2/30]
- Simulation & Telemetry Stack:
- Four virtual FRRouting (FRR) routers configured with OSPF routing across Area 0.
- cAdvisor monitors container resource metrics (CPU, Memory).
- Prometheus aggregates telemetry (CPU, Memory, and network traffic interface stats) in real-time.
- Predictive Analytics & Forecasting:
scripts/export_telemetry.pyperiodically harvests scraped Prometheus metrics.scripts/train_models.pybuilds Isolation Forest models for each individual router.scripts/predictive_engine.pypolls Prometheus, runs live anomaly inference, calculates TTI projection, and publishes live alerts to a local Redis queue and state database.
- Offline Copilot Orchestrator (RAG):
scripts/populate_kb.pygenerates local vector representations of playbooks (knowledge/) and stores them in ChromaDB.scripts/copilot_orchestrator.pyruns as a daemon subscribing to the Redis alert channel (or can be run manually), querying ChromaDB for relevant SOP playbooks and compiling context prompts for LLM remediation.
sequenceDiagram
participant NetLab as Containerlab OSPF Nodes
participant Prom as Prometheus (cAdvisor)
participant ML as Anomaly Engine
participant Redis as Redis (Queue & State DB)
participant DB as ChromaDB
participant Ollama as Ollama (Llama-3.2)
participant Copilot as Copilot Orchestrator (Daemon / CLI)
NetLab->>Prom: Scrape metrics (CPU, Mem, Tx/Rx)
ML->>Prom: Query live metrics
Note over ML: Isolation Forest Inference & TTI Forecast
ML->>Redis: Publish to 'alerts' & set 'latest_alert'
Redis-->>Copilot: Pub/Sub message received (Daemon) or State DB lookup (CLI)
Copilot->>DB: Query relevant SOP runbooks
DB-->>Copilot: Return top-2 runbooks
Copilot->>Ollama: Generate mitigation guidance
Ollama-->>Copilot: Return action plan
- Network Topology & Convergence [100%]: Fully verified Containerlab OSPF topology. Static configuration templates, vtysh warning suppression, and dynamic neighbor adjacencies are operational.
- Telemetry & Queue Integration [100%]: Docker Compose telemetry services (cAdvisor, Prometheus, and Redis) are configured to support real-time, low-latency metric collection and pub/sub message propagation.
- Predictive Engine [100%]: Functional pipeline for telemetry export, machine learning model training, and continuous real-time forecasting. Polling interval optimized to 1 second for instant anomaly detection.
- RAG Knowledge Base [100%]: ChromaDB vector storage populated with OSPF topology maps and Standard Operating Procedures (SOPs).
- Copilot Orchestration [100%]: LLM-guided agent integrated with Redis Pub/Sub subscription daemon, automatically triggering RAG-based mitigation plans upon receiving live network alerts.
- Interactive Dashboard & NOC Chaos Panel [100%]: Live Streamlit dashboard optimized with a 1-second auto-refresh rate, reading directly from the Redis state database and providing real-time telemetry charts and system warning banners.
- Security Hardening [100%]: Implemented cryptographic model integrity checks (HMAC-SHA256), strict symlink/path traversal verification guards (
os.path.realpath) inpopulate_kb.py, and XML-delimited prompt injection isolation constraints incopilot_orchestrator.py&dashboard.py.
Ensure your host machine has the following tools installed:
- Docker and Docker Compose
- Containerlab (version 0.40+)
- Python 3.8+
- Ollama (running locally on default port
11434)
Pull the necessary models in Ollama:
ollama pull nomic-embed-text
ollama pull llama3.2:3bClone the repository and install Python dependencies:
python3 -m venv .gap
source .gap/bin/activate
# Make sure to install dependencies from requirements.txt
pip install -r requirements.txt- Deploy Containerlab:
sudo containerlab deploy -t clab/topology.clab.yml
- Start Monitoring Services:
docker compose -f telemetry/docker-compose.yml up -d
- Generate Traffic:
python3 scripts/traffic_generator.py start
-
Collect Baseline Metrics: Allow the exporter to collect baseline traffic telemetry for a few minutes:
python3 scripts/export_telemetry.py
(Exit with
Ctrl+Cwhen you have accumulated sufficient rows innetwork_telemetry.csv) -
Train Anomaly Models:
python3 scripts/train_models.py
(This outputs trained
.pklmodels tomodels/) -
Start the Predictive Inference Loop:
python3 scripts/predictive_engine.py
(This script polls Prometheus every 1 second, runs real-time anomaly inference, and publishes active alert payloads to Redis)
-
Populate the Knowledge Base:
python3 scripts/populate_kb.py
(Encodes and indexes the markdown files from
knowledge/into the local ChromaDB databasechroma_db/) -
Execute the Copilot Assistant:
The Copilot orchestrator can run in daemon mode listening to the Redis queue, or as a one-off CLI tool:
# Run in daemon mode subscribing to Redis pub/sub for real-time alerts python3 scripts/copilot_orchestrator.py --daemon # Or run as a single CLI query matching the latest active alert in Redis python3 scripts/copilot_orchestrator.py
(Matches retrieved SOP runbooks, queries
llama3.2:3b, and formats recommended mitigation actions) -
Launch the Interactive Dashboard & NOC Chaos Panel:
For a unified operator experience, run the Streamlit dashboard:
streamlit run dashboard.py
This serves a production-grade UI at
http://localhost:8501, featuring:- NOC Chaos Panel (Sidebar): Start/Stop background traffic streams, inject network faults (congestion, flapping, latency degradation, memory leaks), toggle inference daemon, and trigger master resets.
- Dynamic Telemetry Observability: Live charts for CPU usage, memory consumption, and transmit/receive bandwidth query metrics in real-time from Prometheus.
- Predictive System Warnings: Status banners alerting operators to anomaly forecasts, severity levels, and Time-to-Impact projections.
- AI Copilot Orchestration: RAG-driven remediation plan generator integrating the custom local vector index and Ollama models (
llama3.2:3b) with prompt-hardening and unit scaling. - Cryptographic Model Verification: Live integrity check footer monitoring HMAC signatures on trained models.
To test the end-to-end predictive alert and RAG copilot flow, you can inject various network issues using the chaos script:
# Inject 120ms latency delay on transit-branch2 interface
python3 scripts/chaos_injector.py degradation warn
# Inject link congestion (limits transit-hub link to 500Kbps)
python3 scripts/chaos_injector.py congestion warn
# Inject OSPF flapping packet drops (10% drop on transit-branch1)
python3 scripts/chaos_injector.py flapping warn
# Inject a memory leak inside the transit router container
python3 scripts/chaos_injector.py leak warnTo clean up all active network faults and return the simulation to normal:
python3 scripts/chaos_injector.py clearPlanned enhancements for Version 2:
- 🎛️ Playbook Schema Standardization: Adopt a strict syntax schema in playbooks incorporating
Metadata,Diagnostics,Mitigation, andRollbackkeys. - 🤖 Auto-Remediation Execution: Implement interactive confirmation options to execute the recommended commands directly on target Containerlab containers.