Skip to content

null-dreams/gap-moe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gap Moe

Air-Gapped Predictive Network Operations Center (NOC) Copilot

Python Containerlab ChromaDB Ollama

⭐ If you like this project, star it on GitHub!

IntentionSystem ArchitectureCurrent Progress (v1)Getting Started & InstructionsTesting & VerificationRoadmap


A secure, offline, and air-gapped Predictive Network Operations Center (NOC) Copilot framework. gap-moe combines machine learning (Isolation Forests), real-time timeseries telemetry forecasting (linear regression), and Retrieval-Augmented Generation (RAG) using local LLM/embeddings to detect network anomalies, forecast time-to-impact, and automatically recommend precise mitigation commands.

It operates on top of a simulated dynamic OSPF hub-and-spoke topology managed by Containerlab and monitored with Prometheus/cAdvisor.


Intention

In secure, high-availability, air-gapped network environments, traditional cloud-dependent AI tools are not an option. Network anomalies—such as link congestion, interface flapping, latency degradation, or routing engine memory leaks—require instant diagnostics and SOP (Standard Operating Procedure) runbook matching without exposing network data or configuration topologies to external endpoints.

gap-moe is built to resolve this by providing:

  1. Unsupervised Anomaly Detection: Isolation Forests train locally on normal baseline telemetry to detect anomalous states without pre-defined static thresholds.
  2. Deterministic Forecasting: Real-time trend projection via linear regression to predict the estimated Time-To-Impact (TTI) before a system failure (e.g., OOM or link capacity saturation) occurs.
  3. Retrieval-Augmented Generation (RAG): A local persistent vector database (ChromaDB) stores technical SOP runbooks and network topology context. When an alert fires, the top matching runbooks are retrieved.
  4. Local LLM Orchestration: Combined with local network context, Ollama coordinates local models (llama3.2:3b and nomic-embed-text) to synthesize precise, step-by-step diagnostic hypotheses and container-level mitigation commands (such as OSPF dead-interval relaxation, traffic rate-limiting, or routing daemon reloads) for the NOC operator.

System Architecture

The project consists of three core layers:

graph TD
    hub[hub<br>10.0.0.1/30] <-->|eth1| transit[transit<br>10.0.0.2/30<br>10.0.1.1/30<br>10.0.2.1/30]
    transit <-->|eth2| branch-1[branch-1<br>10.0.1.2/30]
    transit <-->|eth3| branch-2[branch-2<br>10.0.2.2/30]
Loading
  1. Simulation & Telemetry Stack:
    • Four virtual FRRouting (FRR) routers configured with OSPF routing across Area 0.
    • cAdvisor monitors container resource metrics (CPU, Memory).
    • Prometheus aggregates telemetry (CPU, Memory, and network traffic interface stats) in real-time.
  2. Predictive Analytics & Forecasting:
    • scripts/export_telemetry.py periodically harvests scraped Prometheus metrics.
    • scripts/train_models.py builds Isolation Forest models for each individual router.
    • scripts/predictive_engine.py polls Prometheus, runs live anomaly inference, calculates TTI projection, and publishes live alerts to a local Redis queue and state database.
  3. Offline Copilot Orchestrator (RAG):
    • scripts/populate_kb.py generates local vector representations of playbooks (knowledge/) and stores them in ChromaDB.
    • scripts/copilot_orchestrator.py runs as a daemon subscribing to the Redis alert channel (or can be run manually), querying ChromaDB for relevant SOP playbooks and compiling context prompts for LLM remediation.
sequenceDiagram
    participant NetLab as Containerlab OSPF Nodes
    participant Prom as Prometheus (cAdvisor)
    participant ML as Anomaly Engine
    participant Redis as Redis (Queue & State DB)
    participant DB as ChromaDB
    participant Ollama as Ollama (Llama-3.2)
    participant Copilot as Copilot Orchestrator (Daemon / CLI)

    NetLab->>Prom: Scrape metrics (CPU, Mem, Tx/Rx)
    ML->>Prom: Query live metrics
    Note over ML: Isolation Forest Inference & TTI Forecast
    ML->>Redis: Publish to 'alerts' & set 'latest_alert'
    Redis-->>Copilot: Pub/Sub message received (Daemon) or State DB lookup (CLI)
    Copilot->>DB: Query relevant SOP runbooks
    DB-->>Copilot: Return top-2 runbooks
    Copilot->>Ollama: Generate mitigation guidance
    Ollama-->>Copilot: Return action plan
Loading

Current Progress (v1.3)

  • Network Topology & Convergence [100%]: Fully verified Containerlab OSPF topology. Static configuration templates, vtysh warning suppression, and dynamic neighbor adjacencies are operational.
  • Telemetry & Queue Integration [100%]: Docker Compose telemetry services (cAdvisor, Prometheus, and Redis) are configured to support real-time, low-latency metric collection and pub/sub message propagation.
  • Predictive Engine [100%]: Functional pipeline for telemetry export, machine learning model training, and continuous real-time forecasting. Polling interval optimized to 1 second for instant anomaly detection.
  • RAG Knowledge Base [100%]: ChromaDB vector storage populated with OSPF topology maps and Standard Operating Procedures (SOPs).
  • Copilot Orchestration [100%]: LLM-guided agent integrated with Redis Pub/Sub subscription daemon, automatically triggering RAG-based mitigation plans upon receiving live network alerts.
  • Interactive Dashboard & NOC Chaos Panel [100%]: Live Streamlit dashboard optimized with a 1-second auto-refresh rate, reading directly from the Redis state database and providing real-time telemetry charts and system warning banners.
  • Security Hardening [100%]: Implemented cryptographic model integrity checks (HMAC-SHA256), strict symlink/path traversal verification guards (os.path.realpath) in populate_kb.py, and XML-delimited prompt injection isolation constraints in copilot_orchestrator.py & dashboard.py.

Getting Started & Instructions

1. Prerequisites

Ensure your host machine has the following tools installed:

  • Docker and Docker Compose
  • Containerlab (version 0.40+)
  • Python 3.8+
  • Ollama (running locally on default port 11434)

Pull the necessary models in Ollama:

ollama pull nomic-embed-text
ollama pull llama3.2:3b

2. Environment Installation

Clone the repository and install Python dependencies:

python3 -m venv .gap
source .gap/bin/activate
# Make sure to install dependencies from requirements.txt
pip install -r requirements.txt

3. Deploying the Network Lab & Telemetry

  1. Deploy Containerlab:
    sudo containerlab deploy -t clab/topology.clab.yml
  2. Start Monitoring Services:
    docker compose -f telemetry/docker-compose.yml up -d
  3. Generate Traffic:
    python3 scripts/traffic_generator.py start

4. Running the Predictive and Orchestration Engine

  1. Collect Baseline Metrics: Allow the exporter to collect baseline traffic telemetry for a few minutes:

    python3 scripts/export_telemetry.py

    (Exit with Ctrl+C when you have accumulated sufficient rows in network_telemetry.csv)

  2. Train Anomaly Models:

    python3 scripts/train_models.py

    (This outputs trained .pkl models to models/)

  3. Start the Predictive Inference Loop:

    python3 scripts/predictive_engine.py

    (This script polls Prometheus every 1 second, runs real-time anomaly inference, and publishes active alert payloads to Redis)

  4. Populate the Knowledge Base:

    python3 scripts/populate_kb.py

    (Encodes and indexes the markdown files from knowledge/ into the local ChromaDB database chroma_db/)

  5. Execute the Copilot Assistant:

    The Copilot orchestrator can run in daemon mode listening to the Redis queue, or as a one-off CLI tool:

    # Run in daemon mode subscribing to Redis pub/sub for real-time alerts
    python3 scripts/copilot_orchestrator.py --daemon
    
    # Or run as a single CLI query matching the latest active alert in Redis
    python3 scripts/copilot_orchestrator.py

    (Matches retrieved SOP runbooks, queries llama3.2:3b, and formats recommended mitigation actions)

  6. Launch the Interactive Dashboard & NOC Chaos Panel:

    For a unified operator experience, run the Streamlit dashboard:

    streamlit run dashboard.py

    This serves a production-grade UI at http://localhost:8501, featuring:

    • NOC Chaos Panel (Sidebar): Start/Stop background traffic streams, inject network faults (congestion, flapping, latency degradation, memory leaks), toggle inference daemon, and trigger master resets.
    • Dynamic Telemetry Observability: Live charts for CPU usage, memory consumption, and transmit/receive bandwidth query metrics in real-time from Prometheus.
    • Predictive System Warnings: Status banners alerting operators to anomaly forecasts, severity levels, and Time-to-Impact projections.
    • AI Copilot Orchestration: RAG-driven remediation plan generator integrating the custom local vector index and Ollama models (llama3.2:3b) with prompt-hardening and unit scaling.
    • Cryptographic Model Verification: Live integrity check footer monitoring HMAC signatures on trained models.

Testing & Verification

Simulated Fault Injection (Chaos)

To test the end-to-end predictive alert and RAG copilot flow, you can inject various network issues using the chaos script:

# Inject 120ms latency delay on transit-branch2 interface
python3 scripts/chaos_injector.py degradation warn

# Inject link congestion (limits transit-hub link to 500Kbps)
python3 scripts/chaos_injector.py congestion warn

# Inject OSPF flapping packet drops (10% drop on transit-branch1)
python3 scripts/chaos_injector.py flapping warn

# Inject a memory leak inside the transit router container
python3 scripts/chaos_injector.py leak warn

To clean up all active network faults and return the simulation to normal:

python3 scripts/chaos_injector.py clear

Roadmap

Planned enhancements for Version 2:

  • 🎛️ Playbook Schema Standardization: Adopt a strict syntax schema in playbooks incorporating Metadata, Diagnostics, Mitigation, and Rollback keys.
  • 🤖 Auto-Remediation Execution: Implement interactive confirmation options to execute the recommended commands directly on target Containerlab containers.

About

A secure, offline, and air-gapped Predictive Network Operations Center (NOC) Copilot framework.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages