Infinite Context. Zero Latency.
A efficient LLM compression system engineered for the Hack 60 Advanced AI Hackathon.
A highly modular two-tier compression architecture designed to run large-scale context agents securely on constrained consumer hardware (6–12 GB VRAM). By unifying a strictly typed UI with a cutting-edge Chain-of-Thought (CoT) LoRA adapter, Kinetic_SYS maintains >95% goal state accuracy across +30 chat turns.
Aesthetics: The UI relies on a brutalist, strictly monochrome aesthetic. Uncompromised scaling constraints ensure data displays dynamically via SVGs across kvCacheMetrics natively wrapped over a 3D Spline background.
We broke from generic frameworks to embrace a highly functional, pure React engine.
- Dynamic Live Metrics: Active tracking of compression ratios via dynamic mathematical SVG tracing natively connected to the WebSockets.
- Zustand AppStore: Singleton memory architecture managing
activeConstraintsdirectly in real-time. - Spline WASM Isolation: Robust error boundary architecture (
SplineErrorBoundary) catching 100k+ WASM particle overflows, gracefully scaling the application on low-end hardware without halting the DOM render.
- Attention Sink KV-Cache: PyTorch tensor slicing guarantees the anchor system prompt retains zero attention decay.
- Qwen2.5-1.5B Fine-tune: A state-of-the-art PEFT model explicitly fine-tuned via
LoRAparameters. The model handles all contextual parsing organically, stripping away legacy TF-IDF or regex dependencies resulting in pure, inferential constraints execution. - Persistent Chat History (SQLite): New session-based architecture that auto-saves conversation state (messages, memory, telemetry) to a local SQLite database, allowing users to resume historical threads seamlessly.
- Async WebSocket Duplex: Streams data points simultaneously down the wire to the frontend to provide immediate telemetry on layer times and latency.
| Layer | Framework/Tech | Usage |
|---|---|---|
| Core UI | React + Vite |
Instant hot module replacement and DOM mapping. |
| Styling | Tailwind CSS |
Strictly bound app.css utilizing custom monochromatic tracking layers. |
| Animation | Framer Motion |
Granular bounds scaling, expanding dynamic structural divs (e.g., ConstraintsSidebar). |
| 3D Engine | @splinetool/react-spline |
Heavyweight WebGL overlay for premium user transitions. |
| Backend API | FastAPI |
Asynchronous WebSocket telemetry handling. |
| Model | Qwen2.5-1.5B (HuggingFace) |
Lightweight SLM fine-tuned specifically for entity relation. |
Clone the repository:
git clone https://github.com/Shrestha-Kumar/context-compression.git
cd context-compression-moduleSince the model is strictly bound to PyTorch and QLoRA, ensure CUDA 12.1+ and a Python 3.10+ environment are ready.
Standard Setup:
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtStart the local PyTorch ASGI Socket:
# From the root directory:
PYTHONPATH=. python backend/app.pySpin up the Vite pipeline on another terminal natively:
cd frontend
npm install
npm run devThe application will map directly to http://localhost:5173.
Upon loading, the WASM engine allocates 100,000 particle limits for the Spline rendering engine. Click Deploy Pipeline. The error handler guarantees safe load execution across devices.
Run the automated benchmarking scripts for your presentation phase.
python -m backend.evaluation.benchmark --mode bothWatch the agent perfectly parse the needle test! The compression preserves >90% token reduction directly visible via the MetricsPanel in the frontend dashboard.