Skip to content

bs258q/chronoquant

Repository files navigation

ChronoQuant

Time-aware embedding compression middleware for Qdrant.

Store vectors at full precision when fresh. Automatically compress them as they age — saving 70%+ Qdrant storage with ≥95% cosine recall preserved.

app  →  ChronoQuant proxy  →  Qdrant
              ↓
     tier-routes + compresses
     embeddings by age + importance

How it works

Every vector gets routed to one of five tiers on write. A background compressor demotes vectors as they age, applying progressively lossy quantization while keeping semantic similarity intact.

Tier Encoding Age threshold Recall Storage vs float32
anchor float16 never demoted 100% −50%
hot float16 < 7 days 100% −50%
warm int8 (PolarQuant) 7–30 days ≥99% −75%
cool 4-bit (PolarQuant) 30–90 days ≥97% −87.5%
cold 3-bit (PolarQuant) > 90 days ≥95% −90.6%

Anchor tier: any vector with importance ≥ 0.7 is pinned here forever — never compressed, never demoted.

PolarQuant: Randomized Hadamard Transform + scalar quantization + bit-packing. Distributes energy uniformly before quantization, maximizing precision per bit.


Storage savings

At 1 million vectors with a realistic age distribution (5% anchor, 15% hot, 25% warm, 30% cool, 25% cold):

Dim Raw Qdrant (float32) ChronoQuant Saved
384 1.4 GiB 0.4 GiB 70%
768 2.9 GiB 0.9 GiB 70%
1536 5.7 GiB 1.7 GiB 70%
3072 11.4 GiB 3.4 GiB 70%

Savings grow as your corpus ages. A collection where 80%+ of vectors are > 30 days old approaches 85–90% reduction.

Run the benchmark yourself:

poetry run python benchmarks/storage_savings.py

Quick start

Requirements: Docker, Python 3.11+, Poetry

git clone https://github.com/your-org/chronoquant
cd chronoquant
poetry install

# Start Qdrant + Redis
docker compose up -d

# Run the demo (real sentence-transformer embeddings)
poetry run python demo.py

Tier aging demo

Watch BGCompressor demote vectors across tiers in real time:

poetry run python tier_aging_demo.py

Output:

Tier counts BEFORE compression:
  anchor  ██           2
  hot     ██████       6
  warm    ██           2
  cool    ██           2
  cold                 0

Running BGCompressor...

Tier counts AFTER compression:
  anchor  ██           2
  hot     ██████       6  (-4)
  warm    █████        5  (+5)
  cool    ████         4  (+4)
  cold    ██           2  (+2)

Score drift (before vs after compression):
  [anchor] +0.0000  Transformers use self-attention...
  [hot   ] +0.0000  Flash Attention rewrites the softmax...
  [warm  ] +0.0031  Sparse autoencoders decompose...
  [cold  ] +0.0052  Gradient checkpointing trades compute...

Takeaway: semantic similarity preserved despite lossy compression.

SDK usage

import numpy as np
from chronoquant.sdk import ChronoQuantClient

client = ChronoQuantClient(
    qdrant_url="http://localhost:6333",
    redis_url="redis://localhost:6379",
    embedding_dim=1536,
)

# Store a vector
vec_id = client.store(
    text="Flash Attention rewrites the softmax kernel for IO-bound efficiency.",
    embedding=my_embedding,   # np.ndarray, float32, normalized
    importance=0.75,          # ≥0.7 → anchor tier (never compressed)
)

# Search across all tiers
results = await client.search(query_vec, top_k=10)
for r in results:
    print(r["score"], r["payload"]["raw_text_ref"])

# Run compression manually (normally a background job)
await client.run_compressor()

FastAPI proxy

Drop ChronoQuant in as an HTTP proxy between your app and Qdrant:

poetry run uvicorn chronoquant.proxy.server:app --port 8765

Endpoints:

Method Path Body
POST /store {text, embedding, importance}
POST /search {embedding, top_k, include_cold}
POST /compress {} — triggers BGCompressor manually
GET /health liveness check

Architecture

┌─────────────────────────────────────────────────────┐
│                  ChronoQuantClient                  │
│                                                     │
│  store() → TierRouter → QdrantAdapter → Qdrant     │
│          → WriteAheadLog (SQLite)                   │
│                                                     │
│  search() → fan-out all tiers → ScoreCalibrator    │
│           → AccessTracker (Redis sorted set)        │
│                                                     │
│  BGCompressor (background)                          │
│    Redlock (Redis) — prevents concurrent runs      │
│    For each tier, for each aged vector:             │
│      PolarQuant.encode() → shadow-write →          │
│      verify recall ≥ 0.95 → delete source          │
└─────────────────────────────────────────────────────┘

BGCompressor safety: shadow-write pattern — encode, upsert to target tier, verify the point exists and recall ≥ 0.95, then delete from source. A failed recall check aborts the migration for that vector; it stays in its current tier.

WriteAheadLog: SQLite-backed, INSERT OR IGNORE idempotent. Survives process crashes. Pending entries can be replayed on startup.

ScoreCalibrator: Applies per-tier cosine score offsets learned from ≥500 samples. Normalizes compressed-tier scores to be directly comparable with float16 scores.


agentmemory plugin

ChronoQuant ships a ready-made plugin that intercepts agentmemory save/search calls and routes them through the compression proxy. Embeddings are automatically tiered by age — your agent's memory gets smaller over time without losing semantic quality.

Install

# In your agentmemory project
npm install /path/to/chronoquant/plugins/agentmemory
# or copy plugins/agentmemory/ into your project directly

Wire it up

const { memory_save_hook, memory_search_hook } = require('chronoquant-agentmemory-plugin');

// Set proxy URL (default: http://localhost:8765)
process.env.CHRONOQUANT_PROXY_URL = 'http://localhost:8765';

// Replace agentmemory's default store call:
await memory_save_hook({
  text: "Flash Attention rewrites the softmax kernel for IO-bound efficiency.",
  embedding: float32Array,   // your embedding as a plain JS array or TypedArray
  importance: 0.6,           // 0.0–1.0; ≥0.7 pins to anchor tier permanently
});

// Replace agentmemory's default search call:
const results = await memory_search_hook({
  embedding: queryEmbedding,
  top_k: 10,
  include_cold: false,  // set true to search 90d+ tier (slower, more thorough)
});
// results: [{ vec_id, score, payload: { text, tier, importance, ... } }, ...]

How it behaves

  • Writes are fire-and-forget with a 2 s timeout. A failed ChronoQuant write logs a warning and does not crash your agent.
  • Searches fall back gracefully: if the proxy is unreachable, memory_search_hook returns null and your code can fall back to a direct Qdrant call.
  • No agentmemory internals are modified — the plugin is pure pass-through hooks.

Start the proxy alongside your agent

# terminal 1 — infrastructure
docker compose up -d

# terminal 2 — ChronoQuant proxy
poetry run uvicorn chronoquant.proxy.server:app --port 8765

# terminal 3 — your agent (now writes compress automatically)
node your-agent.js

Using ChronoQuant as a proxy for other vector databases

ChronoQuant's proxy layer is database-agnostic. The HTTP proxy speaks a simple REST protocol; the backend adapter is swappable. To add a new vector DB:

1. Implement the adapter interface

# chronoquant/adapters/my_db_adapter.py
from __future__ import annotations
import numpy as np

class MyDBAdapter:
    """Implement these five methods and ChronoQuant works out of the box."""

    def upsert(self, collection: str, vec_id: str, vector: np.ndarray, payload: dict) -> None:
        ...

    def search(self, collection: str, query: np.ndarray, limit: int) -> list:
        # Return objects with .id, .score, .payload, .vector attributes
        ...

    def get(self, collection: str, vec_id: str):
        # Return single point or None
        ...

    def delete(self, collection: str, vec_ids: list[str]) -> None:
        ...

    def count(self, collection: str) -> int:
        ...

    def scroll(self, collection: str, limit: int = 1000) -> list:
        # Return all points (used by BGCompressor to scan for aged vectors)
        ...

    def set_payload(self, collection: str, vec_id: str, patch: dict) -> None:
        # Merge-update payload fields without overwriting the whole document
        ...

2. Pass your adapter to the SDK

from chronoquant.sdk import ChronoQuantClient
from my_project.adapters.pinecone_adapter import PineconeAdapter

# Bypass built-in Qdrant wiring
client = ChronoQuantClient.__new__(ChronoQuantClient)
client._qdrant = PineconeAdapter(api_key="...", index="my-index")
# ... init remaining fields (pq, router, wal, etc.) as needed

3. Or pass it directly to BGCompressor / QdrantAdapter

For lighter-weight integration (compression only, no full SDK):

from chronoquant.engine.bg_compressor import BGCompressor
from chronoquant.engine.polar_quant import PolarQuant
from my_project.adapters.weaviate_adapter import WeaviateAdapter

compressor = BGCompressor(
    qdrant=WeaviateAdapter(url="http://localhost:8080"),
    pq=PolarQuant(seed=42),
    redis=None,  # optional — skip distributed lock
)
await compressor.run_compression_job()

Adapter examples (community / bring-your-own)

Database Adapter status Notes
Qdrant ✅ built-in chronoquant.adapters.qdrant_adapter
Pinecone PRs welcome Needs upsert + query + fetch + delete
Weaviate PRs welcome GraphQL query → adapt .score field name
pgvector PRs welcome SQL adapter; scroll = SELECT * WHERE collection = ?
Chroma PRs welcome collection.query() returns distances, invert to similarity
Milvus PRs welcome Flush before scroll

The only hard constraint: the search() method must return objects with .score as cosine similarity (0–1, higher = more similar). If your DB returns distances, convert with score = 1 - distance before returning.


Schema migration

Existing Qdrant collections on schema v1 can be patched to v2 in-place:

from migrations.v1_to_v2 import migrate
from chronoquant.adapters.qdrant_adapter import QdrantAdapter

counts = migrate(qdrant_adapter)
# {"anchor": 0, "hot": 42, "warm": 17, ...}

Adds compressed_at field and bumps schema_version to 2. Safe to run multiple times (idempotent).


Tests

poetry run pytest -v

32 tests across all components. Recall validation suite:

poetry run pytest benchmarks/recall_validation.py -v

Validates PolarQuant meets recall thresholds on 1000 random 1536-dim unit vectors:

test_int8_recall  PASSED  (mean cosine ≥ 0.99)
test_4bit_recall  PASSED  (mean cosine ≥ 0.97)
test_3bit_recall  PASSED  (mean cosine ≥ 0.95)

Project layout

chronoquant/
├── chronoquant/
│   ├── engine/
│   │   ├── polar_quant.py       # RHT + scalar quantization + bit-packing
│   │   ├── tier_router.py       # age + importance → tier name
│   │   ├── anchor_detector.py   # importance threshold check
│   │   ├── bg_compressor.py     # Redlock + shadow-write demotion loop
│   │   └── score_calibrator.py  # per-tier cosine offset correction
│   ├── adapters/
│   │   └── qdrant_adapter.py    # Qdrant CRUD + UUID hashing
│   ├── tracking/
│   │   └── access_tracker.py    # Redis sorted set access log
│   ├── wal/
│   │   └── write_ahead_log.py   # SQLite WAL with idempotent replay
│   ├── proxy/
│   │   └── server.py            # FastAPI proxy (port 8765)
│   └── sdk.py                   # ChronoQuantClient
├── migrations/
│   └── v1_to_v2.py
├── plugins/
│   └── agentmemory/             # agentmemory JS plugin
├── benchmarks/
│   ├── storage_savings.py       # storage reduction + throughput table
│   └── recall_validation.py     # pytest recall thresholds
├── tests/                       # 32 unit + integration tests
├── demo.py                      # sentence-transformers end-to-end demo
├── tier_aging_demo.py           # BGCompressor tier aging visualization
└── docker-compose.yml           # Qdrant + Redis

Configuration

Parameter Default Description
embedding_dim 1536 Vector dimension (must match Qdrant collection)
qdrant_url http://localhost:6333 Qdrant endpoint
redis_url None Redis endpoint (optional; disables Redlock + access tracking if absent)
wal_db_path chronoquant_wal.db SQLite WAL path (:memory: for testing)
pq_seed 42 PolarQuant RNG seed (must be consistent across restarts)

BGCompressor age thresholds (days):

Demotion Threshold
hot → warm 7 days
warm → cool 30 days
cool → cold 90 days

Limitations

  • Throughput: PolarQuant encode/decode is ~160–270 vectors/sec single-threaded Python at 1536-dim. This is a background compression job — not on the search hot path.
  • Dim constraint: PolarQuant pads to next power-of-2. A 1000-dim vector is stored as 1024-dim coefficients, adding ~2.4% overhead.
  • Redis optional but recommended: Without Redis, BGCompressor skips distributed locking (safe for single-process deployments). Access tracking and score calibration are degraded.
  • No automatic re-promotion: Vectors do not move back up tiers if accessed frequently. Access counts are tracked but re-promotion is not yet implemented.

License

MIT

About

Time-aware embedding compression middleware for vector DBs

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors