Time-aware embedding compression middleware for Qdrant.
Store vectors at full precision when fresh. Automatically compress them as they age — saving 70%+ Qdrant storage with ≥95% cosine recall preserved.
app → ChronoQuant proxy → Qdrant
↓
tier-routes + compresses
embeddings by age + importance
Every vector gets routed to one of five tiers on write. A background compressor demotes vectors as they age, applying progressively lossy quantization while keeping semantic similarity intact.
| Tier | Encoding | Age threshold | Recall | Storage vs float32 |
|---|---|---|---|---|
| anchor | float16 | never demoted | 100% | −50% |
| hot | float16 | < 7 days | 100% | −50% |
| warm | int8 (PolarQuant) | 7–30 days | ≥99% | −75% |
| cool | 4-bit (PolarQuant) | 30–90 days | ≥97% | −87.5% |
| cold | 3-bit (PolarQuant) | > 90 days | ≥95% | −90.6% |
Anchor tier: any vector with importance ≥ 0.7 is pinned here forever — never compressed, never demoted.
PolarQuant: Randomized Hadamard Transform + scalar quantization + bit-packing. Distributes energy uniformly before quantization, maximizing precision per bit.
At 1 million vectors with a realistic age distribution (5% anchor, 15% hot, 25% warm, 30% cool, 25% cold):
| Dim | Raw Qdrant (float32) | ChronoQuant | Saved |
|---|---|---|---|
| 384 | 1.4 GiB | 0.4 GiB | 70% |
| 768 | 2.9 GiB | 0.9 GiB | 70% |
| 1536 | 5.7 GiB | 1.7 GiB | 70% |
| 3072 | 11.4 GiB | 3.4 GiB | 70% |
Savings grow as your corpus ages. A collection where 80%+ of vectors are > 30 days old approaches 85–90% reduction.
Run the benchmark yourself:
poetry run python benchmarks/storage_savings.pyRequirements: Docker, Python 3.11+, Poetry
git clone https://github.com/your-org/chronoquant
cd chronoquant
poetry install
# Start Qdrant + Redis
docker compose up -d
# Run the demo (real sentence-transformer embeddings)
poetry run python demo.pyWatch BGCompressor demote vectors across tiers in real time:
poetry run python tier_aging_demo.pyOutput:
Tier counts BEFORE compression:
anchor ██ 2
hot ██████ 6
warm ██ 2
cool ██ 2
cold 0
Running BGCompressor...
Tier counts AFTER compression:
anchor ██ 2
hot ██████ 6 (-4)
warm █████ 5 (+5)
cool ████ 4 (+4)
cold ██ 2 (+2)
Score drift (before vs after compression):
[anchor] +0.0000 Transformers use self-attention...
[hot ] +0.0000 Flash Attention rewrites the softmax...
[warm ] +0.0031 Sparse autoencoders decompose...
[cold ] +0.0052 Gradient checkpointing trades compute...
Takeaway: semantic similarity preserved despite lossy compression.
import numpy as np
from chronoquant.sdk import ChronoQuantClient
client = ChronoQuantClient(
qdrant_url="http://localhost:6333",
redis_url="redis://localhost:6379",
embedding_dim=1536,
)
# Store a vector
vec_id = client.store(
text="Flash Attention rewrites the softmax kernel for IO-bound efficiency.",
embedding=my_embedding, # np.ndarray, float32, normalized
importance=0.75, # ≥0.7 → anchor tier (never compressed)
)
# Search across all tiers
results = await client.search(query_vec, top_k=10)
for r in results:
print(r["score"], r["payload"]["raw_text_ref"])
# Run compression manually (normally a background job)
await client.run_compressor()Drop ChronoQuant in as an HTTP proxy between your app and Qdrant:
poetry run uvicorn chronoquant.proxy.server:app --port 8765Endpoints:
| Method | Path | Body |
|---|---|---|
| POST | /store |
{text, embedding, importance} |
| POST | /search |
{embedding, top_k, include_cold} |
| POST | /compress |
{} — triggers BGCompressor manually |
| GET | /health |
liveness check |
┌─────────────────────────────────────────────────────┐
│ ChronoQuantClient │
│ │
│ store() → TierRouter → QdrantAdapter → Qdrant │
│ → WriteAheadLog (SQLite) │
│ │
│ search() → fan-out all tiers → ScoreCalibrator │
│ → AccessTracker (Redis sorted set) │
│ │
│ BGCompressor (background) │
│ Redlock (Redis) — prevents concurrent runs │
│ For each tier, for each aged vector: │
│ PolarQuant.encode() → shadow-write → │
│ verify recall ≥ 0.95 → delete source │
└─────────────────────────────────────────────────────┘
BGCompressor safety: shadow-write pattern — encode, upsert to target tier, verify the point exists and recall ≥ 0.95, then delete from source. A failed recall check aborts the migration for that vector; it stays in its current tier.
WriteAheadLog: SQLite-backed, INSERT OR IGNORE idempotent. Survives process crashes. Pending entries can be replayed on startup.
ScoreCalibrator: Applies per-tier cosine score offsets learned from ≥500 samples. Normalizes compressed-tier scores to be directly comparable with float16 scores.
ChronoQuant ships a ready-made plugin that intercepts agentmemory save/search calls and routes them through the compression proxy. Embeddings are automatically tiered by age — your agent's memory gets smaller over time without losing semantic quality.
# In your agentmemory project
npm install /path/to/chronoquant/plugins/agentmemory
# or copy plugins/agentmemory/ into your project directlyconst { memory_save_hook, memory_search_hook } = require('chronoquant-agentmemory-plugin');
// Set proxy URL (default: http://localhost:8765)
process.env.CHRONOQUANT_PROXY_URL = 'http://localhost:8765';
// Replace agentmemory's default store call:
await memory_save_hook({
text: "Flash Attention rewrites the softmax kernel for IO-bound efficiency.",
embedding: float32Array, // your embedding as a plain JS array or TypedArray
importance: 0.6, // 0.0–1.0; ≥0.7 pins to anchor tier permanently
});
// Replace agentmemory's default search call:
const results = await memory_search_hook({
embedding: queryEmbedding,
top_k: 10,
include_cold: false, // set true to search 90d+ tier (slower, more thorough)
});
// results: [{ vec_id, score, payload: { text, tier, importance, ... } }, ...]- Writes are fire-and-forget with a 2 s timeout. A failed ChronoQuant write logs a warning and does not crash your agent.
- Searches fall back gracefully: if the proxy is unreachable,
memory_search_hookreturnsnulland your code can fall back to a direct Qdrant call. - No agentmemory internals are modified — the plugin is pure pass-through hooks.
# terminal 1 — infrastructure
docker compose up -d
# terminal 2 — ChronoQuant proxy
poetry run uvicorn chronoquant.proxy.server:app --port 8765
# terminal 3 — your agent (now writes compress automatically)
node your-agent.jsChronoQuant's proxy layer is database-agnostic. The HTTP proxy speaks a simple REST protocol; the backend adapter is swappable. To add a new vector DB:
# chronoquant/adapters/my_db_adapter.py
from __future__ import annotations
import numpy as np
class MyDBAdapter:
"""Implement these five methods and ChronoQuant works out of the box."""
def upsert(self, collection: str, vec_id: str, vector: np.ndarray, payload: dict) -> None:
...
def search(self, collection: str, query: np.ndarray, limit: int) -> list:
# Return objects with .id, .score, .payload, .vector attributes
...
def get(self, collection: str, vec_id: str):
# Return single point or None
...
def delete(self, collection: str, vec_ids: list[str]) -> None:
...
def count(self, collection: str) -> int:
...
def scroll(self, collection: str, limit: int = 1000) -> list:
# Return all points (used by BGCompressor to scan for aged vectors)
...
def set_payload(self, collection: str, vec_id: str, patch: dict) -> None:
# Merge-update payload fields without overwriting the whole document
...from chronoquant.sdk import ChronoQuantClient
from my_project.adapters.pinecone_adapter import PineconeAdapter
# Bypass built-in Qdrant wiring
client = ChronoQuantClient.__new__(ChronoQuantClient)
client._qdrant = PineconeAdapter(api_key="...", index="my-index")
# ... init remaining fields (pq, router, wal, etc.) as neededFor lighter-weight integration (compression only, no full SDK):
from chronoquant.engine.bg_compressor import BGCompressor
from chronoquant.engine.polar_quant import PolarQuant
from my_project.adapters.weaviate_adapter import WeaviateAdapter
compressor = BGCompressor(
qdrant=WeaviateAdapter(url="http://localhost:8080"),
pq=PolarQuant(seed=42),
redis=None, # optional — skip distributed lock
)
await compressor.run_compression_job()| Database | Adapter status | Notes |
|---|---|---|
| Qdrant | ✅ built-in | chronoquant.adapters.qdrant_adapter |
| Pinecone | PRs welcome | Needs upsert + query + fetch + delete |
| Weaviate | PRs welcome | GraphQL query → adapt .score field name |
| pgvector | PRs welcome | SQL adapter; scroll = SELECT * WHERE collection = ? |
| Chroma | PRs welcome | collection.query() returns distances, invert to similarity |
| Milvus | PRs welcome | Flush before scroll |
The only hard constraint: the search() method must return objects with .score as cosine similarity (0–1, higher = more similar). If your DB returns distances, convert with score = 1 - distance before returning.
Existing Qdrant collections on schema v1 can be patched to v2 in-place:
from migrations.v1_to_v2 import migrate
from chronoquant.adapters.qdrant_adapter import QdrantAdapter
counts = migrate(qdrant_adapter)
# {"anchor": 0, "hot": 42, "warm": 17, ...}Adds compressed_at field and bumps schema_version to 2. Safe to run multiple times (idempotent).
poetry run pytest -v32 tests across all components. Recall validation suite:
poetry run pytest benchmarks/recall_validation.py -vValidates PolarQuant meets recall thresholds on 1000 random 1536-dim unit vectors:
test_int8_recall PASSED (mean cosine ≥ 0.99)
test_4bit_recall PASSED (mean cosine ≥ 0.97)
test_3bit_recall PASSED (mean cosine ≥ 0.95)
chronoquant/
├── chronoquant/
│ ├── engine/
│ │ ├── polar_quant.py # RHT + scalar quantization + bit-packing
│ │ ├── tier_router.py # age + importance → tier name
│ │ ├── anchor_detector.py # importance threshold check
│ │ ├── bg_compressor.py # Redlock + shadow-write demotion loop
│ │ └── score_calibrator.py # per-tier cosine offset correction
│ ├── adapters/
│ │ └── qdrant_adapter.py # Qdrant CRUD + UUID hashing
│ ├── tracking/
│ │ └── access_tracker.py # Redis sorted set access log
│ ├── wal/
│ │ └── write_ahead_log.py # SQLite WAL with idempotent replay
│ ├── proxy/
│ │ └── server.py # FastAPI proxy (port 8765)
│ └── sdk.py # ChronoQuantClient
├── migrations/
│ └── v1_to_v2.py
├── plugins/
│ └── agentmemory/ # agentmemory JS plugin
├── benchmarks/
│ ├── storage_savings.py # storage reduction + throughput table
│ └── recall_validation.py # pytest recall thresholds
├── tests/ # 32 unit + integration tests
├── demo.py # sentence-transformers end-to-end demo
├── tier_aging_demo.py # BGCompressor tier aging visualization
└── docker-compose.yml # Qdrant + Redis
| Parameter | Default | Description |
|---|---|---|
embedding_dim |
1536 | Vector dimension (must match Qdrant collection) |
qdrant_url |
http://localhost:6333 |
Qdrant endpoint |
redis_url |
None | Redis endpoint (optional; disables Redlock + access tracking if absent) |
wal_db_path |
chronoquant_wal.db |
SQLite WAL path (:memory: for testing) |
pq_seed |
42 | PolarQuant RNG seed (must be consistent across restarts) |
BGCompressor age thresholds (days):
| Demotion | Threshold |
|---|---|
| hot → warm | 7 days |
| warm → cool | 30 days |
| cool → cold | 90 days |
- Throughput: PolarQuant encode/decode is ~160–270 vectors/sec single-threaded Python at 1536-dim. This is a background compression job — not on the search hot path.
- Dim constraint: PolarQuant pads to next power-of-2. A 1000-dim vector is stored as 1024-dim coefficients, adding ~2.4% overhead.
- Redis optional but recommended: Without Redis, BGCompressor skips distributed locking (safe for single-process deployments). Access tracking and score calibration are degraded.
- No automatic re-promotion: Vectors do not move back up tiers if accessed frequently. Access counts are tracked but re-promotion is not yet implemented.
MIT