Warning
Status: Experimental Proof-of-Concept / RFC
This repository is a functional Request for Comment (RFC) demonstrating instantaneous Polar Quantization mapping of embeddings via Native C++ hardware bindings. It natively intercepts Float32 arrays without causing Node.js garbage collection latency.
Note: While the mathematics and Genkit semantic retrieval logic work flawlessly in this demo, true 32x memory compression limits are constrained by Genkit's current EmbedResponse schema (which mandates number[]). This repository pioneers the algorithmic bridge necessary for when the Genkit ecosystem officially supports Native Binary (Uint8Array) schemas and Hamming Distance vector database plugins.
The TurboQuant Embedder is a high-performance Firebase Genkit middleware plugin that bridges your Cloud Vector Database and standard LLM embedding models (like Google text-embedding-004).
By intercepting high-dimensional Float32 embeddings and routing them through a Native C++ N-API Hardware Binary, this plugin performs absolute 1-bit Polar Quantization math in the background. It losslessly strips Decimal Mantissas and natively crashes vectors into 1s and 0s recursively.
The result? Your vectors are compressed laterally by 96.8% before they ever hit the Vector Database, effectively erasing memory constraints for RAG pipelines without deteriorating logical similarity graphs.
- True Native C++ Hardware Quantization: Computes 1-bit logic through
node-gypcompiled hardware extensions via N-API. Totally asynchronous and non-blocking natively on the Node.js event loop. - Next.js Turbopack Optimization: Features robust
new Functiondynamic obfuscation, meaning the nativeturboquant.nodebinary behaves flawlessly in modern Next.js Edge/Server endpoints without violently crashingTurbopackstatic analyzers. - Genkit Middleware Interceptor: Structurally mimics normal
EmbedResponsetypes from@genkit-ai/ai. Your existing Indexers and Retrievers will never know the data was swapped out from under them. - Hybrid Semantic Grounding: Dynamically fuses chronological Short-Term conversational memory and mathematical Long-Term Vector Search retrieval in the LLM prompt block to prevent catastrophic RAG semantic-drift.
- Multi-Tenant Vector Isolation Sandbox: Built-in User Switcher directly in the UI routes the Next.js frontend state arrays to explicitly filtered RAG instances on Vertex AI so namespaces are flawlessly separated.
By intercepting and converting raw Float32 embedding vectors down using 1-bit precision hardware mapping, the plugin achieves striking efficiency across all cloud Vector Database providers:
| Data Type | Dimension Constraint | Bytes Per Vector | Monthly RAM Estimate (1M Vecs) |
|---|---|---|---|
| Float32 (Standard) | 768 dims | ~3,072 Bytes | ~3,072 MB ($$$) |
| TurboQuant (1-bit) | 768 dims | ~96 Bytes | ~96 MB ($) |
| Savings | - | ~96.8% Less Space | ~96.8% DB Cost Reduction |
By running this plug-in:
- Index Sizes remain lightweight keeping database IO latency ultra-low.
- Serverless Operations are significantly cheaper as retrieval payload sizes shrink by nearly 32x.
- Client Bandwidth required to transfer indices over REST fetches is negligible.
Important
Production Requirement for True Memory Reduction
The C++ native core mathematically converts all hardware Floating Point vectors perfectly into 1s and 0s recursively. However, because standard JSON local test databases (dev-local-vectorstore) cannot accept packed binary files, the backend currently maps these 1s and 0s back out to standard JavaScript Number types.
To achieve the literal 96% hardware compression footprint in production, you must connect Genkit to an enterprise Vector Database (e.g., Google Vertex AI Vector Search, Pinecone Serverless, or Qdrant) that officially supports Binary Quantized Vectors. These databases will flawlessly ingest the 1s and 0s mathematically generated by our C++ core and automatically "Bit-Pack" exactly 8 boolean flags into a single Byte layout.
The core mechanism empowering the 32x compression is a custom C++ module bridging Node.js through the node-addon-api (N-API).
Because Javascript garbage collection becomes extremely volatile when looping over millions of high-precision floating point numbers, we break out into the OS hardware tier. The CompressToPolarQuant function in src/turboquant.cpp intercepts the arrays, instantly evaluates the sign bit of each dimension, and maps them natively into raw 1s and 0s without engaging the V8 JS engine logic.
If you modify src/turboquant.cpp, run the included node-gyp script to rebuild the native binding:
npm run build:cppBecause modern tools (like Next.js's Turbopack) forcefully attempt to statically analyze and bundle backend project files simultaneously, importing .node binaries using standard Javascript require() statements causes destructive build failures.
To resolve this, our turboquant-embedder.ts uses dynamic module resolution on the fly:
import { createRequire } from 'module';
// Obfuscated dynamic load specifically preventing Turbopack from tracing the C++ node runtime
const bypassRequire = new Function('require', 'return require')(createRequire(import.meta.url));
const turboquant = bypassRequire('./build/Release/turboquant.node');To test true hardware compression locally through Google Cloud, you must provision an enterprise Vector Search cluster via the gcloud CLI.
- Save the following native GCP Index configuration to a local file named
index_metadata.json:
{
"contentsDeltaUri": "",
"config": {
"dimensions": 768,
"approximateNeighborsCount": 150,
"distanceMeasureType": "DOT_PRODUCT_DISTANCE",
"algorithmConfig": {"treeAhConfig": {"leafNodeEmbeddingCount": 500, "leafNodesToSearchPercent": 7}}
}
}- Authenticate and create your infrastructure (WARNING: Step 4 takes ~40 minutes because Google provisions a dedicated serverless shard!):
# 1. Login and set billing project
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
# 2. Create the Database Index (MUST specify --index-update-method=stream_update)
gcloud ai indexes create --display-name="turboquant-demo-index-stream" --description="TurboQuant" --metadata-file=index_metadata.json --index-update-method=stream_update --region="us-central1"
# 3. Create a Public HTTP Endpoint Server (Save the ENDPOINT_ID returned!)
gcloud ai index-endpoints create --display-name="turboquant-demo-endpoint" --public-endpoint-enabled --region="us-central1"
# 4. Bind the Index to the Endpoint (Long Running Operation)
gcloud ai index-endpoints deploy-index YOUR_ENDPOINT_ID --deployed-index-id="turboquant_demo_stream" --display-name="turboquant-demo-deployed" --index=YOUR_INDEX_ID --region="us-central1"Once provisioned, bind your Next.js application to @genkit-ai/vertexai to begin scaling hardware vector quantization!
If you are testing this application using your own Google Cloud project, you must configure your Vertex AI Vector Search index with Streaming Updates enabled. Genkit's ai.index command writes the vectors directly to Vertex AI, and it strictly requires StreamUpdate capabilities instead of BatchUpdate.
When provisioning your cluster on Google Cloud:
- Choose Vertex AI Search and Conversation -> Vector Search.
- When creating the index, under the "Update Method" configuration, ensure you select Streaming Update (not Batch).
- If you get
Error upserting datapoints into index [...] Bad Request. StreamUpdate is not enabled on this index, you must recreate the index with the Streaming option toggled on. - Set your GCP
PROJECT_ID,GCLOUD_PROJECT, and standardGOOGLE_APPLICATION_CREDENTIALSin your.env.local.
Because Vertex AI Vector Search is highly optimized for pure vector calculation, it intentionally does not store the actual human-readable strings associated with your vectors (such as User Chat messages).
Genkit requires you to seamlessly pair Vertex AI with a dedicated Document Store. In demo/src/genkit/chatFlow.ts, this repository bridges the vectors to Google Cloud Firestore to handle the actual chat payloads.
To ensure the demo runs natively, you must ensure Firestore is initialized in your GCP Project:
- Go to the Firestore tab in the GCP Console.
- Initialize a Native database.
- The demo application will automatically create a
User_Chatscollection, save your chat messages, and securely link the documents to the Vector embeddings under the hood.
Because Vertex AI Vector Search manages mathematical vectors efficiently, they are not typically viewable in the GCP Console UI. If you want to confirm your vectors are saving correctly, you can read datapoints natively via the REST API:
TOKEN=$(gcloud auth print-access-token)
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://REGION-PROJECT_NUMBER.vdb.vertexai.goog/v1/projects/PROJECT_NUMBER/locations/REGION/indexEndpoints/YOUR_ENDPOINT_ID:readIndexDatapoints \
-d '{"deployedIndexId": "turboquant_demo_stream", "ids": ["YOUR_FIRESTORE_DOCUMENT_ID"]}'# First undeploy the old index
gcloud ai index-endpoints undeploy-index YOUR_ENDPOINT_ID --deployed-index-id=OLD_DEPLOYED_ID --region="us-central1"
# Once successful, delete the configuration
gcloud ai indexes delete OLD_INDEX_ID --region="us-central1"A Next.js full-stack demonstration is included in this repository to showcase "Compressed Chat Memory" perfectly.
You can run the full chat interface and watch underlying vector hits via the Genkit Developer UI locally!
- Clone repo, then build the Native C++ extension globally:
npm run build:cpp - Move into the demo directory:
cd demo - Install strict runtime dependencies:
npm install - Add a
.env.localcontaining your LLM credentials (e.g.GEMINI_API_KEY=your_key) - Spin up the Genkit Developer UI linked directly to the running Next.js Application!
npm run genkitVisit http://localhost:3000 to interact with the Next.js Chatbot containing the live saving analytics, and simultaneously visit http://localhost:4000 to inspect the live backend Vector DB traces mapping out visually!