Skip to content

Commit 303ee58

Browse files
feat: Add comprehensive benchmarking tool and improvements
Benchmarking Features: - Created scripts/benchmark.py for performance testing - Support 4 backends (pgvector, astradb, milvus, chroma) - Support 2 embedding providers (OpenAI, Gemini) - Test 7 operations: bulk/individual create, vector/metadata search, Query DSL, update, delete - Added --skip-slow flag to skip cloud backends for faster testing - Smart Query DSL optimization: 4 operators for slow backends, 10 for fast - Performance summary shows tested vs skipped backends - Comprehensive markdown reports with detailed metrics Engine Enhancements: - Added VectorEngine.drop_collection() method for cleanup - Fixed DEFAULT_COLLECTION_NAME to use api_settings.VECTOR_COLLECTION_NAME Architecture Improvements: - Enhanced ABC base class with unified initialization - Improved adapter collection name handling - Better error reporting in benchmarks Documentation: - Added benchmarking section to README.md (102 lines) - Created docs/benchmarking.md (385 lines complete guide) - Updated docs/contributing.md with benchmarking workflow - Added usage examples and best practices Testing: - All 365 unit tests passing - Updated tests for collection name changes - 40% overall coverage (core components 70-100%) Breaking Changes: - Removed DEFAULT_COLLECTION_NAME class constant (use settings instead)
1 parent b095fbe commit 303ee58

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+5081
-1286
lines changed

.env.example

Lines changed: 55 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,78 @@
1-
# OpenAI (for embeddings)
1+
# ===================================================================
2+
# Embedding Providers
3+
# ===================================================================
4+
5+
# OpenAI
26
OPENAI_API_KEY=sk-your-key-here
3-
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
7+
8+
# Gemini (Google)
9+
GEMINI_API_KEY=your-gemini-api-key
10+
11+
# Embedding Model (optional, shared across providers)
12+
# If not set, each adapter uses its own default:
13+
# - OpenAI: text-embedding-3-small
14+
# - Gemini: gemini-embedding-001
15+
# OpenAI options: text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002
16+
# Gemini options: gemini-embedding-001, text-embedding-004, text-embedding-005
17+
# VECTOR_EMBEDDING_MODEL=gemini-embedding-001
18+
19+
# ===================================================================
20+
# Vector Databases
21+
# ===================================================================
422

523
# AstraDB
624
ASTRA_DB_APPLICATION_TOKEN=AstraCS:your-token-here
725
ASTRA_DB_API_ENDPOINT=https://your-id.apps.astra.datastax.com
8-
ASTRA_DB_COLLECTION_NAME=vector_documents
926

10-
# ChromaDB Cloud (optional)
27+
# ChromaDB Cloud
28+
# Note: Choose ONE deployment mode (Cloud, HTTP, or Local)
1129
CHROMA_API_KEY=your-chroma-api-key
1230
CHROMA_TENANT=your-tenant
1331
CHROMA_DATABASE=your-database
1432

15-
# ChromaDB HTTP Server (optional)
33+
# ChromaDB HTTP Server
34+
# Important: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR
1635
CHROMA_HOST=localhost
1736
CHROMA_PORT=8000
1837

19-
# ChromaDB Local (optional)
38+
# ChromaDB Local Persistence
39+
# Important: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR
2040
CHROMA_PERSIST_DIR=./chroma_data
2141

22-
# Milvus
23-
MILVUS_API_ENDPOINT=https://your-endpoint.zillizcloud.com
24-
MILVUS_USER=your-user
25-
MILVUS_PASSWORD=your-password
42+
# Milvus / Zilliz Cloud
43+
MILVUS_API_ENDPOINT=http://localhost:19530
44+
MILVUS_API_KEY=your-milvus-api-key
2645

2746
# PGVector (PostgreSQL with pgvector extension)
2847
PGVECTOR_HOST=localhost
2948
PGVECTOR_PORT=5432
30-
PGVECTOR_DBNAME=vectordb
3149
PGVECTOR_USER=postgres
3250
PGVECTOR_PASSWORD=your-password
3351

34-
# Vector metric (cosine, dot_product, euclidean)
52+
# ===================================================================
53+
# Vector Engine Settings
54+
# ===================================================================
55+
56+
# Database name (used by PGVector and collection naming)
57+
VECTOR_COLLECTION_NAME=vector_db
58+
59+
# Distance metric: cosine, dot_product, euclidean
3560
VECTOR_METRIC=cosine
61+
62+
# Store original text with vectors (true/false)
63+
VECTOR_STORE_TEXT=false
64+
65+
# Vector embedding dimension
66+
VECTOR_DIM=1536
67+
68+
# Default search result limit
69+
VECTOR_SEARCH_LIMIT=10
70+
71+
# Primary key generation mode: uuid, hash_text, hash_vector, int64, auto
72+
PRIMARY_KEY_MODE=uuid
73+
74+
# Optional: Custom PK factory (dotted path to callable)
75+
# PRIMARY_KEY_FACTORY=mymodule.custom_pk_function
76+
77+
# Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
78+
LOG_LEVEL=INFO

README.md

Lines changed: 161 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,14 @@ CrossVector provides a consistent, high-level API across multiple vector databas
3636
- **4 Vector Databases**: AstraDB, ChromaDB, Milvus, PgVector
3737
- **2 Embedding Providers**: OpenAI, Gemini
3838
- Switch backends without code changes
39+
- Lazy initialization pattern for optimal resource usage
3940

4041
### 🎯 Unified API
4142

4243
- Consistent interface across all adapters
4344
- Django-style `get`, `get_or_create`, `update_or_create` semantics
4445
- Flexible document input formats: `str`, `dict`, or `VectorDocument`
46+
- Standardized error handling with contextual exceptions
4547

4648
### 🔍 Advanced Querying
4749

@@ -55,18 +57,21 @@ CrossVector provides a consistent, high-level API across multiple vector databas
5557
- Automatic batch embedding generation
5658
- Bulk operations: `bulk_create`, `bulk_update`, `upsert`
5759
- Configurable batch sizes and conflict resolution
60+
- Lazy client initialization for faster startup
5861

5962
### 🛡️ Type-Safe & Validated
6063

61-
- Full Pydantic validation
64+
- Full Pydantic v2 validation
6265
- Structured exceptions with detailed context
6366
- Centralized logging with configurable levels
67+
- Explicit configuration validation with helpful error messages
6468

6569
### ⚙️ Flexible Configuration
6670

6771
- Environment variable support via `.env`
6872
- Multiple primary key strategies: UUID, hash-based, int64, custom
6973
- Optional text storage to optimize space
74+
- Strict config validation prevents silent failures
7075

7176
---
7277

@@ -118,9 +123,9 @@ from crossvector import VectorEngine
118123
from crossvector.embeddings.openai import OpenAIEmbeddingAdapter
119124
from crossvector.dbs.pgvector import PgVectorAdapter
120125

121-
# Initialize engine
126+
# Initialize engine (uses default models if not specified)
122127
engine = VectorEngine(
123-
embedding=OpenAIEmbeddingAdapter(model_name="text-embedding-3-small"),
128+
embedding=OpenAIEmbeddingAdapter(), # Uses text-embedding-3-small by default
124129
db=PgVectorAdapter(),
125130
collection_name="my_documents",
126131
store_text=True
@@ -359,7 +364,10 @@ Create a `.env` file in your project root:
359364
OPENAI_API_KEY=sk-...
360365

361366
# Gemini
362-
GOOGLE_API_KEY=AI...
367+
GEMINI_API_KEY=AI...
368+
369+
# Optional: Override default embedding model (each adapter has its own default)
370+
# VECTOR_EMBEDDING_MODEL=gemini-embedding-001
363371

364372
# AstraDB
365373
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
@@ -371,18 +379,24 @@ CHROMA_API_KEY=...
371379
CHROMA_TENANT=...
372380
CHROMA_DATABASE=...
373381

374-
# ChromaDB (Self-hosted)
382+
# ChromaDB (Self-hosted HTTP)
375383
CHROMA_HOST=localhost
376384
CHROMA_PORT=8000
377385

386+
# ChromaDB (Local persistence)
387+
CHROMA_PERSIST_DIR=./chroma_data
388+
389+
# Note: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR
390+
# Choose one based on deployment mode
391+
378392
# Milvus
379393
MILVUS_API_ENDPOINT=https://...
380394
MILVUS_API_KEY=...
381395

382396
# PgVector
383397
PGVECTOR_HOST=localhost
384398
PGVECTOR_PORT=5432
385-
PGVECTOR_DBNAME=vector_db
399+
VECTOR_COLLECTION_NAME=vector_db
386400
PGVECTOR_USER=postgres
387401
PGVECTOR_PASSWORD=postgres
388402

@@ -458,21 +472,26 @@ engine = VectorEngine(embedding=embedding, db=db)
458472
```python
459473
from crossvector.dbs.chroma import ChromaAdapter
460474

461-
# Cloud mode
462-
db = ChromaAdapter() # Uses CHROMA_API_KEY from env
475+
# Cloud mode (requires CHROMA_API_KEY)
476+
db = ChromaAdapter()
463477

464-
# Self-hosted mode
465-
db = ChromaAdapter() # Uses CHROMA_HOST/PORT from env
478+
# Self-hosted HTTP mode (requires CHROMA_HOST, must not set CHROMA_PERSIST_DIR)
479+
db = ChromaAdapter()
466480

467-
# Local persistence mode
468-
db = ChromaAdapter() # Uses CHROMA_PERSIST_DIR from env
481+
# Local persistence mode (requires CHROMA_PERSIST_DIR, must not set CHROMA_HOST)
482+
db = ChromaAdapter()
469483

470484
engine = VectorEngine(embedding=embedding, db=db)
471485

472486
# Features:
473487
# - Multiple deployment modes (cloud/HTTP/local)
474-
# - Automatic client fallback
488+
# - Strict config validation (prevents conflicting settings)
489+
# - Explicit import pattern for better code clarity
475490
# - Flattened metadata with dot-notation support
491+
# - Lazy client initialization
492+
493+
# Important: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR
494+
# Choose one deployment mode explicitly to avoid errors
476495
```
477496

478497
### Milvus
@@ -519,33 +538,45 @@ from crossvector.embeddings.openai import OpenAIEmbeddingAdapter
519538
# Default model (text-embedding-3-small, 1536 dims)
520539
embedding = OpenAIEmbeddingAdapter()
521540

522-
# Larger model (text-embedding-3-large, 3072 dims)
523-
embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-3-large")
541+
# Or use VECTOR_EMBEDDING_MODEL from .env
542+
# VECTOR_EMBEDDING_MODEL=text-embedding-3-large
543+
embedding = OpenAIEmbeddingAdapter() # Uses env var
524544

525-
# Legacy model (text-embedding-ada-002, 1536 dims)
526-
embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-ada-002")
545+
# Explicit model override
546+
embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-3-large")
527547
```
528548

549+
**Supported Models:**
550+
- `text-embedding-3-small` (1536 dims, default)
551+
- `text-embedding-3-large` (3072 dims)
552+
- `text-embedding-ada-002` (1536 dims, legacy)
553+
529554
### Gemini
530555

531556
```python
532557
from crossvector.embeddings.gemini import GeminiEmbeddingAdapter
533558

534-
# Default model (gemini-embedding-001)
559+
# Default model (gemini-embedding-001, 1536 dims)
535560
embedding = GeminiEmbeddingAdapter()
536561

562+
# Or use VECTOR_EMBEDDING_MODEL from .env
563+
# VECTOR_EMBEDDING_MODEL=gemini-embedding-001
564+
embedding = GeminiEmbeddingAdapter() # Uses env var
565+
537566
# With custom dimensions (768, 1536, 3072)
538-
embedding = GeminiEmbeddingAdapter(
539-
model_name="gemini-embedding-001",
540-
dim=1536
541-
)
567+
embedding = GeminiEmbeddingAdapter(dim=768)
542568

543569
# With task type
544570
embedding = GeminiEmbeddingAdapter(
545571
task_type="retrieval_document" # or "retrieval_query", "semantic_similarity"
546572
)
547573
```
548574

575+
**Supported Models:**
576+
- `gemini-embedding-001` (768-3072 dims, default, recommended)
577+
- `text-embedding-005` (768 dims)
578+
- `text-embedding-004` (768 dims, legacy)
579+
549580
---
550581

551582
## Error Handling
@@ -658,7 +689,7 @@ export MILVUS_API_TOKEN=...
658689
# PgVector
659690
export PGVECTOR_HOST=localhost
660691
export PGVECTOR_PORT=5432
661-
export PGVECTOR_DBNAME=vectordb
692+
export VECTOR_COLLECTION_NAME=vectordb
662693
export PGVECTOR_USER=postgres
663694
export PGVECTOR_PASSWORD=postgres
664695
```
@@ -675,13 +706,115 @@ pytest tests/test_engine.py
675706
# With coverage
676707
pytest tests/ --cov=crossvector --cov-report=html
677708

678-
# Integration tests (requires real backends)
679-
python scripts/backend.py --backend pgvector --embedding-provider openai
680-
python scripts/backend.py --backend astradb --embedding-provider openai
681-
python scripts/backend.py --backend milvus --embedding-provider openai
682-
python scripts/backend.py --backend chroma --embedding-provider openai
709+
# Integration tests with real backends (requires credentials)
710+
pytest scripts/tests/test_pgvector.py -v
711+
pytest scripts/tests/test_astradb.py -v
712+
pytest scripts/tests/test_milvus.py -v
713+
pytest scripts/tests/test_chroma.py -v
714+
```
715+
716+
---
717+
718+
## Benchmarking
719+
720+
CrossVector includes a comprehensive benchmarking tool to compare performance across different database backends and embedding providers.
721+
722+
### Quick Start
723+
724+
```bash
725+
# Quick test with 10 documents (recommended first run)
726+
python scripts/benchmark.py --num-docs 10
727+
728+
# Full benchmark with 1000 documents
729+
python scripts/benchmark.py
730+
731+
# Test specific backends and embeddings
732+
python scripts/benchmark.py --backends pgvector milvus --embedding-providers openai
733+
734+
# Custom output file
735+
python scripts/benchmark.py --output results/my_benchmark.md
736+
```
737+
738+
### What Gets Benchmarked
739+
740+
The benchmark tool measures performance across 7 key operations:
741+
742+
1. **Bulk Create** - Batch insertion with automatic embedding generation
743+
2. **Individual Create** - Single document creation performance
744+
3. **Vector Search** - Semantic similarity search with embeddings
745+
4. **Metadata-Only Search** - Filtering without vector similarity
746+
5. **Query DSL Operators** - Testing all 10 operators (eq, ne, gt, gte, lt, lte, in, nin, and, or)
747+
6. **Update Operations** - Document update performance
748+
7. **Delete Operations** - Batch deletion throughput
749+
750+
### Supported Backends
751+
752+
- **PgVector** - PostgreSQL with vector extension
753+
- **AstraDB** - DataStax Astra vector database
754+
- **Milvus** - Open-source vector database
755+
- **ChromaDB** - Embedded vector database
756+
757+
### Supported Embeddings
758+
759+
- **OpenAI** - `text-embedding-3-small` (1536 dimensions)
760+
- **Gemini** - `text-embedding-004` (768 dimensions)
761+
762+
### Sample Results
763+
764+
```markdown
765+
| Backend | Embedding | Bulk Create | Search (avg) | Update (avg) | Delete (batch) |
766+
|----------|-----------|-------------|--------------|--------------|----------------|
767+
| pgvector | openai | 1.37s | 434ms | 6.20ms | 0.54ms |
768+
| pgvector | gemini | 3.64s | 321ms | 3.16ms | 0.47ms |
769+
| milvus | openai | 0.95s | 156ms | 4.12ms | 0.31ms |
770+
| chroma | gemini | 2.14s | 287ms | 5.43ms | 0.89ms |
771+
```
772+
773+
### Requirements
774+
775+
**Environment Variables:**
776+
777+
```bash
778+
# Embedding providers (at least one required)
779+
OPENAI_API_KEY=sk-...
780+
GOOGLE_API_KEY=...
781+
782+
# Database backends (optional, script will skip if not configured)
783+
PGVECTOR_CONNECTION_STRING=postgresql://...
784+
ASTRADB_API_ENDPOINT=https://...
785+
ASTRADB_APPLICATION_TOKEN=AstraCS:...
786+
MILVUS_API_ENDPOINT=https://...
787+
MILVUS_API_TOKEN=...
788+
```
789+
790+
### Recommended Workflow
791+
792+
```bash
793+
# Step 1: Quick verification (1-2 minutes)
794+
python scripts/benchmark.py --num-docs 1 --backends pgvector --embedding-providers openai
795+
796+
# Step 2: Fast comparison with 10 docs (5-10 minutes)
797+
python scripts/benchmark.py --num-docs 10
798+
799+
# Step 3: Production benchmark with 1000 docs (30-60 minutes)
800+
python scripts/benchmark.py --num-docs 1000 --output benchmark_full.md
683801
```
684802

803+
### Output
804+
805+
Results are saved to `benchmark.md` (or custom path) with:
806+
- Performance summary table comparing all combinations
807+
- Detailed metrics for each backend + embedding pair
808+
- Query DSL operator test results
809+
- Timestamps and configuration details
810+
811+
**Example output:**
812+
```
813+
📄 Markdown report saved to: benchmark.md
814+
```
815+
816+
See [benchmarking documentation](docs/benchmarking.md) for more details.
817+
685818
---
686819

687820
## Examples

0 commit comments

Comments
 (0)