Skip to content

Commit 6224f1c

Browse files
Merge pull request #5 from thewebscraping/docs-updates
docs: fix documentation accuracy and remove emoji
2 parents 824e18c + f65e000 commit 6224f1c

File tree

17 files changed

+803
-505
lines changed

17 files changed

+803
-505
lines changed

README.md

Lines changed: 53 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,19 @@
99

1010
CrossVector provides a consistent, high-level API across multiple vector databases (AstraDB, ChromaDB, Milvus, PgVector) and embedding providers (OpenAI, Gemini), allowing you to switch between backends without rewriting your application code.
1111

12-
## 🎯 Recommended Backends
12+
## Recommended Backends
1313

1414
Based on our comprehensive benchmarking, we recommend:
1515

1616
### **For Production:**
1717

18-
- **🥇 ChromaDB Cloud** - Best for cloud deployments
18+
- **ChromaDB Cloud** - Best for cloud deployments
1919
- Hosted solution with excellent performance
2020
- Easy setup and management
2121
- Built-in scaling and backups
2222
- Good for: SaaS applications, MVPs, rapid prototyping
2323

24-
- **🥈 PgVector** - Best for self-hosted/on-premise
24+
- **PgVector** - Best for self-hosted/on-premise
2525
- Excellent performance (6-10 docs/sec bulk insert)
2626
- Very fast metadata queries (<1ms)
2727
- PostgreSQL reliability and ecosystem
@@ -38,57 +38,64 @@ See our [benchmarking guide](docs/benchmarking.md) for detailed performance comp
3838

3939
| Backend | Embedding | Model | Dim | Upsert | Search (avg) | Update (avg) | Delete (batch) | Status |
4040
|---------|-----------|-------|-----|--------|--------------|--------------|----------------|--------|
41-
| pgvector | openai | text-embedding-3-small | 1536 | 7.06s | 21.26ms | 6.21ms | 22.63ms ||
42-
| astradb | openai | text-embedding-3-small | 1536 | 18.89s | 23.86s | 1.11s | 15.15s ||
43-
| milvus | openai | text-embedding-3-small | 1536 | 7.94s | 654.43ms | 569.52ms | 2.17s ||
44-
| chroma | openai | text-embedding-3-small | 1536 | 17.08s | 654.76ms | 1.23s | 4.73s ||
45-
| pgvector | gemini | models/gemini-embedding-001 | 1536 | 6.65s | 18.72ms | 6.40ms | 20.25ms ||
46-
| astradb | gemini | models/gemini-embedding-001 | 1536 | 11.25s | 6.71s | 903.37ms | 15.05s ||
47-
| milvus | gemini | models/gemini-embedding-001 | 1536 | 6.14s | 571.90ms | 561.38ms | 1.91s ||
48-
| chroma | gemini | models/gemini-embedding-001 | 1536 | 18.93s | 417.28ms | 1.24s | 4.63s ||
41+
| pgvector | openai | text-embedding-3-small | 1536 | 7.06s | 21.26ms | 6.21ms | 22.63ms | Yes |
42+
| astradb | openai | text-embedding-3-small | 1536 | 18.89s | 23.86s | 1.11s | 15.15s | Yes |
43+
| milvus | openai | text-embedding-3-small | 1536 | 7.94s | 654.43ms | 569.52ms | 2.17s | Yes |
44+
| chroma | openai | text-embedding-3-small | 1536 | 17.08s | 654.76ms | 1.23s | 4.73s | Yes |
45+
| pgvector | gemini | models/gemini-embedding-001 | 1536 | 6.65s | 18.72ms | 6.40ms | 20.25ms | Yes |
46+
| astradb | gemini | models/gemini-embedding-001 | 1536 | 11.25s | 6.71s | 903.37ms | 15.05s | Yes |
47+
| milvus | gemini | models/gemini-embedding-001 | 1536 | 6.14s | 571.90ms | 561.38ms | 1.91s | Yes |
48+
| chroma | gemini | models/gemini-embedding-001 | 1536 | 18.93s | 417.28ms | 1.24s | 4.63s | Yes |
49+
50+
> **Important Benchmark Notes:**
51+
>
52+
> - **PgVector**: Benchmarks run against a **local PostgreSQL instance**, providing optimal latency. For fair comparison with cloud backends, ensure PgVector is deployed in the **same region and network environment**.
53+
> - **Cloud Backends** (AstraDB, Milvus, ChromaDB): Results are affected by **network latency** and **regional proximity**. Cloud-hosted PgVector will have different performance characteristics depending on region, network conditions, and infrastructure proximity.
54+
> - **Recommendations**: When comparing results, ensure all backends are deployed in the **same region** and **similar network conditions** for objective evaluation.
55+
> - For production deployments, conduct benchmarks in your **actual production environment** with real network conditions.
4956
5057
Full results: [`benchmark.md`](benchmark.md).
5158

5259
---
5360

5461
## Features
5562

56-
### 🔌 Pluggable Architecture
63+
### Pluggable Architecture
5764

5865
- **4 Vector Databases**: AstraDB, ChromaDB, Milvus, PgVector
5966
- **2 Embedding Providers**: OpenAI, Gemini
6067
- Switch backends without code changes
6168
- Lazy initialization pattern for optimal resource usage
6269

63-
### 🎯 Unified API
70+
### Unified API
6471

6572
- Consistent interface across all adapters
6673
- Django-style `get`, `get_or_create`, `update_or_create` semantics
6774
- Flexible document input formats: `str`, `dict`, or `VectorDocument`
6875
- Standardized error handling with contextual exceptions
6976

70-
### 🔍 Advanced Querying
77+
### Advanced Querying
7178

7279
- **Query DSL**: Type-safe filter composition with `Q` objects
7380
- **Universal operators**: `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin`
7481
- **Nested metadata**: Dot-notation paths for hierarchical data
7582
- **Metadata-only search**: Query without vector similarity (where supported)
7683

77-
### 🚀 Performance Optimized
84+
### Performance Optimized
7885

7986
- Automatic batch embedding generation
8087
- Bulk operations: `bulk_create`, `bulk_update`, `upsert`
8188
- Configurable batch sizes and conflict resolution
8289
- Lazy client initialization for faster startup
8390

84-
### 🛡️ Type-Safe & Validated
91+
### Type-Safe & Validated
8592

8693
- Full Pydantic v2 validation
8794
- Structured exceptions with detailed context
8895
- Centralized logging with configurable levels
8996
- Explicit configuration validation with helpful error messages
9097

91-
### ⚙️ Flexible Configuration
98+
### Flexible Configuration
9299

93100
- Environment variable support via `.env`
94101
- Multiple primary key strategies: UUID, hash-based, int64, custom
@@ -142,7 +149,7 @@ pip install crossvector[astradb,all-embeddings]
142149

143150
## Quick Start
144151

145-
> 💡 **Recommended**: Use `GeminiEmbeddingAdapter` for most use cases - free tier, faster search (1.5x), smaller vectors (768 vs 1536 dims). See [benchmarks](benchmark.md) for details.
152+
> **Recommended**: Use `GeminiEmbeddingAdapter` for most use cases - free tier, faster search (1.5x). See [benchmarks](benchmark.md) for details.
146153
147154
### Basic Usage
148155

@@ -153,7 +160,7 @@ from crossvector.dbs.pgvector import PgVectorAdapter
153160

154161
# Initialize engine with Gemini (recommended: free tier, fast performance)
155162
engine = VectorEngine(
156-
embedding=GeminiEmbeddingAdapter(), # Free tier, 1536-dim vectors
163+
embedding=GeminiEmbeddingAdapter(), # Free tier, 1536-dim vectors (default)
157164
db=PgVectorAdapter(),
158165
collection_name="my_documents",
159166
store_text=True
@@ -476,11 +483,11 @@ Different backends have varying feature support:
476483

477484
| Feature | AstraDB | ChromaDB | Milvus | PgVector |
478485
|---------|---------|----------|--------|----------|
479-
| Vector Search | | | | |
480-
| Metadata-Only Search | | | | |
481-
| Nested Metadata | | | | |
482-
| Numeric Comparisons | | | | |
483-
| Text Storage | | | | |
486+
| Vector Search | Yes | Yes | Yes | Yes |
487+
| Metadata-Only Search | Yes | Yes | Yes | Yes |
488+
| Nested Metadata | Yes | Yes | Yes | Yes |
489+
| Numeric Comparisons | Yes | Yes | Yes | Yes |
490+
| Text Storage | Yes | Yes | Yes | Yes |
484491

485492
*ChromaDB supports nested metadata via dot-notation when metadata is flattened.
486493

@@ -562,7 +569,7 @@ engine = VectorEngine(embedding=embedding, db=db)
562569

563570
## Embedding Providers
564571

565-
> 💡 **Recommended**: Start with **Gemini** for free tier and faster performance. See [benchmark comparison](benchmark.md).
572+
> **Recommended**: Start with **Gemini** for free tier and faster performance. See [benchmark comparison](benchmark.md).
566573
567574
### Gemini (Recommended)
568575

@@ -577,19 +584,21 @@ embedding = GeminiEmbeddingAdapter(model_name="models/text-embedding-004", dim=7
577584
```
578585

579586
**Why Choose Gemini:**
580-
- **Free tier**: 1,500 requests/min (vs OpenAI paid only)
581-
- **Faster search**: 234ms avg (1.5x faster than OpenAI)
582-
- **Efficient**: 768 dims = 50% less storage than OpenAI
583-
- **Quality**: Comparable accuracy to OpenAI
587+
- **Free tier**: 1,500 requests/min (vs OpenAI paid only)
588+
- **Faster search**: 234ms avg (1.5x faster than OpenAI)
589+
- **Flexible dims**: 768, 1536, or 3072 with gemini-embedding-001
590+
- **Quality**: Comparable accuracy to OpenAI
584591

585592
**Configuration:**
586593
```bash
587594
GEMINI_API_KEY=AI... # Get free key at https://makersuite.google.com/app/apikey
588595
```
589596

590597
**Supported Models:**
591-
- `gemini-embedding-001` (768 dims, **recommended**)
592-
- `models/text-embedding-004` (768 dims)
598+
- `gemini-embedding-001` (1536 dims default, supports 768/1536/3072, **recommended**)
599+
- `text-embedding-005` (768 dims, English and code)
600+
- `text-multilingual-embedding-002` (768 dims, multilingual)
601+
- `text-embedding-004` (768 dims, legacy English)
593602

594603
### OpenAI (Alternative)
595604

@@ -604,9 +613,9 @@ embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-3-large")
604613
```
605614

606615
**When to Use OpenAI:**
607-
- Need 1536 or 3072 dimensions
608-
- Already have OpenAI API budget
609-
- Prefer OpenAI ecosystem integration
616+
- Need 1536 or 3072 dimensions
617+
- Already have OpenAI API budget
618+
- Prefer OpenAI ecosystem integration
610619

611620
**Configuration:**
612621
```bash
@@ -618,10 +627,6 @@ OPENAI_API_KEY=sk-... # Paid API key from https://platform.openai.com
618627
- `text-embedding-3-large` (3072 dims)
619628
- `text-embedding-ada-002` (1536 dims, legacy)
620629

621-
- `gemini-embedding-001` (1536 dims, default)
622-
- `text-embedding-005` (768 dims)
623-
- `text-embedding-004` (768 dims, legacy)
624-
625630
---
626631

627632
## Error Handling
@@ -809,14 +814,14 @@ The benchmark tool measures performance across 7 key operations:
809814
```markdown
810815
| Backend | Embedding | Model | Dim | Bulk Create | Search (avg) | Update (avg) | Delete (batch) | Status |
811816
|---------|-----------|-------|-----|-------------|--------------|--------------|----------------|--------|
812-
| pgvector | openai | text-embedding-3-small | 1536 | 2.68s | 515.47ms | 6.48ms | 1.76ms | |
813-
| astradb | openai | text-embedding-3-small | 1536 | 32.56s | 1.09s | 875.63ms | 1.44s | |
814-
| milvus | openai | text-embedding-3-small | 1536 | 21.24s | 1.04s | 551.36ms | 180.25ms | |
815-
| chroma | openai | text-embedding-3-small | 1536 | 36.08s | 900.75ms | 2.51s | 521.35ms | |
816-
| pgvector | gemini | models/gemini-embedding-001 | 1536 | 31.50s | 65.29ms | 6.14ms | 1.78ms | |
817-
| astradb | gemini | models/gemini-embedding-001 | 1536 | 1m 2.65s | 882.48ms | 818.93ms | 1.44s | |
818-
| milvus | gemini | models/gemini-embedding-001 | 1536 | 50.26s | 835.50ms | 572.62ms | 224.16ms | |
819-
| chroma | gemini | models/gemini-embedding-001 | 1536 | 1m 3.39s | 628.08ms | 3.16s | 394.21ms | |
817+
| pgvector | openai | text-embedding-3-small | 1536 | 2.68s | 515.47ms | 6.48ms | 1.76ms | Yes |
818+
| astradb | openai | text-embedding-3-small | 1536 | 32.56s | 1.09s | 875.63ms | 1.44s | Yes |
819+
| milvus | openai | text-embedding-3-small | 1536 | 21.24s | 1.04s | 551.36ms | 180.25ms | Yes |
820+
| chroma | openai | text-embedding-3-small | 1536 | 36.08s | 900.75ms | 2.51s | 521.35ms | Yes |
821+
| pgvector | gemini | models/gemini-embedding-001 | 1536 | 31.50s | 65.29ms | 6.14ms | 1.78ms | Yes |
822+
| astradb | gemini | models/gemini-embedding-001 | 1536 | 1m 2.65s | 882.48ms | 818.93ms | 1.44s | Yes |
823+
| milvus | gemini | models/gemini-embedding-001 | 1536 | 50.26s | 835.50ms | 572.62ms | 224.16ms | Yes |
824+
| chroma | gemini | models/gemini-embedding-001 | 1536 | 1m 3.39s | 628.08ms | 3.16s | 394.21ms | Yes |
820825
```
821826

822827
### Requirements
@@ -859,7 +864,7 @@ Results are saved to `benchmark.md` (or custom path) with:
859864

860865
**Example output:**
861866
```
862-
📄 Markdown report saved to: benchmark.md
867+
Markdown report saved to: benchmark.md
863868
```
864869

865870
See [benchmarking documentation](docs/benchmarking.md) for more details.
@@ -1070,4 +1075,4 @@ See [CHANGELOG.md](CHANGELOG.md) for version history and migration guides.
10701075

10711076
---
10721077

1073-
**Made with ❤️ by the [Two Farm](https://www.linkedin.com/in/thetwofarm/)**
1078+
**Made with ❤️ by the [The Two Farm](https://www.linkedin.com/in/thetwofarm/)****

0 commit comments

Comments
 (0)