Skip to content

Commit 930ba09

Browse files
Merge pull request #3 from thewebscraping/standard-dbs
Release v1.0.0 - First Production Release!
2 parents b095fbe + 5acf0e0 commit 930ba09

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+5376
-1318
lines changed

.env.example

Lines changed: 55 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,78 @@
1-
# OpenAI (for embeddings)
1+
# ===================================================================
2+
# Embedding Providers
3+
# ===================================================================
4+
5+
# OpenAI
26
OPENAI_API_KEY=sk-your-key-here
3-
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
7+
8+
# Gemini (Google)
9+
GEMINI_API_KEY=your-gemini-api-key
10+
11+
# Embedding Model (optional, shared across providers)
12+
# If not set, each adapter uses its own default:
13+
# - OpenAI: text-embedding-3-small
14+
# - Gemini: gemini-embedding-001
15+
# OpenAI options: text-embedding-3-small, text-embedding-3-large, text-embedding-ada-002
16+
# Gemini options: gemini-embedding-001, text-embedding-004, text-embedding-005
17+
# VECTOR_EMBEDDING_MODEL=gemini-embedding-001
18+
19+
# ===================================================================
20+
# Vector Databases
21+
# ===================================================================
422

523
# AstraDB
624
ASTRA_DB_APPLICATION_TOKEN=AstraCS:your-token-here
725
ASTRA_DB_API_ENDPOINT=https://your-id.apps.astra.datastax.com
8-
ASTRA_DB_COLLECTION_NAME=vector_documents
926

10-
# ChromaDB Cloud (optional)
27+
# ChromaDB Cloud
28+
# Note: Choose ONE deployment mode (Cloud, HTTP, or Local)
1129
CHROMA_API_KEY=your-chroma-api-key
1230
CHROMA_TENANT=your-tenant
1331
CHROMA_DATABASE=your-database
1432

15-
# ChromaDB HTTP Server (optional)
33+
# ChromaDB HTTP Server
34+
# Important: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR
1635
CHROMA_HOST=localhost
1736
CHROMA_PORT=8000
1837

19-
# ChromaDB Local (optional)
38+
# ChromaDB Local Persistence
39+
# Important: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR
2040
CHROMA_PERSIST_DIR=./chroma_data
2141

22-
# Milvus
23-
MILVUS_API_ENDPOINT=https://your-endpoint.zillizcloud.com
24-
MILVUS_USER=your-user
25-
MILVUS_PASSWORD=your-password
42+
# Milvus / Zilliz Cloud
43+
MILVUS_API_ENDPOINT=http://localhost:19530
44+
MILVUS_API_KEY=your-milvus-api-key
2645

2746
# PGVector (PostgreSQL with pgvector extension)
2847
PGVECTOR_HOST=localhost
2948
PGVECTOR_PORT=5432
30-
PGVECTOR_DBNAME=vectordb
3149
PGVECTOR_USER=postgres
3250
PGVECTOR_PASSWORD=your-password
3351

34-
# Vector metric (cosine, dot_product, euclidean)
52+
# ===================================================================
53+
# Vector Engine Settings
54+
# ===================================================================
55+
56+
# Database name (used by PGVector and collection naming)
57+
VECTOR_COLLECTION_NAME=vector_db
58+
59+
# Distance metric: cosine, dot_product, euclidean
3560
VECTOR_METRIC=cosine
61+
62+
# Store original text with vectors (true/false)
63+
VECTOR_STORE_TEXT=false
64+
65+
# Vector embedding dimension
66+
VECTOR_DIM=1536
67+
68+
# Default search result limit
69+
VECTOR_SEARCH_LIMIT=10
70+
71+
# Primary key generation mode: uuid, hash_text, hash_vector, int64, auto
72+
PRIMARY_KEY_MODE=uuid
73+
74+
# Optional: Custom PK factory (dotted path to callable)
75+
# PRIMARY_KEY_FACTORY=mymodule.custom_pk_function
76+
77+
# Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL
78+
LOG_LEVEL=INFO

CHANGELOG.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,89 @@
11
# CrossVector - Changelog
22

3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [1.0.0] - 2025-12-06 🎉
9+
10+
**First Production Release!**
11+
12+
### Added
13+
14+
**Benchmarking System:**
15+
- Created comprehensive `scripts/benchmark.py` tool for performance testing
16+
- Support for 4 database backends (pgvector, astradb, milvus, chroma)
17+
- Support for 2 embedding providers (OpenAI, Gemini)
18+
- 7 operation types tested: bulk/individual create, vector/metadata search, Query DSL operators, update, delete
19+
- `--skip-slow` flag to skip cloud backends for faster local testing
20+
- Smart Query DSL optimization: 4 operators for slow backends, 10 for fast backends
21+
- Detailed markdown reports with performance metrics
22+
- Performance summary shows tested vs skipped backends clearly
23+
24+
**Engine Improvements:**
25+
- Added `VectorEngine.drop_collection()` method for collection cleanup
26+
- Better collection lifecycle management
27+
28+
**Documentation:**
29+
- Added benchmarking section to README.md (102 lines)
30+
- Created comprehensive `docs/benchmarking.md` guide (385 lines)
31+
- Updated `docs/contributing.md` with benchmarking workflow
32+
- Added usage examples and best practices
33+
- Cost estimation and troubleshooting guides
34+
35+
**Testing:**
36+
- Added 50+ new unit tests
37+
- Test coverage for ABC adapters (82%)
38+
- Test coverage for logger (100%)
39+
- Extended engine tests
40+
- Schema, utils, and Q object coverage tests
41+
- Total: 365 tests passing (from ~300)
42+
43+
**Architecture:**
44+
- Enhanced ABC base class with unified initialization
45+
- Improved adapter architecture
46+
- Better error reporting in benchmarks
47+
- Truncated error messages in reports for readability
48+
49+
### Changed
50+
51+
- Collection name defaults now use `api_settings.VECTOR_COLLECTION_NAME` instead of class constant
52+
- Improved Milvus metadata-only search support verification
53+
- Updated all adapter documentation
54+
- Modernized contributing.md with uv, pre-commit, ruff
55+
56+
### Removed
57+
58+
- Removed `scripts/e2e.py` (replaced with `pytest scripts/tests`)
59+
- Removed `DEFAULT_COLLECTION_NAME` class constant from adapters
60+
61+
### Fixed
62+
63+
- Fixed Milvus tests to verify metadata-only search functionality
64+
- Fixed collection name handling across all adapters
65+
- Better error messages in benchmark reports
66+
- Proper cleanup in benchmark tests
67+
68+
### Breaking Changes
69+
70+
- `DEFAULT_COLLECTION_NAME` class constant removed - use `api_settings.VECTOR_COLLECTION_NAME` in settings instead
71+
- Stricter ChromaDB config validation (prevents conflicting settings)
72+
73+
### Performance
74+
75+
- Benchmark results show ~60% reduction in API calls for cloud backends with optimization
76+
- Local testing with `--skip-slow`: ~2-3 minutes vs 10+ minutes
77+
- PgVector: ~6-10 docs/sec bulk create, ~0.5ms metadata queries
78+
- Gemini: 1.5x faster search vs OpenAI for same operations
79+
80+
### Documentation Updates
81+
82+
- Repository URLs and references updated
83+
- Enhanced architecture diagrams
84+
- Improved API documentation
85+
- Fixed all broken links
86+
387
## [0.1.3] - 2025-11-30
488

589
### Testing Infrastructure

0 commit comments

Comments
 (0)