Skip to content

Commit 560ec4b

Browse files
committed
docs(deprecated): archive llama-cpp-python integration attempt
Archive failed llama-cpp-python embedding/reranker integration due to: - Model compatibility issues (Qwen3-Reranker, Gemma-embedding not supported) - No viable upgrade path (llama-cpp-python 0.3.16 too old for newer models) - Architectural mismatch with embedding/reranker requirements Moved to docs/deprecated/llama-cpp-python/: - Performance test scripts (embedding, reranker benchmarks) - Test results and reports (2402 lines of documentation) - GGUF model downloaders - Analysis and lessons learned Decision: Continue with PyTorch + fastembed for production use. - Better model quality (FP16 vs Q8) - Supports all modern architectures - Active maintenance and ecosystem
1 parent 47a1cc0 commit 560ec4b

13 files changed

Lines changed: 2402 additions & 0 deletions

docs/deprecated/README.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Deprecated Approaches Archive
2+
3+
This directory contains failed or abandoned experimental approaches that were explored during QMD-Python development.
4+
5+
## 📁 Current Archives
6+
7+
### [llama-cpp-python/](./llama-cpp-python/)
8+
9+
**Status**: ❌ ABANDONED (2026-02-19)
10+
11+
**Summary**: Attempted to use llama-cpp-python as an alternative embedding engine for better performance and lower resource usage.
12+
13+
**Why Failed**:
14+
- Model compatibility issues (Qwen3-Reranker, Gemma-embedding not supported)
15+
- No viable upgrade path (llama-cpp-python 0.3.16 too old)
16+
- Architectural mismatch with embedding/reranker models
17+
18+
**What Worked**:
19+
- BGE Small English v1.5 (Q8_0) performed well (5-15ms latency)
20+
- GPU acceleration worked on GTX 1660 Ti
21+
- 3.5x smaller model size (35 MB vs 130 MB)
22+
23+
**What Didn't**:
24+
- Qwen3-Reranker-0.6B failed to load (tensor count mismatch)
25+
- Gemma-embedding models not supported
26+
- No support for modern SOTA embedding architectures
27+
28+
**Decision**: Stick with PyTorch + fastembed for production use.
29+
30+
**Files**:
31+
- `scripts/`: Test and benchmark scripts
32+
- `wheels/`: Pre-built llama-cpp-python packages
33+
- `reports/`: Performance analysis and results
34+
- `README.md`: Detailed analysis and lessons learned
35+
36+
---
37+
38+
## 📝 Purpose of This Archive
39+
40+
### Why Keep Failed Experiments?
41+
42+
1. **Documentation**: Record what was tried and why it failed
43+
2. **Learning**: Capture lessons learned for future reference
44+
3. **Re-evaluation**: Allow re-assessment if technology improves
45+
4. **Transparency**: Show the development process and decisions
46+
47+
### When to Revisit?
48+
49+
Re-evaluate archived approaches if:
50+
- New versions release with breaking improvements
51+
- Project requirements change significantly
52+
- New information suggests viability
53+
- Time/resources allow deeper investigation
54+
55+
---
56+
57+
## 🔍 Quick Reference
58+
59+
| Archive | Date | Status | Main Issue |
60+
|---------|------|--------|------------|
61+
| llama-cpp-python | 2026-02-19 | ❌ Abandoned | Model compatibility |
62+
| (Future archives) | - | - | - |
63+
64+
---
65+
66+
## 💡 Archive Maintenance
67+
68+
**Adding New Archives**:
69+
1. Create subdirectory with descriptive name
70+
2. Move all related files
71+
3. Write comprehensive README.md explaining:
72+
- What was attempted
73+
- Why it failed
74+
- What was learned
75+
- When/if to re-evaluate
76+
4. Update this index
77+
78+
**Removing Archives**:
79+
- **DON'T** - Keep for historical reference
80+
- Exception: Security vulnerabilities (then document why deleted)
81+
82+
---
83+
84+
**Archive Maintained**: 2026-02-19
85+
**Total Archives**: 1
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
=============================================================================
2+
QMD-Python: Llama-cpp-python Integration Archive Summary
3+
=============================================================================
4+
5+
Date: 2026-02-19
6+
Status: COMPLETED (Archived)
7+
8+
=============================================================================
9+
WHAT WAS DONE
10+
=============================================================================
11+
12+
All llama-cpp-python related files have been moved to:
13+
docs/deprecated/llama-cpp-python/
14+
15+
Archive Contents:
16+
- 12 files total
17+
- 102 MB (mostly wheel files)
18+
- Complete test/benchmark scripts
19+
- Performance reports and documentation
20+
21+
=============================================================================
22+
FILE STRUCTURE
23+
=============================================================================
24+
25+
docs/deprecated/llama-cpp-python/
26+
├── README.md # Main analysis (8 KB)
27+
├── scripts/ # Test scripts (72 KB)
28+
│ ├── test_llama_embedding.py # Embedding benchmark
29+
│ ├── test_llama_reranker.py # Reranker benchmark
30+
│ ├── test_model_loading.py # Model loading tests
31+
│ ├── test_qwen_reranker_loading.py # Qwen3 compatibility test
32+
│ ├── test_bge_model.py # BGE model test
33+
│ ├── demo_llama_embedding.py # Usage demo
34+
│ ├── download_gguf_model.py # Model downloader
35+
│ └── download_gguf_simple.py # Simplified downloader
36+
├── wheels/ # Pre-built packages (102 MB)
37+
│ ├── llama_cpp_python-0.3.16-cp310-cp310-win_amd64.whl
38+
│ └── llama_cpp_python-0.3.4-cp310-cp310-win_amd64.whl
39+
└── reports/ # Performance data (12 KB)
40+
├── LLAMA_CPP_PERFORMANCE_REPORT.md # Detailed analysis
41+
└── embedding_benchmark_results.json # Raw benchmark data
42+
43+
=============================================================================
44+
KEY FINDINGS
45+
=============================================================================
46+
47+
✅ What Worked:
48+
- BGE Small English v1.5 (Q8_0) embedding model
49+
- GPU acceleration on GTX 1660 Ti
50+
- 5-15ms latency (excellent performance)
51+
- 35 MB model size (3.5x smaller than PyTorch)
52+
- 22 MB GPU memory (20x less than PyTorch)
53+
54+
❌ What Didn't:
55+
- Qwen3-Reranker-0.6B (model compatibility)
56+
- Gemma-embedding models (architecture not supported)
57+
- Any modern SOTA embedding/reranker models
58+
- No upgrade path (llama-cpp-python 0.3.16 is latest)
59+
60+
=============================================================================
61+
DECISION
62+
=============================================================================
63+
64+
Abandoned llama-cpp-python approach due to:
65+
1. Model compatibility issues (critical blocker)
66+
2. No viable upgrade path
67+
3. Architectural mismatch with requirements
68+
69+
Current Solution: PyTorch + fastembed
70+
- Better model quality (FP16 vs Q8)
71+
- Supports all modern architectures
72+
- Active maintenance and ecosystem
73+
- Acceptable performance (~15-20ms latency)
74+
75+
================================================================<arg_value>LESSONS LEARNED
76+
=============================================================================
77+
78+
1. Check model compatibility BEFORE implementation
79+
2. Not all "optimized" solutions are better
80+
3. Version constraints can be critical blockers
81+
4. PyTorch remains the best choice for embeddings
82+
83+
=============================================================================
84+
NEXT STEPS
85+
=============================================================================
86+
87+
✅ DONE:
88+
- Archive all llama-cpp-python files
89+
- Document reasons for abandonment
90+
- Clean up project root directory
91+
- Update deprecated index
92+
93+
🔄 CONTINUE:
94+
- Use PyTorch + fastembed for production
95+
- Monitor llama-cpp-python for future improvements
96+
- Focus on other optimization opportunities
97+
98+
=============================================================================
99+
ARCHIVE INTEGRITY
100+
=============================================================================
101+
102+
Root directory: CLEAN (no llama-cpp files)
103+
Archive location: docs/deprecated/llama-cpp-python/
104+
Total files: 12
105+
Total size: ~102 MB
106+
Documentation: Complete
107+
108+
=============================================================================
109+
For questions, refer to: docs/deprecated/llama-cpp-python/README.md
110+
=============================================================================

0 commit comments

Comments
 (0)