LADbridge is a distributed, LLM-powered Retrieval-Augmented Generation (RAG) orchestration platform for document question-answering and auto-filling. It dynamically decomposes natural-language queries into atomic tasks, discovers available microservices via semantic vector search, and orchestrates their execution, all driven by an LLM planner.
This system is associated with the paper: "LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices".
┌──────────────────────────────────────────────────────────────────┐
│ Control Unit │
│ (LLM orchestrator — b4rtaz/DeepSeek-R1-8B) │
│ POST /api/control/invoke (input + optional PDF file) │
└──────────────┬───────────────────────────┬───────────────────────┘
│ │
Consul API Semantic search
GET /v1/agent/services POST /index/search
│ │
┌───────▼───────────┐ ┌──────────▼─────────────┐
│ Registry │ │ Catalog Gateway │
│ (Consul :8500) │ │ (Flask + Cheroot) │
│ Service health │ │ MongoDB + Qdrant │
│ checks + meta │ │ Qwen3-Embedding-0.6B │
└───────────────────┘ │ cross-encoder rerank │
| └────────────────────────┘
│
┌───────────────────────┼──────────────────────────────┐
│ │ │
┌─────▼─────────────┐ ┌──────▼──────────────┐ ┌──────────▼───────────┐
│Question Answering │ │ Document Autofiller │ │ Distributed Llama │
│ (Port 5600) │ │ (Port 5700) │ │ (cluster inference) │
│ │ │ │ │ dllama-api │
│ POST /api/qa/ │ │ POST /api/filler/ │ │ POST /v1/chat/ │
│ invoke, upload, │ │ convert, datadoc, │ │ completions │
│ register, health │ │ tofilldoc, fill, │ └──────────────────────┘
└────────┬──────────┘ │ register, health │
│ └─────────────────────┘
┌────────▼────────┐
│ ChromaDB │
│ (paragraph vec) │
│ multilingual- │
│ e5-large-inst. │
└─────────────────┘
- User sends a query (text + optional PDF) to
control-unit - control-unit discovers registered services via Consul
- control-unit performs semantic search on the catalog-gateway, which queries Qdrant for vector similarity and fetches metadata from MongoDB, then re-ranks results with a cross-encoder
- The LLM (phi4-reasoning:14b via Ollama) receives the discovered services and user query, and returns a JSON execution plan decomposing the request into subtasks
- control-unit executes each subtask synchronously against the appropriate worker service
- Results are collected and returned as JSON (or a PDF file if any task produced one)
| Service | Container | Port | Description |
|---|---|---|---|
| registry | registry |
8500 | HashiCorp Consul service registry |
| catalog-data | catalog-data |
27017 | MongoDB — service metadata store |
| catalog-vector | catalog-vector |
6333 | Qdrant — vector search for capabilities |
| catalog-gui | catalog-gui |
27018 | Mongo-Express Web UI (admin/admin) |
| catalog-gateway | catalog-gateway |
5000 | Flask API gateway for catalog (search + CRUD) |
| control-unit | control-unit |
5500 | Main orchestrator — LLM planner + task execution |
| question-answering | question-answering |
5600 | PDF upload, chunking, embedding + Q&A |
| document-autofiller | document-autofiller |
5700 | PDF form field detection + data-driven filling |
- Docker and Docker Compose (Compose V2)
- LLM backend — choose one:
- Ollama server accessible from the Docker host, with the
phi4-reasoning:14bmodel pulled - Distributed Llama cluster deployed across edge devices (see Cluster Deployment below)
- Ollama server accessible from the Docker host, with the
docker compose up -dEach worker must register itself with Consul and the catalog gateway:
curl -X POST http://localhost:5600/api/qa/register
curl -X POST http://localhost:5700/api/filler/registercurl -X POST http://localhost:5500/api/control/invoke \
-F "input=Summarize the paper" \
-F "file=@document.pdf"Distributed Llama enables LLM inference across a cluster of edge devices (2ⁿ nodes) using tensor parallelism over Ethernet. This replaces the single-node Ollama server with a distributed inference backend.
[🔀 SWITCH OR ROUTER]
| | |
| | |_______ 🔸 device1 (ROOT / dllama-api) 10.0.0.1:9999
| |_________ 🔹 device2 (WORKER) 10.0.0.2:9999
|_____________ 🔹 device3 (WORKER) 10.0.0.3:9999
🔹 device4 (WORKER) 10.0.0.4:9999
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-apipython3 launch.py qwen3_8b_q40Workers do not need the model files — they receive slices over the network.
./dllama worker --port 9999 --nthreads <N_CORES>./dllama-api \
--port 9999 \
--model models/<model_dir>/dllama_model_<model_name>.m \
--tokenizer models/<model_dir>/dllama_tokenizer_<model_name>.t \
--buffer-float-type q80 \
--nthreads <N_CORES> \
--max-seq-len 4096 \
--workers <WORKER1_IP>:9999 <WORKER2_IP>:9999 ...Set the OLLAMA_API_URL environment variable on all LADbridge services to point to the dllama-api root node. The control unit uses the OpenAI-compatible endpoint (/v1/chat/completions) and is directly compatible.
In Compose.yaml:
environment:
- OLLAMA_API_URL=http://10.0.0.1:9999 # point to dllama-api rootThe question-answering and document-autofiller services use Ollama's native
/api/generateendpoint, which is not provided bydllama-api. For full compatibility, run an Ollama instance alongside the cluster for those services, or modify theirquery_ollama()calls to use the OpenAI-compatible chat format.
Then proceed with the standard Quick Start.
- distributed-llama requires 2ⁿ nodes (1, 2, 4, 8...)
- Maximum nodes equals the number of KV heads in the model
- The control unit's OpenAI-compatible calls (
/v1/chat/completions) work directly withdllama-api; the worker services' Ollama-native calls (/api/generate) do not
| Endpoint | Description |
|---|---|
POST /index/search |
Semantic search — embed query, search Qdrant, re-rank with cross-encoder |
POST /service |
Register or update a service (MongoDB + Qdrant) |
GET /services |
List all registered services |
GET /services/<id> |
Get a specific service |
DELETE /services/<id> |
Delete a service |
GET /health |
Health check (model loaded + DB connected) |
- Parameters:
input(text, required) +file(PDF, optional) - Returns: JSON with
execution_planandexecution_results, or a PDF download if any worker produced one
| Endpoint | Description |
|---|---|
POST /api/qa/invoke |
Ask a question against uploaded documents |
POST /api/qa/upload |
Upload a PDF to the knowledge base |
POST /api/qa/register |
Register service with Consul + catalog |
GET /api/qa/health |
Health check |
| Endpoint | Description |
|---|---|
POST /api/filler/convert |
Detect fillable fields in a PDF |
POST /api/filler/datadoc |
Upload data document (CSV/Excel) |
POST /api/filler/tofilldoc |
Upload PDF template to fill |
POST /api/filler/fill |
Fill the PDF template with data |
POST /api/filler/register |
Register service with Consul + catalog |
Each service exposes interactive Swagger UI at /swagger.
| Variable | Default | Service |
|---|---|---|
OLLAMA_API_URL |
http://172.31.20.20:5000 |
All services |
REGISTRY_URL |
http://registry:8500 |
control-unit |
CATALOG_URL |
http://catalog-gateway:5000 |
control-unit |
MONGO_USER / MONGO_PASS |
admin / admin |
catalog-gateway |
MONGO_HOST / MONGO_PORT |
catalog-data / 27017 |
catalog-gateway |
MONGO_DB |
LADbridge |
catalog-gateway |
QDRANT_HOST / QDRANT_PORT |
catalog-vector / 6333 |
catalog-gateway |
QDRANT_COLLECTION |
services |
catalog-gateway |
SERVICE_ID |
"question-answering" / "document-autofiller" |
Workers |
SERVICE_HOST / SERVICE_PORT |
(per service) | Workers |
GATEWAY_HOST / GATEWAY_PORT |
catalog-gateway / 5000 |
Workers |
CONSUL_HOST / CONSUL_PORT |
registry / 8500 |
Workers |
| Component | Technology |
|---|---|
| Backend | Python 3.10, Flask 3.1, Cheroot 10.0 |
| API docs | flask-restx 1.3 (Swagger UI) |
| Service registry | HashiCorp Consul |
| Document store | MongoDB 8.0 |
| Vector store (catalog) | Qdrant |
| Vector store (QA) | ChromaDB (local) |
| LLM backend | Distributed Llama |
| Embeddings (catalog) | Qwen/Qwen3-Embedding-0.6B |
| Embeddings (QA) | intfloat/multilingual-e5-large-instruct |
| Re-ranker | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| PDF processing | PyMuPDF (fitz) |
| Orchestration | Docker Compose |
Each service can be run independently:
cd <service-dir>
pip install -r requirements.txt
python app.pyThe results/ and results-with-performance/ directories contain benchmarking infrastructure for comparing centralized vs. distributed deployments (1, 2, 4, or 8 nodes). See the submitter.py script for automated benchmark execution with Prometheus Node Exporter metrics sampling.