LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices

LADbridge is a distributed, LLM-powered Retrieval-Augmented Generation (RAG) orchestration platform for document question-answering and auto-filling. It dynamically decomposes natural-language queries into atomic tasks, discovers available microservices via semantic vector search, and orchestrates their execution, all driven by an LLM planner.

This system is associated with the paper: "LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices".

Architecture

                                    ┌──────────────────────────────────────────────────────────────────┐
                                    │                       Control Unit                               │
                                    │           (LLM orchestrator — b4rtaz/DeepSeek-R1-8B)             │
                                    │     POST /api/control/invoke  (input + optional PDF file)        │
                                    └──────────────┬───────────────────────────┬───────────────────────┘
                                                   │                           │
                                           Consul API                  Semantic search
                                     GET /v1/agent/services        POST /index/search
                                                   │                           │
                                           ┌───────▼───────────┐    ┌──────────▼─────────────┐
                                           │     Registry      │    │    Catalog Gateway     │
                                           │  (Consul :8500)   │    │  (Flask + Cheroot)     │
                                           │  Service health   │    │  MongoDB + Qdrant      │
                                           │  checks + meta    │    │  Qwen3-Embedding-0.6B  │
                                           └───────────────────┘    │  cross-encoder rerank  │
                                                   |                └────────────────────────┘
                                                   │
                           ┌───────────────────────┼──────────────────────────────┐
                           │                       │                              │
                     ┌─────▼─────────────┐  ┌──────▼──────────────┐    ┌──────────▼───────────┐
                     │Question Answering │  │ Document Autofiller │    │  Distributed Llama   │
                     │   (Port 5600)     │  │    (Port 5700)      │    │  (cluster inference) │
                     │                   │  │                     │    │  dllama-api          │
                     │ POST /api/qa/     │  │ POST /api/filler/   │    │  POST /v1/chat/      │
                     │  invoke, upload,  │  │  convert, datadoc,  │    │  completions         │
                     │  register, health │  │  tofilldoc, fill,   │    └──────────────────────┘
                     └────────┬──────────┘  │  register, health   │
                              │             └─────────────────────┘
                     ┌────────▼────────┐
                     │   ChromaDB      │
                     │ (paragraph vec) │
                     │ multilingual-   │
                     │ e5-large-inst.  │
                     └─────────────────┘

User sends a query (text + optional PDF) to control-unit
control-unit discovers registered services via Consul
control-unit performs semantic search on the catalog-gateway, which queries Qdrant for vector similarity and fetches metadata from MongoDB, then re-ranks results with a cross-encoder
The LLM (phi4-reasoning:14b via Ollama) receives the discovered services and user query, and returns a JSON execution plan decomposing the request into subtasks
control-unit executes each subtask synchronously against the appropriate worker service
Results are collected and returned as JSON (or a PDF file if any task produced one)

Services

Service	Container	Port	Description
registry	`registry`	8500	HashiCorp Consul service registry
catalog-data	`catalog-data`	27017	MongoDB — service metadata store
catalog-vector	`catalog-vector`	6333	Qdrant — vector search for capabilities
catalog-gui	`catalog-gui`	27018	Mongo-Express Web UI (admin/admin)
catalog-gateway	`catalog-gateway`	5000	Flask API gateway for catalog (search + CRUD)
control-unit	`control-unit`	5500	Main orchestrator — LLM planner + task execution
question-answering	`question-answering`	5600	PDF upload, chunking, embedding + Q&A
document-autofiller	`document-autofiller`	5700	PDF form field detection + data-driven filling

Prerequisites

Docker and Docker Compose (Compose V2)
LLM backend — choose one:
- Ollama server accessible from the Docker host, with the phi4-reasoning:14b model pulled
- Distributed Llama cluster deployed across edge devices (see Cluster Deployment below)

Quick Start

1. Start all services

docker compose up -d

2. Register worker services

Each worker must register itself with Consul and the catalog gateway:

curl -X POST http://localhost:5600/api/qa/register
curl -X POST http://localhost:5700/api/filler/register

3. Send a query

curl -X POST http://localhost:5500/api/control/invoke \
  -F "input=Summarize the paper" \
  -F "file=@document.pdf"

Cluster Deployment with Distributed Llama

Distributed Llama enables LLM inference across a cluster of edge devices (2ⁿ nodes) using tensor parallelism over Ethernet. This replaces the single-node Ollama server with a distributed inference backend.

Architecture

[🔀 SWITCH OR ROUTER]
      |   |   |
      |   |   |_______ 🔸 device1 (ROOT / dllama-api)  10.0.0.1:9999
      |   |_________  🔹 device2 (WORKER)             10.0.0.2:9999
      |_____________ 🔹 device3 (WORKER)             10.0.0.3:9999
                    🔹 device4 (WORKER)             10.0.0.4:9999

1. Build on all devices

git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api

2. Download a model (root device only)

python3 launch.py qwen3_8b_q40

Workers do not need the model files — they receive slices over the network.

3. Start workers on all worker devices

./dllama worker --port 9999 --nthreads <N_CORES>

4. Start the API server on the root device

./dllama-api \
  --port 9999 \
  --model models/<model_dir>/dllama_model_<model_name>.m \
  --tokenizer models/<model_dir>/dllama_tokenizer_<model_name>.t \
  --buffer-float-type q80 \
  --nthreads <N_CORES> \
  --max-seq-len 4096 \
  --workers <WORKER1_IP>:9999 <WORKER2_IP>:9999 ...

5. Configure LADbridge

Set the OLLAMA_API_URL environment variable on all LADbridge services to point to the dllama-api root node. The control unit uses the OpenAI-compatible endpoint (/v1/chat/completions) and is directly compatible.

In Compose.yaml:

environment:
  - OLLAMA_API_URL=http://10.0.0.1:9999    # point to dllama-api root

The question-answering and document-autofiller services use Ollama's native /api/generate endpoint, which is not provided by dllama-api. For full compatibility, run an Ollama instance alongside the cluster for those services, or modify their query_ollama() calls to use the OpenAI-compatible chat format.

Then proceed with the standard Quick Start.

Known Limitations

distributed-llama requires 2ⁿ nodes (1, 2, 4, 8...)
Maximum nodes equals the number of KV heads in the model
The control unit's OpenAI-compatible calls (/v1/chat/completions) work directly with dllama-api; the worker services' Ollama-native calls (/api/generate) do not

API Overview

catalog-gateway

Endpoint	Description
`POST /index/search`	Semantic search — embed query, search Qdrant, re-rank with cross-encoder
`POST /service`	Register or update a service (MongoDB + Qdrant)
`GET /services`	List all registered services
`GET /services/<id>`	Get a specific service
`DELETE /services/<id>`	Delete a service
`GET /health`	Health check (model loaded + DB connected)

control-unit (`POST /api/control/invoke`)

Parameters: input (text, required) + file (PDF, optional)
Returns: JSON with execution_plan and execution_results, or a PDF download if any worker produced one

question-answering

Endpoint	Description
`POST /api/qa/invoke`	Ask a question against uploaded documents
`POST /api/qa/upload`	Upload a PDF to the knowledge base
`POST /api/qa/register`	Register service with Consul + catalog
`GET /api/qa/health`	Health check

document-autofiller

Endpoint	Description
`POST /api/filler/convert`	Detect fillable fields in a PDF
`POST /api/filler/datadoc`	Upload data document (CSV/Excel)
`POST /api/filler/tofilldoc`	Upload PDF template to fill
`POST /api/filler/fill`	Fill the PDF template with data
`POST /api/filler/register`	Register service with Consul + catalog

Each service exposes interactive Swagger UI at /swagger.

Configuration

Environment Variables

Variable	Default	Service
`OLLAMA_API_URL`	`http://172.31.20.20:5000`	All services
`REGISTRY_URL`	`http://registry:8500`	control-unit
`CATALOG_URL`	`http://catalog-gateway:5000`	control-unit
`MONGO_USER` / `MONGO_PASS`	`admin` / `admin`	catalog-gateway
`MONGO_HOST` / `MONGO_PORT`	`catalog-data` / `27017`	catalog-gateway
`MONGO_DB`	`LADbridge`	catalog-gateway
`QDRANT_HOST` / `QDRANT_PORT`	`catalog-vector` / `6333`	catalog-gateway
`QDRANT_COLLECTION`	`services`	catalog-gateway
`SERVICE_ID`	`"question-answering"` / `"document-autofiller"`	Workers
`SERVICE_HOST` / `SERVICE_PORT`	(per service)	Workers
`GATEWAY_HOST` / `GATEWAY_PORT`	`catalog-gateway` / `5000`	Workers
`CONSUL_HOST` / `CONSUL_PORT`	`registry` / `8500`	Workers

Tech Stack

Component	Technology
Backend	Python 3.10, Flask 3.1, Cheroot 10.0
API docs	flask-restx 1.3 (Swagger UI)
Service registry	HashiCorp Consul
Document store	MongoDB 8.0
Vector store (catalog)	Qdrant
Vector store (QA)	ChromaDB (local)
LLM backend	Distributed Llama
Embeddings (catalog)	`Qwen/Qwen3-Embedding-0.6B`
Embeddings (QA)	`intfloat/multilingual-e5-large-instruct`
Re-ranker	`cross-encoder/ms-marco-MiniLM-L-6-v2`
PDF processing	PyMuPDF (fitz)
Orchestration	Docker Compose

Development

Each service can be run independently:

cd <service-dir>
pip install -r requirements.txt
python app.py

Running Experiments

The results/ and results-with-performance/ directories contain benchmarking infrastructure for comparing centralized vs. distributed deployments (1, 2, 4, or 8 nodes). See the submitter.py script for automated benchmark execution with Prometheus Node Exporter metrics sampling.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
control-unit		control-unit
db-gateway		db-gateway
document-autofiller		document-autofiller
document-qa		document-qa
results-with-performance		results-with-performance
results		results
Compose.yaml		Compose.yaml
README.md		README.md
install_node-exporter.sh		install_node-exporter.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices

Architecture

Services

Prerequisites

Quick Start

1. Start all services

2. Register worker services

3. Send a query

Cluster Deployment with Distributed Llama

Architecture

1. Build on all devices

2. Download a model (root device only)

3. Start workers on all worker devices

4. Start the API server on the root device

5. Configure LADbridge

Known Limitations

API Overview

catalog-gateway

control-unit (`POST /api/control/invoke`)

question-answering

document-autofiller

Configuration

Environment Variables

Tech Stack

Development

Running Experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices

Architecture

Services

Prerequisites

Quick Start

1. Start all services

2. Register worker services

3. Send a query

Cluster Deployment with Distributed Llama

Architecture

1. Build on all devices

2. Download a model (root device only)

3. Start workers on all worker devices

4. Start the API server on the root device

5. Configure LADbridge

Known Limitations

API Overview

catalog-gateway

control-unit (POST /api/control/invoke)

question-answering

document-autofiller

Configuration

Environment Variables

Tech Stack

Development

Running Experiments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

control-unit (`POST /api/control/invoke`)

Packages