Skip to content

FrancescoMazzitelli/LADbridge

Repository files navigation

LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices

LADbridge is a distributed, LLM-powered Retrieval-Augmented Generation (RAG) orchestration platform for document question-answering and auto-filling. It dynamically decomposes natural-language queries into atomic tasks, discovers available microservices via semantic vector search, and orchestrates their execution, all driven by an LLM planner.

This system is associated with the paper: "LADbridge: LAnguage-Driven composition of APIs on Distributed Edge Devices".

Architecture

                                    ┌──────────────────────────────────────────────────────────────────┐
                                    │                       Control Unit                               │
                                    │           (LLM orchestrator — b4rtaz/DeepSeek-R1-8B)             │
                                    │     POST /api/control/invoke  (input + optional PDF file)        │
                                    └──────────────┬───────────────────────────┬───────────────────────┘
                                                   │                           │
                                           Consul API                  Semantic search
                                     GET /v1/agent/services        POST /index/search
                                                   │                           │
                                           ┌───────▼───────────┐    ┌──────────▼─────────────┐
                                           │     Registry      │    │    Catalog Gateway     │
                                           │  (Consul :8500)   │    │  (Flask + Cheroot)     │
                                           │  Service health   │    │  MongoDB + Qdrant      │
                                           │  checks + meta    │    │  Qwen3-Embedding-0.6B  │
                                           └───────────────────┘    │  cross-encoder rerank  │
                                                   |                └────────────────────────┘
                                                   │
                           ┌───────────────────────┼──────────────────────────────┐
                           │                       │                              │
                     ┌─────▼─────────────┐  ┌──────▼──────────────┐    ┌──────────▼───────────┐
                     │Question Answering │  │ Document Autofiller │    │  Distributed Llama   │
                     │   (Port 5600)     │  │    (Port 5700)      │    │  (cluster inference) │
                     │                   │  │                     │    │  dllama-api          │
                     │ POST /api/qa/     │  │ POST /api/filler/   │    │  POST /v1/chat/      │
                     │  invoke, upload,  │  │  convert, datadoc,  │    │  completions         │
                     │  register, health │  │  tofilldoc, fill,   │    └──────────────────────┘
                     └────────┬──────────┘  │  register, health   │
                              │             └─────────────────────┘
                     ┌────────▼────────┐
                     │   ChromaDB      │
                     │ (paragraph vec) │
                     │ multilingual-   │
                     │ e5-large-inst.  │
                     └─────────────────┘
  1. User sends a query (text + optional PDF) to control-unit
  2. control-unit discovers registered services via Consul
  3. control-unit performs semantic search on the catalog-gateway, which queries Qdrant for vector similarity and fetches metadata from MongoDB, then re-ranks results with a cross-encoder
  4. The LLM (phi4-reasoning:14b via Ollama) receives the discovered services and user query, and returns a JSON execution plan decomposing the request into subtasks
  5. control-unit executes each subtask synchronously against the appropriate worker service
  6. Results are collected and returned as JSON (or a PDF file if any task produced one)

Services

Service Container Port Description
registry registry 8500 HashiCorp Consul service registry
catalog-data catalog-data 27017 MongoDB — service metadata store
catalog-vector catalog-vector 6333 Qdrant — vector search for capabilities
catalog-gui catalog-gui 27018 Mongo-Express Web UI (admin/admin)
catalog-gateway catalog-gateway 5000 Flask API gateway for catalog (search + CRUD)
control-unit control-unit 5500 Main orchestrator — LLM planner + task execution
question-answering question-answering 5600 PDF upload, chunking, embedding + Q&A
document-autofiller document-autofiller 5700 PDF form field detection + data-driven filling

Prerequisites

  • Docker and Docker Compose (Compose V2)
  • LLM backend — choose one:
    • Ollama server accessible from the Docker host, with the phi4-reasoning:14b model pulled
    • Distributed Llama cluster deployed across edge devices (see Cluster Deployment below)

Quick Start

1. Start all services

docker compose up -d

2. Register worker services

Each worker must register itself with Consul and the catalog gateway:

curl -X POST http://localhost:5600/api/qa/register
curl -X POST http://localhost:5700/api/filler/register

3. Send a query

curl -X POST http://localhost:5500/api/control/invoke \
  -F "input=Summarize the paper" \
  -F "file=@document.pdf"

Cluster Deployment with Distributed Llama

Distributed Llama enables LLM inference across a cluster of edge devices (2ⁿ nodes) using tensor parallelism over Ethernet. This replaces the single-node Ollama server with a distributed inference backend.

Architecture

[🔀 SWITCH OR ROUTER]
      |   |   |
      |   |   |_______ 🔸 device1 (ROOT / dllama-api)  10.0.0.1:9999
      |   |_________  🔹 device2 (WORKER)             10.0.0.2:9999
      |_____________ 🔹 device3 (WORKER)             10.0.0.3:9999
                    🔹 device4 (WORKER)             10.0.0.4:9999

1. Build on all devices

git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api

2. Download a model (root device only)

python3 launch.py qwen3_8b_q40

Workers do not need the model files — they receive slices over the network.

3. Start workers on all worker devices

./dllama worker --port 9999 --nthreads <N_CORES>

4. Start the API server on the root device

./dllama-api \
  --port 9999 \
  --model models/<model_dir>/dllama_model_<model_name>.m \
  --tokenizer models/<model_dir>/dllama_tokenizer_<model_name>.t \
  --buffer-float-type q80 \
  --nthreads <N_CORES> \
  --max-seq-len 4096 \
  --workers <WORKER1_IP>:9999 <WORKER2_IP>:9999 ...

5. Configure LADbridge

Set the OLLAMA_API_URL environment variable on all LADbridge services to point to the dllama-api root node. The control unit uses the OpenAI-compatible endpoint (/v1/chat/completions) and is directly compatible.

In Compose.yaml:

environment:
  - OLLAMA_API_URL=http://10.0.0.1:9999    # point to dllama-api root

The question-answering and document-autofiller services use Ollama's native /api/generate endpoint, which is not provided by dllama-api. For full compatibility, run an Ollama instance alongside the cluster for those services, or modify their query_ollama() calls to use the OpenAI-compatible chat format.

Then proceed with the standard Quick Start.

Known Limitations

  • distributed-llama requires 2ⁿ nodes (1, 2, 4, 8...)
  • Maximum nodes equals the number of KV heads in the model
  • The control unit's OpenAI-compatible calls (/v1/chat/completions) work directly with dllama-api; the worker services' Ollama-native calls (/api/generate) do not

API Overview

catalog-gateway

Endpoint Description
POST /index/search Semantic search — embed query, search Qdrant, re-rank with cross-encoder
POST /service Register or update a service (MongoDB + Qdrant)
GET /services List all registered services
GET /services/<id> Get a specific service
DELETE /services/<id> Delete a service
GET /health Health check (model loaded + DB connected)

control-unit (POST /api/control/invoke)

  • Parameters: input (text, required) + file (PDF, optional)
  • Returns: JSON with execution_plan and execution_results, or a PDF download if any worker produced one

question-answering

Endpoint Description
POST /api/qa/invoke Ask a question against uploaded documents
POST /api/qa/upload Upload a PDF to the knowledge base
POST /api/qa/register Register service with Consul + catalog
GET /api/qa/health Health check

document-autofiller

Endpoint Description
POST /api/filler/convert Detect fillable fields in a PDF
POST /api/filler/datadoc Upload data document (CSV/Excel)
POST /api/filler/tofilldoc Upload PDF template to fill
POST /api/filler/fill Fill the PDF template with data
POST /api/filler/register Register service with Consul + catalog

Each service exposes interactive Swagger UI at /swagger.

Configuration

Environment Variables

Variable Default Service
OLLAMA_API_URL http://172.31.20.20:5000 All services
REGISTRY_URL http://registry:8500 control-unit
CATALOG_URL http://catalog-gateway:5000 control-unit
MONGO_USER / MONGO_PASS admin / admin catalog-gateway
MONGO_HOST / MONGO_PORT catalog-data / 27017 catalog-gateway
MONGO_DB LADbridge catalog-gateway
QDRANT_HOST / QDRANT_PORT catalog-vector / 6333 catalog-gateway
QDRANT_COLLECTION services catalog-gateway
SERVICE_ID "question-answering" / "document-autofiller" Workers
SERVICE_HOST / SERVICE_PORT (per service) Workers
GATEWAY_HOST / GATEWAY_PORT catalog-gateway / 5000 Workers
CONSUL_HOST / CONSUL_PORT registry / 8500 Workers

Tech Stack

Component Technology
Backend Python 3.10, Flask 3.1, Cheroot 10.0
API docs flask-restx 1.3 (Swagger UI)
Service registry HashiCorp Consul
Document store MongoDB 8.0
Vector store (catalog) Qdrant
Vector store (QA) ChromaDB (local)
LLM backend Distributed Llama
Embeddings (catalog) Qwen/Qwen3-Embedding-0.6B
Embeddings (QA) intfloat/multilingual-e5-large-instruct
Re-ranker cross-encoder/ms-marco-MiniLM-L-6-v2
PDF processing PyMuPDF (fitz)
Orchestration Docker Compose

Development

Each service can be run independently:

cd <service-dir>
pip install -r requirements.txt
python app.py

Running Experiments

The results/ and results-with-performance/ directories contain benchmarking infrastructure for comparing centralized vs. distributed deployments (1, 2, 4, or 8 nodes). See the submitter.py script for automated benchmark execution with Prometheus Node Exporter metrics sampling.

About

LAnguage-Driven composition of APIs on Distributed Edge Devices

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages