Public portfolio release · Multi-agent RAG system for football scouting
ScoutingInteligenteBot is an end-to-end AI system for football scouting. It combines vector retrieval, deterministic filtering and statistical scoring with LLM-generated explanations and automated visualizations, exposed through a Telegram bot interface.
The project started as my MSc thesis and evolved into a productized side project focused on reliable AI system design: each stage has a clear responsibility, and the LLM explains the output but does not decide the ranking.
This repository is a public, data-free portfolio release of the system.
It includes the application architecture, ETL structure, multi-agent pipeline, scoring logic, visualization modules, Docker setup and Telegram bot integration.
It does not include:
-
raw third-party player data,
-
the private local data snapshot,
-
processed player databases,
-
generated FAISS indices,
-
API keys, bot tokens or private environment files.
The original system was developed and validated on a private local snapshot of 14,508 players as of 2026-04-01: 13,020 outfield players and 1,488 goalkeepers. That snapshot is not redistributed in this repository because the underlying data comes from third-party sources.
To run the system end-to-end, you must provide your own authorized data and generate compatible processed databases and FAISS indices.
Active private development may continue separately beyond this public release.
A scout can ask a natural-language query such as:
“Right-footed U23 right winger, strong in dribbling and crossing, max market value 500k, less than two years left on contract”
The system then:
-
retrieves a broad candidate pool using FAISS and multilingual embeddings;
-
applies hard filters for explicit constraints such as age, position, market value, foot, height, contract, league and nationality;
-
scores candidates using deterministic statistical logic based on percentiles, minutes and league coefficients;
-
generates a natural-language explanation of the Top-3 candidates;
-
produces comparative radar charts and individual player profile visualizations.
The ranking is produced by deterministic scoring logic. The LLM is used for explanation, not for deciding the final ranking.
The system follows a retrieval → filtering → scoring → explanation → visualization pipeline:
-
LangGraph for multi-agent orchestration.
-
FAISS for vector retrieval.
-
SentenceTransformers with multilingual embeddings.
-
Polars for dataframe processing.
-
OpenAI API for natural-language explanations.
-
Plotly and mplsoccer for radar/player profile visualizations.
-
Docker / Docker Compose for reproducible execution.
-
Telegram Bot API as the user interface.
ScoutingInteligenteBot/
├── data/ # Local data folder, not versioned
│ ├── processed/
│ │ ├── merged/ # Expected processed player databases
│ │ └── indices/ # Expected FAISS indices + metadata
├── src/
│ ├── scouting/
│ │ ├── etl/ # ETL, merge, enrichment and indexing modules
│ │ └── agents/ # Multi-agent pipeline implementation and Telegram integration
├── figuras/ # README and demo images
├── .env.example # Environment variable template
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml
└── README.md
This repository does not ship with the player database or generated FAISS indices.
Expected local files for an end-to-end run:
data/
├─ processed/
│ ├─ merged/
│ │ ├─ db_jugadores.json
│ │ └─ db_porteros.json
│ └─ indices/current/
│ ├─ faiss_jugadores.index
│ ├─ faiss_porteros.index
│ ├─ metadata_jugadores.json
│ └─ metadata_porteros.json
These files must be generated from your own authorized data sources or adapted from your own internal databases.
The ETL modules included in src/scouting/etl/ document the data engineering approach used by the system: collection, normalization, identity matching, merge and FAISS indexing. They are provided as a reference implementation that can be adapted to compatible datasets.
Users are responsible for complying with the terms of service, licensing conditions and legal requirements of any data provider they use.
git clone https://github.com/jorgeeegj/ScoutingInteligenteBot.git
cd ScoutingInteligenteBot
cp .env.example .env
Edit .env with your own credentials:
OPENAI_API_KEY=...
TELEGRAM_BOT_TOKEN=...
docker compose build --no-cache --pull
The bot expects processed databases and FAISS indices under:
data/processed/merged/
data/processed/indices/current/
These files are not included in the public repository.
docker compose up -d bot
docker compose logs -f bot
Then send a scouting query to your Telegram bot, for example:
“Extremo izquierdo joven, habilidoso, con regate y gol. Precio máximo 500k”
The ETL layer is included to show how the system was designed and how compatible data can be prepared.
Main stages:
-
Collection
Source-specific collectors gather player statistics, identity fields, market information and contract-related attributes.
-
Merge and identity matching
Records from different sources are cleaned, normalized and matched without relying on a single shared ID. The matching logic combines names, dates of birth, height, dominant foot, teams, leagues and other identity signals.
-
Vector indexing
Player metadata is embedded and indexed in FAISS for broad semantic retrieval.
-
Runtime query pipeline
At runtime, the bot does not fetch live data. It queries the local processed snapshot and indices.
Typical commands:
# Full ETL pipeline
docker compose run --rm etl python -m scouting.etl.main_etl
# Merge data
docker compose run --rm etl python -m scouting.etl.merge_data
# Build FAISS indices
docker compose run --rm etl python -m scouting.etl.vector_store_indexing
The exact collectors may require maintenance as websites, APIs and data providers change.
For professional or production use, the recommended path is to adapt the schema and transformation logic to your own authorized data sources or internal databases.
The core implementation lives in src/scouting/agents/.
src/scouting/agents/
├── agent0.py # Vector Retriever
├── agent1.py # Hard Filter
├── agent2.py # Score Evaluator
├── agent3/ # LLM Explanation layer
├── agent4/ # Visualization / report compositor
├── common.py # Shared contracts, config and utilities
├── pipeline.py # LangGraph multi-agent pipeline
└── telegram_bot.py # Telegram polling interface
Retrieves a broad candidate pool from FAISS using multilingual embeddings.
Main responsibilities:
-
encode the normalized scouting query;
-
search separate FAISS indices for outfield players and goalkeepers;
-
return a Polars dataframe with structured player metadata;
-
preserve enough recall for later deterministic filtering.
Default retrieval is intentionally broad so that hard filters and scoring can operate on a large candidate set.
Applies explicit constraints from the query.
Supported filter families include:
-
age,
-
position,
-
market value,
-
dominant foot,
-
height,
-
contract duration,
-
nationality,
-
league and country of league.
This stage is deterministic and exists to enforce hard user constraints before ranking.
Ranks the filtered candidates using statistical logic.
The scoring layer combines:
-
query-relevant metric selection,
-
percentiles and per-90 values,
-
minutes-aware adjustments,
-
league coefficients,
-
role-specific metric profiles.
The LLM does not rank players. Ranking is produced by this deterministic scoring layer.
Generates a professional natural-language scouting explanation for the selected candidates.
The explanation agent receives structured evidence from Agent 2 and writes a user-facing report in the language of the original query.
It is constrained to explain the available data rather than invent new attributes.
Generates comparative visual outputs:
-
radar charts for selected key metrics,
-
per-90 radar comparisons,
-
player profile visualizations,
-
report-ready image artifacts for Telegram.
Visualizations are generated with Python libraries, not by an LLM.
The Telegram interface uses polling and delegates each message to the LangGraph pipeline.
Run:
docker compose up -d bot
docker compose logs -f bot
The bot:
-
reads updates through the Telegram Bot API;
-
stores the last processed update ID to avoid repeated messages;
-
applies an optional chat allowlist;
-
sends back the scouting explanation and generated images.
Relevant environment variables:
OPENAI_API_KEY=...
TELEGRAM_BOT_TOKEN=...
OPENAI_MODEL_SUPERVISOR=gpt-4o
INDICES_DIR=/data/processed/indices/current
TELEGRAM_ALLOWED_CHAT_IDS=
TELEGRAM_ALLOWED_CHAT_IDS_FILE=
TELEGRAM_ALLOW_ALL=0
Example real interaction through Telegram:
Comparative radar chart:
Individual player profile visualizations:
The demo shows how a natural-language scouting query is transformed into:
-
a reasoned Top-3 of player candidates;
-
a natural-language scouting explanation;
-
comparative radar charts;
-
individual player profile visualizations.
Check that Chrome or Chromium is available inside the container:
docker compose exec bot sh -lc 'which google-chrome-stable || which google-chrome || which chromium || which chromium-browser'
If no browser is found, rebuild the image.
Check that /data/last_update_id.txt exists and that the ./data:/data volume is mounted.
Check that:
-
processed databases exist under
data/processed/merged/; -
FAISS indices and metadata exist under
data/processed/indices/current/; -
metadata files contain the fields expected by the filtering and scoring layers.
wsl --shutdown
Then reopen Docker Desktop.
This repository intentionally excludes:
-
.envfiles, -
API keys,
-
Telegram bot tokens,
-
private chat IDs,
-
raw data,
-
processed player databases,
-
generated FAISS indices.
Use .env.example as a template and keep real credentials local.
This project is released under the MIT License. See LICENSE.
Developed by Jorge Gómez Jerez as a public portfolio release of an end-to-end AI scouting system.
GitHub: github.com/jorgeeegj




