Symptom
Ingestion of a large markdown file appears to dispatch chunks strictly serially. Observed live in kg-api-dev logs:
04:01:27 [Chunk 17] 848 words → 04:01:45 18s extraction
04:01:47 [Chunk 18] 965 words → 04:03:04 77s extraction
04:03:10 [Chunk 19] ...
No 429 errors, no retry-after warnings — just raw Anthropic Sonnet API latency on max_tokens=16384, one chunk at a time. The DB has kg_api.ai_extraction_config.max_concurrent_requests = 8, but raising it changes nothing.
Investigation
Two distinct findings — they look like one bug but they are two.
Finding 1 — Within-job serial chunks is by design
api/app/workers/ingestion_worker.py:518:
for i, chunk in enumerate(chunks, 1):
...
recent_concept_ids = process_chunk(
chunk=chunk,
...
recent_concept_ids=recent_concept_ids, # ← sequential dependency
...
)
The loop is plain for. Each process_chunk returns recent_concept_ids that feed into chunk N+1's call so the LLM can deduplicate against concepts that chunk N just created. Parallelizing naively breaks dedup quality (chunk N+1 extracts duplicates of chunk N's concepts because it never saw them).
This is consistent with 1ce64eee feat(ADR-031): Add service token authorization and worker concurrency, which placed the ThreadPoolExecutor at the job-queue level (multiple ingestions in parallel), not at the chunk level (one ingestion's chunks in parallel). The design assumption is: scale via more jobs, not bigger jobs.
That assumption breaks when the user has one large document. They feel N × ~30-80s wall time with nothing they can tune.
Finding 2 — Per-provider semaphore (ADR-049) is never wired
api/app/lib/rate_limiter.py:199 defines get_provider_semaphore(provider_name). ADR-049 §2 says it gates concurrent API calls per provider — "OpenAI 8, Anthropic 4, Ollama 1".
Pickaxe across all branches finds zero callers, ever:
git log --all -S "get_provider_semaphore(" --pickaxe-regex
# (empty)
So the helper exists, the DB column max_concurrent_requests exists, ADR-049 documents the intent — but no call site ever acquires the semaphore. This is latent: when ≥2 ingestion jobs run concurrently, they currently hit the provider without any coordination, and the SDK's retry budget (8) absorbs the 429s silently. The user sees slow ingestions and no errors, even though the gate they configured isn't doing anything.
This is not a regression from the ADR-206 vocab-migration branch (PR pending) — the call_with_tools facade added there is annealing-only and doesn't touch extract_concepts. The gap predates that work.
Repro
- Run any ingestion:
kg ingest <large_file>. Watch kg-api-dev logs and confirm chunks process serially.
- Run two ingestions back-to-back:
kg ingest a.md & kg ingest b.md. Confirm via httpx log lines that requests to api.anthropic.com overlap without semaphore gating.
grep -rn 'get_provider_semaphore(' api/app/ returns only the definition site, no callers.
Suggested scope
Cheap, high-value
- Wire
get_provider_semaphore into the extract_concepts path. Wrap the SDK call inside the semaphore (with get_provider_semaphore(provider_name): client.messages.create(...)). Restores ADR-049's intended behaviour for concurrent jobs. Pure addition, no architectural change.
- Surface in CLI/UI: the admin tab already shows
max_concurrent_requests from the config row — but the value is currently advisory only. Either wire the semaphore (above) or add a footnote.
Medium, real engineering
- Sliding-window in-job concurrency. Dispatch chunk N+1 while N is still extracting, but pass an outdated
recent_concept_ids snapshot. Trades modest dedup quality (more "duplicate concept" cleanup later) for ~3-5x speedup on large single-file ingestions. Needs an experiment to quantify the dedup hit before committing.
- Per-call
max_tokens budget. Sonnet latency scales with the generation budget, not output size. Dropping max_tokens=16384 → 8192 should ~halve latency at zero quality risk for typical chunks. Could be the cheapest win of all.
Out of scope here
- Switching extraction models (Haiku 4.5 etc.) — separate experiment.
References
- ADR-049 — Rate limiting and concurrency (the design that was partially implemented)
- ADR-031 — Worker concurrency at job-queue level (the design that was implemented)
api/app/lib/rate_limiter.py:199 — orphaned get_provider_semaphore
api/app/workers/ingestion_worker.py:518 — serial chunk loop with recent_concept_ids dependency
kg_api.ai_extraction_config.max_concurrent_requests — DB column read by nothing in the runtime path
Symptom
Ingestion of a large markdown file appears to dispatch chunks strictly serially. Observed live in
kg-api-devlogs:No
429errors, no retry-after warnings — just raw Anthropic Sonnet API latency onmax_tokens=16384, one chunk at a time. The DB haskg_api.ai_extraction_config.max_concurrent_requests = 8, but raising it changes nothing.Investigation
Two distinct findings — they look like one bug but they are two.
Finding 1 — Within-job serial chunks is by design
api/app/workers/ingestion_worker.py:518:The loop is plain
for. Eachprocess_chunkreturnsrecent_concept_idsthat feed into chunk N+1's call so the LLM can deduplicate against concepts that chunk N just created. Parallelizing naively breaks dedup quality (chunk N+1 extracts duplicates of chunk N's concepts because it never saw them).This is consistent with
1ce64eee feat(ADR-031): Add service token authorization and worker concurrency, which placed theThreadPoolExecutorat the job-queue level (multiple ingestions in parallel), not at the chunk level (one ingestion's chunks in parallel). The design assumption is: scale via more jobs, not bigger jobs.That assumption breaks when the user has one large document. They feel
N × ~30-80swall time with nothing they can tune.Finding 2 — Per-provider semaphore (ADR-049) is never wired
api/app/lib/rate_limiter.py:199definesget_provider_semaphore(provider_name). ADR-049 §2 says it gates concurrent API calls per provider — "OpenAI 8, Anthropic 4, Ollama 1".Pickaxe across all branches finds zero callers, ever:
So the helper exists, the DB column
max_concurrent_requestsexists, ADR-049 documents the intent — but no call site ever acquires the semaphore. This is latent: when ≥2 ingestion jobs run concurrently, they currently hit the provider without any coordination, and the SDK's retry budget (8) absorbs the 429s silently. The user sees slow ingestions and no errors, even though the gate they configured isn't doing anything.This is not a regression from the ADR-206 vocab-migration branch (PR pending) — the
call_with_toolsfacade added there is annealing-only and doesn't touchextract_concepts. The gap predates that work.Repro
kg ingest <large_file>. Watchkg-api-devlogs and confirm chunks process serially.kg ingest a.md & kg ingest b.md. Confirm viahttpxlog lines that requests toapi.anthropic.comoverlap without semaphore gating.grep -rn 'get_provider_semaphore(' api/app/returns only the definition site, no callers.Suggested scope
Cheap, high-value
get_provider_semaphoreinto the extract_concepts path. Wrap the SDK call inside the semaphore (with get_provider_semaphore(provider_name): client.messages.create(...)). Restores ADR-049's intended behaviour for concurrent jobs. Pure addition, no architectural change.max_concurrent_requestsfrom the config row — but the value is currently advisory only. Either wire the semaphore (above) or add a footnote.Medium, real engineering
recent_concept_idssnapshot. Trades modest dedup quality (more "duplicate concept" cleanup later) for ~3-5x speedup on large single-file ingestions. Needs an experiment to quantify the dedup hit before committing.max_tokensbudget. Sonnet latency scales with the generation budget, not output size. Droppingmax_tokens=16384→8192should ~halve latency at zero quality risk for typical chunks. Could be the cheapest win of all.Out of scope here
References
api/app/lib/rate_limiter.py:199— orphanedget_provider_semaphoreapi/app/workers/ingestion_worker.py:518— serial chunk loop withrecent_concept_idsdependencykg_api.ai_extraction_config.max_concurrent_requests— DB column read by nothing in the runtime path