Skip to content

Ingestion concurrency: per-provider semaphore (ADR-049) never wired; single-job chunks serial by design #417

@aaronsb

Description

@aaronsb

Symptom

Ingestion of a large markdown file appears to dispatch chunks strictly serially. Observed live in kg-api-dev logs:

04:01:27 [Chunk 17] 848 words   → 04:01:45  18s extraction
04:01:47 [Chunk 18] 965 words   → 04:03:04  77s extraction
04:03:10 [Chunk 19] ...

No 429 errors, no retry-after warnings — just raw Anthropic Sonnet API latency on max_tokens=16384, one chunk at a time. The DB has kg_api.ai_extraction_config.max_concurrent_requests = 8, but raising it changes nothing.

Investigation

Two distinct findings — they look like one bug but they are two.

Finding 1 — Within-job serial chunks is by design

api/app/workers/ingestion_worker.py:518:

for i, chunk in enumerate(chunks, 1):
    ...
    recent_concept_ids = process_chunk(
        chunk=chunk,
        ...
        recent_concept_ids=recent_concept_ids,   # ← sequential dependency
        ...
    )

The loop is plain for. Each process_chunk returns recent_concept_ids that feed into chunk N+1's call so the LLM can deduplicate against concepts that chunk N just created. Parallelizing naively breaks dedup quality (chunk N+1 extracts duplicates of chunk N's concepts because it never saw them).

This is consistent with 1ce64eee feat(ADR-031): Add service token authorization and worker concurrency, which placed the ThreadPoolExecutor at the job-queue level (multiple ingestions in parallel), not at the chunk level (one ingestion's chunks in parallel). The design assumption is: scale via more jobs, not bigger jobs.

That assumption breaks when the user has one large document. They feel N × ~30-80s wall time with nothing they can tune.

Finding 2 — Per-provider semaphore (ADR-049) is never wired

api/app/lib/rate_limiter.py:199 defines get_provider_semaphore(provider_name). ADR-049 §2 says it gates concurrent API calls per provider — "OpenAI 8, Anthropic 4, Ollama 1".

Pickaxe across all branches finds zero callers, ever:

git log --all -S "get_provider_semaphore(" --pickaxe-regex
# (empty)

So the helper exists, the DB column max_concurrent_requests exists, ADR-049 documents the intent — but no call site ever acquires the semaphore. This is latent: when ≥2 ingestion jobs run concurrently, they currently hit the provider without any coordination, and the SDK's retry budget (8) absorbs the 429s silently. The user sees slow ingestions and no errors, even though the gate they configured isn't doing anything.

This is not a regression from the ADR-206 vocab-migration branch (PR pending) — the call_with_tools facade added there is annealing-only and doesn't touch extract_concepts. The gap predates that work.

Repro

  1. Run any ingestion: kg ingest <large_file>. Watch kg-api-dev logs and confirm chunks process serially.
  2. Run two ingestions back-to-back: kg ingest a.md & kg ingest b.md. Confirm via httpx log lines that requests to api.anthropic.com overlap without semaphore gating.
  3. grep -rn 'get_provider_semaphore(' api/app/ returns only the definition site, no callers.

Suggested scope

Cheap, high-value

  • Wire get_provider_semaphore into the extract_concepts path. Wrap the SDK call inside the semaphore (with get_provider_semaphore(provider_name): client.messages.create(...)). Restores ADR-049's intended behaviour for concurrent jobs. Pure addition, no architectural change.
  • Surface in CLI/UI: the admin tab already shows max_concurrent_requests from the config row — but the value is currently advisory only. Either wire the semaphore (above) or add a footnote.

Medium, real engineering

  • Sliding-window in-job concurrency. Dispatch chunk N+1 while N is still extracting, but pass an outdated recent_concept_ids snapshot. Trades modest dedup quality (more "duplicate concept" cleanup later) for ~3-5x speedup on large single-file ingestions. Needs an experiment to quantify the dedup hit before committing.
  • Per-call max_tokens budget. Sonnet latency scales with the generation budget, not output size. Dropping max_tokens=163848192 should ~halve latency at zero quality risk for typical chunks. Could be the cheapest win of all.

Out of scope here

  • Switching extraction models (Haiku 4.5 etc.) — separate experiment.

References

  • ADR-049 — Rate limiting and concurrency (the design that was partially implemented)
  • ADR-031 — Worker concurrency at job-queue level (the design that was implemented)
  • api/app/lib/rate_limiter.py:199 — orphaned get_provider_semaphore
  • api/app/workers/ingestion_worker.py:518 — serial chunk loop with recent_concept_ids dependency
  • kg_api.ai_extraction_config.max_concurrent_requests — DB column read by nothing in the runtime path

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions