Ingestion concurrency: per-provider semaphore (ADR-049) never wired; single-job chunks serial by design

## Symptom

Ingestion of a large markdown file appears to dispatch chunks strictly serially. Observed live in `kg-api-dev` logs:

```
04:01:27 [Chunk 17] 848 words   → 04:01:45  18s extraction
04:01:47 [Chunk 18] 965 words   → 04:03:04  77s extraction
04:03:10 [Chunk 19] ...
```

No `429` errors, no retry-after warnings — just raw Anthropic Sonnet API latency on `max_tokens=16384`, one chunk at a time. The DB has `kg_api.ai_extraction_config.max_concurrent_requests = 8`, but raising it changes nothing.

## Investigation

Two distinct findings — they look like one bug but they are two.

### Finding 1 — Within-job serial chunks is **by design**

`api/app/workers/ingestion_worker.py:518`:

```python
for i, chunk in enumerate(chunks, 1):
    ...
    recent_concept_ids = process_chunk(
        chunk=chunk,
        ...
        recent_concept_ids=recent_concept_ids,   # ← sequential dependency
        ...
    )
```

The loop is plain `for`. Each `process_chunk` returns `recent_concept_ids` that feed into chunk N+1's call so the LLM can deduplicate against concepts that chunk N just created. Parallelizing naively breaks dedup quality (chunk N+1 extracts duplicates of chunk N's concepts because it never saw them).

This is consistent with `1ce64eee feat(ADR-031): Add service token authorization and worker concurrency`, which placed the `ThreadPoolExecutor` at the **job-queue level** (multiple ingestions in parallel), not at the chunk level (one ingestion's chunks in parallel). The design assumption is: scale via more jobs, not bigger jobs.

That assumption breaks when the user has one large document. They feel `N × ~30-80s` wall time with nothing they can tune.

### Finding 2 — Per-provider semaphore (ADR-049) is never wired

`api/app/lib/rate_limiter.py:199` defines `get_provider_semaphore(provider_name)`. ADR-049 §2 says it gates concurrent API calls per provider — "OpenAI 8, Anthropic 4, Ollama 1".

Pickaxe across all branches finds **zero callers**, ever:

```bash
git log --all -S "get_provider_semaphore(" --pickaxe-regex
# (empty)
```

So the helper exists, the DB column `max_concurrent_requests` exists, ADR-049 documents the intent — but no call site ever acquires the semaphore. This is latent: when ≥2 ingestion jobs run concurrently, they currently hit the provider without any coordination, and the SDK's retry budget (8) absorbs the 429s silently. The user sees slow ingestions and no errors, even though the gate they configured isn't doing anything.

This is **not** a regression from the ADR-206 vocab-migration branch (PR pending) — the `call_with_tools` facade added there is annealing-only and doesn't touch `extract_concepts`. The gap predates that work.

## Repro

1. Run any ingestion: `kg ingest <large_file>`. Watch `kg-api-dev` logs and confirm chunks process serially.
2. Run two ingestions back-to-back: `kg ingest a.md & kg ingest b.md`. Confirm via `httpx` log lines that requests to `api.anthropic.com` overlap without semaphore gating.
3. `grep -rn 'get_provider_semaphore(' api/app/` returns only the definition site, no callers.

## Suggested scope

### Cheap, high-value
- **Wire `get_provider_semaphore` into the extract_concepts path.** Wrap the SDK call inside the semaphore (`with get_provider_semaphore(provider_name): client.messages.create(...)`). Restores ADR-049's intended behaviour for concurrent jobs. Pure addition, no architectural change.
- **Surface in CLI/UI:** the admin tab already shows `max_concurrent_requests` from the config row — but the value is currently advisory only. Either wire the semaphore (above) or add a footnote.

### Medium, real engineering
- **Sliding-window in-job concurrency.** Dispatch chunk N+1 while N is still extracting, but pass an outdated `recent_concept_ids` snapshot. Trades modest dedup quality (more "duplicate concept" cleanup later) for ~3-5x speedup on large single-file ingestions. Needs an experiment to quantify the dedup hit before committing.
- **Per-call `max_tokens` budget.** Sonnet latency scales with the generation budget, not output size. Dropping `max_tokens=16384` → `8192` should ~halve latency at zero quality risk for typical chunks. Could be the cheapest win of all.

### Out of scope here
- Switching extraction models (Haiku 4.5 etc.) — separate experiment.

## References

- ADR-049 — Rate limiting and concurrency (the design that was partially implemented)
- ADR-031 — Worker concurrency at job-queue level (the design that *was* implemented)
- `api/app/lib/rate_limiter.py:199` — orphaned `get_provider_semaphore`
- `api/app/workers/ingestion_worker.py:518` — serial chunk loop with `recent_concept_ids` dependency
- `kg_api.ai_extraction_config.max_concurrent_requests` — DB column read by nothing in the runtime path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion concurrency: per-provider semaphore (ADR-049) never wired; single-job chunks serial by design #417

Symptom

Investigation

Finding 1 — Within-job serial chunks is by design

Finding 2 — Per-provider semaphore (ADR-049) is never wired

Repro

Suggested scope

Cheap, high-value

Medium, real engineering

Out of scope here

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Ingestion concurrency: per-provider semaphore (ADR-049) never wired; single-job chunks serial by design #417

Description

Symptom

Investigation

Finding 1 — Within-job serial chunks is by design

Finding 2 — Per-provider semaphore (ADR-049) is never wired

Repro

Suggested scope

Cheap, high-value

Medium, real engineering

Out of scope here

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions