diff --git a/DOCS.md b/DOCS.md index 9e1d432..7a75180 100644 --- a/DOCS.md +++ b/DOCS.md @@ -302,12 +302,49 @@ Counters are maintained per level. When a new Level 1 appears, all deeper counte | Build from pages | `TreeDex.from_pages(pages, llm, **opts)` | `await TreeDex.fromPages(pages, llm, opts?)` | From pre-extracted pages | | Build from tree | `TreeDex.from_tree(tree, pages, llm)` | `TreeDex.fromTree(tree, pages, llm)` | From existing tree (no LLM) | | Query | `index.query(q, llm=, agentic=)` | `await index.query(q, {llm?, agentic?})` | Retrieve relevant sections | +| **Multi-index query** | **`TreeDex.query_all(indexes, q, ...)`** | **`await TreeDex.queryAll(indexes, q, ...)`** | **Query multiple indexes simultaneously** | | Save | `index.save(path)` | `await index.save(path)` | Export to JSON | | Load | `TreeDex.load(path, llm)` | `await TreeDex.load(path, llm)` | Import from JSON | | Show tree | `index.show_tree()` | `index.showTree()` | Pretty-print | | Stats | `index.stats()` | `index.stats()` | Return `{total_pages, total_tokens, ...}` | | Find large | `index.find_large_sections(**opts)` | `index.findLargeSections(opts?)` | Nodes exceeding thresholds | +#### `query_all` / `queryAll` Options + +Query multiple TreeDex indexes simultaneously. Indexes are queried in parallel +(Node.js) or sequentially (Python), then results are merged into a single +`MultiQueryResult` with `[Label]` separators between sources. + +| Parameter | Python | Node.js | Default | Description | +|-----------|--------|---------|---------|-------------| +| indexes | `list[TreeDex]` | `TreeDex[]` | required | List of indexes to query | +| question | `str` | `string` | required | The user's question | +| llm | `llm=` | `{ llm? }` | `None` | Shared LLM override; falls back to each index's own LLM | +| agentic | `agentic=` | `{ agentic? }` | `False` | Generate a single answer over the combined context | +| labels | `labels=` | `{ labels? }` | `["Document 1", ...]` | Human-readable name per index | + +```python +multi = TreeDex.query_all( + [index_a, index_b, index_c], + "What are the safety guidelines?", + labels=["Manual A", "Manual B", "Manual C"], + agentic=True, +) +print(multi.combined_context) # merged context with [Manual A]/[Manual B] headers +print(multi.answer) # single answer over all sources +print(multi.results[0].pages_str) # pages from Manual A +``` + +```typescript +const multi = await TreeDex.queryAll( + [indexA, indexB, indexC], + "What are the safety guidelines?", + { labels: ["Manual A", "Manual B", "Manual C"], agentic: true }, +); +console.log(multi.combinedContext); +console.log(multi.answer); +``` + #### `from_file` Options | Parameter | Python | Node.js | Default | Description | @@ -331,6 +368,17 @@ Counters are maintained per level. When a new Level 1 appears, all deeper counte | Reasoning | `.reasoning` | `.reasoning` | `str` | LLM's selection explanation | | Answer | `.answer` | `.answer` | `str` | Generated answer (agentic mode only) | +### MultiQueryResult + +Returned by `TreeDex.query_all()` / `TreeDex.queryAll()`. + +| Property | Python | Node.js | Type | Description | +|----------|--------|---------|------|-------------| +| Per-index results | `.results` | `.results` | `list[QueryResult]` | One result per index, in input order | +| Labels | `.labels` | `.labels` | `list[str]` | Human-readable name for each index | +| Combined context | `.combined_context` | `.combinedContext` | `str` | All contexts merged with `[Label]` headers and `---` separators | +| Answer | `.answer` | `.answer` | `str` | Single answer over all sources (agentic only) | + ### Hierarchy Utilities | Function | Python | Node.js | Description | diff --git a/README.md b/README.md index 45abc8b..5c0abf6 100644 --- a/README.md +++ b/README.md @@ -383,6 +383,7 @@ Use `auto_loader(path)` / `autoLoader(path)` for automatic format detection. | Create from tree | `TreeDex.from_tree(tree, pages)` | `TreeDex.fromTree(tree, pages)` | | Query | `index.query(question)` | `await index.query(question)` | | Agentic query | `index.query(q, agentic=True)` | `await index.query(q, { agentic: true })` | +| **Multi-index query** | **`TreeDex.query_all(indexes, q)`** | **`await TreeDex.queryAll(indexes, q)`** | | Save | `index.save(path)` | `await index.save(path)` | | Load | `TreeDex.load(path, llm)` | `await TreeDex.load(path, llm)` | | Show tree | `index.show_tree()` | `index.showTree()` | @@ -400,6 +401,41 @@ Use `auto_loader(path)` / `autoLoader(path)` for automatic format detection. | Reasoning | `.reasoning` | `.reasoning` | LLM's explanation for selection | | Answer | `.answer` | `.answer` | LLM-generated answer (agentic mode only) | +### `MultiQueryResult` + +Returned by `TreeDex.query_all()` / `TreeDex.queryAll()` when querying multiple indexes at once. + +| Property | Python | Node.js | Description | +|----------|--------|---------|-------------| +| Per-index results | `.results` | `.results` | One `QueryResult` per index, in input order | +| Labels | `.labels` | `.labels` | Human-readable name for each index | +| Combined context | `.combined_context` | `.combinedContext` | All contexts merged with `[Label]` headers | +| Answer | `.answer` | `.answer` | Single answer over all sources (agentic only) | + +**Example:** + +```python +multi = TreeDex.query_all( + [index_a, index_b], + "What are the safety guidelines?", + labels=["Manual A", "Manual B"], + agentic=True, +) +print(multi.combined_context) # [Manual A]\n...\n---\n[Manual B]\n... +print(multi.answer) # unified answer across both documents +print(multi.results[0].pages_str) # pages matched in Manual A +``` + +```typescript +const multi = await TreeDex.queryAll( + [indexA, indexB], + "What are the safety guidelines?", + { labels: ["Manual A", "Manual B"], agentic: true }, +); +console.log(multi.combinedContext); +console.log(multi.answer); +``` + ### Hierarchy Utilities | Function | Python | Node.js | Description | diff --git a/docs/api.md b/docs/api.md index 64b0b02..d2550d0 100644 --- a/docs/api.md +++ b/docs/api.md @@ -1,7 +1,7 @@ --- layout: default title: API Reference -nav_order: 3 +nav_order: 4 --- # API Reference @@ -123,6 +123,57 @@ const result = await index.query(question, { // Or shorthand: await index.query(question, llm) ``` +#### `query_all` / `queryAll` _(static)_ + +Query **multiple indexes simultaneously** and merge results into a single +`MultiQueryResult`. All indexes are queried in parallel (Node.js) or +sequentially (Python). Results are combined with clear `[Document N]` +separators so downstream LLMs or users can distinguish sources. + +```python +multi = TreeDex.query_all( + indexes: list[TreeDex], + question: str, + llm=None, # Shared LLM override (falls back to each index's LLM) + agentic: bool = False, # Generate one answer over the combined context + labels: list[str] = None # Human-readable names (default: "Document 1", "Document 2", …) +) -> MultiQueryResult +``` + +```typescript +const multi = await TreeDex.queryAll(indexes, question, { + llm?, // Shared LLM override + agentic?, // Generate one answer over combined context + labels?, // Human-readable names per index +}); +``` + +**Example:** + +```python +multi = TreeDex.query_all( + [index_a, index_b, index_c], + "What are the safety guidelines?", + llm=llm, + labels=["Manual A", "Manual B", "Manual C"], + agentic=True, +) +print(multi.combined_context) # merged text with [Manual A] / [Manual B] headers +print(multi.answer) # single LLM-generated answer over all sources +print(multi.results[0].pages_str) # pages matched in Manual A +``` + +```typescript +const multi = await TreeDex.queryAll( + [indexA, indexB, indexC], + "What are the safety guidelines?", + { llm, labels: ["Manual A", "Manual B", "Manual C"], agentic: true }, +); +console.log(multi.combinedContext); +console.log(multi.answer); +console.log(multi.results[0].pagesStr); +``` + #### `save` Export the index to a JSON file. @@ -198,6 +249,33 @@ Returned by `index.query()`. --- +## MultiQueryResult + +Returned by `TreeDex.query_all()` / `TreeDex.queryAll()`. + +| Property | Python | Node.js | Type | Description | +|----------|--------|---------|------|-------------| +| Per-index results | `.results` | `.results` | `list[QueryResult]` | One `QueryResult` per index, in input order | +| Labels | `.labels` | `.labels` | `list[str]` | Human-readable name for each index | +| Combined context | `.combined_context` | `.combinedContext` | `str` | All contexts merged with `[Label]` headers and `---` separators | +| Answer | `.answer` | `.answer` | `str` | Single LLM-generated answer over all sources (agentic mode only) | + +**Combined context format:** + +``` +[Manual A] +[Section: Safety Guidelines] +Content from Manual A... + +--- + +[Manual B] +[Section: Hazard Procedures] +Content from Manual B... +``` + +--- + ## PDF Parser Functions ### `extract_toc` / `extractToc` diff --git a/docs/benchmark-report.md b/docs/benchmark-report.md index 99ad619..72be0a7 100644 --- a/docs/benchmark-report.md +++ b/docs/benchmark-report.md @@ -1,7 +1,7 @@ --- layout: default title: Benchmark Report -nav_order: 8 +nav_order: 9 --- # Benchmark Report: TreeDex vs Vector RAG diff --git a/docs/benchmarks.md b/docs/benchmarks.md index b899b87..9c98c58 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -1,7 +1,7 @@ --- layout: default title: Benchmarks -nav_order: 5 +nav_order: 6 --- # Benchmarks diff --git a/docs/case-studies.md b/docs/case-studies.md index ca775f8..0255412 100644 --- a/docs/case-studies.md +++ b/docs/case-studies.md @@ -1,7 +1,7 @@ --- layout: default title: Case Studies -nav_order: 6 +nav_order: 7 --- # Case Studies diff --git a/docs/configuration.md b/docs/configuration.md index 96d7da0..061013b 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1,7 +1,7 @@ --- layout: default title: Configuration -nav_order: 7 +nav_order: 8 --- # Configuration & Tuning diff --git a/docs/how-it-works.md b/docs/how-it-works.md new file mode 100644 index 0000000..6520335 --- /dev/null +++ b/docs/how-it-works.md @@ -0,0 +1,182 @@ +--- +layout: default +title: How It Works +nav_order: 3 +--- + +# How It Works + +A step-by-step walkthrough of what TreeDex does from the moment you load a document to the moment you get an answer. + +--- + +## Phase 1 — Loading + +```python +index = TreeDex.from_file("document.pdf", llm) +``` + +TreeDex reads the document and converts it into a flat list of **pages**, each with: +- `page_num` — position in the document +- `text` — extracted text content +- `token_count` — pre-computed token length +- `images` *(optional)* — base64-encoded images if `extract_images=True` + +Supported formats: **PDF, TXT, HTML, DOCX**. Use `auto_loader` / `autoLoader` to detect format automatically. + +--- + +## Phase 2 — Structure Detection + +TreeDex tries three strategies, in order: + +### Strategy A — PDF Bookmarks (zero LLM calls) +If the PDF has native bookmarks/outline, the tree is built directly from them. This is the fastest and most accurate path — no LLM is needed at all. + +### Strategy B — Font-Size Heading Detection +If no bookmarks exist, TreeDex analyzes font sizes across up to 50 pages. It identifies the body text size (the most frequent) and maps larger sizes to `[H1]`, `[H2]`, `[H3]` markers, which are injected into the page text before the LLM sees it. + +```text +[H1] Chapter 1: Introduction +This chapter covers the background... + +[H2] 1.1 Motivation +The problem we are solving is... +``` + +### Strategy C — LLM Structure Extraction +Pages are grouped into token-capped chunks. The LLM is given each chunk and asked to extract a hierarchical structure — section titles, their nesting level, and their page positions. + +For large documents, instead of passing the full growing section list back to the LLM on each chunk, TreeDex sends a **capped summary** (top-level sections + last 30 sections), preventing prompt bloat. + +--- + +## Phase 3 — Tree Construction + +The raw section list from the LLM is a flat list like: + +```json +[ + { "structure": "1", "title": "Introduction", "physical_index": 0 }, + { "structure": "1.1", "title": "Background", "physical_index": 1 }, + { "structure": "2", "title": "Methods", "physical_index": 4 } +] +``` + +TreeDex then: + +1. **Repairs orphans** — if `"2.3.1"` exists but `"2.3"` doesn't, a synthetic `"2.3"` parent is inserted automatically. +2. **Builds the tree** — dot-notation IDs (`"1.2.3"`) are converted into a nested `TreeNode[]` hierarchy using a hash map. +3. **Assigns page ranges** — each node gets a `start_index` and `end_index` covering all pages it spans. +4. **Assigns node IDs** — each node gets a short unique ID (`"0001"`, `"0002"`, …) used during retrieval. +5. **Embeds text** — actual page content is attached to each node based on its page range. + +--- + +## Phase 4 — Querying + +```python +result = index.query("What are the safety guidelines?") +``` + +### Step 1 — Strip content from the tree +A deep clone of the tree is made with **all `text` fields removed**. Only titles, structure IDs, and page ranges remain. This is the "skeleton". + +### Step 2 — LLM navigates the skeleton +The skeleton (typically 1–2k tokens even for 300-page docs) is sent to the LLM along with your question: + +```text +You are a document retrieval system. Given this tree structure, pick +the node_ids most relevant to the query. Return JSON only. + +{ "node_ids": ["0005"], "reasoning": "Safety section covers this." } +``` + +The LLM reasons over **titles and hierarchy only** — it never sees page content at this stage. + +### Step 3 — Content fetched via hash map +Selected `node_ids` are looked up in a pre-built `O(1)` hash map. The actual page text for those nodes is retrieved and formatted: + +```text +[2: Safety Guidelines] +Section 2 covers personal protective equipment... +``` + +### Step 4 — Return result +A `QueryResult` is returned with: +- `.context` — the retrieved text +- `.node_ids` — which nodes were selected +- `.pages_str` — e.g. `"pages 5-9"` +- `.reasoning` — the LLM's explanation + +--- + +## Phase 5 — Agentic Mode (optional) + +```python +result = index.query("What are the safety guidelines?", agentic=True) +``` + +A **second LLM call** is made — this time passing the retrieved context and asking for a direct answer. Without agentic mode you get the raw context and control the final step yourself. + +| Mode | LLM calls | You get | +|---|---|---| +| Default | 1 | Raw context + page references | +| `agentic=True` | 2 | Raw context + a generated answer | + +--- + +## Multi-Index Querying + +```python +multi = TreeDex.query_all([index_a, index_b], "question", labels=["Doc A", "Doc B"]) +``` + +Each index is queried independently. Results are merged into a single `MultiQueryResult` with clear `[Doc A]` / `[Doc B]` headers separating each source's context. Optionally, a single agentic answer is generated over the combined context. + +--- + +## Full Flow Diagram + +```text +Document (PDF / TXT / HTML / DOCX) + │ + ▼ +[ Loader ] ──────────────────────────────→ Pages[] + │ + ├── PDF bookmarks found? + │ YES → tocToSections() (0 LLM calls) + │ NO → font-size heading detection + │ → chunked LLM structure extraction + │ + ▼ +repairOrphans() → listToTree() + │ + ▼ +assignPageRanges() → assignNodeIds() → embedTextInTree() + │ + ▼ + TreeDex index ready (save/load as JSON) + +──────────────────────────── QUERY TIME ──────────────────────────── + +index.query("your question") + │ + ▼ +[1] stripTextFromTree() ← skeleton only, no content + │ + ▼ +[2] retrievalPrompt → LLM ← 1st LLM call + │ returns: { node_ids, reasoning } + ▼ +[3] extractJson() ← robust parse + │ + ▼ +[4] collectNodeTexts(_nodeMap) ← O(1) hash lookup + │ + ▼ +[5] answerPrompt → LLM ← 2nd LLM call (agentic only) + │ + ▼ +QueryResult { context, nodeIds, pageRanges, reasoning, answer } +``` diff --git a/docs/llm-backends.md b/docs/llm-backends.md index 910912e..7b63175 100644 --- a/docs/llm-backends.md +++ b/docs/llm-backends.md @@ -1,7 +1,7 @@ --- layout: default title: LLM Backends -nav_order: 4 +nav_order: 5 --- # LLM Backends diff --git a/examples/multi_index.py b/examples/multi_index.py new file mode 100644 index 0000000..17eaacd --- /dev/null +++ b/examples/multi_index.py @@ -0,0 +1,59 @@ +"""Example: Query multiple TreeDex indexes simultaneously. + +Demonstrates building separate indexes for multiple documents and querying +them all at once with TreeDex.query_all(). +""" + +import os + +from treedex import TreeDex, GeminiLLM + + +def main(): + llm = GeminiLLM(api_key=os.environ["GEMINI_API_KEY"]) + + # --- Build indexes for multiple documents --- + print("Building indexes...") + index_a = TreeDex.from_file("manual_a.pdf", llm=llm) + index_b = TreeDex.from_file("manual_b.pdf", llm=llm) + index_c = TreeDex.from_file("manual_c.pdf", llm=llm) + + # Optionally save and reload later + index_a.save("manual_a.json") + index_b.save("manual_b.json") + index_c.save("manual_c.json") + + # --- Query all indexes at once --- + question = "What are the safety guidelines?" + + print(f"\nQuerying all indexes: '{question}'") + multi = TreeDex.query_all( + [index_a, index_b, index_c], + question, + labels=["Manual A", "Manual B", "Manual C"], + ) + + # Inspect per-index results + for i, result in enumerate(multi.results): + print(f"\n[{multi.labels[i]}]") + print(f" Nodes: {result.node_ids}") + print(f" Pages: {result.pages_str}") + print(f" Reason: {result.reasoning}") + + # Combined context with [Manual A] / [Manual B] / [Manual C] headers + print("\n--- Combined Context ---") + print(multi.combined_context[:500], "...") + + # --- Agentic mode: one answer across all documents --- + print("\nQuerying with agentic answer...") + multi_agentic = TreeDex.query_all( + [index_a, index_b, index_c], + question, + labels=["Manual A", "Manual B", "Manual C"], + agentic=True, + ) + print(f"\nAnswer:\n{multi_agentic.answer}") + + +if __name__ == "__main__": + main() diff --git a/examples/node/multi-index.ts b/examples/node/multi-index.ts new file mode 100644 index 0000000..371ace3 --- /dev/null +++ b/examples/node/multi-index.ts @@ -0,0 +1,60 @@ +/** + * Example: Query multiple TreeDex indexes simultaneously (Node.js) + * + * Demonstrates building separate indexes for multiple documents and querying + * them all at once with TreeDex.queryAll(). + */ + +import { TreeDex, GeminiLLM } from "../../src/index.js"; + +async function main() { + const apiKey = process.env.GEMINI_API_KEY; + if (!apiKey) { + throw new Error("Missing GEMINI_API_KEY environment variable."); + } + const llm = new GeminiLLM(apiKey); + + // --- Build indexes for multiple documents --- + console.log("Building indexes..."); + const indexA = await TreeDex.fromFile("manual_a.pdf", llm); + const indexB = await TreeDex.fromFile("manual_b.pdf", llm); + const indexC = await TreeDex.fromFile("manual_c.pdf", llm); + + // Optionally save and reload later + await indexA.save("manual_a.json"); + await indexB.save("manual_b.json"); + await indexC.save("manual_c.json"); + + // --- Query all indexes at once --- + const question = "What are the safety guidelines?"; + + console.log(`\nQuerying all indexes: '${question}'`); + const multi = await TreeDex.queryAll( + [indexA, indexB, indexC], + question, + { labels: ["Manual A", "Manual B", "Manual C"] }, + ); + + // Inspect per-index results + multi.results.forEach((result, i) => { + console.log(`\n[${multi.labels[i]}]`); + console.log(` Nodes: ${JSON.stringify(result.nodeIds)}`); + console.log(` Pages: ${result.pagesStr}`); + console.log(` Reason: ${result.reasoning}`); + }); + + // Combined context with [Manual A] / [Manual B] / [Manual C] headers + console.log("\n--- Combined Context ---"); + console.log(multi.combinedContext.slice(0, 500), "..."); + + // --- Agentic mode: one answer across all documents --- + console.log("\nQuerying with agentic answer..."); + const multiAgentic = await TreeDex.queryAll( + [indexA, indexB, indexC], + question, + { labels: ["Manual A", "Manual B", "Manual C"], agentic: true }, + ); + console.log(`\nAnswer:\n${multiAgentic.answer}`); +} + +main().catch(console.error); diff --git a/package-lock.json b/package-lock.json index c2cab27..e791960 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,12 +1,12 @@ { "name": "treedex", - "version": "0.1.4", + "version": "0.1.5", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "treedex", - "version": "0.1.4", + "version": "0.1.5", "license": "MIT", "dependencies": { "gpt-tokenizer": "^2.8.0", diff --git a/src/core.ts b/src/core.ts index a2c9953..5c60b0e 100644 --- a/src/core.ts +++ b/src/core.ts @@ -142,6 +142,37 @@ export class QueryResult { } } +/** Result of querying multiple TreeDex indexes simultaneously. */ +export class MultiQueryResult { + /** Per-index results in the same order as the input indexes. */ + readonly results: QueryResult[]; + /** Labels for each index (e.g. filenames or custom names). */ + readonly labels: string[]; + /** All per-index contexts merged with [Document N] separators. */ + readonly combinedContext: string; + /** Optional agentic answer generated over the combined context. */ + readonly answer: string; + + constructor( + results: QueryResult[], + labels: string[], + combinedContext: string, + answer: string = "", + ) { + this.results = results; + this.labels = labels; + this.combinedContext = combinedContext; + this.answer = answer; + } + + toString(): string { + const parts = this.results.map( + (r, i) => `[${this.labels[i]}] ${r.toString()}`, + ); + return `MultiQueryResult(\n ${parts.join(",\n ")}\n)`; + } +} + /** Tree-based document index for RAG retrieval. */ export class TreeDex { readonly tree: TreeNode[]; @@ -471,4 +502,78 @@ export class TreeDex { pages: this.pages, }); } + + /** + * Query multiple TreeDex indexes simultaneously and merge results. + * + * All indexes are queried in parallel. Results are combined into a single + * `MultiQueryResult` with a `combinedContext` that labels each source. + * + * @param indexes - Array of TreeDex instances to query. + * @param question - The question to ask each index. + * @param options - Shared LLM, optional per-index labels, and agentic mode. + * + * @example + * const multi = await TreeDex.queryAll( + * [index1, index2, index3], + * "What are the safety guidelines?", + * { llm, labels: ["Manual A", "Manual B", "Manual C"], agentic: true }, + * ); + * console.log(multi.combinedContext); + * console.log(multi.answer); + */ + static async queryAll( + indexes: TreeDex[], + question: string, + options?: { + llm?: BaseLLM; + agentic?: boolean; + labels?: string[]; + }, + ): Promise { + if (indexes.length === 0) { + throw new Error("queryAll requires at least one index."); + } + + const llm = options?.llm; + const agentic = options?.agentic ?? false; + const labels = options?.labels ?? indexes.map((_, i) => `Document ${i + 1}`); + + if (labels.length !== indexes.length) { + throw new Error("labels length must match indexes length."); + } + + if (!llm && indexes.some((idx) => idx.llm === null)) { + throw new Error( + "No LLM provided for one or more indexes. Pass options.llm or set llm on every TreeDex.", + ); + } + + // Query all indexes in parallel (no agentic per-index; answer at the end) + const results = await Promise.all( + indexes.map((idx) => idx.query(question, { llm, agentic: false })), + ); + + // Merge contexts with clear document separators + const sections: string[] = []; + for (let i = 0; i < results.length; i++) { + const ctx = results[i].context.trim(); + if (ctx.length > 0) { + sections.push(`[${labels[i]}]\n${ctx}`); + } + } + const combinedContext = sections.join("\n\n---\n\n"); + + // Optional: generate a single answer over all combined context + let answer = ""; + if (agentic && combinedContext.length > 0) { + const activeLlm = llm ?? indexes.find((idx) => idx.llm)?.llm ?? null; + if (!activeLlm) { + throw new Error("No LLM provided for agentic mode in queryAll."); + } + answer = await activeLlm.generate(answerPrompt(combinedContext, question)); + } + + return new MultiQueryResult(results, labels, combinedContext, answer); + } } diff --git a/src/index.ts b/src/index.ts index 1d01770..1e3d0e4 100644 --- a/src/index.ts +++ b/src/index.ts @@ -1,6 +1,6 @@ /** TreeDex: Tree-based document RAG framework. */ -export { TreeDex, QueryResult } from "./core.js"; +export { TreeDex, QueryResult, MultiQueryResult } from "./core.js"; export { PDFLoader, TextLoader, diff --git a/test/core.test.ts b/test/core.test.ts index 068b641..20578de 100644 --- a/test/core.test.ts +++ b/test/core.test.ts @@ -2,7 +2,7 @@ import { describe, it, expect, vi } from "vitest"; import { writeFile, mkdir, rm, readFile } from "node:fs/promises"; import { join } from "node:path"; import { tmpdir } from "node:os"; -import { TreeDex, QueryResult } from "../src/core.js"; +import { TreeDex, QueryResult, MultiQueryResult } from "../src/core.js"; import { FunctionLLM } from "../src/llm-backends.js"; import { listToTree, @@ -225,3 +225,125 @@ describe("TreeDex.findLargeSections", () => { expect(large.length).toBe(0); }); }); + +describe("MultiQueryResult", () => { + it("should store results and labels", () => { + const r1 = new QueryResult("ctx1", ["0001"], [[0, 1]], "reason1"); + const r2 = new QueryResult("ctx2", ["0002"], [[2, 3]], "reason2"); + const multi = new MultiQueryResult( + [r1, r2], + ["Doc A", "Doc B"], + "[Doc A]\nctx1\n\n---\n\n[Doc B]\nctx2", + ); + expect(multi.results).toHaveLength(2); + expect(multi.labels).toEqual(["Doc A", "Doc B"]); + expect(multi.combinedContext).toContain("[Doc A]"); + expect(multi.combinedContext).toContain("[Doc B]"); + expect(multi.answer).toBe(""); + }); + + it("should store agentic answer", () => { + const r = new QueryResult("ctx", ["0001"], [[0, 0]], "reason"); + const multi = new MultiQueryResult([r], ["Doc A"], "[Doc A]\nctx", "My answer"); + expect(multi.answer).toBe("My answer"); + }); + + it("should have toString", () => { + const r = new QueryResult("ctx", ["0001"], [[0, 0]], "reason"); + const multi = new MultiQueryResult([r], ["Doc A"], "[Doc A]\nctx"); + expect(multi.toString()).toContain("Doc A"); + }); +}); + +describe("TreeDex.queryAll", () => { + it("should query multiple indexes and merge contexts", async () => { + const llm = makeMockLlm(); + const { tree, pages } = makeTreeAndPages(); + const index1 = TreeDex.fromTree(tree, pages, llm); + const index2 = TreeDex.fromTree(tree, pages, llm); + + const multi = await TreeDex.queryAll([index1, index2], "What methods were used?"); + + expect(multi.results).toHaveLength(2); + expect(multi.labels).toEqual(["Document 1", "Document 2"]); + expect(multi.combinedContext).toContain("[Document 1]"); + expect(multi.combinedContext).toContain("[Document 2]"); + expect(multi.combinedContext).toContain("---"); + }); + + it("should use custom labels", async () => { + const llm = makeMockLlm(); + const { tree, pages } = makeTreeAndPages(); + const index1 = TreeDex.fromTree(tree, pages, llm); + const index2 = TreeDex.fromTree(tree, pages, llm); + + const multi = await TreeDex.queryAll( + [index1, index2], + "question", + { labels: ["Manual A", "Manual B"] }, + ); + + expect(multi.labels).toEqual(["Manual A", "Manual B"]); + expect(multi.combinedContext).toContain("[Manual A]"); + expect(multi.combinedContext).toContain("[Manual B]"); + }); + + it("should use shared override LLM", async () => { + const { tree, pages } = makeTreeAndPages(); + const index1 = TreeDex.fromTree(tree, pages); // no llm + const index2 = TreeDex.fromTree(tree, pages); // no llm + const llm = makeMockLlm(); + + const multi = await TreeDex.queryAll([index1, index2], "question", { llm }); + + expect(multi.results).toHaveLength(2); + expect(multi.combinedContext.length).toBeGreaterThan(0); + }); + + it("should generate agentic answer over combined context", async () => { + const baseLlm = makeMockLlm(); + const agenticLlm = new FunctionLLM((prompt: string) => { + if (prompt.includes("retrieval system")) { + return JSON.stringify({ node_ids: ["0001"], reasoning: "relevant" }); + } + return "The combined answer from all documents."; + }); + + const { tree, pages } = makeTreeAndPages(); + const index1 = TreeDex.fromTree(tree, pages, agenticLlm); + + const multi = await TreeDex.queryAll( + [index1], + "What is the answer?", + { agentic: true }, + ); + + expect(multi.answer).toBe("The combined answer from all documents."); + }); + + it("should throw for empty indexes array", async () => { + await expect(TreeDex.queryAll([], "question")).rejects.toThrow( + "queryAll requires at least one index", + ); + }); + + it("should throw when labels length mismatches", async () => { + const llm = makeMockLlm(); + const { tree, pages } = makeTreeAndPages(); + const index = TreeDex.fromTree(tree, pages, llm); + + await expect( + TreeDex.queryAll([index], "question", { labels: ["A", "B"] }), + ).rejects.toThrow("labels length must match"); + }); + + it("should have empty answer when not agentic", async () => { + const llm = makeMockLlm(); + const { tree, pages } = makeTreeAndPages(); + const index = TreeDex.fromTree(tree, pages, llm); + + const multi = await TreeDex.queryAll([index], "question"); + + expect(multi.answer).toBe(""); + }); +}); diff --git a/tests/test_core.py b/tests/test_core.py index ac55ce6..4c2d6ae 100644 --- a/tests/test_core.py +++ b/tests/test_core.py @@ -6,7 +6,7 @@ import pytest -from treedex.core import TreeDex, QueryResult +from treedex.core import TreeDex, QueryResult, MultiQueryResult from treedex.llm_backends import FunctionLLM from treedex.tree_builder import list_to_tree, assign_page_ranges, assign_node_ids, embed_text_in_tree @@ -182,3 +182,107 @@ def test_none_large(self): index = TreeDex.from_tree(tree, pages) large = index.find_large_sections(max_pages=100) assert len(large) == 0 + + +class TestMultiQueryResult: + def test_stores_results_and_labels(self): + r1 = QueryResult("ctx1", ["0001"], [(0, 1)], "reason1") + r2 = QueryResult("ctx2", ["0002"], [(2, 3)], "reason2") + multi = MultiQueryResult( + [r1, r2], + ["Doc A", "Doc B"], + "[Doc A]\nctx1\n\n---\n\n[Doc B]\nctx2", + ) + assert len(multi.results) == 2 + assert multi.labels == ["Doc A", "Doc B"] + assert "[Doc A]" in multi.combined_context + assert "[Doc B]" in multi.combined_context + assert multi.answer == "" + + def test_stores_agentic_answer(self): + r = QueryResult("ctx", ["0001"], [(0, 0)], "reason") + multi = MultiQueryResult([r], ["Doc A"], "[Doc A]\nctx", "My answer") + assert multi.answer == "My answer" + + def test_repr(self): + r = QueryResult("ctx", ["0001"], [(0, 0)], "reason") + multi = MultiQueryResult([r], ["Doc A"], "[Doc A]\nctx") + assert "Doc A" in repr(multi) + + +class TestQueryAll: + def test_queries_multiple_indexes(self): + llm = _make_mock_llm() + tree, pages = _make_tree_and_pages() + index1 = TreeDex.from_tree(tree, pages, llm) + index2 = TreeDex.from_tree(tree, pages, llm) + + multi = TreeDex.query_all([index1, index2], "What methods were used?") + + assert len(multi.results) == 2 + assert multi.labels == ["Document 1", "Document 2"] + assert "[Document 1]" in multi.combined_context + assert "[Document 2]" in multi.combined_context + assert "---" in multi.combined_context + + def test_custom_labels(self): + llm = _make_mock_llm() + tree, pages = _make_tree_and_pages() + index1 = TreeDex.from_tree(tree, pages, llm) + index2 = TreeDex.from_tree(tree, pages, llm) + + multi = TreeDex.query_all( + [index1, index2], + "question", + labels=["Manual A", "Manual B"], + ) + + assert multi.labels == ["Manual A", "Manual B"] + assert "[Manual A]" in multi.combined_context + assert "[Manual B]" in multi.combined_context + + def test_shared_override_llm(self): + tree, pages = _make_tree_and_pages() + index1 = TreeDex.from_tree(tree, pages) # no llm + index2 = TreeDex.from_tree(tree, pages) # no llm + llm = _make_mock_llm() + + multi = TreeDex.query_all([index1, index2], "question", llm=llm) + + assert len(multi.results) == 2 + assert len(multi.combined_context) > 0 + + def test_agentic_answer(self): + def agentic_generate(prompt: str) -> str: + if "retrieval system" in prompt: + return json.dumps({"node_ids": ["0001"], "reasoning": "relevant"}) + return "The combined answer from all documents." + + llm = FunctionLLM(agentic_generate) + tree, pages = _make_tree_and_pages() + index = TreeDex.from_tree(tree, pages, llm) + + multi = TreeDex.query_all([index], "What is the answer?", agentic=True) + + assert multi.answer == "The combined answer from all documents." + + def test_raises_for_empty_indexes(self): + with pytest.raises(ValueError, match="at least one index"): + TreeDex.query_all([], "question") + + def test_raises_for_label_mismatch(self): + llm = _make_mock_llm() + tree, pages = _make_tree_and_pages() + index = TreeDex.from_tree(tree, pages, llm) + + with pytest.raises(ValueError, match="labels length must match"): + TreeDex.query_all([index], "question", labels=["A", "B"]) + + def test_no_answer_when_not_agentic(self): + llm = _make_mock_llm() + tree, pages = _make_tree_and_pages() + index = TreeDex.from_tree(tree, pages, llm) + + multi = TreeDex.query_all([index], "question") + + assert multi.answer == "" diff --git a/treedex/__init__.py b/treedex/__init__.py index 83c0c34..b2137be 100644 --- a/treedex/__init__.py +++ b/treedex/__init__.py @@ -1,6 +1,6 @@ """TreeDex: Tree-based document RAG framework.""" -from treedex.core import TreeDex, QueryResult +from treedex.core import TreeDex, QueryResult, MultiQueryResult from treedex.loaders import PDFLoader, TextLoader, HTMLLoader, DOCXLoader, auto_loader from treedex.llm_backends import ( BaseLLM, @@ -29,6 +29,7 @@ # Core "TreeDex", "QueryResult", + "MultiQueryResult", # Loaders "PDFLoader", "TextLoader", diff --git a/treedex/core.py b/treedex/core.py index 56b8567..cece8f4 100644 --- a/treedex/core.py +++ b/treedex/core.py @@ -123,6 +123,35 @@ def __repr__(self): ) +class MultiQueryResult: + """Result of querying multiple TreeDex indexes simultaneously. + + Attributes: + results: Per-index QueryResult objects in the same order as the + input indexes. + labels: Human-readable label for each index (e.g. filenames). + combined_context: All per-index contexts merged with + ``[Document N]`` separators for easy downstream consumption. + answer: Optional agentic answer generated over the combined context. + """ + + def __init__( + self, + results: list[QueryResult], + labels: list[str], + combined_context: str, + answer: str = "", + ): + self.results = results + self.labels = labels + self.combined_context = combined_context + self.answer = answer + + def __repr__(self) -> str: + parts = [f"[{self.labels[i]}] {repr(r)}" for i, r in enumerate(self.results)] + return "MultiQueryResult(\n " + ",\n ".join(parts) + "\n)" + + class TreeDex: """Tree-based document index for RAG retrieval.""" @@ -368,3 +397,86 @@ def find_large_sections(self, max_pages: int = 10, self.tree, max_pages=max_pages, max_tokens=max_tokens, pages=self.pages ) + + @classmethod + def query_all( + cls, + indexes: list["TreeDex"], + question: str, + llm=None, + agentic: bool = False, + labels: list[str] | None = None, + ) -> "MultiQueryResult": + """Query multiple TreeDex indexes simultaneously and merge results. + + All indexes are queried sequentially and results are combined into a + single :class:`MultiQueryResult` whose ``combined_context`` labels + each source document clearly. + + Args: + indexes: List of TreeDex instances to query. + question: The question to ask every index. + llm: Shared LLM instance. Falls back to each index's own LLM when + not provided. + agentic: If ``True``, generate a single answer over the combined + context using the LLM after all retrievals are done. + labels: Human-readable names for each index (e.g. filenames). + Defaults to ``"Document 1"``, ``"Document 2"``, … + + Returns: + A :class:`MultiQueryResult` with per-index results, merged + context, and an optional agentic answer. + + Example:: + + multi = TreeDex.query_all( + [index1, index2, index3], + "What are the safety guidelines?", + llm=llm, + labels=["Manual A", "Manual B", "Manual C"], + agentic=True, + ) + print(multi.combined_context) + print(multi.answer) + """ + if not indexes: + raise ValueError("query_all requires at least one index.") + + if labels is None: + labels = [f"Document {i + 1}" for i in range(len(indexes))] + + if len(labels) != len(indexes): + raise ValueError("labels length must match indexes length.") + + if llm is None and any(idx.llm is None for idx in indexes): + raise ValueError( + "No LLM provided for one or more indexes. Pass llm= to " + "query_all() or set llm on every TreeDex instance." + ) + + # Query each index (no per-index agentic; single answer at the end) + results = [ + idx.query(question, llm=llm, agentic=False) + for idx in indexes + ] + + # Merge contexts with clear document separators + sections = [] + for i, result in enumerate(results): + ctx = result.context.strip() + if ctx: + sections.append(f"[{labels[i]}]\n{ctx}") + combined_context = "\n\n---\n\n".join(sections) + + # Optional: generate one answer over the combined context + answer = "" + if agentic and combined_context: + active_llm = llm or next( + (idx.llm for idx in indexes if idx.llm is not None), None + ) + if active_llm is None: + raise ValueError("No LLM provided for agentic mode in query_all.") + prompt = ANSWER_PROMPT.format(context=combined_context, query=question) + answer = active_llm.generate(prompt) + + return MultiQueryResult(results, labels, combined_context, answer)