Skip to content

feat(memory): add #1538 benchmark fixtures for retrieval scenarios#1561

Open
unn-Known1 wants to merge 7 commits into
tinyhumansai:mainfrom
unn-Known1:feat/1538-memory-benchmark-fixtures
Open

feat(memory): add #1538 benchmark fixtures for retrieval scenarios#1561
unn-Known1 wants to merge 7 commits into
tinyhumansai:mainfrom
unn-Known1:feat/1538-memory-benchmark-fixtures

Conversation

@unn-Known1
Copy link
Copy Markdown
Contributor

@unn-Known1 unn-Known1 commented May 12, 2026

Fixes #1538

Adds deterministic memory retrieval benchmark fixtures covering:

Scenario Test Description
1 bench_cross_chat_recall Verify relevant source retrieval across chat scopes
1b bench_cross_chat_entity_discoverable Entities discoverable across chats
2 bench_citation_bundle_provenance Retrieval returns chunk_ids + source_ref provenance
2b bench_citation_fetch_leaves_hydrates fetch_leaves hydrates exact chunk IDs with content
3 bench_stale_preference_newer_supersedes Newer explicit correction coexists with history
4 bench_contradiction_surfaces_both_with_provenance Disagreeing sources surface with provenance labels
5 bench_long_source_retrieves_exact_leaf Large source returns compact hits, not full dumps
5b bench_drill_down_isolates_children Drill-down scope isolation verified
6 bench_scale_ingest_20_sources_no_real_data Scale soak: 20 sources, no real PII

Acceptance criteria

  • Deterministic fixture data (no real user data)
  • Benchmark runner: cargo test -p openhuman_core -- retrieval_benchmarks
  • Source provenance asserted (chunk_ids, source_ref)
  • Scale fixture (20 synthetic sources)
  • CI wiring via existing #[cfg(test)] module

Files changed

  • src/openhuman/memory/tree/retrieval/benchmarks.rs — 7 new test scenarios
  • src/openhuman/memory/tree/retrieval/mod.rs — benchmark module wired in

Run tests

cargo test -p openhuman_core -- retrieval_benchmarks

Rust + Tauri validation: GitHub Actions CI

Summary by CodeRabbit

  • Tests
    • Added nine deterministic retrieval scenario tests validating topical recall across chats, entity discovery by canonical identifiers, citation/provenance population for query hits, hydration of requested content, stale-preference semantics, contradiction surfacing with provenance and dates, bounded long-source summaries, channel/scope drill-down isolation, and a scale/soak ingestion check across many synthetic sources.
    • Tests run in isolated temporary workspaces with deterministic timestamps and wait for background jobs to quiesce before assertions.

Review Change Stack

…scenarios

Add deterministic test fixtures covering:
- Cross-chat recall: verify relevant source retrieval across chat scopes
- Citation bundle: provenance chain (chunk_ids, source_ref) on retrieval
- Stale preference: newer explicit correction coexists with history
- Contradiction handling: disagreeing sources surface with provenance
- Long-source compression: large source returns compact hits, not dumps
- Scale soak: 20-source ingest with no real user data

Tests live in src/openhuman/memory/tree/retrieval/benchmarks.rs
and run via: cargo test -p openhuman_core -- retrieval_benchmarks
@unn-Known1 unn-Known1 requested a review from a team May 12, 2026 14:46
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a test-only benchmarks module with deterministic fixtures and nine tokio tests that seed synthetic chats, ingest them deterministically, drain background jobs, and assert retrieval APIs (query_topic/query_source/query_global/search_entities/fetch_leaves) return provenance, hydration, isolation, staleness, contradiction, and scale properties.

Changes

Memory Retrieval Benchmarks

Layer / File(s) Summary
Test framework and helpers
src/openhuman/memory/tree/retrieval/benchmarks.rs, src/openhuman/memory/tree/retrieval/mod.rs
Module docs + registration under #[cfg(test)], bench_config() creating an isolated temp workspace with embeddings disabled, and ingest_chat_batch() to deterministically build/ingest chats with per-message timestamps and drain background jobs.
Retrieval scenario tests
src/openhuman/memory/tree/retrieval/benchmarks.rs
Nine #[tokio::test] scenarios exercising retrieval APIs: bench_cross_chat_recall, bench_cross_chat_entity_discoverable, bench_citation_bundle_provenance, bench_citation_fetch_leaves_hydrates, bench_stale_preference_newer_supersedes, bench_contradiction_surfaces_both_with_provenance, bench_long_source_retrieves_exact_leaf, bench_drill_down_isolates_children, bench_scale_ingest_20_sources_no_real_data.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant bench_config
  participant ingest_chat_batch
  participant IngestService
  participant BackgroundJobs
  participant QueryAPI
  participant fetch_leaves
  participant search_entities
  participant Assertions
  Test->>bench_config: create isolated config (embeddings disabled)
  bench_config-->>Test: return deterministic env
  Test->>ingest_chat_batch: build ChatBatch with deterministic timestamps
  ingest_chat_batch->>IngestService: ingest_chat
  IngestService->>BackgroundJobs: schedule background processing
  BackgroundJobs->>ingest_chat_batch: drain until idle
  Test->>QueryAPI: call query_topic/query_source/query_global
  QueryAPI->>fetch_leaves: hydrate requested chunk IDs
  QueryAPI->>search_entities: resolve canonical entities
  QueryAPI-->>Test: return hits with source_ref, node_id, content
  Test->>Assertions: verify provenance, hydration, isolation, staleness, contradiction, scale
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • senamakel

Poem

🐰 In a tidy test-lined glen I poke,
I sow timestamped seeds and watch them soak,
I fetch the leaves, the sources sing,
Cross-chat echoes, provenance ring,
Benchmarks hop — assertions woke!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding benchmark fixtures for retrieval scenarios tied to issue #1538.
Linked Issues check ✅ Passed The PR implements all coding-related requirements from #1538: five+ benchmark scenarios with deterministic fixtures, pass/fail assertions, source/chunk provenance validation, scale fixture with 20 sources, and CI wiring via #[cfg(test)].
Out of Scope Changes check ✅ Passed All changes are scoped to the memory retrieval benchmark module; no unrelated code modifications were introduced.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/memory/tree/retrieval/benchmarks.rs (1)

45-70: ⚡ Quick win

Consider adding debug logging at pipeline boundaries.

While test code doesn't always need extensive logging, these benchmarks exercise the full memory-tree pipeline (ingest → extract → seal → retrieve). Adding log::debug! or tracing::debug! statements at key boundaries (ingest entry, drain completion, query execution) would help troubleshoot failures when scenarios don't meet expectations.

As per coding guidelines: "Add copious debug/trace logs while implementing features; use log/tracing crates at debug/trace level on RPC entry/exit, error paths, state transitions, and hard-to-infer branches."

📊 Example logging additions
 async fn ingest_chat_batch(
     cfg: &Config,
     scope: &str,
     owner: &str,
     messages: Vec<(String, String)>,
     base_ts_millis: i64,
 ) -> Vec<String> {
+    log::debug!("[bench] ingest_chat_batch: scope={}, owner={}, msg_count={}", scope, owner, messages.len());
     let batch = ChatBatch {
         platform: "slack".into(),
         channel_label: scope.into(),
         messages: messages
             .into_iter()
             .enumerate()
             .map(|(i, (author, text))| ChatMessage {
                 author,
                 timestamp: Utc.timestamp_millis_opt(base_ts_millis + (i as i64) * 60_000).unwrap(),
                 text,
                 source_ref: None,
             })
             .collect(),
     };
     let result = ingest_chat(cfg, scope, owner, vec![], batch).await.unwrap();
+    log::debug!("[bench] drain_until_idle: starting");
     drain_until_idle(cfg).await.unwrap();
+    log::debug!("[bench] drain_until_idle: complete, chunk_ids={}", result.chunk_ids.len());
     result.chunk_ids
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/retrieval/benchmarks.rs` around lines 45 - 70, Add
debug-level logging to the ingest_chat_batch pipeline boundaries: log at
function entry (include scope, owner, base_ts_millis and number of messages),
log immediately after awaiting ingest_chat (include result or a brief summary),
and log after drain_until_idle completes before returning chunk_ids. Use
log::debug! or tracing::debug! and reference ingest_chat_batch, ingest_chat, and
drain_until_idle so the logs appear at ingest entry, post-ingest, and post-drain
to aid debugging when benchmarks fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 445-448: The assertion currently uses a logical OR and will pass
if only one date appears; change the check on the combined string content_all
(and its to_lowercase() call) to require both date tokens by using a logical AND
so the test verifies that content_all contains "june" AND contains "jul" (or
"july" if you prefer full month spelling) to ensure both disagreeing sources are
present.

---

Nitpick comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 45-70: Add debug-level logging to the ingest_chat_batch pipeline
boundaries: log at function entry (include scope, owner, base_ts_millis and
number of messages), log immediately after awaiting ingest_chat (include result
or a brief summary), and log after drain_until_idle completes before returning
chunk_ids. Use log::debug! or tracing::debug! and reference ingest_chat_batch,
ingest_chat, and drain_until_idle so the logs appear at ingest entry,
post-ingest, and post-drain to aid debugging when benchmarks fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ef7d8b90-8b10-4d27-93f3-e85a17b11599

📥 Commits

Reviewing files that changed from the base of the PR and between 15c7442 and c28a65c.

📒 Files selected for processing (2)
  • src/openhuman/memory/tree/retrieval/benchmarks.rs
  • src/openhuman/memory/tree/retrieval/mod.rs

Comment thread src/openhuman/memory/tree/retrieval/benchmarks.rs
The platforms.iter().cycle().take().enumerate() destructuring
yields &&str, not &str — convert to owned String before passing
to ingest_chat_batch which expects &str.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/openhuman/memory/tree/retrieval/benchmarks.rs (1)

445-448: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen contradiction assertion to verify both dates are present.

The current logic checks contains("june") || contains("jul") which passes if only ONE date is present. To properly validate the contradiction scenario (both disagreeing sources surface), verify both dates appear in the combined content.

🔍 Proposed fix
-    assert!(
-        content_all.to_lowercase().contains("june") || content_all.to_lowercase().contains("jul"),
-        "contradiction hits should include both date references"
-    );
+    let content_lower = content_all.to_lowercase();
+    assert!(
+        content_lower.contains("june") && content_lower.contains("jul"),
+        "contradiction hits should include both date references (June 15 and July 30), got: {}",
+        content_all.chars().take(200).collect::<String>()
+    );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/retrieval/benchmarks.rs` around lines 445 - 448,
The assertion currently allows either date to be present; change it to require
both by asserting that content_all.to_lowercase().contains("june") AND
content_all.to_lowercase().contains("jul") (or "july" if you prefer full month
name) so the contradiction benchmark verifies both disagreeing date references
appear; update the assertion that references content_all accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 445-448: The assertion currently allows either date to be present;
change it to require both by asserting that
content_all.to_lowercase().contains("june") AND
content_all.to_lowercase().contains("jul") (or "july" if you prefer full month
name) so the contradiction benchmark verifies both disagreeing date references
appear; update the assertion that references content_all accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5a36fd38-df37-4b3f-b914-1928628d90d1

📥 Commits

Reviewing files that changed from the base of the PR and between c28a65c and 3f6f9bd.

📒 Files selected for processing (1)
  • src/openhuman/memory/tree/retrieval/benchmarks.rs

CodeRabbit review: June and July must both appear — change || to &&
to ensure the contradiction scenario correctly verifies both sources.
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 12, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 12, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #1561: feat(memory): add #1538 benchmark fixtures for retrieval scenarios

Walkthrough

This PR adds nine deterministic retrieval benchmark tests for the memory tree pipeline under src/openhuman/memory/tree/retrieval/benchmarks.rs, wired into the module via #[cfg(test)] mod benchmarks in mod.rs. The approach is structurally sound: an isolated TempDir workspace per test, inert embedder for determinism, drain_until_idle to settle the async job queue before asserting, and synthetic fixture data with no real PII. The six retrieval scenarios (cross-chat recall, citation bundle, stale preference, contradiction, long-source compression, scale soak) map well onto the acceptance criteria in #1538.

The main issue is a systematic misuse of query_topic: several tests pass bare keywords like "phoenix" or "alice" where the function contract requires a canonical entity ID (e.g. "topic:phoenix", "email:alice@example.com"). lookup_entity performs an exact WHERE entity_id = ?1 match — bare keywords will never hit anything in the index, causing those tests to assert against empty results sets and trivially pass on the wrong condition. One scenario (query_global in the scale test) also carries a latent false-pass risk. The missing docs/TEST-COVERAGE-MATRIX.md update flagged in #1538 is unaddressed.


Changes

File Summary
src/openhuman/memory/tree/retrieval/benchmarks.rs New file — 9 deterministic test scenarios for the retrieval pipeline
src/openhuman/memory/tree/retrieval/mod.rs Wires mod benchmarks under #[cfg(test)]

Actionable comments (5)

[critical] benchmarks.rs:149query_topic("phoenix", ...) uses a bare keyword, not a canonical entity ID

query_topic delegates to lookup_entity, which executes WHERE entity_id = ?1 — an exact match against the entity index. The index stores canonical IDs like topic:phoenix and email:alice@example.com, never bare keywords. Calling query_topic(&cfg, "phoenix", ...) will return zero rows from the index, so topic_resp.hits will be empty and the !phoenix_hits.is_empty() assertion on line 159 will trivially panic with the "at least one hit should mention 'phoenix'" message — or worse, silently pass if the scorer chose not to extract the topic and the test is accidentally relying on the wrong failure mode.

The same bug affects bench_stale_preference_newer_supersedes (line 273, "alice") and bench_contradiction_surfaces_both_with_provenance (line 363, "phoenix").

Existing usage for reference — from integration_test.rs:90:

// Correct — uses the canonical id form:
let by_email = query_topic(&cfg, "email:alice@example.com", None, None, 20).await.unwrap();

What to change:

bench_cross_chat_recall (line 149):

// before
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();

// after
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();

bench_stale_preference_newer_supersedes (line 273 — querying alice's preference):

// before
let topic_resp = query_topic(&cfg, "alice", None, None, 20).await.unwrap();

// after
let topic_resp = query_topic(&cfg, "email:alice@example.com", None, None, 20).await.unwrap();
// and update the fixture messages to include alice@example.com so the entity extractor fires

bench_contradiction_surfaces_both_with_provenance (line 363):

// before
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();

// after
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();

Also note: topic:phoenix entity-index promotion depends on the scorer extracting Phoenix as a topic from the ingested text. The existing tests (topic.rs:413-429) document that this may or may not fire depending on scorer config, and they deliberately do not hard-assert non-empty hits for topic:phoenix. For robustness these benchmarks should either (a) use an email entity that is guaranteed to be extracted (like alice@example.com), or (b) add a guard: if topic_resp.hits.is_empty() { return; } with a clear comment that topic extraction is scorer-dependent.


[critical] benchmarks.rs:193bench_cross_chat_entity_discoverable queries "alice" not "alice@example.com"

search_entities(&cfg, "alice", None, 10) performs a LIKE '%alice%' match which will correctly fire on entity_id LIKE '%alice%' rows — this one is actually fine as written because search_entities uses substring matching, not exact lookup. However, the assertion on line 205 then calls .find(|m| m.canonical_id.contains("alice@example.com")), which is correct.

The real issue is that the assertion alice.mention_count >= 2 on line 210 assumes the entity extractor fires for both ingested chat batches. The first batch text is "alice@example.com is leading the Phoenix migration." and the second is "alice@example.com confirmed the Friday timeline." — both explicitly include the email so extraction should be reliable. This test is fine; just noting for completeness.


[major] benchmarks.rs:421bench_long_source_retrieves_exact_leaf conflates "summaries" with query_source and hard-codes total <= 10 without proof

query_source only returns hits from sealed source trees (trees with max_level > 0 and a root summary). From source.rs:121-122:

if tree.max_level == 0 && tree.root_id.is_none() {
    continue;
}

Whether 30 messages of ~200 tokens seal into one or multiple summaries depends on INPUT_TOKEN_BUDGET — a value the test doesn't control. If the budget is large enough that all 30 messages don't trigger a seal during drain_until_idle, source_resp.total will be 0, not <= 10, and the assertion passes trivially for the wrong reason (no summaries were produced, so no dump occurred).

The test's intent — "a large source does not dump all 30 raw chunks" — is valid, but it should assert that a seal actually happened before asserting the bound:

assert!(
    source_resp.total >= 1,
    "30 messages should have sealed into at least one summary"
);
assert!(
    source_resp.total <= 10,
    "long source should not dump all chunks; expected <= 10 summaries, got {}",
    source_resp.total
);

[major] benchmarks.rs:499bench_scale_ingest_20_sources_no_real_data asserts query_global returns hits but global digests require end_of_day_digest

query_global returns the recap from the global digest tree, which is only populated when end_of_day_digest is called explicitly. The existing integration_test.rs:114-119 documents this clearly:

// query_global: no daily digest has been built yet → empty.
let global = query_global(&cfg, 7).await.unwrap();
assert!(global.hits.is_empty(), "end_of_day_digest hasn't been called, so global is empty");

This benchmark calls query_global(&cfg, 30) immediately after drain_until_idle without calling end_of_day_digest, then asserts !global_resp.hits.is_empty(). This will always fail (or if it silently passes, it's passing for wrong reasons). The test description says "verify retrieval remains correct at scale" — the right thing to assert after 20-source ingestion is scope coverage via query_source, not query_global.

Either call end_of_day_digest before the global assertion (requires importing and driving end_of_day_digest + a DigestOutcome::Emitted guard), or replace the global assertion with one that actually exercises post-ingest state, for example:

// Show at least one source tree was produced across 5 platforms
let source_resp = query_source(&cfg, None, None, None, None, 50).await.unwrap();
assert!(
    source_resp.total >= 5,
    "scale ingest should produce source trees for each distinct scope"
);

[minor] benchmarks.rs:60-67bench_config() duplicates test_config() from integration_test.rs

The only difference from integration_test.rs's test_config() is the name. This is a small thing, but it adds a maintenance burden — if the embedding config structure changes, both places need updating. Consider extracting a shared crate::openhuman::memory::tree::retrieval::integration_test::test_config by making it pub(crate) (it's already #![cfg(test)] gated so there's no production-code exposure), or just leave a comment cross-referencing the two functions.


Nitpicks

  • benchmarks.rs:30 — The drain_until_idle import uses the jobs::testing path while integration_test.rs:167 imports the non-testing path jobs::drain_until_idle. Worth checking which is the canonical test path (the comment in integration_test.rs suggests the testing module is intentional for the test harness).
  • benchmarks.rs:86 — The comment says "Scenario 1 — Cross-chat recall" covers tests 1 and 1b, but the doc table at the top of the file only lists 5 scenarios while the PR description lists 6. The numbering in the file (1, 1b, 2, 2b, 3, 4, 5, 5b, 6) vs the doc table (1-5) is confusing. Renaming to a flat bench_<scenario>_<variant> prefix or updating the header table would reduce confusion.
  • benchmarks.rs:1 — The //! doc comment says "Run with: cargo test --package openhuman_core -- retrieval_benchmarks". The --package flag value should match the crate name in Cargo.toml — worth verifying it's openhuman_core and not openhuman.

Missing: docs/TEST-COVERAGE-MATRIX.md

Issue #1538 explicitly lists updating docs/TEST-COVERAGE-MATRIX.md as an acceptance criterion (unchecked in the PR description). This is a documentation gap — new scenarios should be logged in the matrix so reviewers and CI auditors can track coverage against the spec without reading test code.


Verified / looks good

  • Module wiring in mod.rs is correct: #[cfg(test)] mod benchmarks placed before the existing integration_test entry, consistent with project convention.
  • Fixture data uses generic names (alice, bob, carol, dave, eve, alice@example.com) throughout — no real PII.
  • TempDir is kept alive for the duration of each test via let (_tmp, cfg) binding — no accidental early drop.
  • drain_until_idle is called before each assertion set, correctly sequencing the async job queue.
  • The bench_citation_fetch_leaves_hydrates test correctly captures chunk_ids from the ingest return value and uses them as the fetch input — solid provenance chain.
  • bench_drill_down_isolates_children correctly seeds two distinct scopes and asserts the negative (no cross-scope bleed) — good guard.
  • The && fix for the June/July assertion (previously ||) is present in the current diff at line 445-448. Not repeating this finding.
  • No dynamic imports, no hardcoded secrets, no PII in log statements.

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #1561

Good test coverage concept — the 9 benchmark scenarios are well-motivated and map to real retrieval quality concerns. A few correctness issues need addressing before this can land.

2 critical · 2 major · see inline comments below.

"slack:#ops",
"carol",
vec![
(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] query_topic called with bare keyword — will return zero hits.

query_topic delegates to lookup_entity which does WHERE entity_id = ?1 (exact match). The entity index stores canonical IDs like topic:phoenix, not bare keywords.

This test (and the same pattern in bench_stale_preference_newer_supersedes line 318 and bench_contradiction_surfaces_both_with_provenance line 395) will return zero rows, causing the !topic_resp.hits.is_empty() assertion to always panic — meaning the test can never pass as written.

The existing integration_test.rs:90 uses the correct pattern. Either use the canonical entity ID or use query_source with a scope filter.

// Instead of:
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();
// Use canonical ID:
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();

"drill_down / query_topic should not cross scope into unrelated channels. \
Found 'ops' content in eng query: {}",
eng_content.chars().take(200).collect::<String>()
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[critical] query_global requires end_of_day_digest — assertion will always fail.

query_global only returns hits after the daily digest has been built. The existing integration_test.rs:114-119 explicitly documents this:

"no daily digest has been built yet → empty."

Without calling end_of_day_digest first, global_resp.hits will always be empty and this assertion will always panic.

// Add before the query_global call:
use crate::openhuman::memory::tree::tree_global::seal::end_of_day_digest;
end_of_day_digest(&cfg).await.unwrap();

specific implementation details for iteration {}.",
i, i
),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] Long-source test can pass trivially with zero results.

The assertion source_resp.total <= 10 passes when total == 0, which would happen if no seal/summary was produced (e.g. if the ingest volume doesn't cross the seal budget). This means the test could pass for the wrong reason — it verifies an upper bound but no lower bound.

Add a lower-bound guard:

assert!(
    source_resp.total >= 1,
    "long source should produce at least one summary hit, got 0"
);
assert!(
    source_resp.total <= 10,
    "long source should not dump all chunks; expected <= 10 summaries, got {}",
    source_resp.total
);

fn bench_config() -> (TempDir, Config) {
let tmp = TempDir::new().unwrap();
let mut cfg = Config::default();
cfg.workspace_dir = tmp.path().to_path_buf();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] bench_config() duplicates test_config() from integration_test.rs.

This is an exact copy of test_config() (lines 24-33 in integration_test.rs) with only the name changed. Having two identical setup functions in the same module directory adds maintenance surface — if the config shape changes, both need updating.

Consider extracting the shared helper into a #[cfg(test)] block in the parent mod.rs or a shared test utilities module, then importing from both test files.

- query_topic: use canonical form 'topic:phoenix' instead of bare keyword
- query_topic: use canonical form 'email:alice@example.com' for alice
- Add lower-bound guard for source_resp.total before upper-bound assertion
- Replace query_global (requires end_of_day_digest) with query_source

Closes graycyrus review comments on PR tinyhumansai#1561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add memory retrieval benchmark fixtures for cross-chat recall, citations, and stale/contradictory facts

2 participants