feat(memory): add #1538 benchmark fixtures for retrieval scenarios#1561
feat(memory): add #1538 benchmark fixtures for retrieval scenarios#1561unn-Known1 wants to merge 7 commits into
Conversation
…scenarios Add deterministic test fixtures covering: - Cross-chat recall: verify relevant source retrieval across chat scopes - Citation bundle: provenance chain (chunk_ids, source_ref) on retrieval - Stale preference: newer explicit correction coexists with history - Contradiction handling: disagreeing sources surface with provenance - Long-source compression: large source returns compact hits, not dumps - Scale soak: 20-source ingest with no real user data Tests live in src/openhuman/memory/tree/retrieval/benchmarks.rs and run via: cargo test -p openhuman_core -- retrieval_benchmarks
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a test-only benchmarks module with deterministic fixtures and nine tokio tests that seed synthetic chats, ingest them deterministically, drain background jobs, and assert retrieval APIs (query_topic/query_source/query_global/search_entities/fetch_leaves) return provenance, hydration, isolation, staleness, contradiction, and scale properties. ChangesMemory Retrieval Benchmarks
Sequence Diagram(s)sequenceDiagram
participant Test
participant bench_config
participant ingest_chat_batch
participant IngestService
participant BackgroundJobs
participant QueryAPI
participant fetch_leaves
participant search_entities
participant Assertions
Test->>bench_config: create isolated config (embeddings disabled)
bench_config-->>Test: return deterministic env
Test->>ingest_chat_batch: build ChatBatch with deterministic timestamps
ingest_chat_batch->>IngestService: ingest_chat
IngestService->>BackgroundJobs: schedule background processing
BackgroundJobs->>ingest_chat_batch: drain until idle
Test->>QueryAPI: call query_topic/query_source/query_global
QueryAPI->>fetch_leaves: hydrate requested chunk IDs
QueryAPI->>search_entities: resolve canonical entities
QueryAPI-->>Test: return hits with source_ref, node_id, content
Test->>Assertions: verify provenance, hydration, isolation, staleness, contradiction, scale
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/memory/tree/retrieval/benchmarks.rs (1)
45-70: ⚡ Quick winConsider adding debug logging at pipeline boundaries.
While test code doesn't always need extensive logging, these benchmarks exercise the full memory-tree pipeline (ingest → extract → seal → retrieve). Adding
log::debug!ortracing::debug!statements at key boundaries (ingest entry, drain completion, query execution) would help troubleshoot failures when scenarios don't meet expectations.As per coding guidelines: "Add copious debug/trace logs while implementing features; use log/tracing crates at debug/trace level on RPC entry/exit, error paths, state transitions, and hard-to-infer branches."
📊 Example logging additions
async fn ingest_chat_batch( cfg: &Config, scope: &str, owner: &str, messages: Vec<(String, String)>, base_ts_millis: i64, ) -> Vec<String> { + log::debug!("[bench] ingest_chat_batch: scope={}, owner={}, msg_count={}", scope, owner, messages.len()); let batch = ChatBatch { platform: "slack".into(), channel_label: scope.into(), messages: messages .into_iter() .enumerate() .map(|(i, (author, text))| ChatMessage { author, timestamp: Utc.timestamp_millis_opt(base_ts_millis + (i as i64) * 60_000).unwrap(), text, source_ref: None, }) .collect(), }; let result = ingest_chat(cfg, scope, owner, vec![], batch).await.unwrap(); + log::debug!("[bench] drain_until_idle: starting"); drain_until_idle(cfg).await.unwrap(); + log::debug!("[bench] drain_until_idle: complete, chunk_ids={}", result.chunk_ids.len()); result.chunk_ids }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/retrieval/benchmarks.rs` around lines 45 - 70, Add debug-level logging to the ingest_chat_batch pipeline boundaries: log at function entry (include scope, owner, base_ts_millis and number of messages), log immediately after awaiting ingest_chat (include result or a brief summary), and log after drain_until_idle completes before returning chunk_ids. Use log::debug! or tracing::debug! and reference ingest_chat_batch, ingest_chat, and drain_until_idle so the logs appear at ingest entry, post-ingest, and post-drain to aid debugging when benchmarks fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 445-448: The assertion currently uses a logical OR and will pass
if only one date appears; change the check on the combined string content_all
(and its to_lowercase() call) to require both date tokens by using a logical AND
so the test verifies that content_all contains "june" AND contains "jul" (or
"july" if you prefer full month spelling) to ensure both disagreeing sources are
present.
---
Nitpick comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 45-70: Add debug-level logging to the ingest_chat_batch pipeline
boundaries: log at function entry (include scope, owner, base_ts_millis and
number of messages), log immediately after awaiting ingest_chat (include result
or a brief summary), and log after drain_until_idle completes before returning
chunk_ids. Use log::debug! or tracing::debug! and reference ingest_chat_batch,
ingest_chat, and drain_until_idle so the logs appear at ingest entry,
post-ingest, and post-drain to aid debugging when benchmarks fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ef7d8b90-8b10-4d27-93f3-e85a17b11599
📒 Files selected for processing (2)
src/openhuman/memory/tree/retrieval/benchmarks.rssrc/openhuman/memory/tree/retrieval/mod.rs
The platforms.iter().cycle().take().enumerate() destructuring yields &&str, not &str — convert to owned String before passing to ingest_chat_batch which expects &str.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/openhuman/memory/tree/retrieval/benchmarks.rs (1)
445-448:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStrengthen contradiction assertion to verify both dates are present.
The current logic checks
contains("june") || contains("jul")which passes if only ONE date is present. To properly validate the contradiction scenario (both disagreeing sources surface), verify both dates appear in the combined content.🔍 Proposed fix
- assert!( - content_all.to_lowercase().contains("june") || content_all.to_lowercase().contains("jul"), - "contradiction hits should include both date references" - ); + let content_lower = content_all.to_lowercase(); + assert!( + content_lower.contains("june") && content_lower.contains("jul"), + "contradiction hits should include both date references (June 15 and July 30), got: {}", + content_all.chars().take(200).collect::<String>() + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/tree/retrieval/benchmarks.rs` around lines 445 - 448, The assertion currently allows either date to be present; change it to require both by asserting that content_all.to_lowercase().contains("june") AND content_all.to_lowercase().contains("jul") (or "july" if you prefer full month name) so the contradiction benchmark verifies both disagreeing date references appear; update the assertion that references content_all accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 445-448: The assertion currently allows either date to be present;
change it to require both by asserting that
content_all.to_lowercase().contains("june") AND
content_all.to_lowercase().contains("jul") (or "july" if you prefer full month
name) so the contradiction benchmark verifies both disagreeing date references
appear; update the assertion that references content_all accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5a36fd38-df37-4b3f-b914-1928628d90d1
📒 Files selected for processing (1)
src/openhuman/memory/tree/retrieval/benchmarks.rs
CodeRabbit review: June and July must both appear — change || to && to ensure the contradiction scenario correctly verifies both sources.
graycyrus
left a comment
There was a problem hiding this comment.
Review — PR #1561: feat(memory): add #1538 benchmark fixtures for retrieval scenarios
Walkthrough
This PR adds nine deterministic retrieval benchmark tests for the memory tree pipeline under src/openhuman/memory/tree/retrieval/benchmarks.rs, wired into the module via #[cfg(test)] mod benchmarks in mod.rs. The approach is structurally sound: an isolated TempDir workspace per test, inert embedder for determinism, drain_until_idle to settle the async job queue before asserting, and synthetic fixture data with no real PII. The six retrieval scenarios (cross-chat recall, citation bundle, stale preference, contradiction, long-source compression, scale soak) map well onto the acceptance criteria in #1538.
The main issue is a systematic misuse of query_topic: several tests pass bare keywords like "phoenix" or "alice" where the function contract requires a canonical entity ID (e.g. "topic:phoenix", "email:alice@example.com"). lookup_entity performs an exact WHERE entity_id = ?1 match — bare keywords will never hit anything in the index, causing those tests to assert against empty results sets and trivially pass on the wrong condition. One scenario (query_global in the scale test) also carries a latent false-pass risk. The missing docs/TEST-COVERAGE-MATRIX.md update flagged in #1538 is unaddressed.
Changes
| File | Summary |
|---|---|
src/openhuman/memory/tree/retrieval/benchmarks.rs |
New file — 9 deterministic test scenarios for the retrieval pipeline |
src/openhuman/memory/tree/retrieval/mod.rs |
Wires mod benchmarks under #[cfg(test)] |
Actionable comments (5)
[critical] benchmarks.rs:149 — query_topic("phoenix", ...) uses a bare keyword, not a canonical entity ID
query_topic delegates to lookup_entity, which executes WHERE entity_id = ?1 — an exact match against the entity index. The index stores canonical IDs like topic:phoenix and email:alice@example.com, never bare keywords. Calling query_topic(&cfg, "phoenix", ...) will return zero rows from the index, so topic_resp.hits will be empty and the !phoenix_hits.is_empty() assertion on line 159 will trivially panic with the "at least one hit should mention 'phoenix'" message — or worse, silently pass if the scorer chose not to extract the topic and the test is accidentally relying on the wrong failure mode.
The same bug affects bench_stale_preference_newer_supersedes (line 273, "alice") and bench_contradiction_surfaces_both_with_provenance (line 363, "phoenix").
Existing usage for reference — from integration_test.rs:90:
// Correct — uses the canonical id form:
let by_email = query_topic(&cfg, "email:alice@example.com", None, None, 20).await.unwrap();What to change:
bench_cross_chat_recall (line 149):
// before
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();
// after
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();bench_stale_preference_newer_supersedes (line 273 — querying alice's preference):
// before
let topic_resp = query_topic(&cfg, "alice", None, None, 20).await.unwrap();
// after
let topic_resp = query_topic(&cfg, "email:alice@example.com", None, None, 20).await.unwrap();
// and update the fixture messages to include alice@example.com so the entity extractor firesbench_contradiction_surfaces_both_with_provenance (line 363):
// before
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();
// after
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();Also note: topic:phoenix entity-index promotion depends on the scorer extracting Phoenix as a topic from the ingested text. The existing tests (topic.rs:413-429) document that this may or may not fire depending on scorer config, and they deliberately do not hard-assert non-empty hits for topic:phoenix. For robustness these benchmarks should either (a) use an email entity that is guaranteed to be extracted (like alice@example.com), or (b) add a guard: if topic_resp.hits.is_empty() { return; } with a clear comment that topic extraction is scorer-dependent.
[critical] benchmarks.rs:193 — bench_cross_chat_entity_discoverable queries "alice" not "alice@example.com"
search_entities(&cfg, "alice", None, 10) performs a LIKE '%alice%' match which will correctly fire on entity_id LIKE '%alice%' rows — this one is actually fine as written because search_entities uses substring matching, not exact lookup. However, the assertion on line 205 then calls .find(|m| m.canonical_id.contains("alice@example.com")), which is correct.
The real issue is that the assertion alice.mention_count >= 2 on line 210 assumes the entity extractor fires for both ingested chat batches. The first batch text is "alice@example.com is leading the Phoenix migration." and the second is "alice@example.com confirmed the Friday timeline." — both explicitly include the email so extraction should be reliable. This test is fine; just noting for completeness.
[major] benchmarks.rs:421 — bench_long_source_retrieves_exact_leaf conflates "summaries" with query_source and hard-codes total <= 10 without proof
query_source only returns hits from sealed source trees (trees with max_level > 0 and a root summary). From source.rs:121-122:
if tree.max_level == 0 && tree.root_id.is_none() {
continue;
}Whether 30 messages of ~200 tokens seal into one or multiple summaries depends on INPUT_TOKEN_BUDGET — a value the test doesn't control. If the budget is large enough that all 30 messages don't trigger a seal during drain_until_idle, source_resp.total will be 0, not <= 10, and the assertion passes trivially for the wrong reason (no summaries were produced, so no dump occurred).
The test's intent — "a large source does not dump all 30 raw chunks" — is valid, but it should assert that a seal actually happened before asserting the bound:
assert!(
source_resp.total >= 1,
"30 messages should have sealed into at least one summary"
);
assert!(
source_resp.total <= 10,
"long source should not dump all chunks; expected <= 10 summaries, got {}",
source_resp.total
);[major] benchmarks.rs:499 — bench_scale_ingest_20_sources_no_real_data asserts query_global returns hits but global digests require end_of_day_digest
query_global returns the recap from the global digest tree, which is only populated when end_of_day_digest is called explicitly. The existing integration_test.rs:114-119 documents this clearly:
// query_global: no daily digest has been built yet → empty.
let global = query_global(&cfg, 7).await.unwrap();
assert!(global.hits.is_empty(), "end_of_day_digest hasn't been called, so global is empty");This benchmark calls query_global(&cfg, 30) immediately after drain_until_idle without calling end_of_day_digest, then asserts !global_resp.hits.is_empty(). This will always fail (or if it silently passes, it's passing for wrong reasons). The test description says "verify retrieval remains correct at scale" — the right thing to assert after 20-source ingestion is scope coverage via query_source, not query_global.
Either call end_of_day_digest before the global assertion (requires importing and driving end_of_day_digest + a DigestOutcome::Emitted guard), or replace the global assertion with one that actually exercises post-ingest state, for example:
// Show at least one source tree was produced across 5 platforms
let source_resp = query_source(&cfg, None, None, None, None, 50).await.unwrap();
assert!(
source_resp.total >= 5,
"scale ingest should produce source trees for each distinct scope"
);[minor] benchmarks.rs:60-67 — bench_config() duplicates test_config() from integration_test.rs
The only difference from integration_test.rs's test_config() is the name. This is a small thing, but it adds a maintenance burden — if the embedding config structure changes, both places need updating. Consider extracting a shared crate::openhuman::memory::tree::retrieval::integration_test::test_config by making it pub(crate) (it's already #![cfg(test)] gated so there's no production-code exposure), or just leave a comment cross-referencing the two functions.
Nitpicks
benchmarks.rs:30— Thedrain_until_idleimport uses thejobs::testingpath whileintegration_test.rs:167imports the non-testing pathjobs::drain_until_idle. Worth checking which is the canonical test path (the comment inintegration_test.rssuggests thetestingmodule is intentional for the test harness).benchmarks.rs:86— The comment says "Scenario 1 — Cross-chat recall" covers tests 1 and 1b, but the doc table at the top of the file only lists 5 scenarios while the PR description lists 6. The numbering in the file (1, 1b, 2, 2b, 3, 4, 5, 5b, 6) vs the doc table (1-5) is confusing. Renaming to a flatbench_<scenario>_<variant>prefix or updating the header table would reduce confusion.benchmarks.rs:1— The//!doc comment says "Run with:cargo test --package openhuman_core -- retrieval_benchmarks". The--packageflag value should match the crate name inCargo.toml— worth verifying it'sopenhuman_coreand notopenhuman.
Missing: docs/TEST-COVERAGE-MATRIX.md
Issue #1538 explicitly lists updating docs/TEST-COVERAGE-MATRIX.md as an acceptance criterion (unchecked in the PR description). This is a documentation gap — new scenarios should be logged in the matrix so reviewers and CI auditors can track coverage against the spec without reading test code.
Verified / looks good
- Module wiring in
mod.rsis correct:#[cfg(test)] mod benchmarksplaced before the existingintegration_testentry, consistent with project convention. - Fixture data uses generic names (
alice,bob,carol,dave,eve,alice@example.com) throughout — no real PII. TempDiris kept alive for the duration of each test vialet (_tmp, cfg)binding — no accidental early drop.drain_until_idleis called before each assertion set, correctly sequencing the async job queue.- The
bench_citation_fetch_leaves_hydratestest correctly captureschunk_idsfrom the ingest return value and uses them as the fetch input — solid provenance chain. bench_drill_down_isolates_childrencorrectly seeds two distinct scopes and asserts the negative (no cross-scope bleed) — good guard.- The
&&fix for the June/July assertion (previously||) is present in the current diff at line 445-448. Not repeating this finding. - No dynamic imports, no hardcoded secrets, no PII in log statements.
graycyrus
left a comment
There was a problem hiding this comment.
Review — PR #1561
Good test coverage concept — the 9 benchmark scenarios are well-motivated and map to real retrieval quality concerns. A few correctness issues need addressing before this can land.
2 critical · 2 major · see inline comments below.
| "slack:#ops", | ||
| "carol", | ||
| vec![ | ||
| ( |
There was a problem hiding this comment.
[critical] query_topic called with bare keyword — will return zero hits.
query_topic delegates to lookup_entity which does WHERE entity_id = ?1 (exact match). The entity index stores canonical IDs like topic:phoenix, not bare keywords.
This test (and the same pattern in bench_stale_preference_newer_supersedes line 318 and bench_contradiction_surfaces_both_with_provenance line 395) will return zero rows, causing the !topic_resp.hits.is_empty() assertion to always panic — meaning the test can never pass as written.
The existing integration_test.rs:90 uses the correct pattern. Either use the canonical entity ID or use query_source with a scope filter.
// Instead of:
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();
// Use canonical ID:
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();| "drill_down / query_topic should not cross scope into unrelated channels. \ | ||
| Found 'ops' content in eng query: {}", | ||
| eng_content.chars().take(200).collect::<String>() | ||
| ); |
There was a problem hiding this comment.
[critical] query_global requires end_of_day_digest — assertion will always fail.
query_global only returns hits after the daily digest has been built. The existing integration_test.rs:114-119 explicitly documents this:
"no daily digest has been built yet → empty."
Without calling end_of_day_digest first, global_resp.hits will always be empty and this assertion will always panic.
// Add before the query_global call:
use crate::openhuman::memory::tree::tree_global::seal::end_of_day_digest;
end_of_day_digest(&cfg).await.unwrap();| specific implementation details for iteration {}.", | ||
| i, i | ||
| ), | ||
| ) |
There was a problem hiding this comment.
[major] Long-source test can pass trivially with zero results.
The assertion source_resp.total <= 10 passes when total == 0, which would happen if no seal/summary was produced (e.g. if the ingest volume doesn't cross the seal budget). This means the test could pass for the wrong reason — it verifies an upper bound but no lower bound.
Add a lower-bound guard:
assert!(
source_resp.total >= 1,
"long source should produce at least one summary hit, got 0"
);
assert!(
source_resp.total <= 10,
"long source should not dump all chunks; expected <= 10 summaries, got {}",
source_resp.total
);| fn bench_config() -> (TempDir, Config) { | ||
| let tmp = TempDir::new().unwrap(); | ||
| let mut cfg = Config::default(); | ||
| cfg.workspace_dir = tmp.path().to_path_buf(); |
There was a problem hiding this comment.
[major] bench_config() duplicates test_config() from integration_test.rs.
This is an exact copy of test_config() (lines 24-33 in integration_test.rs) with only the name changed. Having two identical setup functions in the same module directory adds maintenance surface — if the config shape changes, both need updating.
Consider extracting the shared helper into a #[cfg(test)] block in the parent mod.rs or a shared test utilities module, then importing from both test files.
- query_topic: use canonical form 'topic:phoenix' instead of bare keyword - query_topic: use canonical form 'email:alice@example.com' for alice - Add lower-bound guard for source_resp.total before upper-bound assertion - Replace query_global (requires end_of_day_digest) with query_source Closes graycyrus review comments on PR tinyhumansai#1561
Fixes #1538
Adds deterministic memory retrieval benchmark fixtures covering:
bench_cross_chat_recallbench_cross_chat_entity_discoverablebench_citation_bundle_provenancebench_citation_fetch_leaves_hydratesbench_stale_preference_newer_supersedesbench_contradiction_surfaces_both_with_provenancebench_long_source_retrieves_exact_leafbench_drill_down_isolates_childrenbench_scale_ingest_20_sources_no_real_dataAcceptance criteria
cargo test -p openhuman_core -- retrieval_benchmarks#[cfg(test)]moduleFiles changed
src/openhuman/memory/tree/retrieval/benchmarks.rs— 7 new test scenariossrc/openhuman/memory/tree/retrieval/mod.rs— benchmark module wired inRun tests
cargo test -p openhuman_core -- retrieval_benchmarksRust + Tauri validation: GitHub Actions CI
Summary by CodeRabbit