feat(memory): add #1538 benchmark fixtures for retrieval scenarios by unn-Known1 · Pull Request #1561 · tinyhumansai/openhuman

unn-Known1 · 2026-05-12T14:46:04Z

Fixes #1538

Adds deterministic memory retrieval benchmark fixtures covering:

Scenario	Test	Description
1	`bench_cross_chat_recall`	Verify relevant source retrieval across chat scopes
1b	`bench_cross_chat_entity_discoverable`	Entities discoverable across chats
2	`bench_citation_bundle_provenance`	Retrieval returns chunk_ids + source_ref provenance
2b	`bench_citation_fetch_leaves_hydrates`	fetch_leaves hydrates exact chunk IDs with content
3	`bench_stale_preference_newer_supersedes`	Newer explicit correction coexists with history
4	`bench_contradiction_surfaces_both_with_provenance`	Disagreeing sources surface with provenance labels
5	`bench_long_source_retrieves_exact_leaf`	Large source returns compact hits, not full dumps
5b	`bench_drill_down_isolates_children`	Drill-down scope isolation verified
6	`bench_scale_ingest_20_sources_no_real_data`	Scale soak: 20 sources, no real PII

Acceptance criteria

Deterministic fixture data (no real user data)
Benchmark runner: cargo test -p openhuman_core -- retrieval_benchmarks
Source provenance asserted (chunk_ids, source_ref)
Scale fixture (20 synthetic sources)
CI wiring via existing #[cfg(test)] module

Files changed

src/openhuman/memory/tree/retrieval/benchmarks.rs — 7 new test scenarios
src/openhuman/memory/tree/retrieval/mod.rs — benchmark module wired in

Run tests

cargo test -p openhuman_core -- retrieval_benchmarks

Rust + Tauri validation: GitHub Actions CI

Summary by CodeRabbit

Tests
- Added nine deterministic retrieval scenario tests validating topical recall across chats, entity discovery by canonical identifiers, citation/provenance population for query hits, hydration of requested content, stale-preference semantics, contradiction surfacing with provenance and dates, bounded long-source summaries, channel/scope drill-down isolation, and a scale/soak ingestion check across many synthetic sources.
- Tests run in isolated temporary workspaces with deterministic timestamps and wait for background jobs to quiesce before assertions.

…scenarios Add deterministic test fixtures covering: - Cross-chat recall: verify relevant source retrieval across chat scopes - Citation bundle: provenance chain (chunk_ids, source_ref) on retrieval - Stale preference: newer explicit correction coexists with history - Contradiction handling: disagreeing sources surface with provenance - Long-source compression: large source returns compact hits, not dumps - Scale soak: 20-source ingest with no real user data Tests live in src/openhuman/memory/tree/retrieval/benchmarks.rs and run via: cargo test -p openhuman_core -- retrieval_benchmarks

coderabbitai · 2026-05-12T14:46:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a test-only benchmarks module with deterministic fixtures and nine tokio tests that seed synthetic chats, ingest them deterministically, drain background jobs, and assert retrieval APIs (query_topic/query_source/query_global/search_entities/fetch_leaves) return provenance, hydration, isolation, staleness, contradiction, and scale properties.

Changes

Memory Retrieval Benchmarks

Layer / File(s)	Summary
Test framework and helpers `src/openhuman/memory/tree/retrieval/benchmarks.rs`, `src/openhuman/memory/tree/retrieval/mod.rs`	Module docs + registration under `#[cfg(test)]`, `bench_config()` creating an isolated temp workspace with embeddings disabled, and `ingest_chat_batch()` to deterministically build/ingest chats with per-message timestamps and drain background jobs.
Retrieval scenario tests `src/openhuman/memory/tree/retrieval/benchmarks.rs`	Nine `#[tokio::test]` scenarios exercising retrieval APIs: `bench_cross_chat_recall`, `bench_cross_chat_entity_discoverable`, `bench_citation_bundle_provenance`, `bench_citation_fetch_leaves_hydrates`, `bench_stale_preference_newer_supersedes`, `bench_contradiction_surfaces_both_with_provenance`, `bench_long_source_retrieves_exact_leaf`, `bench_drill_down_isolates_children`, `bench_scale_ingest_20_sources_no_real_data`.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant bench_config
  participant ingest_chat_batch
  participant IngestService
  participant BackgroundJobs
  participant QueryAPI
  participant fetch_leaves
  participant search_entities
  participant Assertions
  Test->>bench_config: create isolated config (embeddings disabled)
  bench_config-->>Test: return deterministic env
  Test->>ingest_chat_batch: build ChatBatch with deterministic timestamps
  ingest_chat_batch->>IngestService: ingest_chat
  IngestService->>BackgroundJobs: schedule background processing
  BackgroundJobs->>ingest_chat_batch: drain until idle
  Test->>QueryAPI: call query_topic/query_source/query_global
  QueryAPI->>fetch_leaves: hydrate requested chunk IDs
  QueryAPI->>search_entities: resolve canonical entities
  QueryAPI-->>Test: return hits with source_ref, node_id, content
  Test->>Assertions: verify provenance, hydration, isolation, staleness, contradiction, scale

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

senamakel

Poem

🐰 In a tidy test-lined glen I poke,
I sow timestamped seeds and watch them soak,
I fetch the leaves, the sources sing,
Cross-chat echoes, provenance ring,
Benchmarks hop — assertions woke!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding benchmark fixtures for retrieval scenarios tied to issue `#1538`.
Linked Issues check	✅ Passed	The PR implements all coding-related requirements from `#1538`: five+ benchmark scenarios with deterministic fixtures, pass/fail assertions, source/chunk provenance validation, scale fixture with 20 sources, and CI wiring via #[cfg(test)].
Out of Scope Changes check	✅ Passed	All changes are scoped to the memory retrieval benchmark module; no unrelated code modifications were introduced.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/openhuman/memory/tree/retrieval/benchmarks.rs (1)

45-70: ⚡ Quick win

Consider adding debug logging at pipeline boundaries.

While test code doesn't always need extensive logging, these benchmarks exercise the full memory-tree pipeline (ingest → extract → seal → retrieve). Adding log::debug! or tracing::debug! statements at key boundaries (ingest entry, drain completion, query execution) would help troubleshoot failures when scenarios don't meet expectations.

As per coding guidelines: "Add copious debug/trace logs while implementing features; use log/tracing crates at debug/trace level on RPC entry/exit, error paths, state transitions, and hard-to-infer branches."

📊 Example logging additions

 async fn ingest_chat_batch(
     cfg: &Config,
     scope: &str,
     owner: &str,
     messages: Vec<(String, String)>,
     base_ts_millis: i64,
 ) -> Vec<String> {
+    log::debug!("[bench] ingest_chat_batch: scope={}, owner={}, msg_count={}", scope, owner, messages.len());
     let batch = ChatBatch {
         platform: "slack".into(),
         channel_label: scope.into(),
         messages: messages
             .into_iter()
             .enumerate()
             .map(|(i, (author, text))| ChatMessage {
                 author,
                 timestamp: Utc.timestamp_millis_opt(base_ts_millis + (i as i64) * 60_000).unwrap(),
                 text,
                 source_ref: None,
             })
             .collect(),
     };
     let result = ingest_chat(cfg, scope, owner, vec![], batch).await.unwrap();
+    log::debug!("[bench] drain_until_idle: starting");
     drain_until_idle(cfg).await.unwrap();
+    log::debug!("[bench] drain_until_idle: complete, chunk_ids={}", result.chunk_ids.len());
     result.chunk_ids
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/retrieval/benchmarks.rs` around lines 45 - 70, Add
debug-level logging to the ingest_chat_batch pipeline boundaries: log at
function entry (include scope, owner, base_ts_millis and number of messages),
log immediately after awaiting ingest_chat (include result or a brief summary),
and log after drain_until_idle completes before returning chunk_ids. Use
log::debug! or tracing::debug! and reference ingest_chat_batch, ingest_chat, and
drain_until_idle so the logs appear at ingest entry, post-ingest, and post-drain
to aid debugging when benchmarks fail.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 445-448: The assertion currently uses a logical OR and will pass
if only one date appears; change the check on the combined string content_all
(and its to_lowercase() call) to require both date tokens by using a logical AND
so the test verifies that content_all contains "june" AND contains "jul" (or
"july" if you prefer full month spelling) to ensure both disagreeing sources are
present.

---

Nitpick comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 45-70: Add debug-level logging to the ingest_chat_batch pipeline
boundaries: log at function entry (include scope, owner, base_ts_millis and
number of messages), log immediately after awaiting ingest_chat (include result
or a brief summary), and log after drain_until_idle completes before returning
chunk_ids. Use log::debug! or tracing::debug! and reference ingest_chat_batch,
ingest_chat, and drain_until_idle so the logs appear at ingest entry,
post-ingest, and post-drain to aid debugging when benchmarks fail.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ef7d8b90-8b10-4d27-93f3-e85a17b11599

📥 Commits

Reviewing files that changed from the base of the PR and between 15c7442 and c28a65c.

📒 Files selected for processing (2)

src/openhuman/memory/tree/retrieval/benchmarks.rs
src/openhuman/memory/tree/retrieval/mod.rs

The platforms.iter().cycle().take().enumerate() destructuring yields &&str, not &str — convert to owned String before passing to ingest_chat_batch which expects &str.

coderabbitai

♻️ Duplicate comments (1)

src/openhuman/memory/tree/retrieval/benchmarks.rs (1)

445-448: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Strengthen contradiction assertion to verify both dates are present.

The current logic checks contains("june") || contains("jul") which passes if only ONE date is present. To properly validate the contradiction scenario (both disagreeing sources surface), verify both dates appear in the combined content.

🔍 Proposed fix

-    assert!(
-        content_all.to_lowercase().contains("june") || content_all.to_lowercase().contains("jul"),
-        "contradiction hits should include both date references"
-    );
+    let content_lower = content_all.to_lowercase();
+    assert!(
+        content_lower.contains("june") && content_lower.contains("jul"),
+        "contradiction hits should include both date references (June 15 and July 30), got: {}",
+        content_all.chars().take(200).collect::<String>()
+    );

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/tree/retrieval/benchmarks.rs` around lines 445 - 448,
The assertion currently allows either date to be present; change it to require
both by asserting that content_all.to_lowercase().contains("june") AND
content_all.to_lowercase().contains("jul") (or "july" if you prefer full month
name) so the contradiction benchmark verifies both disagreeing date references
appear; update the assertion that references content_all accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/openhuman/memory/tree/retrieval/benchmarks.rs`:
- Around line 445-448: The assertion currently allows either date to be present;
change it to require both by asserting that
content_all.to_lowercase().contains("june") AND
content_all.to_lowercase().contains("jul") (or "july" if you prefer full month
name) so the contradiction benchmark verifies both disagreeing date references
appear; update the assertion that references content_all accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5a36fd38-df37-4b3f-b914-1928628d90d1

📥 Commits

Reviewing files that changed from the base of the PR and between c28a65c and 3f6f9bd.

📒 Files selected for processing (1)

src/openhuman/memory/tree/retrieval/benchmarks.rs

CodeRabbit review: June and July must both appear — change || to && to ensure the contradiction scenario correctly verifies both sources.

graycyrus

Review — PR #1561: feat(memory): add #1538 benchmark fixtures for retrieval scenarios

Walkthrough

This PR adds nine deterministic retrieval benchmark tests for the memory tree pipeline under src/openhuman/memory/tree/retrieval/benchmarks.rs, wired into the module via #[cfg(test)] mod benchmarks in mod.rs. The approach is structurally sound: an isolated TempDir workspace per test, inert embedder for determinism, drain_until_idle to settle the async job queue before asserting, and synthetic fixture data with no real PII. The six retrieval scenarios (cross-chat recall, citation bundle, stale preference, contradiction, long-source compression, scale soak) map well onto the acceptance criteria in #1538.

The main issue is a systematic misuse of query_topic: several tests pass bare keywords like "phoenix" or "alice" where the function contract requires a canonical entity ID (e.g. "topic:phoenix", "email:alice@example.com"). lookup_entity performs an exact WHERE entity_id = ?1 match — bare keywords will never hit anything in the index, causing those tests to assert against empty results sets and trivially pass on the wrong condition. One scenario (query_global in the scale test) also carries a latent false-pass risk. The missing docs/TEST-COVERAGE-MATRIX.md update flagged in #1538 is unaddressed.

Changes

File	Summary
`src/openhuman/memory/tree/retrieval/benchmarks.rs`	New file — 9 deterministic test scenarios for the retrieval pipeline
`src/openhuman/memory/tree/retrieval/mod.rs`	Wires `mod benchmarks` under `#[cfg(test)]`

Actionable comments (5)

[critical] `benchmarks.rs:149` — `query_topic("phoenix", ...)` uses a bare keyword, not a canonical entity ID

query_topic delegates to lookup_entity, which executes WHERE entity_id = ?1 — an exact match against the entity index. The index stores canonical IDs like topic:phoenix and email:alice@example.com, never bare keywords. Calling query_topic(&cfg, "phoenix", ...) will return zero rows from the index, so topic_resp.hits will be empty and the !phoenix_hits.is_empty() assertion on line 159 will trivially panic with the "at least one hit should mention 'phoenix'" message — or worse, silently pass if the scorer chose not to extract the topic and the test is accidentally relying on the wrong failure mode.

The same bug affects bench_stale_preference_newer_supersedes (line 273, "alice") and bench_contradiction_surfaces_both_with_provenance (line 363, "phoenix").

Existing usage for reference — from integration_test.rs:90:

// Correct — uses the canonical id form:
let by_email = query_topic(&cfg, "email:alice@example.com", None, None, 20).await.unwrap();

What to change:

bench_cross_chat_recall (line 149):

// before
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();

// after
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();

bench_stale_preference_newer_supersedes (line 273 — querying alice's preference):

// before
let topic_resp = query_topic(&cfg, "alice", None, None, 20).await.unwrap();

// after
let topic_resp = query_topic(&cfg, "email:alice@example.com", None, None, 20).await.unwrap();
// and update the fixture messages to include alice@example.com so the entity extractor fires

bench_contradiction_surfaces_both_with_provenance (line 363):

// before
let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap();

// after
let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();

Also note: topic:phoenix entity-index promotion depends on the scorer extracting Phoenix as a topic from the ingested text. The existing tests (topic.rs:413-429) document that this may or may not fire depending on scorer config, and they deliberately do not hard-assert non-empty hits for topic:phoenix. For robustness these benchmarks should either (a) use an email entity that is guaranteed to be extracted (like alice@example.com), or (b) add a guard: if topic_resp.hits.is_empty() { return; } with a clear comment that topic extraction is scorer-dependent.

[critical] `benchmarks.rs:193` — `bench_cross_chat_entity_discoverable` queries `"alice"` not `"alice@example.com"`

search_entities(&cfg, "alice", None, 10) performs a LIKE '%alice%' match which will correctly fire on entity_id LIKE '%alice%' rows — this one is actually fine as written because search_entities uses substring matching, not exact lookup. However, the assertion on line 205 then calls .find(|m| m.canonical_id.contains("alice@example.com")), which is correct.

The real issue is that the assertion alice.mention_count >= 2 on line 210 assumes the entity extractor fires for both ingested chat batches. The first batch text is "alice@example.com is leading the Phoenix migration." and the second is "alice@example.com confirmed the Friday timeline." — both explicitly include the email so extraction should be reliable. This test is fine; just noting for completeness.

[major] `benchmarks.rs:421` — `bench_long_source_retrieves_exact_leaf` conflates "summaries" with `query_source` and hard-codes `total <= 10` without proof

query_source only returns hits from sealed source trees (trees with max_level > 0 and a root summary). From source.rs:121-122:

if tree.max_level == 0 && tree.root_id.is_none() {
    continue;
}

Whether 30 messages of ~200 tokens seal into one or multiple summaries depends on INPUT_TOKEN_BUDGET — a value the test doesn't control. If the budget is large enough that all 30 messages don't trigger a seal during drain_until_idle, source_resp.total will be 0, not <= 10, and the assertion passes trivially for the wrong reason (no summaries were produced, so no dump occurred).

The test's intent — "a large source does not dump all 30 raw chunks" — is valid, but it should assert that a seal actually happened before asserting the bound:

assert!(
    source_resp.total >= 1,
    "30 messages should have sealed into at least one summary"
);
assert!(
    source_resp.total <= 10,
    "long source should not dump all chunks; expected <= 10 summaries, got {}",
    source_resp.total
);

[major] `benchmarks.rs:499` — `bench_scale_ingest_20_sources_no_real_data` asserts `query_global` returns hits but global digests require `end_of_day_digest`

query_global returns the recap from the global digest tree, which is only populated when end_of_day_digest is called explicitly. The existing integration_test.rs:114-119 documents this clearly:

// query_global: no daily digest has been built yet → empty.
let global = query_global(&cfg, 7).await.unwrap();
assert!(global.hits.is_empty(), "end_of_day_digest hasn't been called, so global is empty");

This benchmark calls query_global(&cfg, 30) immediately after drain_until_idle without calling end_of_day_digest, then asserts !global_resp.hits.is_empty(). This will always fail (or if it silently passes, it's passing for wrong reasons). The test description says "verify retrieval remains correct at scale" — the right thing to assert after 20-source ingestion is scope coverage via query_source, not query_global.

Either call end_of_day_digest before the global assertion (requires importing and driving end_of_day_digest + a DigestOutcome::Emitted guard), or replace the global assertion with one that actually exercises post-ingest state, for example:

// Show at least one source tree was produced across 5 platforms
let source_resp = query_source(&cfg, None, None, None, None, 50).await.unwrap();
assert!(
    source_resp.total >= 5,
    "scale ingest should produce source trees for each distinct scope"
);

[minor] `benchmarks.rs:60-67` — `bench_config()` duplicates `test_config()` from `integration_test.rs`

The only difference from integration_test.rs's test_config() is the name. This is a small thing, but it adds a maintenance burden — if the embedding config structure changes, both places need updating. Consider extracting a shared crate::openhuman::memory::tree::retrieval::integration_test::test_config by making it pub(crate) (it's already #![cfg(test)] gated so there's no production-code exposure), or just leave a comment cross-referencing the two functions.

Nitpicks

benchmarks.rs:30 — The drain_until_idle import uses the jobs::testing path while integration_test.rs:167 imports the non-testing path jobs::drain_until_idle. Worth checking which is the canonical test path (the comment in integration_test.rs suggests the testing module is intentional for the test harness).
benchmarks.rs:86 — The comment says "Scenario 1 — Cross-chat recall" covers tests 1 and 1b, but the doc table at the top of the file only lists 5 scenarios while the PR description lists 6. The numbering in the file (1, 1b, 2, 2b, 3, 4, 5, 5b, 6) vs the doc table (1-5) is confusing. Renaming to a flat bench_<scenario>_<variant> prefix or updating the header table would reduce confusion.
benchmarks.rs:1 — The //! doc comment says "Run with: cargo test --package openhuman_core -- retrieval_benchmarks". The --package flag value should match the crate name in Cargo.toml — worth verifying it's openhuman_core and not openhuman.

Missing: `docs/TEST-COVERAGE-MATRIX.md`

Issue #1538 explicitly lists updating docs/TEST-COVERAGE-MATRIX.md as an acceptance criterion (unchecked in the PR description). This is a documentation gap — new scenarios should be logged in the matrix so reviewers and CI auditors can track coverage against the spec without reading test code.

Verified / looks good

Module wiring in mod.rs is correct: #[cfg(test)] mod benchmarks placed before the existing integration_test entry, consistent with project convention.
Fixture data uses generic names (alice, bob, carol, dave, eve, alice@example.com) throughout — no real PII.
TempDir is kept alive for the duration of each test via let (_tmp, cfg) binding — no accidental early drop.
drain_until_idle is called before each assertion set, correctly sequencing the async job queue.
The bench_citation_fetch_leaves_hydrates test correctly captures chunk_ids from the ingest return value and uses them as the fetch input — solid provenance chain.
bench_drill_down_isolates_children correctly seeds two distinct scopes and asserts the negative (no cross-scope bleed) — good guard.
The && fix for the June/July assertion (previously ||) is present in the current diff at line 445-448. Not repeating this finding.
No dynamic imports, no hardcoded secrets, no PII in log statements.

graycyrus

Review — PR #1561

Good test coverage concept — the 9 benchmark scenarios are well-motivated and map to real retrieval quality concerns. A few correctness issues need addressing before this can land.

2 critical · 2 major · see inline comments below.

graycyrus · 2026-05-12T18:44:01Z

+        "slack:#ops",
+        "carol",
+        vec![
+            (


[critical] query_topic called with bare keyword — will return zero hits.

query_topic delegates to lookup_entity which does WHERE entity_id = ?1 (exact match). The entity index stores canonical IDs like topic:phoenix, not bare keywords.

This test (and the same pattern in bench_stale_preference_newer_supersedes line 318 and bench_contradiction_surfaces_both_with_provenance line 395) will return zero rows, causing the !topic_resp.hits.is_empty() assertion to always panic — meaning the test can never pass as written.

The existing integration_test.rs:90 uses the correct pattern. Either use the canonical entity ID or use query_source with a scope filter.

// Instead of: let topic_resp = query_topic(&cfg, "phoenix", None, None, 20).await.unwrap(); // Use canonical ID: let topic_resp = query_topic(&cfg, "topic:phoenix", None, None, 20).await.unwrap();

graycyrus · 2026-05-12T18:44:01Z

+        "drill_down / query_topic should not cross scope into unrelated channels. \
+         Found 'ops' content in eng query: {}",
+        eng_content.chars().take(200).collect::<String>()
+    );


[critical] query_global requires end_of_day_digest — assertion will always fail.

query_global only returns hits after the daily digest has been built. The existing integration_test.rs:114-119 explicitly documents this:

"no daily digest has been built yet → empty."

Without calling end_of_day_digest first, global_resp.hits will always be empty and this assertion will always panic.

// Add before the query_global call: use crate::openhuman::memory::tree::tree_global::seal::end_of_day_digest; end_of_day_digest(&cfg).await.unwrap();

graycyrus · 2026-05-12T18:44:01Z

+                     specific implementation details for iteration {}.",
+                    i, i
+                ),
+            )


[major] Long-source test can pass trivially with zero results.

The assertion source_resp.total <= 10 passes when total == 0, which would happen if no seal/summary was produced (e.g. if the ingest volume doesn't cross the seal budget). This means the test could pass for the wrong reason — it verifies an upper bound but no lower bound.

Add a lower-bound guard:

assert!( source_resp.total >= 1, "long source should produce at least one summary hit, got 0" ); assert!( source_resp.total <= 10, "long source should not dump all chunks; expected <= 10 summaries, got {}", source_resp.total );

graycyrus · 2026-05-12T18:44:02Z

+fn bench_config() -> (TempDir, Config) {
+    let tmp = TempDir::new().unwrap();
+    let mut cfg = Config::default();
+    cfg.workspace_dir = tmp.path().to_path_buf();


[major] bench_config() duplicates test_config() from integration_test.rs.

This is an exact copy of test_config() (lines 24-33 in integration_test.rs) with only the name changed. Having two identical setup functions in the same module directory adds maintenance surface — if the config shape changes, both need updating.

Consider extracting the shared helper into a #[cfg(test)] block in the parent mod.rs or a shared test utilities module, then importing from both test files.

- query_topic: use canonical form 'topic:phoenix' instead of bare keyword - query_topic: use canonical form 'email:alice@example.com' for alice - Add lower-bound guard for source_resp.total before upper-bound assertion - Replace query_global (requires end_of_day_digest) with query_source Closes graycyrus review comments on PR tinyhumansai#1561

unn-Known1 requested a review from a team May 12, 2026 14:46

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

Comment thread src/openhuman/memory/tree/retrieval/benchmarks.rs

fix(benchmarks): resolve String: From<&&str> in scale soak fixture

3f6f9bd

The platforms.iter().cycle().take().enumerate() destructuring yields &&str, not &str — convert to owned String before passing to ingest_chat_batch which expects &str.

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

fix(benchmarks): contradiction test requires both dates (AND not OR)

f401cad

CodeRabbit review: June and July must both appear — change || to && to ensure the contradiction scenario correctly verifies both sources.

coderabbitai Bot previously approved these changes May 12, 2026

View reviewed changes

style(benchmarks): apply cargo fmt

b099a8e

unn-Known1 dismissed coderabbitai[bot]’s stale review via b099a8e May 12, 2026 17:20

coderabbitai Bot previously approved these changes May 12, 2026

View reviewed changes

fix(benchmarks): fix remaining &&str usage in scale soak fixture

291a5f4

unn-Known1 dismissed coderabbitai[bot]’s stale review via 291a5f4 May 12, 2026 17:38

coderabbitai Bot previously approved these changes May 12, 2026

View reviewed changes

graycyrus reviewed May 12, 2026

View reviewed changes

graycyrus requested changes May 12, 2026

View reviewed changes

unn-Known1 dismissed coderabbitai[bot]’s stale review via 38afc9c May 12, 2026 19:47

style: apply rustfmt line-length formatting

3abb87e

Conversation

unn-Known1 commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes #1538

Acceptance criteria

Files changed

Run tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Review — PR #1561: feat(memory): add #1538 benchmark fixtures for retrieval scenarios

Walkthrough

Changes

Actionable comments (5)

[critical] benchmarks.rs:149 — query_topic("phoenix", ...) uses a bare keyword, not a canonical entity ID

[critical] benchmarks.rs:193 — bench_cross_chat_entity_discoverable queries "alice" not "alice@example.com"

[major] benchmarks.rs:421 — bench_long_source_retrieves_exact_leaf conflates "summaries" with query_source and hard-codes total <= 10 without proof

[major] benchmarks.rs:499 — bench_scale_ingest_20_sources_no_real_data asserts query_global returns hits but global digests require end_of_day_digest

[minor] benchmarks.rs:60-67 — bench_config() duplicates test_config() from integration_test.rs

Nitpicks

Missing: docs/TEST-COVERAGE-MATRIX.md

Verified / looks good

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Review — PR #1561

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

graycyrus May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

unn-Known1 commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading

[critical] `benchmarks.rs:149` — `query_topic("phoenix", ...)` uses a bare keyword, not a canonical entity ID

[critical] `benchmarks.rs:193` — `bench_cross_chat_entity_discoverable` queries `"alice"` not `"alice@example.com"`

[major] `benchmarks.rs:421` — `bench_long_source_retrieves_exact_leaf` conflates "summaries" with `query_source` and hard-codes `total <= 10` without proof

[major] `benchmarks.rs:499` — `bench_scale_ingest_20_sources_no_real_data` asserts `query_global` returns hits but global digests require `end_of_day_digest`

[minor] `benchmarks.rs:60-67` — `bench_config()` duplicates `test_config()` from `integration_test.rs`

Missing: `docs/TEST-COVERAGE-MATRIX.md`