feat: enable dataset stats collection and retrieval #1529

nikhilsinhaparseable · 2026-01-27T14:13:24Z

server always collects field level stats
add endpoint to fetch stats for field(s) along with offset and limit

Summary by CodeRabbit

New Features
- Added a /dataset_stats API to query dataset field statistics with optional field filtering and pagination.
Chores
- Removed the command-line flag for collecting dataset stats; statistics are now computed automatically.
Bug Fixes
- Simplified shutdown to perform a single sync operation and streamlined automatic field-stat calculation (excluding the stats stream).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

server always collects field level stats add endpoint to fetch stats for field(s) along with offset and limit

coderabbitai · 2026-01-27T14:13:46Z

Walkthrough

Adds a dataset statistics API (types, SQL builder, handler, and routing), introduces indexmap dependency, removes the CLI collect_dataset_stats flag and its runtime guard, makes field-stat calculation unconditional (except for dataset stats stream), and simplifies shutdown sync to a single unconditional call.

Changes

Cohort / File(s)	Summary
Dependency Management `Cargo.toml`	Added `indexmap = { version = "2.13.0", features = ["serde"] }`.
CLI Configuration `src/cli.rs`	Removed `collect_dataset_stats` field from `Options` (CLI flag `P_COLLECT_DATASET_STATS` removed).
HTTP Routing & Server `src/handlers/http/modal/query_server.rs`, `src/handlers/http/modal/server.rs`	Registered new `/dataset_stats` web scope; added `get_dataset_stats_webscope()` and wired `get_dataset_stats` handler into base API routing.
Health Check / Shutdown `src/handlers/http/health_check.rs`	Removed conditional extra sync; now performs a single unconditional `perform_sync_operations().await` on shutdown.
Dataset Statistics API Implementation `src/storage/field_stats.rs`	Added public types `DataSetStatsRequest`, `FieldStats`, `QueryRow`; added `get_dataset_stats` handler, `build_stats_sql`, and `transform_query_results` to compute per-field totals, distinct counts, and ordered distinct-value maps (uses `IndexMap`).
Object Storage Stats Invocation `src/storage/object_storage.rs`	Removed runtime guard checking `options.collect_dataset_stats`; now unconditionally calls `calculate_field_stats` for streams except the dataset stats stream.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Handler as Handler\n(get_dataset_stats)
    participant SQLBuilder as SQL Builder\n(build_stats_sql)
    participant Query as Query Engine
    participant Transform as Transform\n(transform_query_results)
    participant Response as JSON Response

    Client->>Handler: POST /dataset_stats\n(DataSetStatsRequest)
    Handler->>SQLBuilder: build_stats_sql(dataset, fields, offset, limit)
    SQLBuilder-->>Handler: SQL string
    Handler->>Query: Execute SQL
    Query-->>Handler: Rows (QueryRow[])
    Handler->>Transform: transform_query_results(rows)
    Transform-->>Handler: HashMap<String, FieldStats>
    Handler->>Response: Serialize to JSON
    Response-->>Client: FieldStats JSON (ordered distinct values)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: add support for list type data in field stats #1369 — Refactors calculate_field_stats and touches src/storage/field_stats.rs, closely related to the new stats implementation.
feat: add stats for each field #1340 — Implements per-field dataset statistics and similar routing/CLI changes; strongly related.
reject event if fields count exceed 250 #1311 — Modifies Options in src/cli.rs (related CLI flag changes).

Suggested labels

for next release

Suggested reviewers

parmesant
praveen5959

Poem

🐰 I hopped through code with a cheerful clatter,

Counting fields and values that matter,
IndexMap ordered, stats now in sight,
Flags trimmed away, the flow runs light,
A rabbit's cheer for results done right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description is minimal but accurately conveys the key changes. However, it lacks the structured format from the template including testing confirmation, comments about intent, and documentation additions.	Consider adding the template structure with explicit confirmation of testing, code comments explaining intent, and documentation for the new statistics endpoints to improve clarity.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: enabling dataset statistics collection and adding retrieval functionality, which aligns with the core modifications across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/storage/field_stats.rs`:
- Around line 614-621: The aggregation in the CTE field_totals is using
SUM(DISTINCT field_stats_count) which undercounts when multiple parquet files
produce identical counts; change the aggregation to SUM(field_stats_count) in
the query that references DATASET_STATS_STREAM_NAME and dataset_name so
total_field_count correctly sums all field_stats_count rows (look for the CTE
named field_totals and the SUM(DISTINCT field_stats_count) expression and
replace it with SUM(field_stats_count)).

🧹 Nitpick comments (3)

src/handlers/http/health_check.rs (1)
65-70: Double sync for stats data flush is intentional but warrants clarification.

The first perform_sync_operations() call triggers field stats calculation during parquet upload. The generated stats are buffered in the pstats stream. The second call is necessary to flush and upload this stats data. The comment could be more explicit about this two-phase dependency.
📝 Consider clarifying the comment
     // Perform sync operations
     perform_sync_operations().await;

-    // This is to ensure that all stats data is synced before the server shuts down
+    // Second sync required: the first sync generates field stats during parquet uploads,
+    // which are buffered in the pstats stream. This second sync flushes those stats.
     perform_sync_operations().await;
src/handlers/http/modal/server.rs (1)

731-735: Consider renaming for consistency with return type.

get_dataset_stats_webscope() returns Resource but the naming convention *_webscope in this file typically returns Scope (e.g., get_demo_data_webscope(), get_metrics_webscope()). Consider renaming to get_dataset_stats_factory() to match the pattern used by other Resource-returning methods like get_query_factory() and get_about_factory().
src/storage/field_stats.rs (1)
537-547: Consider preserving error context in conversions.

The error handling chain loses context by using generic error messages. Consider including the original error details for better debugging.
♻️ Suggested improvement
     let body_bytes = response.into_body().try_into_bytes().map_err(|_| {
-        QueryError::CustomError("error in converting response to bytes".to_string())
+        QueryError::CustomError("failed to extract response body bytes".to_string())
     })?;

-    let body_str = std::str::from_utf8(&body_bytes).map_err(|_| {
-        QueryError::CustomError("error in converting response bytes to string".to_string())
+    let body_str = std::str::from_utf8(&body_bytes).map_err(|e| {
+        QueryError::CustomError(format!("invalid UTF-8 in response: {e}"))
     })?;

     let rows: Vec<QueryRow> =
-        serde_json::from_str(body_str).map_err(|e| QueryError::CustomError(e.to_string()))?;
+        serde_json::from_str(body_str).map_err(|e| {
+            QueryError::CustomError(format!("failed to parse query response: {e}"))
+        })?;

src/storage/field_stats.rs

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/storage/field_stats.rs`:
- Around line 622-636: The CTE ranked_values uses
MAX(field_stats_distinct_count) which yields the largest per-file distinct_count
instead of the true distinct count across files; update the aggregation to
compute the true distinct count by replacing MAX(field_stats_distinct_count)
with COUNT(DISTINCT field_stats_distinct_stats_distinct_value) (or by deriving
distinct_count from the grouped distinct_value rows) and ensure the alias
field_stats_distinct_count or distinct_count reflects that change; if you
intentionally want the per-file max as an approximation, add a clear
comment/docstring near the ranked_values CTE explaining the limitation instead.

🧹 Nitpick comments (2)

src/storage/field_stats.rs (2)

508-553: Consider direct query execution to avoid double serialization.

The current implementation serializes the query result to JSON (in query()), converts to bytes, then deserializes back into QueryRow. While functional, this adds unnecessary overhead.

Consider using the query session directly to execute the SQL and process RecordBatch results, similar to how calculate_single_field_stats works. This would avoid the intermediate JSON round-trip.

650-651: Missing tests for new API functions.

The new functions get_dataset_stats, transform_query_results, and build_stats_sql lack unit tests. Consider adding tests for:

build_stats_sql: SQL generation with various field filters, pagination values, and special characters

transform_query_results: Aggregation logic, sorting behavior, empty input handling

Would you like me to generate unit tests for these functions?

src/storage/field_stats.rs

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/storage/field_stats.rs`:
- Around line 607-640: The SQL string injects dataset_name into a literal but
only escapes double quotes; fix by escaping single quotes in dataset_name before
the format! call (e.g., replace each ' with '' or chain the existing replace) so
the formatted SQL uses a safely-escaped dataset_name; update the dataset_name
variable used in the format! invocation (the one right before the big
format!(...) block) to the escaped form to prevent SQL injection when building
the query.

🧹 Nitpick comments (1)

src/storage/field_stats.rs (1)

512-523: Consider clamping limit (and possibly offset) to a safe upper bound.

Unbounded limits can drive heavy scans and large responses for popular datasets. A small guard using an existing config (if available) would protect the endpoint from accidental or abusive large requests.

src/storage/field_stats.rs

feat: enable dataset stats collection and retrieval

a81ae4a

server always collects field level stats add endpoint to fetch stats for field(s) along with offset and limit

use default in place of new

38a2c02

coderabbitai bot requested changes Jan 27, 2026

View reviewed changes

src/storage/field_stats.rs Show resolved Hide resolved

coderabbitai bot requested changes Jan 27, 2026

View reviewed changes

src/storage/field_stats.rs Show resolved Hide resolved

update stats query

1fa46a0

coderabbitai bot requested changes Jan 27, 2026

View reviewed changes

src/storage/field_stats.rs Show resolved Hide resolved

coderabbitai bot approved these changes Jan 27, 2026

View reviewed changes

nitisht approved these changes Jan 28, 2026

View reviewed changes

nitisht merged commit a0d9e19 into parseablehq:main Jan 28, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: enable dataset stats collection and retrieval #1529

feat: enable dataset stats collection and retrieval #1529

Uh oh!

nikhilsinhaparseable commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: enable dataset stats collection and retrieval #1529

feat: enable dataset stats collection and retrieval #1529

Uh oh!

Conversation

nikhilsinhaparseable commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nikhilsinhaparseable commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading