Skip to content

Conversation

@nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Jan 27, 2026

server always collects field level stats
add endpoint to fetch stats for field(s) along with offset and limit

Summary by CodeRabbit

  • New Features

    • Added a /dataset_stats API to query dataset field statistics with optional field filtering and pagination.
  • Chores

    • Removed the command-line flag for collecting dataset stats; statistics are now computed automatically.
  • Bug Fixes

    • Simplified shutdown to perform a single sync operation and streamlined automatic field-stat calculation (excluding the stats stream).

✏️ Tip: You can customize this high-level summary in your review settings.

server always collects field level stats
add endpoint to fetch stats for field(s) along with offset and limit
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

Walkthrough

Adds a dataset statistics API (types, SQL builder, handler, and routing), introduces indexmap dependency, removes the CLI collect_dataset_stats flag and its runtime guard, makes field-stat calculation unconditional (except for dataset stats stream), and simplifies shutdown sync to a single unconditional call.

Changes

Cohort / File(s) Summary
Dependency Management
Cargo.toml
Added indexmap = { version = "2.13.0", features = ["serde"] }.
CLI Configuration
src/cli.rs
Removed collect_dataset_stats field from Options (CLI flag P_COLLECT_DATASET_STATS removed).
HTTP Routing & Server
src/handlers/http/modal/query_server.rs, src/handlers/http/modal/server.rs
Registered new /dataset_stats web scope; added get_dataset_stats_webscope() and wired get_dataset_stats handler into base API routing.
Health Check / Shutdown
src/handlers/http/health_check.rs
Removed conditional extra sync; now performs a single unconditional perform_sync_operations().await on shutdown.
Dataset Statistics API Implementation
src/storage/field_stats.rs
Added public types DataSetStatsRequest, FieldStats, QueryRow; added get_dataset_stats handler, build_stats_sql, and transform_query_results to compute per-field totals, distinct counts, and ordered distinct-value maps (uses IndexMap).
Object Storage Stats Invocation
src/storage/object_storage.rs
Removed runtime guard checking options.collect_dataset_stats; now unconditionally calls calculate_field_stats for streams except the dataset stats stream.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Handler as Handler\n(get_dataset_stats)
    participant SQLBuilder as SQL Builder\n(build_stats_sql)
    participant Query as Query Engine
    participant Transform as Transform\n(transform_query_results)
    participant Response as JSON Response

    Client->>Handler: POST /dataset_stats\n(DataSetStatsRequest)
    Handler->>SQLBuilder: build_stats_sql(dataset, fields, offset, limit)
    SQLBuilder-->>Handler: SQL string
    Handler->>Query: Execute SQL
    Query-->>Handler: Rows (QueryRow[])
    Handler->>Transform: transform_query_results(rows)
    Transform-->>Handler: HashMap<String, FieldStats>
    Handler->>Response: Serialize to JSON
    Response-->>Client: FieldStats JSON (ordered distinct values)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • parmesant
  • praveen5959

Poem

🐰 I hopped through code with a cheerful clatter,

Counting fields and values that matter,
IndexMap ordered, stats now in sight,
Flags trimmed away, the flow runs light,
A rabbit's cheer for results done right.

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description is minimal but accurately conveys the key changes. However, it lacks the structured format from the template including testing confirmation, comments about intent, and documentation additions. Consider adding the template structure with explicit confirmation of testing, code comments explaining intent, and documentation for the new statistics endpoints to improve clarity.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: enabling dataset statistics collection and adding retrieval functionality, which aligns with the core modifications across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/storage/field_stats.rs`:
- Around line 614-621: The aggregation in the CTE field_totals is using
SUM(DISTINCT field_stats_count) which undercounts when multiple parquet files
produce identical counts; change the aggregation to SUM(field_stats_count) in
the query that references DATASET_STATS_STREAM_NAME and dataset_name so
total_field_count correctly sums all field_stats_count rows (look for the CTE
named field_totals and the SUM(DISTINCT field_stats_count) expression and
replace it with SUM(field_stats_count)).
🧹 Nitpick comments (3)
src/handlers/http/health_check.rs (1)

65-70: Double sync for stats data flush is intentional but warrants clarification.

The first perform_sync_operations() call triggers field stats calculation during parquet upload. The generated stats are buffered in the pstats stream. The second call is necessary to flush and upload this stats data. The comment could be more explicit about this two-phase dependency.

📝 Consider clarifying the comment
     // Perform sync operations
     perform_sync_operations().await;

-    // This is to ensure that all stats data is synced before the server shuts down
+    // Second sync required: the first sync generates field stats during parquet uploads,
+    // which are buffered in the pstats stream. This second sync flushes those stats.
     perform_sync_operations().await;
src/handlers/http/modal/server.rs (1)

731-735: Consider renaming for consistency with return type.

get_dataset_stats_webscope() returns Resource but the naming convention *_webscope in this file typically returns Scope (e.g., get_demo_data_webscope(), get_metrics_webscope()). Consider renaming to get_dataset_stats_factory() to match the pattern used by other Resource-returning methods like get_query_factory() and get_about_factory().

src/storage/field_stats.rs (1)

537-547: Consider preserving error context in conversions.

The error handling chain loses context by using generic error messages. Consider including the original error details for better debugging.

♻️ Suggested improvement
     let body_bytes = response.into_body().try_into_bytes().map_err(|_| {
-        QueryError::CustomError("error in converting response to bytes".to_string())
+        QueryError::CustomError("failed to extract response body bytes".to_string())
     })?;

-    let body_str = std::str::from_utf8(&body_bytes).map_err(|_| {
-        QueryError::CustomError("error in converting response bytes to string".to_string())
+    let body_str = std::str::from_utf8(&body_bytes).map_err(|e| {
+        QueryError::CustomError(format!("invalid UTF-8 in response: {e}"))
     })?;

     let rows: Vec<QueryRow> =
-        serde_json::from_str(body_str).map_err(|e| QueryError::CustomError(e.to_string()))?;
+        serde_json::from_str(body_str).map_err(|e| {
+            QueryError::CustomError(format!("failed to parse query response: {e}"))
+        })?;

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/storage/field_stats.rs`:
- Around line 622-636: The CTE ranked_values uses
MAX(field_stats_distinct_count) which yields the largest per-file distinct_count
instead of the true distinct count across files; update the aggregation to
compute the true distinct count by replacing MAX(field_stats_distinct_count)
with COUNT(DISTINCT field_stats_distinct_stats_distinct_value) (or by deriving
distinct_count from the grouped distinct_value rows) and ensure the alias
field_stats_distinct_count or distinct_count reflects that change; if you
intentionally want the per-file max as an approximation, add a clear
comment/docstring near the ranked_values CTE explaining the limitation instead.
🧹 Nitpick comments (2)
src/storage/field_stats.rs (2)

508-553: Consider direct query execution to avoid double serialization.

The current implementation serializes the query result to JSON (in query()), converts to bytes, then deserializes back into QueryRow. While functional, this adds unnecessary overhead.

Consider using the query session directly to execute the SQL and process RecordBatch results, similar to how calculate_single_field_stats works. This would avoid the intermediate JSON round-trip.


650-651: Missing tests for new API functions.

The new functions get_dataset_stats, transform_query_results, and build_stats_sql lack unit tests. Consider adding tests for:

  • build_stats_sql: SQL generation with various field filters, pagination values, and special characters
  • transform_query_results: Aggregation logic, sorting behavior, empty input handling

Would you like me to generate unit tests for these functions?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/storage/field_stats.rs`:
- Around line 607-640: The SQL string injects dataset_name into a literal but
only escapes double quotes; fix by escaping single quotes in dataset_name before
the format! call (e.g., replace each ' with '' or chain the existing replace) so
the formatted SQL uses a safely-escaped dataset_name; update the dataset_name
variable used in the format! invocation (the one right before the big
format!(...) block) to the escaped form to prevent SQL injection when building
the query.
🧹 Nitpick comments (1)
src/storage/field_stats.rs (1)

512-523: Consider clamping limit (and possibly offset) to a safe upper bound.

Unbounded limits can drive heavy scans and large responses for popular datasets. A small guard using an existing config (if available) would protect the endpoint from accidental or abusive large requests.

@nitisht nitisht merged commit a0d9e19 into parseablehq:main Jan 28, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants