Skip to content

feat(CLP-JSON): Add Messages and Files stats display to CLP-JSON Web UI Ingest page.#2293

Draft
junhaoliao wants to merge 2 commits into
y-scope:mainfrom
junhaoliao:clp-json-stats
Draft

feat(CLP-JSON): Add Messages and Files stats display to CLP-JSON Web UI Ingest page.#2293
junhaoliao wants to merge 2 commits into
y-scope:mainfrom
junhaoliao:clp-json-stats

Conversation

@junhaoliao
Copy link
Copy Markdown
Member

@junhaoliao junhaoliao commented May 23, 2026

Description

Problem

The CLP-JSON (clp-s) web UI Ingest page Details component only shows the TimeRange card, while CLP-Text additionally renders Messages (total log event count) and Files (total file count) cards. The root cause is that CLP-S doesn't populate the same data pipeline that CLP-Text relies on:

Aspect CLP-Text CLP-S (before this PR)
num_messages source SUM(num_messages) from files table (per-file, NOT NULL) Not tracked — ArchiveStats omits it
num_files source COUNT(DISTINCT orig_file_id) from files table Not tracked — ArchiveStats omits it
Archives table columns num_messages, num_files left as NULL by C++ inserter Same columns left as NULL
Web UI SQL Joins archives + files tables Joins archives + files tables, but files table is empty for CLP-S

Because CLP-S never populates the files table, the join always yields num_messages=0 and num_files=0.

Solution

C++ layer — Extend ArchiveStats to include num_messages and num_files, matching how CLP-Text's data ultimately reaches the UI (but via a different source):

Aspect CLP-Text CLP-S (after this PR)
Stats class ArchiveMetadata — no num_messages/num_files fields ArchiveStats — now includes both
num_messages computed by Per-file in SQLite/MySQL files table ArchiveWriter::m_next_log_event_id (running log-event count)
num_files computed by COUNT(DISTINCT orig_file_id) in SQL from files table ArchiveWriter::m_num_files (range-index range count)
JSON output {id, uncompressed_size, size} only Now includes num_messages and num_files

Database layer — Add nullable BIGINT columns num_messages and num_files to the archives table schema. The Python compression task's update_archive_metadata() now inserts these values from the C++ stats output, while CLP-Text's C++ inserter continues leaving them NULL (hence nullable).

Web UI layer — Rewrite buildMultiDatasetDetailsSql() to query only from the archives table, eliminating the dependency on the always-empty files table:

Aspect CLP-Text CLP-S (after this PR)
Tables queried clp_archives + clp_files (join) <prefix>_<dataset>_archives only (UNION ALL across datasets)
num_messages SQL SUM(num_messages) from files COALESCE(SUM(num_messages), 0) from archives
num_files SQL COUNT(DISTINCT orig_file_id) from files COALESCE(SUM(num_files), 0) from archives

Update the CLP-S Details component to render Messages and Files cards using the same grid layout as CLP-Text.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Build

Task: Verify the full project builds successfully including the new C++ changes and web UI.

Command:

task

Output:

(Exit code 0 — build completed successfully, clp-package produced)

End-to-end compression test (CLP-S)

Task: Verify the full compression pipeline — clp-s now outputs num_messages and num_files in its JSON stats, the Python task writes them to the DB, and the archives table contains non-NULL values.

Command:

cd build/clp-package
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl

Output:

2026-05-23T02:31:42.871 INFO [compress] Compression job 1 submitted.
2026-05-23T02:31:44.875 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 198.26MB/s.
2026-05-23T02:31:45.376 INFO [compress] Compression finished.
2026-05-23T02:31:45.377 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 178.34MB/s.

Database verification

Task: Verify the num_messages and num_files columns in the archives table contain non-NULL, non-zero values after compression.

Command:

docker exec <database-container> mysql -u clp-user -p<password> clp-db -e "SELECT id, num_messages, num_files FROM clp_default_archives LIMIT 5;"

Output:

id	num_messages	num_files
ac317804-58aa-4a20-8e71-32321cf9ffbd	1000000	1

Explanation: Both num_messages (1,000,000 log events) and num_files (1 file) are populated with non-NULL values, confirming the C++ → Python → DB pipeline works correctly.

Web UI browser validation (CLP-S Ingest page)

Task: Verify the CLP-JSON Ingest page now displays Messages and Files cards alongside the TimeRange card, using a browser.

Navigated to http://localhost:4000/ and confirmed the Details section renders all three cards with correct values:

CLP-JSON Ingest page — Details section

Explanation: The screenshot shows the Time Range, Messages (1,000,000), and Files (1) cards all rendering correctly. The grid layout has 3 children (timeRange spanning 2 columns, Messages and Files each in 1 column), matching the CLP-Text layout.

Details grid layout verification

Task: Verify the CSS grid layout structure matches CLP-Text (2-column grid with Time Range spanning full width).

Inspected the computed styles of the details grid element in the browser:

CLP-JSON Ingest page — full page

Explanation: The grid uses a 2-column layout with 8px gap. The Time Range child spans both columns (via grid-column: span 2 CSS), while Messages and Files each occupy one column — matching the CLP-Text Details layout exactly.

Summary by CodeRabbit

Release Notes

  • New Features

    • Archives now track message and file counts as metadata metrics available in the interface.
  • UI Changes

    • Archive details page now displays message and file counts consistently across all archive types.
    • Removed conditional layout rendering to provide a unified details view.
  • Improvements

    • Optimized metrics retrieval by aggregating message and file counts directly from archive metadata.

Review Change Stack

@junhaoliao junhaoliao requested review from a team and gibber9809 as code owners May 23, 2026 04:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d6c72693-9eb4-465d-94fa-f28e793fe71a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR extends the archive metadata pipeline to track and persist num_messages and num_files counters across compression, storage, and web UI components. The database schema and C++ structures are updated to define and collect these metrics, which then flow through job orchestration persistence and into SQL queries powering the ingest details page.

Changes

Archive Statistics Tracking Enhancement

Layer / File(s) Summary
Schema and metadata field definitions
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py, components/core/src/clp/streaming_archive/Constants.hpp, components/webui/client/src/pages/IngestPage/sqlConfig.ts
Database schema adds nullable num_messages and num_files BIGINT columns; C++ metadata constants and SQL column enum define the field names for use throughout the pipeline.
Archive statistics collection in C++
components/core/src/clp_s/ArchiveWriter.hpp, components/core/src/clp_s/ArchiveWriter.cpp
ArchiveWriter increments m_num_files when opening ranges, and ArchiveStats now accepts and exposes these counters via constructor parameters, JSON serialization, and public getter methods.
Statistics persistence to database
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Archive statistics insertion now requires and persists num_messages and num_files to the archives table.
Query and UI integration
components/webui/client/src/pages/IngestPage/Details/sql.ts, components/webui/client/src/pages/IngestPage/Details/index.tsx
Multi-dataset details SQL now aggregates counts from the archives table directly; Details component uses a single unified layout rendering time range, messages, and files with the new metrics.

Sequence Diagram

sequenceDiagram
  participant ArchiveWriter
  participant ArchiveStats
  participant CompressionTask
  participant Database
  participant DetailsUI as Details UI
  ArchiveWriter->>ArchiveWriter: increment m_num_files on range open
  ArchiveWriter->>ArchiveStats: construct with m_num_files, m_num_messages
  ArchiveStats->>ArchiveStats: serialize to JSON
  CompressionTask->>Database: insert archive stats with num_messages, num_files
  DetailsUI->>Database: query archives table
  Database->>DetailsUI: return aggregated SUM(num_messages), SUM(num_files)
  DetailsUI->>DetailsUI: render unified Messages and Files components
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change—adding Messages and Files stats display to the CLP-JSON Web UI Ingest page—which aligns with the comprehensive changeset across C++, database, and Web UI layers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1)

23-42: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add an explicit schema migration path for existing archives tables.

CREATE TABLE IF NOT EXISTS won’t add new columns to already-existing tables. With the new inserts expecting num_messages and num_files, upgrades can fail at runtime unless these columns are backfilled via migration.

Suggested fix direction
 def _create_archives_table(db_cursor, archives_table_name: str) -> None:
     db_cursor.execute(
         f"""
         CREATE TABLE IF NOT EXISTS `{archives_table_name}` (
             ...
             `num_messages` BIGINT NULL DEFAULT NULL,
             `num_files` BIGINT NULL DEFAULT NULL,
             ...
         )
         """
     )
+
+    # Ensure upgrades add newly introduced nullable columns on existing tables.
+    # Implement with your existing migration strategy/dialect-safe checks.
+    # Example: conditionally ALTER TABLE when column is missing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py` around lines
23 - 42, The current _create_archives_table uses CREATE TABLE IF NOT EXISTS
which does not add new columns to existing tables; add an idempotent migration
step after the CREATE that checks information_schema (or equivalent) for the
presence of the columns `num_messages` and `num_files` on the table named by
archives_table_name and, if missing, runs an ALTER TABLE to add them with the
intended types (BIGINT NULL DEFAULT NULL) and constraints; ensure this migration
is safe to re-run (skip ALTER if column exists), backfill any needed default
values if desired, and log/report failures so runtime inserts referencing
num_messages/num_files will not error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py`:
- Around line 23-42: The current _create_archives_table uses CREATE TABLE IF NOT
EXISTS which does not add new columns to existing tables; add an idempotent
migration step after the CREATE that checks information_schema (or equivalent)
for the presence of the columns `num_messages` and `num_files` on the table
named by archives_table_name and, if missing, runs an ALTER TABLE to add them
with the intended types (BIGINT NULL DEFAULT NULL) and constraints; ensure this
migration is safe to re-run (skip ALTER if column exists), backfill any needed
default values if desired, and log/report failures so runtime inserts
referencing num_messages/num_files will not error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7cc1c17f-5180-4fe2-9b6a-bffef29ebbad

📥 Commits

Reviewing files that changed from the base of the PR and between 85eaa70 and abffa6a.

📒 Files selected for processing (8)
  • components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
  • components/core/src/clp/streaming_archive/Constants.hpp
  • components/core/src/clp_s/ArchiveWriter.cpp
  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
  • components/webui/client/src/pages/IngestPage/Details/index.tsx
  • components/webui/client/src/pages/IngestPage/Details/sql.ts
  • components/webui/client/src/pages/IngestPage/sqlConfig.ts

@junhaoliao junhaoliao marked this pull request as draft May 23, 2026 05:27
…um_files` stat.

Previously, `m_num_files` was incremented in `add_field_to_current_range()`
every time a new range was opened, which over-counted files when a single
large input file was split across multiple archives. Now the increment is
moved to the callers (`ingest_json`/`ingest_kvier`) where `file_split_number`
is available, so only the first split (split 0) of each input file is counted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant