feat(CLP-JSON): Add Messages and Files stats display to CLP-JSON Web UI Ingest page. by junhaoliao · Pull Request #2293 · y-scope/clp

junhaoliao · 2026-05-23T04:53:34Z

Description

Problem

The CLP-JSON (clp-s) web UI Ingest page Details component only shows the TimeRange card, while CLP-Text additionally renders Messages (total log event count) and Files (total file count) cards. The root cause is that CLP-S doesn't populate the same data pipeline that CLP-Text relies on:

Aspect	CLP-Text	CLP-S (before this PR)
`num_messages` source	`SUM(num_messages)` from files table (per-file, `NOT NULL`)	Not tracked — `ArchiveStats` omits it
`num_files` source	`COUNT(DISTINCT orig_file_id)` from files table	Not tracked — `ArchiveStats` omits it
Archives table columns	`num_messages`, `num_files` left as `NULL` by C++ inserter	Same columns left as `NULL`
Web UI SQL	Joins `archives` + `files` tables	Joins `archives` + `files` tables, but files table is empty for CLP-S

Because CLP-S never populates the files table, the join always yields num_messages=0 and num_files=0.

Solution

C++ layer — Extend ArchiveStats to include num_messages and num_files, matching how CLP-Text's data ultimately reaches the UI (but via a different source):

Aspect	CLP-Text	CLP-S (after this PR)
Stats class	`ArchiveMetadata` — no `num_messages`/`num_files` fields	`ArchiveStats` — now includes both
`num_messages` computed by	Per-file in SQLite/MySQL `files` table	`ArchiveWriter::m_next_log_event_id` (running log-event count)
`num_files` computed by	`COUNT(DISTINCT orig_file_id)` in SQL from `files` table	`ArchiveWriter::m_num_files` (range-index range count)
JSON output	`{id, uncompressed_size, size}` only	Now includes `num_messages` and `num_files`

Database layer — Add nullable BIGINT columns num_messages and num_files to the archives table schema. The Python compression task's update_archive_metadata() now inserts these values from the C++ stats output, while CLP-Text's C++ inserter continues leaving them NULL (hence nullable).

Web UI layer — Rewrite buildMultiDatasetDetailsSql() to query only from the archives table, eliminating the dependency on the always-empty files table:

Aspect	CLP-Text	CLP-S (after this PR)
Tables queried	`clp_archives` + `clp_files` (join)	`<prefix>_<dataset>_archives` only (UNION ALL across datasets)
`num_messages` SQL	`SUM(num_messages)` from files	`COALESCE(SUM(num_messages), 0)` from archives
`num_files` SQL	`COUNT(DISTINCT orig_file_id)` from files	`COALESCE(SUM(num_files), 0)` from archives

Update the CLP-S Details component to render Messages and Files cards using the same grid layout as CLP-Text.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Build

Task: Verify the full project builds successfully including the new C++ changes and web UI.

Command:

task

Output:

(Exit code 0 — build completed successfully, clp-package produced)

End-to-end compression test (CLP-S)

Task: Verify the full compression pipeline — clp-s now outputs num_messages and num_files in its JSON stats, the Python task writes them to the DB, and the archives table contains non-NULL values.

Command:

cd build/clp-package
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl

Output:

2026-05-23T02:31:42.871 INFO [compress] Compression job 1 submitted.
2026-05-23T02:31:44.875 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 198.26MB/s.
2026-05-23T02:31:45.376 INFO [compress] Compression finished.
2026-05-23T02:31:45.377 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 178.34MB/s.

Database verification

Task: Verify the num_messages and num_files columns in the archives table contain non-NULL, non-zero values after compression.

Command:

docker exec <database-container> mysql -u clp-user -p<password> clp-db -e "SELECT id, num_messages, num_files FROM clp_default_archives LIMIT 5;"

Output:

id	num_messages	num_files
ac317804-58aa-4a20-8e71-32321cf9ffbd	1000000	1

Explanation: Both num_messages (1,000,000 log events) and num_files (1 file) are populated with non-NULL values, confirming the C++ → Python → DB pipeline works correctly.

Web UI browser validation (CLP-S Ingest page)

Task: Verify the CLP-JSON Ingest page now displays Messages and Files cards alongside the TimeRange card, using a browser.

Navigated to http://localhost:4000/ and confirmed the Details section renders all three cards with correct values:

Explanation: The screenshot shows the Time Range, Messages (1,000,000), and Files (1) cards all rendering correctly. The grid layout has 3 children (timeRange spanning 2 columns, Messages and Files each in 1 column), matching the CLP-Text layout.

Details grid layout verification

Task: Verify the CSS grid layout structure matches CLP-Text (2-column grid with Time Range spanning full width).

Inspected the computed styles of the details grid element in the browser:

Explanation: The grid uses a 2-column layout with 8px gap. The Time Range child spans both columns (via grid-column: span 2 CSS), while Messages and Files each occupy one column — matching the CLP-Text Details layout exactly.

Summary by CodeRabbit

Release Notes

New Features
- Archives now track message and file counts as metadata metrics available in the interface.
UI Changes
- Archive details page now displays message and file counts consistently across all archive types.
- Removed conditional layout rendering to provide a unified details view.
Improvements
- Optimized metrics retrieval by aggregating message and file counts directly from archive metadata.

…UI Ingest page.

coderabbitai · 2026-05-23T04:53:48Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d6c72693-9eb4-465d-94fa-f28e793fe71a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR extends the archive metadata pipeline to track and persist num_messages and num_files counters across compression, storage, and web UI components. The database schema and C++ structures are updated to define and collect these metrics, which then flow through job orchestration persistence and into SQL queries powering the ingest details page.

Changes

Archive Statistics Tracking Enhancement

Layer / File(s)	Summary
Schema and metadata field definitions `components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py`, `components/core/src/clp/streaming_archive/Constants.hpp`, `components/webui/client/src/pages/IngestPage/sqlConfig.ts`	Database schema adds nullable `num_messages` and `num_files` BIGINT columns; C++ metadata constants and SQL column enum define the field names for use throughout the pipeline.
Archive statistics collection in C++ `components/core/src/clp_s/ArchiveWriter.hpp`, `components/core/src/clp_s/ArchiveWriter.cpp`	ArchiveWriter increments `m_num_files` when opening ranges, and ArchiveStats now accepts and exposes these counters via constructor parameters, JSON serialization, and public getter methods.
Statistics persistence to database `components/job-orchestration/job_orchestration/executor/compress/compression_task.py`	Archive statistics insertion now requires and persists `num_messages` and `num_files` to the archives table.
Query and UI integration `components/webui/client/src/pages/IngestPage/Details/sql.ts`, `components/webui/client/src/pages/IngestPage/Details/index.tsx`	Multi-dataset details SQL now aggregates counts from the archives table directly; Details component uses a single unified layout rendering time range, messages, and files with the new metrics.

Sequence Diagram

sequenceDiagram
  participant ArchiveWriter
  participant ArchiveStats
  participant CompressionTask
  participant Database
  participant DetailsUI as Details UI
  ArchiveWriter->>ArchiveWriter: increment m_num_files on range open
  ArchiveWriter->>ArchiveStats: construct with m_num_files, m_num_messages
  ArchiveStats->>ArchiveStats: serialize to JSON
  CompressionTask->>Database: insert archive stats with num_messages, num_files
  DetailsUI->>Database: query archives table
  Database->>DetailsUI: return aggregated SUM(num_messages), SUM(num_files)
  DetailsUI->>DetailsUI: render unified Messages and Files components

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change—adding Messages and Files stats display to the CLP-JSON Web UI Ingest page—which aligns with the comprehensive changeset across C++, database, and Web UI layers.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1)

23-42: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add an explicit schema migration path for existing archives tables.

CREATE TABLE IF NOT EXISTS won’t add new columns to already-existing tables. With the new inserts expecting num_messages and num_files, upgrades can fail at runtime unless these columns are backfilled via migration.

Suggested fix direction

 def _create_archives_table(db_cursor, archives_table_name: str) -> None:
     db_cursor.execute(
         f"""
         CREATE TABLE IF NOT EXISTS `{archives_table_name}` (
             ...
             `num_messages` BIGINT NULL DEFAULT NULL,
             `num_files` BIGINT NULL DEFAULT NULL,
             ...
         )
         """
     )
+
+    # Ensure upgrades add newly introduced nullable columns on existing tables.
+    # Implement with your existing migration strategy/dialect-safe checks.
+    # Example: conditionally ALTER TABLE when column is missing.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py` around lines
23 - 42, The current _create_archives_table uses CREATE TABLE IF NOT EXISTS
which does not add new columns to existing tables; add an idempotent migration
step after the CREATE that checks information_schema (or equivalent) for the
presence of the columns `num_messages` and `num_files` on the table named by
archives_table_name and, if missing, runs an ALTER TABLE to add them with the
intended types (BIGINT NULL DEFAULT NULL) and constraints; ensure this migration
is safe to re-run (skip ALTER if column exists), backfill any needed default
values if desired, and log/report failures so runtime inserts referencing
num_messages/num_files will not error.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py`:
- Around line 23-42: The current _create_archives_table uses CREATE TABLE IF NOT
EXISTS which does not add new columns to existing tables; add an idempotent
migration step after the CREATE that checks information_schema (or equivalent)
for the presence of the columns `num_messages` and `num_files` on the table
named by archives_table_name and, if missing, runs an ALTER TABLE to add them
with the intended types (BIGINT NULL DEFAULT NULL) and constraints; ensure this
migration is safe to re-run (skip ALTER if column exists), backfill any needed
default values if desired, and log/report failures so runtime inserts
referencing num_messages/num_files will not error.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7cc1c17f-5180-4fe2-9b6a-bffef29ebbad

📥 Commits

Reviewing files that changed from the base of the PR and between 85eaa70 and abffa6a.

📒 Files selected for processing (8)

components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
components/core/src/clp/streaming_archive/Constants.hpp
components/core/src/clp_s/ArchiveWriter.cpp
components/core/src/clp_s/ArchiveWriter.hpp
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
components/webui/client/src/pages/IngestPage/Details/index.tsx
components/webui/client/src/pages/IngestPage/Details/sql.ts
components/webui/client/src/pages/IngestPage/sqlConfig.ts

…um_files` stat. Previously, `m_num_files` was incremented in `add_field_to_current_range()` every time a new range was opened, which over-counted files when a single large input file was split across multiple archives. Now the increment is moved to the callers (`ingest_json`/`ingest_kvier`) where `file_split_number` is available, so only the first split (split 0) of each input file is counted.

feat(CLP-JSON): Add Messages and Files stats display to CLP-JSON Web …

abffa6a

…UI Ingest page.

junhaoliao requested review from a team and gibber9809 as code owners May 23, 2026 04:53

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

junhaoliao marked this pull request as draft May 23, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(CLP-JSON): Add Messages and Files stats display to CLP-JSON Web UI Ingest page.#2293

feat(CLP-JSON): Add Messages and Files stats display to CLP-JSON Web UI Ingest page.#2293
junhaoliao wants to merge 2 commits into
y-scope:mainfrom
junhaoliao:clp-json-stats

junhaoliao commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

junhaoliao commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Checklist

Validation performed

Build

End-to-end compression test (CLP-S)

Database verification

Web UI browser validation (CLP-S Ingest page)

Details grid layout verification

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

junhaoliao commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading