Skip to content

feat(clp-s): Add support for ingesting inputs to clp-s using LibArchive; add support for ingesting inputs to log-converter using LibArchive.#2295

Open
gibber9809 wants to merge 16 commits into
y-scope:mainfrom
gibber9809:libarchive-support-with-tests
Open

feat(clp-s): Add support for ingesting inputs to clp-s using LibArchive; add support for ingesting inputs to log-converter using LibArchive.#2295
gibber9809 wants to merge 16 commits into
y-scope:mainfrom
gibber9809:libarchive-support-with-tests

Conversation

@gibber9809
Copy link
Copy Markdown
Contributor

@gibber9809 gibber9809 commented May 26, 2026

Description

This PR adds support for ingesting data using libarchive in both clp-s and log-converter. This allows us to support .tar.gz inputs, as well as the various other formats that libarchive supports. Since libarchive supports such a wide variety of formats and deduces input types itself, we integrate libarchive as a fallback for when our normal input type deduction can not deduce the type of an input.

This change is implemented by adding a utility (try_process_archive_with_libarchive) in InputConfig that allows passing an arbitrary reader as well as handlers for various input types (log-text, json, kv-ir, etc.). This utility attempts to process the input with libarchive, and if successful invokes the relevant handlers for each file in the archive (or single file if e.g. the input is a single gzipped file) after deducing the file type. The utility function skips empty directories and other non-file content in an archive.

We also add some basic integration tests which mimic existing log-converter and clp-s binary integration tests after first converting test inputs into .tar.gz files.

Since libarchive requires linking against a litany of extra dependencies, we add a CLP_BUILD_CLP_S_ENABLE_LIBARCHIVE cmake option to disable building clp-s with libarchive support. This should allow us to reduce the number of dependencies we need to manage in projects like our presto plugin, where libarchive is unneeded anyway.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Added integration tests for ingesting .tar.gz files into clp-s and log-converter
  • Manually validated that log-converter and clp-s can now directly the download links for our open-source datasets on zenodo as inputs.

Summary by CodeRabbit

  • New Features

    • Added archive (.tar.gz) input support: archives are auto-extracted and members (JSON, KV-IR, log text, empty files) are processed during compression/conversion.
  • Tests

    • New integration tests verify lossless round-trip compression and log-conversion workflows using .tar.gz archives; added archive creation utility and a search-verification helper.
  • Documentation

    • input-path docs updated to list ndjson, KV-IR, tarballs, and gzip/Zstd variants.
  • Bug Fixes

    • Improved archive handling and clearer failure behaviour for unsupported or network-related input errors.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds libarchive-based processing so clp-s and log-converter accept tar.gz inputs, extract members, deduce member file types, and dispatch members to JSON, KV-IR, log-text, or empty-file handlers; includes build/linking, public API, implementations, parser/converter wiring, tests, and docs.

Changes

Archive input support for clp-s and log-converter

Layer / File(s) Summary
CMake build system integration
components/core/src/clp_s/CMakeLists.txt, components/core/src/clp_s/indexer/CMakeLists.txt
Adds libarchive reader source files and LibArchive::LibArchive linkage to clp_s_clp_dependencies, clp_s_io, and the indexer executable.
Archive processing API declaration
components/core/src/clp_s/InputConfig.hpp
Declares try_process_archive_with_libarchive(...) with std::function handlers for JSON, KV-IR, log-text, and empty members; adds <functional> and updates try_deduce_reader_type docs.
Archive processing implementation
components/core/src/clp_s/InputConfig.cpp
Implements try_process_archive_with_libarchive using LibarchiveReader/LibarchiveFileReader, iterates regular entries, deduces nested FileType, invokes handlers, and manages nested reader lifecycle; updates includes and nested-type switch.
JsonParser archive integration
components/core/src/clp_s/JsonParser.cpp
Updates JsonParser::ingest() to attempt archive processing for FileType::Unknown, wiring JSON/KV-IR/logtext handlers and checking network errors for nested readers.
log_converter archive integration
components/core/src/clp_s/log_converter/log_converter.cpp
Refactors convert_files to route LogText into conversion logic, add explicit Unknown handling that can process archives via try_process_archive_with_libarchive, and consolidate conversion result handling.
Test utilities and integration tests
integration-tests/tests/utils/archive.py, integration-tests/tests/binary_tests/test_identity_transformation.py, integration-tests/tests/binary_tests/test_log_converter.py
Adds create_tar_gz_from_dir utility, a clp-s tar.gz identity round-trip test, a log-converter tar.gz integration test, refactors search-result verification into _verify_search_results, and updates docs for input-path support.
CMake options flags
components/core/cmake/Options/options.cmake
Adds CLP_NEED_LIBARCHIVE to dependency flag lists for clp_s_clp_dependencies and clp_s_io.

Sequence Diagram

sequenceDiagram
  participant InputReader as InputReader
  participant ArchiveAPI as try_process_archive_with_libarchive
  participant Libarchive as LibarchiveReader/LibarchiveFileReader
  participant JsonParser as JsonParser::ingest
  participant LogConverter as log_converter::convert_files
  InputReader->>ArchiveAPI: open(reader, path, default_file_path)
  ArchiveAPI->>Libarchive: initialize and iterate entries
  Libarchive->>ArchiveAPI: emit regular entry (entry_path)
  ArchiveAPI->>JsonParser: json_handler(nested_reader, entry_path)
  ArchiveAPI->>JsonParser: kvir_handler(nested_reader, entry_path)
  ArchiveAPI->>LogConverter: logtext_handler(nested_reader, entry_path)
  ArchiveAPI->>InputReader: close nested readers on completion/failure
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • y-scope/clp#2240: Modifies log_converter/log_converter.cpp conversion flow; related because both PRs change archive/Unknown handling and conversion logic.

Suggested reviewers

  • LinZhihao-723
  • kirkrodrigues
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding LibArchive support for ingesting inputs in both clp-s and log-converter, which is the central purpose of the entire changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gibber9809 gibber9809 marked this pull request as ready for review May 26, 2026 19:59
@gibber9809 gibber9809 requested a review from a team as a code owner May 26, 2026 19:59
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/core/src/clp_s/log_converter/log_converter.cpp`:
- Around line 70-76: The empty CURL error handling in the
clp_s::FileType::Unknown branch should mirror JsonParser.cpp: when
clp_s::NetworkUtils::check_and_log_curl_error(path.path,
nested_readers.front().get()) returns true, close/cleanup the nested reader(s)
(e.g., nested_readers.front()->close() or equivalent) and return early from the
function instead of falling through to archive processing; update the case
handling around clp_s::FileType::Unknown to perform the cleanup and immediate
return so network errors are reported correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 425c9d36-fdbf-4304-b44b-b7833ee133ef

📥 Commits

Reviewing files that changed from the base of the PR and between 5e48c13 and 8187cc4.

📒 Files selected for processing (8)
  • components/core/src/clp_s/CMakeLists.txt
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/InputConfig.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/log_converter/log_converter.cpp
  • integration-tests/tests/binary_tests/test_identity_transformation.py
  • integration-tests/tests/binary_tests/test_log_converter.py
  • integration-tests/tests/utils/archive.py

Comment on lines +70 to +76
case clp_s::FileType::Unknown: {
if (false == nested_readers.empty()
&& clp_s::NetworkUtils::check_and_log_curl_error(
path.path,
nested_readers.front().get()
))
{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Empty CURL error handling block does nothing.

The CURL error check at lines 71-76 has an empty block {}. If check_and_log_curl_error returns true, no action is taken and execution continues to attempt archive processing. This is inconsistent with JsonParser.cpp (lines 679-689), which closes readers and returns early when a CURL error is detected.

This likely causes confusing error messages when a network error occurs, as the code will try archive processing, fail, and then log "Received input that was not unstructured logtext" instead of the actual network error.

Proposed fix
             case clp_s::FileType::Unknown: {
                 if (false == nested_readers.empty()
                     && clp_s::NetworkUtils::check_and_log_curl_error(
                             path.path,
                             nested_readers.front().get()
                     ))
-                {}
+                {
+                    clp_s::close_nested_readers(nested_readers);
+                    return false;
+                }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/core/src/clp_s/log_converter/log_converter.cpp` around lines 70 -
76, The empty CURL error handling in the clp_s::FileType::Unknown branch should
mirror JsonParser.cpp: when
clp_s::NetworkUtils::check_and_log_curl_error(path.path,
nested_readers.front().get()) returns true, close/cleanup the nested reader(s)
(e.g., nested_readers.front()->close() or equivalent) and return early from the
function instead of falling through to archive processing; update the case
handling around clp_s::FileType::Unknown to perform the cleanup and immediate
return so network errors are reported correctly.

@gibber9809 gibber9809 changed the title Libarchive support with tests feat(clp-s): Add support for ingesting inputs to clp-s using LibArchive; add support for ingesting inputs to log-converter using LibArchive. May 26, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
components/core/src/clp_s/log_converter/log_converter.cpp (1)

70-78: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close nested_readers before this early return.

This now stops on CURL errors, but Line 77 still exits without cleaning up the reader chain. That leaves the Unknown error path inconsistent with JsonParser.cpp and can leak handles / skip close-time error propagation.

Proposed fix
                 if (false == nested_readers.empty()
                     && clp_s::NetworkUtils::check_and_log_curl_error(
                             path.path,
                             nested_readers.front().get()
                     ))
                 {
+                    clp_s::close_nested_readers(nested_readers);
                     return false;
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/core/src/clp_s/log_converter/log_converter.cpp` around lines 70 -
78, In the FileType::Unknown branch, ensure the nested_readers chain is closed
before the early return: call close() or otherwise clear/flush each reader in
nested_readers (the same cleanup behavior used in JsonParser.cpp) immediately
after clp_s::NetworkUtils::check_and_log_curl_error(...) returns true and before
returning false; update the block around nested_readers.front().get() to perform
that cleanup so handles are not leaked and close-time errors can propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@components/core/src/clp_s/log_converter/log_converter.cpp`:
- Around line 70-78: In the FileType::Unknown branch, ensure the nested_readers
chain is closed before the early return: call close() or otherwise clear/flush
each reader in nested_readers (the same cleanup behavior used in JsonParser.cpp)
immediately after clp_s::NetworkUtils::check_and_log_curl_error(...) returns
true and before returning false; update the block around
nested_readers.front().get() to perform that cleanup so handles are not leaked
and close-time errors can propagate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9610a4bf-70f0-415a-99c3-78b66f998f12

📥 Commits

Reviewing files that changed from the base of the PR and between 8e6b812 and e7ada8e.

📒 Files selected for processing (1)
  • components/core/src/clp_s/log_converter/log_converter.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants