feat(clp-s): Add support for ingesting inputs to clp-s using LibArchive; add support for ingesting inputs to log-converter using LibArchive.#2295
Conversation
…-scope#2251) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Quinn Taylor Mitchell <q.mitchell@mail.utoronto.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
…using libarchive when we can't deduce their type.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds libarchive-based processing so clp-s and log-converter accept tar.gz inputs, extract members, deduce member file types, and dispatch members to JSON, KV-IR, log-text, or empty-file handlers; includes build/linking, public API, implementations, parser/converter wiring, tests, and docs. ChangesArchive input support for clp-s and log-converter
Sequence DiagramsequenceDiagram
participant InputReader as InputReader
participant ArchiveAPI as try_process_archive_with_libarchive
participant Libarchive as LibarchiveReader/LibarchiveFileReader
participant JsonParser as JsonParser::ingest
participant LogConverter as log_converter::convert_files
InputReader->>ArchiveAPI: open(reader, path, default_file_path)
ArchiveAPI->>Libarchive: initialize and iterate entries
Libarchive->>ArchiveAPI: emit regular entry (entry_path)
ArchiveAPI->>JsonParser: json_handler(nested_reader, entry_path)
ArchiveAPI->>JsonParser: kvir_handler(nested_reader, entry_path)
ArchiveAPI->>LogConverter: logtext_handler(nested_reader, entry_path)
ArchiveAPI->>InputReader: close nested readers on completion/failure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/core/src/clp_s/log_converter/log_converter.cpp`:
- Around line 70-76: The empty CURL error handling in the
clp_s::FileType::Unknown branch should mirror JsonParser.cpp: when
clp_s::NetworkUtils::check_and_log_curl_error(path.path,
nested_readers.front().get()) returns true, close/cleanup the nested reader(s)
(e.g., nested_readers.front()->close() or equivalent) and return early from the
function instead of falling through to archive processing; update the case
handling around clp_s::FileType::Unknown to perform the cleanup and immediate
return so network errors are reported correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 425c9d36-fdbf-4304-b44b-b7833ee133ef
📒 Files selected for processing (8)
components/core/src/clp_s/CMakeLists.txtcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/InputConfig.hppcomponents/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/log_converter/log_converter.cppintegration-tests/tests/binary_tests/test_identity_transformation.pyintegration-tests/tests/binary_tests/test_log_converter.pyintegration-tests/tests/utils/archive.py
| case clp_s::FileType::Unknown: { | ||
| if (false == nested_readers.empty() | ||
| && clp_s::NetworkUtils::check_and_log_curl_error( | ||
| path.path, | ||
| nested_readers.front().get() | ||
| )) | ||
| {} |
There was a problem hiding this comment.
Empty CURL error handling block does nothing.
The CURL error check at lines 71-76 has an empty block {}. If check_and_log_curl_error returns true, no action is taken and execution continues to attempt archive processing. This is inconsistent with JsonParser.cpp (lines 679-689), which closes readers and returns early when a CURL error is detected.
This likely causes confusing error messages when a network error occurs, as the code will try archive processing, fail, and then log "Received input that was not unstructured logtext" instead of the actual network error.
Proposed fix
case clp_s::FileType::Unknown: {
if (false == nested_readers.empty()
&& clp_s::NetworkUtils::check_and_log_curl_error(
path.path,
nested_readers.front().get()
))
- {}
+ {
+ clp_s::close_nested_readers(nested_readers);
+ return false;
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/core/src/clp_s/log_converter/log_converter.cpp` around lines 70 -
76, The empty CURL error handling in the clp_s::FileType::Unknown branch should
mirror JsonParser.cpp: when
clp_s::NetworkUtils::check_and_log_curl_error(path.path,
nested_readers.front().get()) returns true, close/cleanup the nested reader(s)
(e.g., nested_readers.front()->close() or equivalent) and return early from the
function instead of falling through to archive processing; update the case
handling around clp_s::FileType::Unknown to perform the cleanup and immediate
return so network errors are reported correctly.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
components/core/src/clp_s/log_converter/log_converter.cpp (1)
70-78:⚠️ Potential issue | 🟠 Major | ⚡ Quick winClose
nested_readersbefore this early return.This now stops on CURL errors, but Line 77 still exits without cleaning up the reader chain. That leaves the
Unknownerror path inconsistent withJsonParser.cppand can leak handles / skip close-time error propagation.Proposed fix
if (false == nested_readers.empty() && clp_s::NetworkUtils::check_and_log_curl_error( path.path, nested_readers.front().get() )) { + clp_s::close_nested_readers(nested_readers); return false; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/core/src/clp_s/log_converter/log_converter.cpp` around lines 70 - 78, In the FileType::Unknown branch, ensure the nested_readers chain is closed before the early return: call close() or otherwise clear/flush each reader in nested_readers (the same cleanup behavior used in JsonParser.cpp) immediately after clp_s::NetworkUtils::check_and_log_curl_error(...) returns true and before returning false; update the block around nested_readers.front().get() to perform that cleanup so handles are not leaked and close-time errors can propagate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@components/core/src/clp_s/log_converter/log_converter.cpp`:
- Around line 70-78: In the FileType::Unknown branch, ensure the nested_readers
chain is closed before the early return: call close() or otherwise clear/flush
each reader in nested_readers (the same cleanup behavior used in JsonParser.cpp)
immediately after clp_s::NetworkUtils::check_and_log_curl_error(...) returns
true and before returning false; update the block around
nested_readers.front().get() to perform that cleanup so handles are not leaked
and close-time errors can propagate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9610a4bf-70f0-415a-99c3-78b66f998f12
📒 Files selected for processing (1)
components/core/src/clp_s/log_converter/log_converter.cpp
Description
This PR adds support for ingesting data using libarchive in both clp-s and log-converter. This allows us to support .tar.gz inputs, as well as the various other formats that libarchive supports. Since libarchive supports such a wide variety of formats and deduces input types itself, we integrate libarchive as a fallback for when our normal input type deduction can not deduce the type of an input.
This change is implemented by adding a utility (
try_process_archive_with_libarchive) in InputConfig that allows passing an arbitrary reader as well as handlers for various input types (log-text, json, kv-ir, etc.). This utility attempts to process the input with libarchive, and if successful invokes the relevant handlers for each file in the archive (or single file if e.g. the input is a single gzipped file) after deducing the file type. The utility function skips empty directories and other non-file content in an archive.We also add some basic integration tests which mimic existing log-converter and clp-s binary integration tests after first converting test inputs into .tar.gz files.
Since libarchive requires linking against a litany of extra dependencies, we add a
CLP_BUILD_CLP_S_ENABLE_LIBARCHIVEcmake option to disable building clp-s with libarchive support. This should allow us to reduce the number of dependencies we need to manage in projects like our presto plugin, where libarchive is unneeded anyway.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Tests
Documentation
Bug Fixes