Skip to content

Conversation

@inoray
Copy link
Collaborator

@inoray inoray commented Jan 6, 2026

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@inoray inoray linked an issue Jan 6, 2026 that may be closed by this pull request
@gemini-code-assist
Copy link

Summary of Changes

Hello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on improving the robustness and configurability of document processing within the genon preprocessor. It introduces a more flexible logging system, ensures graceful handling of optional dependencies, and significantly refines the document chunking strategy, particularly for legal documents, to better manage token limits and section segmentation.

Highlights

  • Enhanced Logging: Introduced a standardized setup_logging method across all main processor facades (attachment_processor, basic_processor, intelligent_processor, intelligent_processor_law, intelligent_processor_ocr). This allows dynamic configuration of logging levels via log_level keyword arguments and provides informative log messages at the start of processing.
  • Robust upload_files Integration: The upload_files utility from genos_utils is now conditionally imported using a try-except ImportError block. This ensures that the system remains functional even if genos_utils is not installed, preventing runtime errors. Calls to upload_files are also made conditional on its successful import.
  • Refined Document Chunking Logic: The intelligent_processor_law.py file has undergone significant changes to its chunking mechanism. The HybridChunker has been replaced by GenosBucketChunker, which includes new methods for token counting and evenly splitting document items into chunks based on token limits. This allows for more granular control over chunk sizes, especially for long sections.
  • Configurable Text Splitter Parameters: The RecursiveCharacterTextSplitter in attachment_processor.py and attachment_processor_origin.py now explicitly accepts chunk_size and chunk_overlap parameters from kwargs with default values, providing more flexibility in how documents are split into smaller text segments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces logging functionality across multiple processor files (attachment_processor.py, basic_processor.py, intelligent_processor.py, intelligent_processor_law.py, intelligent_processor_ocr.py) by adding logging imports, _log instances, and a setup_logging method to configure logging levels dynamically. It also modifies the import of upload_files to use a try-except ImportError block, making it optional, and uncomments calls to upload_files within compose_vectors methods, conditionally executing them based on the upload_files import status. Whitespace issues were corrected in several files. In intelligent_processor_law.py, the chunking logic was significantly refactored: HierarchicalChunker was renamed to GenosBucketChunker, the HybridChunker class was removed, and new methods for token-based splitting (get_text_from_item, split_items_evenly_by_tokens) were introduced to handle oversized chunks, with split_documents now using GenosBucketChunker with max_tokens=0. For test.py, a sys.path modification was added for local testing and commented-out WMF conversion parameters were removed. Review comments highlight that the setup_logging method, which calls logging.basicConfig(), is duplicated across multiple files and should be configured once at application startup, ideally moved to a common utility. Additionally, the refactoring of the chunking logic in intelligent_processor_law.py is noted as potentially incomplete, with concerns that the new oversized chunk splitting logic might not be correctly invoked or applied due to the max_tokens=0 setting, which could lead to a bug where large sections are not split. The direct modification of sys.path in test.py is also flagged as a brittle practice for dependency management.

Comment on lines +8 to +9
import sys
sys.path.insert(0, "../../../") # 현재 doc_parser의 docling 폴더 참조

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying sys.path directly is generally discouraged as it can make the project structure brittle and dependencies unclear. For local testing, it's better to use a proper packaging setup (e.g., using a setup.py or pyproject.toml file and installing the package in editable mode with pip install -e .). This makes imports work consistently without path manipulation.

@inoray inoray requested a review from mestanam-mnc January 6, 2026 05:16
@mestanam-mnc mestanam-mnc merged commit 6f0bea8 into develop Jan 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

기존 HybridChunker -> GenosBucketChunker 로 refactoring

3 participants