Bug/134 genosbucketchunker #142

inoray · 2026-01-06T04:38:15Z

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

gemini-code-assist · 2026-01-06T04:38:38Z

Summary of Changes

Hello @inoray, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on improving the robustness and configurability of document processing within the genon preprocessor. It introduces a more flexible logging system, ensures graceful handling of optional dependencies, and significantly refines the document chunking strategy, particularly for legal documents, to better manage token limits and section segmentation.

Highlights

Enhanced Logging: Introduced a standardized setup_logging method across all main processor facades (attachment_processor, basic_processor, intelligent_processor, intelligent_processor_law, intelligent_processor_ocr). This allows dynamic configuration of logging levels via log_level keyword arguments and provides informative log messages at the start of processing.
Robust upload_files Integration: The upload_files utility from genos_utils is now conditionally imported using a try-except ImportError block. This ensures that the system remains functional even if genos_utils is not installed, preventing runtime errors. Calls to upload_files are also made conditional on its successful import.
Refined Document Chunking Logic: The intelligent_processor_law.py file has undergone significant changes to its chunking mechanism. The HybridChunker has been replaced by GenosBucketChunker, which includes new methods for token counting and evenly splitting document items into chunks based on token limits. This allows for more granular control over chunk sizes, especially for long sections.
Configurable Text Splitter Parameters: The RecursiveCharacterTextSplitter in attachment_processor.py and attachment_processor_origin.py now explicitly accepts chunk_size and chunk_overlap parameters from kwargs with default values, providing more flexibility in how documents are split into smaller text segments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces logging functionality across multiple processor files (attachment_processor.py, basic_processor.py, intelligent_processor.py, intelligent_processor_law.py, intelligent_processor_ocr.py) by adding logging imports, _log instances, and a setup_logging method to configure logging levels dynamically. It also modifies the import of upload_files to use a try-except ImportError block, making it optional, and uncomments calls to upload_files within compose_vectors methods, conditionally executing them based on the upload_files import status. Whitespace issues were corrected in several files. In intelligent_processor_law.py, the chunking logic was significantly refactored: HierarchicalChunker was renamed to GenosBucketChunker, the HybridChunker class was removed, and new methods for token-based splitting (get_text_from_item, split_items_evenly_by_tokens) were introduced to handle oversized chunks, with split_documents now using GenosBucketChunker with max_tokens=0. For test.py, a sys.path modification was added for local testing and commented-out WMF conversion parameters were removed. Review comments highlight that the setup_logging method, which calls logging.basicConfig(), is duplicated across multiple files and should be configured once at application startup, ideally moved to a common utility. Additionally, the refactoring of the chunking logic in intelligent_processor_law.py is noted as potentially incomplete, with concerns that the new oversized chunk splitting logic might not be correctly invoked or applied due to the max_tokens=0 setting, which could lead to a bug where large sections are not split. The direct modification of sys.path in test.py is also flagged as a brittle practice for dependency management.

genon/preprocessor/facade/intelligent_processor_law.py

genon/preprocessor/facade/attachment_processor.py

genon/preprocessor/facade/basic_processor.py

genon/preprocessor/facade/intelligent_processor.py

genon/preprocessor/facade/intelligent_processor_law.py

genon/preprocessor/facade/intelligent_processor_ocr.py

gemini-code-assist · 2026-01-06T04:41:04Z

genon/preprocessor/facade/test.py

+import sys
+sys.path.insert(0, "../../../") # 현재 doc_parser의 docling 폴더 참조


Modifying sys.path directly is generally discouraged as it can make the project structure brittle and dependencies unclear. For local testing, it's better to use a proper packaging setup (e.g., using a setup.py or pyproject.toml file and installing the package in editable mode with pip install -e .). This makes imports work consistently without path manipulation.

inoray added 3 commits January 6, 2026 10:04

첨부용전처리기, kwargs 오류 수정. (RecursiveCharacterTextSplitter 함수 인자 오류)

7e9f141

GenosBucketChunker 구현 (규정용)

c540993

facade, 로깅제어기능 추가. genos_utils 라이브러리 참조 자동처리 추가

cd3f0ec

inoray linked an issue Jan 6, 2026 that may be closed by this pull request

기존 HybridChunker -> GenosBucketChunker 로 refactoring #134

Closed

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

genos_utils, 오류처리 보완

18f07f4

inoray requested a review from mestanam-mnc January 6, 2026 05:16

mestanam-mnc approved these changes Jan 6, 2026

View reviewed changes

mestanam-mnc merged commit 6f0bea8 into develop Jan 6, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug/134 genosbucketchunker #142

Bug/134 genosbucketchunker #142

Uh oh!

inoray commented Jan 6, 2026

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		import sys
		sys.path.insert(0, "../../../") # 현재 doc_parser의 docling 폴더 참조

Bug/134 genosbucketchunker #142

Bug/134 genosbucketchunker #142

Uh oh!

Conversation

inoray commented Jan 6, 2026

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants