Skip to content

Conversation

@inoray
Copy link
Collaborator

@inoray inoray commented Nov 25, 2025

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@inoray inoray linked an issue Nov 25, 2025 that may be closed by this pull request
@gemini-code-assist
Copy link

Summary of Changes

Hello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the document chunking mechanism by improving how sections are identified, split, and merged. The changes focus on maintaining logical document structure, especially concerning section headers and their hierarchical levels, to ensure that generated document chunks are more semantically coherent and adhere to token limits effectively.

Highlights

  • Refactored Document Splitting Logic: The _split_document_by_tokens function has been refactored into a clearer, multi-stage process, including dedicated steps for initial section splitting, text generation, single-line title merging, and final token-based merging.
  • Introduced Helper Functions: Two new helper functions, get_header_level and get_current_chunk, were added to encapsulate common logic, improving readability and maintainability of the main splitting function.
  • Improved Single-Line Header Merging: The logic for merging single-line section headers (e.g., 'Chapter X') with subsequent sections has been enhanced to prevent merging if the current header's level is higher than the next section's header level, ensuring logical document structure.
  • Enhanced Token-Based Merging with Header Hierarchy: The final token-based merging step now considers the hierarchy of section headers. A new chunk is created not only when token limits are exceeded but also when a higher-level header is encountered, preventing inappropriate merging of distinct document sections.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly refactors the _split_document_by_tokens method to improve document chunking logic. The introduction of helper functions and clearer, multi-step processing enhances readability and maintainability. The new logic for merging sections based on header levels is a great improvement for creating more semantically coherent chunks.

However, there is a critical maintainability issue: the exact same code changes are applied to both genon/preprocessor/facade/intelligent_processor_law.py and genon/preprocessor/facade/legacy/적재용(규정)v2.py. This code duplication will make future maintenance difficult and error-prone. The HybridChunker class and its helper methods should be defined in a single, shared module and imported where needed, rather than being copied. Additionally, using non-ASCII characters in filenames, like in 적재용(규정)v2.py, can lead to compatibility issues across different operating systems and should be avoided.

I've added a few specific comments to further improve the code quality.

current_section_items = []
current_section_header_infos = []
current_section_header_short_infos = []
cur_items, cur_h_infos, cur_h_short = [], [], []

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While the refactoring has improved the structure, abbreviating variable names like cur_items, cur_h_infos, and cur_h_short reduces readability compared to the previous, more descriptive names (current_section_items, etc.). Using clear, unabbreviated names improves code maintainability for other developers who may work on this code in the future.

Comment on lines +521 to +523
for i, item in enumerate(items):
h_info = header_info_list[i] if i < len(header_info_list) else {}
h_short = header_short_info_list[i] if i < len(header_short_info_list) else {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop iterates over items and manually fetches corresponding elements from header_info_list and header_short_info_list with index-based checks. This can be made more Pythonic and readable by using itertools.zip_longest to handle lists of potentially different lengths gracefully. This would require adjusting the loop structure.

For example:

from itertools import zip_longest

for item, h_info, h_short in zip_longest(items, header_info_list, header_short_info_list, fillvalue={}):
    # ... loop body ...

continue
# 문단이 이미 구성된 것은 제외 (문자 수가 30자 이상이면 문단을 구성했다고 간주)
item_text = "".join(getattr(it, "text", "") for it in items)
if len(item_text) > 30:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The number 30 is used here as a threshold to determine if an item's text constitutes a paragraph. This is a 'magic number' and makes the code harder to understand and maintain. It would be better to define it as a named constant at the beginning of the _split_document_by_tokens method, for example: MIN_PARAGRAPH_LENGTH_FOR_MERGE = 30.

Suggested change
if len(item_text) > 30:
if len(item_text) > MIN_PARAGRAPH_LENGTH_FOR_MERGE:


for text, items, header_infos, header_short_infos in sections_with_text:

b_new_chunk = False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of the b_new_chunk boolean flag to control the flow for creating a new chunk makes the logic a bit indirect. You can make this more concise and readable by evaluating the conditions for creating a new chunk directly in the if statement on line 619. This would remove the need for this flag and the conditional block on lines 611-615.

@inoray inoray requested review from JaeseungYang and mestanam-mnc and removed request for mestanam-mnc November 25, 2025 04:24
@inoray inoray merged commit 6532d59 into develop Nov 25, 2025
2 checks passed
@inoray inoray deleted the bug/114-toc-enrichment-section-header branch November 25, 2025 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ToC enrichment -> Section Header 전환 오류 점검

3 participants