Bug/114 toc enrichment section header #123

inoray · 2025-11-25T04:15:32Z

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

gemini-code-assist · 2025-11-25T04:15:45Z

Summary of Changes

Hello @inoray, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the document chunking mechanism by improving how sections are identified, split, and merged. The changes focus on maintaining logical document structure, especially concerning section headers and their hierarchical levels, to ensure that generated document chunks are more semantically coherent and adhere to token limits effectively.

Highlights

Refactored Document Splitting Logic: The _split_document_by_tokens function has been refactored into a clearer, multi-stage process, including dedicated steps for initial section splitting, text generation, single-line title merging, and final token-based merging.
Introduced Helper Functions: Two new helper functions, get_header_level and get_current_chunk, were added to encapsulate common logic, improving readability and maintainability of the main splitting function.
Improved Single-Line Header Merging: The logic for merging single-line section headers (e.g., 'Chapter X') with subsequent sections has been enhanced to prevent merging if the current header's level is higher than the next section's header level, ensuring logical document structure.
Enhanced Token-Based Merging with Header Hierarchy: The final token-based merging step now considers the hierarchy of section headers. A new chunk is created not only when token limits are exceeded but also when a higher-level header is encountered, preventing inappropriate merging of distinct document sections.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly refactors the _split_document_by_tokens method to improve document chunking logic. The introduction of helper functions and clearer, multi-step processing enhances readability and maintainability. The new logic for merging sections based on header levels is a great improvement for creating more semantically coherent chunks.

However, there is a critical maintainability issue: the exact same code changes are applied to both genon/preprocessor/facade/intelligent_processor_law.py and genon/preprocessor/facade/legacy/적재용(규정)v2.py. This code duplication will make future maintenance difficult and error-prone. The HybridChunker class and its helper methods should be defined in a single, shared module and imported where needed, rather than being copied. Additionally, using non-ASCII characters in filenames, like in 적재용(규정)v2.py, can lead to compatibility issues across different operating systems and should be avoided.

I've added a few specific comments to further improve the code quality.

gemini-code-assist · 2025-11-25T04:17:46Z

genon/preprocessor/facade/intelligent_processor_law.py

-        current_section_items = []
-        current_section_header_infos = []
-        current_section_header_short_infos = []
+        cur_items, cur_h_infos, cur_h_short = [], [], []


While the refactoring has improved the structure, abbreviating variable names like cur_items, cur_h_infos, and cur_h_short reduces readability compared to the previous, more descriptive names (current_section_items, etc.). Using clear, unabbreviated names improves code maintainability for other developers who may work on this code in the future.

gemini-code-assist · 2025-11-25T04:17:46Z

genon/preprocessor/facade/intelligent_processor_law.py

+        for i, item in enumerate(items):
+            h_info = header_info_list[i] if i < len(header_info_list) else {}
+            h_short = header_short_info_list[i] if i < len(header_short_info_list) else {}


This loop iterates over items and manually fetches corresponding elements from header_info_list and header_short_info_list with index-based checks. This can be made more Pythonic and readable by using itertools.zip_longest to handle lists of potentially different lengths gracefully. This would require adjusting the loop structure.

For example:

from itertools import zip_longest for item, h_info, h_short in zip_longest(items, header_info_list, header_short_info_list, fillvalue={}): # ... loop body ...

gemini-code-assist · 2025-11-25T04:17:46Z

genon/preprocessor/facade/intelligent_processor_law.py

-                    continue
+            # 문단이 이미 구성된 것은 제외 (문자 수가 30자 이상이면 문단을 구성했다고 간주)
+            item_text = "".join(getattr(it, "text", "") for it in items)
+            if len(item_text) > 30:


The number 30 is used here as a threshold to determine if an item's text constitutes a paragraph. This is a 'magic number' and makes the code harder to understand and maintain. It would be better to define it as a named constant at the beginning of the _split_document_by_tokens method, for example: MIN_PARAGRAPH_LENGTH_FOR_MERGE = 30.

Suggested change

if len(item_text) > 30:

if len(item_text) > MIN_PARAGRAPH_LENGTH_FOR_MERGE:

gemini-code-assist · 2025-11-25T04:17:46Z

genon/preprocessor/facade/intelligent_processor_law.py

+
+        for text, items, header_infos, header_short_infos in sections_with_text:
+
+            b_new_chunk = False


The use of the b_new_chunk boolean flag to control the flow for creating a new chunk makes the logic a bit indirect. You can make this more concise and readable by evaluating the conditions for creating a new chunk directly in the if statement on line 619. This would remove the need for this flag and the conditional block on lines 611-615.

inoray added 3 commits November 19, 2025 18:08

B의 Section Header가 A(이전 Chunk)의 끝에 붙는 현상 수정, 섹션헤더 레벨 정보 활용

2308e4f

섹션 헤더의 텍스트가 잘못생성되는 것 수정

0cc1a24

불필요한 로깅 주석처리

a1a3e45

inoray linked an issue Nov 25, 2025 that may be closed by this pull request

ToC enrichment -> Section Header 전환 오류 점검 #114

Closed

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

inoray requested review from JaeseungYang and mestanam-mnc and removed request for mestanam-mnc November 25, 2025 04:24

JaeseungYang approved these changes Nov 25, 2025

View reviewed changes

inoray merged commit 6532d59 into develop Nov 25, 2025
2 checks passed

inoray deleted the bug/114-toc-enrichment-section-header branch November 25, 2025 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug/114 toc enrichment section header #123

Bug/114 toc enrichment section header #123

Uh oh!

inoray commented Nov 25, 2025

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if len(item_text) > 30:
	if len(item_text) > MIN_PARAGRAPH_LENGTH_FOR_MERGE:


		for text, items, header_infos, header_short_infos in sections_with_text:

		b_new_chunk = False

Bug/114 toc enrichment section header #123

Bug/114 toc enrichment section header #123

Uh oh!

Conversation

inoray commented Nov 25, 2025

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants