-
Notifications
You must be signed in to change notification settings - Fork 2
Bug/114 toc enrichment section header #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @inoray, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the document chunking mechanism by improving how sections are identified, split, and merged. The changes focus on maintaining logical document structure, especially concerning section headers and their hierarchical levels, to ensure that generated document chunks are more semantically coherent and adhere to token limits effectively. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly refactors the _split_document_by_tokens method to improve document chunking logic. The introduction of helper functions and clearer, multi-step processing enhances readability and maintainability. The new logic for merging sections based on header levels is a great improvement for creating more semantically coherent chunks.
However, there is a critical maintainability issue: the exact same code changes are applied to both genon/preprocessor/facade/intelligent_processor_law.py and genon/preprocessor/facade/legacy/적재용(규정)v2.py. This code duplication will make future maintenance difficult and error-prone. The HybridChunker class and its helper methods should be defined in a single, shared module and imported where needed, rather than being copied. Additionally, using non-ASCII characters in filenames, like in 적재용(규정)v2.py, can lead to compatibility issues across different operating systems and should be avoided.
I've added a few specific comments to further improve the code quality.
| current_section_items = [] | ||
| current_section_header_infos = [] | ||
| current_section_header_short_infos = [] | ||
| cur_items, cur_h_infos, cur_h_short = [], [], [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the refactoring has improved the structure, abbreviating variable names like cur_items, cur_h_infos, and cur_h_short reduces readability compared to the previous, more descriptive names (current_section_items, etc.). Using clear, unabbreviated names improves code maintainability for other developers who may work on this code in the future.
| for i, item in enumerate(items): | ||
| h_info = header_info_list[i] if i < len(header_info_list) else {} | ||
| h_short = header_short_info_list[i] if i < len(header_short_info_list) else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop iterates over items and manually fetches corresponding elements from header_info_list and header_short_info_list with index-based checks. This can be made more Pythonic and readable by using itertools.zip_longest to handle lists of potentially different lengths gracefully. This would require adjusting the loop structure.
For example:
from itertools import zip_longest
for item, h_info, h_short in zip_longest(items, header_info_list, header_short_info_list, fillvalue={}):
# ... loop body ...| continue | ||
| # 문단이 이미 구성된 것은 제외 (문자 수가 30자 이상이면 문단을 구성했다고 간주) | ||
| item_text = "".join(getattr(it, "text", "") for it in items) | ||
| if len(item_text) > 30: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number 30 is used here as a threshold to determine if an item's text constitutes a paragraph. This is a 'magic number' and makes the code harder to understand and maintain. It would be better to define it as a named constant at the beginning of the _split_document_by_tokens method, for example: MIN_PARAGRAPH_LENGTH_FOR_MERGE = 30.
| if len(item_text) > 30: | |
| if len(item_text) > MIN_PARAGRAPH_LENGTH_FOR_MERGE: |
|
|
||
| for text, items, header_infos, header_short_infos in sections_with_text: | ||
|
|
||
| b_new_chunk = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the b_new_chunk boolean flag to control the flow for creating a new chunk makes the logic a bit indirect. You can make this more concise and readable by evaluating the conditions for creating a new chunk directly in the if statement on line 619. This would remove the need for this flag and the conditional block on lines 611-615.
Checklist: