facade/attachment_preprocessor.py #138

mestanam-mnc · 2025-12-14T11:48:30Z

파일 확장자에 따른 langchain, docling 사용여부 구분

docling

hwpx, docx

langchain

그 외 확장자

hwpx, docx를 제외한 확장자에 대해서는 page에 +1

확인필요

한국은행에 전달한 코드가 facade에 있는 첨부용 코드인지 확인 필요.

docling 사용 ext : hwpx, docx 그 외 langchain

gemini-code-assist · 2025-12-14T11:48:41Z

Summary of Changes

Hello @mestanam-mnc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

이 PR은 첨부 파일 처리 로직 내에서 페이지 번호 매기기 방식을 조정합니다. 이는 .hwpx 및 .docx와 같은 특정 파일 형식(아마도 docling으로 처리되는)과 다른 파일 형식(아마도 langchain으로 처리되는) 간의 페이지 인덱싱 일관성을 보장하기 위한 것입니다. 변경 사항은 다양한 문서 처리 도구에서 정확한 페이지 참조를 유지하는 데 중점을 둡니다.

Highlights

페이지 번호 처리 로직 변경: 문서 청크에서 벡터를 구성할 때 page 메타데이터의 기본값이 0에서 1로 변경되었습니다.
파일 확장자에 따른 페이지 번호 조정: .hwpx 및 .docx 확장자를 제외한 파일에 대해 page 번호가 1씩 증가하도록 조건부 로직이 추가되었습니다. 이는 docling과 langchain 사용 여부에 따른 페이지 번호 처리 구분을 반영합니다.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

이 PR은 첨부 파일의 페이지 번호를 계산하는 로직을 수정합니다. hwpx/docx 파일과 그 외 파일의 페이지 처리를 구분하는 것이 목적으로 보입니다. 하지만 hwpx/docx가 아닌 파일에 대한 구현에서, 페이지 정보가 없는 문서가 페이지 2로 잘못 할당되는 논리적 오류가 있는 것으로 보입니다. 또한, 코드를 단순화할 수 있는 불필요한 조건문이 있습니다. 수정 제안을 포함한 구체적인 리뷰 의견을 남겼습니다.

gemini-code-assist · 2025-12-14T11:50:01Z

genon/preprocessor/facade/attachment_processor.py

+            page = chunk.metadata.get('page', 1)
+            if ext not in ['.hwpx', '.docx']:
+                page += 1


페이지 번호 계산 로직에 두 가지 문제가 있는 것 같습니다.

첫째, if ext not in ['.hwpx', '.docx']: 조건문은 항상 참(true)입니다. DocumentProcessor의 __call__ 메소드를 보면, 이 compose_vectors 메소드는 .hwpx나 .docx가 아닌 파일 형식에 대해서만 호출되기 때문입니다. 따라서 이 if문은 불필요하며 코드를 혼란스럽게 만들 수 있습니다.

둘째, 페이지 정보가 없는 문서의 페이지 번호 계산이 잘못된 것으로 보입니다.

페이지 번호가 있는 문서(예: PDF)의 경우, chunk.metadata.get('page', 1)은 0부터 시작하는 페이지 번호(0, 1, 2...)를 반환하고, 여기에 1을 더하여 1부터 시작하는 페이지 번호(1, 2, 3...)로 올바르게 변환됩니다.

하지만 페이지 번호가 없는 문서의 경우, chunk.metadata.get('page', 1)은 기본값 1을 반환하고, page += 1에 의해 최종적으로 2가 됩니다. 이로 인해 해당 문서의 모든 내용은 페이지 2에 할당되는데, 이는 의도된 동작이 아닐 가능성이 높습니다. 일반적으로 페이지 1에 할당되어야 합니다.

더 간단하고 올바른 구현은 다음과 같습니다.

page = chunk.metadata.get('page', 0) + 1

이렇게 하면 0부터 시작하는 페이지는 1부터 시작하도록 변환되고, 페이지가 없는 문서는 페이지 1(0 + 1)에 할당됩니다.

Suggested change

page = chunk.metadata.get('page', 1)

if ext not in ['.hwpx', '.docx']:

page += 1

page = chunk.metadata.get('page', 0) + 1

facade/attachment_preprocessor.py

fcea6a5

docling 사용 ext : hwpx, docx 그 외 langchain

mestanam-mnc linked an issue Dec 14, 2025 that may be closed by this pull request

[BOK] 첨부용 전처리기 시작페이지 0 -> 1 #135

Closed

mestanam-mnc requested review from inoray and yspaik December 14, 2025 11:48

gemini-code-assist bot reviewed Dec 14, 2025

View reviewed changes

inoray approved these changes Dec 18, 2025

View reviewed changes

inoray merged commit 15873ec into develop Dec 18, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

facade/attachment_preprocessor.py #138

facade/attachment_preprocessor.py #138

Uh oh!

mestanam-mnc commented Dec 14, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

facade/attachment_preprocessor.py #138

facade/attachment_preprocessor.py #138

Uh oh!

Conversation

mestanam-mnc commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

파일 확장자에 따른 langchain, docling 사용여부 구분

docling

langchain

확인필요

Uh oh!

gemini-code-assist bot commented Dec 14, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mestanam-mnc commented Dec 14, 2025 •

edited

Loading