Skip to content

Scrape Improve#317

Merged
locnguyen1986 merged 2 commits intoreleasefrom
main
Dec 18, 2025
Merged

Scrape Improve#317
locnguyen1986 merged 2 commits intoreleasefrom
main

Conversation

@locnguyen1986
Copy link
Collaborator

Scrape Improve

Copilot AI review requested due to automatic review settings December 17, 2025 17:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the web scraping functionality by introducing separate timeout configurations for scrape operations, implementing graceful degradation for failed scrapes, and improving retry logic with non-retryable error handling. The changes enable better handling of slow-loading web pages while avoiding unnecessary retries for errors that will never succeed (like 403 Forbidden or 401 Unauthorized).

Key Changes:

  • Added dedicated scrape client with configurable longer timeout (default 30s vs 15s for regular operations)
  • Implemented graceful degradation: scrape failures now return response with status/error fields instead of propagating errors
  • Enhanced retry logic to distinguish between retryable and non-retryable errors (e.g., 403, 401, 404, 410)
  • Updated fallback client with browser-like headers to improve bot detection avoidance

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/mcp-tools/main.go Maps new ScrapeTimeout config to SearchClient initialization
services/mcp-tools/internal/infrastructure/config/config.go Adds SERPER_SCRAPE_TIMEOUT environment variable with 30s default
services/mcp-tools/internal/infrastructure/search/client.go Implements separate scrape client, graceful degradation logic, and browser-like headers for fallback
services/mcp-tools/internal/infrastructure/search/retry.go Adds NonRetryableErrors list and checks them before attempting retries
services/mcp-tools/internal/domain/search/types.go Extends FetchWebpageResponse with Status and Error fields for graceful degradation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +251 to +257
errMsg := "scrape failed"
if serperErr != nil && fallbackErr != nil {
errMsg = fmt.Sprintf("serper: %v; fallback: %v", serperErr, fallbackErr)
} else if fallbackErr != nil {
errMsg = fallbackErr.Error()
} else if serperErr != nil {
errMsg = serperErr.Error()
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message construction logic is redundant and can be simplified. The else-if chain at lines 254-257 will never execute because if both errors are nil, the function would have already returned successfully. The logic can be simplified to just check if both errors exist and construct the message accordingly, without the unnecessary else-if branches.

Suggested change
errMsg := "scrape failed"
if serperErr != nil && fallbackErr != nil {
errMsg = fmt.Sprintf("serper: %v; fallback: %v", serperErr, fallbackErr)
} else if fallbackErr != nil {
errMsg = fallbackErr.Error()
} else if serperErr != nil {
errMsg = serperErr.Error()
// At this point, fallbackErr is guaranteed to be non-nil because a successful
// fallback would have returned earlier.
errMsg := fallbackErr.Error()
if serperErr != nil {
errMsg = fmt.Sprintf("serper: %v; fallback: %v", serperErr, fallbackErr)

Copilot uses AI. Check for mistakes.
@locnguyen1986 locnguyen1986 merged commit ce09ad7 into release Dec 18, 2025
13 of 14 checks passed
@github-project-automation github-project-automation bot moved this to QA in Jan Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: QA

Development

Successfully merging this pull request may close these issues.

3 participants