Skip to content

Conversation

@charlieroth
Copy link
Owner

🌐 Network Fetcher for Page Content Implementation

This PR implements a robust HTTP client system for fetching and processing web page content as part of the background job system.

🎯 Features Implemented

✅ Robust HTTP Client

  • 10s connect timeout, 30s total request timeout
  • Automatic redirect following (max 10 redirects)
  • Compression support (gzip, brotli, deflate)
  • 5MB content size limit protection
  • Proper User-Agent and Accept headers

✅ Smart Content Processing

  • Character encoding detection from Content-Type headers
  • HTML meta tag charset parsing (<meta charset>, <meta http-equiv>)
  • Heuristic charset detection using chardetng
  • UTF-8 normalization with original bytes preserved

✅ Error Classification System

  • Retryable: Network failures, DNS issues, 5xx server errors, timeouts
  • Permanent: Invalid URLs, 4xx client errors, unsupported content types
  • Structured error types with detailed retry logic

✅ Job Runner Integration

  • FetchPageJobHandler for background URL processing
  • Database storage in contents table with metadata
  • Automatic item status updates (pending → fetched)
  • MD5 checksums for content deduplication

🏗️ Architecture

src/fetcher/
├── mod.rs # Public API exports
├── client.rs # HTTP client with singleton pattern
├── errors.rs # Error types and retry classification
├── pipeline.rs # Content decoding and charset detection
└── types.rs # PageResponse and Charset definitions

src/jobs/handlers/
└── fetch_page.rs # Background job handler

tests/
└── fetcher_client.rs # 13 comprehensive tests

🔧 Technical Details

Dependencies Added:

  • encoding_rs - Character encoding conversion
  • chardetng - Heuristic charset detection
  • url, bytes, md5 - Content processing utilities
  • once_cell - Singleton HTTP client
  • wiremock - Testing infrastructure

Database Integration:

  • Stores HTML content in existing contents table
  • Updates item status from pending to fetched
  • Handles concurrent access with FOR UPDATE locks
  • SQLx offline query preparation included

🧪 Testing Coverage

9 Integration Tests (using wiremock):

  • Success scenarios with proper HTML content
  • HTTP error codes (404 non-retryable, 5xx retryable)
  • Redirect handling and final URL resolution
  • Gzip compression support
  • Content-type validation and filtering
  • Body size limit enforcement
  • Invalid URL handling

4 Unit Tests:

  • Charset detection from various sources
  • Content decoding validation
  • Error retry classification logic

🚀 Usage

Direct API:

use capsule::fetcher::fetch;
let response = fetch("https://example.com").await?;

Background Jobs:

// Enqueue a fetch job
let payload = FetchPagePayload { item_id };
job_repository.enqueue("fetch_page", payload, now()).await?;

Co-authored-by: Amp amp@ampcode.com
Amp-Thread-ID: https://ampcode.com/threads/T-2126140c-a3e6-48a0-a6cc-2e924d3c6344

- Add robust HTTP client with timeouts (10s connect, 30s total) and redirect handling
- Implement comprehensive error classification (retryable vs permanent failures)
- Add content decoding pipeline with charset detection and UTF-8 normalization
- Support gzip/brotli/deflate compression and 5MB size limits
- Create FetchPageJobHandler for background URL fetching
- Add 13 comprehensive tests covering timeouts, redirects, compression, errors
- Integrate with job runner system and database storage
- Add dependencies: encoding_rs, chardetng, url, bytes, md5, once_cell

Components:
- src/fetcher/ - Core HTTP fetching module
- src/jobs/handlers/fetch_page.rs - Job handler for background processing
- tests/fetcher_client.rs - Comprehensive test suite
- Updated SQLx offline query cache

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-2126140c-a3e6-48a0-a6cc-2e924d3c6344
@charlieroth charlieroth linked an issue Aug 26, 2025 that may be closed by this pull request
7 tasks
@charlieroth charlieroth self-assigned this Aug 26, 2025
@charlieroth charlieroth changed the title feat: implement network fetcher for page content Network Fetcher for Page Content Aug 26, 2025
@charlieroth charlieroth merged commit 977924d into main Aug 27, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Network Fetcher for Page Content

2 participants