Skip to content

Comments

Filter pale chapters to save only chapter content#14

Merged
Tomotz merged 1 commit intomasterfrom
devin/1771665871-pale-chapter-filter
Feb 21, 2026
Merged

Filter pale chapters to save only chapter content#14
Tomotz merged 1 commit intomasterfrom
devin/1771665871-pale-chapter-filter

Conversation

@Tomotz
Copy link
Owner

@Tomotz Tomotz commented Feb 21, 2026

Filter pale chapters to save only chapter content

Summary

Adds extract_chapter_content() to save_ebook.py that uses regex to extract only the chapter title (<h1 class="entry-title">) and body (<div class="entry-content">) from each downloaded WordPress page. This strips out user comments, sharing/like buttons (jp-post-flair), related posts (jp-relatedposts), navigation, sidebar, footer metadata, and all other non-chapter HTML.

The function is applied in the pale download loop so pale_full.html contains only chapter text.

Review & Testing Checklist for Human

  • Regex vs nested divs: The content regex relies on the </div><!-- .entry-content --> HTML comment as its end marker. Verify this comment is consistently present across all 311 chapters — if any chapter's template differs, content could be truncated or missed entirely.
  • Silent chapter drops: If a page returns unexpected HTML (error page, changed template), both regexes return empty strings and the chapter is silently skipped. Consider whether a warning/print should be added.
  • jp-post-flair stripping: The cleanup regexes use greedy .* with re.DOTALL, removing everything from those divs to end-of-string. Confirm these divs always appear at the tail of entry-content and never inside actual chapter text.
  • Test run: Re-run the script on a handful of chapters (or the full set) and spot-check that output contains complete chapter text with no leftover comment/sharing HTML and no missing content.

Notes

Add extract_chapter_content() that uses regex to pull out just the
entry-title and entry-content from each downloaded page, stripping
comments, sharing buttons, navigation, related posts, and other
non-chapter elements.

Co-Authored-By: tom mottes <tom.mottes@gmail.com>
@Tomotz Tomotz merged commit 433afb9 into master Feb 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant