Skip to content

feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles#70

Open
zh-xl-kang wants to merge 2 commits into
public-clis:mainfrom
zh-xl-kang:feat/article-rich-content
Open

feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles#70
zh-xl-kang wants to merge 2 commits into
public-clis:mainfrom
zh-xl-kang:feat/article-rich-content

Conversation

@zh-xl-kang

Copy link
Copy Markdown

Summary

Twitter Article Draft.js content has several entity types that were being silently dropped during Markdown conversion. This PR adds support for all of them.

What was missing

Content type Before After
Images (with captions) Skipped entirely ![caption](https://pbs.twimg.com/media/xxx.jpg)
Dividers Skipped entirely ---
Embedded tweets Skipped entirely > [Embedded Tweet](https://x.com/i/status/ID)
Bold text Plain text **text**
Italic text Plain text *text*
Inline code Plain text `text`
Strikethrough Plain text ~~text~~
Bold + Link on same span Offset corruption Correct rendering

Changes

twitter_cli/parser.py

  1. _extract_atomic_markdown_extract_atomic_content: Renamed and extended to handle DIVIDER and TWEET entity types in addition to the existing MARKDOWN type.

  2. _render_article_text_block: Rewritten to handle both inlineStyleRanges (Bold/Italic/Code/Strikethrough) and entityRanges (links) in a unified right-to-left pass. This fixes a bug where applying styles before links would corrupt character offsets when both appeared on the same text span.

    Key design decision: all operations are collected as (start, end, replacement) tuples, sorted by offset descending, and applied right-to-left. This is correct because Draft.js offsets always reference the original text.

  3. Case-insensitive style matching: Twitter API returns style names in Title case ("Bold", "Italic") rather than uppercase. The code normalizes via .upper().

tests/test_article_parsing.py (new)

25 unit tests covering:

  • All inline styles (Bold, Italic, Code, Strikethrough)
  • Links (including URL with parentheses)
  • Mixed Bold + Link on the same span (regression test)
  • All atomic entity types (DIVIDER, TWEET, MARKDOWN)
  • End-to-end _parse_article with synthetic article data
  • Edge cases (empty text, out-of-bounds offsets, unknown entity types)

Real-world validation

Tested against 43 real Twitter Articles (310,000+ chars total). Examples of content recovered:

Article v1 chars v2 chars New content
15 LEVELS OF HERMES AGENT 29,119 31,276 +2 imgs, +24 dividers, +3 tweets, +130 bold, +17 links
Hermes Agent 完全指南 6,028 8,665 +4 imgs, +10 dividers, +4 tweets, +32 bold, +31 links
Claude Code 架构与工程实践 11,120 19,608 +22 imgs, +21 dividers, +64 bold, +27 links

Test results

126 passed in 0.26s
├── 25 new (article parsing)
└── 101 existing (zero regressions)

…tyles from Articles

Twitter Article Draft.js content has several entity types that were being
silently dropped during Markdown conversion:

- MEDIA entities → now rendered as `![caption](url)` with caption text
- DIVIDER entities → now rendered as `---`
- TWEET entities → now rendered as `> [Embedded Tweet](url)`
- inlineStyleRanges (Bold/Italic/Code/Strikethrough) → now converted to
  `**`, `*`, backticks, `~~` respectively

The previous implementation skipped all atomic blocks (images, dividers,
embedded tweets) and ignored inline style ranges entirely.

Key design decisions:
- Style and link operations are collected as (start, end, replacement)
  tuples and applied right-to-left, so overlapping Bold+Link on the same
  text block don't corrupt each other's character offsets.
- Twitter API returns style names in Title case ('Bold', 'Italic') rather
  than uppercase, so comparisons use .upper() for case-insensitive matching.
- Image URLs are resolved from article_results.media_entities using the
  mediaId → original_img_url mapping chain (entityMap only contains mediaId,
  not the URL itself).

Tested against 43 real Twitter Articles (310K+ chars total).
25 new unit tests added, 101 existing tests still pass.
…_content

The function was renamed from _extract_atomic_markdown to _extract_atomic_content.
Update all references in the existing test suite.
@zh-xl-kang

Copy link
Copy Markdown
Author

Hi! Just a gentle bump on this PR.

Quick summary: The Draft.js article parser was silently dropping images, dividers, embedded tweets, and all inline formatting (Bold/Italic/Code). This PR recovers all of them.

What changed (parser.py, +70/-29 lines):

  • _extract_atomic_markdown renamed to _extract_atomic_content: now handles DIVIDER, TWEET, in addition to existing MARKDOWN
  • _render_article_text_block: rewritten to handle inlineStyleRanges (Bold/Italic/Code/Strikethrough) and links in a unified right-to-left pass, fixing offset corruption when both appear on the same text span
  • Image URLs resolved via media_entities mapping chain (mediaId to original_img_url)

Validation: tested against 43 real Twitter Articles (310K+ chars). Example: a 29K-char article now recovers +2 images, +24 dividers, +3 embedded tweets, +130 bold spans, +17 links.

Tests: 25 new + 101 existing = 126 passed, 0 failures. Also updated test_client.py imports for the renamed function.

I noticed the CI workflow has not run yet — it may need maintainer approval for fork PRs. Happy to address any feedback. Thanks for the great tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant