feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles by zh-xl-kang · Pull Request #70 · public-clis/twitter-cli

zh-xl-kang · 2026-06-29T04:35:25Z

Summary

Twitter Article Draft.js content has several entity types that were being silently dropped during Markdown conversion. This PR adds support for all of them.

What was missing

Content type	Before	After
Images (with captions)	Skipped entirely	`![caption](https://pbs.twimg.com/media/xxx.jpg)`
Dividers	Skipped entirely	`---`
Embedded tweets	Skipped entirely	`> [Embedded Tweet](https://x.com/i/status/ID)`
Bold text	Plain text	`text`
Italic text	Plain text	`text`
Inline code	Plain text	`text`
Strikethrough	Plain text	`~~text~~`
Bold + Link on same span	Offset corruption	Correct rendering

Changes

`twitter_cli/parser.py`

_extract_atomic_markdown → _extract_atomic_content: Renamed and extended to handle DIVIDER and TWEET entity types in addition to the existing MARKDOWN type.
_render_article_text_block: Rewritten to handle both inlineStyleRanges (Bold/Italic/Code/Strikethrough) and entityRanges (links) in a unified right-to-left pass. This fixes a bug where applying styles before links would corrupt character offsets when both appeared on the same text span.

Key design decision: all operations are collected as (start, end, replacement) tuples, sorted by offset descending, and applied right-to-left. This is correct because Draft.js offsets always reference the original text.
Case-insensitive style matching: Twitter API returns style names in Title case ("Bold", "Italic") rather than uppercase. The code normalizes via .upper().

`tests/test_article_parsing.py` (new)

25 unit tests covering:

All inline styles (Bold, Italic, Code, Strikethrough)
Links (including URL with parentheses)
Mixed Bold + Link on the same span (regression test)
All atomic entity types (DIVIDER, TWEET, MARKDOWN)
End-to-end _parse_article with synthetic article data
Edge cases (empty text, out-of-bounds offsets, unknown entity types)

Real-world validation

Tested against 43 real Twitter Articles (310,000+ chars total). Examples of content recovered:

Article	v1 chars	v2 chars	New content
15 LEVELS OF HERMES AGENT	29,119	31,276	+2 imgs, +24 dividers, +3 tweets, +130 bold, +17 links
Hermes Agent 完全指南	6,028	8,665	+4 imgs, +10 dividers, +4 tweets, +32 bold, +31 links
Claude Code 架构与工程实践	11,120	19,608	+22 imgs, +21 dividers, +64 bold, +27 links

Test results

126 passed in 0.26s
├── 25 new (article parsing)
└── 101 existing (zero regressions)

…tyles from Articles Twitter Article Draft.js content has several entity types that were being silently dropped during Markdown conversion: - MEDIA entities → now rendered as `![caption](url)` with caption text - DIVIDER entities → now rendered as `---` - TWEET entities → now rendered as `> [Embedded Tweet](url)` - inlineStyleRanges (Bold/Italic/Code/Strikethrough) → now converted to `**`, `*`, backticks, `~~` respectively The previous implementation skipped all atomic blocks (images, dividers, embedded tweets) and ignored inline style ranges entirely. Key design decisions: - Style and link operations are collected as (start, end, replacement) tuples and applied right-to-left, so overlapping Bold+Link on the same text block don't corrupt each other's character offsets. - Twitter API returns style names in Title case ('Bold', 'Italic') rather than uppercase, so comparisons use .upper() for case-insensitive matching. - Image URLs are resolved from article_results.media_entities using the mediaId → original_img_url mapping chain (entityMap only contains mediaId, not the URL itself). Tested against 43 real Twitter Articles (310K+ chars total). 25 new unit tests added, 101 existing tests still pass.

…_content The function was renamed from _extract_atomic_markdown to _extract_atomic_content. Update all references in the existing test suite.

zh-xl-kang · 2026-06-29T07:23:46Z

Hi! Just a gentle bump on this PR.

Quick summary: The Draft.js article parser was silently dropping images, dividers, embedded tweets, and all inline formatting (Bold/Italic/Code). This PR recovers all of them.

What changed (parser.py, +70/-29 lines):

_extract_atomic_markdown renamed to _extract_atomic_content: now handles DIVIDER, TWEET, in addition to existing MARKDOWN
_render_article_text_block: rewritten to handle inlineStyleRanges (Bold/Italic/Code/Strikethrough) and links in a unified right-to-left pass, fixing offset corruption when both appear on the same text span
Image URLs resolved via media_entities mapping chain (mediaId to original_img_url)

Validation: tested against 43 real Twitter Articles (310K+ chars). Example: a 29K-char article now recovers +2 images, +24 dividers, +3 embedded tweets, +130 bold spans, +17 links.

Tests: 25 new + 101 existing = 126 passed, 0 failures. Also updated test_client.py imports for the renamed function.

I noticed the CI workflow has not run yet — it may need maintainer approval for fork PRs. Happy to address any feedback. Thanks for the great tool!

zh-xl-kang added 2 commits June 29, 2026 12:34

fix(tests): update test_client.py imports for renamed _extract_atomic…

29497dc

…_content The function was renamed from _extract_atomic_markdown to _extract_atomic_content. Update all references in the existing test suite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles#70

feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles#70
zh-xl-kang wants to merge 2 commits into
public-clis:mainfrom
zh-xl-kang:feat/article-rich-content

zh-xl-kang commented Jun 29, 2026

Uh oh!

zh-xl-kang commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zh-xl-kang commented Jun 29, 2026

Summary

What was missing

Changes

twitter_cli/parser.py

tests/test_article_parsing.py (new)

Real-world validation

Test results

Uh oh!

zh-xl-kang commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`twitter_cli/parser.py`

`tests/test_article_parsing.py` (new)