feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles#70
Open
zh-xl-kang wants to merge 2 commits into
Open
feat(parser): extract images, dividers, embedded tweets, and inline styles from Articles#70zh-xl-kang wants to merge 2 commits into
zh-xl-kang wants to merge 2 commits into
Conversation
…tyles from Articles
Twitter Article Draft.js content has several entity types that were being
silently dropped during Markdown conversion:
- MEDIA entities → now rendered as `` with caption text
- DIVIDER entities → now rendered as `---`
- TWEET entities → now rendered as `> [Embedded Tweet](url)`
- inlineStyleRanges (Bold/Italic/Code/Strikethrough) → now converted to
`**`, `*`, backticks, `~~` respectively
The previous implementation skipped all atomic blocks (images, dividers,
embedded tweets) and ignored inline style ranges entirely.
Key design decisions:
- Style and link operations are collected as (start, end, replacement)
tuples and applied right-to-left, so overlapping Bold+Link on the same
text block don't corrupt each other's character offsets.
- Twitter API returns style names in Title case ('Bold', 'Italic') rather
than uppercase, so comparisons use .upper() for case-insensitive matching.
- Image URLs are resolved from article_results.media_entities using the
mediaId → original_img_url mapping chain (entityMap only contains mediaId,
not the URL itself).
Tested against 43 real Twitter Articles (310K+ chars total).
25 new unit tests added, 101 existing tests still pass.
…_content The function was renamed from _extract_atomic_markdown to _extract_atomic_content. Update all references in the existing test suite.
Author
|
Hi! Just a gentle bump on this PR. Quick summary: The Draft.js article parser was silently dropping images, dividers, embedded tweets, and all inline formatting (Bold/Italic/Code). This PR recovers all of them. What changed (
Validation: tested against 43 real Twitter Articles (310K+ chars). Example: a 29K-char article now recovers +2 images, +24 dividers, +3 embedded tweets, +130 bold spans, +17 links. Tests: 25 new + 101 existing = 126 passed, 0 failures. Also updated I noticed the CI workflow has not run yet — it may need maintainer approval for fork PRs. Happy to address any feedback. Thanks for the great tool! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Twitter Article Draft.js content has several entity types that were being silently dropped during Markdown conversion. This PR adds support for all of them.
What was missing
---> [Embedded Tweet](https://x.com/i/status/ID)**text***text*`text`~~text~~Changes
twitter_cli/parser.py_extract_atomic_markdown→_extract_atomic_content: Renamed and extended to handleDIVIDERandTWEETentity types in addition to the existingMARKDOWNtype._render_article_text_block: Rewritten to handle bothinlineStyleRanges(Bold/Italic/Code/Strikethrough) andentityRanges(links) in a unified right-to-left pass. This fixes a bug where applying styles before links would corrupt character offsets when both appeared on the same text span.Key design decision: all operations are collected as
(start, end, replacement)tuples, sorted by offset descending, and applied right-to-left. This is correct because Draft.js offsets always reference the original text.Case-insensitive style matching: Twitter API returns style names in Title case (
"Bold","Italic") rather than uppercase. The code normalizes via.upper().tests/test_article_parsing.py(new)25 unit tests covering:
_parse_articlewith synthetic article dataReal-world validation
Tested against 43 real Twitter Articles (310,000+ chars total). Examples of content recovered:
Test results