-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
CrossRef API returns abstracts with HTML/JATS XML tags that need to be cleaned for display.
Problem
CrossRef abstracts often contain markup like:
<jats:p>Objective. Hippocampal ripples are high-frequency...</jats:p>
<jats:italic>in vitro</jats:italic>When displayed on websites or in citations, these tags appear as raw text or cause rendering issues.
Proposed Solution
Add abstract cleaning functionality to scitex.scholar:
from scitex.scholar import utils
# Clean abstract from CrossRef response
clean_abstract = utils.clean_abstract(raw_abstract)
# Or integrated into Work object
work = crossref_scitex.get("10.1088/1741-2552/ac3266")
work.abstract # Already cleaned
work.abstract_raw # Original with tags (if needed)Tags to Handle
- JATS XML tags:
<jats:p>,<jats:italic>,<jats:bold>,<jats:sup>,<jats:sub> - HTML tags:
<p>,<i>,<b>,<em>,<strong>,<sup>,<sub> - Preserve meaningful whitespace and paragraph breaks
Implementation Options
- Strip all tags - Simple regex/BeautifulSoup approach
- Convert to plain text - Preserve formatting intent (italic → text)
- Convert to Markdown -
<jats:italic>→_text_
Use Cases
- Publications page display (scitex-cloud)
- Citation generation
- Paper metadata export
Related
- Issue feat(scholar): Add comprehensive citation style management #141 (Citation style management)
- Module:
scitex.scholar.local_dbs.crossref_scitex
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request