Skip to content

feat(scholar): Strip HTML/JATS tags from CrossRef abstracts #142

@ywatanabe1989

Description

@ywatanabe1989

Summary

CrossRef API returns abstracts with HTML/JATS XML tags that need to be cleaned for display.

Problem

CrossRef abstracts often contain markup like:

<jats:p>Objective. Hippocampal ripples are high-frequency...</jats:p>
<jats:italic>in vitro</jats:italic>

When displayed on websites or in citations, these tags appear as raw text or cause rendering issues.

Proposed Solution

Add abstract cleaning functionality to scitex.scholar:

from scitex.scholar import utils

# Clean abstract from CrossRef response
clean_abstract = utils.clean_abstract(raw_abstract)

# Or integrated into Work object
work = crossref_scitex.get("10.1088/1741-2552/ac3266")
work.abstract  # Already cleaned
work.abstract_raw  # Original with tags (if needed)

Tags to Handle

  • JATS XML tags: <jats:p>, <jats:italic>, <jats:bold>, <jats:sup>, <jats:sub>
  • HTML tags: <p>, <i>, <b>, <em>, <strong>, <sup>, <sub>
  • Preserve meaningful whitespace and paragraph breaks

Implementation Options

  1. Strip all tags - Simple regex/BeautifulSoup approach
  2. Convert to plain text - Preserve formatting intent (italic → text)
  3. Convert to Markdown - <jats:italic>_text_

Use Cases

  • Publications page display (scitex-cloud)
  • Citation generation
  • Paper metadata export

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions