Skip to content

ML4LitS/epmc-tools

Repository files navigation

Europe PMC Development Tool

epmc-tools is a Python package that provides a powerful command-line interface (epmc-cli) and library for interacting with scientific literature. It allows you to:

  • Process JATS XML from local files or URLs, converting them to JSON.
  • Extract accession numbers from text.
  • Split text into sentences.
  • Access the Europe PMC APIs for searching articles, grants, and annotations.
  • Harvest metadata via the OAI-PMH service.

Installation

To install the package, clone the repository and install it using pip:

git clone https://github.com/ML4LitS/epmc-tools.git
cd epmc-tools
pip install .

Dependencies

The required Python packages will be installed automatically. However, the tool also relies on scispacy model, which needs to be downloaded separately.

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz

Editable Mode

If you are developing the package, you may want to install it in editable mode:

pip install -e .

Usage

The tool can be used as a command-line interface (epmc-cli) or as a Python library.

Command-Line Interface

The epmc-cli tool provides a command-line interface for the Europe PMC API and for local file processing.

Local Commands

The local commands process files on your machine or from the web.

  • Convert JATS XML to JSON: The jats2json command can intelligently handle a local file path, a URL, or a PMCID.

    From a local file:

    epmc-cli local jats2json test_data/PXD053361.xml output.json

    From a PMCID:

    epmc-cli local jats2json PMC11704132 output.json

    From a URL (with sentence splitting disabled):

    epmc-cli local jats2json https://some-url/article.xml output.json --no-sentenciser
  • Extract accession numbers: This command processes the JSON file created by jats2json.

    epmc-cli local extract-accessions-resources output.json accessions.json

API Commands

  • Search for articles:
    epmc-cli articles search "BRCA1" --page-size 1
  • Get article metadata:
    epmc-cli articles get PMC11704132
  • Get full-text XML:
    epmc-cli articles fulltext PMC11704132

For more detailed usage instructions, please refer to the documentation.

Library

The core components of europmc-dev-tool can be imported and used directly in your Python scripts. This allows for greater flexibility and integration into your own custom workflows.

A full example can be found in script_usage_example.py, which shows how to build a robust pipeline. Here is a simplified version:

import json
import os
import requests
import spacy
from europmc_dev_tool.api.articles import ArticlesClient
from europmc_dev_tool.jats_processor import XMLProcessor
from europmc_dev_tool.section_maps import ordered_labels
from europmc_dev_tool.spacy_extractor import extract_with_spacy

def get_xml_content(identifier):
    """
    Intelligently fetches JATS XML content from a PMCID, URL, or local file.
    """
    if identifier.startswith('http'):
        response = requests.get(identifier)
        response.raise_for_status()
        return response.text
    elif os.path.exists(identifier):
        with open(identifier, 'r') as f:
            return f.read()
    elif identifier.upper().startswith('PMC'):
        articles_client = ArticlesClient()
        return articles_client.get_fulltext_xml(identifier)
    else:
        raise ValueError("Input is not a valid file path, URL, or PMCID.")

# 1. Fetch content (example with a PMCID)
xml_content = get_xml_content("PMC11704132")

if xml_content:
    # 2. Process the JATS XML to JSON
    processor = XMLProcessor()
    processed_data = processor.process_full_text(xml_content)
    final_json = processor.process_json(processed_data, ordered_labels)

    # 3. Extract accession numbers and resources
    nlp = spacy.load("en_core_sci_sm")
    all_extractions = []
    for section, sentences in final_json.get('sections', {}).items():
        for sentence in sentences:
            text = sentence.get('text', '')
            sentence_id = sentence.get('sentence_id')
            extractions = extract_with_spacy(nlp, text, section, sentence_id)
            if extractions:
                all_extractions.extend(extractions)

    print(json.dumps(all_extractions, indent=2))

For more detailed usage instructions, please refer to the documentation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages