API Documentation

The core functionality of Pubmed Parse can be divided into 3 main parts based on the source of the data we use an as input. Input source can be either from MEDLINE XML, PubMed Open Access subset (PubMed OA), or Website (using eutils). Below, we list the core APIs implemented in Pubmed Parser

Parse MEDLINE XML

pubmed_parser.parse_medline_xml(path, year_info_only=True, nlm_category=False, author_list=False, reference_list=False, parse_downto_mesh_subterms=False)[source]

Parse XML file from Medline XML format available at https://ftp.ncbi.nlm.nih.gov/pubmed/.

Parameters

path: str: The path
year_info_only: bool: if True, this tool will only attempt to extract year information from PubDate. if False, an attempt will be made to harvest all available PubDate information. If only year and month information is available, this will yield a date of the form ‘YYYY-MM’. If year, month and day information is available, a date of the form ‘YYYY-MM-DD’ will be returned. NOTE: the resolution of PubDate information in the Medline(R) database varies between articles. default: True
nlm_category: bool: if True, this will parse structured abstract where each section if original Label if False, this will parse structured abstract where each section will be assigned to NLM category of each sections default: False
author_list: bool: if True, return parsed author output as a list of authors if False, return parsed author output as a string of authors concatenated with ; default: False
reference_list: bool: if True, parse reference list as an output if False, return string of PMIDs concatenated with ; default: False
parse_downto_mesh_subterms: bool: if True, return mesh terms concatenated with “; “ and mesh subterms concatenated “ / ” and appended with * if the subterm is major if False, return mesh_terms concatenated with “; ” default: False

Return

An iterator of dictionary containing information about articles in NLM format.: see parse_article_info). Articles that have been deleted will be added with no information other than the fields delete being True, and pmid.

Examples

>>> article_iterator = pubmed_parser.parse_medline_xml('data/pubmed20n0014.xml.gz')
>>> for article in article_iterator:
...     if article.get('delete'):
...         print(f"Deleted PMID: {article['pmid']}")
...     else:
...         print(article['title'])

pubmed_parser.parse_grant_id(pubmed_article)[source]

Parse Grant ID and related information from a given MEDLINE tree

Parameters

pubmed_article: Element: The lxml node pointing to a medline document

Returns

grant_list: list: List of grants acknowledged in the publications. Each entry in the dictionary contains the PubMed ID, grant ID, grant acronym, country, and agency.

Parse PubMed OA XML

pubmed_parser.parse_pubmed_xml(path, include_path=False, nxml=False)[source]

Given an input XML path to PubMed XML file, extract information and metadata from a given XML file and return parsed XML file in dictionary format. You can check ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ to list of available files to download

Parameters

path: str: A path to a given PumMed XML file
include_path: bool: if True, include a key ‘path_to_file’ in an output dictionary default: False
nxml: bool: if True, this will strip a namespace of an XML after reading a file see https://stackoverflow.com/questions/18159221/remove-namespace-and-prefix-from-xml-in-python-using-lxml to default: False

Return

dict_out: dict: A dictionary contains a following keys from a parsed XML path ‘full_title’, ‘abstract’, ‘journal’, ‘pmid’, ‘pmc’, ‘doi’, ‘publisher_id’, ‘author_list’, ‘affiliation_list’, ‘publication_year’, ‘publication_date’, ‘epublication_date’, ‘subjects’

pubmed_parser.parse_pubmed_references(path)[source]

Given path to xml file, parse references articles to list of dictionary

Parameters

path: str: A string to an XML path.

Return

dict_refs: list: A list contains dictionary for references made in a given file.

pubmed_parser.parse_pubmed_paragraph(path)[source]

Give path to a given PubMed OA file, parse and return a dictionary of all paragraphs, section that it belongs to, and a list of reference made in each paragraph as a list of PMIDs

Parameters

path: str: A string to an XML path.

Return

dict_pars: list: A list contains dictionary for paragraph text and its metadata. Metadata includes ‘pmc’ of an article, ‘pmid’ of an article, ‘reference_ids’ which is a list of reference rid made in a paragraph, ‘section’ name of an article, and section ‘text’

pubmed_parser.parse_pubmed_caption(path)[source]

Given single xml path, extract figure caption and reference id back to that figure

Parameters

path: str: A string to an PubMed OA XML path

Return

dict_captions: list: A list contains all dictionary of figure ID (‘fig_id’) with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘fig_caption’ (figure’s caption), ‘graphic_ref’ (a file name corresponding to a figure file in OA bulk download)

Examples

>>> pubmed_parser.parse_pubmed_caption('data/pone.0000217.nxml')
[{
    'pmid': '17299597',
    'pmc': '1790863',
    'fig_caption': "Fisher's geometric model in two-dimensional phenotypic space. ...",
    'fig_id': 'pone-0000217-g001',
    'fig_label': 'Figure 1',
    'graphic_ref': 'pone.0000217.g001'
}, ...]

pubmed_parser.parse_pubmed_table(path, return_xml=True)[source]

Parse table from given Pubmed Open-Access XML file

Parameters

path: str: A string to an PubMed OA XML path
return_xml: bool: if True, a dictionary (in an output list) will have a key ‘table_xml’ which is an XML string of a parsed table default: True

Return

table_dicts: list: A list contains all dictionary of table with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘label’ (in a full text), ‘caption’

Parse from Website

pubmed_parser.parse_xml_web(pmid, sleep=None, save_xml=False)[source]

Give an input PMID, load and parse XML using PubMed eutils

Parameters

pmid: str: A string of PMID which you want to parse from eutils
sleep: int: An integer of how long you want to wait after parsing one PMID from eutils default: None
save_xml: bool: if it is True, save an XML output as a string in the key xml in an output dictionary. It is good to check the information in if it is False, we won’t save a full XML to an output default: False

Return

dict_out: dict: A dictionary contains information of parsed XML from a given PMID

Examples

>>> pubmed_parser.parse_xml_web(11360989, sleep=1, save_xml=False)
{
    'title': 'Molecular biology and evolution. Can genes explain biological complexity?',
    'abstract': '',
    'journal': 'Science (New York, N.Y.)',
    'affiliation': 'Collegium Budapest (Institute for Advanced Study), 2 Szentháromság u., H-1014 Budapest, Hungary. szathmary@colbud.hu',
    'authors': 'E Szathmáry; F Jordán; C Pál',
    'keywords': 'D000818:Animals;D005075:Biological Evolution;...',
    'doi': '10.1126/science.1060852',
    'year': '2001',
    'version_id': None,
    'version_date': None,
    'pmid': '11360989'
}

pubmed_parser.parse_citation_web(doc_id, id_type='PMC')[source]

Parse citations from given document id

Parameters

doc_id: (str, int): document id
id_type: str: corresponding type of doc_id. This can be a choice from the following [‘PMC’, ‘PMID’, ‘DOI’, ‘OTHER’]

Return

dict_out: dict: output is a dictionary contains following keys ‘pmc’ (Pubmed Central ID), ‘pmid’ (Pubmed ID), ‘doi’ (DOI of an article), ‘n_citations’ (number of citations for given articles), ‘pmc_cited’ (list of PMCs that cite the given PMC)

Examples

>>> pubmed_parser.parse_citation_web(6933944, id_type='PMC')
{
    'n_citations': 0,
    'pmid': '31624211',
    'pmc': '6933944',
    'doi': '10.1126/science.aax1562',
    'pmc_cited': []
}

pubmed_parser.parse_outgoing_citation_web(doc_id, id_type='PMC')[source]

A function to load citations from NCBI eutils API for a given document

Example URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&id=221212

Parameters

doc_id: str: The document ID
id_type: str: A type of provided document ID, can be either ‘PMC’ or ‘PMID’

Return

dict_out: dict: a dictionary containing the following keys ‘n_citations’ (number of citations for that article), ‘doc_id’ (the document ID number), ‘id_type’ (the type of document ID provided (PMCID or PMID)), ‘pmid_cited’ (a list of papers cited by the document as PMIDs)

>>> pubmed_parser.parse_outgoing_citation_web(6933944, id_type='PMC')
{
    'n_citations': 11,
    'doc_id': '6933944',
    'id_type': 'PMC',
    'pmid_cited': ['30705152', ..., ]
}