API Documentation

The core functionality of Pubmed Parse can be divided into 3 main parts based on the source of the data we use an as input. Input source can be either from MEDLINE XML, PubMed Open Access subset (PubMed OA), or Website (using eutils). Below, we list the core APIs implemented in Pubmed Parser

Parse MEDLINE XML

pubmed_parser.parse_medline_xml(path, year_info_only=True, nlm_category=False, author_list=False, reference_list=False, parse_downto_mesh_subterms=False)[source]

Parse XML file from Medline XML format available at https://ftp.ncbi.nlm.nih.gov/pubmed/

Parameters

path: str

The path

year_info_only: bool

if True, this tool will only attempt to extract year information from PubDate. if False, an attempt will be made to harvest all available PubDate information. If only year and month information is available, this will yield a date of the form ‘YYYY-MM’. If year, month and day information is available, a date of the form ‘YYYY-MM-DD’ will be returned. NOTE: the resolution of PubDate information in the Medline(R) database varies between articles. default: True

nlm_category: bool

if True, this will parse structured abstract where each section if original Label if False, this will parse structured abstract where each section will be assigned to NLM category of each sections default: False

author_list: bool

if True, return parsed author output as a list of authors if False, return parsed author output as a string of authors concatenated with ; default: False

reference_list: bool

if True, parse reference list as an output if False, return string of PMIDs concatenated with ; default: False

parse_downto_mesh_subterms: bool

if True, return mesh terms concatenated with “; “ and mesh subterms concatenated “ / “: and appended with * if the subterm is major

if False, return mesh_terms concatenated with “; ” default: False

Return

An iterator of dictionary containing information about articles in NLM format.: see parse_article_info). Articles that have been deleted will be added with no information other than the field delete being True

Examples

>>> article_iterator = pubmed_parser.parse_medline_xml('data/pubmed20n0014.xml.gz')
>>> for article in article_iterator:
...     print(article['title'])

pubmed_parser.parse_grant_id(pubmed_article)[source]

Parse Grant ID and related information from a given MEDLINE tree

Parameters

pubmed_article: Element: The lxml node pointing to a medline document

Returns

grant_list: list: List of grants acknowledged in the publications. Each entry in the dictionary contains the PubMed ID, grant ID, grant acronym, country, and agency.

Parse PubMed OA XML

pubmed_parser.parse_pubmed_xml(path, include_path=False, nxml=False)[source]

Given an input XML path to PubMed XML file, extract information and metadata from a given XML file and return parsed XML file in dictionary format. You can check ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ to list of available files to download

Parameters

path: str: A path to a given PumMed XML file
include_path: bool: if True, include a key ‘path_to_file’ in an output dictionary default: False
nxml: bool: if True, this will strip a namespace of an XML after reading a file see https://stackoverflow.com/questions/18159221/remove-namespace-and-prefix-from-xml-in-python-using-lxml to default: False

Return

dict_out: dict: A dictionary contains a following keys from a parsed XML path ‘full_title’, ‘abstract’, ‘journal’, ‘pmid’, ‘pmc’, ‘doi’, ‘publisher_id’, ‘author_list’, ‘affiliation_list’, ‘publication_year’, ‘publication_date’, ‘subjects’

}

pubmed_parser.parse_pubmed_references(path)[source]

Given path to xml file, parse references articles to list of dictionary

Parameters

path: str: A string to an XML path.

Return

dict_refs: list: A list contains dictionary for references made in a given file.

pubmed_parser.parse_pubmed_paragraph(path, all_paragraph=False)[source]

Give path to a given PubMed OA file, parse and return a dictionary of all paragraphs, section that it belongs to, and a list of reference made in each paragraph as a list of PMIDs

Parameters

path: str: A string to an XML path.
all_paragraph: bool: By default, this function will only append a paragraph if there is at least one reference made in a paragraph (to aviod noisy parsed text). A boolean indicating if you want to include paragraph with no references made or not if True, include all paragraphs if False, include only paragraphs that have references default: False

Return

dict_pars: list: A list contains dictionary for paragraph text and its metadata. Metadata includes ‘pmc’ of an article, ‘pmid’ of an article, ‘reference_ids’ which is a list of reference rid made in a paragraph, ‘section’ name of an article, and section ‘text’

pubmed_parser.parse_pubmed_caption(path)[source]

Given single xml path, extract figure caption and reference id back to that figure

Parameters

path: str: A string to an PubMed OA XML path

Return

dict_captions: list: A list contains all dictionary of figure ID (‘fig_id’) with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘fig_caption’ (figure’s caption), ‘graphic_ref’ (a file name corresponding to a figure file in OA bulk download)

Examples

>>> pubmed_parser.parse_pubmed_caption('data/pone.0000217.nxml')
[{
    'pmid': '17299597',
    'pmc': '1790863',
    'fig_caption': "Fisher's geometric model in two-dimensional phenotypic space. ...",
    'fig_id': 'pone-0000217-g001',
    'fig_label': 'Figure 1',
    'graphic_ref': 'pone.0000217.g001'
}, ...]

pubmed_parser.parse_pubmed_table(path, return_xml=True)[source]

Parse table from given Pubmed Open-Access XML file

Parameters

path: str: A string to an PubMed OA XML path
return_xml: bool: if True, a dictionary (in an output list) will have a key ‘table_xml’ which is an XML string of a parsed table default: True

Return

table_dicts: list: A list contains all dictionary of table with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘label’ (in a full text), ‘caption’

Parse from Website

pubmed_parser.parse_xml_web(pmid, sleep=None, save_xml=False)[source]

Give an input PMID, load and parse XML using PubMed eutils

Parameters

pmid: str: A string of PMID which you want to parse from eutils
sleep: int: An integer of how long you want to wait after parsing one PMID from eutils default: None
save_xml: bool: if it is True, save an XML output as a string in the key xml in an output dictionary. It is good to check the information in if it is False, we won’t save a full XML to an output default: False

Return

dict_out: dict: A dictionary contains information of parsed XML from a given PMID

Examples

>>> pubmed_parser.parse_xml_web(11360989, sleep=1, save_xml=False)
{
    'title': 'Molecular biology and evolution. Can genes explain biological complexity?',
    'abstract': '',
    'journal': 'Science (New York, N.Y.)',
    'affiliation': 'Collegium Budapest (Institute for Advanced Study), 2 Szentháromság u., H-1014 Budapest, Hungary. szathmary@colbud.hu',
    'authors': 'E Szathmáry; F Jordán; C Pál',
    'keywords': 'D000818:Animals;D005075:Biological Evolution;...',
    'doi': '10.1126/science.1060852',
    'year': '2001',
    'pmid': '11360989'
}

pubmed_parser.parse_citation_web(doc_id, id_type='PMC')[source]

Parse citations from given document id

Parameters

doc_id: (str, int): document id
id_type: str: corresponding type of doc_id. This can be a choice from the following [‘PMC’, ‘PMID’, ‘DOI’, ‘OTHER’]

Return

dict_out: dict: output is a dictionary contains following keys ‘pmc’ (Pubmed Central ID), ‘pmid’ (Pubmed ID), ‘doi’ (DOI of an article), ‘n_citations’ (number of citations for given articles), ‘pmc_cited’ (list of PMCs that cite the given PMC)

Examples

>>> pubmed_parser.parse_citation_web(6933944, id_type='PMC')
{
    'n_citations': 0,
    'pmid': '31624211',
    'pmc': '6933944',
    'doi': '10.1126/science.aax1562',
    'pmc_cited': []
}

pubmed_parser.parse_outgoing_citation_web(doc_id, id_type='PMC')[source]

A function to load citations from NCBI eutils API for a given document

Example URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&id=221212

Parameters

doc_id: str: The document ID
id_type: str: A type of provided document ID, can be either ‘PMC’ or ‘PMID’

Return

dict_out: dict: a dictionary containing the following keys ‘n_citations’ (number of citations for that article), ‘doc_id’ (the document ID number), ‘id_type’ (the type of document ID provided (PMCID or PMID)), ‘pmid_cited’ (a list of papers cited by the document as PMIDs)

>>> pubmed_parser.parse_outgoing_citation_web(6933944, id_type='PMC')
{
    'n_citations': 11,
    'doc_id': '6933944',
    'id_type': 'PMC',
    'pmid_cited': ['30705152', ..., ]
}