API Documentation

The core functionality of Pubmed Parse can be divided into 3 main parts based on the source of the data we use an as input. Input source can be either from MEDLINE XML, PubMed Open Access subset (PubMed OA), or Website (using eutils). Below, we list the core APIs implemented in Pubmed Parser

Parse MEDLINE XML

pubmed_parser.parse_medline_xml(path, year_info_only=True, nlm_category=False, author_list=False, reference_list=False, parse_downto_mesh_subterms=False)[source]

Parse XML file from Medline XML format available at https://ftp.ncbi.nlm.nih.gov/pubmed/

Parameters

path: str

The path

year_info_only: bool

if True, this tool will only attempt to extract year information from PubDate. if False, an attempt will be made to harvest all available PubDate information. If only year and month information is available, this will yield a date of the form ‘YYYY-MM’. If year, month and day information is available, a date of the form ‘YYYY-MM-DD’ will be returned. NOTE: the resolution of PubDate information in the Medline(R) database varies between articles. default: True

nlm_category: bool

if True, this will parse structured abstract where each section if original Label if False, this will parse structured abstract where each section will be assigned to NLM category of each sections default: False

author_list: bool

if True, return parsed author output as a list of authors if False, return parsed author output as a string of authors concatenated with ; default: False

reference_list: bool

if True, parse reference list as an output if False, return string of PMIDs concatenated with ; default: False

parse_downto_mesh_subterms: bool
if True, return mesh terms concatenated with “; “ and mesh subterms concatenated “ / “

and appended with * if the subterm is major

if False, return mesh_terms concatenated with “; ” default: False

Return

An iterator of dictionary containing information about articles in NLM format.

see parse_article_info). Articles that have been deleted will be added with no information other than the field delete being True

Examples

>>> article_iterator = pubmed_parser.parse_medline_xml('data/pubmed20n0014.xml.gz')
>>> for article in article_iterator:
...     print(article['title'])
pubmed_parser.parse_grant_id(pubmed_article)[source]

Parse Grant ID and related information from a given MEDLINE tree

Parameters

pubmed_article: Element

The lxml node pointing to a medline document

Returns

grant_list: list

List of grants acknowledged in the publications. Each entry in the dictionary contains the PubMed ID, grant ID, grant acronym, country, and agency.

Parse PubMed OA XML

pubmed_parser.parse_pubmed_xml(path, include_path=False, nxml=False)[source]

Given an input XML path to PubMed XML file, extract information and metadata from a given XML file and return parsed XML file in dictionary format. You can check ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ to list of available files to download

Parameters

path: str

A path to a given PumMed XML file

include_path: bool

if True, include a key ‘path_to_file’ in an output dictionary default: False

nxml: bool

if True, this will strip a namespace of an XML after reading a file see https://stackoverflow.com/questions/18159221/remove-namespace-and-prefix-from-xml-in-python-using-lxml to default: False

Return

dict_out: dict

A dictionary contains a following keys from a parsed XML path ‘full_title’, ‘abstract’, ‘journal’, ‘pmid’, ‘pmc’, ‘doi’, ‘publisher_id’, ‘author_list’, ‘affiliation_list’, ‘publication_year’, ‘publication_date’, ‘subjects’

}

pubmed_parser.parse_pubmed_references(path)[source]

Given path to xml file, parse references articles to list of dictionary

Parameters

path: str

A string to an XML path.

Return

dict_refs: list

A list contains dictionary for references made in a given file.

pubmed_parser.parse_pubmed_paragraph(path, all_paragraph=False)[source]

Give path to a given PubMed OA file, parse and return a dictionary of all paragraphs, section that it belongs to, and a list of reference made in each paragraph as a list of PMIDs

Parameters

path: str

A string to an XML path.

all_paragraph: bool

By default, this function will only append a paragraph if there is at least one reference made in a paragraph (to aviod noisy parsed text). A boolean indicating if you want to include paragraph with no references made or not if True, include all paragraphs if False, include only paragraphs that have references default: False

Return

dict_pars: list

A list contains dictionary for paragraph text and its metadata. Metadata includes ‘pmc’ of an article, ‘pmid’ of an article, ‘reference_ids’ which is a list of reference rid made in a paragraph, ‘section’ name of an article, and section ‘text’

pubmed_parser.parse_pubmed_caption(path)[source]

Given single xml path, extract figure caption and reference id back to that figure

Parameters

path: str

A string to an PubMed OA XML path

Return

dict_captions: list

A list contains all dictionary of figure ID (‘fig_id’) with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘fig_caption’ (figure’s caption), ‘graphic_ref’ (a file name corresponding to a figure file in OA bulk download)

Examples

>>> pubmed_parser.parse_pubmed_caption('data/pone.0000217.nxml')
[{
    'pmid': '17299597',
    'pmc': '1790863',
    'fig_caption': "Fisher's geometric model in two-dimensional phenotypic space. ...",
    'fig_id': 'pone-0000217-g001',
    'fig_label': 'Figure 1',
    'graphic_ref': 'pone.0000217.g001'
}, ...]
pubmed_parser.parse_pubmed_table(path, return_xml=True)[source]

Parse table from given Pubmed Open-Access XML file

Parameters

path: str

A string to an PubMed OA XML path

return_xml: bool

if True, a dictionary (in an output list) will have a key ‘table_xml’ which is an XML string of a parsed table default: True

Return

table_dicts: list

A list contains all dictionary of table with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘label’ (in a full text), ‘caption’

Parse from Website

pubmed_parser.parse_xml_web(pmid, sleep=None, save_xml=False)[source]

Give an input PMID, load and parse XML using PubMed eutils

Parameters

pmid: str

A string of PMID which you want to parse from eutils

sleep: int

An integer of how long you want to wait after parsing one PMID from eutils default: None

save_xml: bool

if it is True, save an XML output as a string in the key xml in an output dictionary. It is good to check the information in if it is False, we won’t save a full XML to an output default: False

Return

dict_out: dict

A dictionary contains information of parsed XML from a given PMID

Examples

>>> pubmed_parser.parse_xml_web(11360989, sleep=1, save_xml=False)
{
    'title': 'Molecular biology and evolution. Can genes explain biological complexity?',
    'abstract': '',
    'journal': 'Science (New York, N.Y.)',
    'affiliation': 'Collegium Budapest (Institute for Advanced Study), 2 Szentháromság u., H-1014 Budapest, Hungary. szathmary@colbud.hu',
    'authors': 'E Szathmáry; F Jordán; C Pál',
    'keywords': 'D000818:Animals;D005075:Biological Evolution;...',
    'doi': '10.1126/science.1060852',
    'year': '2001',
    'pmid': '11360989'
}
pubmed_parser.parse_citation_web(doc_id, id_type='PMC')[source]

Parse citations from given document id

Parameters

doc_id: (str, int)

document id

id_type: str

corresponding type of doc_id. This can be a choice from the following [‘PMC’, ‘PMID’, ‘DOI’, ‘OTHER’]

Return

dict_out: dict

output is a dictionary contains following keys ‘pmc’ (Pubmed Central ID), ‘pmid’ (Pubmed ID), ‘doi’ (DOI of an article), ‘n_citations’ (number of citations for given articles), ‘pmc_cited’ (list of PMCs that cite the given PMC)

Examples

>>> pubmed_parser.parse_citation_web(6933944, id_type='PMC')
{
    'n_citations': 0,
    'pmid': '31624211',
    'pmc': '6933944',
    'doi': '10.1126/science.aax1562',
    'pmc_cited': []
}
pubmed_parser.parse_outgoing_citation_web(doc_id, id_type='PMC')[source]

A function to load citations from NCBI eutils API for a given document

Example URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&id=221212

Parameters

doc_id: str

The document ID

id_type: str

A type of provided document ID, can be either ‘PMC’ or ‘PMID’

Return

dict_out: dict

a dictionary containing the following keys ‘n_citations’ (number of citations for that article), ‘doc_id’ (the document ID number), ‘id_type’ (the type of document ID provided (PMCID or PMID)), ‘pmid_cited’ (a list of papers cited by the document as PMIDs)

>>> pubmed_parser.parse_outgoing_citation_web(6933944, id_type='PMC')
{
    'n_citations': 11,
    'doc_id': '6933944',
    'id_type': 'PMC',
    'pmid_cited': ['30705152', ..., ]
}