API Documentation
The core functionality of Pubmed Parse can be divided into 3 main parts based on the source of the data we use an as input. Input source can be either from MEDLINE XML, PubMed Open Access subset (PubMed OA), or Website (using eutils). Below, we list the core APIs implemented in Pubmed Parser
Parse MEDLINE XML
- pubmed_parser.parse_medline_xml(path, year_info_only=True, nlm_category=False, author_list=False, reference_list=False, parse_downto_mesh_subterms=False)[source]
Parse XML file from Medline XML format available at https://ftp.ncbi.nlm.nih.gov/pubmed/
Parameters
- path: str
The path
- year_info_only: bool
if True, this tool will only attempt to extract year information from PubDate. if False, an attempt will be made to harvest all available PubDate information. If only year and month information is available, this will yield a date of the form ‘YYYY-MM’. If year, month and day information is available, a date of the form ‘YYYY-MM-DD’ will be returned. NOTE: the resolution of PubDate information in the Medline(R) database varies between articles. default: True
- nlm_category: bool
if True, this will parse structured abstract where each section if original Label if False, this will parse structured abstract where each section will be assigned to NLM category of each sections default: False
- author_list: bool
if True, return parsed author output as a list of authors if False, return parsed author output as a string of authors concatenated with
;
default: False- reference_list: bool
if True, parse reference list as an output if False, return string of PMIDs concatenated with ; default: False
- parse_downto_mesh_subterms: bool
- if True, return mesh terms concatenated with “; “ and mesh subterms concatenated “ / “
and appended with * if the subterm is major
if False, return mesh_terms concatenated with “; ” default: False
Return
- An iterator of dictionary containing information about articles in NLM format.
see parse_article_info). Articles that have been deleted will be added with no information other than the field delete being True
Examples
>>> article_iterator = pubmed_parser.parse_medline_xml('data/pubmed20n0014.xml.gz') >>> for article in article_iterator: ... print(article['title'])
- pubmed_parser.parse_grant_id(pubmed_article)[source]
Parse Grant ID and related information from a given MEDLINE tree
Parameters
- pubmed_article: Element
The lxml node pointing to a medline document
Returns
- grant_list: list
List of grants acknowledged in the publications. Each entry in the dictionary contains the PubMed ID, grant ID, grant acronym, country, and agency.
Parse PubMed OA XML
- pubmed_parser.parse_pubmed_xml(path, include_path=False, nxml=False)[source]
Given an input XML path to PubMed XML file, extract information and metadata from a given XML file and return parsed XML file in dictionary format. You can check
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/
to list of available files to downloadParameters
- path: str
A path to a given PumMed XML file
- include_path: bool
if True, include a key ‘path_to_file’ in an output dictionary default: False
- nxml: bool
if True, this will strip a namespace of an XML after reading a file see https://stackoverflow.com/questions/18159221/remove-namespace-and-prefix-from-xml-in-python-using-lxml to default: False
Return
- dict_out: dict
A dictionary contains a following keys from a parsed XML path ‘full_title’, ‘abstract’, ‘journal’, ‘pmid’, ‘pmc’, ‘doi’, ‘publisher_id’, ‘author_list’, ‘affiliation_list’, ‘publication_year’, ‘publication_date’, ‘epublication_date’ ,’subjects’
}
- pubmed_parser.parse_pubmed_references(path)[source]
Given path to xml file, parse references articles to list of dictionary
Parameters
- path: str
A string to an XML path.
Return
- dict_refs: list
A list contains dictionary for references made in a given file.
- pubmed_parser.parse_pubmed_paragraph(path)[source]
Give path to a given PubMed OA file, parse and return a dictionary of all paragraphs, section that it belongs to, and a list of reference made in each paragraph as a list of PMIDs
Parameters
- path: str
A string to an XML path.
Return
- dict_pars: list
A list contains dictionary for paragraph text and its metadata. Metadata includes ‘pmc’ of an article, ‘pmid’ of an article, ‘reference_ids’ which is a list of reference
rid
made in a paragraph, ‘section’ name of an article, and section ‘text’
Given single xml path, extract figure caption and reference id back to that figure
Parameters
- path: str
A string to an PubMed OA XML path
Return
- dict_captions: list
A list contains all dictionary of figure ID (‘fig_id’) with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘fig_caption’ (figure’s caption), ‘graphic_ref’ (a file name corresponding to a figure file in OA bulk download)
Examples
>>> pubmed_parser.parse_pubmed_caption('data/pone.0000217.nxml') [{ 'pmid': '17299597', 'pmc': '1790863', 'fig_caption': "Fisher's geometric model in two-dimensional phenotypic space. ...", 'fig_id': 'pone-0000217-g001', 'fig_label': 'Figure 1', 'graphic_ref': 'pone.0000217.g001' }, ...]
- pubmed_parser.parse_pubmed_table(path, return_xml=True)[source]
Parse table from given Pubmed Open-Access XML file
Parameters
- path: str
A string to an PubMed OA XML path
- return_xml: bool
if True, a dictionary (in an output list) will have a key ‘table_xml’ which is an XML string of a parsed table default: True
Return
- table_dicts: list
A list contains all dictionary of table with its metadata. Metadata includes ‘pmid’, ‘pmc’, ‘label’ (in a full text), ‘caption’
Parse from Website
- pubmed_parser.parse_xml_web(pmid, sleep=None, save_xml=False)[source]
Give an input PMID, load and parse XML using PubMed eutils
Parameters
- pmid: str
A string of PMID which you want to parse from eutils
- sleep: int
An integer of how long you want to wait after parsing one PMID from eutils default: None
- save_xml: bool
if it is True, save an XML output as a string in the key
xml
in an output dictionary. It is good to check the information in if it is False, we won’t save a full XML to an output default: False
Return
- dict_out: dict
A dictionary contains information of parsed XML from a given PMID
Examples
>>> pubmed_parser.parse_xml_web(11360989, sleep=1, save_xml=False) { 'title': 'Molecular biology and evolution. Can genes explain biological complexity?', 'abstract': '', 'journal': 'Science (New York, N.Y.)', 'affiliation': 'Collegium Budapest (Institute for Advanced Study), 2 Szentháromság u., H-1014 Budapest, Hungary. szathmary@colbud.hu', 'authors': 'E Szathmáry; F Jordán; C Pál', 'keywords': 'D000818:Animals;D005075:Biological Evolution;...', 'doi': '10.1126/science.1060852', 'year': '2001', 'version_id': None, 'version_date': None, 'pmid': '11360989' }
- pubmed_parser.parse_citation_web(doc_id, id_type='PMC')[source]
Parse citations from given document id
Parameters
- doc_id: (str, int)
document id
- id_type: str
corresponding type of doc_id. This can be a choice from the following [‘PMC’, ‘PMID’, ‘DOI’, ‘OTHER’]
Return
- dict_out: dict
output is a dictionary contains following keys ‘pmc’ (Pubmed Central ID), ‘pmid’ (Pubmed ID), ‘doi’ (DOI of an article), ‘n_citations’ (number of citations for given articles), ‘pmc_cited’ (list of PMCs that cite the given PMC)
Examples
>>> pubmed_parser.parse_citation_web(6933944, id_type='PMC') { 'n_citations': 0, 'pmid': '31624211', 'pmc': '6933944', 'doi': '10.1126/science.aax1562', 'pmc_cited': [] }
- pubmed_parser.parse_outgoing_citation_web(doc_id, id_type='PMC')[source]
A function to load citations from NCBI eutils API for a given document
Example URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&id=221212
Parameters
- doc_id: str
The document ID
- id_type: str
A type of provided document ID, can be either ‘PMC’ or ‘PMID’
Return
- dict_out: dict
a dictionary containing the following keys ‘n_citations’ (number of citations for that article), ‘doc_id’ (the document ID number), ‘id_type’ (the type of document ID provided (PMCID or PMID)), ‘pmid_cited’ (a list of papers cited by the document as PMIDs)
>>> pubmed_parser.parse_outgoing_citation_web(6933944, id_type='PMC') { 'n_citations': 11, 'doc_id': '6933944', 'id_type': 'PMC', 'pmid_cited': ['30705152', ..., ] }