Analyze Scientific Publications with E-utilities and Python | by Jozsef Meszaros | Could, 2023
To question an NCBI database successfully, you’ll need to study sure E-utilities, outline your search fields, and select your search parameters — which management the way in which outcomes are returned to your browser or in our case, we’ll use Python to question the databases.
4 most helpful E-utilities
There are 9 E-utilities out there from NCBI, and they’re all applied as server-side quick CGI applications. This implies you’ll entry them by creating URLs which finish in .cgi and specify question parameters after a question-mark, with parameters separated by ampersands. All of them, aside from EFetch, provides you with both XML or JSON outputs.
ESearch
generates a listing of ID numbers that meet your search question
The next E-Utilities can be utilized with a number of ID numbers:
ESummary
journal, writer listing, grants, dates, references, publication kindEFetch
**XML ONLY** all of whatESummary
gives in addition to an summary, listing of grants used within the analysis, establishments of authors, and MeSH key phrasesELink
gives a listing of hyperlinks to associated citations utilizing computed similarity rating in addition to offering a hyperlink to the printed merchandise [your gateway to the full-text of the article]
The NCBI hosts 38 databases throughout their servers, associated to a wide range of information that goes past literature citations. To get an entire listing of present databases, you need to use EInfo with out search phrases:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
Every database will differ in how it may be accessed and the knowledge it returns. For our functions, we’ll deal with the pubmed and pmc databases as a result of these are the place scientific literature are searched and retrieved.
The 2 most necessary issues to study looking NCBI are search fields and outputs. The search fields are numerous and can rely upon the database. The outputs are extra simple and studying how one can use the outputs is crucial, particularly for doing massive searches.
Search fields
You gained’t be capable of actually harness the potential of E-utilities with out understanding concerning the out there search fields. You’ll find a full listing of those search fields on the NLM website together with an outline of every, however for the most correct listing of search phrases particular to a database, you’ll need to parse your individual XML listing utilizing this hyperlink:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=pubmed
with the db flag set to the database (we are going to use pubmed for this text, however literature can be out there by means of pmc).
One particularly helpful search area is the Medline Topic Headings (MeSH).[3] Indexers, who’re specialists within the area, keep the PubMed database and use MeSH phrases to replicate the subject material of journal articles as they’re printed. Every listed publication is often described by 10 to 12 fastidiously chosen MeSH phrases by the indexers. If no search phrases are specified, then queries will probably be executed towards each search time period out there within the database queried.[4]
Question parameters
Every of the E-utilities accepts a number of question parameters by means of the URL line which you need to use to regulate the sort and quantity of output returned from a question. That is the place you’ll be able to set the variety of search outcomes retrieved or the dates searched. Listed below are a listing of the extra necessary parameters:
Database parameter:
db
needs to be set to the database you have an interest in looking — pubmed or pmc for scientific literature
Date parameters: You may get extra management over the date through the use of search fields, [pdat] for instance for the publication date, however date parameters present a extra handy method to constrain outcomes.
reldate
the times to be searched relative to the present date, set reldate=1 for the latest daymindate
andmaxdate
specify date in line with the format YYYY/MM/DD, YYYY, or YYYY/MM (a question should include eachmindate
andmaxdate
parameters)datetype
units the kind of date whenever you question by date — choices are ‘mdat’ (modification date), ‘pdat’ (publication date) and ‘edat’ (Entrez date)
Retrieval parameters:
rettype
the kind of info to return (for literature searches, use the default setting)retmode
format of the output (XML is the default, although all E-utilities besides fetch do assist JSON)retmax
is the utmost variety of information to return — the default is 20 and the utmost worth is 10,000 (ten thousand)retstart
given a listing of hits for a question,retstart
specifies the index (helpful for when your search exceeds the ten thousand most)cmd
that is solely related toELink
and is used to specify whether or not to return IDs of comparable articles or URLs to full-texts
As soon as we all know concerning the E-Utilities, have chosen our search fields, and determined upon question parameters, we’re able to execute queries and retailer the outcomes — even for a number of pages.
When you don’t particularly want to make use of Python to make use of the E-utilities, it does make it a lot simpler to parse, retailer, and analyze the outcomes of your queries. Right here’s how one can get began in your information science mission.
Let’s say you need to search MeSH phrases for the time period “myoglobin” between 2022 and 2023. You’ll set your retmax to 50 for now, however bear in mind the max is 10,000 and you may question at a fee of three/s.
import urllib.request
search_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed' +
f'&time period=myoglobin[mesh]' +
f'&mindate=2022' +
f'&maxdate=2023' +
f'&retmode=json' +
f'&retmax=50'link_list = urllib.request.urlopen(search_url).learn().decode('utf-8')
link_list
The outcomes are returned as a listing of IDs, which can be utilized in a subsequent search inside the database you queried. Observe that “rely” reveals there are 154 outcomes for this question, which you possibly can use in the event you needed to get a complete rely of publications for a sure set of search phrases. In case you needed to return IDs for all of the publication, you’d set the retmax parameter to the rely, or 154. On the whole, I set this to a really excessive quantity so I can retrieve the entire outcomes and retailer them.
Boolean looking is simple with PubMed and it solely requires including +OR+
, +NOT+
, or +AND+
to the URL between search phrases. For instance:
These search strings can constructed utilizing Python. Within the following steps, we’ll parse the outcomes utilizing Python’s json bundle to get the IDs for every of the publications returned. The IDs can then be used to create a string — this string of IDs can be utilized by the opposite E-Utilities to return details about the publications.
Use ESummary to return details about publications
The aim of ESummary is to return information that you just may anticipate to see in a paper’s quotation (date of publication, web page numbers, authors, and so forth). After you have a consequence within the type of a listing of IDs from ESearch (within the step above), you’ll be able to be a part of this listing into a protracted URL.
The restrict for a URL is 2048 characters, and every publication’s ID is 8 characters lengthy, so to be secure, it’s best to cut up your listing of hyperlinks up into batches of 250 when you have a listing bigger than 250 IDs. See my pocket book on the backside of the article for an instance.
The outcomes from an ESummary are returned in JSON format and may embrace a hyperlink to the paper’s full-text:
import json
consequence = json.hundreds( link_list )
id_list = ','.be a part of( consequence['esearchresult']['idlist'] )summary_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esummary.fcgi?db=pubmed&id={id_list}&retmode=json'
summary_list = urllib.request.urlopen(summary_url).learn().decode('utf-8')
We will once more use json to parse summary_list. When utilizing the json bundle, you’ll be able to browse the fields of every particular person article through the use of abstract[‘result’][id as string], as within the instance beneath:
abstract = json.hundreds( summary_list )
abstract['result']['37047528']
We will create a dataframe to seize the ID for every article together with the identify of the journal, the publication date, title of the article, a URL for retrieving the total textual content, in addition to the primary and final writer.
uid = [ x for x in summary['result'] if x != 'uids' ]
journals = [ summary['result'][x]['fulljournalname'] for x in abstract['result'] if x != 'uids' ]
titles = [ summary['result'][x]['title'] for x in abstract['result'] if x != 'uids' ]
first_authors = [ summary['result'][x]['sortfirstauthor'] for x in abstract['result'] if x != 'uids' ]
last_authors = [ summary['result'][x]['lastauthor'] for x in abstract['result'] if x != 'uids' ]
hyperlinks = [ summary['result'][x]['elocationid'] for x in abstract['result'] if x != 'uids' ]
pubdates = [ summary['result'][x]['pubdate'] for x in abstract['result'] if x != 'uids' ]hyperlinks = [ re.sub('doi:s','http://dx.doi.org/',x) for x in links ]
results_df = pd.DataFrame( {'ID':uid,'Journal':journals,'PublicationDate':pubdates,'Title':titles,'URL':hyperlinks,'FirstAuthor':first_authors,'LastAuthor':last_authors} )
Beneath is a listing of all of the totally different fields that ESummary returns so you may make your individual database:
'uid','pubdate','epubdate','supply','authors','lastauthor','title',
'sorttitle','quantity','situation','pages','lang','nlmuniqueid','issn',
'essn','pubtype','recordstatus','pubstatus','articleids','historical past',
'references','attributes','pmcrefcount','fulljournalname','elocationid',
'doctype','srccontriblist','booktitle','medium','version',
'publisherlocation','publishername','srcdate','reportnumber',
'availablefromurl','locationlabel','doccontriblist','docdate',
'bookname','chapter','sortpubdate','sortfirstauthor','vernaculartitle'
Use EFetch whenever you need abstracts, key phrases, and different particulars (XML output solely)
We will use EFetch to return related fields as ESummary, with the caveat that the result’s returned in XML solely. There are a number of fascinating further fields in EFetch which embrace: the summary, author-selected key phrases, the Medline Subheadings (MeSH phrases), grants that sponsored the analysis, battle of curiosity statements, a listing of chemical compounds used within the analysis, and an entire listing of all of the references cited by the paper. Right here’s how you’ll use BeautifulSoup to acquire a few of these objects:
from bs4 import BeautifulSoup
import lxml
import pandas as pdabstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
abstract_ = urllib.request.urlopen(abstract_url).learn().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_,options="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')
# Abstracts
abstract_texts = [ x.find('AbstractText').text for x in articles_iterable ]
# Battle of Curiosity statements
coi_texts = [ x.find('CoiStatement').text if x.find('CoiStatement') is not None else '' for x in articles_iterable ]
# MeSH phrases
meshheadings_all = listing()
for article in articles_iterable:
consequence = article.discover('MeshHeadingList').find_all('MeshHeading')
meshheadings_all.append( [ x.text for x in result ] )
# ReferenceList
references_all = listing()
for article in articles_:
if article.discover('ReferenceList') will not be None:
consequence = article.discover('ReferenceList').find_all('Quotation')
references_all.append( [ x.text for x in result ] )
else:
references_all.append( [] )
results_table = pd.DataFrame( {'COI':coi_texts, 'Summary':abstract_texts, 'MeSH_Terms':meshheadings_all, 'References':references_all} )
Now we are able to use this desk to go looking abstracts, battle of curiosity statements, or make visuals that join totally different fields of analysis utilizing MeSH headings and reference lists. There are after all many different tags that you possibly can discover, returned by EFetch, right here’s how one can see all of them utilizing BeautifulSoup:
efetch_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={id_list}'
efetch_result = urllib.request.urlopen( efetch_url ).learn().decode('utf-8')
efetch_bs = BeautifulSoup(efetch_result,options="xml")tags = efetch_bs.find_all()
for tag in tags:
print(tag)
Utilizing ELink to retrieve related publications, and full-text hyperlinks
You might need to discover articles just like those returned by your search question. These articles are grouped in line with a similarity rating utilizing a probabilistic topic-based mannequin.[5] To retrieve the similarity scores for a given ID, you need to move cmd=neighbor_score in your name to ELink. Right here’s an instance for one article:
import urllib.request
import jsonid_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=neighbor_score'
elinks = urllib.request.urlopen(elink_url).learn().decode('utf-8')
elinks_json = json.hundreds( elinks )
ids_=[];score_=[];
all_links = elinks_json['linksets'][0]['linksetdbs'][0]['links']
for hyperlink in all_links:
[ (ids_.append( link['id'] ),score_.append( hyperlink['score'] )) for id,s in hyperlink.objects() ]
pd.DataFrame( {'id':ids_,'rating':score_} ).drop_duplicates(['id','score'])
The opposite perform of ELink is to offer full-text hyperlinks to an article primarily based on its ID, which may be returned in the event you move cmd=prlinks to ELink as a substitute.
In case you want to entry solely these full-text hyperlinks which can be free to the general public, you’ll want to use hyperlinks that include “pmc” (PubMed Central). Accessing articles behind a paywall might require subscription by means of a College—earlier than downloading a big corpus of full-text articles by means of a paywall, it’s best to seek the advice of together with your group’s librarians.
Here’s a code snippet of how you possibly can retrieve the hyperlinks for a publication:
id_ = '37055458'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).learn().decode('utf-8')elinks_json = json.hundreds( elinks )
[ x['url']['value'] for x in elinks_json['linksets'][0]['idurllist'][0]['objurls'] ]
You too can retrieve hyperlinks for a number of publications in a single name to ELink, as I present beneath:
id_list = '37055458,574140'
elink_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/elink.fcgi?db=pubmed&id={id_list}&retmode=json&cmd=prlinks'
elinks = urllib.request.urlopen(elink_url).learn().decode('utf-8')elinks_json = json.hundreds( elinks )
elinks_json
urls_ = elinks_json['linksets'][0]['idurllist']
for url_ in urls_:
[ print( url_['id'], x['url']['value'] ) for x in url_['objurls'] ]
Often, a scientific publication will probably be authored by somebody who’s a CEO, CSO, or CTO of an organization. With PubMed, we’ve got the power to investigate the most recent life science business developments. Battle of curiosity statements, which had been launched as a search time period in PubMed throughout 2017,[6] give a lens into which author-provided key phrases are showing in publications the place a C-suite govt is disclosed as an writer. In different phrases, the key phrases chosen by the authors to explain their discovering. To hold out this perform, merely embrace CEO[cois]+OR+CSO[cois]+OR+CTO[cois] as search time period in your URL, retrieve the entire outcomes returned, and extract the key phrases from the ensuing XML output for every publication. Every publication incorporates between 4–8 key phrases. As soon as the corpus is generated, you’ll be able to quantify key phrase frequency per 12 months inside the corpus as the variety of publications in a 12 months specifying a key phrase, divided by the variety of publications for that 12 months.
For instance, if 10 publications listing the key phrase “most cancers” and there are 1000 publications that 12 months, the key phrase frequency can be 0.001. Utilizing the seaborn clustermap module with the key phrase frequencies you’ll be able to generate a visualization the place darker bands point out a bigger worth of key phrase frequency/12 months (I’ve dropped COVID-19 and SARS-COV-2 from the visualization as they had been each represented at frequencies far larger 0.05, predictably). Annually, roughly 1000–1500 papers had been returned.
From this visualization, a number of insights concerning the corpus of publications with C-suite authors listed turns into clear. First, some of the distinct clusters (on the backside) incorporates key phrases which have been strongly represented within the corpus for the previous 5 years: most cancers, machine studying, biomarkers, synthetic intelligence — simply to call a number of. Clearly, business is closely energetic and publishing in these areas. A second cluster, close to the center of the determine, reveals key phrases that disappeared from the corpus after 2018, together with bodily exercise, public well being, kids, mass spectrometry, and mhealth (or cell well being). It’s to not say that these areas will not be being developed in business, simply that the publication exercise has slowed. Wanting on the backside proper of the determine, you’ll be able to extract phrases which have appeared extra not too long ago within the corpus, together with liquid biopsy and precision drugs — that are certainly two very “sizzling” areas of medication in the mean time. By inspecting the publications additional, you possibly can extract the names of the businesses and different info of curiosity. Beneath is the code I wrote to generate this visible:
import pandas as pd
import time
from bs4 import BeautifulSoup
import seaborn as sns
from matplotlib import pyplot as plt
import itertools
from collections import Counter
from numpy import array_split
from urllib.request import urlopenclass Searcher:
# Any occasion of searcher will seek for the phrases and return the variety of outcomes on a per 12 months foundation #
def __init__(self, start_, end_, term_, **kwargs):
self.raw_ = enter
self.name_ = 'searcher'
self.description_ = 'searcher'
self.duration_ = end_ - start_
self.start_ = start_
self.end_ = end_
self.term_ = term_
self.search_results = listing()
self.count_by_year = listing()
self.choices = listing()
# Parse key phrase arguments
if 'rely' in kwargs and kwargs['count'] == 1:
self.choices = 'rettype=rely'
if 'retmax' in kwargs:
self.choices = f'retmax={kwargs["retmax"]}'
if 'run' in kwargs and kwargs['run'] == 1:
self.do_search()
self.parse_results()
def do_search(self):
datestr_ = [self.start_ + x for x in range(self.duration_)]
choices = "".be a part of(self.choices)
for 12 months in datestr_:
this_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/esearch.fcgi/' +
f'?db=pubmed&time period={self.term_}' +
f'&mindate={12 months}&maxdate={12 months + 1}&{choices}'
print(this_url)
self.search_results.append(
urlopen(this_url).learn().decode('utf-8'))
time.sleep(.33)
def parse_results(self):
for end in self.search_results:
xml_ = BeautifulSoup(consequence, options="xml")
self.count_by_year.append(xml_.discover('Rely').textual content)
self.ids = [id.text for id in xml_.find_all('Id')]
def __repr__(self):
return repr(f'Search PubMed from {self.start_} to {self.end_} with search phrases {self.term_}')
def __str__(self):
return self.description
# Create a listing which can include searchers, that retrieve outcomes for every of the search queries
searchers = listing()
searchers.append(Searcher(2022, 2023, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2021, 2022, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2020, 2021, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2019, 2020, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
searchers.append(Searcher(2018, 2019, 'CEO[cois]+OR+CTO[cois]+OR+CSO[cois]', run=1, retmax=10000))
# Create a dictionary to retailer key phrases for all articles from a specific 12 months
keywords_dict = dict()
# Every searcher obtained outcomes for a specific begin and finish 12 months
# Iterate over searchers
for this_search in searchers:
# Break up the outcomes from one search into batches for URL formatting
chunk_size = 200
batches = array_split(this_search.ids, len(this_search.ids) // chunk_size + 1)
# Create a dict key for this searcher object primarily based on the years of protection
this_dict_key = f'{this_search.start_}to{this_search.end_}'
# Every worth within the dictionary will probably be a listing that will get appended with key phrases for every article
keywords_all = listing()
for this_batch in batches:
ids_ = ','.be a part of(this_batch)
# Pull down the web site containing XML for all of the leads to a batch
abstract_url = f'http://eutils.ncbi.nlm.nih.gov/entrez//eutils/efetch.fcgi?db=pubmed&id={ids_}'
abstract_ = urlopen(abstract_url).learn().decode('utf-8')
abstract_bs = BeautifulSoup(abstract_, options="xml")
articles_iterable = abstract_bs.find_all('PubmedArticle')
# Iterate over the entire articles from the web site
for article in articles_iterable:
consequence = article.find_all('Key phrase')
if consequence will not be None:
keywords_all.append([x.text for x in result])
else:
keywords_all.append([])
# Take a break between batches!
time.sleep(1)
# As soon as all of the key phrases are assembled for a searcher, add them to the dictionary
keywords_dict[this_dict_key] = keywords_all
# Print the important thing as soon as it has been dumped to the pickle
print(this_dict_key)
# Restrict to phrases that appeared approx 5 occasions or extra in any given 12 months
mapping_ = {'2018to2019':2018,'2019to2020':2019,'2020to2021':2020,'2021to2022':2021,'2022to2023':2022}
global_word_list = listing()
for key_,value_ in keywords_dict.objects():
Ntitles = len( value_ )
flattened_list = listing( itertools.chain(*value_) )
flattened_list = [ x.lower() for x in flattened_list ]
counter_ = Counter( flattened_list )
words_this_year = [ ( item , frequency/Ntitles , mapping_[key_] ) for merchandise, frequency in counter_.objects() if frequency/Ntitles >= .005 ]
global_word_list.prolong(words_this_year)
# Plot outcomes as clustermap
global_word_df = pd.DataFrame(global_word_list)
global_word_df.columns = ['word', 'frequency', 'year']
pivot_df = global_word_df.loc[:, ['word', 'year', 'frequency']].pivot(index="phrase", columns="12 months",
values="frequency").fillna(0)
pivot_df.drop('covid-19', axis=0, inplace=True)
pivot_df.drop('sars-cov-2', axis=0, inplace=True)
sns.set(font_scale=0.7)
plt.determine(figsize=(22, 2))
res = sns.clustermap(pivot_df, col_cluster=False, yticklabels=True, cbar=True)
After studying this text, you have to be able to go from crafting extremely tailor-made search queries of the scientific literature all the way in which to producing information visualizations for nearer scrutiny. Whereas there are different extra advanced methods to entry and retailer articles utilizing further options of the varied E-utilities, I’ve tried to current essentially the most simple set of operations that ought to apply to most use instances for an information scientist serious about scientific publishing developments. By familiarizing your self with the E-utilities as I’ve offered right here, you’ll go far towards understanding the developments and connections inside scientific literature. As talked about, there are numerous objects past publications that may be unlocked by means of mastering the E-utilities and the way they function inside the bigger universe of NCBI databases.