Rentrez Tutorial
David winter
2024-10-28
Source:vignettes/rentrez_tutorial.Rmd
rentrez_tutorial.Rmd
Introduction: The NCBI, entrez and rentrez
.
The NCBI shares a lot of data. At the time this document was compiled, there were 31.7 million papers in PubMed, including 6.6 million full-text records available in PubMed Central. The NCBI Nucleotide Database (which includes GenBank) has data for 432 million different sequences, and dbSNP describes 702 million different genetic variants. All of these records can be cross-referenced with the 1.86 million species in the NCBI taxonomy or 27 thousand disease-associated records in OMIM.
The NCBI makes this data available through a web interface, an FTP server and through a REST API
called the Entrez
Utilities (Eutils
for short). This package provides
functions to use that API, allowing users to gather and combine data
from multiple NCBI databases in the comfort of an R session or
script.
Getting started with the rentrez
To make the most of all the data the NCBI shares you need to know a
little about their databases, the records they contain and the ways you
can find those records. The NCBI provides
extensive documentation for each of their databases and for the EUtils API that
rentrez
takes advantage of. There are also some helper
functions in rentrez
that help users learn their way around
the NCBI’s databases.
First, you can use entrez_dbs()
to find the list of
available databases:
There is a set of functions with names starting
entrez_db_
that can be used to gather more information
about each of these databases:
Functions that help you learn about NCBI databases
Function name | Return |
---|---|
entrez_db_summary() |
Brief description of what the database is |
entrez_db_searchable() |
Set of search terms that can used with this database |
entrez_db_links() |
Set of databases that might contain linked records |
For instance, we can get a description of the somewhat cryptically named database ‘cdd’…
entrez_db_summary("cdd")
… or find out which search terms can be used with the Sequence Read Archive (SRA) database (which contains raw data from sequencing projects):
entrez_db_searchable("sra")
Just how these ‘helper’ functions might be useful will become clearer
once you’ve started using rentrez
, so let’s get
started.
Searching databases: entrez_search()
Very often, the first thing you’ll want to do with
rentrez
is search a given NCBI database to find records
that match some keywords. You can do this using the function
entrez_search()
. In the simplest case you just need to
provide a database name (db
) and a search term
(term
) so let’s search PubMed for articles about the
R language
:
r_search <- entrez_search(db="pubmed", term="R Language")
The object returned by a search acts like a list, and you can get a summary of its contents by printing it.
r_search
There are a few things to note here. First, the NCBI’s server has
worked out that we meant R as a programming language, and so included
the ‘MeSH’ term term
associated with programming languages. We’ll worry about MeSH terms and
other special queries later, for now just note that you can use this
feature to check that your search term was interpreted in the way you
intended. Second, there are many more ‘hits’ for this search than there
are unique IDs contained in this object. That’s because the optional
argument retmax
, which controls the maximum number of
returned values has a default value of 20.
The IDs are the most important thing returned here. They allow us to
fetch records matching those IDs, gather summary data about them or find
cross-referenced records in other databases. We access the IDs as a
vector using the $
operator:
r_search$ids
If we want to get more than 20 IDs we can do so by increasing the
ret_max
argument.
another_r_search <- entrez_search(db="pubmed", term="R Language", retmax=40)
another_r_search
If we want to get IDs for all of the thousands of records that match this search, we can use the NCBI’s web history feature described below.
Building search terms
The EUtils API uses a special syntax to build search terms. You can
search a database against a specific term using the format
query[SEARCH FIELD]
, and combine multiple such searches
using the boolean operators AND
, OR
and
NOT
.
For instance, we can find next generation sequence datasets for the (amazing…) ciliate Tetrahymena thermophila by using the organism (‘ORGN’) search field:
entrez_search(db="sra",
term="Tetrahymena thermophila[ORGN]",
retmax=0)
We can narrow our focus to only those records that have been added recently (using the colon to specify a range of values):
entrez_search(db="sra",
term="Tetrahymena thermophila[ORGN] AND 2013:2015[PDAT]",
retmax=0)
Or include recent records for either T. thermophila or it’s close relative T. borealis (using parentheses to make ANDs and ORs explicit).
entrez_search(db="sra",
term="(Tetrahymena thermophila[ORGN] OR Tetrahymena borealis[ORGN]) AND 2013:2015[PDAT]",
retmax=0)
The set of search terms available varies between databases. You can
get a list of available terms or any given data base with
entrez_db_searchable()
entrez_db_searchable("sra")
Using the Filter field
“Filter” is a special field that, as the names suggests, allows you to limit records returned by a search to set of filtering criteria. There is no programmatic way to find the particular terms that can be used with the Filter field. However, the NCBI’s website provides an “advanced search” tool for some databases that can be used to discover these terms.
For example, to find the list of possible to find all of the terms that can be used to filter searches to the nucleotide database using the advanced search for that databse. On that page selecting “Filter” from the first drop-down box then clicking “Show index list” will allow the user to scroll through possible filtering terms.
###Precise queries using MeSH terms
In addition to the search terms described above, the NCBI allows searches using Medical Subject Heading (MeSH) terms. These terms create a ‘controlled vocabulary’, and allow users to make very finely controlled queries of databases.
For instance, if you were interested in reviewing studies on how a class of anti-malarial drugs called Folic Acid Antagonists work against Plasmodium vivax (a particular species of malarial parasite), you could use this search:
entrez_search(db = "pubmed",
term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
The complete set of MeSH terms is available as a database from the
NCBI. That means it is possible to download detailed information about
each term and find the ways in which terms relate to each other using
rentrez
. You can search for specific terms with
entrez_search(db="mesh", term =...)
and learn about the
results of your search using the tools described below.
Advanced counting
As you can see above, the object returned by
entrez_search()
includes the number of records matching a
given search. This means you can learn a little about the composition
of, or trends in, the records stored in the NCBI’s databases using only
the search utility. For instance, let’s track the rise of the scientific
buzzword “connectome” in PubMed, programmatically creating search terms
for the PDAT
field:
search_year <- function(year, term){
query <- paste(term, "AND (", year, "[PDAT])")
entrez_search(db="pubmed", term=query, retmax=0)$count
}
year <- 2008:2014
papers <- sapply(year, search_year, term="Connectome", USE.NAMES=FALSE)
plot(year, papers, type='b', main="The Rise of the Connectome")
Finding cross-references : entrez_link()
:
One of the strengths of the NCBI databases is the degree to which
records of one type are connected to other records within the NCBI or to
external data sources. The function entrez_link()
allows
users to discover these links between records.
My god, it’s full of links
To get an idea of the degree to which records in the NCBI are cross-linked we can find all NCBI data associated with a single gene (in this case the Amyloid Beta Precursor gene, the product of which is associated with the plaques that form in the brains of Alzheimer’s Disease patients).
The function entrez_link()
can be used to find
cross-referenced records. In the most basic case we need to provide an
ID (id
), the database from which this ID comes
(dbfrom
) and the name of a database in which to find linked
records (db
). If we set this last argument to ‘all’ we can
find links in multiple databases:
all_the_links <- entrez_link(dbfrom='gene', id=351, db='all')
all_the_links
Just as with entrez_search
the returned object behaves
like a list, and we can learn a little about its contents by printing
it. In the case, all of the information is in links
(and
there’s a lot of them!):
all_the_links$links
The names of the list elements are in the format
[source_database]_[linked_database]
and the elements
themselves contain a vector of linked-IDs. So, if we want to find open
access publications associated with this gene we could get linked
records in PubMed Central:
all_the_links$links$gene_pmc[1:10]
Or if were interested in this gene’s role in diseases we could find links to clinVar:
all_the_links$links$gene_clinvar
Narrowing our focus
If we know beforehand what sort of links we’d like to find , we can
to use the db
argument to narrow the focus of a call to
entrez_link
.
For instance, say we are interested in knowing about all of the RNA
transcripts associated with the Amyloid Beta Precursor gene in humans.
Transcript sequences are stored in the nucleotide database (referred to
as nuccore
in EUtils), so to find transcripts associated
with a given gene we need to set dbfrom=gene
and
db=nuccore
.
nuc_links <- entrez_link(dbfrom='gene', id=351, db='nuccore')
nuc_links
nuc_links$links
The object we get back contains links to the nucleotide database generally, but also to special subsets of that database like refseq. We can take advantage of this narrower set of links to find IDs that match unique transcripts from our gene of interest.
nuc_links$links$gene_nuccore_refseqrna
We can use these ids in calls to entrez_fetch()
or
entrez_summary()
to learn more about the transcripts they
represent.
External links
In addition to finding data within the NCBI, entrez_link
can turn up connections to external databases. Perhaps the most
interesting example is finding links to the full text of papers in
PubMed. For example, when I wrote this document the first paper linked
to Amyloid Beta Precursor had a unique ID of 25500142
. We
can find links to the full text of that paper with
entrez_link
by setting the cmd
argument to
‘llinks’:
paper_links <- entrez_link(dbfrom="pubmed", id=25500142, cmd="llinks")
paper_links
Each element of the linkouts
object contains information
about an external source of data on this paper:
paper_links$linkouts
Each of those linkout objects contains quite a lot of information,
but the URL is probably the most useful. For that reason,
rentrez
provides the function linkout_urls
to
make extracting just the URL simple:
linkout_urls(paper_links)
The full list of options for the cmd
argument are given
in in-line documentation (?entrez_link
). If you are
interested in finding full text records for a large number of articles
checkout the package fulltext which makes use
of multiple sources (including the NCBI) to discover the full text
articles.
Using more than one ID
It is possible to pass more than one ID to
entrez_link()
. By default, doing so will give you a single
elink object containing the complete set of links for all of
the IDs that you specified. So, if you were looking for protein IDs
related to specific genes you could do:
all_links_together <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"))
all_links_together
all_links_together$links$gene_protein
Although this behaviour might sometimes be useful, it means we’ve
lost track of which protein
ID is linked to which
gene
ID. To retain that information we can set
by_id
to TRUE
. This gives us a list of elink
objects, each once containing links from a single gene
ID:
all_links_sep <- entrez_link(db="protein", dbfrom="gene", id=c("93100", "223646"), by_id=TRUE)
all_links_sep
lapply(all_links_sep, function(x) x$links$gene_protein)
Getting summary data: entrez_summary()
Having found the unique IDs for some records via
entrez_search
or entrez_link()
, you are
probably going to want to learn something about them. The
Eutils
API has two ways to get information about a record.
entrez_fetch()
returns ‘full’ records in varying formats
and entrez_summary()
returns less information about each
record, but in relatively simple format. Very often the summary records
have the information you are after, so rentrez
provides
functions to parse and summarise summary records.
The summary record
entrez_summary()
takes a vector of unique IDs for the
samples you want to get summary information from. Let’s start by finding
out something about the paper describing Taxize, using its PubMed
ID:
taxize_summ <- entrez_summary(db="pubmed", id=24555091)
taxize_summ
Once again, the object returned by entrez_summary
behaves like a list, so you can extract elements using $
.
For instance, we could convert our PubMed ID to another article
identifier…
taxize_summ$articleids
…or see how many times the article has been cited in PubMed Central papers
taxize_summ$pmcrefcount
Dealing with many records
If you give entrez_summary()
a vector with more than one
ID you’ll get a list of summary records back. Let’s get those
Plasmodium vivax papers we found in the
entrez_search()
section back, and fetch some summary data
on each paper:
vivax_search <- entrez_search(db = "pubmed",
term = "(vivax malaria[MeSH]) AND (folic acid antagonists[MeSH])")
multi_summs <- entrez_summary(db="pubmed", id=vivax_search$ids)
rentrez
provides a helper function,
extract_from_esummary()
that takes one or more elements
from every summary record in one of these lists. Here it is working with
one…
extract_from_esummary(multi_summs, "fulljournalname")
… and several elements:
date_and_cite <- extract_from_esummary(multi_summs, c("pubdate", "pmcrefcount", "title"))
knitr::kable(head(t(date_and_cite)), row.names=FALSE)
Fetching full records: entrez_fetch()
As useful as the summary records are, sometimes they just don’t have
the information that you need. If you want a complete representation of
a record you can use entrez_fetch
, using the argument
rettype
to specify the format you’d like the record in.
Fetch DNA sequences in fasta format
Let’s extend the example given in the entrez_link()
section about finding transcript for a given gene. This time we will
fetch cDNA sequences of those transcripts.We can start by repeating the
steps in the earlier example to get nucleotide IDs for refseq
transcripts of two genes:
gene_ids <- c(351, 11647)
linked_seq_ids <- entrez_link(dbfrom="gene", id=gene_ids, db="nuccore")
linked_transripts <- linked_seq_ids$links$gene_nuccore_refseqrna
head(linked_transripts)
Now we can get our sequences with entrez_fetch
, setting
rettype
to “fasta” (the list of formats available for each
database is give in this table):
all_recs <- entrez_fetch(db="nuccore", id=linked_transripts, rettype="fasta")
class(all_recs)
nchar(all_recs)
Congratulations, now you have a really huge character vector! Rather than printing all those thousands of bases we can take a peak at the top of the file:
If we wanted to use these sequences in some other application we could write them to file:
write(all_recs, file="my_transcripts.fasta")
Alternatively, if you want to use them within an R session
we could write them to a temporary file then read that. In this case I’m
using read.dna()
from the pylogenetics package ape (but not
executing the code block in this vignette, so you don’t have to install
that package):
Fetch a parsed XML document
Most of the NCBI’s databases can return records in XML format. In
additional to downloading the text-representation of these files,
entrez_fetch()
can return objects parsed by the
XML
package. As an example, we can check out the Taxonomy
database’s record for (did I mention they are amazing….) Tetrahymena
thermophila, specifying we want the result to be parsed by setting
parsed=TRUE
:
Tt <- entrez_search(db="taxonomy", term="(Tetrahymena thermophila[ORGN]) AND Species[RANK]")
tax_rec <- entrez_fetch(db="taxonomy", id=Tt$ids, rettype="xml", parsed=TRUE)
class(tax_rec)
The package XML (which you have if you have installed
rentrez
) provides functions to get information from these
files. For relatively simple records like this one you can use
XML::xmlToList
:
tax_list <- XML::xmlToList(tax_rec)
tax_list$Taxon$GeneticCode
For more complex records, which generate deeply-nested lists, you can
use XPath expressions
along with the function XML::xpathSApply
or the extraction
operatord [
and [[
to extract specific parts
of the file. For instance, we can get the scientific name of each taxon
in T. thermophila’s lineage by specifying a path through the
XML
tt_lineage <- tax_rec["//LineageEx/Taxon/ScientificName"]
tt_lineage[1:4]
As the name suggests, XML::xpathSApply()
is a
counterpart of base R’s sapply
, and can be used to apply a
function to nodes in an XML object. A particularly useful function to
apply is XML::xmlValue
, which returns the content of the
node:
XML::xpathSApply(tax_rec, "//LineageEx/Taxon/ScientificName", XML::xmlValue)
There are a few more complex examples of using XPath
on the rentrez
wiki
Using NCBI’s Web History features
When you are dealing with very large queries it can be time consuming to pass long vectors of unique IDs to and from the NCBI. To avoid this problem, the NCBI provides a feature called “web history” which allows users to store IDs on the NCBI servers then refer to them in future calls.
###Post a set of IDs to the NCBI for later use:
entrez_post()
If you have a list of many NCBI IDs that you want to use later on,
you can post them to the NCBI’s severs. In order to provide a brief
example, I’m going to post just one ID, the omim
identifier
for asthma:
upload <- entrez_post(db="omim", id=600807)
upload
The NCBI sends you back some information you can use to refer to the
posted IDs. In rentrez
, that information is represented as
a web_history
object.
Note that if you have a very long list of IDs you may receive a 414
error when you try to upload them. If you have such a list (and they
come from an external sources rather than a search that can be save to a
web_history
object), you may have to ‘chunk’ the IDs into
smaller sets that can processed.
Get a web_history
object from
entrez_search
or entrez_link()
In addition to directly uploading IDs to the NCBI, you can use the
web history features with entrez_search
and
entrez_link
. For instance, imagine you wanted to find all
of the sequences of the widely-studied gene COI from all snails (which
are members of the taxonomic group Gastropoda):
entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]")
That’s a lot of sequences! If you really wanted to download all of
these it would be a good idea to save all those IDs to the server by
setting use_history
to TRUE
(note you now get
a web_history
object along with your normal search
result):
snail_coi <- entrez_search(db="nuccore", term="COI[Gene] AND Gastropoda[ORGN]", use_history=TRUE)
snail_coi
snail_coi$web_history
Similarity, entrez_link()
can return
web_history
objects by using the cmd
neighbor_history
. Let’s find genetic variants (from the
clinvar database) associated with asthma (using the same OMIM ID we
identified earlier):
asthma_clinvar <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", id=600807)
asthma_clinvar$web_histories
As you can see, instead of returning lists of IDs for each linked
database (as it would be default), entrez_link()
now
returns a list of web_histories.
Use a web_history
object
Once you have those IDs stored on the NCBI’s servers, you are going
to want to do something with them. The functions
entrez_fetch()
entrez_summary()
and
entrez_link()
can all use web_history
objects
in exactly the same way they use IDs.
So, we could repeat the last example (finding variants linked to asthma), but this time using the ID we uploaded earlier
asthma_variants <- entrez_link(dbfrom="omim", db="clinvar", cmd="neighbor_history", web_history=upload)
asthma_variants
… if we want to get some genetic information about these variants we need to map our clinvar IDs to SNP IDs:
snp_links <- entrez_link(dbfrom="clinvar", db="snp",
web_history=asthma_variants$web_histories$omim_clinvar,
cmd="neighbor_history")
snp_summ <- entrez_summary(db="snp", web_history=snp_links$web_histories$clinvar_snp)
knitr::kable(extract_from_esummary(snp_summ, c("chr", "fxn_class", "global_maf")))
If you really wanted to you could also use web_history
objects to download all those thousands of COI sequences. When
downloading large sets of data, it is a good idea to take advantage of
the arguments retmax
and restart
to split the
request up into smaller chunks. For instance, we could get the first 200
sequences in 50-sequence chunks:
(note: this code block is not executed as part of the vignette to save time and bandwidth):
for( seq_start in seq(1,200,50)){
recs <- entrez_fetch(db="nuccore", web_history=snail_coi$web_history,
rettype="fasta", retmax=50, retstart=seq_start)
cat(recs, file="snail_coi.fasta", append=TRUE)
cat(seq_start+49, "sequences downloaded\r")
}
Rate-limiting and API Keys
By default, the NCBI limits users to making only 3 requests per
second (and rentrez
enforces that limit). Users who
register for an “API key” are able to make up to ten requests per
second. Getting one of these keys is simple, you just need to register for “my ncbi”
account then click on a button in the account settings
page.
Once you have an API key, rentrez will allow you to take advantage of
it. For one-off cases, this is as simple as adding the
api_key
argument to given function call. (Note these
examples are not executed as part of this document, as the API key used
here not a real one).
entrez_link(db="protein", dbfrom="gene", id=93100, api_key ="ABCD123")
It most cases you will want to use your API for each of several calls
to the NCBI. rentrez
makes this easy by allowing you to set
an environment variable ,ENTREZ_KEY
. Once this value is set
to your key rentrez
will use it for all requests to the
NCBI. To set the value for a single R session you can use the function
set_entrez_key()
. Here we set the value and confirm it is
available.
set_entrez_key("ABCD123")
Sys.getenv("ENTREZ_KEY")
If you use rentrez
often you should edit your
.Renviron
file (see r help(Startup)
for
description of this file) to include your key. Doing so will mean all
requests you send will take advantage of your API key.
As long as an API key is set by one of these methods,
rentrez
will allow you to make up to ten requests per
second.
Slowing rentrez down when you hit the rate-limit
rentrez
won’t let you send requests to the NCBI
at a rate higher than the rate-limit, but it is sometimes possible that
they will arrive too close together an produce errors. If you
are using rentrez
functions in a for loop and find
rate-limiting errors are occuring, you may consider adding a call to
Sys.sleep(0.1)
before each message sent to the NCBI. This
will ensure you stay beloe the rate limit.
What next ?
This tutorial has introduced you to the core functions of
rentrez
, there are almost limitless ways that you could put
them together. Check
out the wiki for more specific examples, and be sure to read the
inline-documentation for each function. If you run into problem with
rentrez, or just need help with the package and Eutils
please contact us by opening an issue at the github
repository