2. How to search for and fetch sequences
2023-07-24
Source:vignettes/2_search_and_fetch.Rmd
2_search_and_fetch.Rmd
A downside of the restez
approach is we are unable to
then search the local database for sequences of interest as the database
cannot possibly contain sufficient metadata (most notably taxonomic
information) to perform as rich a search as online via NCBI. As a result
we must perform the sequence discovery process independently from
restez
. In this tutorial we will demonstrate how to perform
online searches using the rentrez
package and then use the
search results to look up sequences in a restez
database.
To run all the code in this tutorial you will need to have already set up the rodents database, see Build a database of all rodents.
Search NCBI GenBank accession numbers
Let’s pretend we’re interested in COI sequences for a group of
rodents called the Sciuromorpha. restez
provides the
ncbi_acc_get()
function to make this easy: just provide a
text string query like you would on
NCBI GenBank, and it returns a character vector of accession numbers
corresponding to that query. By default, it drops any version number
after each accession number. This is not strictly necessary, but it
makes it easier to check our results from restez
later.
library(restez)
# 33553 - Sciuromorpha - squirrel-like things
accessions <- ncbi_acc_get(
'txid33553[Organism:exp] AND COI [GENE] AND 100:1000[SLEN]')
print(length(accessions))
#> [1] 492
print(accessions[1:10])
#> [1] "MZ661159" "MZ364430" "HQ966965" "GU670702" "GU670701" "GU670700" "GU670699" "GU670462" "GU670451" "GU670450"
Retrieve sequences
To fetch the sequences from the rodents database, we can use the
gb_fasta_get
function.
restez_path_set(rodents_path)
coi_sequences <- gb_fasta_get(id = accessions)
str(coi_sequences[[1]])
#> chr ">LT630607.1 Sciurus sp. 1 AG-2016 mitochondrial partial COI gene for cytochrome oxidase subunit 1, specimen vou"| __truncated__
# Are all accessions in results?
all(accessions %in% names(coi_sequences))
#> [1] TRUE
Comparing to Entrez
Can we not just use rentrez
to do the fetching as well?
Yes, but restez
can be a lot faster. NCBI limits the number
of requests per user, often to as little as 100 items per request with
varying time delays. Additionally for any programmatic retrieval of
sequences using an online server can never be as reliable as a local
copy.
# time via restez
system.time(expr = {
coi_sequences <- gb_fasta_get(id = accessions)
})
#> user system elapsed
#> 0.259 0.077 0.272
# time via Entrez
system.time(expr = {
coi_sequences_p1 <- rentrez::entrez_fetch(db = 'nucleotide',
id = accessions[1:100],
rettype = 'fasta')
coi_sequences_p2 <- rentrez::entrez_fetch(db = 'nucleotide',
id = accessions[101:200],
rettype = 'fasta')
coi_sequences_p3 <- rentrez::entrez_fetch(db = 'nucleotide',
id = accessions[201:300],
rettype = 'fasta')
coi_sequences_p4 <- rentrez::entrez_fetch(db = 'nucleotide',
id = accessions[301:400],
rettype = 'fasta')
coi_sequences_p5 <- rentrez::entrez_fetch(db = 'nucleotide',
id = accessions[401:456],
rettype = 'fasta')
})
#> user system elapsed
#> 0.050 0.008 5.536