1. Download GenBank
In the first section of this introduction to restez
we
will explain how to download parts of GenBank on to your computer and
from these downloads, construct a local database. GenBank can be
downloaded from NCBI using the File Transfer Protocol (FTP), the online
portal for these downloads can be viewed from your web browser, FTP NCBI GenBank. GenBank
is hosted at this portal as 1000s of compressed sequence files called
flatfiles. New versions of these flatfiles are released once a month as
new sequence information is uploaded. For information on the latest
GenBank release, see the NCBI statistics
page.
restez
downloads a selection of these flatfiles and
unpacks them into an SQL-like database (specifically, DuckDB) that can be queried using
restez
’s functions. Because there are potentially huge
amounts of sequence information, all of which is unlikely to be of
interest to a user, restez
allows you to select the sets of
flatfiles to download based on GenBank’s domains. For the most part
these sets of flatfile types are determined by taxonomy, for example, a
user can download all primate or plant sequences.
Here, we will explain how to set up restez
, download
genbank and then create a database from the downloaded files.
1.1 Setting up
Before we can download anything, we need to tell restez
where we would like to store the downloaded files and the eventual
GenBank database. To do this we use the restez_path_set()
function. With this function a user can provide a system file path to a
folder within which the function will create a new folder called
‘restez’. In this example, we will create a restez
folder
in a temporary folder.
# set a random number generator seed for reproducibility
set.seed(12345)
library(restez)
# rstz_pth <- tempdir()
restez_path_set(filepath = rstz_pth)
#> ... Creating '/tmp/RtmpVAzsSm/restez'
#> ... Creating '/tmp/RtmpVAzsSm/restez/downloads'
A user must always set the restez
path in order to use
the restez
library. Additionally, by running
restez_path_set()
on a new folder messages are printed to
console telling us that the path was created and that a downloads
subfolder was created.
Note, whenever you intend to use
restez
you will need to specify therestez
path. If you are using it regularly, it might be a good idea to set it using your .Rprofile.restez_path_set()
only needs to be run once per R session, but we will use it repeatedly in the chunks below just as a reminder.
1.2 Download
Now we can download GenBank files using db_download()
.
This is an interactive function that looks up the latest GenBank
release, parses the release notes, and prints to console the available
sets of flatfiles that can be downloaded. For this example, we will
select the smallest available domain which is ‘Unannotated’ and can be
pre-specified with ‘20’. Note that the numbering scheme may change
between GenBank releases, so you shouldn’t assume a given number will
correspond to the same domain if you run it again later. Instead, you
can run this function without the preselection
argument and
wait for the function to prompt you for a domain selection, which will
show the current numbering scheme.
After launching, the download function will ask you whether it is OK
to download and set up a database for the selected sequence files. If
the selection is likely to be large you can always quit the process
using either Esc
or Ctrl+c
. Please note that
the diskspace estimates are preliminary. The package tries to be
conservative in its estimates.
db_download(preselection = '20')
#> ────────────────────────────────────────────────────────────────────────────────
#> Looking up latest GenBank release ...
#> ... release number 263
#> ... found 4699 sequence files
#> ────────────────────────────────────────────────────────────────────────────────
#> Which sequence file types would you like to download?
#> Choose from those listed below:
#> • 1 - 'Plant (including fungi and algae)'
#> 1660 files and 1950 GB
#> • 2 - 'Invertebrate'
#> 934 files and 1350 GB
#> • 3 - 'Bacterial'
#> 394 files and 587 GB
#> • 4 - 'Viral'
#> 331 files and 496 GB
#> • 5 - 'Primate'
#> 325 files and 475 GB
#> • 6 - 'Other vertebrate'
#> 288 files and 407 GB
#> • 7 - 'EST (expressed sequence tag)'
#> 163 files and 244 GB
#> • 8 - 'Other mammalian'
#> 156 files and 218 GB
#> • 9 - 'Rodent'
#> 113 files and 159 GB
#> • 10 - 'GSS (genome survey sequence)'
#> 79 files and 117 GB
#> • 11 - 'Patent'
#> 78 files and 116 GB
#> • 12 - 'Constructed'
#> 68 files and 101 GB
#> • 13 - 'TSA (transcriptome shotgun assembly)'
#> 37 files and 54.2 GB
#> • 14 - 'Environmental sampling'
#> 29 files and 42.7 GB
#> • 15 - 'HTGS (high throughput genomic sequencing)'
#> 25 files and 36.8 GB
#> • 16 - 'Synthetic and chimeric'
#> 9 files and 12.6 GB
#> • 17 - 'HTC (high throughput cDNA sequencing)'
#> 3 files and 3.49 GB
#> • 18 - 'Phage'
#> 3 files and 3.54 GB
#> • 19 - 'STS (sequence tagged site)'
#> 3 files and 4.45 GB
#> • 20 - 'Unannotated'
#> 1 files and 0.00736 GB
#> Provide one or more numbers separated by spaces.
#> e.g. to download all Mammalian sequences, type: "5 8 9" followed by Enter
#>
#> Which files would you like to download?
#> ────────────────────────────────────────────────────────────────────────────────
#> You've selected a total of 1 file(s) and 0.00736 GB of uncompressed data.
#> These represent:
#> • 'Unannotated'
#>
#> Based on stated GenBank files sizes, we estimate ...
#> ... 0.00147 GB for compressed, downloaded files
#> ... 0.00898 GB for the SQL database
#> Leading to a total of 0.0105 GB
#>
#> Please note, the real sizes of the database and its downloads cannot be
#> accurately predicted beforehand.
#> These are just estimates, actual sizes may differ by up to a factor of two.
#>
#> Is this OK?
#> ────────────────────────────────────────────────────────────────────────────────
#> Downloading ...
#> ... 'gbuna1.seq' (1/1)
#> Done. Enjoy your day.
Now the download has begun, we just need to wait for it to finish.
For the largest domains, this can take quite a while and may be prone to
server errors. In this case, you may want to set the
max_tries
argument of db_download()
to a
relatively high number to automatically re-try downloading until that
number of tries is exceeded (in this case, be sure to set
overwrite
to FALSE
or the download will start
over from scratch).
1.3 Create a database
After the download has completed, we need to use
db_create()
to create our local database. This looks in the
downloads folder of our restez
path, breaks these files up
into separate GenBank records and adds them to the SQL-like database.
Again, for very large collections of sequences this can take quite a
while.
db_create()
#> Inspecting 1 file(s) to add to the database ...
#> ... 'gbuna1.seq.gz' (1/1)
#> Done.
db_create()
allows a user to specify minimum and maximum
sequence lengths. It’s always a good idea to limit the number of
sequences in a database so that look-up times are faster. If you know
you are only interested in certain lengths of sequences it is a good
idea to limit the sequences in the database at this stage. This can also
be done with the acc_filter
argument, as described in the
“Tips
and Tricks” vignette. You can always run db_create()
again to change the limits. You will simply need to delete the original
database first with db_delete()
.
1.4 Checking the setup
After the download and the database creation steps are complete, we
can confirm our setup using restez_status()
.
restez_path_set(rstz_pth)
restez_status()
#> Checking setup status at ...
#> ────────────────────────────────────────────────────────────────────────────────
#> Restez path ...
#> ... Path '/tmp/RtmpVAzsSm/restez'
#> ... Does path exist? 'Yes'
#> ────────────────────────────────────────────────────────────────────────────────
#> Download ...
#> ... Path '/tmp/RtmpVAzsSm/restez/downloads'
#> ... Does path exist? 'Yes'
#> ... N. files 2
#> ... Total size 2.6M
#> ... GenBank division selections 'Unannotated'
#> ... GenBank Release 263
#> ... Last updated '2024-11-27 03:39:42.645875'
#> ────────────────────────────────────────────────────────────────────────────────
#> Database ...
#> ... Path '/tmp/RtmpVAzsSm/restez/sql_db'
#> ... Does path exist? 'Yes'
#> ... Total size 8.76M
#> ... Does the database have data? 'Yes'
#> ... Number of sequences 775
#> ... Min. sequence length 0
#> ... Max. sequence length Inf
#> ... Last_updated '2024-11-27 03:39:43.850198'
The status function allows a user to always touch base with their
restez
set-up. It can be run at any point, before and/or
after download or database creation. It is also designed to provide
useful messages to a user for what they need to do next in order for
them to make queries to their database. If you ever get confused, run
restez_status()
to see what is happening.
Additionally, if you are developing your own functions that depend on
restez
, you can use restez_ready()
which will
simply return TRUE or FALSE on whether the database can be queried.
restez_path_set(rstz_pth)
if (restez_ready()) {
print('Database is ready to be queried!')
} else {
print('Database cannot be queried :-(')
}
#> [1] "Database is ready to be queried!"
2. Query
Once a restez database has been set up we can query the database
using the gb_*_get()
functions. These functions allow us to
retrieve specific columns in the SQL-like database: ‘sequence’,
‘definition’, ‘accession’, ‘version’ and ‘organism’. Also, they allow us
to get the whole text formatted record and sequence data in fasta
format. In this example, we can use list_db_ids()
to
identify accession numbers in the database. We could also use
entrez_search()
provided the database contains sequences of
interest to us, see ‘search
and fetch’.
restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AH000952"
# sequence
str(gb_sequence_get(id))
#> Named chr "GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATC"| __truncated__
#> - attr(*, "names")= chr "AH000952"
# definition
(gb_definition_get(id))
#> AH000952
#> "Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds"
# version
(gb_version_get(id))
#> AH000952
#> "AH000952.2"
# organism
(gb_organism_get(id))
#> AH000952
#> "Transposon Tn3"
# fasta
cat(gb_fasta_get(id))
#> >AH000952.2 Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds
#> GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATC
#> TTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGT
#> CTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGT
#> TGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNGGCCATTATTCCTTCACGCTGGCAGAACTGGTGACCAAAGGTCATCTGAGACCA
#> TTAAAAGAGGCGTCAGAGGCAGAAAACGTTGCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCC
# Note, for large databases these requests can take a long time to complete.
# Always try and limit the size of the database by specifying min and max
# sequence lengths with db_create()
# Also note, if an id is not present in the database nothing is returned
cat(gb_fasta_get(id = c(id, 'notanid')))
#> >AH000952.2 Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds
#> GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATC
#> TTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGT
#> CTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGT
#> TGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNGGCCATTATTCCTTCACGCTGGCAGAACTGGTGACCAAAGGTCATCTGAGACCA
#> TTAAAAGAGGCGTCAGAGGCAGAAAACGTTGCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCC
Additionally, for more flexibility and options for extracting sequence record information see GenBank record parsing.
3. Entrez
Entrez wrappers are part of the restez
package. These
allow a user to make use of the local GenBank using functions that were
built for rentrez
.
This minimizes the amount of coding changes required for any Entrez
dependent code.
Currently, only
entrez_fetch()
is available with restez and only text formatted rettypes are allowed.
restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AF298087"
# get fasta sequences with entrez
res <- restez::entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF298087.1 Unidentified clone A5 DNA sequence from ocean beach sand
#> GATCCGGTCNTCGCGCTTTGCGCCTCGCGGATCTCGCGCACCTGCAGCTGGGTCACCAGCGGCGAGATGA
#> TGGTGCCTAGCAGCAGCCCGAAGACCACCATCACCACCAGGATCTCGATCAGGGTGAAGCCCTGGTCAGG
#> ACGCCGGCTCACGGCGCGGGCGGTGGCTTCGCGGCGCATCACAGGCACGGCCTTGCTGGCGTGGTTGGGG
#> GCGGCACTGGATTATTGCAGGCCACCATTTCCAGGTAATCGTTAAAGCCCATCTTACGTGCCGTGCGCTC
#> CATGAGCAGGTCGACACCATTCGCGTTGTCTTCCTCGAAGCCGGCCTGGCCGGGAGTCGCCAGGGTGAGG
#> TCCTGCGTTGGCGCAAGCTCACCGCTCACGATCACCACCGCGTGGGCATCGTCCACGACTGAAGCCCCAT
#> CGGCGACGTTGACGACGACGCANTCGCCTCCGACGGTGCAGGGGTCGGACGGGCCCCCGGCTCCNGTGGG
#> GCTGAAGCCGTTGGCGTANCTCGCATANACAAACTGNGCCCACCGTTCTCGGNGAACCANGCGGGAGATC
#> CCAGGGTTTCCGGGCCTG
# entrez_fetch will also search via Entrez for any ids not in the db
plant_sequence <- 'AY952423'
res <- restez::entrez_fetch(db = 'nucleotide', id = c(id, plant_sequence),
rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF298087.1 Unidentified clone A5 DNA sequence from ocean beach sand
#> GATCCGGTCNTCGCGCTTTGCGCCTCGCGGATCTCGCGCACCTGCAGCTGGGTCACCAGCGGCGAGATGA
#> TGGTGCCTAGCAGCAGCCCGAAGACCACCATCACCACCAGGATCTCGATCAGGGTGAAGCCCTGGTCAGG
#> ACGCCGGCTCACGGCGCGGGCGGTGGCTTCGCGGCGCATCACAGGCACGGCCTTGCTGGCGTGGTTGGGG
#> GCGGCACTGGATTATTGCAGGCCACCATTTCCAGGTAATCGTTAAAGCCCATCTTACGTGCCGTGCGCTC
#> CATGAGCAGGTCGACACCATTCGCGTTGTCTTCCTCGAAGCCGGCCTGGCCGGGAGTCGCCAGGGTGAGG
#> TCCTGCGTTGGCGCAAGCTCACCGCTCACGATCACCACCGCGTGGGCATCGTCCACGACTGAAGCCCCAT
#> CGGCGACGTTGACGACGACGCANTCGCCTCCGACGGTGCAGGGGTCGGACGGGCCCCCGGCTCCNGTGGG
#> GCTGAAGCCGTTGGCGTANCTCGCATANACAAACTGNGCCCACCGTTCTCGGNGAACCANGCGGGAGATC
#> CCAGGGTTTCCGGGCCTG
#>
#> >AY952423.1 Livistona chinensis tRNA-Lys (trnK) gene, partial sequence; and matK gene, complete sequence; chloroplast
#> ATTGGGGTTGCTAACTCAACGGTAGAGTACTCGGCTTTTAAGTGCGACTATCATCTTTTACACATTTGGA
#> TGAAGTAAGGAATTCGTCCAGACTATTGGTAGAGTCTATAAGACCACGACTGATCCTGAAAGGTAATGAA
#> TGGAAAAAATAGCATGTCGTACGTAATACAATGAGAAACTTGTAATTTCTTATTGTAATTTTTTAAGTAG
#> AACTTTGAGTTTATCCTTACTGGATCATTACAAAAATATTGTATTTTATTTTTGGAAGGGGACGAAAAAA
#> AGGAAATTCCCAACATTTATTGTTTGGTCTAATGAATAAATGGATAGGGGCCTAGGGTAGGGCCCAATTT
#> TTGTAAAACAAAAAGCAACGAGCTTATGTTCTTAATTTGAATAATTACCCGATCTAATTAGATGTTAAAA
#> ATAAATTAGTGCCAGATGTGGTAAAGGGTTCTACTGTAAGTGGACCTTTTTTTTTTTTTTTTATGAATCC
#> TACCTATTATCTATTATGGATTAAAGATGGATGTGTATAAGAAGAAGTATACTGATAAAGAGAATTTTTC
#> CAAAGTCAAAAGAGCAATCGGGTTGCAAAAATAAAGGATTTTTACCTCCGAGAATTATAAATTAATTGGA
#> TCAAAAGGAGAGGAAAAAGTCTGTGATTGGACTCCTTCTATCCGCGGGTATGGGTATATAGTAGGTATAT
#> ATGTATATTTGTATACTATATAAATTACATGCCCTGTTCTGACCGTATTGCACTATGTATTATTTGATAA
#> TCCAAGAAATGCCTCCTACTTCTGGTTCAAGTAGAAATGAAAATAGAAGAATTACAAGAATATTTAGAAA
#> AAGATAGATCTCGGCAACAACACTTCCTATACCCACTTTTCTTTCAGGAGTATATTTATGCACTTGCTCA
#> TGATTATGGGTTTAAAGGGTTCGATTTTTTACGAACCTATGGAAATTGGGGGTTATGATAATAAATCTAG
#> TTCAGTACTTGTAAAACATTTAATTACTCGAATGTATCAACAGAATTATTTGATTTATTCTGTTAATGAA
#> TCTAACCAAAATCGATTGATTGAGCATAACAATTCTTTTTATTCTCAAATGATATCTGAAGTTTTTGCGA
#> TCATTGCAGAAATTCCATTCTCTCAGCAATTACTATTTTCTCTTCGAGGAAAAAAGAATACCAAAATCTC
#> AGACTTTACGATCTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTATCACATTTAAACTATGTGTC
#> AGATATATTAATACCCTATCCCATCCATTTGGAAATCTTGGTGCAAATTCTTCAATGCTGGATCCAAGAT
#> GTTTCTTCTTTGCATTTATTGCGATTCTTTCTCCACGAACATCATAATGGGAATAGTTTTTTTTTTCCAA
#> AGAAATCCTTTTCAAAAGAAAATAAAAGACTCTTTCGATTCCTATATAATTCTTATGTATCTGAATGTGA
#> ATTTGTCTTAGTGTTTCTTCGTAAACAATCCTCTTATTTACAATCAAAATCCTATGGAATCTTTCTTGAG
#> CGAACACATTTCTATGGAAGAATGGAACATCTTATAGTAGTGTGTCATAATTATTGTCAGAAGGCCTTTT
#> GGGTCTTCAAGGATCCTTTTATGCATTATGTTCGATATCAAGGAAAAGCAATTCTGGCATCAAAAGGATC
#> TTATCTTTTGATGAAGAAATGGAGATGTCATCTTGTCAATTTCTGGCAATATTATTTTCATTTTTGGGCT
#> CAGCCTTACAGAATTTCAATAAACCAATTAGGAAATCATTCCTTCTATTTTCTCGGTTATCTTTCAAGTG
#> TATTAAAAAATACTTCGTCTGTAAGGAATCAAATGCTAGAGAATTCCTTTTTAATAGATACTATTACTAA
#> TAAATTGGATACCATAGTCCCAGTTCTTCCTCTTATTGGATCTTTGTCTAAAGCTAAATTTTGTACCGTA
#> TCCGGGCATCCTAGTAGTAAGCCAATCTGGACGGATTTATCGGATTCTGATATTATTGATAGATTTGGTC
#> GGATATGTAGAAATCTTTCTCATTATTATAGTGGATCCTCAAAAAAACAGAGCTTATATCGAATAAGGTA
#> TATACTTCGACTTTCTTGTGCTAGAACTTTAGCTCGTAAACATAAAAGTACAGTACGTGCTTTTTTGCAA
#> AGATTAGGTTCGGAATTATTAGAAGAATTCTTTACAGAAGAAGAAGGAGTTGTTTTTTTGATTTCCCAAA
#> AGAACAAAACCTCTTTTCCTCTCTATAGGTCACATAGAGAACGCATTTGGTATTTGGATATTATCCATAT
#> TAATGAATTGGTGAATTCATTTATGATGGGGCGATAAGCCCCTATAAAATAAGAAATATAAATTTTTTCT
#> AATGTCTAATAAATAGACGACAAATTCATTAATTTTCATTCTGAAATGCTCATCTAGTAGTGTAGTGATT
#> GAATCAACTGAGTATTCAAAATTTTTAGACAAACTTCTAGGGATAGAAGTTTGTTTTATCTGTATACATA
#> GGTAAAGTCGTGTGCAATGAAAAATGCAAGCACGATTTGGGGAGAGATAATTTTCTCTATTGTAACAAAT
#> AAAAATTATCTACTCCATCCGACTAGTTAATCG