1. Download GenBank
In the first section of this introduction to restez
we will explain how to download parts of GenBank on to your computer and from these downloads, construct a local database. GenBank can be downloaded from NCBI using the File Transfer Protocol (FTP), the online portal for these downloads can be viewed from your web browser, FTP NCBI GenBank. GenBank is hosted at this portal as 1000s of compressed sequence files called flatfiles. New versions of these flatfiles are released once a month as new sequence information is uploaded. For information on the latest GenBank release, see the NCBI statistics page.
restez
downloads a selection of these flatfiles and unpacks them into an SQL-like database (specifically, DuckDB) that can be queried using restez
’s functions. Because there are potentially huge amounts of sequence information, all of which is unlikely to be of interest to a user, restez
allows you to select the sets of flatfiles to download based on GenBank’s domains. For the most part these sets of flatfile types are determined by taxonomy, for example, a user can download all primate or plant sequences.
Here, we will explain how to set up restez
, download genbank and then create a database from the downloaded files.
1.1 Setting up
Before we can download anything, we need to tell restez
where we would like to store the downloaded files and the eventual GenBank database. To do this we use the restez_path_set()
function. With this function a user can provide a system file path to a folder within which the function will create a new folder called ‘restez’. In this example, we will create a restez
folder in a temporary folder.
# set a random number generator seed for reproducibility
set.seed(12345)
library(restez)
# rstz_pth <- tempdir()
restez_path_set(filepath = rstz_pth)
#> ... Creating '/tmp/RtmpUI7PB4/restez'
#> ... Creating '/tmp/RtmpUI7PB4/restez/downloads'
A user must always set the restez
path in order to use the restez
library. Additionally, by running restez_path_set()
on a new folder messages are printed to console telling us that the path was created and that a downloads subfolder was created.
Note, whenever you intend to use
restez
you will need to specify therestez
path. If you are using it regularly, it might be a good idea to set it using your .Rprofile.restez_path_set()
only needs to be run once per R session, but we will use it repeatedly in the chunks below just as a reminder.
1.2 Download
Now we can download GenBank files using db_download()
. This is an interactive function that looks up the latest GenBank release, parses the release notes, and prints to console the available sets of flatfiles that can be downloaded. For this example, we will select the smallest available domain which is ‘Unannotated’ and can be pre-specified with ‘20’. Note that the numbering scheme may change between GenBank releases, so you shouldn’t assume a given number will correspond to the same domain if you run it again later. Instead, you can run this function without the preselection
argument and wait for the function to prompt you for a domain selection, which will show the current numbering scheme.
After launching, the download function will ask you whether it is OK to download and set up a database for the selected sequence files. If the selection is likely to be large you can always quit the process using either Esc
or Ctrl+c
. Please note that the diskspace estimates are preliminary. The package tries to be conservative in its estimates.
db_download(preselection = '20')
#> ────────────────────────────────────────────────────────────────────────────────
#> Looking up latest GenBank release ...
#> ... release number 257
#> ... found 7745 sequence files
#> ────────────────────────────────────────────────────────────────────────────────
#> Which sequence file types would you like to download?
#> Choose from those listed below:
#> • 1 - 'Invertebrate'
#> 1739 files and 737 GB
#> • 2 - 'Plant (including fungi and algae)'
#> 1265 files and 797 GB
#> • 3 - 'Viral'
#> 1022 files and 453 GB
#> • 4 - 'Bacterial'
#> 1010 files and 449 GB
#> • 5 - 'EST (expressed sequence tag)'
#> 578 files and 244 GB
#> • 6 - 'Other vertebrate'
#> 423 files and 181 GB
#> • 7 - 'Rodent'
#> 306 files and 124 GB
#> • 8 - 'GSS (genome survey sequence)'
#> 270 files and 117 GB
#> • 9 - 'Patent'
#> 260 files and 110 GB
#> • 10 - 'Constructed'
#> 236 files and 99.2 GB
#> • 11 - 'Other mammalian'
#> 235 files and 92.3 GB
#> • 12 - 'TSA (transcriptome shotgun assembly)'
#> 127 files and 54.2 GB
#> • 13 - 'HTGS (high throughput genomic sequencing)'
#> 82 files and 36.8 GB
#> • 14 - 'Environmental sampling'
#> 79 files and 34 GB
#> • 15 - 'Primate'
#> 58 files and 24.1 GB
#> • 16 - 'Synthetic and chimeric'
#> 29 files and 11.6 GB
#> • 17 - 'STS (sequence tagged site)'
#> 11 files and 4.45 GB
#> • 18 - 'HTC (high throughput cDNA sequencing)'
#> 8 files and 3.49 GB
#> • 19 - 'Phage'
#> 6 files and 2.94 GB
#> • 20 - 'Unannotated'
#> 1 files and 0.00726 GB
#> Provide one or more numbers separated by spaces.
#> e.g. to download all Mammalian sequences, type: "7 11 15" followed by Enter
#>
#> Which files would you like to download?
#> ────────────────────────────────────────────────────────────────────────────────
#> You've selected a total of 1 file(s) and 0.00726 GB of uncompressed data.
#> These represent:
#> • 'Unannotated'
#>
#> Based on stated GenBank files sizes, we estimate ...
#> ... 0.00145 GB for compressed, downloaded files
#> ... 0.00886 GB for the SQL database
#> Leading to a total of 0.0103 GB
#>
#> Please note, the real sizes of the database and its downloads cannot be
#> accurately predicted beforehand.
#> These are just estimates, actual sizes may differ by up to a factor of two.
#>
#> Is this OK?
#> ────────────────────────────────────────────────────────────────────────────────
#> Downloading ...
#> ... 'gbuna1.seq' (1/1)
#> Done. Enjoy your day.
Now the download has begun, we just need to wait for it to finish. For the largest domains, this can take quite a while and may be prone to server errors. In this case, you may want to set the max_tries
argument of db_download()
to a relatively high number to automatically re-try downloading until that number of tries is exceeded (in this case, be sure to set overwrite
to FALSE
or the download will start over from scratch).
1.3 Create a database
After the download has completed, we need to use db_create()
to create our local database. This looks in the downloads folder of our restez
path, breaks these files up into separate GenBank records and adds them to the SQL-like database. Again, for very large collections of sequences this can take quite a while.
db_create()
#> Inspecting 1 file(s) to add to the database ...
#> ... 'gbuna1.seq.gz' (1/1)
#> Done.
db_create()
allows a user to specify minimum and maximum sequence lengths. It’s always a good idea to limit the number of sequences in a database so that look-up times are faster. If you know you are only interested in certain lengths of sequences it is a good idea to limit the sequences in the database at this stage. This can also be done with the acc_filter
argument, as described in the “Tips and Tricks” vignette. You can always run db_create()
again to change the limits. You will simply need to delete the original database first with db_delete()
.
1.4 Checking the setup
After the download and the database creation steps are complete, we can confirm our setup using restez_status()
.
restez_path_set(rstz_pth)
restez_status()
#> Checking setup status at ...
#> ────────────────────────────────────────────────────────────────────────────────
#> Restez path ...
#> ... Path '/tmp/RtmpUI7PB4/restez'
#> ... Does path exist? 'Yes'
#> ────────────────────────────────────────────────────────────────────────────────
#> Download ...
#> ... Path '/tmp/RtmpUI7PB4/restez/downloads'
#> ... Does path exist? 'Yes'
#> ... N. files 2
#> ... Total size 2.93M
#> ... GenBank division selections 'Unannotated'
#> ... GenBank Release 257
#> ... Last updated '2023-09-23 08:26:20.617314'
#> ────────────────────────────────────────────────────────────────────────────────
#> Database ...
#> ... Path '/tmp/RtmpUI7PB4/restez/sql_db'
#> ... Does path exist? 'Yes'
#> ... Total size 6.26M
#> ... Does the database have data? 'Yes'
#> ... Number of sequences 760
#> ... Min. sequence length 0
#> ... Max. sequence length Inf
#> ... Last_updated '2023-09-23 08:26:23.304994'
The status function allows a user to always touch base with their restez
set-up. It can be run at any point, before and/or after download or database creation. It is also designed to provide useful messages to a user for what they need to do next in order for them to make queries to their database. If you ever get confused, run restez_status()
to see what is happening.
Additionally, if you are developing your own functions that depend on restez
, you can use restez_ready()
which will simply return TRUE or FALSE on whether the database can be queried.
restez_path_set(rstz_pth)
if (restez_ready()) {
print('Database is ready to be queried!')
} else {
print('Database cannot be queried :-(')
}
#> [1] "Database is ready to be queried!"
2. Query
Once a restez database has been set up we can query the database using the gb_*_get()
functions. These functions allow us to retrieve specific columns in the SQL-like database: ‘sequence’, ‘definition’, ‘accession’, ‘version’ and ‘organism’. Also, they allow us to get the whole text formatted record and sequence data in fasta format. In this example, we can use list_db_ids()
to identify accession numbers in the database. We could also use entrez_search()
provided the database contains sequences of interest to us, see ‘search and fetch’.
restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AH000952"
# sequence
str(gb_sequence_get(id))
#> Named chr "GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATC"| __truncated__
#> - attr(*, "names")= chr "AH000952"
# definition
(gb_definition_get(id))
#> AH000952
#> "Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds"
# version
(gb_version_get(id))
#> AH000952
#> "AH000952.2"
# organism
(gb_organism_get(id))
#> AH000952
#> "Transposon Tn3"
# fasta
cat(gb_fasta_get(id))
#> >AH000952.2 Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds
#> GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATC
#> TTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGT
#> CTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGT
#> TGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNGGCCATTATTCCTTCACGCTGGCAGAACTGGTGACCAAAGGTCATCTGAGACCA
#> TTAAAAGAGGCGTCAGAGGCAGAAAACGTTGCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCC
# Note, for large databases these requests can take a long time to complete.
# Always try and limit the size of the database by specifying min and max
# sequence lengths with db_create()
# Also note, if an id is not present in the database nothing is returned
cat(gb_fasta_get(id = c(id, 'notanid')))
#> >AH000952.2 Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds
#> GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATC
#> TTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGT
#> CTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGT
#> TGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNGGCCATTATTCCTTCACGCTGGCAGAACTGGTGACCAAAGGTCATCTGAGACCA
#> TTAAAAGAGGCGTCAGAGGCAGAAAACGTTGCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCC
Additionally, for more flexibility and options for extracting sequence record information see GenBank record parsing.
3. Entrez
Entrez wrappers are part of the restez
package. These allow a user to make use of the local GenBank using functions that were built for rentrez
. This minimizes the amount of coding changes required for any Entrez dependent code.
Currently, only
entrez_fetch()
is available with restez and only text formatted rettypes are allowed.
restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AF298087"
# get fasta sequences with entrez
res <- restez::entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF298087.1 Unidentified clone A5 DNA sequence from ocean beach sand
#> GATCCGGTCNTCGCGCTTTGCGCCTCGCGGATCTCGCGCACCTGCAGCTGGGTCACCAGCGGCGAGATGA
#> TGGTGCCTAGCAGCAGCCCGAAGACCACCATCACCACCAGGATCTCGATCAGGGTGAAGCCCTGGTCAGG
#> ACGCCGGCTCACGGCGCGGGCGGTGGCTTCGCGGCGCATCACAGGCACGGCCTTGCTGGCGTGGTTGGGG
#> GCGGCACTGGATTATTGCAGGCCACCATTTCCAGGTAATCGTTAAAGCCCATCTTACGTGCCGTGCGCTC
#> CATGAGCAGGTCGACACCATTCGCGTTGTCTTCCTCGAAGCCGGCCTGGCCGGGAGTCGCCAGGGTGAGG
#> TCCTGCGTTGGCGCAAGCTCACCGCTCACGATCACCACCGCGTGGGCATCGTCCACGACTGAAGCCCCAT
#> CGGCGACGTTGACGACGACGCANTCGCCTCCGACGGTGCAGGGGTCGGACGGGCCCCCGGCTCCNGTGGG
#> GCTGAAGCCGTTGGCGTANCTCGCATANACAAACTGNGCCCACCGTTCTCGGNGAACCANGCGGGAGATC
#> CCAGGGTTTCCGGGCCTG
# entrez_fetch will also search via Entrez for any ids not in the db
plant_sequence <- 'AY952423'
res <- restez::entrez_fetch(db = 'nucleotide', id = c(id, plant_sequence),
rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF298087.1 Unidentified clone A5 DNA sequence from ocean beach sand
#> GATCCGGTCNTCGCGCTTTGCGCCTCGCGGATCTCGCGCACCTGCAGCTGGGTCACCAGCGGCGAGATGA
#> TGGTGCCTAGCAGCAGCCCGAAGACCACCATCACCACCAGGATCTCGATCAGGGTGAAGCCCTGGTCAGG
#> ACGCCGGCTCACGGCGCGGGCGGTGGCTTCGCGGCGCATCACAGGCACGGCCTTGCTGGCGTGGTTGGGG
#> GCGGCACTGGATTATTGCAGGCCACCATTTCCAGGTAATCGTTAAAGCCCATCTTACGTGCCGTGCGCTC
#> CATGAGCAGGTCGACACCATTCGCGTTGTCTTCCTCGAAGCCGGCCTGGCCGGGAGTCGCCAGGGTGAGG
#> TCCTGCGTTGGCGCAAGCTCACCGCTCACGATCACCACCGCGTGGGCATCGTCCACGACTGAAGCCCCAT
#> CGGCGACGTTGACGACGACGCANTCGCCTCCGACGGTGCAGGGGTCGGACGGGCCCCCGGCTCCNGTGGG
#> GCTGAAGCCGTTGGCGTANCTCGCATANACAAACTGNGCCCACCGTTCTCGGNGAACCANGCGGGAGATC
#> CCAGGGTTTCCGGGCCTG
#>
#> >AY952423.1 Livistona chinensis tRNA-Lys (trnK) gene, partial sequence; and matK gene, complete sequence; chloroplast
#> ATTGGGGTTGCTAACTCAACGGTAGAGTACTCGGCTTTTAAGTGCGACTATCATCTTTTACACATTTGGA
#> TGAAGTAAGGAATTCGTCCAGACTATTGGTAGAGTCTATAAGACCACGACTGATCCTGAAAGGTAATGAA
#> TGGAAAAAATAGCATGTCGTACGTAATACAATGAGAAACTTGTAATTTCTTATTGTAATTTTTTAAGTAG
#> AACTTTGAGTTTATCCTTACTGGATCATTACAAAAATATTGTATTTTATTTTTGGAAGGGGACGAAAAAA
#> AGGAAATTCCCAACATTTATTGTTTGGTCTAATGAATAAATGGATAGGGGCCTAGGGTAGGGCCCAATTT
#> TTGTAAAACAAAAAGCAACGAGCTTATGTTCTTAATTTGAATAATTACCCGATCTAATTAGATGTTAAAA
#> ATAAATTAGTGCCAGATGTGGTAAAGGGTTCTACTGTAAGTGGACCTTTTTTTTTTTTTTTTATGAATCC
#> TACCTATTATCTATTATGGATTAAAGATGGATGTGTATAAGAAGAAGTATACTGATAAAGAGAATTTTTC
#> CAAAGTCAAAAGAGCAATCGGGTTGCAAAAATAAAGGATTTTTACCTCCGAGAATTATAAATTAATTGGA
#> TCAAAAGGAGAGGAAAAAGTCTGTGATTGGACTCCTTCTATCCGCGGGTATGGGTATATAGTAGGTATAT
#> ATGTATATTTGTATACTATATAAATTACATGCCCTGTTCTGACCGTATTGCACTATGTATTATTTGATAA
#> TCCAAGAAATGCCTCCTACTTCTGGTTCAAGTAGAAATGAAAATAGAAGAATTACAAGAATATTTAGAAA
#> AAGATAGATCTCGGCAACAACACTTCCTATACCCACTTTTCTTTCAGGAGTATATTTATGCACTTGCTCA
#> TGATTATGGGTTTAAAGGGTTCGATTTTTTACGAACCTATGGAAATTGGGGGTTATGATAATAAATCTAG
#> TTCAGTACTTGTAAAACATTTAATTACTCGAATGTATCAACAGAATTATTTGATTTATTCTGTTAATGAA
#> TCTAACCAAAATCGATTGATTGAGCATAACAATTCTTTTTATTCTCAAATGATATCTGAAGTTTTTGCGA
#> TCATTGCAGAAATTCCATTCTCTCAGCAATTACTATTTTCTCTTCGAGGAAAAAAGAATACCAAAATCTC
#> AGACTTTACGATCTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTATCACATTTAAACTATGTGTC
#> AGATATATTAATACCCTATCCCATCCATTTGGAAATCTTGGTGCAAATTCTTCAATGCTGGATCCAAGAT
#> GTTTCTTCTTTGCATTTATTGCGATTCTTTCTCCACGAACATCATAATGGGAATAGTTTTTTTTTTCCAA
#> AGAAATCCTTTTCAAAAGAAAATAAAAGACTCTTTCGATTCCTATATAATTCTTATGTATCTGAATGTGA
#> ATTTGTCTTAGTGTTTCTTCGTAAACAATCCTCTTATTTACAATCAAAATCCTATGGAATCTTTCTTGAG
#> CGAACACATTTCTATGGAAGAATGGAACATCTTATAGTAGTGTGTCATAATTATTGTCAGAAGGCCTTTT
#> GGGTCTTCAAGGATCCTTTTATGCATTATGTTCGATATCAAGGAAAAGCAATTCTGGCATCAAAAGGATC
#> TTATCTTTTGATGAAGAAATGGAGATGTCATCTTGTCAATTTCTGGCAATATTATTTTCATTTTTGGGCT
#> CAGCCTTACAGAATTTCAATAAACCAATTAGGAAATCATTCCTTCTATTTTCTCGGTTATCTTTCAAGTG
#> TATTAAAAAATACTTCGTCTGTAAGGAATCAAATGCTAGAGAATTCCTTTTTAATAGATACTATTACTAA
#> TAAATTGGATACCATAGTCCCAGTTCTTCCTCTTATTGGATCTTTGTCTAAAGCTAAATTTTGTACCGTA
#> TCCGGGCATCCTAGTAGTAAGCCAATCTGGACGGATTTATCGGATTCTGATATTATTGATAGATTTGGTC
#> GGATATGTAGAAATCTTTCTCATTATTATAGTGGATCCTCAAAAAAACAGAGCTTATATCGAATAAGGTA
#> TATACTTCGACTTTCTTGTGCTAGAACTTTAGCTCGTAAACATAAAAGTACAGTACGTGCTTTTTTGCAA
#> AGATTAGGTTCGGAATTATTAGAAGAATTCTTTACAGAAGAAGAAGGAGTTGTTTTTTTGATTTCCCAAA
#> AGAACAAAACCTCTTTTCCTCTCTATAGGTCACATAGAGAACGCATTTGGTATTTGGATATTATCCATAT
#> TAATGAATTGGTGAATTCATTTATGATGGGGCGATAAGCCCCTATAAAATAAGAAATATAAATTTTTTCT
#> AATGTCTAATAAATAGACGACAAATTCATTAATTTTCATTCTGAAATGCTCATCTAGTAGTGTAGTGATT
#> GAATCAACTGAGTATTCAAAATTTTTAGACAAACTTCTAGGGATAGAAGTTTGTTTTATCTGTATACATA
#> GGTAAAGTCGTGTGCAATGAAAAATGCAAGCACGATTTGGGGAGAGATAATTTTCTCTATTGTAACAAAT
#> AAAAATTATCTACTCCATCCGACTAGTTAATCG