Skip to contents

1. Download GenBank

In the first section of this introduction to restez we will explain how to download parts of GenBank on to your computer and from these downloads, construct a local database. GenBank can be downloaded from NCBI using the File Transfer Protocol (FTP), the online portal for these downloads can be viewed from your web browser, FTP NCBI GenBank. GenBank is hosted at this portal as 1000s of compressed sequence files called flatfiles. New versions of these flatfiles are released once a month as new sequence information is uploaded. For information on the latest GenBank release, see the NCBI statistics page.

restez downloads a selection of these flatfiles and unpacks them into an SQL-like database (specifically, DuckDB) that can be queried using restez’s functions. Because there are potentially huge amounts of sequence information, all of which is unlikely to be of interest to a user, restez allows you to select the sets of flatfiles to download based on GenBank’s domains. For the most part these sets of flatfile types are determined by taxonomy, for example, a user can download all primate or plant sequences.

Here, we will explain how to set up restez, download genbank and then create a database from the downloaded files.

1.1 Setting up

Before we can download anything, we need to tell restez where we would like to store the downloaded files and the eventual GenBank database. To do this we use the restez_path_set() function. With this function a user can provide a system file path to a folder within which the function will create a new folder called ‘restez’. In this example, we will create a restez folder in a temporary folder.

# set a random number generator seed for reproducibility
set.seed(12345)

library(restez)
# rstz_pth <- tempdir()
restez_path_set(filepath = rstz_pth)
#> ... Creating '/tmp/Rtmplp8EGR/restez'
#> ... Creating '/tmp/Rtmplp8EGR/restez/downloads'

A user must always set the restez path in order to use the restez library. Additionally, by running restez_path_set() on a new folder messages are printed to console telling us that the path was created and that a downloads subfolder was created.

Note, whenever you intend to use restez you will need to specify the restez path. If you are using it regularly, it might be a good idea to set it using your .Rprofile. restez_path_set() only needs to be run once per R session, but we will use it repeatedly in the chunks below just as a reminder.

1.2 Download

Now we can download GenBank files using db_download(). This is an interactive function that looks up the latest GenBank release, parses the release notes, and prints to console the available sets of flatfiles that can be downloaded. For this example, we will select the smallest available domain which is ‘Unannotated’ and can be pre-specified with ‘20’. Note that the numbering scheme may change between GenBank releases, so you shouldn’t assume a given number will correspond to the same domain if you run it again later. Instead, you can run this function without the preselection argument and wait for the function to prompt you for a domain selection, which will show the current numbering scheme.

After launching, the download function will ask you whether it is OK to download and set up a database for the selected sequence files. If the selection is likely to be large you can always quit the process using either Esc or Ctrl+c. Please note that the diskspace estimates are preliminary. The package tries to be conservative in its estimates.

db_download(preselection = '20')
#> ────────────────────────────────────────────────────────────────────────────────
#> Looking up latest GenBank release ...
#> ... release number 263
#> ... found 4699 sequence files
#> ────────────────────────────────────────────────────────────────────────────────
#> Which sequence file types would you like to download?
#> Choose from those listed below:
#> • 1  - 'Plant (including fungi and algae)'
#>         1660 files and 1950 GB
#> • 2  - 'Invertebrate'
#>         934 files and 1350 GB
#> • 3  - 'Bacterial'
#>         394 files and 587 GB
#> • 4  - 'Viral'
#>         331 files and 496 GB
#> • 5  - 'Primate'
#>         325 files and 475 GB
#> • 6  - 'Other vertebrate'
#>         288 files and 407 GB
#> • 7  - 'EST (expressed sequence tag)'
#>         163 files and 244 GB
#> • 8  - 'Other mammalian'
#>         156 files and 218 GB
#> • 9  - 'Rodent'
#>         113 files and 159 GB
#> • 10 - 'GSS (genome survey sequence)'
#>         79 files and 117 GB
#> • 11 - 'Patent'
#>         78 files and 116 GB
#> • 12 - 'Constructed'
#>         68 files and 101 GB
#> • 13 - 'TSA (transcriptome shotgun assembly)'
#>         37 files and 54.2 GB
#> • 14 - 'Environmental sampling'
#>         29 files and 42.7 GB
#> • 15 - 'HTGS (high throughput genomic sequencing)'
#>         25 files and 36.8 GB
#> • 16 - 'Synthetic and chimeric'
#>         9 files and 12.6 GB
#> • 17 - 'HTC (high throughput cDNA sequencing)'
#>         3 files and 3.49 GB
#> • 18 - 'Phage'
#>         3 files and 3.54 GB
#> • 19 - 'STS (sequence tagged site)'
#>         3 files and 4.45 GB
#> • 20 - 'Unannotated'
#>         1 files and 0.00736 GB
#> Provide one or more numbers separated by spaces.
#> e.g. to download all Mammalian sequences, type: "5 8 9" followed by Enter
#> 
#> Which files would you like to download?
#> ────────────────────────────────────────────────────────────────────────────────
#> You've selected a total of 1 file(s) and 0.00736 GB of uncompressed data.
#> These represent: 
#> • 'Unannotated'
#> 
#> Based on stated GenBank files sizes, we estimate ... 
#> ... 0.00147 GB for  compressed, downloaded files
#> ... 0.00898 GB for the SQL database
#> Leading to a total of 0.0105 GB
#> 
#> Please note, the real sizes of the database and its downloads cannot be
#> accurately predicted beforehand.
#> These are just estimates, actual sizes may differ by up to a factor of two.
#> 
#> Is this OK?
#> ────────────────────────────────────────────────────────────────────────────────
#> Downloading ...
#> ... 'gbuna1.seq' (1/1)
#> Done. Enjoy your day.

Now the download has begun, we just need to wait for it to finish. For the largest domains, this can take quite a while and may be prone to server errors. In this case, you may want to set the max_tries argument of db_download() to a relatively high number to automatically re-try downloading until that number of tries is exceeded (in this case, be sure to set overwrite to FALSE or the download will start over from scratch).

1.3 Create a database

After the download has completed, we need to use db_create() to create our local database. This looks in the downloads folder of our restez path, breaks these files up into separate GenBank records and adds them to the SQL-like database. Again, for very large collections of sequences this can take quite a while.

db_create()
#> Inspecting 1 file(s) to add to the database ...
#> ... 'gbuna1.seq.gz' (1/1)
#> Done.

db_create() allows a user to specify minimum and maximum sequence lengths. It’s always a good idea to limit the number of sequences in a database so that look-up times are faster. If you know you are only interested in certain lengths of sequences it is a good idea to limit the sequences in the database at this stage. This can also be done with the acc_filter argument, as described in the “Tips and Tricks” vignette. You can always run db_create() again to change the limits. You will simply need to delete the original database first with db_delete().

1.4 Checking the setup

After the download and the database creation steps are complete, we can confirm our setup using restez_status().

restez_path_set(rstz_pth)
restez_status()
#> Checking setup status at  ...
#> ────────────────────────────────────────────────────────────────────────────────
#> Restez path ...
#> ... Path '/tmp/Rtmplp8EGR/restez'
#> ... Does path exist? 'Yes'
#> ────────────────────────────────────────────────────────────────────────────────
#> Download ...
#> ... Path '/tmp/Rtmplp8EGR/restez/downloads'
#> ... Does path exist? 'Yes'
#> ... N. files 2
#> ... Total size 2.6M
#> ... GenBank division selections 'Unannotated'
#> ... GenBank Release 263
#> ... Last updated '2024-10-28 06:07:08.51561'
#> ────────────────────────────────────────────────────────────────────────────────
#> Database ...
#> ... Path '/tmp/Rtmplp8EGR/restez/sql_db'
#> ... Does path exist? 'Yes'
#> ... Total size 8.76M
#> ... Does the database have data? 'Yes'
#> ... Number of sequences 775
#> ... Min. sequence length 0
#> ... Max. sequence length Inf
#> ... Last_updated '2024-10-28 06:07:09.584274'

The status function allows a user to always touch base with their restez set-up. It can be run at any point, before and/or after download or database creation. It is also designed to provide useful messages to a user for what they need to do next in order for them to make queries to their database. If you ever get confused, run restez_status() to see what is happening.

Additionally, if you are developing your own functions that depend on restez, you can use restez_ready() which will simply return TRUE or FALSE on whether the database can be queried.

restez_path_set(rstz_pth)
if (restez_ready()) {
  print('Database is ready to be queried!')
} else {
  print('Database cannot be queried :-(')
}
#> [1] "Database is ready to be queried!"

2. Query

Once a restez database has been set up we can query the database using the gb_*_get() functions. These functions allow us to retrieve specific columns in the SQL-like database: ‘sequence’, ‘definition’, ‘accession’, ‘version’ and ‘organism’. Also, they allow us to get the whole text formatted record and sequence data in fasta format. In this example, we can use list_db_ids() to identify accession numbers in the database. We could also use entrez_search() provided the database contains sequences of interest to us, see ‘search and fetch’.

restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AH000952"
# sequence
str(gb_sequence_get(id))
#>  Named chr "GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATC"| __truncated__
#>  - attr(*, "names")= chr "AH000952"
# definition
(gb_definition_get(id))
#>                                                                                AH000952 
#> "Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds"
# version
(gb_version_get(id))
#>     AH000952 
#> "AH000952.2"
# organism
(gb_organism_get(id))
#>         AH000952 
#> "Transposon Tn3"
# fasta
cat(gb_fasta_get(id))
#> >AH000952.2 Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds
#> GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATC
#> TTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGT
#> CTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGT
#> TGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNGGCCATTATTCCTTCACGCTGGCAGAACTGGTGACCAAAGGTCATCTGAGACCA
#> TTAAAAGAGGCGTCAGAGGCAGAAAACGTTGCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCC
# Note, for large databases these requests can take a long time to complete.
# Always try and limit the size of the database by specifying min and max
# sequence lengths with db_create()
# Also note, if an id is not present in the database nothing is returned
cat(gb_fasta_get(id = c(id, 'notanid')))
#> >AH000952.2 Transposon Tn3 transposon Tn3 ampicillin resistance (amp-r) product gene, partial cds
#> GGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATC
#> TTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGT
#> CTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGT
#> TGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
#> NNNNNNNNNNNNNNNNGGCCATTATTCCTTCACGCTGGCAGAACTGGTGACCAAAGGTCATCTGAGACCA
#> TTAAAAGAGGCGTCAGAGGCAGAAAACGTTGCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCC

Additionally, for more flexibility and options for extracting sequence record information see GenBank record parsing.

3. Entrez

Entrez wrappers are part of the restez package. These allow a user to make use of the local GenBank using functions that were built for rentrez. This minimizes the amount of coding changes required for any Entrez dependent code.

Currently, only entrez_fetch() is available with restez and only text formatted rettypes are allowed.

restez_path_set(rstz_pth)
ids <- suppressWarnings(list_db_ids(db = 'nucleotide', n = 100))
(id <- sample(ids, 1))
#> [1] "AF298087"
# get fasta sequences with entrez
res <- restez::entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF298087.1 Unidentified clone A5 DNA sequence from ocean beach sand
#> GATCCGGTCNTCGCGCTTTGCGCCTCGCGGATCTCGCGCACCTGCAGCTGGGTCACCAGCGGCGAGATGA
#> TGGTGCCTAGCAGCAGCCCGAAGACCACCATCACCACCAGGATCTCGATCAGGGTGAAGCCCTGGTCAGG
#> ACGCCGGCTCACGGCGCGGGCGGTGGCTTCGCGGCGCATCACAGGCACGGCCTTGCTGGCGTGGTTGGGG
#> GCGGCACTGGATTATTGCAGGCCACCATTTCCAGGTAATCGTTAAAGCCCATCTTACGTGCCGTGCGCTC
#> CATGAGCAGGTCGACACCATTCGCGTTGTCTTCCTCGAAGCCGGCCTGGCCGGGAGTCGCCAGGGTGAGG
#> TCCTGCGTTGGCGCAAGCTCACCGCTCACGATCACCACCGCGTGGGCATCGTCCACGACTGAAGCCCCAT
#> CGGCGACGTTGACGACGACGCANTCGCCTCCGACGGTGCAGGGGTCGGACGGGCCCCCGGCTCCNGTGGG
#> GCTGAAGCCGTTGGCGTANCTCGCATANACAAACTGNGCCCACCGTTCTCGGNGAACCANGCGGGAGATC
#> CCAGGGTTTCCGGGCCTG
# entrez_fetch will also search via Entrez for any ids not in the db
plant_sequence <- 'AY952423'
res <- restez::entrez_fetch(db = 'nucleotide', id = c(id, plant_sequence),
                            rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF298087.1 Unidentified clone A5 DNA sequence from ocean beach sand
#> GATCCGGTCNTCGCGCTTTGCGCCTCGCGGATCTCGCGCACCTGCAGCTGGGTCACCAGCGGCGAGATGA
#> TGGTGCCTAGCAGCAGCCCGAAGACCACCATCACCACCAGGATCTCGATCAGGGTGAAGCCCTGGTCAGG
#> ACGCCGGCTCACGGCGCGGGCGGTGGCTTCGCGGCGCATCACAGGCACGGCCTTGCTGGCGTGGTTGGGG
#> GCGGCACTGGATTATTGCAGGCCACCATTTCCAGGTAATCGTTAAAGCCCATCTTACGTGCCGTGCGCTC
#> CATGAGCAGGTCGACACCATTCGCGTTGTCTTCCTCGAAGCCGGCCTGGCCGGGAGTCGCCAGGGTGAGG
#> TCCTGCGTTGGCGCAAGCTCACCGCTCACGATCACCACCGCGTGGGCATCGTCCACGACTGAAGCCCCAT
#> CGGCGACGTTGACGACGACGCANTCGCCTCCGACGGTGCAGGGGTCGGACGGGCCCCCGGCTCCNGTGGG
#> GCTGAAGCCGTTGGCGTANCTCGCATANACAAACTGNGCCCACCGTTCTCGGNGAACCANGCGGGAGATC
#> CCAGGGTTTCCGGGCCTG
#> 
#> >AY952423.1 Livistona chinensis tRNA-Lys (trnK) gene, partial sequence; and matK gene, complete sequence; chloroplast
#> ATTGGGGTTGCTAACTCAACGGTAGAGTACTCGGCTTTTAAGTGCGACTATCATCTTTTACACATTTGGA
#> TGAAGTAAGGAATTCGTCCAGACTATTGGTAGAGTCTATAAGACCACGACTGATCCTGAAAGGTAATGAA
#> TGGAAAAAATAGCATGTCGTACGTAATACAATGAGAAACTTGTAATTTCTTATTGTAATTTTTTAAGTAG
#> AACTTTGAGTTTATCCTTACTGGATCATTACAAAAATATTGTATTTTATTTTTGGAAGGGGACGAAAAAA
#> AGGAAATTCCCAACATTTATTGTTTGGTCTAATGAATAAATGGATAGGGGCCTAGGGTAGGGCCCAATTT
#> TTGTAAAACAAAAAGCAACGAGCTTATGTTCTTAATTTGAATAATTACCCGATCTAATTAGATGTTAAAA
#> ATAAATTAGTGCCAGATGTGGTAAAGGGTTCTACTGTAAGTGGACCTTTTTTTTTTTTTTTTATGAATCC
#> TACCTATTATCTATTATGGATTAAAGATGGATGTGTATAAGAAGAAGTATACTGATAAAGAGAATTTTTC
#> CAAAGTCAAAAGAGCAATCGGGTTGCAAAAATAAAGGATTTTTACCTCCGAGAATTATAAATTAATTGGA
#> TCAAAAGGAGAGGAAAAAGTCTGTGATTGGACTCCTTCTATCCGCGGGTATGGGTATATAGTAGGTATAT
#> ATGTATATTTGTATACTATATAAATTACATGCCCTGTTCTGACCGTATTGCACTATGTATTATTTGATAA
#> TCCAAGAAATGCCTCCTACTTCTGGTTCAAGTAGAAATGAAAATAGAAGAATTACAAGAATATTTAGAAA
#> AAGATAGATCTCGGCAACAACACTTCCTATACCCACTTTTCTTTCAGGAGTATATTTATGCACTTGCTCA
#> TGATTATGGGTTTAAAGGGTTCGATTTTTTACGAACCTATGGAAATTGGGGGTTATGATAATAAATCTAG
#> TTCAGTACTTGTAAAACATTTAATTACTCGAATGTATCAACAGAATTATTTGATTTATTCTGTTAATGAA
#> TCTAACCAAAATCGATTGATTGAGCATAACAATTCTTTTTATTCTCAAATGATATCTGAAGTTTTTGCGA
#> TCATTGCAGAAATTCCATTCTCTCAGCAATTACTATTTTCTCTTCGAGGAAAAAAGAATACCAAAATCTC
#> AGACTTTACGATCTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTATCACATTTAAACTATGTGTC
#> AGATATATTAATACCCTATCCCATCCATTTGGAAATCTTGGTGCAAATTCTTCAATGCTGGATCCAAGAT
#> GTTTCTTCTTTGCATTTATTGCGATTCTTTCTCCACGAACATCATAATGGGAATAGTTTTTTTTTTCCAA
#> AGAAATCCTTTTCAAAAGAAAATAAAAGACTCTTTCGATTCCTATATAATTCTTATGTATCTGAATGTGA
#> ATTTGTCTTAGTGTTTCTTCGTAAACAATCCTCTTATTTACAATCAAAATCCTATGGAATCTTTCTTGAG
#> CGAACACATTTCTATGGAAGAATGGAACATCTTATAGTAGTGTGTCATAATTATTGTCAGAAGGCCTTTT
#> GGGTCTTCAAGGATCCTTTTATGCATTATGTTCGATATCAAGGAAAAGCAATTCTGGCATCAAAAGGATC
#> TTATCTTTTGATGAAGAAATGGAGATGTCATCTTGTCAATTTCTGGCAATATTATTTTCATTTTTGGGCT
#> CAGCCTTACAGAATTTCAATAAACCAATTAGGAAATCATTCCTTCTATTTTCTCGGTTATCTTTCAAGTG
#> TATTAAAAAATACTTCGTCTGTAAGGAATCAAATGCTAGAGAATTCCTTTTTAATAGATACTATTACTAA
#> TAAATTGGATACCATAGTCCCAGTTCTTCCTCTTATTGGATCTTTGTCTAAAGCTAAATTTTGTACCGTA
#> TCCGGGCATCCTAGTAGTAAGCCAATCTGGACGGATTTATCGGATTCTGATATTATTGATAGATTTGGTC
#> GGATATGTAGAAATCTTTCTCATTATTATAGTGGATCCTCAAAAAAACAGAGCTTATATCGAATAAGGTA
#> TATACTTCGACTTTCTTGTGCTAGAACTTTAGCTCGTAAACATAAAAGTACAGTACGTGCTTTTTTGCAA
#> AGATTAGGTTCGGAATTATTAGAAGAATTCTTTACAGAAGAAGAAGGAGTTGTTTTTTTGATTTCCCAAA
#> AGAACAAAACCTCTTTTCCTCTCTATAGGTCACATAGAGAACGCATTTGGTATTTGGATATTATCCATAT
#> TAATGAATTGGTGAATTCATTTATGATGGGGCGATAAGCCCCTATAAAATAAGAAATATAAATTTTTTCT
#> AATGTCTAATAAATAGACGACAAATTCATTAATTTTCATTCTGAAATGCTCATCTAGTAGTGTAGTGATT
#> GAATCAACTGAGTATTCAAAATTTTTAGACAAACTTCTAGGGATAGAAGTTTGTTTTATCTGTATACATA
#> GGTAAAGTCGTGTGCAATGAAAAATGCAAGCACGATTTGGGGAGAGATAATTTTCTCTATTGTAACAAAT
#> AAAAATTATCTACTCCATCCGACTAGTTAATCG