Biological Sequence Retrieval
The biomartr
package allows users to retrieve biological
sequences in a very simple and intuitive way.
Using biomartr
, users can retrieve either genomes,
proteomes, CDS, RNA, GFF, and genome assembly statistics data using the
specialized functions:
NOTE: To make sure that you have a sufficiently stable (internet) connection between R and the respective databases, please set the default
timeout
setting on your local machine from 60sec to at least 30000sec before running any retrieval functions via:
options(timeout = 30000)
Topics
- 0. Getting Started with Sequence Retrieval
-
1. Genome retrieval with
getGenome()
- 1.B. Multiple genome retrieval with
getGenomeSet()
-
2. Proteome retrieval with
getProteome()
- 2.B. Multiple proteome retrieval
with
getProteomeSet()
-
3. Coding sequence retrieval with
getCDS()
- 3.B. Multiple coding sequence retrieval
with
getCDSSet()
-
4. RNA retrieval with
getRNA()
- 4.B. Multiple RNA retrieval with
getRNASet()
-
5.
GFF retrieval with
getGFF()
- 5.B. Multiple GFF retrieval with
getGFFSet()
-
6.
GTF retrieval with
getGTF()
-
7. Repeat Masker Annotation file
retrieval with
getRepeatMasker()
-
8. Genome Assembly Stats
Retrieval with
getAssemblyStats()
-
9. Collection (Genome, Proteome,
CDS, RNA, GFF, Repeat Masker, AssemblyStats) retrieval with
getCollection()
Getting Started with Sequence Retrieval
First users can check whether or not the genome, proteome, CDS, RNA, GFF, GTF, or genome assembly statistics of their interest is available for download.
Using the scientific name of the organism of interest, users can
check whether the corresponding genome is available via the
is.genome.available()
function.
Please note that the first time you run this command it might take a while, because during the initial execution of this function all necessary information is retrieved from NCBI and then stored locally. All further runs are then much faster.
Example NCBI RefSeq
(?is.genome.available):
# checking whether or not the Homo sapiens
# genome is avaialable for download
is.genome.available(db = "refseq", organism = "Homo sapiens")
A reference or representative genome assembly is available for 'Homo sapiens'.
[1] TRUE
In the case of the human genome, there are more than one entry in the NCBI RefSeq database (see message). By specifying the argument ‘details = TRUE’ users can retrieve all information for ‘Homo sapiens’ that is stored in NCBI RefSeq.
# checking whether or not the Homo sapiens
# genome is avaialable for download
is.genome.available(db = "refseq", organism = "Homo sapiens", details = TRUE)
assembly_accessi bioproject biosample wgs_master refseq_category taxid
<chr> <chr> <chr> <chr> <chr> <int>
1 GCF_000001405.38 PRJNA168 NA NA reference geno 9606
# ... with 15 more variables: species_taxid <int>, organism_name <chr>,
# infraspecific_name <chr>, isolate <chr>, version_status <chr>,
# assembly_level <chr>, release_type <chr>, genome_rep <chr>,
# seq_rel_date <date>, asm_name <chr>, submitter <chr>,
# gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
# excluded_from_refseq <chr>
Here, we find one possible versions of the human genome having the
assembly_accession
ID GCF_000001405.38
.
When retrieving a genome with e.g. getGenome()
the
organism
argument can also be specified using the
assembly_accession
ID instead of the scientific name of the
organism. This is true for all get*()
functions. Hence,
instead of writing
getGenome(db = "refseq", organism = "Homo sapiens")
, users
can specify
getGenome(db = "refseq", organism = "GCF_000001405.38")
.
This is particularly useful when more than one entry is available for
one organism.
Please note that the assembly_accession
id might
change internally in external databases such as NCBI RefSeq. Thus when
writing automated scripts using the assembly_accession
id
they might stop working due to the internal change of ids at RefSeq.
Recently, I had the case where the assembly_accession
id
was changed at RefSeq from GCF_000001405.37
to
GCF_000001405.38
. Thus, scripts based on screening for
entries with is.genome.available()
stopped working, because
the id GCF_000001405.37
couldn’t be found
anymore.
In some cases users will find several entries for the same scientific name. This might be due to the fact that ecotypes, strains, or other types of sub-species are available in the respective databases.
For example. when we search for the bacterium
Mycobacterium tuberculosis
in NCBI RefSeq we get 5377
hits.
is.genome.available(organism = "Mycobacterium tuberculosis", db = "refseq", details = TRUE)
A tibble: 6,744 x 21
assembly_accessi bioproject biosample wgs_master refseq_category
<chr> <chr> <chr> <chr> <chr>
1 GCF_000729745.1 PRJNA224116 SAMN02899 JPFP000000 na
2 GCF_000729755.1 PRJNA224116 SAMN02899 JPFQ000000 na
3 GCF_000729765.1 PRJNA224116 SAMN02899 JPFR000000 na
4 GCF_000749605.1 PRJNA224116 SAMN02902 JQES000000 na
5 GCF_000749615.1 PRJNA224116 SAMN02902 JQER000000 na
6 GCF_000749625.1 PRJNA224116 SAMN02902 JQEQ000000 na
7 GCF_000749665.1 PRJNA224116 SAMN02902 JQEV000000 na
8 GCF_000749675.1 PRJNA224116 SAMN02902 JQET000000 na
9 GCF_000749685.1 PRJNA224116 SAMN02902 JQEN000000 na
10 GCF_000749725.1 PRJNA224116 SAMN02902 JQEM000000 na
with 6,734 more rows, and 16 more variables: taxid <int>,
species_taxid <int>, organism_name <chr>,
infraspecific_name <chr>, isolate <chr>, version_status <chr>,
assembly_level <chr>, release_type <chr>, genome_rep <chr>,
seq_rel_date <date>, asm_name <chr>, submitter <chr>,
gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
excluded_from_refseq <chr>
When looking at the names of these organisms we see that they consist
of different strains of Mycobacterium tuberculosis
.
tail(is.genome.available(organism = "Mycobacterium tuberculosis",
db = "refseq",
details = TRUE)$organism_name)
[1] "Mycobacterium tuberculosis CDC1551"
[2] "Mycobacterium tuberculosis H37Ra"
[3] "Mycobacterium tuberculosis F11"
[4] "Mycobacterium tuberculosis KZN 1435"
[5] "Mycobacterium tuberculosis SUMu001"
[6] "Mycobacterium tuberculosis str. Haarlem"
Now we can use the assembly_accession
id to retrieve the
Mycobacterium tuberculosis
strain we are interested in,
e.g. Mycobacterium tuberculosis CDC1551
.
MtbCDC1551 <- getGenome(db = "refseq",
organism = "GCF_000008585.1",
path = file.path("_ncbi_downloads","genomes"),
reference = FALSE)
MtbCDC1551_genome <- read_genome(MtbCDC1551)
MtbCDC1551_genome
A DNAStringSet instance of length 1
width seq names
[1] 4403837 TTGACCGATGACCC...AGGGAGATACGTCG NC_002755.2 Mycob...
Using the NCBI Taxonomy ID instead of the scientific name to screen for organism availability
Instead of specifying the scientific name
of the
organism of interest users can specify the NCBI Taxonomy
identifier (= taxid
) of the corresponding organism. For
example, the taxid
of Homo sapiens
is
9606
. Now users can specify organism = "9606"
to retrieve entries for Homo sapiens
:
# checking availability for Homo sapiens using its taxid
is.genome.available(db = "refseq", organism = "9606", details = TRUE)
assembly_accession bioproject biosample wgs_master refseq_category taxid
<chr> <chr> <chr> <chr> <chr> <int>
1 GCF_000001405.38 PRJNA168 NA NA reference genome 9606
# ... with 15 more variables: species_taxid <int>, organism_name <chr>,
# infraspecific_name <chr>, isolate <chr>, version_status <chr>,
# assembly_level <chr>, release_type <chr>, genome_rep <chr>,
# seq_rel_date <date>, asm_name <chr>, submitter <chr>,
# gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
# excluded_from_refseq <chr>
Using the accession ID instead of the scientific name or taxid to screen for organism availability
Finally, instead of specifying either the
scientific name
of the organism of interest nor the
taxid
, users can specify the accession ID of the organism
of interest. In the following example we use the accession ID of
Homo sapiens
(= GCF_000001405.38
):
# checking availability for Homo sapiens using its taxid
is.genome.available(db = "refseq", organism = "GCF_000001405.38", details = TRUE)
assembly_accession bioproject biosample wgs_master refseq_category taxid
<chr> <chr> <chr> <chr> <chr> <int>
1 GCF_000001405.38 PRJNA168 NA NA reference genome 9606
# ... with 15 more variables: species_taxid <int>, organism_name <chr>,
# infraspecific_name <chr>, isolate <chr>, version_status <chr>,
# assembly_level <chr>, release_type <chr>, genome_rep <chr>,
# seq_rel_date <date>, asm_name <chr>, submitter <chr>,
# gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
# excluded_from_refseq <chr>
A small negative example
In some cases your organism of interest will not be available in
NCBI RefSeq
. Here an example:
# check genome availability for Candida glabrata
is.genome.available(db = "refseq", organism = "Candida glabrata")
Unfortunatey, no entry for 'Candida glabrata' was found in the 'refseq' database.
Please consider specifying 'db = genbank' or 'db = ensembl'
to check whether 'Candida glabrata' is availabe in these databases.
[1] FALSE
When now checking the availability in NCBI Genbank
we
find that indeed ‘Candida glabrata’ is available:
# check genome availability for Candida glabrata
is.genome.available(db = "genbank", organism = "Candida glabrata")
Only a non-reference genome assembly is available for 'Candida glabrata'.
Please make sure to specify the argument 'reference = FALSE' when running any get*() function.
[1] TRUE
Although, an entry is available the
is.genome.available()
warns us that only a non-reference
genome is available for ‘Candida glabrata’. To then retrieve the
e.g. genome etc. files users need to specify the
reference = FALSE
argument in the get*()
functions.
For example:
# retrieve non-reference genome
getGenome(db = "genbank", organism = "Candida glabrata", reference = FALSE)
Unfortunately no genome file could be found for organism 'Candida glabrata'. Thus, the download of this organism has been omitted. Have you tried to specify 'reference = FALSE' ?
[1] "Not available"
Example NCBI Genbank
(?is.genome.available):
Test whether or not the genome of Homo sapiens
is
present at NCBI Genbank.
# checking whether or not the Homo sapiens
# genome is avaialable for download
is.genome.available(db = "genbank", organism = "Homo sapiens")
A reference or representative genome assembly is available for 'Homo sapiens'.
More than one entry was found for 'Homo sapiens'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Homo sapiens, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE
Now with details = TRUE
.
# checking whether or not the Homo sapiens
# genome is avaialable for download
is.genome.available(db = "genbank", organism = "Homo sapiens", details = TRUE)
A tibble: 1,041 x 23
assembly_accession bioproject biosample wgs_master refseq_category
<chr> <chr> <chr> <chr> <chr>
1 GCA_000001405.28 PRJNA31257 NA NA reference geno
2 GCA_000002115.2 PRJNA1431 SAMN02981 AADD00000 na
3 GCA_000002125.2 PRJNA19621 SAMN02981 ABBA00000 na
4 GCA_000002135.3 PRJNA10793 NA NA na
5 GCA_000004845.2 PRJNA42199 SAMN00003 ADDF00000 na
6 GCA_000005465.1 PRJNA42201 NA DAAB00000 na
7 GCA_000181135.1 PRJNA28335 SAMN00001 ABKV00000 na
8 GCA_000185165.1 PRJNA59877 SAMN02981 AEKP00000 na
9 GCA_000212995.1 PRJNA19621 SAMN02981 ABSL00000 na
10 GCA_000252825.1 PRJNA19621 SAMN02981 ABBA00000 na
# with 1,031 more rows, and 18 more variables: taxid <int>,
# species_taxid <int>, organism_name <chr>,
# infraspecific_name <chr>, isolate <chr>, version_status <chr>,
# assembly_level <chr>, release_type <chr>, genome_rep <chr>,
# seq_rel_date <date>, asm_name <chr>, submitter <chr>,
# gbrs_paired_asm <chr>, paired_asm_comp <chr>, ftp_path <chr>,
# excluded_from_refseq <chr>, X22 <chr>, X23 <chr>
As you can see there are several versions of the
Homo sapiens
genome available for download from NCBI
Genbank. Using the assembly_accession
id will now allow to
specify which version shall be retrieved.
Using is.genome.available()
with ENSEMBL
Users can also specify db = "ensembl"
to retrieve
available organisms provided by ENSEMBL. Again, users might experience a
delay in the execution of this function when running it for the first
time. This is due to the download of ENSEMBL information which is then
stored internally to enable a much faster execution of this function in
following runs. The corresponding information files are stored at
file.path(tempdir(), "ensembl_summary.txt")
.
Example ENSEMBL
(?is.genome.available):
# cheking whether Homo sapiens is available in the ENSEMBL database
is.genome.available(db = "ensembl", organism = "Homo sapiens")
A reference or representative genome assembly is available for 'Homo sapiens'.
[1] TRUE
# retrieve details for Homo sapiens
is.genome.available(db = "ensembl", organism = "Homo sapiens", details = TRUE)
division assembly accession release name taxon_id strain
<chr> <chr> <chr> <int> <chr> <int> <chr>
1 EnsemblVertebrates GRCh38 GCA_00000 104 homo 9606 NA
# with 3 more variables: display_name <chr>, common_name <chr>,
# strain_collection <chr>
Again, users can either specify the taxid
or
accession id
when searching for organism entries.
# retrieve details for Homo sapiens using taxid
is.genome.available(db = "ensembl", organism = "9606", details = TRUE)
division assembly accession release name taxon_id strain
<chr> <chr> <chr> <int> <chr> <int> <chr>
1 EnsemblVertebrates GRCh38 GCA_00000 104 homo 9606 NA
# with 3 more variables: display_name <chr>, common_name <chr>,
# strain_collection <chr>
# retrieve details for Homo sapiens using accession id
is.genome.available(organism = "GCA_000001405.28", db = "ensembl", details = TRUE)
division assembly accession release name taxon_id strain
<chr> <chr> <chr> <int> <chr> <int> <chr>
1 EnsemblVertebrates GRCh38 GCA_00000 104 homo 9606 NA
# with 3 more variables: display_name <chr>, common_name <chr>,
# strain_collection <chr>
Please note that the accession
id can change
internally at ENSEMBL. E.g. in a recent case the accession
id changed from GCA_000001405.25
to
GCA_000001405.27
. Hence, please be careful and take this
issue into account when you build automated retrieval scripts that are
based on accession
ids.
Example UniProt
(?is.genome.available):
Users can also check the availability of proteomes in the UniProt database by specifying:
# retrieve information from UniProt
is.genome.available(db = "uniprot", "Homo sapiens", details = FALSE)
A reference or representative genome assembly is available for 'Homo sapiens'.
More than one entry was found for 'Homo sapiens'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Homo sapiens, db = uniprot, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE
or with details:
# retrieve information from UniProt
is.genome.available(db = "uniprot", "Homo sapiens", details = TRUE)
A tibble: 29 x 16
name description isReferenceProte isRepresentativ
<chr> <chr> <lgl> <lgl>
1 Homo sapiens Homo sapiens (Hom TRUE TRUE
2 Human associ NA FALSE TRUE
3 Human respir NA FALSE FALSE
4 Human respir NA FALSE FALSE
5 Human respir NA FALSE FALSE
6 Human respir NA FALSE FALSE
7 Human respir NA FALSE FALSE
8 Human respir NA FALSE FALSE
9 Human respir NA FALSE FALSE
10 Human respir NA FALSE FALSE
# with 19 more rows, and 12 more variables:
# genomeAssembly <df[,4]>, dbReference <list>, component <list>,
# reference <list>, annotationScore <df[,1]>, scores <list>,
# upid <chr>, modified <dbl>, taxonomy <int>, source <chr>,
# superregnum <chr>, strain <chr>
Users can also search available species at UniProt via
taxid
or upid
id.
Here 9606
defines the taxonomy id for
Homo sapiens
.
# retrieve information from UniProt
is.genome.available(db = "uniprot", "9606", details = TRUE)
A tibble: 3 x 15
name description isReferenceProt isRepresentativ genomeAssembly$
<chr> <chr> <lgl> <lgl> <chr>
1 Homo Homo sapie TRUE TRUE Ensembl
2 Homo NA FALSE FALSE ENA/EMBL
3 Homo NA FALSE FALSE ENA/EMBL
# with 10 more variables: dbReference <list>, component <list>,
# reference <list>, annotationScore <df[,1]>, scores <list>,
# upid <chr>, modified <dbl>, taxonomy <int>, source <chr>,
# superregnum <chr>
Here UP000005640
defines the upid
for
Homo sapiens
.
# retrieve information from UniProt
is.genome.available(db = "uniprot", "UP000005640", details = TRUE)
name description isReferenceProt isRepresentativ genomeAssembly$
<chr> <chr> <lgl> <lgl> <chr>
1 Homo Homo sapie TRUE TRUE Ensembl
# with 10 more variables: dbReference <list>, component <list>,
# reference <list>, annotationScore <df[,1]>, scores <list>,
# upid <chr>, modified <dbl>, taxonomy <int>, source <chr>,
# superregnum <chr>
In general, the argument db
specifies from which
database (refseq
, genbank
,
ensembl
or uniprot
) organism information shall
be retrieved. Options are:
db = 'refseq'
db = 'genbank'
db = 'ensembl'
db = 'uniprot'
Listing the total number of available genomes
In some cases it might be useful to check how many genomes (in total) are available in the different databases.
Users can determine the total number of available genomes using the
listGenomes()
function.
Example refseq
:
length(listGenomes(db = "refseq"))
[1] 24910
Example genbank
:
length(listGenomes(db = "genbank"))
[1] 35298
Example ensembl
:
length(listGenomes(db = "ensembl"))
[1] 310
Hence, currently 24910 genomes (including all kingdoms of life) are
stored on NCBI RefSeq
(as of 16/11/2021).
Retrieving kingdom, group and subgroup information
Using this example users can retrieve the number of available species for each kingdom of life:
Example refseq
:
# the number of genomes available for each kingdom
listKingdoms(db = "refseq")
Archaea Bacteria Eukaryota Viruses
479 16255 1383 12916
Example genbank
:
# the number of genomes available for each kingdom
listKingdoms(db = "genbank")
Archaea Bacteria Eukaryota
1902 32532 12404
Example ENSEMBL
:
# the number of genomes available for each kingdom
listKingdoms(db = "ensembl")
Starting information retrieval for: EnsemblVertebrates
Starting information retrieval for: EnsemblPlants
Starting information retrieval for: EnsemblFungi
Starting information retrieval for: EnsemblMetazoa
Starting information retrieval for: EnsemblBacteria
EnsemblBacteria EnsemblFungi EnsemblMetazoa
31332 1504 251
EnsemblPlants EnsemblVertebrates
150 317
Analogous computations can be performed for groups
and
subgroups
Unfortunately, ENSEMBL
does not provide group or
subgroup information. Therefore, group and subgroup listings are limited
to refseq
and genbank
.
Example refseq
:
# the number of genomes available for each group
listGroups(db = "refseq")
Abditibacteriota
1
Acidithiobacillia
8
Acidobacteriia
22
Ackermannviridae
65
Actinobacteria
2640
Adenoviridae
64
Adomaviridae
2
Aliusviridae
2
Alloherpesviridae
13
Alphaflexiviridae
62
Alphaproteobacteria
1762
...
Example genbank
:
# the number of genomes available for each group
listGroups(db = "genbank")
Abditibacteriota Acidithiobacillia
1 8
Acidobacteriia Actinobacteria
24 1596
Alphaproteobacteria Amphibians
1604 6
Apicomplexans Aquificae
47 7
Archaeoglobi Armatimonadetes
5 38
Ascomycetes Bacteria candidate phyla
689 3532
Bacteroidetes/Chlorobi group Balneolia
2247 44
Basidiomycetes Betaproteobacteria
204 751
Birds Blastocatellia
80 2
Caldisericia candidate division WS1
1 1
candidate division Zixibacteria Candidatus Aegiribacteria
17 1
Candidatus Aenigmarchaeota Candidatus Bathyarchaeota
14 42
Candidatus Cloacimonetes Candidatus Diapherotrites
88 11
Candidatus Fermentibacteria Candidatus Geothermarchaeota
8 3
Candidatus Heimdallarchaeota Candidatus Hydrogenedentes
4 10
Candidatus Korarchaeota Candidatus Kryptonia
1 4
Candidatus Lambdaproteobacteria Candidatus Latescibacteria
6 13
Candidatus Lokiarchaeota Candidatus Marinimicrobia
2 92
Candidatus Marsarchaeota Candidatus Micrarchaeota
15 35
Candidatus Moduliflexus Candidatus Muproteobacteria
1 14
Candidatus Nanohaloarchaeota Candidatus Odinarchaeota
16 1
Candidatus Omnitrophica Candidatus Pacearchaeota
126 41
Candidatus Parvarchaeota Candidatus Tectomicrobia
11 6
Candidatus Thorarchaeota Candidatus Vecturithrix
7 1
Candidatus Verstraetearchaeota Candidatus Woesearchaeota
5 36
Chlamydiae Chloroflexi
43 403
Chrysiogenetes Coprothermobacterota
1 1
Crenarchaeota Cyanobacteria/Melainabacteria group
54 172
Deferribacteres Deinococcus-Thermus
8 27
delta/epsilon subdivisions Dictyoglomia
615 1
Elusimicrobia Endomicrobia
1 2
environmental samples Fibrobacteres
3 15
Firmicutes Fishes
2663 168
Flatworms Fusobacteriia
35 29
Gammaproteobacteria Gemmatimonadetes
2041 82
Green Algae Hadesarchaea
30 5
Halobacteria Holophagae
162 9
Hydrogenophilalia Insects
32 280
Kinetoplasts Kiritimatiellaeota
33 3
Land Plants Lentisphaerae
282 32
Mammals Methanobacteria
145 31
Methanococci Methanomicrobia
2 77
Methanonatronarchaeia Methanopyri
2 1
Nanoarchaeota nitrifying bacterium enrichment culture
11 1
Nitrospinae Nitrospira
19 31
Oligoflexia Other
87 15
Other Animals Other Fungi
125 58
Other Plants Other Protists
1 128
Planctomycetes Reptiles
171 21
Rhodothermia Roundworms
4 91
Solibacteres Spirochaetia
7 115
Synergistia Tenericutes
43 141
Thaumarchaeota Theionarchaea
92 2
Thermococci Thermodesulfobacteria
21 7
Thermoplasmata Thermotogae
135 23
unclassified Acidobacteria unclassified Archaea (miscellaneous)
94 52
unclassified Bacteria (miscellaneous) unclassified Caldiserica
197 11
unclassified Calditrichaeota unclassified Elusimicrobia
2 101
unclassified Euryarchaeota unclassified Nitrospirae
348 90
unclassified Proteobacteria unclassified Rhodothermaeota
130 4
unclassified Spirochaetes unclassified Synergistetes
37 3
unclassified Thermotogae uncultured archaeon
1 1
uncultured archaeon A07HB70 uncultured archaeon A07HN63
1 1
uncultured archaeon A07HR60 uncultured archaeon A07HR67
1 1
uncultured bacterium Verrucomicrobia
1 296
Zetaproteobacteria
41
Note that when running the
listGenomes()
function for the first time, it might take a while until the function returns any results, because necessary information need to be downloaded from NCBI and ENSEMBL databases. All subsequent executions oflistGenomes()
will then respond very fast, because they will access the corresponding files stored on your hard drive.
Downloading Biological Sequences and Annotations
After checking for the availability of sequence information for an
organism of interest, the next step is to download the corresponding
genome, proteome, CDS, or GFF file. The following functions allow users
to download proteomes, genomes, CDS and GFF files from several database
resources such as: NCBI RefSeq, NCBI Genbank, ENSEMBL. When a
corresponding proteome, genome, CDS or GFF file was loaded to your
hard-drive, a documentation *.txt
file is generated storing
File Name
, Organism
, Database
,
URL
, DATE
, assembly_accession
,
bioproject
, biosample
, taxid
,
version_status
, release_type
,
seq_rel_date
etc. information of the retrieved file. This
way a better reproducibility of proteome, genome, CDS and GFF versions
used for subsequent data analyses can be achieved.
Genome Retrieval
The easiest way to download a genome is to use the
getGenome()
function.
In this example we will download the genome of
Homo sapiens
.
The getGenome()
function is an interface function to the
NCBI RefSeq, NCBI Genbank, ENSEMBL databases from
which corresponding genomes can be retrieved.
The db
argument specifies from which database genome
assemblies in *.fasta
file format shall be retrieved.
Options are:
-
db = "refseq"
for retrieval from NCBI RefSeq -
db = "genbank"
for retrieval from NCBI Genbank -
db = "ensembl"
for retrieval from ENSEMBL
Furthermore, users need to specify the scientific name
,
the taxid
(= NCBI Taxnonomy
identifier), or the accession identifier
of the organism of
interest for which a genome assembly shall be downloaded,
e.g. organism = "Homo sapiens"
or
organism = "9606"
or
organism = "GCF_000001405.37"
. Finally, the
path
argument specifies the folder path in which the
corresponding assembly shall be locally stored. In case users would like
to store the genome file at a different location, they can specify the
path = file.path("put","your","path","here")
argument
(e.g. file.path("_ncbi_downloads","genomes")
).
Example NCBI RefSeq:
# download the genome of Homo sapiens from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.refseq <- getGenome( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","genomes") )
In this example, getGenome()
creates a directory named
'_ncbi_downloads/genomes'
into which the corresponding
genome named GCF_000001405.34_GRCh38.p8_genomic.fna.gz
is
downloaded. The return value of getGenome()
is the folder
path to the downloaded genome file that can then be used as input to the
read_genome()
function. The variable
HS.genome.refseq
stores the path to the downloaded
genome.
Users can also omit the path
argument if they wish to
store the genome in their current working directory. E.g.:
# download the genome of Homo sapiens from refseq
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.refseq <- getGenome( db = "refseq",
organism = "Homo sapiens")
Subsequently, users can use the read_genome()
function
to import the genome into the R session. Users can choose to work with
the genome sequence in R either as Biostrings
object (obj.type = "Biostrings"
) or data.table
object (obj.type = "data.table"
) by specifying the
obj.type
argument of the read_genome()
function.
# import downloaded genome as Biostrings object
Human_Genome <- read_genome(file = HS.genome.refseq)
# look at the Biostrings object
Human_Genome
A DNAStringSet instance of length 551
width seq names
[1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN NC_000001.11 Homo...
[2] 175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC NT_187361.1 Homo ...
[3] 32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC NT_187362.1 Homo ...
[4] 127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC NT_187363.1 Homo ...
[5] 66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC NT_187364.1 Homo ...
... ... ...
[547] 170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC NT_187685.1 Homo ...
[548] 215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC NT_187686.1 Homo ...
[549] 170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC NT_187687.1 Homo ...
[550] 177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC NT_113949.2 Homo ...
[551] 16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG NC_012920.1 Homo ...
Internally, a text file named
doc_Homo_sapiens_db_refseq.txt
is generated. The
information stored in this log file is structured as follows:
File Name: Homo_sapiens_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_genomic.fna.gz
Download_Date: Sat Oct 22 12:41:07 2016
refseq_category: reference genome
assembly_accession: GCF_000001405.35
bioproject: PRJNA168
biosample: NA
taxid: 9606
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2016-09-26
submitter: Genome Reference Consortium
In addition, the genome summary statistics for the retrieved species
is stored locally
(doc_Homo_sapiens_db_refseq_summary_statistics.tsv
) to
provide users with insights regarding the genome assembly quality (see
?summary_genome()
for details). This file can be used as
Supplementary Information
file in publications to
facilitate reproducible research. Most comparative genomics studies do
not consider differences in genome assembly qualities when comparing the
genomes of diverse species. This way, they expose themselves to
technical artifacts that might generate patterns mistaken to be of
biological relevance whereas in reality they just reflect the difference
in genome assembly quality. Considering the quality of genome assemblies
when downloading the genomic sequences will help researchers to avoid
these pitfalls.
The summary statistics include:
genome_size_mbp
: Genome size in mega base pairsn50_mbp
: The N50 contig size of the genome assembly in mega base pairsn_seqs
: The number of chromosomes/scaffolds/contigs of the genome assembly filen_nnn
: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly filerel_nnn
: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly filegenome_entropy
: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies)n_gc
: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly filerel_gc
: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file
In summary, the getGenome()
and
read_genome()
functions allow users to retrieve genome
assemblies by specifying the scientific name of the organism of interest
and allow them to import the retrieved genome assembly e.g. as
Biostrings
object. Thus, users can then perform the
Biostrings notation
to work with downloaded genomes and can
rely on the log file generated by getGenome()
to better
document the source and version of genome assemblies used for subsequent
studies.
Alternatively, users can perform the pipeline logic of the magrittr package:
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Human_Genome <- getGenome( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","genomes")) %>%
read_genome()
Human_Genome
A DNAStringSet instance of length 551
width seq names
[1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN NC_000001.11 Homo...
[2] 175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC NT_187361.1 Homo ...
[3] 32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC NT_187362.1 Homo ...
[4] 127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC NT_187363.1 Homo ...
[5] 66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC NT_187364.1 Homo ...
... ... ...
[547] 170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC NT_187685.1 Homo ...
[548] 215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC NT_187686.1 Homo ...
[549] 170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC NT_187687.1 Homo ...
[550] 177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC NT_113949.2 Homo ...
[551] 16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG NC_012920.1 Homo ...
Use taxid
id for genome retrieval
Alternatively, instead of specifying the scientific name in the
argument organism
users can specify the
taxonomy
id of the corresponding organism. Here, we specify
the taxonomy id 559292
which encodes the species
Saccharomyces cerevisiae
.
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
db = "refseq",
organism = "559292") %>%
read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 17
width seq names
[1] 230218 CCACACCACACCCACACAC...GGGTGTGGTGTGTGTGGG NC_001133.9 Sacch...
[2] 813184 AAATAGCCCTCATGTACGT...TGTGGTGTGTGGGTGTGT NC_001134.8 Sacch...
[3] 316620 CCCACACACCACACCCACA...GTGGGTGTGGTGTGTGTG NC_001135.5 Sacch...
[4] 1531933 ACACCACACCCACACCACA...GTAGTAAGTAGCTTTTGG NC_001136.10 Sacc...
[5] 576874 CGTCTCCTCCAAGCCCTGT...ATTTTCATTTTTTTTTTT NC_001137.3 Sacch...
... ... ...
[13] 924431 CCACACACACACCACACCC...TGTGGTGTGTGTGTGGGG NC_001145.3 Sacch...
[14] 784333 CCGGCTTTCTGACCGAAAT...TGTGGGTGTGGTGTGGGT NC_001146.8 Sacch...
[15] 1091291 ACACCACACCCACACCACA...TGTGTGGGTGTGGTGTGT NC_001147.6 Sacch...
[16] 948066 AAATAGCCCTCATGTACGT...TTTAATTTCGGTCAGAAA NC_001148.4 Sacch...
[17] 85779 TTCATAATTAATTTTTTAT...TATAATATAATATCCATA NC_001224.1 Sacch...
Use assembly_accession
id for genome retrieval
Alternatively, instead of specifying the scientific name or taxonomy
in the argument organism
users can specify the
assembly_accession
id of the corresponding organism. Here,
we specify the assembly_accession
id
GCF_000146045.2
which encodes the species
Saccharomyces cerevisiae
.
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
db = "refseq",
organism = "GCF_000146045.2") %>%
read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 17
width seq names
[1] 230218 CCACACCACACCCACACACCCACACA...GTGGTGTGGGTGTGGTGTGTGTGGG NC_001133.9 Sacch...
[2] 813184 AAATAGCCCTCATGTACGTCTCCTCC...GTGTGGGTGTGGTGTGTGGGTGTGT NC_001134.8 Sacch...
[3] 316620 CCCACACACCACACCCACACCACACC...GGGTGTGGTGGGTGTGGTGTGTGTG NC_001135.5 Sacch...
[4] 1531933 ACACCACACCCACACCACACCCACAC...AATAAAGGTAGTAAGTAGCTTTTGG NC_001136.10 Sacc...
[5] 576874 CGTCTCCTCCAAGCCCTGTTGTCTCT...GGGTTTCATTTTCATTTTTTTTTTT NC_001137.3 Sacch...
... ... ...
[13] 924431 CCACACACACACCACACCCACACCAC...GTGTGGGTGTGGTGTGTGTGTGGGG NC_001145.3 Sacch...
[14] 784333 CCGGCTTTCTGACCGAAATTAAAAAA...GGGTGTGTGTGGGTGTGGTGTGGGT NC_001146.8 Sacch...
[15] 1091291 ACACCACACCCACACCACACCCACAC...GAGAGAGTGTGTGGGTGTGGTGTGT NC_001147.6 Sacch...
[16] 948066 AAATAGCCCTCATGTACGTCTCCTCC...TTTTTTTTTTAATTTCGGTCAGAAA NC_001148.4 Sacch...
[17] 85779 TTCATAATTAATTTTTTATATATATA...GCTTAATTATAATATAATATCCATA NC_001224.1 Sacch...
Example NCBI Genbank:
Genome retrieval from NCBI Genbank
.
# download the genome of Homo sapiens from Genbank
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.genbank <- getGenome( db = "genbank",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","genomes") )
# import downloaded genome as Biostrings object
Human_Genome <- read_genome(file = HS.genome.genbank)
# look at the Biostrings object
Human_Genome
A DNAStringSet instance of length 551
width seq names
[1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN CM000663.2 Homo s...
[2] 175055 GAATTCAGCTGAGAAGAACAGGCA...TGTTTGTCAGTCACATAGAATTC KI270706.1 Homo s...
[3] 32032 AGGGGTCTGCTTAGAGAGGGTCTC...TGACTTACGTTGACGTGGAATTC KI270707.1 Homo s...
[4] 127682 GATCGAGACTATCCTGGCTAACAC...ATTGTCAATTGGGACCTTTGATC KI270708.1 Homo s...
[5] 66860 GAATTCATTCGATGACGATTCCAT...AAAAAACTCTCAGCCACGAATTC KI270709.1 Homo s...
... ... ...
[547] 170148 TTTCTTTCTTTTTTTTTTTTTTGT...GTCACAGGACTCATGGGGAATTC KI270931.1 Homo s...
[548] 215732 TGTGGTGAGGACCCTTAAGATCTA...GTCACAGGACTCATGGGGAATTC KI270932.1 Homo s...
[549] 170537 TCTACTCTCCCATGCTTGCCTCGG...GTCACAGGACTCATGGGGAATTC KI270933.1 Homo s...
[550] 177381 GATCTATCTGTATCTCCACAGGTG...GTCACAGGACTCATGGGGAATTC GL000209.2 Homo s...
[551] 16569 GATCACAGGTCTATCACCCTATTA...CCCTTAAATAAGACATCACGATG J01415.2 Homo sap...
Use taxonomy
id for genome retrieval
Alternatively, instead of specifying the scientific name in the
argument organism
users can specify the
taxonomy
id of the corresponding organism. Here, we specify
the taxonomy id 559292
which encodes the species
Saccharomyces cerevisiae
.
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
db = "genbank",
organism = "559292") %>%
read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 16
width seq names
[1] 230218 CCACACCACACCCACA...TGTGGTGTGTGTGGG BK006935.2 TPA_in...
[2] 813184 AAATAGCCCTCATGTA...GGTGTGTGGGTGTGT BK006936.2 TPA_in...
[3] 316620 CCCACACACCACACCC...GGTGTGGTGTGTGTG BK006937.2 TPA_in...
[4] 1531933 ACACCACACCCACACC...GTAAGTAGCTTTTGG BK006938.2 TPA_in...
[5] 576874 CGTCTCCTCCAAGCCC...TTCATTTTTTTTTTT BK006939.2 TPA_in...
... ... ...
[12] 1078177 CACACACACACACCAC...ACATGAGGGCTATTT BK006945.2 TPA_in...
[13] 924431 CCACACACACACCACA...GGTGTGTGTGTGGGG BK006946.2 TPA_in...
[14] 784333 CCGGCTTTCTGACCGA...GGGTGTGGTGTGGGT BK006947.3 TPA_in...
[15] 1091291 ACACCACACCCACACC...GTGGGTGTGGTGTGT BK006948.2 TPA_in...
[16] 948066 AAATAGCCCTCATGTA...AATTTCGGTCAGAAA BK006949.2 TPA_in...
Use assembly_accession
id for genome retrieval
Alternatively, instead of specifying the scientific name or taxonomy
in the argument organism
users can specify the
assembly_accession
id of the corresponding organism. Here,
we specify the assembly_accession
id
GCA_000146045.2
which encodes the species
Saccharomyces cerevisiae
.
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
db = "genbank",
organism = "GCA_000146045.2") %>%
read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 16
width seq names
[1] 230218 CCACACCACACCCACAC...GTGTGGTGTGTGTGGG BK006935.2 TPA_in...
[2] 813184 AAATAGCCCTCATGTAC...TGGTGTGTGGGTGTGT BK006936.2 TPA_in...
[3] 316620 CCCACACACCACACCCA...GGGTGTGGTGTGTGTG BK006937.2 TPA_in...
[4] 1531933 ACACCACACCCACACCA...AGTAAGTAGCTTTTGG BK006938.2 TPA_in...
[5] 576874 CGTCTCCTCCAAGCCCT...TTTCATTTTTTTTTTT BK006939.2 TPA_in...
... ... ...
[12] 1078177 CACACACACACACCACC...TACATGAGGGCTATTT BK006945.2 TPA_in...
[13] 924431 CCACACACACACCACAC...TGGTGTGTGTGTGGGG BK006946.2 TPA_in...
[14] 784333 CCGGCTTTCTGACCGAA...TGGGTGTGGTGTGGGT BK006947.3 TPA_in...
[15] 1091291 ACACCACACCCACACCA...TGTGGGTGTGGTGTGT BK006948.2 TPA_in...
[16] 948066 AAATAGCCCTCATGTAC...TAATTTCGGTCAGAAA BK006949.2 TPA_in...
In addition, the genome summary statistics for the retrieved species
is stored locally
(doc_saccharomyces_cerevisiae_db_genbank_summary_statistics.tsv
)
to provide users with insights regarding the genome assembly quality
(see ?summary_genome()
for details). This file can be used
as Supplementary Information
file in publications to
facilitate reproducible research. Most comparative genomics studies do
not consider differences in genome assembly qualities when comparing the
genomes of diverse species. This way, they expose themselves to
technical artifacts that might generate patterns mistaken to be of
biological relevance whereas in reality they just reflect the difference
in genome assembly quality. Considering the quality of genome assemblies
when downloading the genomic sequences will help researchers to avoid
these pitfalls.
The summary statistics include:
genome_size_mbp
: Genome size in mega base pairsn50_mbp
: The N50 contig size of the genome assembly in mega base pairsn_seqs
: The number of chromosomes/scaffolds/contigs of the genome assembly filen_nnn
: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly filerel_nnn
: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly filegenome_entropy
: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies)n_gc
: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly filerel_gc
: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file
Example ENSEMBL:
# download the genome of Homo sapiens from ENSEMBL
# and store the corresponding genome file in '_ncbi_downloads/genomes'
HS.genome.ensembl <- getGenome( db = "ensembl",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","genomes") ,
assembly_type = "primary_assembly")
# import downloaded genome as Biostrings object
Human_Genome <- read_genome(file = HS.genome.ensembl)
# look at the Biostrings object
Human_Genome
A DNAStringSet instance of length 524
width seq names
[1] 248956422 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 1 dna:chromosome ...
[2] 242193529 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 2 dna:chromosome ...
[3] 198295559 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 3 dna:chromosome ...
[4] 190214555 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 4 dna:chromosome ...
[5] 181538259 NNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNN 5 dna:chromosome ...
... ... ...
[520] 993 GCCCCACGTCCGGGAGGGAGGTGG...GAGGGAGGTGGGGGGTCAGCCCT KI270539.1 dna:sc...
[521] 990 TTTCATAGAGCATGTTTGAAACAC...TCAGAAACTTGTTGTGATGTGTG KI270385.1 dna:sc...
[522] 981 AGATTTCGTTGGAACGGGATAAAC...TCAGCTTTCAAACACTCTTTTTG KI270423.1 dna:sc...
[523] 971 ATTTGCGATGTGTGTTCTCAACTA...TTGGATAGCTTTGAAGTTTTCGT KI270392.1 dna:sc...
[524] 970 AAGTGGATATTTGGATAGCTTTGA...TCCTCAATAACAGAGTTGAACCT KI270394.1 dna:sc...
If you are using db = "ensembl"
you can set
assembly_type = "primary_assembly"
. For the human genome
the toplevel genome assembly size is > 70 GB uncompressed, while
primary assembly is just a few GB. Users can also specify the
release
argument which denotes the database release version
of ENSEMBL (db = "ensembl"
) in case they wish to download
archived vgenome versions. Default is release = NULL meaning that the
most recent database version is used.
In addition, the genome summary statistics for the retrieved species
is stored locally
(doc_homo_sapiens_db_ensembl_summary_statistics.tsv
) to
provide users with insights regarding the genome assembly quality (see
?summary_genome()
for details). This file can be used as
Supplementary Information
file in publications to
facilitate reproducible research. Most comparative genomics studies do
not consider differences in genome assembly qualities when comparing the
genomes of diverse species. This way, they expose themselves to
technical artifacts that might generate patterns mistaken to be of
biological relevance whereas in reality they just reflect the difference
in genome assembly quality. Considering the quality of genome assemblies
when downloading the genomic sequences will help researchers to avoid
these pitfalls.
The summary statistics include:
genome_size_mbp
: Genome size in mega base pairsn50_mbp
: The N50 contig size of the genome assembly in mega base pairsn_seqs
: The number of chromosomes/scaffolds/contigs of the genome assembly filen_nnn
: The absolute number of NNNs (over all chromosomes or scaffolds or contigs) in the genome assembly filerel_nnn
: The percentage (relative frequency) of NNNs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly filegenome_entropy
: The Shannon Entropy of the genome assembly file (median entropy over all individual chromosome entropies)n_gc
: The total number of GCs (over all chromosomes or scaffolds or contigs) in the genome assembly filerel_gc
: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file
Use taxonomy
id for genome retrieval
Alternatively, instead of specifying the scientific name in the
argument organism
users can specify the
taxonomy
id of the corresponding organism. Here, we specify
the taxonomy id 4932
which encodes the species
Saccharomyces cerevisiae
.
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
db = "ensembl",
organism = "4932") %>%
read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 17
width seq names
[1] 230218 CCACACCACACCCA...TGGTGTGTGTGGG I dna:chromosome ...
[2] 813184 AAATAGCCCTCATG...TGTGTGGGTGTGT II dna:chromosome...
[3] 316620 CCCACACACCACAC...TGTGGTGTGTGTG III dna:chromosom...
[4] 1531933 ACACCACACCCACA...AAGTAGCTTTTGG IV dna:chromosome...
[5] 576874 CGTCTCCTCCAAGC...CATTTTTTTTTTT V dna:chromosome ...
... ... ...
[13] 924431 CCACACACACACCA...TGTGTGTGTGGGG XIII dna:chromoso...
[14] 784333 CCGGCTTTCTGACC...GTGTGGTGTGGGT XIV dna:chromosom...
[15] 1091291 ACACCACACCCACA...GGGTGTGGTGTGT XV dna:chromosome...
[16] 948066 AAATAGCCCTCATG...TTTCGGTCAGAAA XVI dna:chromosom...
[17] 85779 TTCATAATTAATTT...TATAATATCCATA Mito dna:chromoso...
Use assembly_accession
id for genome retrieval
Alternatively, instead of specifying the scientific name or taxonomy
in the argument organism
users can specify the
assembly_accession
id of the corresponding organism. Here,
we specify the assembly_accession
id
GCA_000146045.2
which encodes the species
Saccharomyces cerevisiae
.
# install.packages("magrittr")
library(magrittr)
# import genome as Biostrings object
Scerevisiae_Genome <- getGenome(
db = "ensembl",
organism = "GCA_000146045.2") %>%
read_genome()
Scerevisiae_Genome
A DNAStringSet instance of length 17
width seq names
[1] 230218 CCACACCACACCCACA...TGTGGTGTGTGTGGG I dna:chromosome ...
[2] 813184 AAATAGCCCTCATGTA...GGTGTGTGGGTGTGT II dna:chromosome...
[3] 316620 CCCACACACCACACCC...GGTGTGGTGTGTGTG III dna:chromosom...
[4] 1531933 ACACCACACCCACACC...GTAAGTAGCTTTTGG IV dna:chromosome...
[5] 439888 CACACACACCACACCC...GTGTGGTGTGTGTGT IX dna:chromosome...
... ... ...
[13] 1078177 CACACACACACACCAC...ACATGAGGGCTATTT XII dna:chromosom...
[14] 924431 CCACACACACACCACA...GGTGTGTGTGTGGGG XIII dna:chromoso...
[15] 784333 CCGGCTTTCTGACCGA...GGGTGTGGTGTGGGT XIV dna:chromosom...
[16] 1091291 ACACCACACCCACACC...GTGGGTGTGGTGTGT XV dna:chromosome...
[17] 948066 AAATAGCCCTCATGTA...AATTTCGGTCAGAAA XVI dna:chromosom...
GenomeSet Retrieval
The getGenomeSet()
function enables users to retrieve
genome files for multiple species. This is convenient when users wish to
bulk download a particular set of species. Internally, a folder named
set_genomes
is generated in which genomes and genome info
files are stored. In addition, users can specify the arguments:
clean_retrieval
and gunzip
(both are
TRUE
by default) to clean file names and automatically
unzip downloaded files.
Example:
# specify the species names
download_species <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# retrieve these three species from NCBI RefSeq
biomartr::getGenomeSet("refseq", organisms = download_species, path = "set_genomes")
If the download process was interrupted, users can re-run the
function and it will only download missing genomes. In cases users wish
to download everything again and updating existing genomes, they may
specify the argument update = TRUE
.
Proteome Retrieval
The getProteome()
function is an interface function to
the NCBI
RefSeq, NCBI
Genbank, ENSEMBL,
and UniProt databases from which
corresponding proteomes can be retrieved. It works analogous to
getGenome()
.
The db
argument specifies from which database proteomes
in *.fasta
file format shall be retrieved.
Options are:
-
db = "refseq"
for retrieval from NCBI RefSeq -
db = "genbank"
for retrieval from NCBI Genbank -
db = "ensembl"
for retrieval from ENSEMBL -
db = "uniprot"
for retrieval from UniProt
Furthermore, again users need to specify the scientific name of the
organism of interest for which a proteomes shall be downloaded,
e.g. organism = "Homo sapiens"
. Finally, the
path
argument specifies the folder path in which the
corresponding proteome shall be locally stored. In case users would like
to store the proteome file at a different location, they can specify the
path = file.path("put","your","path","here")
argument.
Example NCBI RefSeq
:
# download the proteome of Homo sapiens from refseq
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
HS.proteome.refseq <- getProteome( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes"))
In this example, getProteome()
creates a directory named
'_ncbi_downloads/proteomes'
into which the corresponding
genome named GCF_000001405.34_GRCh38.p8_protein.faa.gz
is
downloaded. The return value of getProteome()
is the folder
path to the downloaded proteome file that can then be used as input to
the read_proteome()
function. The variable
HS.proteome.refseq
stores the path to the downloaded
proteome. Subsequently, users can use the read_proteome()
function to import the proteome into the R session. Users can choose to
work with the proteome sequence in R either as Biostrings
object (obj.type = "Biostrings"
) or data.table
object (obj.type = "data.table"
) by specifying the
obj.type
argument of the read_proteome()
function.
# import proteome as Biostrings object
Human_Proteome <- read_proteome(file = HS.proteome.refseq)
Human_Proteome
A AAStringSet instance of length 113620
width seq names
[1] 1474 MGKNKLLHPSLVL...YNAPCSKDLGNA NP_000005.2 alpha...
[2] 290 MDIEAYFERIGYK...LVPKPGDGSLTI NP_000006.2 aryla...
[3] 421 MAAGFGRCCRVLR...IVAREHIDKYKN NP_000007.1 mediu...
[4] 412 MAAALLARASGPA...VIAGHLLRSYRS NP_000008.1 short...
[5] 655 MQAARMAASLGRQ...RGGVVTSNPLGF NP_000009.1 very ...
... ... ...
[113616] 98 MPLIYMNIMLAFT...LDYVHNLNLLQC YP_003024034.1 NA...
[113617] 459 MLKLIVPTIMLLP...SLNPDIITGFSS YP_003024035.1 NA...
[113618] 603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA...
[113619] 174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA...
[113620] 380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy...
Alternatively, users can perform the pipeline logic of the magrittr package:
# install.packages("magrittr")
library(magrittr)
# import proteome as Biostrings object
Human_Proteome <- getProteome( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes")) %>%
read_proteome()
Human_Proteome
A AAStringSet instance of length 113620
width seq names
[1] 1474 MGKNKLLHPSLVL...YNAPCSKDLGNA NP_000005.2 alpha...
[2] 290 MDIEAYFERIGYK...LVPKPGDGSLTI NP_000006.2 aryla...
[3] 421 MAAGFGRCCRVLR...IVAREHIDKYKN NP_000007.1 mediu...
[4] 412 MAAALLARASGPA...VIAGHLLRSYRS NP_000008.1 short...
[5] 655 MQAARMAASLGRQ...RGGVVTSNPLGF NP_000009.1 very ...
... ... ...
[113616] 98 MPLIYMNIMLAFT...LDYVHNLNLLQC YP_003024034.1 NA...
[113617] 459 MLKLIVPTIMLLP...SLNPDIITGFSS YP_003024035.1 NA...
[113618] 603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA...
[113619] 174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA...
[113620] 380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy...
Example NCBI Genbank
:
# download the proteome of Homo sapiens from genbank
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
HS.proteome.genbank <- getProteome( db = "genbank",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes"))
# import proteome as Biostrings object
Human_Proteome <- read_proteome(file = HS.proteome.genbank)
Human_Proteome
A AAStringSet instance of length 13
width seq names
[1] 318 MPMANLLLLIVPILI...VSMPITISSIPPQT AAB58943.1 NADH d...
[2] 347 MNPLAQPVIYSTIFA...TLLLPISPFMLMIL AAB58944.1 NADH d...
[3] 513 MFADRWLFSTNHKDI...PPYHTFEEPVYMKS AAB58945.1 cytoch...
[4] 227 MAHAAQVGLQDATSP...IPLKIFEMGPVFTL AAB58946.1 cytoch...
[5] 68 MPQLNTTVWPTMITP...WTKICSLHSLPPQS AAB58947.1 ATPase...
... ... ...
[9] 98 MPLIYMNIMLAFTIS...YGLDYVHNLNLLQC AAB58951.1 NADH d...
[10] 459 MLKLIVPTIMLLPLT...LLSLNPDIITGFSS AAB58952.1 NADH d...
[11] 603 MTMHTTMTTLTLTSL...SFFFPLILTLLLIT AAB58953.1 NADH d...
[12] 174 MMYALFLLSVGLVMG...FVGVYIVIEIARGN AAB58954.1 NADH d...
[13] 380 MTPMRKTNPLMKLIN...PTISLIENKMLKWA AAB58955.3 cytoch...
Example ENSEMBL
:
# download the proteome of Homo sapiens from ENSEMBL
# and store the corresponding proteome file in '_ncbi_downloads/proteomes'
HS.proteome.ensembl <- getProteome( db = "ensembl",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes"))
# import proteome as Biostrings object
Human_Proteome <- read_proteome(file = HS.proteome.ensembl)
Human_Proteome
A AAStringSet instance of length 107844
width seq names
[1] 3 PSY ENSP00000451515.1...
[2] 4 TGGY ENSP00000452494.1...
[3] 2 EI ENSP00000451042.1...
[4] 4 GTGG ENSP00000487941.1...
[5] 4 GTGG ENSP00000488240.1...
... ... ...
[107840] 135 MLQKKIEEKDLKV...LNHICKVPLAIK ENSP00000495237.1...
[107841] 166 MEHAFTPLEPLLS...IIQEESLIYLLQ ENSP00000496198.1...
[107842] 42 MEVPTAYMISPKE...GLKSENTMLLRC ENSP00000495723.1...
[107843] 508 MPSMLERISKNLV...GTLSLLQQLAEA ENSP00000496548.1...
[107844] 508 MPSMLERISKNLV...GTLSLLQQLAEA ENSP00000494855.1...
Users can also specify the release
argument which
denotes the database release version of ENSEMBL
(db = "ensembl"
) in case they wish to download archived
vgenome versions. Default is release = NULL meaning that the most recent
database version is used.
Example Retrieval Uniprot
:
Another way of retrieving proteome sequences is from the UniProt database.
# download the proteome of Mus musculus from UniProt
# and store the corresponding proteome file in '_uniprot_downloads/proteomes'
Mm.proteome.uniprot<- getProteome( db = "uniprot",
organism = "Mus musculus",
path = file.path("_uniprot_downloads","proteomes"))
# import proteome as Biostrings object
Mouse_Proteome <- read_proteome(file = Mm.proteome.uniprot)
Mouse_Proteome
A AAStringSet instance of length 84522
width seq names
[1] 781 MATQADLMELDMAMEPD...LPPGDSNQLAWFDTDL sp|Q02248|CTNB1_M...
[2] 2531 MPRLLTPLLCLTLLPAL...PTTMPSQITHIPEAFK sp|Q01705|NOTC1_M...
[3] 437 MLLLLARCFLVILASSL...LDSETMHPLGMAVKSS sp|Q62226|SHH_MOU...
[4] 380 MKKPIGILSPGVALGTA...VKCKKCTEIVDQFVCK sp|P22725|WNT5A_M...
[5] 387 MEESQSDISLELPLSQE...SRHKKTMVKKVGPDSD sp|P02340|P53_MOU...
... ... ...
[84518] 459 MLKIILPSLMLLPLTWL...LILLTTSPKLITGLTM tr|A0A0F6PZ84|A0A...
[84519] 380 MTNMRKTHPLFKIINHS...LMPISGIIEDKMLKLY tr|U3TEV9|U3TEV9_...
[84520] 381 MTNMRKTHPLFKIINHS...MPISGIIEDKMLKLYP tr|A0A0F6PXN8|A0A...
[84521] 172 MNNYIFVLSSLFLVGCL...WSLFAGIFIIIEITRD tr|A0A0F6PXK9|A0A...
[84522] 381 MTNMRKTHPLFKIINHS...MPISGIIEDKMLKLYP tr|A0A0F6PYR5|A0A...
ProteomeSet Retrieval
The getProteomeSet()
function enables users to retrieve
proteome files for multiple species. This is convenient when users wish
to bulk download a particular set of species. Internally, a folder named
set_proteomes
is generated in which proteomes and proteome
info files are stored. In addition, users can specify the arguments:
clean_retrieval
and gunzip
(both are
TRUE
by default) to clean file names and automatically
unzip downloaded files.
Example:
# specify the species names
download_species <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# retrieve these three species from NCBI RefSeq
biomartr::getProteomeSet("refseq", organisms = download_species, path = "set_proteomes")
If the download process was interrupted, users can re-run the
function and it will only download missing genomes. In cases users wish
to download everything again and updating existing genomes, they may
specify the argument update = TRUE
.
CDS Retrieval
The getCDS()
function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from
which corresponding CDS files can be retrieved. It works analogous to
getGenome()
and getProteome()
.
The db
argument specifies from which database proteomes
in *.fasta
file format shall be retrieved.
Options are:
-
db = "refseq"
for retrieval from NCBI RefSeq -
db = "genbank"
for retrieval from NCBI Genbank -
db = "ensembl"
for retrieval from ENSEMBL
Furthermore, again users need to specify the scientific name of the
organism of interest for which a proteomes shall be downloaded,
e.g. organism = "Homo sapiens"
. Finally, the
path
argument specifies the folder path in which the
corresponding CDS file shall be locally stored. In case users would like
to store the CDS file at a different location, they can specify the
path = file.path("put","your","path","here")
argument.
Example NCBI RefSeq
:
# download the genome of Homo sapiens from refseq
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
HS.cds.refseq <- getCDS( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","CDS"))
In this example, getCDS()
creates a directory named
'_ncbi_downloads/CDS'
into which the corresponding genome
named Homo_sapiens_cds_from_genomic_refseq.fna.gz
is
downloaded. The return value of getCDS()
is the folder path
to the downloaded genome file that can then be used as input to the
read_cds()
function. The variable
HS.cds.refseq
stores the path to the downloaded CDS file.
Subsequently, users can use the read_cds()
function to
import the genome into the R session. Users can choose to work with the
genome sequence in R either as Biostrings
object (obj.type = "Biostrings"
) or data.table
object (obj.type = "data.table"
) by specifying the
obj.type
argument of the read_cds()
function.
# import downloaded CDS as Biostrings object
Human_CDS <- read_cds(file = HS.cds.refseq,
obj.type = "Biostrings")
# look at the Biostrings object
Human_CDS
A BStringSet instance of length 114967
width seq names
[1] 918 ATGGTGACTGAATTCATTTTTCTG...CACATTCTAGTGTAAAGTTTTAG lcl|NC_000001.11_...
[2] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
[3] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
[4] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
[5] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
... ... ...
[114963] 297 ATGCCCCTCATTTACATAAATATT...ACCTAAACCTACTCCAATGCTAA lcl|NC_012920.1_c...
[114964] 1378 ATGCTAAAACTAATCGTCCCAACA...CATCATTACCGGGTTTTCCTCTT lcl|NC_012920.1_c...
[114965] 1812 ATAACCATGCACACTACTATAACC...TAACCCTACTCCTAATCACATAA lcl|NC_012920.1_c...
[114966] 525 ATGATGTATGCTTTGTTTCTGTTG...TTGAGATTGCTCGGGGGAATAGG lcl|NC_012920.1_c...
[114967] 1141 ATGACCCCAATACGCAAAACTAAC...AAACAAAATACTCAAATGGGCCT lcl|NC_012920.1_c...
Internally, a text file named
doc_Homo_sapiens_db_refseq.txt
is generated. The
information stored in this log file is structured as follows:
File Name: Homo_sapiens_cds_from_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/
GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_cds_from_genomic.fna.gz
Download_Date: Sun Oct 23 17:19:05 2016
refseq_category: reference genome
assembly_accession: GCF_000001405.35
bioproject: PRJNA168
biosample: NA
taxid: 9606
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2016-09-26
submitter: Genome Reference Consortium
In summary, the getCDS()
and read_cds()
functions allow users to retrieve CDS files by specifying the scientific
name of the organism of interest and allow them to import the retrieved
CDS file e.g. as Biostrings
object. Thus, users can then
perform the Biostrings notation
to work with downloaded CDS
and can rely on the log file generated by getCDS()
to
better document the source and version of CDS used for subsequent
studies.
Alternatively, users can perform the pipeline logic of the magrittr package:
# install.packages("magrittr")
library(magrittr)
# import CDS as Biostrings object
Human_CDS <- getCDS( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","CDS")) %>%
read_cds(obj.type = "Biostrings")
Human_CDS
A BStringSet instance of length 114967
width seq names
[1] 918 ATGGTGACTGAATTCATTTTTCTG...CACATTCTAGTGTAAAGTTTTAG lcl|NC_000001.11_...
[2] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
[3] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
[4] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
[5] 402 ATGAGTGACAGCATCAACTTCTCT...CAGGACCCAGGCACAGGCATTAG lcl|NC_000001.11_...
... ... ...
[114963] 297 ATGCCCCTCATTTACATAAATATT...ACCTAAACCTACTCCAATGCTAA lcl|NC_012920.1_c...
[114964] 1378 ATGCTAAAACTAATCGTCCCAACA...CATCATTACCGGGTTTTCCTCTT lcl|NC_012920.1_c...
[114965] 1812 ATAACCATGCACACTACTATAACC...TAACCCTACTCCTAATCACATAA lcl|NC_012920.1_c...
[114966] 525 ATGATGTATGCTTTGTTTCTGTTG...TTGAGATTGCTCGGGGGAATAGG lcl|NC_012920.1_c...
[114967] 1141 ATGACCCCAATACGCAAAACTAAC...AAACAAAATACTCAAATGGGCCT lcl|NC_012920.1_c...
Example NCBI Genbank
:
# download the genome of Homo sapiens from genbank
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
HS.cds.genbank <- getCDS( db = "genbank",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","CDS"))
# import downloaded CDS as Biostrings object
Human_CDS <- read_cds(file = HS.cds.genbank,
obj.type = "Biostrings")
# look at the Biostrings object
Human_CDS
A BStringSet instance of length 13
width seq names
[1] 956 ATACCCATGGCCAACCTCCTACTCCT...ATCTCCAGCATTCCCCCTCAAACCTA lcl|J01415.2_cds_...
[2] 1042 ATTAATCCCCTGGCCCAACCCGTCAT...CTCCCCTTTTATACTAATAATCTTAT lcl|J01415.2_cds_...
[3] 1542 ATGTTCGCCGACCGTTGACTATTCTC...AAGAACCCGTATACATAAAATCTAGA lcl|J01415.2_cds_...
[4] 684 ATGGCACATGCAGCGCAAGTAGGTCT...AAATAGGGCCCGTATTTACCCTATAG lcl|J01415.2_cds_...
[5] 207 ATGCCCCAACTAAATACTACCGTATG...TTCATTCATTGCCCCCACAATCCTAG lcl|J01415.2_cds_...
... ... ...
[9] 297 ATGCCCCTCATTTACATAAATATTAT...ATAACCTAAACCTACTCCAATGCTAA lcl|J01415.2_cds_...
[10] 1378 ATGCTAAAACTAATCGTCCCAACAAT...CGACATCATTACCGGGTTTTCCTCTT lcl|J01415.2_cds_...
[11] 1812 ATAACCATGCACACTACTATAACCAC...TCCTAACCCTACTCCTAATCACATAA lcl|J01415.2_cds_...
[12] 525 ATGATGTATGCTTTGTTTCTGTTGAG...TAATTGAGATTGCTCGGGGGAATAGG lcl|J01415.2_cds_...
[13] 1141 ATGACCCCAATACGCAAAACTAACCC...TGAAAACAAAATACTCAAATGGGCCT lcl|J01415.2_cds_...
Example ENSEMBL
:
# download the genome of Homo sapiens from ensembl
# and store the corresponding genome CDS file in '_ncbi_downloads/CDS'
HS.cds.ensembl <- getCDS( db = "ensembl",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","CDS"))
# import downloaded CDS as Biostrings object
Human_CDS <- read_cds(file = HS.cds.ensembl,
obj.type = "Biostrings")
# look at the Biostrings object
Human_CDS
A BStringSet instance of length 102915
width seq names
[1] 13 ACTGGGGGATACG ENST00000448914.1...
[2] 12 GGGACAGGGGGC ENST00000631435.1...
[3] 12 GGGACAGGGGGC ENST00000632684.1...
[4] 9 CCTTCCTAC ENST00000434970.2...
[5] 8 GAAATAGT ENST00000415118.1...
... ... ...
[102911] 1665 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000436984.6...
[102912] 1920 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000439889.6...
[102913] 2094 ATGCTACTGCCACTGCTGCTGTCC...ATGCAGAAGTCAAGTTCCAATGA ENST00000339313.9...
[102914] 466 ATGCTACTGCCACTGCTGCTGTCC...AGCCCTGGACCTCTCTGTGCAGT ENST00000529627.1...
[102915] 559 ATGCGGAGATGCTACTGCCACTGC...CCTCACCTGCCATGTGGACTTCT ENST00000530476.1...
Users can also specify the release
argument which
denotes the database release version of ENSEMBL
(db = "ensembl"
) in case they wish to download archived
vgenome versions. Default is release = NULL meaning that the most recent
database version is used.
CDSSet Retrieval
The getCDSSet()
function enables users to retrieve CDS
files for multiple species. This is convenient when users wish to bulk
download a particular set of species. Internally, a folder named
set_cds
is generated in which CDS and CDS info files are
stored. In addition, users can specify the arguments:
clean_retrieval
and gunzip
(both are
TRUE
by default) to clean file names and automatically
unzip downloaded files.
Example:
# specify the species names
download_species <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# retrieve these three species from NCBI RefSeq
biomartr::getCDSSet("refseq", organisms = download_species, path = "set_cds")
If the download process was interrupted, users can re-run the
function and it will only download missing genomes. In cases users wish
to download everything again and updating existing genomes, they may
specify the argument update = TRUE
.
RNA Retrieval
The getRNA()
function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from
which corresponding RNA files can be retrieved. It works analogous to
getGenome()
, getProteome()
, and
getCDS()
.
The db
argument specifies from which database proteomes
in *.fasta
file format shall be retrieved.
Options are:
-
db = "refseq"
for retrieval from NCBI RefSeq -
db = "genbank"
for retrieval from NCBI Genbank -
db = "ensembl"
for retrieval from ENSEMBL
Furthermore, again users need to specify the scientific name of the
organism of interest for which a proteomes shall be downloaded,
e.g. organism = "Homo sapiens"
. Finally, the
path
argument specifies the folder path in which the
corresponding RNA file shall be locally stored. In case users would like
to store the RNA file at a different location, they can specify the
path = file.path("put","your","path","here")
argument.
Example NCBI RefSeq
:
# download the RNA of Homo sapiens from refseq
# and store the corresponding RNA file in '_ncbi_downloads/RNA'
HS.rna.refseq <- getRNA( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","RNA"))
In this example, getRNA()
creates a directory named
'_ncbi_downloads/RNA'
into which the corresponding RNA file
named Homo_sapiens_rna_from_genomic_refseq.fna.gz
is
downloaded. The return value of getRNA()
is the folder path
to the downloaded genome file that can then be used as input to the
read_rna()
function. The variable
HS.rna.refseq
stores the path to the downloaded RNA file.
Subsequently, users can use the read_cds()
function to
import the genome into the R session. Users can choose to work with the
genome sequence in R either as Biostrings
object (obj.type = "Biostrings"
) or data.table
object (obj.type = "data.table"
) by specifying the
obj.type
argument of the read_rna()
function.
# import downloaded RNA as Biostrings object
Human_rna <- read_rna(file = HS.rna.refseq,
obj.type = "Biostrings")
# look at the Biostrings object
Human_rna
A BStringSet instance of length 164136
width seq names
[1] 1652 CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGT...CACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG lcl|NC_000001.11_...
[2] 1769 TCCGGCAGAGCGGAAGCGGCGGCGGGAGCTTCCGGGAGGGCGG...ACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA lcl|NC_000001.11_...
[3] 68 TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_...
[4] 23 TGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_...
[5] 23 GAGAGGAACATGGGCTCAGGACA lcl|NC_000001.11_...
... ... ...
[164132] 59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA lcl|NC_012920.1_t...
[164133] 71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA lcl|NC_012920.1_t...
[164134] 69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA lcl|NC_012920.1_t...
[164135] 66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA lcl|NC_012920.1_t...
[164136] 68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA lcl|NC_012920.1_t...
Internally, a text file named
doc_Homo_sapiens_db_refseq.txt
is generated. The
information stored in this log file is structured as follows:
File Name: Homo_sapiens_rna_from_genomic_refseq.fna.gz
Organism Name: Homo_sapiens
Database: NCBI refseq
URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_rna_from_genomic.fna.gz
Download_Date: Wed Mar 15 16:46:45 2017
refseq_category: reference genome
assembly_accession: GCF_000001405.36
bioproject: PRJNA168
biosample: NA
taxid: 9606
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2017-01-06
submitter: Genome Reference Consortium
In summary, the getRNA()
and read_rna()
functions allow users to retrieve RNA files by specifying the scientific
name of the organism of interest and allow them to import the retrieved
RNA file e.g. as Biostrings
object. Thus, users can then
perform the Biostrings notation
to work with downloaded RNA
and can rely on the log file generated by getRNA()
to
better document the source and version of RNA used for subsequent
studies.
Alternatively, users can perform the pipeline logic of the magrittr package:
# install.packages("magrittr")
library(magrittr)
# import RNA as Biostrings object
Human_rna <- getRNA( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","RNA")) %>%
read_cds(obj.type = "Biostrings")
Human_rna
A BStringSet instance of length 164136
width seq names
[1] 1652 CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGT...CACAGCTAGAGATCCTTTATTAAAAGCACACTGTTGGTTTCTG lcl|NC_000001.11_...
[2] 1769 TCCGGCAGAGCGGAAGCGGCGGCGGGAGCTTCCGGGAGGGCGG...ACCAACAGTGTGCTTTTAATAAAGGATCTCTAGCTGTGCAGGA lcl|NC_000001.11_...
[3] 68 TGTGGGAGAGGAACATGGGCTCAGGACAGCGGGTGTCAGCTTGCCTGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_...
[4] 23 TGACCCCCATGTCGCCTCTGTAG lcl|NC_000001.11_...
[5] 23 GAGAGGAACATGGGCTCAGGACA lcl|NC_000001.11_...
... ... ...
[164132] 59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA lcl|NC_012920.1_t...
[164133] 71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA lcl|NC_012920.1_t...
[164134] 69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA lcl|NC_012920.1_t...
[164135] 66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA lcl|NC_012920.1_t...
[164136] 68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA lcl|NC_012920.1_t...
Example NCBI Genbank
:
# download the RNA of Homo sapiens from genbank
# and store the corresponding genome RNA file in '_ncbi_downloads/RNA'
HS.rna.genbank <- getRNA( db = "genbank",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","RNA"))
# import downloaded RNA as Biostrings object
Human_rna <- read_cds(file = HS.rna.genbank,
obj.type = "Biostrings")
# look at the Biostrings object
Human_rna
A BStringSet instance of length 24
width seq names
[1] 71 GTTTATGTAGCTTACCTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACA lcl|J01415.2_trna...
[2] 954 AATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACA...AAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAAC lcl|J01415.2_rrna...
[3] 69 CAGAGTGTAGCTTAACACAAAGCACCCAACTTACACTTAGGAGATTTCAACTTAACTTGACCGCTCTGA lcl|J01415.2_trna...
[4] 1559 GCTAAACCTAGCCCCAAACCCACTCCACCTTACTACCAGACAACCT...ATCTCAACTTAGTATTATACCCACACCCACCCAAGAACAGGGTTT lcl|J01415.2_rrna...
[5] 75 GTTAAGATGGCAGAGCCCGGTAATCGCATAAAACTTAAAACTTTACAGTCAGAGGTTCAATTCCTCTTCTTAACA lcl|J01415.2_trna...
... ... ...
[20] 59 GAGAAAGCTCACAAGAACTGCTAACTCATGCCCCCATGTCTAACAACATGGCTTTCTCA lcl|J01415.2_trna...
[21] 71 ACTTTTAAAGGATAACAGCTATCCATTGGTCTTAGGCCCCAAAAATTTTGGTGCAACTCCAAATAAAAGTA lcl|J01415.2_trna...
[22] 69 GTTCTTGTAGTTGAAATACAACGATGGTTTTTCATATCATTGGTCGTGGTTGTAGTCCGTGCGAGAATA lcl|J01415.2_trna...
[23] 66 GTCCTTGTAGTATAAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACA lcl|J01415.2_trna...
[24] 68 CAGAGAATAGTTTAAATTAGAATCTTAGCTTTGGGTGCTAATGGTGGAGTTAAAGACTTTTTCTCTGA lcl|J01415.2_trna...
Example ENSEMBL
:
# download the RNA of Homo sapiens from ensembl
# and store the corresponding genome RNA file in '_ncbi_downloads/RNA'
HS.rna.ensembl <- getRNA( db = "ensembl",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","RNA"))
# import downloaded RNA as Biostrings object
Human_rna <- read_cds(file = HS.rna.ensembl,
obj.type = "Biostrings")
# look at the Biostrings object
Human_rna
A BStringSet instance of length 36701
width seq names
[1] 104 GTGCTCACTTTGGCAACATACATACTAAAATTGGACGGATACAG...GCACAAGGATGACATGCAAATTCATGAAGCATTCCATATTTTT ENST00000516494.2...
[2] 164 TTAACTACCTGACAGAGGAGATACTGTGATCATGAAAGTGGTTT...CATAATTTCTGGTGGTAGGGGACTGCGTTCATGTTCTCCCCTA ENST00000627793.1...
[3] 114 ACACTGGTTTCTCTTCAGATCGAATAAATCTTTCGCCTTTTACT...AGTTATAAGCTAATTTTTTGTAAGCCTTGCCCTGGGGAGGCAG ENST00000629478.1...
[4] 64 CAGTGTTACACCTCTTTTAGAATTTATCTATCAGGTTTTCCAGTGTTCACTGAAATTTGTCTCT ENST00000629187.1...
[5] 107 GTGTTGGCCTGGGCAGCACGTATACTAAAGTTGGAATGACACAG...GCGCAAGGATGATGTGCAAATTCGTGACAAGTTCCATATTTTT ENST00000631612.1...
... ... ...
[36697] 90 GGCTAGTCCAAATGTAGTGGTGTTCCAAACTAATTAATCACAACCAGTTACAGATTTCTTGTTTTCTTTTCCACTCACACTTAGCCTTAA ENST00000410951.1...
[36698] 109 GGCTGATCTGAAGATGATGAGTTATCTCAATTGATTGTTCAGCC...TCTATTCTTTCCTCTCTTCTCACTACTGCACTTGGCTAGGAAA ENST00000410462.2...
[36699] 109 GGTGGGTCCAAAGGTAGCAAGTTATCTCAATTGATCACAGTCAG...CATTCTATCACCCCTTCTCATTACTGTACTTGACTAGTCTTTT ENST00000364501.1...
[36700] 293 GGATATGAGGGCGATCTGGCAGGGACATCTGTCACCCCACTGAT...AAAATTAGCTGGGCATAGTGGCGTGCACCTGTCGTCCTAGCTA ENST00000365097.1...
[36701] 172 TTGAAGGCGTGGAGACTGAAGTCCTCTCTATATCCACAGAACAG...AAGAGGGCTGTTCAGTCTCCATGCCCTTCAATCCTTGGCTACT ENST00000615087.1...
Users can also specify the release
argument which
denotes the database release version of ENSEMBL
(db = "ensembl"
) in case they wish to download archived
vgenome versions. Default is release = NULL meaning that the most recent
database version is used.
RNASet Retrieval
The getRNASet()
function enables users to retrieve RNA
files for multiple species. This is convenient when users wish to bulk
download a particular set of species. Internally, a folder named
set_rna
is generated in which RNA and RNA info files are
stored. In addition, users can specify the arguments:
clean_retrieval
and gunzip
(both are
TRUE
by default) to clean file names and automatically
unzip downloaded files.
Example:
# specify the species names
download_species <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# retrieve these three species from NCBI RefSeq
biomartr::getRNASet("refseq", organisms = download_species, path = "set_rna")
If the download process was interrupted, users can re-run the
function and it will only download missing genomes. In cases users wish
to download everything again and updating existing genomes, they may
specify the argument update = TRUE
.
Retrieve the annotation file of a particular genome
Finally, users can download the corresponding annotation
.gff
files for particular genomes of interest using the
getGFF()
or alternatively for ensembl
the
getGTF()
function.
Example NCBI RefSeq
:
# download the GFF file of Homo sapiens from refseq
# and store the corresponding file in '_ncbi_downloads/annotation'
HS.gff.refseq <- getGFF( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","annotation"))
After downloading the .gff
file, users can import the
.gff
file with read_gff()
.
# import downloaded GFF file
Human_GFF <- read_gff(file = HS.gff.refseq)
Human_GFF
seqid source type start end score strand phase
<chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl>
1 NC_000001.11 RefSeq region 1 248956422 0 + 0
2 NC_000001.11 BestRefSeq gene 11874 14409 0 + 0
3 NC_000001.11 BestRefSeq transcript 11874 14409 0 + 0
4 NC_000001.11 BestRefSeq exon 11874 12227 0 + 0
5 NC_000001.11 BestRefSeq exon 12613 12721 0 + 0
6 NC_000001.11 BestRefSeq exon 13221 14409 0 + 0
7 NC_000001.11 BestRefSeq gene 14362 29370 0 - 0
8 NC_000001.11 BestRefSeq transcript 14362 29370 0 - 0
9 NC_000001.11 BestRefSeq exon 29321 29370 0 - 0
10 NC_000001.11 BestRefSeq exon 24738 24891 0 - 0
Removing corrupt lines from downloaded GFF files
In some cases, GFF
files stored at NCBI databases
include corrupt lines that have more than 65000 characters. This leads
to problems when trying to import such annotation files into R. To
overcome this issue users can specify the
remove_annotation_outliers = TRUE
argument to remove such
outlier lines and overwrite the downloaded annotation file. This will
make any downstream analysis with this annotation file much more
reliable.
Example:
Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
Starting GFF retrieval of 'Arabidopsis thaliana' from refseq ...
|============================================| 100% 52 MB
File _ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz exists already. Thus, download has been skipped.
Importing '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' ...
|============================================| 100% 434 MB
Reading annotation file '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' and removing all outlier lines with number of characters greater 65000 ...
Overwriting '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' with removed outlier lines ...
Unzipping file _ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff.gz' ...
The new annotation file was created and has been stored at '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_refseq.gff'.
The outlier lines were stored in a separate file to explore at '/var/folders/3x/6bbw6ds1039gpwny1m0hn8r80000gp/T//RtmpuVjnsC/Arabidopsis_thaliana_genomic_refseq.gff.gz_outlier_lines.txt'.
Example NCBI Genbank
:
# download the GFF file of Homo sapiens from genbank
# and store the corresponding file in '_ncbi_downloads/annotation'
HS.gff.genbank <- getGFF( db = "genbank",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","annotation"))
After downloading the .gff
file, users can import the
.gff
file with read_gff()
.
# import downloaded GFF file
Human_GFF <- read_gff(file = HS.gff.genbank)
# show all elements of the data.frame
# options(tibble.print_max = Inf)
Human_GFF
seqid source type start end score strand phase
<chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl>
1 CM000663.2 Genbank region 1 248956422 0 + 0
2 CM000663.2 Genbank centromere 122026460 125184587 0 + 0
3 KI270706.1 Genbank region 1 175055 0 + 0
4 KI270707.1 Genbank region 1 32032 0 + 0
5 KI270708.1 Genbank region 1 127682 0 + 0
6 KI270709.1 Genbank region 1 66860 0 + 0
7 KI270710.1 Genbank region 1 40176 0 + 0
8 KI270711.1 Genbank region 1 42210 0 + 0
9 KI270712.1 Genbank region 1 176043 0 + 0
10 KI270713.1 Genbank region 1 40745 0 + 0
Removing corrupt lines from downloaded GFF files
In some cases, GFF
files stored at NCBI databases
include corrupt lines that have more than 65000 characters. This leads
to problems when trying to import such annotation files into R. To
overcome this issue users can specify the
remove_annotation_outliers = TRUE
argument to remove such
outlier lines and overwrite the downloaded annotation file. This will
make any downstream analysis with this annotation file much more
reliable.
Example:
Ath_path <- biomartr::getGFF(db = "genbank",
organism = "Arabidopsis thaliana",
remove_annotation_outliers = TRUE)
Starting GFF retrieval of 'Arabidopsis thaliana' from genbank ...
Completed!
Now continue with species download ...
GFF download of Arabidopsis_thaliana is completed!
Checking md5 hash of file: _ncbi_downloads/annotation/Arabidopsis_thaliana_md5checksums.txt ...
The md5 hash of file '_ncbi_downloads/annotation/Arabidopsis_thaliana_md5checksums.txt' matches!
Importing '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' ...
Reading annotation file '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' and removing all outlier lines with number of characters greater 65000 ...
Overwriting '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' with removed outlier lines ...
Unzipping file _ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff.gz' ...
The new annotation file was created and has been stored at '_ncbi_downloads/annotation/Arabidopsis_thaliana_genomic_genbank.gff'.
The outlier lines were stored in a separate file to explore at '/var/folders/3x/6bbw6ds1039gpwny1m0hn8r80000gp/T//RtmpuVjnsC/Arabidopsis_thaliana_genomic_genbank.gff.gz_outlier_lines.txt'.
Example ENSEMBL
:
# download the GFF file of Homo sapiens from ENSEMBL
# and store the corresponding file in 'ensembl/annotation'
HS.gff.ensembl <- getGFF( db = "ensembl",
organism = "Homo sapiens",
path = file.path("ensembl","annotation"))
After downloading the .gff
file, users can import the
.gff
file with read_gff()
.
# import downloaded GFF file
Human_GFF <- read_gff(file = HS.gff.ensembl)
# show all elements of the data.frame
# options(tibble.print_max = Inf)
Human_GFF
seqid source type start end score strand phase
<int> <chr> <chr> <int> <int> <chr> <chr> <dbl>
1 1 GRCh38 chromosome 1 248956422 . . 0
2 1 . biological_region 10469 11240 1.3e+03 . 0
3 1 . biological_region 10650 10657 0.999 + 0
4 1 . biological_region 10655 10657 0.999 - 0
5 1 . biological_region 10678 10687 0.999 + 0
6 1 . biological_region 10681 10688 0.999 - 0
7 1 . biological_region 10707 10716 0.999 + 0
8 1 . biological_region 10708 10718 0.999 - 0
9 1 . biological_region 10735 10747 0.999 - 0
10 1 . biological_region 10737 10744 0.999 + 0
Users can also specify the release
argument which
denotes the database release version of ENSEMBL
(db = "ensembl"
) in case they wish to download archived
vgenome versions. Default is release = NULL meaning that the most recent
database version is used.
Removing corrupt lines from downloaded GFF files
In some cases, GFF
files stored at NCBI databases
include corrupt lines that have more than 65000 characters. This leads
to problems when trying to import such annotation files into R. To
overcome this issue users can specify the
remove_annotation_outliers = TRUE
argument to remove such
outlier lines and overwrite the downloaded annotation file. This will
make any downstream analysis with this annotation file much more
reliable.
Alternatively for getGTF()
:
# download the GTF file of Homo sapiens from ENSEMBL
# and store the corresponding file in 'ensembl/annotation'
HS.gtf.ensembl <- getGTF( db = "ensembl",
organism = "Homo sapiens",
path = file.path("ensembl","annotation"),
assembly_type = "primary_assembly")
Taken together, getGFF()
or getGTF()
in
combination with getGenome()
, getProteome()
,
getRNA()
and getCDS()
allows users to retrieve
the genome information together with the corresponding .gff
or gtf
annotation file to make sure that they both have the
same version and origin.
GFFSet Retrieval
The getGFFSet()
function enables users to retrieve GFF
files for multiple species. This is convenient when users wish to bulk
download a particular set of species. Internally, a folder named
set_gff
is generated in which GFF and GFF info files are
stored. In addition, users can specify the arguments:
clean_retrieval
and gunzip
(both are
TRUE
by default) to clean file names and automatically
unzip downloaded files.
Example:
# specify the species names
download_species <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# retrieve these three species from NCBI RefSeq
biomartr::getGFFSet("refseq", organisms = download_species, path = "set_gff")
If the download process was interrupted, users can re-run the
function and it will only download missing genomes. In cases users wish
to download everything again and updating existing genomes, they may
specify the argument update = TRUE
.
In some cases, GFF
files stored at NCBI databases
include corrupt lines that have more than 65000 characters. This leads
to problems when trying to import such annotation files into R. To
overcome this issue users can specify the
remove_annotation_outliers = TRUE
argument to remove such
outlier lines and overwrite the downloaded annotation file. This will
make any downstream analysis with this annotation file much more
reliable.
Repeat Masker Retrieval
Repeat Masker is a tool
for screening DNA sequences for interspersed repeats and low complexity
DNA sequences. NCBI stores the Repeat Masker
for sevel
species in their databases and can be retrieved using
getRepeatMasker()
and imported via
read_rm()
.
Example NCBI RefSeq
:
# download repeat masker annotation file for Homo sapiens
Hsapiens_rm <- getRepeatMasker( db = "refseq",
organism = "Homo sapiens",
path = file.path("refseq","TEannotation"))
Now users can import the TE annotation file using
read_rm()
.
# import TE annotation file
Hsapiens_rm_import <- read_rm("refseq/TEannotation/Homo_sapiens_rm_refseq.out.gz")
# look at data
Hsapiens_rm_import
Genome Assembly Stats Retrieval
By specifying the scientific name of an organism of interest the corresponding genome assembly stats file storing the assembly statistics of the organism of interest can be downloaded and stored locally.
Example NCBI RefSeq
:
# download genome assembly stats file for Homo sapiens
Hsapiens_stats <- getAssemblyStats( db = "refseq",
organism = "Homo sapiens",
path = file.path("refseq","AssemblyStats"))
Now users can import the TE annotation file using
read_rm()
.
# import TE annotation file
Hsapiens_stats_import <- read_assemblystats(Hsapiens_stats)
# look at data
Hsapiens_stats_import
A tibble: 1,196 x 6
unit_name molecule_name molecule_type sequence_type statistic value
<chr> <chr> <chr> <chr> <chr> <int>
1 all all all all total-length NA
2 all all all all spanned-gaps 661
3 all all all all unspanned-gaps 349
4 all all all all region-count 317
5 all all all all scaffold-count 875
6 all all all all scaffold-N50 59364414
7 all all all all scaffold-L50 17
8 all all all all scaffold-N75 31699399
9 all all all all scaffold-N90 7557127
10 all all all all contig-count 1536
Example NCBI Genbank
:
# download genome assembly stats file for Homo sapiens
Hsapiens_stats <- getAssemblyStats( db = "genbank",
organism = "Homo sapiens",
path = file.path("genbank","AssemblyStats"))
Now users can import the TE annotation file using
read_rm()
.
# import TE annotation file
Hsapiens_stats_import <- read_assemblystats(Hsapiens_stats)
# look at data
Hsapiens_stats_import
A tibble: 1,196 x 6
unit_name molecule_name molecule_type sequence_type statistic value
<chr> <chr> <chr> <chr> <chr> <int>
1 all all all all total-length NA
2 all all all all spanned-gaps 661
3 all all all all unspanned-gaps 349
4 all all all all region-count 317
5 all all all all scaffold-count 875
6 all all all all scaffold-N50 59364414
7 all all all all scaffold-L50 17
8 all all all all scaffold-N75 31699399
9 all all all all scaffold-N90 7557127
10 all all all all contig-count 1536
Collection Retrieval
The automated retrieval of collections (= Genome, Proteome, CDS, RNA,
GFF, Repeat Masker, AssemblyStats) will make sure that the genome file
of an organism will match the CDS, proteome, RNA, GFF, etc file and was
generated using the same genome assembly version. One aspect of why
genomics studies fail in computational and biological reproducibility is
that it is not clear whether CDS, proteome, RNA, GFF, etc files used in
a proposed analysis were generated using the same genome assembly file
denoting the same genome assembly version. To avoid this seemingly
trivial mistake we encourage users to retrieve genome file collections
using the biomartr
function getCollection()
and attach the corresponding output as Supplementary Data to the
respective genomics study to ensure computational and biological
reproducibility.
By specifying the scientific name of an organism of interest a collection consisting of the genome file, proteome file, CDS file, RNA file, GFF file, Repeat Masker file, AssemblyStats file of the organism of interest can be downloaded and stored locally.
Example NCBI RefSeq
:
# download collection for Saccharomyces cerevisiae
getCollection( db = "refseq",
organism = "Saccharomyces cerevisiae",
path = file.path("refseq","Collections"))
Internally, the getCollection()
function will now
generate a folder named
refseq/Collection/Saccharomyces_erevisiae
and will store
all genome and annotation files for
Saccharomyces cerevisiae
in the same folder. In addition,
the exact genoem and annotation version will be logged in the
doc
folder.
Genome download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The genome of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.fna.gz'.
Starting proteome retrieval of 'Saccharomyces cerevisiae' from refseq ...
Proteome download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The proteome of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_protein_refseq.faa.gz' .
Starting CDS retrieval of 'Saccharomyces cerevisiae' from refseq ...
CDS download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The genomic CDS of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_cds_from_genomic_refseq.fna.gz' .
Starting GFF retrieval of 'Saccharomyces cerevisiae' from refseq ...
GFF download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The *.gff annotation file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_genomic_refseq.gff.gz'.
Starting RNA retrieval of 'Saccharomyces cerevisiae' from refseq ...
RNA download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The genomic RNA of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_rna_from_genomic_refseq.fna.gz' .
Starting Repeat Masker retrieval of 'Saccharomyces cerevisiae' from refseq ...
RepeatMasker download is completed!
Checking md5 hash of file: refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt ...
The md5 hash of file 'refseq/Collections/Saccharomyces_cerevisiae_md5checksums.txt' matches!
The Repeat Masker output file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_rm_refseq.out.gz'.
Starting assembly quality stats retrieval of 'Saccharomyces cerevisiae' from refseq ...
Genome assembly quality stats file download completed!
The assembly statistics file of 'Saccharomyces_cerevisiae' has been downloaded to 'refseq/Collections' and has been named 'Saccharomyces_cerevisiae_assembly_stats_refseq.txt'.
Collection retrieval finished successfully!
We retrieved the genome assembly and checked the annotation for 'Saccharomyces cerevisiae' (taxid: 559292, strain=S288C) from 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt' (accession: GCF_000146045.2, bioproject: PRJNA128, biosample: NA) using the biomartr R package (Drost and Paszkowski, 2017).
Users can now simply attach the output folder as supplementary data in their study and state in the materials sections. This way, computational and biological reproducibility can be standardized and indeed will become trivial in the context of genome version and annotation version.