Meta-Genome Retrieval
2024-12-12
Source:vignettes/MetaGenome_Retrieval.Rmd
MetaGenome_Retrieval.Rmd
NOTE: To make sure that you have a sufficiently stable (internet) connection between R and the respective databases, please set the default
timeout
setting on your local machine from 60sec to at least 30000sec before running any retrieval functions via:
options(timeout = 30000)
Topics
- 1. Perform Meta-Genome Retrieval for Specific Kingdoms of Life
- 2. Retrieve groups or subgroups of species
- 3. Meta retrieval of genome assembly quality information
- 4.
Retrieve data from metagenome projects such as
human gut metagenome
project fromNCBI Genbank
- 5. Retrieve Individual Genomes for all Species in the Tree of Life
Perform Meta-Genome Retrieval
The number of genome assemblies generated and stored in sequence
databases is growing exponentially every year. With the availability of
this growing amount of genomic data, meta-genomics studies become more
and more popular. By using this bulk of genomes for comparing them to
thousands of other genomes new structural patterns and evolutionary
insights can be obtained. However, the first step in any meta-genomics
study is the retrieval of the genomes, proteomes, coding sequences or
annotation files that shall be compared and investigated. For this
purpose, the meta.retrieval()
and
meta.retrieval.all()
functions allows users to perform
straightforward meta-genome retrieval of hundreds of genomes, proteomes,
CDS, etc in R. Finally, in addition to the retrieved sequence
information the meta.retrieval()
and
meta.retrieval.all()
functions will generate a
summary file
which contains information about the genome
version, genome status, submitter, etc for each organism to promote
computational and scientific reproducibility of the meta-genomics study
at hand. This summary file
can for example be attached as
Supplementary Data
to the respective study.
Getting Started
The meta.retrieval()
and
meta.retrieval.all()
functions aim to simplify the genome
retrieval and computational reproducibility process for meta-genomics
studies. Both functions allow users to either download genomes,
proteomes, CDS, etc for species within a specific kingdom or subgroup of
life (meta.retrieval()
) or of all species of all kingdoms
(meta.retrieval.all()
). Before biomartr
users
had to write shell
scripts to download respective genomic
data. However, since many meta-genomics packages exist for the R
programming language, I implemented this functionality for easy
integration into existing R workflows and for easier
reproducibility.
For example, the pipeline logic of the magrittr package can be
used with meta.retrieval()
and
meta.retrieval.all()
as follows.
# download all vertebrate genomes, then apply ...
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") %>% ...
Here ...
denotes any subsequent meta-genomics analysis.
Hence, meta.retrieval()
enables the pipeline methodology
for meta-genomics.
Retrieve Genomic Sequences
To retrieve a list of all available kingdoms stored in the
NCBI RefSeq
, NCBI Genbank
, and
ENSEMBL
databases users can consult the
getKingdoms()
function which stores a list of all available
kingdoms of life for the corresponding database.
Example NCBI RefSeq
:
getKingdoms(db = "refseq")
[1] "archaea" "bacteria" "fungi" "invertebrate"
[5] "plant" "protozoa" "vertebrate_mammalian" "vertebrate_other"
[9] "viral"
Example NCBI Genbank
:
getKingdoms(db = "genbank")
[1] "archaea" "bacteria" "fungi"
[4] "invertebrate" "plant" "protozoa"
[7] "vertebrate_mammalian" "vertebrate_other"
In these examples the difference betwenn db = "refseq"
and db = "genbank"
is that db = "genbank"
does
not store viral
information.
Example ENSEMBL
getKingdoms(db = "ensembl")
[1] "EnsemblVertebrates"
The ENSEMBL
database does not differentiate between
different kingdoms, but specialized on storing high-quality reference
genomes of diverse biological disciplines.
Retrieval from NCBI RefSeq
Download all mammalian vertebrate genomes from
NCBI RefSeq
.
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome", reference = FALSE)
The argument kingdom
specifies the kingdom selected with
getKingdoms()
from which genomes of organisms shall be
retrieved. The db
argument specifies the database from
which respective genomes shall be downloaded. The argument
type
specifies that genome assembly
files
shall be retrieved. The argument reference
indicates
whether or not a genome shall be downloaded if it isn’t marked in the
database as either a reference genome
or a
representative genome
. Options are:
-
reference = FALSE
(Default): all organisms (reference, representative, and non-representative genomes) are downloaded. -
reference = TRUE
: organisms that are downloaded must be either areference
orrepresentative genome
. Thus, most genomes which are usually non-reference genomes will not be downloaded and the user will retrieve much less organisms than are stored in databases.
When running this command all geneomes are stored in a folder which
is either named according to the kingdom (in this case
vertebrate_mammalian
). Alternatively, users can specify the
out.folder
argument to define a custom output folder
path.
Internally, in this example meta.retrieval()
will
generate a folder named vertebrate_mammalian
in which
respective genomes will be stored. In addition, the
vertebrate_mammalian
folder contains a folder named
documentation
which stores individual documentation files
for each individual organism and a summary file
which
stores documentation for all retrieved organisms. This
summary file
can then be used as
Supplementary Data
in studies to promote computational
reproducibility.
An example documentation file of an individual organism looks like this:
File Name: Mus_musculus_genomic_genbank.gff.gz
Organism Name: Mus_musculus
Database: NCBI genbank
URL: ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.7_GRCm38.p5/GCA_000001635.7_GRCm38.p5_genomic.gff.gz
Download_Date: Mon Nov 14 12:43:45 2016
refseq_category: reference genome
assembly_accession: GCA_000001635.7
bioproject: PRJNA20689
biosample: NA
taxid: 10090
infraspecific_name: NA
version_status: latest
release_type: Patch
genome_rep: Full
seq_rel_date: 2016-06-29
submitter: Genome Reference Consortium
An example summary file
of all organism looks like this
(here we use 105 Plant species as an example):
# A tibble: 105 x 16
file_name organism url database path refseq_category
<chr> <chr> <chr> <chr> <chr> <chr>
1 Aegilops_tau Aegilops_ ftp.ncbi.nlm.nih refseq Prot representative
2 Amborella_tr Amborella ftp.ncbi.nlm.nih refseq Prot representative
3 Ananas_comos Ananas_co ftp.ncbi.nlm.nih refseq Prot representative
4 Arabidopsis_ Arabidops ftp.ncbi.nlm.nih refseq Prot representative
5 Arabidopsis_ Arabidops ftp.ncbi.nlm.nih refseq Prot reference geno
6 Arachis_dura Arachis_d ftp.ncbi.nlm.nih refseq Prot representative
7 Arachis_ipae Arachis_i ftp.ncbi.nlm.nih refseq Prot representative
8 Asparagus_of Asparagus ftp.ncbi.nlm.nih refseq Prot representative
9 Auxenochlore Auxenochl ftp.ncbi.nlm.nih refseq Prot representative
10 Bathycoccus_ Bathycocc ftp.ncbi.nlm.nih refseq Prot representative
# ... with 95 more rows, and 10 more variables: assembly_accession <chr>,
# bioproject <chr>, biosample <chr>, taxid <int>,
# infraspecific_name <chr>, version_status <chr>, release_type <chr>,
# genome_rep <chr>, seq_rel_date <date>, submitter <chr>
Restarting a corrupted download
Unfortunately, when downloading large amounts of genomes the NCBI
RefSeq database limits the number of queries from an individual IP
address. This causes that the download process might stop or time out at
a particular step. To overcome this limitation users can simply
re-run the meta.retrieval()
command they
previously executed and specify the argument
restart_at_last
which has the following two options:
- If
restart_at_last = TRUE
(Default) thenmeta.retrieval()
will skip all organisms that are already present in the folder and will start downloading all remaining species (thus will pick up from where the initial download process stopped). However, this waymeta.retrieval()
will not be able to check whether already downloaded organism files are corrupted or not by checking themd5 checksum
of the respective file. Thus, I recommend to download the last organism beforemeta.retrieval()
stopped manually usinggetGenome()
to make sure that the respective file is not corrupted. - If
restart_at_last = FALSE
thenmeta.retrieval()
will start from the beginning and crawl through already downloaded organism files and check whether already downloaded organism files are corrupted or not by checking themd5 checksum
(this procedure takes longer thanrestart_at_last = TRUE
). After checking existing files the function will start downloading all remaining organisms.
Un-zipping downloaded files
After downloading genomes users can format the output of
meta.retrieval()
by first un-zipping downloaded files and
renaming them for more convenient downstream data analysis (e.g. from
Saccharomyces_cerevisiae_cds_from_genomic_refseq.fna.gz
to
Scerevisiae.fa
).
The easiest way to use clean.retrieval()
in combination
with meta.retrieval()
is to use the pipe operator from the
magrittr
package:
library(magrittr)
meta.retrieval(kingdom = "vertebrate_mammalian",
db = "refseq",
type = "genome") %>%
clean.retrieval()
In the first step, genome assembly files are downloaded with
meta.retrieval
and subsequently (%>%
)
un-zipped and re-named using clean.retrieval()
.
Example Bacteria
# download all bacteria genomes
meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome", reference = FALSE)
Example Viruses
# download all virus genomes
meta.retrieval(kingdom = "viral", db = "refseq", type = "genome", reference = FALSE)
Example Archaea
# download all archaea genomes
meta.retrieval(kingdom = "archaea", db = "refseq", type = "genome", reference = FALSE)
Example Fungi
# download all fungi genomes
meta.retrieval(kingdom = "fungi", db = "refseq", type = "genome", reference = FALSE)
Example Plants
# download all plant genomes
meta.retrieval(kingdom = "plant", db = "refseq", type = "genome", reference = FALSE)
Example Invertebrates
# download all invertebrate genomes
meta.retrieval(kingdom = "invertebrate", db = "refseq", type = "genome", reference = FALSE)
Example Protozoa
# download all invertebrate genomes
meta.retrieval(kingdom = "protozoa", db = "refseq", type = "genome", reference = FALSE)
Retrieval from NCBI Genbank
Alternatively, download all mammalian vertebrate genomes from
NCBI Genbank
, e.g.
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome", reference = FALSE)
Example Bacteria
# download all bacteria genomes
meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome", reference = FALSE)
Example Archaea
# download all archaea genomes
meta.retrieval(kingdom = "archaea", db = "genbank", type = "genome", reference = FALSE)
Example Fungi
# download all fungi genomes
meta.retrieval(kingdom = "fungi", db = "genbank", type = "genome", reference = FALSE)
Example Plants
# download all plant genomes
meta.retrieval(kingdom = "plant", db = "genbank", type = "genome", reference = FALSE)
Example Invertebrates
# download all invertebrate genomes
meta.retrieval(kingdom = "invertebrate", db = "genbank", type = "genome", reference = FALSE)
Example Protozoa
# download all invertebrate genomes
meta.retrieval(kingdom = "protozoa", db = "genbank", type = "genome", reference = FALSE)
Retrieval from ENSEMBL
# download all genomes from ENSEMBL
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "genome", reference = FALSE)
Retrieve groups or subgroups of species
In case users do not wish to retrieve genomes from an entire kingdom,
but rather from a group or subgoup (e.g. from species belonging to the
Gammaproteobacteria
class, a subgroup of the
bacteria
kingdom), they can use the following workflow.
Example retrieval of all Gammaproteobacteria
genomes
from NCBI RefSeq
:
First, users can again consult the getKingdoms()
function to retrieve kingdom information.
getKingdoms(db = "refseq")
[1] "archaea" "bacteria" "fungi" "invertebrate"
[5] "plant" "protozoa" "vertebrate_mammalian" "vertebrate_other"
[9] "viral"
In this example, we will choose the bacteria
kingdom.
Now, the getGroups()
function allows users to obtain
available subgroups of the bacteria
kingdom.
getGroups(db = "refseq", kingdom = "bacteria")
[1] "Acidithiobacillia" "Acidobacteriia"
[3] "Actinobacteria" "Alphaproteobacteria"
[5] "Aquificae" "Armatimonadetes"
[7] "Bacteroidetes/Chlorobi group" "Balneolia"
[9] "Betaproteobacteria" "Blastocatellia"
[11] "Candidatus Kryptonia" "Chlamydiae"
[13] "Chloroflexi" "Cyanobacteria/Melainabacteria group"
[15] "Deinococcus-Thermus" "delta/epsilon subdivisions"
[17] "Endomicrobia" "Fibrobacteres"
[19] "Firmicutes" "Fusobacteriia"
[21] "Gammaproteobacteria" "Gemmatimonadetes"
[23] "Kiritimatiellaeota" "Nitrospira"
[25] "Planctomycetes" "Spirochaetia"
[27] "Synergistia" "Tenericutes"
[29] "Thermodesulfobacteria" "Thermotogae"
[31] "unclassified Acidobacteria" "unclassified Bacteria (miscellaneous)"
[33] "unclassified Proteobacteria" "Verrucomicrobia"
[35] "Zetaproteobacteria"
Please note, that the kingdom
argument specified in
getGroups()
needs to match with an available kingdom
retrieved with getKingdoms()
. It is also important that in
both cases: getKingdoms()
and getGroups()
the
same database should be specified.
Now we choose the group Gammaproteobacteria
and specify
the group
argument in the meta.retrieval()
function.
meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "genome", reference = FALSE)
Using this command, all bacterial (kingdom = "bacteria"
)
genomes (type = "genome"
) that belong to the group
Gammaproteobacteria
(group = "Gammaproteobacteria"
) will be retrieved from NCBI
RefSeq (db = "refseq"
).
Alternatively, Gammaproteobacteria
genomes can be
retrieved from NCBI Genbank by exchanging db = "refseq"
to
db = "genbank"
. If users wish to download proteome, CDS, or
GFF files instead of genomes, they can specify the argument:
type = "proteome"
, type = "cds"
, or
type = "gff"
.
Example retrieval of all Adenoviridae
genomes from
NCBI RefSeq
:
Retrieve groups for viruses.
getGroups(db = "refseq", kingdom = "viral")
[1] "Adenoviridae" "Alloherpesviridae"
[3] "Alphaflexiviridae" "Alphatetraviridae"
[5] "Alvernaviridae" "Amalgaviridae"
[7] "Ampullaviridae" "Anelloviridae"
[9] "Apple fruit crinkle viroid" "Apple hammerhead viroid-like circular RNA"
[11] "Apscaviroid" "Arenaviridae"
[13] "Arteriviridae" "Ascoviridae"
[15] "Asfarviridae" "Astroviridae"
[17] "Avsunviroid" "Baculoviridae"
[19] "Barnaviridae" "Benyviridae"
[21] "Betaflexiviridae" "Bicaudaviridae"
[23] "Birnaviridae" "Bornaviridae"
[25] "Bromoviridae" "Bunyaviridae"
[27] "Caliciviridae" "Carmotetraviridae"
[29] "Caulimoviridae" "Cherry leaf scorch small circular viroid-like RNA 1"
[31] "Cherry small circular viroid-like RNA" "Chrysoviridae"
[33] "Circoviridae" "Closteroviridae"
[35] "Cocadviroid" "Coleviroid"
[37] "Coronaviridae" "Corticoviridae"
[39] "Cystoviridae" "Dicistroviridae"
[41] "Elaviroid" "Endornaviridae"
[43] "Filoviridae" "Flaviviridae"
[45] "Fusarividae" "Fuselloviridae"
[47] "Gammaflexiviridae" "Geminiviridae"
[49] "Genomoviridae" "Globuloviridae"
[51] "Grapevine latent viroid" "Guttaviridae"
[53] "Hepadnaviridae" "Hepeviridae"
[55] "Herpesviridae" "Hostuviroid"
[57] "Hypoviridae" "Hytrosaviridae"
[59] "Iflaviridae" "Inoviridae"
[61] "Iridoviridae" "Lavidaviridae"
[63] "Leviviridae" "Lipothrixviridae"
[65] "Luteoviridae" "Malacoherpesviridae"
[67] "Marnaviridae" "Marseilleviridae"
[69] "Megabirnaviridae" "Mesoniviridae"
[71] "Microviridae" "Mimiviridae"
[73] "Mulberry small circular viroid-like RNA 1" "Mymonaviridae"
[75] "Myoviridae" "Nanoviridae"
[77] "Narnaviridae" "Nimaviridae"
[79] "Nodaviridae" "Nudiviridae"
[81] "Nyamiviridae" "Ophioviridae"
[83] "Orthomyxoviridae" "Other"
[85] "Papillomaviridae" "Paramyxoviridae"
[87] "Partitiviridae" "Parvoviridae"
[89] "Pelamoviroid" "Permutotetraviridae"
[91] "Persimmon viroid" "Phycodnaviridae"
[93] "Picobirnaviridae" "Picornaviridae"
[95] "Plasmaviridae" "Pneumoviridae"
[97] "Podoviridae" "Polydnaviridae"
[99] "Polyomaviridae" "Pospiviroid"
[101] "Potyviridae" "Poxviridae"
[103] "Quadriviridae" "Reoviridae"
[105] "Retroviridae" "Rhabdoviridae"
[107] "Roniviridae" "Rubber viroid India/2009"
[109] "Rudiviridae" "Secoviridae"
[111] "Siphoviridae" "Sphaerolipoviridae"
[113] "Sunviridae" "Tectiviridae"
[115] "Togaviridae" "Tombusviridae"
[117] "Totiviridae" "Turriviridae"
[119] "Tymoviridae" "unclassified"
[121] "unclassified Pospiviroidae" "Virgaviridae"
Now we can choose Adenoviridae
as group argument for the
meta.retrieval()
function.
meta.retrieval(kingdom = "viral", group = "Adenoviridae", db = "refseq", type = "genome", reference = FALSE)
Again, by exchanging type = "genome"
by either
type = "proteome"
, type = "cds"
,
type = "rna"
, type = "assemblystats"
, or
type = "gff"
, users can retrieve proteome, CDS, RNA, genome
assembly statistics or GFF files instead of genomes.
Meta retrieval of genome assembly quality information
Although much effort is invested to increase the genome assembly quality when new genomes are published or new versions are released, the influence of genome assembly quality on downstream analyses cannot be neglected. A rule of thumb is, that the larger the genome the more prone it is to genome assembly errors and therefore, a reduction of assembly quality.
In Veeckman et al., 2016 the authors conclude:
As yet, no uniform metrics or standards are in place to estimate the completeness of a genome assembly or the annotated gene space, despite their importance for downstream analyses
In most metagenomics studies, however, the influence or bias of genome assembly quality on the outcome of the analysis (e.g. comparative genomics, annotation, etc.) is neglected. To better grasp the genome assembly quality, the NCBI databases store genome assembly statistics of some species for which genome assemblies are available. An example assembly statistics report can be found at: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt.
The biomartr
package allows users to retrieve these
genome assembly stats file in an automated way by specifying the
argument type = "assemblystats"
and
combine = TRUE
. Please make sure that
combine = TRUE
when selecting
type = "assemblystats"
.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieve genome assembly stats for all mammal genome assemblies
# and store these stats in a data.frame
mammals.gc <- meta.retrieval(kingdom = "vertebrate_mammalian",
db = "refseq",
type = "assemblystats",
combine = TRUE)
mammals.gc
species total_length spanned_gaps unspanned_gaps region_count scaffold_count
<chr> <int> <int> <int> <int> <int>
1 Ornithorhynchus anatinus 1995607322 243698 137 0 200283
2 Sarcophilus harrisii NA 201317 0 0 35974
3 Dasypus novemcinctus NA 268413 0 0 46559
4 Erinaceus europaeus NA 219764 0 0 5803
5 Echinops telfairi NA 269444 0 0 8402
6 Pteropus alecto 1985975446 104566 0 0 65598
7 Rousettus aegyptiacus 1910250568 559 0 0 NA
8 Callithrix jacchus NA 184972 2242 0 16399
9 Cebus capucinus imitator NA 133441 0 0 7156
10 Cercocebus atys NA 65319 0 0 11433
# ... with 89 more rows, and 9 more variables: scaffold_N50 <int>, scaffold_L50 <int>,
# scaffold_N75 <int>, scaffold_N90 <int>, contig_count <int>, contig_N50 <int>, total_gap_length <int>,
# molecule_count <int>, top_level_count <int>
Analogously, this information can be retrieved for each kingdom other
than kingdom = "vertebrate_mammalian"
. Please consult
getKingdoms()
for available kingdoms.
Metagenome project retrieval from NCBI Genbank
NCBI Genbank stores metagenome projects in addition to species specific genome, proteome or CDS sequences. To retrieve these metagenomes users can perform the following combination of commands:
First, users can list the project names of available metagenomes by typing
# list available metagenomes at NCBI Genbank
listMetaGenomes()
[1] "metagenome" "human gut metagenome" "epibiont metagenome"
[4] "marine metagenome" "soil metagenome" "mine drainage metagenome"
[7] "mouse gut metagenome" "marine sediment metagenome" "termite gut metagenome"
[10] "hot springs metagenome" "human lung metagenome" "fossil metagenome"
[13] "freshwater metagenome" "saltern metagenome" "stromatolite metagenome"
[16] "coral metagenome" "mosquito metagenome" "fish metagenome"
[19] "bovine gut metagenome" "chicken gut metagenome" "wastewater metagenome"
[22] "microbial mat metagenome" "freshwater sediment metagenome" "human metagenome"
[25] "hydrothermal vent metagenome" "compost metagenome" "wallaby gut metagenome"
[28] "groundwater metagenome" "gut metagenome" "sediment metagenome"
[31] "ant fungus garden metagenome" "food metagenome" "hypersaline lake metagenome"
[34] "hydrocarbon metagenome" "activated sludge metagenome" "viral metagenome"
[37] "bioreactor metagenome" "wasp metagenome" "permafrost metagenome"
[40] "sponge metagenome" "aquatic metagenome" "insect gut metagenome"
[43] "activated carbon metagenome" "anaerobic digester metagenome" "rock metagenome"
[46] "terrestrial metagenome" "rock porewater metagenome" "seawater metagenome"
[49] "scorpion gut metagenome" "soda lake metagenome" "glacier metagenome"
Internally the listMetaGenomes()
function downloads the
assembly_summary.txt file from
ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/ to retrieve available
metagenome information. This procedure might take a few seconds during
the first run of listMetaGenomes()
. Subsequently, the
assembly_summary.txt file will be stored in the tempdir()
directory to achieve a much faster access of this information during
following uses of listMetaGenomes()
.
In case users wish to retrieve detailed information about available
metagenome projects they can specify the details = TRUE
argument.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# detailed information on available metagenomes at NCBI Genbank
listMetaGenomes(details = TRUE)
# A tibble: 857 x 21
assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid
<chr> <chr> <chr> <chr> <chr> <int> <int>
1 GCA_000206185.1 PRJNA32359 SAMN02954317 AAGA00000000.1 na 256318 256318
2 GCA_000206205.1 PRJNA32355 SAMN02954315 AAFZ00000000.1 na 256318 256318
3 GCA_000206225.1 PRJNA32357 SAMN02954316 AAFY00000000.1 na 256318 256318
4 GCA_000208265.2 PRJNA17779 SAMN02954240 AASZ00000000.1 na 256318 256318
5 GCA_000208285.1 PRJNA17657 SAMN02954268 AATO00000000.1 na 256318 256318
6 GCA_000208305.1 PRJNA17659 SAMN02954269 AATN00000000.1 na 256318 256318
7 GCA_000208325.1 PRJNA16729 SAMN02954263 AAQL00000000.1 na 256318 256318
8 GCA_000208345.1 PRJNA16729 SAMN02954262 AAQK00000000.1 na 256318 256318
9 GCA_000208365.1 PRJNA13699 SAMN02954283 AAFX00000000.1 na 256318 256318
10 GCA_900010595.1 PRJEB11544 SAMEA3639840 CZPY00000000.1 na 256318 256318
# ... with 847 more rows, and 14 more variables: organism_name <chr>, infraspecific_name <chr>,
# isolate <chr>, version_status <chr>, assembly_level <chr>, release_type <chr>, genome_rep <chr>,
# seq_rel_date <date>, asm_name <chr>, submitter <chr>, gbrs_paired_asm <chr>, paired_asm_comp <chr>,
# ftp_path <chr>, excluded_from_refseq <chr>
Finally, users can retrieve available metagenomes using
getMetaGenomes()
. The name
argument receives
the metagenome project name retrieved with
listMetaGenomes()
. The path
argument specifies
the folder path in which corresponding genomes shall be stored.
# retrieve all genomes belonging to the human gut metagenome project
getMetaGenomes(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut"))
1] "The metagenome of 'human gut metagenome' has been downloaded to '_ncbi_downloads/human_gut'."
[1] "_ncbi_downloads/human_gut/GCA_000205525.2_ASM20552v2_genomic.fna.gz"
[2] "_ncbi_downloads/human_gut/GCA_000205765.1_ASM20576v1_genomic.fna.gz"
[3] "_ncbi_downloads/human_gut/GCA_000205785.1_ASM20578v1_genomic.fna.gz"
[4] "_ncbi_downloads/human_gut/GCA_000207925.1_ASM20792v1_genomic.fna.gz"
[5] "_ncbi_downloads/human_gut/GCA_000207945.1_ASM20794v1_genomic.fna.gz"
[6] "_ncbi_downloads/human_gut/GCA_000207965.1_ASM20796v1_genomic.fna.gz"
[7] "_ncbi_downloads/human_gut/GCA_000207985.1_ASM20798v1_genomic.fna.gz"
[8] "_ncbi_downloads/human_gut/GCA_000208005.1_ASM20800v1_genomic.fna.gz"
[9] "_ncbi_downloads/human_gut/GCA_000208025.1_ASM20802v1_genomic.fna.gz"
[10] "_ncbi_downloads/human_gut/GCA_000208045.1_ASM20804v1_genomic.fna.gz"
[11] "_ncbi_downloads/human_gut/GCA_000208065.1_ASM20806v1_genomic.fna.gz"
[12] "_ncbi_downloads/human_gut/GCA_000208085.1_ASM20808v1_genomic.fna.gz"
[13] "_ncbi_downloads/human_gut/GCA_000208105.1_ASM20810v1_genomic.fna.gz"
[14] "_ncbi_downloads/human_gut/GCA_000208125.1_ASM20812v1_genomic.fna.gz"
[15] "_ncbi_downloads/human_gut/GCA_000208145.1_ASM20814v1_genomic.fna.gz"
[16] "_ncbi_downloads/human_gut/GCA_000208165.1_ASM20816v1_genomic.fna.gz"
...
Internally, getMetaGenomes()
creates a folder specified
in the path
argument. Genomes associated with the
metagenomes project specified in the name
argument will
then be downloaded and stored in this folder. As return value
getMetaGenomes()
returns the file paths to the genome files
which can then be used as input to the read*()
functions.
Alternatively or subsequent to the metagenome retrieval, users can
retrieve annotation files of genomes belonging to a metagenome project
selected with listMetaGenomes()
by using the
getMetaGenomeAnnotations()
function.
# retrieve all genomes belonging to the human gut metagenome project
getMetaGenomeAnnotations(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut","annotations"))
[1] "The annotations of metagenome 'human gut metagenome' have been downloaded and stored at '_ncbi_downloads/human_gut/annotations'."
[1] "_ncbi_downloads/human_gut/annotations/GCA_000205525.2_ASM20552v2_genomic.gff.gz"
[2] "_ncbi_downloads/human_gut/annotations/GCA_000205765.1_ASM20576v1_genomic.gff.gz"
[3] "_ncbi_downloads/human_gut/annotations/GCA_000205785.1_ASM20578v1_genomic.gff.gz"
[4] "_ncbi_downloads/human_gut/annotations/GCA_000207925.1_ASM20792v1_genomic.gff.gz"
[5] "_ncbi_downloads/human_gut/annotations/GCA_000207945.1_ASM20794v1_genomic.gff.gz"
[6] "_ncbi_downloads/human_gut/annotations/GCA_000207965.1_ASM20796v1_genomic.gff.gz"
[7] "_ncbi_downloads/human_gut/annotations/GCA_000207985.1_ASM20798v1_genomic.gff.gz"
[8] "_ncbi_downloads/human_gut/annotations/GCA_000208005.1_ASM20800v1_genomic.gff.gz"
[9] "_ncbi_downloads/human_gut/annotations/GCA_000208025.1_ASM20802v1_genomic.gff.gz"
[10] "_ncbi_downloads/human_gut/annotations/GCA_000208045.1_ASM20804v1_genomic.gff.gz"
[11] "_ncbi_downloads/human_gut/annotations/GCA_000208065.1_ASM20806v1_genomic.gff.gz"
[12] "_ncbi_downloads/human_gut/annotations/GCA_000208085.1_ASM20808v1_genomic.gff.gz"
[13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz"
[13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz"
[14] "_ncbi_downloads/human_gut/annotations/GCA_000208125.1_ASM20812v1_genomic.gff.gz"
[15] "_ncbi_downloads/human_gut/annotations/GCA_000208145.1_ASM20814v1_genomic.gff.gz"
[16] "_ncbi_downloads/human_gut/annotations/GCA_000208165.1_ASM20816v1_genomic.gff.gz"
...
The file paths of the downloaded *.gff
are retured by
getMetaGenomeAnnotations()
and can be used as input for the
read.gff()
function in the seqreadr package.
Retrieval from NCBI RefSeq
:
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "proteome", reference = FALSE)
Retrieval from NCBI Genbank
:
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "proteome", reference = FALSE)
Retrieval from ENSEMBL
:
# download all Ensembl proteome sequneces
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "proteome", reference = FALSE)
Retrieve CDS Sequences
Download all mammalian vertebrate CDS from RefSeq (Genbank does not store CDS data).
Retrieval from NCBI RefSeq
:
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "cds", reference = FALSE)
Retrieval from NCBI Genbank
:
# download all vertebrate genomes
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "cds", reference = FALSE)
Retrieval from ENSEMBL
:
# download all Ensembl CDS sequneces
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "cds", reference = FALSE)
Retrieve GFF files
Download all mammalian vertebrate gff files.
Example NCBI RefSeq
:
# download all vertebrate gff files
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "gff", reference = FALSE)
Example NCBI Genbank
:
# download all vertebrate gff files
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "gff", reference = FALSE)
Retrieve GTF files
Download all mammalian vertebrate gtf files.
Example ENSEMBL
:
# download all vertebrate gff files
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "gtf", reference = FALSE)
Retrieve RNA sequences
Download all mammalian vertebrate RNA sequences from
NCBI RefSeq
and NCBI Genbank
.
Retrieval from NCBI RefSeq
:
# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "rna", reference = FALSE)
Retrieval from NCBI Genbank
:
# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "rna", reference = FALSE)
Retrieval from ENSEMBL
:
# download all Ensembl RNA sequneces
meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "rna", reference = FALSE)
Retrieve Repeat Masker Sequences
Download all mammalian vertebrate Repeat Masker Annotation files from
NCBI RefSeq
and NCBI Genbank
.
Retrieval from NCBI RefSeq
:
# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "rm", reference = FALSE)
Retrieval from NCBI Genbank
:
# download all vertebrate RNA sequneces
meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "rm", reference = FALSE)
Users can obtain alternative kingdoms using
getKingdoms()
.
Retrieve Individual Genomes for all Species in the Tree of Life
If users wish to download the all genomes, proteome, CDS, or gff
files for all species available in RefSeq or Genbank, they can use the
meta.retrieval.all()
function for this purpose.
Genome Retrieval
Example RefSeq
:
# download all geneomes stored in RefSeq
meta.retrieval.all(db = "refseq", type = "genome", reference = FALSE)
Example Genbank
:
# download all geneomes stored in Genbank
meta.retrieval.all(db = "genbank", type = "genome", reference = FALSE)
Proteome Retrieval
Example RefSeq
:
# download all proteome stored in RefSeq
meta.retrieval.all(db = "refseq", type = "proteome", reference = FALSE)
Example Genbank
:
# download all proteome stored in Genbank
meta.retrieval.all(db = "genbank", type = "proteome", reference = FALSE)
Again, by exchanging type = "proteome"
by either
type = "genome"
type = "cds"
type = "rna"
type = "assemblystats"
type = "gff"
users can retrieve genome, CDS, RNA, genome assembly statistics or GFF files instead of proteomes.