Main genome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the genome of the organisms of interest will be downloaded and stored locally. Genome files can be retrieved from several databases.
Usage
getGenomeSet(
db = "refseq",
organisms,
reference = FALSE,
release = NULL,
clean_retrieval = TRUE,
gunzip = TRUE,
update = FALSE,
path = "set_genomes",
assembly_type = "toplevel"
)
Arguments
- db
a character string specifying the database from which the genome shall be retrieved:
db = "refseq"
db = "genbank"
db = "ensembl"
- organisms
a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:
by
scientific name
: e.g.organism = "Homo sapiens"
by
database specific accession identifier
: e.g.organism = "GCF_000001405.37"
(= NCBI RefSeq identifier forHomo sapiens
)by
taxonomic identifier from NCBI Taxonomy
: e.g.organism = "9606"
(= taxid ofHomo sapiens
)
- reference
a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.
- release
the database release version of ENSEMBL (
db = "ensembl"
). Default isrelease = NULL
meaning that the most recent database version is used.- clean_retrieval
logical value indicating whether or not downloaded files shall be renamed for more convenient downstream data analysis.
- gunzip
a logical value indicating whether or not files should be unzipped.
- update
a logical value indicating whether or not files that were already downloaded and are still present in the output folder shall be updated and re-loaded (
update = TRUE
or whether the existing file shall be retainedupdate = FALSE
(Default)).- path
a character string specifying the location (a folder) in which the corresponding genomes shall be stored. Default is
path
="set_genomes"
.- assembly_type
a character string specifying from which assembly type the genome shall be retrieved from (ensembl only, else this argument is ignored): Default is
assembly_type = "toplevel")
. This will give you all multi-chromosomes (copies of the same chromosome with small variations). As an example the toplevel fasta genome in human is over 70 GB uncompressed. To get primary assembly with 1 chromosome variant per chromosome:assembly_type = "primary_assembly")
. As an example, the primary_assembly fasta genome in human is only a few GB uncompressed:
Details
Internally this function loads the the overview.txt file from NCBI:
refseq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
genbank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/
and creates a directory 'set_genomes' to store the genomes of interest as fasta files for future processing. In case the corresponding fasta file already exists within the 'set_genomes' folder and is accessible within the workspace, no download process will be performed.
Examples
if (FALSE) {
getGenomeSet("refseq", organisms = c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella"))
}