Skip to contents

Main genome retrieval function for a set of organism of interest. By specifying the scientific names of the organisms of interest the corresponding fasta-files storing the genome of the organisms of interest will be downloaded and stored locally. Genome files can be retrieved from several databases.

Usage

getGenomeSet(
  db = "refseq",
  organisms,
  reference = FALSE,
  release = NULL,
  clean_retrieval = TRUE,
  gunzip = TRUE,
  update = FALSE,
  path = "set_genomes",
  assembly_type = "toplevel"
)

Arguments

db

a character string specifying the database from which the genome shall be retrieved:

  • db = "refseq"

  • db = "genbank"

  • db = "ensembl"

organisms

a character vector storing the names of the organisms than shall be retrieved. There are three available options to characterize an organism:

  • by scientific name: e.g. organism = "Homo sapiens"

  • by database specific accession identifier: e.g. organism = "GCF_000001405.37" (= NCBI RefSeq identifier for Homo sapiens)

  • by taxonomic identifier from NCBI Taxonomy: e.g. organism = "9606" (= taxid of Homo sapiens)

reference

a logical value indicating whether or not a genome shall be downloaded if it isn't marked in the database as either a reference genome or a representative genome.

release

the database release version of ENSEMBL (db = "ensembl"). Default is release = NULL meaning that the most recent database version is used.

clean_retrieval

logical value indicating whether or not downloaded files shall be renamed for more convenient downstream data analysis.

gunzip

a logical value indicating whether or not files should be unzipped.

update

a logical value indicating whether or not files that were already downloaded and are still present in the output folder shall be updated and re-loaded (update = TRUE or whether the existing file shall be retained update = FALSE (Default)).

path

a character string specifying the location (a folder) in which the corresponding genomes shall be stored. Default is path = "set_genomes".

assembly_type

a character string specifying from which assembly type the genome shall be retrieved from (ensembl only, else this argument is ignored): Default is assembly_type = "toplevel"). This will give you all multi-chromosomes (copies of the same chromosome with small variations). As an example the toplevel fasta genome in human is over 70 GB uncompressed. To get primary assembly with 1 chromosome variant per chromosome: assembly_type = "primary_assembly"). As an example, the primary_assembly fasta genome in human is only a few GB uncompressed:

Value

File path to downloaded genomes.

Details

Internally this function loads the the overview.txt file from NCBI:

refseq: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

genbank: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/

and creates a directory 'set_genomes' to store the genomes of interest as fasta files for future processing. In case the corresponding fasta file already exists within the 'set_genomes' folder and is accessible within the workspace, no download process will be performed.

Author

Hajk-Georg Drost

Examples

if (FALSE) {
getGenomeSet("refseq", organisms = c("Arabidopsis thaliana", 
                                     "Arabidopsis lyrata", 
                                     "Capsella rubella"))
}