The 1000 Plants initiative (1KP) provides the transcriptome sequences to over 1000 plants from diverse lineages.
onekp allows researchers in plant genomics and transcriptomics to access this dataset through a simple R interface. The metadata for each transcriptome project is scraped from the 1KP project website. This metadata includes the species, tissue, and research group for each sequence sample.
onekp leverages the taxonomy program
taxizedb, a local database version of
taxize package, to allow filtering of the metadata by taxonomic group (entered as either a taxon name or NCBI ID). The raw nucleotide or translated peptide sequence can then be downloaded for the full, or filtered, table of transcriptome projects.
The data may also be accessed directly through CyVerse (previously iPlant). CyVerse efficiently distributes data using the iRODS data system. This approach is preferable for high-throughput cases or in where iRODS is already in play. Further, accessing data straight from the source at CyVerse is more stable than scraping it from project website. However, the
onekp R package is generally easier to use (no iRODS dependency or CyVerse API) and offers powerful filtering solutions.
Gane Ka-Shu Wong - Principal investigator
Michael Deyholos - Alberta co-investigator
Yong Zhang - Shenzhen co-investigator
Eric Carpenter - Database manager
R package maintainer
onekp is on CRAN, but currently is a little out of date. So for now it is better to install through github.
Retrieve the protein and gene transcript FASTA files for two 1KP transcriptomes:
This will create the following directory:
oneKP ├── nuc │ ├── ROAP.fna │ └── URDJ.fna └── pep ├── ROAP.faa └── URDJ.faa
onekp can also filter by species names, taxon ids, or clade.
# filter by species name filter_by_species(onekp, 'Pinus radiata') # filter by species NCBI taxon ID filter_by_species(onekp, 3347) # filter by clade name scientific name (get all data for the Brassicaceae family) filter_by_clade(onekp, 'Brassicaceae') # filter by clade NCBI taxon ID filter_by_clade(onekp, 3700)
So to get the protein sequences for all species in Brassicaceae:
Development of this R package was supported by the National Science Foundation under Grant No. IOS 1546858.
We welcome any contributions!