Accessing the OneKP metadata
All project with the onekp
R package start at the same place:
onekp <- retrieve_onekp()
#> Registered S3 method overwritten by 'hoardr':
#> method from
#> print.cache_info httr
class(onekp)
#> [1] "OneKP"
#> attr(,"package")
#> [1] "onekp"
The retrieve_onekp
function scrapes the metadata associated with each transcriptome project from the 1KP public data page. It also links each species to its NCBI taxonomy ID (which is used later to filter by clade).
The only part of the OneKP object that you will need to interact with directly is the @table
slot, a data.frame with the form:
species | code | family | tissue | peptides | nucleotides | tax_id |
---|---|---|---|---|---|---|
Amborella trichopoda | URDJ | Amborellaceae | leaves | URDJ.faa.tar.bz2 | URDJ.fna.tar.bz2 | 13333 |
Nuphar advena | WTKZ | Nymphaeaceae | young leaves | WTKZ.faa.tar.bz2 | WTKZ.fna.tar.bz2 | 77108 |
Nymphaea sp. | PZRT | Nymphaeaceae | young leaves | PZRT.faa.tar.bz2 | PZRT.fna.tar.bz2 | NA |
Retrieving sequence
To get sequence, first subset the onekp@table
until it contains only the species you want. There are several ways to do this.
You can use all the normal tools for subsetting the table directly, e.g.
onekp@table <- subset(onekp@table, family == 'Nymphaeaceae')
onekp
also has a few builtin tools for taxonomic selection
# filter by species name ('species' column of onekp@table)
filter_by_species(onekp, 'Pinus radiata')
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs
# filter by species NCBI taxon ID ('tax_id' column of onekp@table)
filter_by_species(onekp, 3347)
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs
# filter by clade name scientific name (get all data for the Brassicaceae family)
filter_by_clade(onekp, 'Brassicaceae')
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs
# filter by clade NCBI taxon ID
filter_by_clade(onekp, 3700)
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs
Once you have chosen the studies you want, you can retrieve the protein or transcript FASTA files:
download_peptides(filter_by_clade(onekp, 'Brassicaceae'))
download_nucleotides(filter_by_clade(onekp, 'Brassicaceae'))
This will download the files into a temporary directory. Alternatively, you may set your own directory with the dir
argument. The downloaded protein FASTA files have the extension .faa
and the DNA files the extension .fna
. The basename for each file is the 1KP 4-letter code.