library(onekp)
library(knitr)
library(magrittr)

Accessing the OneKP metadata

All project with the onekp R package start at the same place:

onekp <- retrieve_onekp()
#> Registered S3 method overwritten by 'hoardr':
#>   method           from
#>   print.cache_info httr
class(onekp)
#> [1] "OneKP"
#> attr(,"package")
#> [1] "onekp"

The retrieve_onekp function scrapes the metadata associated with each transcriptome project from the 1KP public data page. It also links each species to its NCBI taxonomy ID (which is used later to filter by clade).

The only part of the OneKP object that you will need to interact with directly is the @table slot, a data.frame with the form:

species code family tissue peptides nucleotides tax_id
Amborella trichopoda URDJ Amborellaceae leaves URDJ.faa.tar.bz2 URDJ.fna.tar.bz2 13333
Nuphar advena WTKZ Nymphaeaceae young leaves WTKZ.faa.tar.bz2 WTKZ.fna.tar.bz2 77108
Nymphaea sp. PZRT Nymphaeaceae young leaves PZRT.faa.tar.bz2 PZRT.fna.tar.bz2 NA

Retrieving sequence

To get sequence, first subset the [email protected] until it contains only the species you want. There are several ways to do this.

You can use all the normal tools for subsetting the table directly, e.g.

onekp@table <- subset(onekp@table, family == 'Nymphaeaceae')

onekp also has a few builtin tools for taxonomic selection

# filter by species name ('species' column of [email protected])
filter_by_species(onekp, 'Pinus radiata')
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

# filter by species NCBI taxon ID  ('tax_id' column of [email protected])
filter_by_species(onekp, 3347)
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

# filter by clade name scientific name (get all data for the Brassicaceae family)
filter_by_clade(onekp, 'Brassicaceae')
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

# filter by clade NCBI taxon ID
filter_by_clade(onekp, 3700)
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

Once you have chosen the studies you want, you can retrieve the protein or transcript FASTA files:

download_peptides(filter_by_clade(onekp, 'Brassicaceae'))
download_nucleotides(filter_by_clade(onekp, 'Brassicaceae'))

This will download the files into a temporary directory. Alternatively, you may set your own directory with the dir argument. The downloaded protein FASTA files have the extension .faa and the DNA files the extension .fna. The basename for each file is the 1KP 4-letter code.