Introduction

library(onekp)
library(knitr)
library(magrittr)

Accessing the OneKP metadata

All project with the onekp R package start at the same place:

onekp <- retrieve_onekp()
#> Registered S3 method overwritten by 'hoardr':
#>   method           from
#>   print.cache_info httr
class(onekp)
#> [1] "OneKP"
#> attr(,"package")
#> [1] "onekp"

The retrieve_onekp function scrapes the metadata associated with each transcriptome project from the 1KP public data page. It also links each species to its NCBI taxonomy ID (which is used later to filter by clade).

The only part of the OneKP object that you will need to interact with directly is the @table slot, a data.frame with the form:

species	code	family	tissue	peptides	nucleotides	tax_id
Amborella trichopoda	URDJ	Amborellaceae	leaves	URDJ.faa.tar.bz2	URDJ.fna.tar.bz2	13333
Nuphar advena	WTKZ	Nymphaeaceae	young leaves	WTKZ.faa.tar.bz2	WTKZ.fna.tar.bz2	77108
Nymphaea sp.	PZRT	Nymphaeaceae	young leaves	PZRT.faa.tar.bz2	PZRT.fna.tar.bz2	NA

Retrieving sequence

To get sequence, first subset the onekp@table until it contains only the species you want. There are several ways to do this.

You can use all the normal tools for subsetting the table directly, e.g.

onekp@table <- subset(onekp@table, family == 'Nymphaeaceae')

onekp also has a few builtin tools for taxonomic selection

# filter by species name ('species' column of onekp@table)
filter_by_species(onekp, 'Pinus radiata')
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

# filter by species NCBI taxon ID  ('tax_id' column of onekp@table)
filter_by_species(onekp, 3347)
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

# filter by clade name scientific name (get all data for the Brassicaceae family)
filter_by_clade(onekp, 'Brassicaceae')
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

# filter by clade NCBI taxon ID
filter_by_clade(onekp, 3700)
#> OneKP object
#> Slot "table": metadata for 0 transcriptomes from 0 species
#> Slot "links": map of file names from "table" to URLs

Once you have chosen the studies you want, you can retrieve the protein or transcript FASTA files:

download_peptides(filter_by_clade(onekp, 'Brassicaceae'))
download_nucleotides(filter_by_clade(onekp, 'Brassicaceae'))

This will download the files into a temporary directory. Alternatively, you may set your own directory with the dir argument. The downloaded protein FASTA files have the extension .faa and the DNA files the extension .fna. The basename for each file is the 1KP 4-letter code.

Zebulun Arendsee

2023-08-25

Accessing the OneKP metadata

Retrieving sequence

About

Community

Resources