Abstract
Describes the background of the package, important functions defined in the package and some of the applications and usages.
Getting started
This vignette explains how to work with sets using this package. The package provides a class to store the information efficiently and functions to work with it.
The TidySet class
To create a TidySet
object, to store associations between elements and sets image we have several genes associated with a characteristic.
library("BaseSet")
gene_lists <- list(
geneset1 = c("A", "B"),
geneset2 = c("B", "C", "D")
)
tidy_set <- tidySet(gene_lists)
tidy_set
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
This is then stored internally in three slots relations()
, elements()
, and sets()
slots.
If you have more information for each element or set it can be added:
gene_data <- data.frame(
stat1 = c( 1, 2, 3, 4 ),
info1 = c("a", "b", "c", "d")
)
tidy_set <- add_column(tidy_set, "elements", gene_data)
set_data <- data.frame(
Group = c( 100 , 200 ),
Column = c("abc", "def")
)
tidy_set <- add_column(tidy_set, "sets", set_data)
tidy_set
#> elements sets fuzzy Group Column stat1 info1
#> 1 A geneset1 1 100 abc 1 a
#> 2 B geneset1 1 100 abc 2 b
#> 3 B geneset2 1 200 def 2 b
#> 4 C geneset2 1 200 def 3 c
#> 5 D geneset2 1 200 def 4 d
This data is stored in one of the three slots, which can be directly accessed using their getter methods:
relations(tidy_set)
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
elements(tidy_set)
#> elements stat1 info1
#> 1 A 1 a
#> 2 B 2 b
#> 3 C 3 c
#> 4 D 4 d
sets(tidy_set)
#> sets Group Column
#> 1 geneset1 100 abc
#> 2 geneset2 200 def
You can add as much information as you want, with the only restriction for a “fuzzy” column for the relations()
. See the Fuzzy sets vignette: vignette("Fuzzy sets", "BaseSet")
.
You can also use the standard R approach with [
:
gene_data <- data.frame(
stat2 = c( 4, 4, 3, 5 ),
info2 = c("a", "b", "c", "d")
)
tidy_set$info1 <- NULL
tidy_set[, "elements", c("stat2", "info2")] <- gene_data
tidy_set[, "sets", "Group"] <- c("low", "high")
tidy_set
#> elements sets fuzzy Group Column stat1 stat2 info2
#> 1 A geneset1 1 low abc 1 4 a
#> 2 B geneset1 1 low abc 2 4 b
#> 3 B geneset2 1 high def 2 4 b
#> 4 C geneset2 1 high def 3 3 c
#> 5 D geneset2 1 high def 4 5 d
Observe that one can add, replace or delete
Creating a TidySet
As you can see it is possible to create a TidySet from a list. More commonly you can create it from a data.frame:
relations <- data.frame(elements = c("a", "b", "c", "d", "e", "f"),
sets = c("A", "A", "A", "A", "A", "B"),
fuzzy = c(1, 1, 1, 1, 1, 1))
TS <- tidySet(relations)
TS
#> elements sets fuzzy
#> 1 a A 1
#> 2 b A 1
#> 3 c A 1
#> 4 d A 1
#> 5 e A 1
#> 6 f B 1
It is also possible from a matrix:
m <- matrix(c(0, 0, 1, 1, 1, 1, 0, 1, 0), ncol = 3, nrow = 3,
dimnames = list(letters[1:3], LETTERS[1:3]))
m
#> A B C
#> a 0 1 0
#> b 0 1 1
#> c 1 1 0
tidy_set <- tidySet(m)
tidy_set
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
Or they can be created from a GeneSet and GeneSetCollection objects. Additionally it has several function to read files related to sets like the OBO files (getOBO
) and GAF (getGAF
)
Converting to other formats
It is possible to extract the gene sets as a list
, for use with functions such as lapply
.
as.list(tidy_set)
#> $A
#> c
#> 1
#>
#> $B
#> a b c
#> 1 1 1
#>
#> $C
#> b
#> 1
Or if you need to apply some network methods and you need a matrix, you can create it with incidence
:
incidence(tidy_set)
#> A B C
#> c 1 1 0
#> a 0 1 0
#> b 0 1 1
Operations with sets
To work with sets several methods are provided. In general you can provide a new name for the resulting set of the operation, but if you don’t one will be automatically provided using naming()
. All methods work with fuzzy and non-fuzzy sets
Intersection
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = TRUE)
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c D 1
The keep argument used here is if you want to keep all the other previous sets:
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = FALSE)
#> elements sets fuzzy
#> 1 c D 1
Complement
We can look for the complement of one or several sets:
complement_set(tidy_set, sets = c("A", "B"))
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c ∁A∪B 0
#> 7 a ∁A∪B 0
#> 8 b ∁A∪B 0
Observe that we haven’t provided a name for the resulting set but we can provide one if we prefer to
complement_set(tidy_set, sets = c("A", "B"), name = "F")
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c F 0
#> 7 a F 0
#> 8 b F 0
Subtract
This is the equivalent of setdiff
, but clearer:
out <- subtract(tidy_set, set_in = "A", not_in = "B", name = "A-B")
out
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
name_sets(out)
#> [1] "A" "B" "C" "A-B"
subtract(tidy_set, set_in = "B", not_in = "A", keep = FALSE)
#> elements sets fuzzy
#> 1 a B∖A 1
#> 2 b B∖A 1
See that in the first case there isn’t any element present in B not in set A, but the new set is stored. In the second use case we focus just on the elements that are present on B but not in A.
Additional information
The number of unique elements and sets can be obtained using the nElements()
and nSets()
methods.
nElements(tidy_set)
#> [1] 3
nSets(tidy_set)
#> [1] 3
nRelations(tidy_set)
#> [1] 5
If you wish to know all in a single call you can use dim(tidy_set)
: 3, 5, 3. This summary doesn’t provide the number of relations of each set. You can quickly obtain that with lengths(tidy_set)
: 1, 3, 1
The size of each set can be obtained using the set_size()
method.
set_size(tidy_set)
#> sets size probability
#> 1 A 1 1
#> 2 B 3 1
#> 3 C 1 1
Conversely, the number of sets associated with each gene is returned by the element_size()
function.
element_size(tidy_set)
#> elements size probability
#> 1 c 2 1
#> 2 a 1 1
#> 3 b 2 1
The identifiers of elements and sets can be inspected and renamed using name_elements
and
name_elements(tidy_set)
#> [1] "c" "a" "b"
name_elements(tidy_set) <- paste0("Gene", seq_len(nElements(tidy_set)))
name_elements(tidy_set)
#> [1] "Gene1" "Gene2" "Gene3"
name_sets(tidy_set)
#> [1] "A" "B" "C"
name_sets(tidy_set) <- paste0("Geneset", seq_len(nSets(tidy_set)))
name_sets(tidy_set)
#> [1] "Geneset1" "Geneset2" "Geneset3"
Using dplyr
verbs
You can also use mutate()
, filter()
, select()
, group_by()
and other dplyr
verbs with TidySets. You usually need to activate which three slots you want to affect with activate()
:
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:BaseSet':
#>
#> union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m_TS <- tidy_set %>%
activate("relations") %>%
mutate(Important = runif(nRelations(tidy_set)))
m_TS
#> elements sets fuzzy Important
#> 1 Gene1 Geneset1 1 0.91612839
#> 2 Gene2 Geneset2 1 0.11411068
#> 3 Gene3 Geneset2 1 0.08081424
#> 4 Gene1 Geneset2 1 0.15312576
#> 5 Gene3 Geneset3 1 0.11016696
You can use activate to select what are the verbs modifying:
set_modified <- m_TS %>%
activate("elements") %>%
mutate(Pathway = if_else(elements %in% c("Gene1", "Gene2"),
"pathway1",
"pathway2"))
set_modified
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.91612839 pathway1
#> 2 Gene2 Geneset2 1 0.11411068 pathway1
#> 3 Gene3 Geneset2 1 0.08081424 pathway2
#> 4 Gene1 Geneset2 1 0.15312576 pathway1
#> 5 Gene3 Geneset3 1 0.11016696 pathway2
set_modified %>%
deactivate() %>% # To apply a filter independently of where it is
filter(Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.9161284 pathway1
#> 2 Gene2 Geneset2 1 0.1141107 pathway1
#> 3 Gene1 Geneset2 1 0.1531258 pathway1
If you think you need group_by
usually this could mean that you need a new set. You can create a new one with group
.
# A new group of those elements in pathway1 and with Important == 1
set_modified %>%
deactivate() %>%
group(name = "new", Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.91612839 pathway1
#> 2 Gene2 Geneset2 1 0.11411068 pathway1
#> 3 Gene3 Geneset2 1 0.08081424 pathway2
#> 4 Gene1 Geneset2 1 0.15312576 pathway1
#> 5 Gene3 Geneset3 1 0.11016696 pathway2
#> 6 Gene1 new 1 NA pathway1
#> 7 Gene2 new 1 NA pathway1
set_modified %>%
group("pathway1", elements %in% c("Gene1", "Gene2"))
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.91612839 pathway1
#> 2 Gene2 Geneset2 1 0.11411068 pathway1
#> 3 Gene3 Geneset2 1 0.08081424 pathway2
#> 4 Gene1 Geneset2 1 0.15312576 pathway1
#> 5 Gene3 Geneset3 1 0.11016696 pathway2
#> 6 Gene1 pathway1 1 NA pathway1
#> 7 Gene2 pathway1 1 NA pathway1
You can use group_by()
but it won’t return a TidySet
.
set_modified %>%
deactivate() %>%
group_by(Pathway, sets) %>%
count()
#> # A tibble: 4 × 3
#> # Groups: Pathway, sets [4]
#> Pathway sets n
#> <chr> <chr> <int>
#> 1 pathway1 Geneset1 1
#> 2 pathway1 Geneset2 2
#> 3 pathway2 Geneset2 1
#> 4 pathway2 Geneset3 1
After grouping or mutating sometimes we might be interested in moving a column describing something to other places. We can do by this with:
elements(set_modified)
#> elements Pathway
#> 1 Gene1 pathway1
#> 2 Gene2 pathway1
#> 3 Gene3 pathway2
out <- move_to(set_modified, "elements", "relations", "Pathway")
relations(out)
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.91612839 pathway1
#> 2 Gene2 Geneset2 1 0.11411068 pathway1
#> 3 Gene3 Geneset2 1 0.08081424 pathway2
#> 4 Gene1 Geneset2 1 0.15312576 pathway1
#> 5 Gene3 Geneset3 1 0.11016696 pathway2
Session info
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 BaseSet_0.9.0.9002
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.1 knitr_1.45 rlang_1.1.2
#> [5] xfun_0.41 stringi_1.8.2 purrr_1.0.2 generics_0.1.3
#> [9] textshaping_0.3.7 jsonlite_1.8.8 glue_1.6.2 rprojroot_2.0.4
#> [13] htmltools_0.5.7 ragg_1.2.6 sass_0.4.7 fansi_1.0.5
#> [17] rmarkdown_2.25 tibble_3.2.1 evaluate_0.23 jquerylib_0.1.4
#> [21] fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.4 memoise_2.0.1
#> [25] stringr_1.5.1 compiler_4.3.2 fs_1.6.3 pkgconfig_2.0.3
#> [29] systemfonts_1.0.5 digest_0.6.33 R6_2.5.1 tidyselect_1.2.0
#> [33] utf8_1.2.4 pillar_1.9.0 magrittr_2.0.3 bslib_0.6.1
#> [37] tools_4.3.2 pkgdown_2.0.7 cachem_1.0.8 desc_1.4.2