
About BaseSet
Lluís Revilla
August Pi i Sunyer Biomedical Research Institute (IDIBAPS); Liver Unit, Hospital Clinic[email protected]
2023 Mar 06
Source:vignettes/basic.Rmd
basic.Rmd
Abstract
Describes the background of the package, important functions defined in the package and some of the applications and usages.
The TidySet class
This is a basic example which shows you how to create a TidySet
object, to store associations between genes and sets:
library("BaseSet")
gene_lists <- list(
geneset1 = c("A", "B"),
geneset2 = c("B", "C", "D")
)
tidy_set <- tidySet(gene_lists)
tidy_set
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
This is then stored internally in three slots relations
, elements
, and sets
slots.
If you have more information for each element or set it can be added:
gene_data <- data.frame(
stat1 = c( 1, 2, 3, 4 ),
info1 = c("a", "b", "c", "d")
)
tidy_set <- add_column(tidy_set, "elements", gene_data)
set_data <- data.frame(
Group = c( 100, 200 ),
Colum = c( "abc", "def")
)
tidy_set <- add_column(tidy_set, "sets", set_data)
tidy_set
#> elements sets fuzzy Group Colum stat1 info1
#> 1 A geneset1 1 100 abc 1 a
#> 2 B geneset1 1 100 abc 2 b
#> 3 B geneset2 1 200 def 2 b
#> 4 C geneset2 1 200 def 3 c
#> 5 D geneset2 1 200 def 4 d
This data is stored in one of the three slots, which can be directly accessed using their getter methods:
relations(tidy_set)
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
elements(tidy_set)
#> elements stat1 info1
#> 1 A 1 a
#> 2 B 2 b
#> 3 C 3 c
#> 4 D 4 d
sets(tidy_set)
#> sets Group Colum
#> 1 geneset1 100 abc
#> 2 geneset2 200 def
You can add as much information as you want, with the only restriction for a “fuzzy” column for the relations
. See the Fuzzy sets vignette.
Creating a TidySet
As you can see it is possible to create a TidySet from a list and a data.frame, but it is also possible from a matrix:
m <- matrix(c(0, 0, 1, 1, 1, 1, 0, 1, 0), ncol = 3, nrow =3,
dimnames = list(letters[1:3], LETTERS[1:3]))
m
#> A B C
#> a 0 1 0
#> b 0 1 1
#> c 1 1 0
tidy_set <- tidySet(m)
Or they can be created from a GeneSet and GeneSetCollection objects. Additionally it has several function to read files related to sets like the OBO files (getOBO
) and GAF (getGAF
)
Converting to other formats
It is possible to extract the gene sets as a list
, for use with functions such as lapply
.
as.list(tidy_set)
#> $A
#> c
#> 1
#>
#> $B
#> a b c
#> 1 1 1
#>
#> $C
#> b
#> 1
Or if you need to apply some network methods and you need a matrix, you can create it with incidence
:
incidence(tidy_set)
#> A B C
#> c 1 1 0
#> a 0 1 0
#> b 0 1 1
Operations with sets
To work with sets several methods are provided. In general you can provide a new name for the resulting set of the operation, but if you don’t one will be automatically provided using naming
. All methods work with fuzzy and non-fuzzy sets
Intersection
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = TRUE)
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c D 1
The keep argument used here is if you want to keep all the other previous sets:
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = FALSE)
#> elements sets fuzzy
#> 1 c D 1
Complement
We can look for the complement of one or several sets:
complement_set(tidy_set, sets = c("A", "B"))
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c ∁A∪B 0
#> 7 a ∁A∪B 0
#> 8 b ∁A∪B 0
Observe that we haven’t provided a name for the resulting set but we can provide one if we prefer to
complement_set(tidy_set, sets = c("A", "B"), name = "F")
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c F 0
#> 7 a F 0
#> 8 b F 0
Subtract
This is the equivalent of setdiff
, but clearer:
out <- subtract(tidy_set, set_in = "A", not_in = "B", name = "A-B")
out
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
name_sets(out)
#> [1] "A" "B" "C" "A-B"
subtract(tidy_set, set_in = "B", not_in = "A", keep = FALSE)
#> elements sets fuzzy
#> 1 a B∖A 1
#> 2 b B∖A 1
See that in the first case there isn’t any element present in B not in set A, but the new set is stored. In the second use case we focus just on the elements that are present on B but not in A.
Additional information
The number of unique elements and sets can be obtained using the nElements
and nSets
methods.
nElements(tidy_set)
#> [1] 3
nSets(tidy_set)
#> [1] 3
nRelations(tidy_set)
#> [1] 5
The size of each gene set can be obtained using the set_size
method.
set_size(tidy_set, "A")
#> sets size probability
#> 1 A 1 1
Conversely, the number of sets associated with each gene is returned by the element_size
function.
element_size(tidy_set)
#> elements size probability
#> 1 c 2 1
#> 2 a 1 1
#> 3 b 2 1
The identifiers of elements and sets can be inspected and renamed using name_elements
and
name_elements(tidy_set)
#> [1] "c" "a" "b"
name_elements(tidy_set) <- paste0("Gene", seq_len(nElements(tidy_set)))
name_elements(tidy_set)
#> [1] "Gene1" "Gene2" "Gene3"
name_sets(tidy_set)
#> [1] "A" "B" "C"
name_sets(tidy_set) <- paste0("Geneset", seq_len(nSets(tidy_set)))
name_sets(tidy_set)
#> [1] "Geneset1" "Geneset2" "Geneset3"
Using dplyr
verbs
You can also use mutate
, filter
and other dplyr
verbs with TidySets (with the only exception being group_by
), but you usually need to activate which three slots you want to affect with activate
:
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:BaseSet':
#>
#> union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m_TS <- tidy_set %>%
activate("relations") %>%
mutate(Important = runif(nRelations(tidy_set)))
m_TS
#> elements sets fuzzy Important
#> 1 Gene1 Geneset1 1 0.58615127
#> 2 Gene2 Geneset2 1 0.78079681
#> 3 Gene3 Geneset2 1 0.25116542
#> 4 Gene1 Geneset2 1 0.07821071
#> 5 Gene3 Geneset3 1 0.57316936
You can use activate to select what are the verbs modifying:
set_modified <- m_TS %>%
activate("elements") %>%
mutate(Pathway = if_else(elements %in% c("Gene1", "Gene2"),
"pathway1",
"pathway2"))
set_modified
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.58615127 pathway1
#> 2 Gene2 Geneset2 1 0.78079681 pathway1
#> 3 Gene3 Geneset2 1 0.25116542 pathway2
#> 4 Gene1 Geneset2 1 0.07821071 pathway1
#> 5 Gene3 Geneset3 1 0.57316936 pathway2
set_modified %>%
deactivate() %>% # To apply a filter independently of where it is
filter(Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.58615127 pathway1
#> 2 Gene2 Geneset2 1 0.78079681 pathway1
#> 3 Gene1 Geneset2 1 0.07821071 pathway1
If you think you need group_by usually this would mean that you need a new set. You can create a new one with group
. If you want to use group_by
to group some elements then you need to create a new set:
# A new group of those elements in pathway1 and with Important == 1
set_modified %>%
deactivate() %>%
group(name = "new", Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.58615127 pathway1
#> 2 Gene2 Geneset2 1 0.78079681 pathway1
#> 3 Gene3 Geneset2 1 0.25116542 pathway2
#> 4 Gene1 Geneset2 1 0.07821071 pathway1
#> 5 Gene3 Geneset3 1 0.57316936 pathway2
#> 6 Gene1 new 1 NA pathway1
#> 7 Gene2 new 1 NA pathway1
set_modified %>%
group("pathway1", elements %in% c("Gene1", "Gene2"))
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.58615127 pathway1
#> 2 Gene2 Geneset2 1 0.78079681 pathway1
#> 3 Gene3 Geneset2 1 0.25116542 pathway2
#> 4 Gene1 Geneset2 1 0.07821071 pathway1
#> 5 Gene3 Geneset3 1 0.57316936 pathway2
#> 6 Gene1 pathway1 1 NA pathway1
#> 7 Gene2 pathway1 1 NA pathway1
After grouping or mutating sometimes we might be interested in moving a column describing something to other places. We can do by this with:
elements(set_modified)
#> elements Pathway
#> 1 Gene1 pathway1
#> 2 Gene2 pathway1
#> 3 Gene3 pathway2
out <- move_to(set_modified, "elements", "relations", "Pathway")
relations(out)
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.58615127 pathway1
#> 2 Gene2 Geneset2 1 0.78079681 pathway1
#> 3 Gene3 Geneset2 1 0.25116542 pathway2
#> 4 Gene1 Geneset2 1 0.07821071 pathway1
#> 5 Gene3 Geneset3 1 0.57316936 pathway2
Session info
#> R version 4.2.2 Patched (2022-11-10 r83330)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.0 BaseSet_0.0.17.9000
#>
#> loaded via a namespace (and not attached):
#> [1] knitr_1.42 magrittr_2.0.3 tidyselect_1.2.0 R6_2.5.1
#> [5] ragg_1.2.5 rlang_1.0.6 fastmap_1.1.1 fansi_1.0.4
#> [9] stringr_1.5.0 tools_4.2.2 xfun_0.37 utf8_1.2.3
#> [13] cli_3.6.0 withr_2.5.0 jquerylib_0.1.4 systemfonts_1.0.4
#> [17] htmltools_0.5.4 yaml_2.3.7 digest_0.6.31 rprojroot_2.0.3
#> [21] tibble_3.1.8 lifecycle_1.0.3 pkgdown_2.0.7 textshaping_0.3.6
#> [25] purrr_1.0.1 sass_0.4.5 vctrs_0.5.2 fs_1.6.1
#> [29] memoise_2.0.1 glue_1.6.2 cachem_1.0.7 evaluate_0.20
#> [33] rmarkdown_2.20 stringi_1.7.12 pillar_1.8.1 compiler_4.2.2
#> [37] bslib_0.4.2 generics_0.1.3 desc_1.4.2 jsonlite_1.8.4
#> [41] pkgconfig_2.0.3