
Advanced examples
Lluís Revilla
lluis.revilla@gmail.com2023 Dec 04
Source:vignettes/advanced.Rmd
advanced.Rmd
Abstract
This vignette assumes you are familiar with set operations from the basic vignette.
Initial setup
To show compatibility with tidy workflows we will use magrittr pipe operator and the dplyr verbs.
Human gene ontology
We will explore the genes with assigned gene ontology terms. These terms describe what is the process and role of the genes. The links are annotated with different evidence codes to indicate how such annotation is supported.
# We load some libraries
library("org.Hs.eg.db", quietly = TRUE)
library("GO.db", quietly = TRUE)
library("ggplot2", quietly = TRUE)
# Prepare the data
h2GO_TS <- tidySet(org.Hs.egGO)
h2GO <- as.data.frame(org.Hs.egGO)
We can now explore if there are differences in evidence usage for each ontology in gene ontology:
library("forcats", include.only = "fct_reorder2", quietly = TRUE)
h2GO %>%
group_by(Evidence, Ontology) %>%
count(name = "Freq") %>%
ungroup() %>%
mutate(Evidence = fct_reorder2(Evidence, Ontology, -Freq),
Ontology = case_match(Ontology,
"CC" ~ "Cellular Component",
"MF" ~ "Molecular Function",
"BP" ~ "Biological Process",
.default = NA)) %>%
ggplot() +
geom_col(aes(Evidence, Freq)) +
facet_grid(~Ontology) +
theme_minimal() +
coord_flip() +
labs(x = element_blank(), y = element_blank(),
title = "Evidence codes for each ontology")
We can see that biological process are more likely to be defined by IMP evidence code that means inferred from mutant phenotype. While inferred from physical interaction (IPI) is almost exclusively used to assign molecular functions.
This graph doesn’t consider that some relationships are better annotated than other:
h2GO_TS %>%
relations() %>%
group_by(elements, sets) %>%
count(sort = TRUE, name = "Annotations") %>%
ungroup() %>%
count(Annotations, sort = TRUE) %>%
ggplot() +
geom_col(aes(Annotations, n)) +
theme_minimal() +
labs(x = "Evidence codes", y = "Annotations",
title = "Evidence codes for each annotation",
subtitle = "in human") +
scale_x_continuous(breaks = 1:7)
We can see that mostly all the annotations are done with a single evidence code. So far we have explored the code that it is related to a gene but how many genes don’t have any annotation?
# Add all the genes and GO terms
h2GO_TS <- add_elements(h2GO_TS, keys(org.Hs.eg.db)) %>%
add_sets(grep("^GO:", keys(GO.db), value = TRUE))
sizes_element <- element_size(h2GO_TS) %>%
arrange(desc(size))
sum(sizes_element$size == 0)
#> [1] 170422
sum(sizes_element$size != 0)
#> [1] 20757
sizes_set <- set_size(h2GO_TS) %>%
arrange(desc(size))
sum(sizes_set$size == 0)
#> [1] 24078
sum(sizes_set$size != 0)
#> [1] 18809
So we can see that both there are more genes without annotation and more gene ontology terms without a (direct) gene annotated.
sizes_element %>%
filter(size != 0) %>%
ggplot() +
geom_histogram(aes(size), binwidth = 1) +
theme_minimal() +
labs(x = "# sets per element", y = "Count")
sizes_set %>%
filter(size != 0) %>%
ggplot() +
geom_histogram(aes(size), binwidth = 1) +
theme_minimal() +
labs(x = "# elements per set", y = "Count")
As you can see on the second plot we have very large values but that are on associated on many genes:
head(sizes_set, 10)
#> sets size probability Ontology
#> 1 GO:0005515 12657 1 MF
#> 2 GO:0005634 5629 1 CC
#> 3 GO:0005829 5428 1 CC
#> 4 GO:0005886 5169 1 CC
#> 5 GO:0005737 5062 1 CC
#> 6 GO:0005654 3896 1 CC
#> 7 GO:0016020 3237 1 CC
#> 8 GO:0046872 2376 1 MF
#> 9 GO:0070062 2211 1 CC
#> 10 GO:0005576 1969 1 CC
Using fuzzy values
This could radically change if we used fuzzy values. We could assign a fuzzy value to each evidence code given the lowest fuzzy value for the IEA (Inferred from Electronic Annotation) evidence. The highest values would be for evidence codes coming from experiments or alike.
nr <- h2GO_TS %>%
relations() %>%
dplyr::select(sets, Evidence) %>%
distinct() %>%
mutate(fuzzy = case_match(Evidence,
"EXP" ~ 0.9,
"IDA" ~ 0.8,
"IPI" ~ 0.8,
"IMP" ~ 0.75,
"IGI" ~ 0.7,
"IEP" ~ 0.65,
"HEP" ~ 0.6,
"HDA" ~ 0.6,
"HMP" ~ 0.5,
"IBA" ~ 0.45,
"ISS" ~ 0.4,
"ISO" ~ 0.32,
"ISA" ~ 0.32,
"ISM" ~ 0.3,
"RCA" ~ 0.2,
"TAS" ~ 0.15,
"NAS" ~ 0.1,
"IC" ~ 0.02,
"ND" ~ 0.02,
"IEA" ~ 0.01,
.default = 0.01)) %>%
dplyr::select(sets = "sets", elements = "Evidence", fuzzy = fuzzy)
We have several evidence codes for the same ontology, this would result on different fuzzy values for each relation. Instead, we extract this and add them as new sets and elements and add an extra column to classify what are those elements:
ts <- h2GO_TS %>%
relations() %>%
dplyr::select(-Evidence) %>%
rbind(nr) %>%
tidySet() %>%
mutate_element(Type = ifelse(grepl("^[0-9]+$", elements), "gene", "evidence"))
Now we can see which gene ontologies are more supported by the evidence:
ts %>%
dplyr::filter(Type != "Gene") %>%
cardinality() %>%
arrange(desc(cardinality)) %>%
head()
#> sets cardinality
#> 1 GO:0005515 12659.91
#> 2 GO:0005634 5635.00
#> 3 GO:0005829 5432.78
#> 4 GO:0005886 5175.50
#> 5 GO:0005737 5067.10
#> 6 GO:0005654 3899.26
Surprisingly the most supported terms are protein binding, nucleus and cytosol. I would expect them to be the top three terms for cellular component, biological function and molecular function.
Calculating set sizes would be interesting but it requires computing a big number of combinations that make it last long and require many memory available.
ts %>%
filter(sets %in% c("GO:0008152", "GO:0003674", "GO:0005575"),
Type != "gene") %>%
set_size()
#> sets size probability
#> 1 GO:0003674 0 0.98
#> 2 GO:0003674 1 0.02
#> 3 GO:0005575 0 0.98
#> 4 GO:0005575 1 0.02
#> 5 GO:0008152 0 0.99
#> 6 GO:0008152 1 0.01
Unexpectedly there is few evidence for the main terms:
go_terms <- c("GO:0008152", "GO:0003674", "GO:0005575")
ts %>%
filter(sets %in% go_terms & Type != "gene")
#> elements sets fuzzy Type
#> 1 IEA GO:0008152 0.01 evidence
#> 2 ND GO:0005575 0.02 evidence
#> 3 ND GO:0003674 0.02 evidence
In fact those terms are arbitrarily decided or inferred from electronic analysis.
Human pathways
Now we will repeat the same analysis with pathways:
# We load some libraries
library("reactome.db")
# Prepare the data (is easier, there isn't any ontoogy or evidence column)
h2p <- as.data.frame(reactomeEXTID2PATHID)
colnames(h2p) <- c("sets", "elements")
# Filter only for human pathways
h2p <- h2p[grepl("^R-HSA-", h2p$sets), ]
# There are duplicate relations with different evidence codes!!:
summary(duplicated(h2p[, c("elements", "sets")]))
#> Mode FALSE TRUE
#> logical 128328 14231
h2p <- unique(h2p)
# Create a TidySet and
h2p_TS <- tidySet(h2p) %>%
# Add all the genes
add_elements(keys(org.Hs.eg.db))
Now that we have everything ready we can start measuring some things…
sizes_element <- element_size(h2p_TS) %>%
arrange(desc(size))
sum(sizes_element$size == 0)
#> [1] 180170
sum(sizes_element$size != 0)
#> [1] 11280
sizes_set <- set_size(h2p_TS) %>%
arrange(desc(size))
We can see there are more genes without pathways than genes with pathways.
sizes_element %>%
filter(size != 0) %>%
ggplot() +
geom_histogram(aes(size), binwidth = 1) +
scale_y_log10() +
theme_minimal() +
labs(x = "# sets per element", y = "Count")
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 282 rows containing missing values (`geom_bar()`).
sizes_set %>%
ggplot() +
geom_histogram(aes(size), binwidth = 1) +
scale_y_log10() +
theme_minimal() +
labs(x = "# elements per set", y = "Count")
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 2342 rows containing missing values (`geom_bar()`).
As you can see on the second plot we have very large values but that are on associated on many genes:
head(sizes_set, 10)
#> sets size probability
#> 1 R-HSA-162582 2594 1
#> 2 R-HSA-1430728 2126 1
#> 3 R-HSA-168256 2068 1
#> 4 R-HSA-1643685 1998 1
#> 5 R-HSA-392499 1986 1
#> 6 R-HSA-74160 1547 1
#> 7 R-HSA-597592 1463 1
#> 8 R-HSA-73857 1377 1
#> 9 R-HSA-212436 1253 1
#> 10 R-HSA-1266738 1236 1
Session info
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] reactome.db_1.86.2 forcats_1.0.0 ggplot2_3.4.4
#> [4] GO.db_3.18.0 org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1
#> [7] IRanges_2.36.0 S4Vectors_0.40.2 Biobase_2.62.0
#> [10] BiocGenerics_0.48.1 dplyr_1.1.4 BaseSet_0.9.0.9002
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.42.0 gtable_0.3.4 xfun_0.41
#> [4] bslib_0.6.1 generics_0.1.3 vctrs_0.6.5
#> [7] tools_4.3.2 bitops_1.0-7 fansi_1.0.5
#> [10] tibble_3.2.1 RSQLite_2.3.3 highr_0.10
#> [13] blob_1.2.4 pkgconfig_2.0.3 desc_1.4.2
#> [16] graph_1.80.0 lifecycle_1.0.4 GenomeInfoDbData_1.2.11
#> [19] farver_2.1.1 compiler_4.3.2 stringr_1.5.1
#> [22] textshaping_0.3.7 Biostrings_2.70.1 munsell_0.5.0
#> [25] GenomeInfoDb_1.38.1 htmltools_0.5.7 sass_0.4.7
#> [28] RCurl_1.98-1.13 yaml_2.3.7 pillar_1.9.0
#> [31] pkgdown_2.0.7 crayon_1.5.2 jquerylib_0.1.4
#> [34] cachem_1.0.8 tidyselect_1.2.0 digest_0.6.33
#> [37] stringi_1.8.2 purrr_1.0.2 labeling_0.4.3
#> [40] rprojroot_2.0.4 fastmap_1.1.1 grid_4.3.2
#> [43] colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
#> [46] utf8_1.2.4 XML_3.99-0.16 GSEABase_1.64.0
#> [49] withr_2.5.2 scales_1.3.0 bit64_4.0.5
#> [52] rmarkdown_2.25 XVector_0.42.0 httr_1.4.7
#> [55] bit_4.0.5 ragg_1.2.6 png_0.1-8
#> [58] memoise_2.0.1 evaluate_0.23 knitr_1.45
#> [61] rlang_1.1.2 xtable_1.8-4 glue_1.6.2
#> [64] DBI_1.1.3 annotate_1.80.0 jsonlite_1.8.8
#> [67] R6_2.5.1 systemfonts_1.0.5 fs_1.6.3
#> [70] zlibbioc_1.48.0