---
title: "Surprisal Analysis Guidelines"
output: github_document
vignette: >
%\VignetteIndexEntry{Surprisal analysis guidelines}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Surprisal Analysis, an R package for information theoretic analysis of gene expression data
```{r}
library(SurprisalAnalysis)
library(ggplot2)
```
Read data and apply Surprisal analysis
```{r}
data <- read.csv(system.file("extdata", "helper_T_cell_0_test.csv", package = "SurprisalAnalysis"), header=TRUE)
results <- surprisal_analysis(data)
results[[2]]-> transcript_weights
percentile_GO <- 0.95 #change based on your preference
lambda_no <- 2 #change based on your preference, lambda #1 is the baseline state
```
Run GO analysis
```{r, eval = FALSE}
GO.results <- GO_analysis_surprisal_analysis(transcript_weights, percentile_GO, lambda_no, key_type = "SYMBOL", flip = FALSE, species.db.str = "org.Mm.eg.db", top_GO_terms=15)
```
The function GO_analysis_surprisal_analysis() runs Gene Ontology (GO) enrichment on the most influential transcripts from a chosen Surprisal pattern. Below are the input arguments:
transcript_weights
A matrix of transcript weights, typically the second element ([[2]]) returned from the Surprisal analysis function.
percentile_GO
A numeric value between 0 and 1 specifying the quantile cutoff for transcript selection.
Example: 0.95 means only the top 5% of transcripts (by absolute weight) in the chosen $\lambda$ pattern are used.
lambda_no
An integer specifying which $\lambda$ pattern to analyze.
Note: $\lambda_1$ represents the balance state, while higher-order $\lambda$’s capture additional constraints or patterns.
key_type
The type of transcript identifiers used in your data. Options include:
"SYMBOL" (gene symbols, e.g. TP53),
"ENTREZID" (Entrez gene IDs),
"ENSEMBL" (Ensembl IDs),
"PROBEID" (microarray probe IDs). This must match the ID format in your input dataset.
flip
Logical (TRUE/FALSE). If TRUE, multiplies transcript weights for the selected $\lambda$ by –1 before selecting the top quantile.
Useful for ensuring consistency with the direction of $\lambda$ plots.
-
species.db.str
The organism database to use for gene mapping. Current options:
"org.Hs.eg.db" for Homo sapiens (human),
"org.Mm.eg.db" for Mus musculus (mouse)
ont
The GO ontology branch for enrichment analysis. Options:
"BP" – Biological Process (default),
"MF" – Molecular Function,
"CC" – Cellular Component
pAdjustMethod
The multiple testing correction method. Options include: "BH" (default), "bonferroni", "holm", "hochberg", "hommel", "BY", "none".
top_GO_terms
An integer specifying the number of top enriched GO terms to return (default: 15).
```{r, eval = FALSE}
ggplot(GO.results, aes(x=Description, y=Count, fill=p.adjust))+geom_bar(stat="identity")+scale_fill_gradient(low = "#790915", high = "#062c5c")+theme_minimal()+
theme(
# Remove panel border
panel.border=element_blank(),
#plot.border = element_blank(),
# Remove panel grid lines
panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
# Add axis line
axis.line = element_line(colour = "black"),
#axis.title.x = element_blank(),
axis.title.y = element_blank(),
#axis.text = element_blank(),
#legend.position = "none",
plot.title = element_text(hjust = 0.5, size=20),
#axis.text = element_text(size = 15),
text = element_text(size=18)
) +coord_flip()+labs(tag="A", title="GO analysis")
```