Surprisal Analysis Guidelines

Surprisal Analysis, an R package for information theoretic analysis of gene expression data

library(SurprisalAnalysis)
library(ggplot2)

Read data and apply Surprisal analysis

data <- read.csv(system.file("extdata", "helper_T_cell_0_test.csv.gz", package = "SurprisalAnalysis"), header=TRUE)
results <- surprisal_analysis(data)
results[[2]]-> transcript_weights
percentile_GO <- 0.95 #change based on your preference
lambda_no <- 2 #change based on your preference, lambda #1 is the baseline state

Run GO analysis

GO.results <- GO_analysis_surprisal_analysis(transcript_weights, percentile_GO, lambda_no, key_type = "SYMBOL", flip = FALSE, species.db.str =  "org.Mm.eg.db", top_GO_terms=15)

The function GO_analysis_surprisal_analysis() runs Gene Ontology (GO) enrichment on the most influential transcripts from a chosen Surprisal pattern. Below are the input arguments:

transcript_weights
A matrix of transcript weights, typically the second element ([[2]]) returned from the Surprisal analysis function.
percentile_GO
A numeric value between 0 and 1 specifying the quantile cutoff for transcript selection. Example: 0.95 means only the top 5% of transcripts (by absolute weight) in the chosen \(\lambda\) pattern are used.
lambda_no
An integer specifying which \(\lambda\) pattern to analyze. Note: \(\lambda_1\) represents the balance state, while higher-order \(\lambda\)’s capture additional constraints or patterns.
key_type

The type of transcript identifiers used in your data. Options include:

“SYMBOL” (gene symbols, e.g. TP53),

“ENTREZID” (Entrez gene IDs),

“ENSEMBL” (Ensembl IDs),
“PROBEID” (microarray probe IDs). This must match the ID format in your input dataset.
flip
Logical (TRUE/FALSE). If TRUE, multiplies transcript weights for the selected \(\lambda\) by –1 before selecting the top quantile. Useful for ensuring consistency with the direction of \(\lambda\) plots.
species.db.str

The organism database to use for gene mapping. Current options:

“org.Hs.eg.db” for Homo sapiens (human),
“org.Mm.eg.db” for Mus musculus (mouse)
ont

The GO ontology branch for enrichment analysis. Options:

“BP” – Biological Process (default),

“MF” – Molecular Function,
“CC” – Cellular Component
pAdjustMethod
The multiple testing correction method. Options include: “BH” (default), “bonferroni”, “holm”, “hochberg”, “hommel”, “BY”, “none”.
top_GO_terms
An integer specifying the number of top enriched GO terms to return (default: 15).

ggplot(GO.results, aes(x=Description, y=Count, fill=p.adjust))+geom_bar(stat="identity")+scale_fill_gradient(low = "#790915", high = "#062c5c")+theme_minimal()+
  
  theme(
    # Remove panel border
    panel.border=element_blank(),  
    #plot.border = element_blank(),
    # Remove panel grid lines
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    # Add axis line
    axis.line = element_line(colour = "black"),
    #axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    #axis.text = element_blank(),
    #legend.position = "none",
    plot.title = element_text(hjust = 0.5, size=20),
    #axis.text = element_text(size = 15),
    
    text = element_text(size=18)
  ) +coord_flip()+labs(tag="A", title="GO analysis")

Surprisal Analysis Guidelines

Surprisal Analysis, an R package for information theoretic analysis of gene expression data

transcript_weights

percentile_GO

lambda_no

key_type

flip

species.db.str

ont

pAdjustMethod

top_GO_terms