Authors: Enrico Glaab and Sophie Le Bars
This vignette demonstrates a standard workflow for analyzing bulk RNA-seq data to identify sex-specific effects using the XYomics package. The package provides a streamlined set of tools for differential expression, pathway analysis, and network-based interpretation.
This tutorial covers the following key steps:
We begin by simulating a bulk RNA-seq dataset to illustrate the
package’s capabilities. The simulate_omics_data function
creates a dataset with 500 genes and 100 samples, balanced across
phenotype (Control/Disease) and sex (male/female).
The simulation introduces three types of effects: - Male-specific: Genes up-regulated in the disease state only in males. - Female-specific: Genes up-regulated in the disease state only in females. - Sex-dimorphic: Genes regulated in opposite directions between males and females in the disease state.
library(XYomics)
library(dplyr)
library(clusterProfiler)
library(org.Hs.eg.db)
simulate_omics_data <- function(n_samples = 100, n_genes = 500) {
set.seed(123)
all_entrez <- keys(org.Hs.eg.db, keytype = "ENTREZID")
selected_entrez <- sample(all_entrez, n_genes)
sex <- rep(c("m", "f"), each = n_samples / 2)
phenotype <- rep(rep(c("Control", "Disease"), each = n_samples / 4), 2)
expression_data <- matrix(rnorm(n_samples * n_genes, mean = 10, sd = 0.1),
nrow = n_genes, ncol = n_samples)
male_disease <- which(sex == "m" & phenotype == "Disease")
female_disease <- which(sex == "f" & phenotype == "Disease")
expression_data[1:25, male_disease] <- expression_data[1:25, male_disease] + 8
expression_data[26:50, female_disease] <- expression_data[26:50, female_disease] + 8
expression_data[51:75, male_disease] <- expression_data[51:75, male_disease] + 6
expression_data[51:75, female_disease] <- expression_data[51:75, female_disease] - 6
colnames(expression_data) <- paste0("Sample", 1:n_samples)
rownames(expression_data) <- selected_entrez
return(list(expression_data = expression_data, phenotype = phenotype, sex = sex))
}
simulated_data <- simulate_omics_data()
omics_data <- simulated_data$expression_data
phenotype <- simulated_data$phenotype
sex <- simulated_data$sex
cat("Dimensions of expression data:", dim(omics_data), "\n")## Dimensions of expression data: 500 100
When analyzing sex differences, researchers must be aware of the pitfalls of associated statistical analyses, including the limitations of sex-stratified analyses and the challenges of analyzing interactions between sex and disease state.
Sex-stratified analyses use standard statistical tests for differential molecular abundance analysis to test for disease-associated changes in each sex separately. These may include classical parametric hypothesis tests, such as Welch’s test for normally distributed data, or non-parametric tests, such as the Mann–Whitney U test, as well as special moderated statistics for high-dimensional omics data analysis, such as the empirical Bayes moderated t-statistic. However, a pure sex-stratified analysis may misclassify a change as sex-specific if it uses a standard significance threshold to assess both the presence and absence of an effect. Stochastic variation in significance scores around a chosen threshold may lead to the erroneous detection of significance specific to only one sex, especially if the p-value in the other sex marginally exceeds the chosen threshold. In addition, such an analysis may miss sex-modulated changes, where significant changes in both sexes share the same direction but differ significantly in magnitude; these changes require cross-sex comparisons for accurate detection.
Interaction analysis (using a multiplicative term
like sex*disease) formally tests whether the relationship
between disease and molecular changes differs significantly between
males and females. Not only can such interaction terms reveal
complexities in disease mechanisms that might otherwise be obscured in
analyses that do not consider SABV, but they also have the potential to
detect changes that are limited to the magnitude of an effect, an aspect
that sex-stratified analyses do not capture. Nevertheless, robust
estimation of interaction effects requires large sample sizes, which are
often not available due to the costs associated with advanced molecular
profiling techniques such as single-cell RNA sequencing.
The XYomics package provides functions for all three types of analyses.
Here, we use the sex_stratified_analysis_bulk function
to perform a stratified analysis.
male_deg <- sex_stratified_analysis_bulk(
x = omics_data,
phenotype = phenotype,
gender = sex,
analysis_type = "male"
)
female_deg <- sex_stratified_analysis_bulk(
x = omics_data,
phenotype = phenotype,
gender = sex,
analysis_type = "female"
)
cat("Top differentially expressed genes in males (stratified analysis):\n")## Top differentially expressed genes in males (stratified analysis):
## logFC AveExpr t P.Value adj.P.Val B
## 124900609 8.078211 13.98250 285.3067 0 0 5530.402
## 127896748 8.031050 14.00357 284.4926 0 0 5520.917
## 127891734 8.003977 14.02736 283.4291 0 0 5508.488
## 127402385 8.022368 14.00286 283.3256 0 0 5507.276
## 127830600 8.018576 14.00111 283.3203 0 0 5507.214
## 129993743 7.997612 14.00713 283.0152 0 0 5503.639
We use the sex_interaction_analysis_bulk function to
perform a formal interaction analysis using a multiplicative term.
interaction_term_results_bulk <- sex_interaction_analysis_bulk(
x = omics_data,
phenotype = phenotype,
gender = sex,
phenotype_labels = c("Control", "Disease"),
sex_labels = c("f", "m")
)
cat("Top genes from Interaction Term analysis (bulk):\n")## Top genes from Interaction Term analysis (bulk):
## logFC AveExpr t P.Value adj.P.Val B
## 130060506 8.026858 11.99883 200.3543 0 0 20059.71
## 130066717 8.045758 11.99960 200.8261 0 0 20154.33
## 127891734 8.040629 12.00748 200.6980 0 0 20128.63
## 127830113 7.961523 12.00532 198.7235 0 0 19734.34
## 129993743 8.003568 12.01005 199.7730 0 0 19943.42
## 127408622 7.996061 12.00467 199.5856 0 0 19906.01
The identify_sex_specific_genes function categorizes
genes based on the results of the stratified analysis.
This provides a useful, albeit potentially less robust,
classification.
sex_specific_results <- identify_sex_specific_genes(
male_results = male_deg,
female_results = female_deg,
target_fdr = 0.05,
exclude_fdr = 0.5
)
cat("Number of genes in each category (from stratified analysis):\n")## Number of genes in each category (from stratified analysis):
##
## female-specific male-specific sex-dimorphic
## 26 28 25
We use generate_boxplot to visualize the expression of
top genes identified by the stratified analysis.
# Male-specific gene
top_male_gene <- sex_specific_results %>%
filter(gene_type == "male-specific") %>%
arrange(male_FDR) %>%
pull(gene_id) %>%
head(1)
generate_boxplot(
x = omics_data,
index = top_male_gene,
phenotype = phenotype,
gender = sex,
title = paste("Expression of Top Male-Specific Gene:", top_male_gene)
)# Female-specific gene
top_female_gene <- sex_specific_results %>%
filter(gene_type == "female-specific") %>%
arrange(female_FDR) %>%
pull(gene_id) %>%
head(1)
generate_boxplot(
x = omics_data,
index = top_female_gene,
phenotype = phenotype,
gender = sex,
title = paste("Expression of Top Female-Specific Gene:", top_female_gene)
)# Sex-dimorphic gene
top_dimorphic_gene <- sex_specific_results %>%
filter(gene_type == "sex-dimorphic") %>%
arrange(male_FDR) %>%
pull(gene_id) %>%
head(1)
generate_boxplot(
x = omics_data,
index = top_dimorphic_gene,
phenotype = phenotype,
gender = sex,
title = paste("Expression of Top Sex-Dimorphic Gene:", top_dimorphic_gene)
)We perform pathway enrichment on the gene lists derived from the stratified analysis.
# GO analysis for female-specific genes
female_specific_genes <- sex_specific_results %>%
filter(gene_type == "female-specific") %>%
pull(gene_id)
cat("\nPerforming GO enrichment analysis for female-specific genes...\n")##
## Performing GO enrichment analysis for female-specific genes...
go_results <- improved_pathway_enrichment(
gene_list = female_specific_genes,
enrichment_db = "GO"
)
if (!is.null(go_results) && nrow(go_results) > 0) {
cat("Top GO Terms:\n")
print(head(as.data.frame(go_results)))
} else {
cat("No significant GO terms found.\n")
}## Top GO Terms:
## ONTOLOGY ID Description
## GO:0051643 BP GO:0051643 endoplasmic reticulum localization
## GO:0120009 BP GO:0120009 intermembrane lipid transfer
## GO:0140056 BP GO:0140056 organelle localization by membrane tethering
## GO:0022406 BP GO:0022406 membrane docking
## GO:0007029 BP GO:0007029 endoplasmic reticulum organization
## GO:0015914 BP GO:0015914 phospholipid transport
## GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
## GO:0051643 1/1 10/18870 0.0005299417 0.004239534 NA 23344 1
## GO:0120009 1/1 54/18870 0.0028616852 0.007277866 NA 23344 1
## GO:0140056 1/1 82/18870 0.0043455220 0.007277866 NA 23344 1
## GO:0022406 1/1 92/18870 0.0048754637 0.007277866 NA 23344 1
## GO:0007029 1/1 100/18870 0.0052994171 0.007277866 NA 23344 1
## GO:0015914 1/1 103/18870 0.0054583996 0.007277866 NA 23344 1
# GO analysis for male-specific genes
male_specific_genes <- sex_specific_results %>%
filter(gene_type == "male-specific") %>%
pull(gene_id)
cat("\nPerforming GO pathway analysis for male-specific genes...\n")##
## Performing GO pathway analysis for male-specific genes...
go_results_male <- improved_pathway_enrichment(
gene_list = male_specific_genes,
enrichment_db = "GO"
)
if (!is.null(go_results_male) && nrow(go_results_male) > 0) {
cat("Top GO Pathways:\n")
print(head(as.data.frame(go_results_male)))
} else {
cat("No significant GO pathways found.\n")
}## No significant GO pathways found.
Finally, we construct a protein-protein interaction network using the results from the stratified analysis.
First, we download and process a human protein-protein interaction
network from the STRING database using get_string_network.
As this step can be time-consuming, for this vignette we will create a
small dummy network for demonstration purposes.
# In a real analysis, you would run:
# string_network <- get_string_network(organism = "9606", score_threshold = 700)
# For this vignette, we create a small dummy network
set.seed(123)
dummy_nodes <- unique(c(male_specific_genes, female_specific_genes))
if(length(dummy_nodes) > 100) dummy_nodes <- sample(dummy_nodes, 100)
dummy_edges <- data.frame(
from = sample(dummy_nodes, 50, replace = TRUE),
to = sample(dummy_nodes, 50, replace = TRUE)
)
string_network <- igraph::graph_from_data_frame(dummy_edges, directed = FALSE)
string_network <- igraph::simplify(string_network)
cat("Using a dummy network for demonstration purposes.\n")## Using a dummy network for demonstration purposes.
Next, we define “prizes” for our genes based on their statistical
significance (e.g., -log10 of the p-value) and use the
construct_ppi_pcsf function to build a context-specific
network.
# Use male DE results to define prizes
prizes <- -log10(male_deg$P.Value)
names(prizes) <- rownames(male_deg)
prizes <- prizes[is.finite(prizes)]
# Construct the network
ppi <- construct_ppi_pcsf(
g = string_network,
prizes = prizes,
w = 2,
b = 1,
mu = 5e-04
)## 100% of your terminal nodes are included in the interactome
## Solving the PCSF...
## Constructed PPI with 2 nodes and 1 edges.
Finally, we use the visualize_network function to plot
the resulting protein-protein interaction network. The node colors can
represent different properties, such as the relative expression changes
in males versus females.
visualize_network(
g = ppi,
female_res = female_deg,
male_res = male_deg,
vertex.size = 8,
vertex.label.cex = 0.7,
main = "Sex-Specific Protein-Protein Interaction Network"
)## IGRAPH eb64179 UNWB 2 1 --
## + attr: name (v/c), prize (v/n), type (v/c), weight (e/n)
## + edge from eb64179 (vertex names):
## [1] 130066070--144106
This vignette has demonstrated two key approaches for analyzing sex-specific effects in bulk RNA-seq data. While sex-stratified analysis is a common first step, interaction analysis provides a more statistically robust method for identifying genes whose regulation by disease is truly dependent on sex. We recommend using an interaction-based approach where possible, while being mindful of the need for adequate sample sizes.