UPDhmm 1.5.1
Uniparental disomy (UPD) is a rare genetic condition where an individual inherits both copies of a chromosome from one parent, as opposed to the typical inheritance of one copy from each parent. The extent of its contribution as a causative factor in rare genetic diseases remains largely unexplored.
UPDs can lead to disease either by inheriting a carrier pathogenic mutation as homozygous from a carrier parent or by causing errors in genetic imprinting. Nevertheless, there are currently no standardized methods available for the detection and characterization of these events.
We have developed UPDhmm, an R/Bioconductor package that provides a tool to detect, classify, and locate uniparental disomy events in trio genetic data.
UPDhmm relies on a Hidden Markov Model (HMM) to identify regions with UPD.
In our HMM model, the observations are the genotype combinations of the father, mother, and proband at each genomic variant. The hidden states represent five inheritance patterns:
Normal (Mendelian inheritance)
Maternal isodisomy
Paternal isodisomy
Maternal heterodisomy
Paternal heterodisomy
Emission probabilities were defined based on the expected inheritance patterns, and the Viterbi algorithm was used to infer the most likely sequence of hidden states along the genome.
The final output consists of genomic segments whose inferred state deviates from Mendelian inheritance — i.e., likely UPD regions.
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("UPDhmm")
BiocManager::install("regioneR")
BiocManager::install("karyoploteR")
library(UPDhmm)
library(dplyr)
The input required by the package is a multisample VCF file containing genotypes for a trio (proband, mother, and father).
Requirements:
Only biallelic variants are supported.
Accepted genotype (GT) formats: 0/0, 0/1, 1/0, 1/1 (or their phased equivalents 0|0, 0|1, 1|0, 1|1).
For details on how to preprocess VCF files and prepare them in the appropriate format, please refer to the “Preprocessing and Input Preparation” vignette included with the package.
In this tutorial, we will perform a complete analysis using two trio VCF files included with the UPDhmm package.
By following the steps below, you will learn how to:
Load and preprocess the VCF files.
Detect UPD events using the HMM-based approach.
Apply quality filters to the detected events.
Collapse and summarize results per individual.
Perform postprocessing analyses to identify potential true UPDs by detecting recurrent or artifact-prone genomic regions across samples.
library(VariantAnnotation)
library(regioneR)
library(karyoploteR)
#Example files
files <- c(
system.file(package = "UPDhmm", "extdata", "sample1.vcf.gz"),
system.file(package = "UPDhmm", "extdata", "sample2.vcf.gz")
)
# Define the trio structure
trio_list <- list(
list(
proband = "NA19675",
mother = "NA19678",
father = "NA19679"
),
list(
proband = "NA19685",
mother = "NA19688",
father = "NA19689"
)
)
# Read and preprocess each VCF
vcf_list <- mapply(function(file, trio) {
vcf <- readVcf(file)
vcfCheck(vcf,
proband = trio$proband,
mother = trio$mother,
father = trio$father)
},
files, trio_list, SIMPLIFY = FALSE)
Each processed VCF is now ready for UPD detection.
The principal function of UPDhmm
package, calculateEvents()
, is the central
function for identifying Uniparental Disomy (UPD) events. It takes the output
from the previous vcfCheck()
function and systematically analyzes genomic
data, splitting the VCF into chromosomes and applying the Viterbi algorithm.
calculateEvents()
function encompasses a serie of subprocesses for
identifying Uniparental disomies (UPDs):
data.frame
In this example, we run the function in serial mode, but calculateEvents()
now includes an optional argument, BPPARAM, which controls parallel execution settings and is passed to bplapply. By default, it uses BiocParallel::SerialParam() (serial execution). To enable parallelization, provide a BiocParallel backend, for example:
events_list <- lapply(vcf_list, calculateEvents)
head(events_list[[1]])
ID seqnames start end group n_snps log_likelihood
1 NA19675 3 100374740 197574936 het_mat 47 75.21933
2 NA19675 6 32489853 32490000 iso_mat 6 38.63453
3 NA19675 6 32609207 32609341 iso_mat 11 95.23543
4 NA19675 11 55322539 55587117 iso_mat 7 40.02082
5 NA19675 14 105408955 105945891 het_fat 30 20.84986
6 NA19675 15 22382655 23000272 iso_mat 12 48.33859
p_value n_mendelian_error ratio_proband ratio_mother ratio_father
1 4.212224e-18 2 1.0104188 1.0250623 0.9731367
2 5.110683e-10 5 1.0771736 1.0880590 0.9802379
3 1.690378e-22 7 1.0007892 1.1000846 0.8351554
4 2.512703e-10 3 1.1680464 1.0441808 1.1935373
5 4.967274e-06 2 1.0373541 0.9793402 0.9287609
6 3.586255e-12 3 0.9448732 1.0861209 0.9461424
Note: when running under R CMD check or Bioconductor build systems, the number of workers may be automatically limited (usually less than 2).
The calculateEvents
function returns a data.frame
containing all
detected events in the provided trio. If no events are found, the function
will return an empty data.frame
.
Column name | Description |
---|---|
ID |
Identifier of the proband (child sample) |
seqnames |
Chromosome name |
start |
Start position of the block (genomic coordinate) |
end |
End position of the block |
group |
Predicted inheritance state (iso_mat , iso_fat , het_mat , het_fat ) |
n_snps |
Number of informative SNPs within the event |
log_likelihood |
Log-likelihood ratio comparing the probability of the observed genotype configuration under the best-fitting UPD model versus the null (biparental inheritance). Higher values indicate stronger evidence supporting a UPD-like pattern. |
p_value |
Empirical p-value derived from the likelihood ratio test |
n_mendelian_error |
Number of Mendelian inconsistencies detected within the region |
ratio_proband |
Ratio between the average sequencing depth inside the event and the depth outside it for the proband. A value close to 1 indicates balanced coverage, while significant deviations may suggest copy-number changes. |
ratio_mother |
Same ratio computed for the maternal sample |
ratio_father |
Same ratio computed for the paternal sample |
In this example, we run the function in serial mode, but calculateEvents()
now includes an optional argument BPPARAM that controls parallel execution.
Argument | Description |
---|---|
BPPARAM |
Parallelization settings, passed to BiocParallel::bplapply() . By default, the function uses BiocParallel::SerialParam() (serial execution). To enable parallel computation, a BiocParallel backend can be specified — for example: |
library(BiocParallel)
BPPARAM <- MulticoreParam(workers = min(2, parallel::detectCores()))
events_list <- lapply(vcf_list, calculateEvents, BPPARAM = BPPARAM)
Based on the evaluation conducted in the publication, where the UPDhmm package was presented and validated through simulations of events of different sizes and application to a real cohort, we established a series of filtering criteria to remove small events and exclude those likely caused by gains or losses of genetic material in one of the trio members.
Specifically, the following thresholds are applied:
n_mendelian_error > 2: at least three Mendelian errors are required to support a true UPD signal.
size > 500,000: events must span at least 500 kb to avoid small spurious segments.
ratio_proband, ratio_father, ratio_mother between 0.8 and 1.2: ensures balanced read depth across trio members, excluding events likely caused by copy-number variation.
filtered_events_list <- lapply(events_list, function(df) {
if (nrow(df) == 0) return(df) # skip empty results
df$size <- df$end - df$start
dplyr::filter(
df,
n_mendelian_error > 2,
size > 500000,
ratio_proband >= 0.8 & ratio_proband <= 1.2,
ratio_father >= 0.8 & ratio_father <= 1.2,
ratio_mother >= 0.8 & ratio_mother <= 1.2
)
})
The collapseEvents()
function provides a concise summary of detected UPD events per individual and chromosome.
It merges multiple contiguus or overlapping events that share the same predicted type (e.g., maternal isodisomy or paternal heterodisomy within the same chromosome.
During this process, the function:
Aggregates the total number of collapsed events.
Sums up the Mendelian errors across merged regions.
Calculates the minimum start and maximum end coordinates, defining the overall genomic span of the collapsed event.
collapsed_list <- lapply(filtered_events_list, collapseEvents)
collapsed_all <- do.call(rbind.data.frame, c(collapsed_list, make.row.names = FALSE))
head(collapsed_all)
ID seqnames group n_events total_mendelian_error total_size
1 NA19675 15 iso_mat 1 3 617617
2 NA19675 19 het_mat 1 3 1207630
3 NA19685 15 iso_mat 1 6 19741113
4 NA19685 6 het_mat 1 3 1010072
collapsed_events min_start max_end
1 15:22382655-23000272 22382655 23000272
2 19:42315281-43522911 42315281 43522911
3 15:22368862-42109975 22368862 42109975
4 6:32489853-33499925 32489853 33499925
The function returns a simplified table summarizing one row per individual, chromosome, and UPD type.
Column name | Description |
---|---|
ID |
Identifier of the proband |
seqnames |
Chromosome |
group |
Predicted state |
n_events |
Number of events collapsed |
total_mendelian_error |
Sum of Mendelian errors |
total_size |
Total genomic span covered |
collapsed_events |
Comma-separated list of merged events |
min_start / max_end |
Genomic span of collapsed block |
In this final step, we perform postprocessing analyses to refine the detected UPD events and highlight potentially true or recurrent regions across samples. This helps to distinguish genuine biological UPDs from artifact-prone genomic regions (e.g., repetitive or complex loci that may systematically produce false positives).
To explore whether similar events recur across individuals, UPDhmm provides two complementary functions:
identifyRecurrentRegions()
— identifies overlapping regions across multiple samples and counts how many individuals
share each region. It returns a GRanges object with the recurrent segments and the number of supporting samples. The function includes several arguments that control how recurrence is determined:
Arguments:
df — The input data.frame containing the genomic regions to evaluate. This table must include columns describing chromosome (seqnames), genomic coordinates (start, end or min_start, max_end), and a column with Mendelian error counts (n_mendelian_error or total_mendelian_error). Filtering arguments:
ID_col — The name of the column containing sample identifiers. It allows the function to determine how many unique individuals support each overlapping genomic region.
error_threshold (default = 100) — Maximum number of Mendelian errors allowed for a region to be considered in the recurrence analysis. Regions exceeding this threshold are excluded from recurrence counting. These regions, which often span broad genomic segments, are not considered in the recurrence calculation since they could overlap with many smaller events and bias the results. Instead, they are retained in the output as non-recurrent events, under the assumption that they may represent genuine large-scale alterations rather than noise.
min_support (default = 3) — Minimum number of unique samples that must overlap a region for it to be classified as recurrent. This parameter can be adjusted depending on the cohort size — smaller values detect more shared regions, while higher thresholds focus on the most consistently recurrent events.
markRecurrentRegions()
—
This function annotates each event in a results table (from calculateEvents() or collapseEvents()) as recurrent or
non-recurrent,depending on whether it overlaps any region defined as recurrent.
The function returns the same data.frame provided as input, but with two additional columns:
These annotations facilitate the downstream filtering of events, allowing users to retain only non-recurrent regions, which are more likely to represent true UPD candidates rather than technical artifacts or common genomic polymorphisms. In summary, these two functions can thus be combined to systematically identify and flag recurrent genomic regions, facilitating the filtering of potential technical artifacts or recurrent UPDs.
recurrentRegions <- identifyRecurrentRegions(
df = collapsed_all,
ID_col = "ID",
error_threshold = 100,
min_support = 2
)
head(recurrentRegions)
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | n_samples
<Rle> <IRanges> <Rle> | <integer>
[1] 15 22368862-42109975 * | 2
-------
seqinfo: 6 sequences from an unspecified genome; no seqlengths
annotatedEvents <- markRecurrentRegions(
df = collapsed_all,
recurrent_regions = recurrentRegions
)
head(annotatedEvents)
ID seqnames group n_events total_mendelian_error total_size
1 NA19675 15 iso_mat 1 3 617617
2 NA19675 19 het_mat 1 3 1207630
3 NA19685 15 iso_mat 1 6 19741113
4 NA19685 6 het_mat 1 3 1010072
collapsed_events min_start max_end Recurrent n_samples
1 15:22382655-23000272 22382655 23000272 Yes 2
2 19:42315281-43522911 42315281 43522911 No 1
3 15:22368862-42109975 22368862 42109975 Yes 2
4 6:32489853-33499925 32489853 33499925 No 1
Alternatively, recurrent regions can be provided directly as an external BED file (e.g., SFARI.bed
),
converted into a GRanges
object, and used in the same way.
The SFARI.bed
file was generated from our analysis of the SFARI cohort,
where recurrent genomic intervals were identified across trios using UPDhmm
.
These regions represent loci frequently showing artifactual or low-confidence UPD signals
in the SFARI dataset and can be used as a mask to improve specificity in other analyses.
library(regioneR)
bed <- system.file(package = "UPDhmm", "extdata", "SFARI.bed")
bed_df <- read.table(
bed,
header = TRUE
)
bed_gr <- toGRanges(bed_df)
annotatedEvents <- markRecurrentRegions(
df = collapsed_all,
recurrent_regions = bed_gr
)
head(annotatedEvents)
ID seqnames group n_events total_mendelian_error total_size
1 NA19675 15 iso_mat 1 3 617617
2 NA19675 19 het_mat 1 3 1207630
3 NA19685 15 iso_mat 1 6 19741113
4 NA19685 6 het_mat 1 3 1010072
collapsed_events min_start max_end Recurrent n_samples
1 15:22382655-23000272 22382655 23000272 No 1
2 19:42315281-43522911 42315281 43522911 No 1
3 15:22368862-42109975 22368862 42109975 No 1
4 6:32489853-33499925 32489853 33499925 No 1
Regions annotated as non-recurrent and passing all quality filters (low Mendelian error count and consistent read depth ratios) are the most likely true UPD candidates. Conversely, highly recurrent or error-prone regions should be interpreted with caution, as they may reflect systematic technical artifacts or structural genomic complexity.
After identifying and postprocessing UPD events, a final and highly informative step is to visualize their genomic distribution. Visualization helps to quickly assess the size, chromosomal location, and type (maternal or paternal; hetero- or isodisomy) of the detected events.
The karyoploteR
package provides elegant genome-wide visualization tools that can be integrated directly with the UPDhmm results.
The following custom plotting function, plotUPDKp()
, enables the visualization of UPD segments detected in a trio across all autosomes.
To visualize events across the genome, you can use the karyoploteR package:
library(karyoploteR)
library(regioneR)
# Function to visualize UPD events across the genome
plotUPDKp <- function(updEvents) {
#--------------------------------------------------------------
# 1. Detect coordinate columns
#--------------------------------------------------------------
if (all(c("start", "end") %in% colnames(updEvents))) {
start_col <- "start"
end_col <- "end"
} else if (all(c("min_start", "max_end") %in% colnames(updEvents))) {
start_col <- "min_start"
end_col <- "max_end"
} else {
stop("Input must contain either (start, end) or (min_start, max_end) columns.")
}
#--------------------------------------------------------------
# 2. Ensure chromosome naming format (e.g. "chr3")
#--------------------------------------------------------------
updEvents$seqnames <- ifelse(grepl("^chr", updEvents$seqnames),
updEvents$seqnames,
paste0("chr", updEvents$seqnames))
#--------------------------------------------------------------
# 3. Helper function to safely create GRanges only if events exist
#--------------------------------------------------------------
to_gr_safe <- function(df, grp) {
subset_df <- subset(df, group == grp)
if (nrow(subset_df) > 0) {
toGRanges(subset_df[, c("seqnames", start_col, end_col)])
} else {
NULL
}
}
#--------------------------------------------------------------
# 4. Separate UPD event types
#--------------------------------------------------------------
het_fat <- to_gr_safe(updEvents, "het_fat")
het_mat <- to_gr_safe(updEvents, "het_mat")
iso_fat <- to_gr_safe(updEvents, "iso_fat")
iso_mat <- to_gr_safe(updEvents, "iso_mat")
#--------------------------------------------------------------
# 5. Create the genome ideogram
#--------------------------------------------------------------
kp <- plotKaryotype(genome = "hg19")
#--------------------------------------------------------------
# 6. Plot regions by inheritance type (if any exist)
#--------------------------------------------------------------
if (!is.null(het_fat)) kpPlotRegions(kp, het_fat, col = "#AAF593")
if (!is.null(het_mat)) kpPlotRegions(kp, het_mat, col = "#FFB6C1")
if (!is.null(iso_fat)) kpPlotRegions(kp, iso_fat, col = "#A6E5FC")
if (!is.null(iso_mat)) kpPlotRegions(kp, iso_mat, col = "#E9B864")
#--------------------------------------------------------------
# 7. Add legend
#--------------------------------------------------------------
legend("topright",
legend = c("Het_Fat", "Het_Mat", "Iso_Fat", "Iso_Mat"),
fill = c("#AAF593", "#FFB6C1", "#A6E5FC", "#E9B864"))
}
# Example: visualize UPD events for the first trio
plotUPDKp(annotatedEvents)
In summary, the UPDhmm
package provides a complete workflow for the detection, characterization, and visualization of uniparental disomy (UPD) events from trio-based sequencing data.
Starting from raw VCF files, users can preprocess, model inheritance patterns through a Hidden Markov Model, and identify genomic regions consistent with UPD.
Postprocessing steps, including the collapsing of events and the identification of recurrent or artifact-prone regions, enable a refined interpretation of results and the prioritization of true UPD candidates.
Finally, genome-wide visualization with karyoploteR offers an intuitive way to inspect and report detected UPDs.
Altogether, `UPDhmm facilitates a standardized and reproducible approach for the study of UPD across large cohorts, integrating statistical detection, quality control, and interpretative visualization into a single coherent framework.
sessionInfo()
R version 4.5.1 Patched (2025-08-23 r88802)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] karyoploteR_1.35.3 regioneR_1.41.3
[3] VariantAnnotation_1.55.1 Rsamtools_2.25.3
[5] Biostrings_2.77.2 XVector_0.49.1
[7] SummarizedExperiment_1.39.2 Biobase_2.69.1
[9] GenomicRanges_1.61.5 IRanges_2.43.5
[11] S4Vectors_0.47.4 Seqinfo_0.99.2
[13] MatrixGenerics_1.21.0 matrixStats_1.5.0
[15] BiocGenerics_0.55.3 generics_0.1.4
[17] dplyr_1.1.4 UPDhmm_1.5.1
[19] BiocStyle_2.37.1
loaded via a namespace (and not attached):
[1] DBI_1.2.3 bitops_1.0-9 gridExtra_2.3
[4] rlang_1.1.6 magrittr_2.0.4 biovizBase_1.57.1
[7] compiler_4.5.1 RSQLite_2.4.3 GenomicFeatures_1.61.6
[10] png_0.1-8 vctrs_0.6.5 ProtGenerics_1.41.0
[13] stringr_1.5.2 pkgconfig_2.0.3 crayon_1.5.3
[16] fastmap_1.2.0 magick_2.9.0 backports_1.5.0
[19] rmarkdown_2.30 UCSC.utils_1.5.0 tinytex_0.57
[22] bit_4.6.0 xfun_0.53 cachem_1.1.0
[25] GenomeInfoDb_1.45.12 jsonlite_2.0.0 blob_1.2.4
[28] DelayedArray_0.35.3 BiocParallel_1.43.4 parallel_4.5.1
[31] cluster_2.1.8.1 R6_2.6.1 bslib_0.9.0
[34] stringi_1.8.7 RColorBrewer_1.1-3 bezier_1.1.2
[37] rtracklayer_1.69.1 rpart_4.1.24 jquerylib_0.1.4
[40] Rcpp_1.1.0 bookdown_0.45 knitr_1.50
[43] base64enc_0.1-3 Matrix_1.7-4 nnet_7.3-20
[46] tidyselect_1.2.1 rstudioapi_0.17.1 dichromat_2.0-0.1
[49] abind_1.4-8 yaml_2.3.10 codetools_0.2-20
[52] curl_7.0.0 lattice_0.22-7 tibble_3.3.0
[55] KEGGREST_1.49.2 S7_0.2.0 evaluate_1.0.5
[58] foreign_0.8-90 pillar_1.11.1 BiocManager_1.30.26
[61] checkmate_2.3.3 RCurl_1.98-1.17 ensembldb_2.33.2
[64] ggplot2_4.0.0 scales_1.4.0 glue_1.8.0
[67] lazyeval_0.2.2 Hmisc_5.2-4 tools_4.5.1
[70] BiocIO_1.19.0 data.table_1.17.8 BSgenome_1.77.2
[73] GenomicAlignments_1.45.4 XML_3.99-0.19 grid_4.5.1
[76] AnnotationDbi_1.71.2 colorspace_2.1-2 htmlTable_2.4.3
[79] restfulr_0.0.16 Formula_1.2-5 cli_3.6.5
[82] HMM_1.0.2 S4Arrays_1.9.1 AnnotationFilter_1.33.0
[85] gtable_0.3.6 sass_0.4.10 digest_0.6.37
[88] SparseArray_1.9.1 rjson_0.2.23 htmlwidgets_1.6.4
[91] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
[94] lifecycle_1.0.4 httr_1.4.7 bit64_4.6.0-1
[97] bamsignals_1.41.2