--- title: "Misha Basics (Short Guide)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Misha Basics (Short Guide)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This page gives a compact mental model for misha. Use it as the first quick read before the full `Manual` vignette. ## The Core Idea Most analyses follow the same pattern: 1. Choose **where** to look (intervals / scope). 2. Choose **how** to walk through it (iterator). 3. Evaluate a **track expression** over those iterator intervals. In misha this is usually one call to `gextract`, `gscreen`, or `gsummary`. You are not limited to raw track names. You can pass full expressions, for example `log(dense_track + 1)`, `dense_track / (chip.sum + 1e-6)`, or `pmin(dense_track, 2)`. All examples below assume the bundled examples database: ```{r, eval = FALSE} library(misha) gdb.init_examples() ``` ## Four Concepts You Need First ### 1) Track A **track** is genomic signal organized over coordinates. - Dense track: value for each fixed-size bin (for example `dense_track` in the examples DB). - Sparse track: values on intervals (for example peaks). - 2D track: values on genomic rectangles (for example contact matrices). Useful starter commands: ```{r, eval = FALSE} gtrack.ls() # list tracks in the examples DB gtrack.info("dense_track") # inspect type/metadata gtrack.info("sparse_track") ``` For intuition, you can think of `dense_track` as a ChIP-seq-like coverage signal. ### 2) Intervals An **interval set** defines genomic regions (`chrom`, `start`, `end`) where you want to work. - Intervals can come from files, annotations, peak calls, or be generated in code. - Intervals often act as a **scope**: "analyze only here." ```{r, eval = FALSE} regions <- gintervals(1, c(0, 250000), c(100000, 260000)) ``` ### 3) Iterator The **iterator** is the stepping policy inside the scope. - `iterator = 100` -> fixed 100 bp bins - `iterator = "some_sparse_track"` -> iterate over that track's intervals - `iterator = some_intervals_df` -> iterate over explicit regions - `iterator = "my_intervals_set"` -> iterate directly over an intervals set Think of it as: scope says *where*, iterator says *in what chunks*. ```{r, eval = FALSE} out <- gextract("dense_track", regions, iterator = 100) log_out <- gextract("log(dense_track + 1)", regions, iterator = 100) ``` Create and use an intervals set as an iterator: ```{r, eval = FALSE} gintervals.save(regions, "my_intervals_set") out2 <- gextract("dense_track", gintervals.all(), iterator = "my_intervals_set") ``` ### 4) Virtual Track A **virtual track** is a named on-the-fly transformation, not stored as a physical track file. Examples: - Local sum of a source track - Distance to nearest annotation interval - Quantile-like or nearest-neighbor summaries ```{r, eval = FALSE} gvtrack.create("chip.sum", "dense_track", "sum") out <- gextract("chip.sum", regions, iterator = 200) ``` You can also shift the iterator window used by the virtual track: ```{r, eval = FALSE} gvtrack.create("chip.shifted", "dense_track", "sum") gvtrack.iterator("chip.shifted", sshift = -100, eshift = 100) out <- gextract("chip.shifted", regions, iterator = 200) ``` Here, each iterator interval is expanded by 100 bp on both sides before evaluating `dense_track`. Virtual tracks are session objects (easy to list with `gvtrack.ls` and delete with `gvtrack.rm`). ## Minimal Workflow ```{r, eval = FALSE} library(misha) gdb.init_examples() # 1) pick scope regions <- gintervals(1, 0, 50000) # 2) inspect available tracks print(gtrack.ls()) # 3) extract signal with a chosen iterator chip <- gextract("dense_track", regions, iterator = 100) # 4) screen high-signal bins (as a simple peak-like filter) hi_chip <- gscreen("dense_track > 0.6", regions, iterator = 100) # 5) summarize distribution/coverage stats <- gsummary("dense_track", regions, iterator = 100) ``` ## PWM in One Minute A PWM/PSSM is a motif model over A/C/G/T. In misha, a common pattern is: 1. Extract sequence from intervals. 2. Score those sequences with a PWM. ```{r, eval = FALSE} regions <- gintervals(1, c(1000, 2000), c(1020, 2020)) seqs <- gseq.extract(regions) pssm <- matrix(c( 0.80, 0.05, 0.10, 0.05, 0.10, 0.10, 0.70, 0.10, 0.05, 0.80, 0.05, 0.10, 0.10, 0.10, 0.10, 0.70 ), ncol = 4, byrow = TRUE) colnames(pssm) <- c("A", "C", "G", "T") scores <- gseq.pwm(seqs, pssm, mode = "lse") ``` If your database has motif files under `pssms/`, you can create a genome-wide PWM-energy track with `gtrack.create_pwm_energy(...)`.