--- title: "Base editing design with crisprDesign" author: - name: Jean-Philippe Fortin email: fortin946@gmail.com date: "`r Sys.Date()`" output: BiocStyle::html_document: toc_float: true #theme: paper number_sections: true vignette: > %\VignetteIndexEntry{Base editing design with crisprDesign} \usepackage[utf8]{inputenc} %\VignetteEngine{knitr::rmarkdown} bibliography: references.bib --- # Introduction ```{r, message=FALSE, warning=FALSE,results='hide' } library(crisprDesign) ``` # Defining the base editor object We illustrate the CRISPR base editing (CRISPRbe) functionalities of `crisprDesign` by designing and characterizing gRNAs targeting the gene IQSEC3 using the cytidine base editor BE4max [@koblan2018improving]. We first load the BE4max `BaseEditor` object available in `crisprBase`: ```{r} data(BE4max, package="crisprBase") BE4max ``` The editing probabilities of the base editor BE4max are stored in a matrix where rows correspond to the different nucleotide substitutions, and columns correspond to the genomic coordinate relative to the PAM site. The `editingWeights` function from `crisprBase` allows us to retrieve those probabilities. One can see that C to T editing is optimal around 15 nucleotides upstream of the PAM site for the BE4max base editor: ```{r} crisprBase::editingWeights(BE4max)["C2T",] ``` To learn how to build a \code{BaseEditor} object for your own base editor enzyme, see the \code{crisprBase} package vignette. # Designing spacer sequences Next, we load the `grListExample` object in `crisprDesign` that contains gene coordinates in hg38 for exons of all human IQSEC3 isoforms. The object was obtained by converting an Ensembl `TxDb` object into a `GRangesList` object using the `TxDb2GRangesList` convenience function in `crisprDesign`. ```{r} data(grListExample, package="crisprDesign") ``` The `queryTxObject` function allows us to query such objects for a specific gene and feature. Here, we obtain a `GRanges` object containing the first exon of the IQSEC3 gene: ```{r} gr <- queryTxObject(txObject=grListExample, featureType="cds", queryColumn="gene_symbol", queryValue="IQSEC3") ``` and retain the first exon only: ```{r} gr <- gr[1] ``` `findSpacers` is the main function to obtain a list of all possible spacer sequences targeting protospacers located in the target DNA sequence(s). If a `GRanges` object is provided as input, a `BSgenome` object (object containing sequences of a reference genome) will need to be provided as well: ```{r, warning=FALSE, message=FALSE} library(BSgenome.Hsapiens.UCSC.hg38) bsgenome <- BSgenome.Hsapiens.UCSC.hg38 guideSet <- findSpacers(gr, bsgenome=bsgenome, crisprNuclease=BE4max, strict_overlap=FALSE) guideSet ``` The argument \code{strict_overlap} set to FALSE enables spacer sequences to be not-strictly overlapping with the CDS region; this is useful for base editing design as the editing window can extend beyond the protospacer sequence region. The function returns a `GuideSet` object that stores genomic coordinates for all spacer sequences found in the regions provided by `gr`. The `GuideSet` object is an extension of a `GenomicRanges` object that stores additional information about gRNAs. For the sake of time, we will retain only two gRNAs: ```{r} guideSet <- guideSet[c(50,51)] print(guideSet) ``` # Allele prediction The function `addEditedAlleles` finds, characterizes, and scores predicted edited alleles for each gRNA, for a chosen transcript. It requires a transcript-specific annotation that can be obtained using the function `getTxInfoDataFrame`. Here, we will perform the analysis using the main isoform of IQSEC3 (transcript id ENST00000538872). We first get the transcript table for ENST00000538872, ```{r} txid <- "ENST00000538872" txTable <- getTxInfoDataFrame(tx_id=txid, txObject=grListExample, bsgenome=bsgenome, extend=30) head(txTable) ``` The argument `extend` specifies the number of nucleotides upstream and downstream of the exons to include. This is useful to characterize gRNAs overlapping splice junctions. The `region` column indicates where the location of the nucleotide: 3UTR, 5UTR, CDS, Intron, and Upstream and Downstream of the CDS. We are ready to add predicted alleles to the `GuideSet` object: ```{r} editingWindow <- c(-20,-8) guideSet <- addEditedAlleles(guideSet, baseEditor=BE4max, txTable=txTable, editingWindow=editingWindow, minEditingWeight = 0, minMutationScore = 0.3) ``` - The `editingWindow` argument specifies the window of editing that we are interested in. When not provided, it uses the default window provided in the `BaseEditor` object. Note that providing large windows can exponentially increase computing time as the number of possible alleles grows exponentially. - The `minEditingWeight` specifies the minimum editing weight required for an allele to be listed as a predicted allele. Default is 0. A higher threshold can be used to omit alleles with low probabilities. - The `mutationScore` specifies a minimum predicted probability for labeling an allele with a predicted variant. Alleles with scores lower than this threshold will be labeled as "not_editing". Default of 0.3. For each gRNA, a predicted alleles annotation is stored and can be retrieved using the `editedAlleles` accessor function. As an example, let's retrieve the predicted alleles for the first gRNA: ```{r} alleles <- editedAlleles(guideSet[1]) print(alleles) ``` The resulting `DataFrame` is ordered so that the top predicted alleles (based on the `score` column) are shown first. The `score` represents the likelihood of the edited allele to occur relative to all possible edited alleles, and is calculated using the editing weights stored in the `BE4max` object. The `seq` column represents the edited nucleotide sequences. They are always reported from the 5' to 3' direction on the strand corresponding to the gRNA strand. - The `n_mismatches` column indicates the number of amino acid that differs between the edited allele and the wildtype allele. - The `n_nonsense` and `n_missense` columns indicate the number of mismatches that are nonsense and missense mutation, respectively. Those two columns sum to the `n_mismatches` column. - The `variant` column indicates the functional consequence of the allele. There are 7 possible choices: - `silent`: single or multiple silent mutations - `missense`: single missense mutation - `nonsense`: single nonsense mutation - `missense_multi`: multiple missense mutations - `nonsense_multi`: multiple nonsense mutations - `splice_junction`: mutation in a splice junction - `not_targeting`: no mutations found in CDS or splice junctions In case an edited allele leads to multiple editing events that have different variant label, the most detrimental mutation (splice junction over nonsense, nonsense over missense, missense over silent) is reported. - The `positions` column lists the amino acid numbers where mutations occur. For alleles that are labeled as `splice_junction`, it lists the closest amino acid. - The `aa` column reports the result edited amino acid sequence. The alleles object also contains useful metadata information that can be accessed using the `metadata` accessor function: ```{r} metadata(alleles) ``` The `wildtypeAllele` reports the unedited nucleotide sequence of the region specified by the editing window (with respect to the gRNA PAM site). It is always reported from the 5' to 3' direction on the strand corresponding to the gRNA strand. The `start` and `end` specify the corresponding coordinates on the transcript. # gRNA-level aggregate variant scores For both analysis and visualization purposes, it is often useful to label gRNAs with a score and label that represents the most likely functional consequence of that gRNA, for a given base editor. The `addEditedAlleles` function described above also implements a gRNA-level aggregate scoring scheme that adds several gRNA-level aggregate scores to the `GuideSet` object: ```{r} head(guideSet) ``` - `score_missense_single`: sum of probability scores across all alleles with a single missense mutation - `score_nonsense_single`: sum of probability scores across all alleles with a single nonsense mutation - `score_missense_multi`: sum of probability scores across all alleles with multiple missense mutations - `score_nonsense_multi`: sum of probability scores across all alleles with multiple nonsense mutations - `score_silent`: sum of probability scores across all alleles with only silent mutations - `score_splice_junction`: sum of probability scores across all alleles with a splice junction mutation - `score_missense`: `score_missense_single` and `score_missense_multiple` added together - `score_nonsense`: `score_nonsense_single` and `score_nonsense_multiple` added together - `maxVariantScore`: maximum score across all score columns - `maxVariant`: variant label for the maximum score - `aapos`: amino acid position for the top predicted allele for the variant category that has the maximum score For both gRNAs, the highest scores are missense, and therefore the `maxVariant` is set to missense. # Session Info ```{r} sessionInfo() ``` # References