---
title: "Summarise database characteristics"
output: 
  html_document:
    pandoc_args: [
      "--number-offset=1,0"
      ]
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{database characteristics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

In this vignette, we explore how the *OmopSketch* function `databaseCharacteristics()` and `shinyCharacteristics()` can serve as a valuable tool for characterising databases containing electronic health records mapped to the OMOP Common Data Model.

## Create a mock CDM

We begin by loading the necessary packages and creating a mock CDM using the R package [omock](https://ohdsi.github.io/omock/):

```{r, warning=FALSE}
library(dplyr)
library(OmopSketch)
library(omock)

cdm <- mockCdmFromDataset(datasetName = "GiBleed", source = "duckdb")

cdm
```

# Database characteristics

## Summarise Characteristics

The `databaseCharacteristics()` function provides a comprehensive overview of the Common Data Model (CDM). It returns a [summarised result](https://darwin-eu-dev.github.io/omopgenerics/articles/summarised_result.html) combining several characterisation components:

- **General database snapshot:**  
  Generated using `summariseOmopSnapshot()`, this provides high-level metadata about the CDM, including size of person table, time span covered, source type, vocabulary version, etc.

- **Population characterisation:**  
  Describes the demographics of population under observation, built using the [CohortConstructor](https://ohdsi.github.io/CohortConstructor/) and [CohortCharacteristics](https://darwin-eu.github.io/CohortCharacteristics/) packages.

- **Person table characterisation:**  
  Produced using `summarisePerson()`, this component summarises the content and missingness of the `person` table.
  
- **Observation period characterisation:**  
  Produced using `summariseObservationPeriod()`, this component summarises the content and missingness of the observation period table.  
  Temporal trends — including changes in the number of records and subjects, median age, sex distribution, and total person-days — are then derived using `summariseTrend()`.

- **Clinical tables characterisation:**  
  Produced using `summariseClinicalRecords()`, this component summarises the content and missingness across all clinical tables.  
  Temporal trends in the number of records and subjects, median age, and sex distribution are also computed using `summariseTrend()`.

- **Concept Counts:** 
Optionally, concept-level summaries can be included by computing concept counts with `summariseConceptIdCounts()`.

Together, these outputs provide a holistic view of the CDM’s structure, data completeness, and temporal behaviour — supporting both data quality assessment and study feasibility evaluation.

```{r, eval = FALSE}
result <- databaseCharacteristics(cdm = cdm)
```

## Selecting tables to characterise

By default, the following OMOP tables are included in the characterisation:  *visit_occurrence*,  *visit_detail*, *condition_occurrence*, *drug_exposure*, *procedure_occurrence*, *device_exposure*, *measurement*, *observation*, *death*.

You can customise which tables to include in the analysis by specifying them with the `omopTableName` argument.

```{r, eval=FALSE}
result <- databaseCharacteristics(
  cdm = cdm, 
  omopTableName = c("drug_exposure", "condition_occurrence")
)
```

## Stratifying by Sex

To stratify the characterisation results by sex, set the `sex` argument to `TRUE`:

```{r, eval=FALSE}
result <- databaseCharacteristics(
  cdm = cdm,
  omopTableName = c("drug_exposure", "condition_occurrence"),
  sex = TRUE
)
```

## Stratifying by Age Group

You can choose to characterise the data stratifying by age group by creating a list defining the age groups you want to use.

```{r, eval=FALSE}
result <- databaseCharacteristics(
  cdm = cdm,
  omopTableName = c("drug_exposure", "condition_occurrence"),
  ageGroup = list(c(0, 50), c(51, 100))
)
```

## Filtering by date range and time interval

Use the `dateRange` argument to limit the analysis to a specific period. Combine it with the `interval` argument to stratify results by time. Valid values for interval include "overall" (default), "years", "quarters", and "months":

```{r, eval=FALSE}
result <- databaseCharacteristics(
  cdm = cdm,
  interval = "years",
  dateRange = as.Date(c("2010-01-01", "2018-12-31"))
)
```

## Sample the CDM

You can use the `sample` argument to limit the characterisation to a subset of the CDM.  
This can be useful for quickly exploring large datasets or focusing on a specific cohort already included in the CDM.

The `sample` argument accepts either:

- An **integer**, to randomly sample a specified number of people from the person table in the CDM.
- A **string**, corresponding to the name of a cohort within the CDM to use for characterisation.


```{r, eval=FALSE}
result <- databaseCharacteristics(
  cdm = cdm,
  sample = 1000L
)

result <- databaseCharacteristics(
  cdm = cdm,
  sample = "my_cohort"
)
```

## Including Concept Counts

To include concept counts in the characterisation, set `conceptIdCounts = TRUE`:

```{r, eval=FALSE}
result <- databaseCharacteristics(
  cdm = cdm,
  conceptIdCounts = TRUE
)
```

## Other arguments

It is possible to pass arguments from any of the underlying functions to `databaseCharacteristics()` in order to customise the output.
For example, to stratify trends and concept counts by records observed in or out of observation, you can pass the argument `inObservation = TRUE`:

```{r, eval = FALSE}
result <- databaseCharacteristics(
  cdm = cdm,
  conceptIdCounts = TRUE, 
  inObservation = TRUE
)
```


# Visualise the characterisation results

To explore the characterisation results interactively, you can use the `shinyCharacteristics()` function. This function generates a Shiny application in the specified `directory`, allowing you to browse, filter, and visualise the results through an intuitive user interface.

```{r, eval=FALSE}
shinyCharacteristics(result = result, directory = "path/to/your/shiny")
```

## Customise the Shiny App

You can customise the title, logo, and theme of the Shiny app by setting the appropriate arguments:

- `title`: The title displayed at the top of the app

- `logo`: Path to a custom logo (must be in SVG format)

- `theme`: One of the available `OmopViewer` themes.

- `background`: A custom background panel for the Shiny app

```{r, eval=FALSE}
shinyCharacteristics(
  result = result, 
  directory = "path/to/my/shiny",
  title = "Characterisation of my data",
  logo = "path/to/my/logo.svg",
  theme = "scarlet", 
  background = "path/to/my/background.md"
)
```

An example of the Shiny application generated by `shinyCharacteristics()` can be explored [here](https://dpa-pde-oxford.shinyapps.io/OmopSketchCharacterisation/), where the characterisation of several synthetic datasets is available.

# Disconnect from CDM

Finally, disconnect from the mock CDM.

```{r}
cdmDisconnect(cdm = cdm)
```