---
title: "Using Template Cohorts"
author: "James P. Gilbert"
date: "`r Sys.Date()`"
output:
  html_document:
    number_sections: yes
    toc: yes
vignette: >
  %\VignetteIndexEntry{Using Template Cohorts}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

# Introduction
This guide intends to demonstrate the usage of template cohorts within the Cohort Generator package.
This can provide a convenient approach to computing large sets of features.
While this is possible through the use of custom scripts, doing so will often require one-off approaches to integrating
references within studies or other OHDSI packages, greatly limiting their reproducibility.

The principle behind this implementation is that, for all intents and purposes, cohorts created via "bulk" operations
should be treated no differently to cohorts created through Circe definitions.

## Limitations of this approach
For the design of reliable, reusable Phenotype Algorithms, we strongly advise the usage of Circe based approaches.
While there is a trade-off that such an approach may be less efficient that pure SQL, this will greatly limit the
reproducibility and replicability of studies using these cohorts.

# Basic SQL templates

The most straightforward cohort template is an SQL only definition for an individual cohort.
A good use case for this example is a cohort for which a Circe definition is inefficient and an equivalent cohort
can be made with SQL alone or, in this example, where the cohort is speculative based on the standard vocabulary.
Here we create a cohort that searches for all drug exposusres based on string patterns in the concept vocabulary
table.

The first step is to create a definition as follows
```{r}
library(CohortGenerator)

cohortDefinitionSet <- createEmptyCohortDefinitionSet()

sql <- "INSERT INTO @cohort_database_schema.@cohort_table
              (cohort_definition_id, subject_id, cohort_start_date, cohort_end_date)
         SELECT 1 as cohort_definition_id,
                person_id as subject_id,
                drug_era_start_date as cohort_start_date,
                drug_era_end_date as cohort_end_date
         FROM @cdm_database_schema.drug_era de
         INNER JOIN @cdm_database_schema.concept c on de.drug_concept_id = c.concept_id
         -- Find any matches of drugs named 'asprin' in the drug concept table
         WHERE lower(c.concept_name) like '%asprin%'; "

cohortDefinitionSet <- cohortDefinitionSet |>
  addSqlCohortDefinition(
    sql = sql,
    cohortId = 1,
    cohortName = "my asprin cohort"
  )

connection <- DatabaseConnector::connect(Eunomia::getEunomiaConnectionDetails())
createCohortTables(connection = connection, cohortDatabaseSchema = "main")
status <- generateCohortSet(
  connection = connection,
  cdmDatabaseSchema = "main",
  cohortTableNames = getCohortTableNames(),
  cohortDefinitionSet = cohortDefinitionSet,
  incremental = TRUE
)
```
Note - in this definition we do not ever reference the bound variables `@cohort_table`, `@cohort_database_schema`,
`@cdm_database_schema` or `@vocabulary_database_schema`. This is for two reasons, firstly these are already known at
the time of cohort generation, secondly they will make the cohort definition not generalizable to other data sources
(i.e. the phenotype algorithm will not be transportable to other data sets).
This may also leak information about your CDM and should be checked before inclusion within network studies.


### Validating custom sql cohorts

To be valid, a cohort definition must:

* Include a cohort id, subject id, start date and end date
* The start date must always be on or before the end date
* There must be no overlapping or duplicate eras for the same subject within a cohort
* In general, cohorts should be only inside observation periods within the CDM

With Circe based cohorts, it is guaranteed that definitions meet this criteria.
However, this is not necessarily true when generating custom SQL definitions.
Consequently, we use the `getCohortValidationCounts` function within the `CohortGenerator` package.

```{r}
getCohortValidationCounts(
  connection = connection,
  cdmDatabaseSchema = "main",
  cohortDatabaseSchema = "main"
)
```

# Built in large scale definitions
There are currently 3 large scale cohort generation methods included within the package that use vocabulary tables to
bulk generate thousands of cohorts in a time that would be infieasible if using Circe standard cohorts from ATLAS or
Capr.
These can be broadly defined as:

1. `createRxNormCohortTemplateDefinition` This is definition of all RxNorm standard ingredient concepts that uses the
concept ancestor table to merge all definitions into a single cohort.
2. `createAtcCohortTemplateDefinition` This is definition of all ATC standard ingredients using the
concept hierarchy.
This can either be generated from the first exposure or from merging all eras between individual ingredients
2. `createSnomedCohortTemplateDefinition` This a definition of all SNOMED (OHDSI standard) condition occurences (with
some exclusions based on strings to exclude)

## Drug ingredient cohorts

```{r}
# Library imports
library(CohortGenerator)
library(DatabaseConnector)

# Create RxNorm ingredient cohort template
rxNormDefinition <- createRxNormCohortTemplateDefinition(
  connection = connection, # Replace with your DatabaseConnector connection
  cdmDatabaseSchema = "main",
  priorObservationPeriod = 365
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortTemplateDefintion(cohortTemplateDefintion = rxNormDefinition)


# View details of generated template references
rxNormReferences <- rxNormDefinition$getTemplateReferences()
head(rxNormReferences)
```

## ATC Base cohorts
```{r, eval=FALSE}
# Create ATC-based cohort template
atcDefinition <- createAtcCohortTemplateDefinition(
  connection = connection, # Replace with your DatabaseConnector connection
  cdmDatabaseSchema = "main",
  mergeIngredientEras = TRUE,
  priorObservationPeriod = 365
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortTemplateDefintion(cohortTemplateDefintion = atcDefinition)

# View ATC template references
atcReferences <- atcDefinition$getTemplateReferences()
head(atcReferences)
```

## SNOMED condition cohorts
```{r}
# Create SNOMED cohort template
snomedDefinition <- createSnomedCohortTemplateDefinition(
  connection = connection, # Replace with your DatabaseConnector connection
  cdmDatabaseSchema = "main",
  priorObservationPeriod = 180 # Require 180 days prior observation
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortTemplateDefintion(cohortTemplateDefintion = snomedDefinition)

# View the condition template references
snomedReferences <- snomedDefinition$getTemplateReferences()
head(snomedReferences)
```

# Creating custom cohort templates

Creating custom templates involves a number of developer decisions that must be decided:
1. How will references be defined?

This must be a `data.frame` that contains `cohortId` and `cohortName` fields. Additionally, this may contain json (for
example, if you wish to relate all templates to a Circe definition that includes concept sets).
In all the 3 methods above, templates are generated entirely from the vocabulary within an OMOP CDM but this detail is
left to be implementation specific.

2. Define an parameterize SQL.

The SQL for generating template cohorts has a few restrictive properties:
* It must result in inserts into the cohort table with column names `cohort_definition_id`, `cohort_start_date`,
`cohort_end_date`, `subject_id`
* It should use the SqlRender standard bindings `@cohort_table`, `@cdm_database_schema`, `@cohort_database_schema` and
`@vocabulary_database_schema` (in general vocabulary tables should be stored within the cdm).


## Generating the cohorts

Template cohorts generate in the standard model within cohort generator.
```{r}
status <- generateCohortSet(
  connection = connection,
  cdmDatabaseSchema = "main",
  cohortTableNames = getCohortTableNames(),
  cohortDefinitionSet = cohortDefinitionSet,
  incremental = TRUE
)
```

### On execution order
One note on execution is that the order of
execution is:

1. Circe cohorts
2. Non-standard sql/Template sql cohorts
3. Subset cohort definitions

For this reason, it is possible to both create subset based cohorts form templates. However, it is not
currently possible to reference subsetted cohorts from template definitions under this standard execution framework.
When referencing Circe cohorts in template definitions, special care should be taken around incremental execution.

# Conclusion
The CohortGenerator package offers a robust, flexible approach for creating large-scale or custom cohorts, making it easier to prototype and analyze in the OHDSI ecosystem. By leveraging built-in templates or user-defined SQL, you can create cohorts that combine efficiency, reproducibility, and scalability.

Key Takeaways:
* Large-scale cohort templates (RxNorm, ATC, SNOMED) allow for rapid, efficient generation of thousands of cohorts.
* Custom SQL-based templates enable precise control and address unique research requirements.
* The approach integrates smoothly with OHDSI tools like ATLAS, CAPR, and DatabaseConnector to ensure compatibility across the OHDSI ecosystem.