--- title: "Using Template Cohorts" author: "James P. Gilbert" date: "`r Sys.Date()`" output: html_document: number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{Using Template Cohorts} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- # Introduction This guide intends to demonstrate the usage of template cohorts within the Cohort Generator package. This can provide a convenient approach to computing large sets of features. While this is possible through the use of custom scripts, doing so will often require one-off approaches to integrating references within studies or other OHDSI packages, greatly limiting their reproducibility. The principle behind this implementation is that, for all intents and purposes, cohorts created via "bulk" operations should be treated no differently to cohorts created through Circe definitions. ## Limitations of this approach For the design of reliable, reusable Phenotype Algorithms, we strongly advise the usage of Circe based approaches. While there is a trade-off that such an approach may be less efficient that pure SQL, this will greatly limit the reproducibility and replicability of studies using these cohorts. # Basic SQL templates The most straightforward cohort template is an SQL only definition for an individual cohort. A good use case for this example is a cohort for which a Circe definition is inefficient and an equivalent cohort can be made with SQL alone or, in this example, where the cohort is speculative based on the standard vocabulary. Here we create a cohort that searches for all drug exposusres based on string patterns in the concept vocabulary table. The first step is to create a definition as follows ```{r} library(CohortGenerator) cohortDefinitionSet <- createEmptyCohortDefinitionSet() sql <- "INSERT INTO @cohort_database_schema.@cohort_table (cohort_definition_id, subject_id, cohort_start_date, cohort_end_date) SELECT 1 as cohort_definition_id, person_id as subject_id, drug_era_start_date as cohort_start_date, drug_era_end_date as cohort_end_date FROM @cdm_database_schema.drug_era de INNER JOIN @cdm_database_schema.concept c on de.drug_concept_id = c.concept_id -- Find any matches of drugs named 'asprin' in the drug concept table WHERE lower(c.concept_name) like '%asprin%'; " cohortDefinitionSet <- cohortDefinitionSet |> addSqlCohortDefinition( sql = sql, cohortId = 1, cohortName = "my asprin cohort" ) connection <- DatabaseConnector::connect(Eunomia::getEunomiaConnectionDetails()) createCohortTables(connection = connection, cohortDatabaseSchema = "main") status <- generateCohortSet( connection = connection, cdmDatabaseSchema = "main", cohortTableNames = getCohortTableNames(), cohortDefinitionSet = cohortDefinitionSet, incremental = TRUE ) ``` Note - in this definition we do not ever reference the bound variables `@cohort_table`, `@cohort_database_schema`, `@cdm_database_schema` or `@vocabulary_database_schema`. This is for two reasons, firstly these are already known at the time of cohort generation, secondly they will make the cohort definition not generalizable to other data sources (i.e. the phenotype algorithm will not be transportable to other data sets). This may also leak information about your CDM and should be checked before inclusion within network studies. ### Validating custom sql cohorts To be valid, a cohort definition must: * Include a cohort id, subject id, start date and end date * The start date must always be on or before the end date * There must be no overlapping or duplicate eras for the same subject within a cohort * In general, cohorts should be only inside observation periods within the CDM With Circe based cohorts, it is guaranteed that definitions meet this criteria. However, this is not necessarily true when generating custom SQL definitions. Consequently, we use the `getCohortValidationCounts` function within the `CohortGenerator` package. ```{r} getCohortValidationCounts( connection = connection, cdmDatabaseSchema = "main", cohortDatabaseSchema = "main" ) ``` # Built in large scale definitions There are currently 3 large scale cohort generation methods included within the package that use vocabulary tables to bulk generate thousands of cohorts in a time that would be infieasible if using Circe standard cohorts from ATLAS or Capr. These can be broadly defined as: 1. `createRxNormCohortTemplateDefinition` This is definition of all RxNorm standard ingredient concepts that uses the concept ancestor table to merge all definitions into a single cohort. 2. `createAtcCohortTemplateDefinition` This is definition of all ATC standard ingredients using the concept hierarchy. This can either be generated from the first exposure or from merging all eras between individual ingredients 2. `createSnomedCohortTemplateDefinition` This a definition of all SNOMED (OHDSI standard) condition occurences (with some exclusions based on strings to exclude) ## Drug ingredient cohorts ```{r} # Library imports library(CohortGenerator) library(DatabaseConnector) # Create RxNorm ingredient cohort template rxNormDefinition <- createRxNormCohortTemplateDefinition( connection = connection, # Replace with your DatabaseConnector connection cdmDatabaseSchema = "main", priorObservationPeriod = 365 ) cohortDefinitionSet <- cohortDefinitionSet |> addCohortTemplateDefintion(cohortTemplateDefintion = rxNormDefinition) # View details of generated template references rxNormReferences <- rxNormDefinition$getTemplateReferences() head(rxNormReferences) ``` ## ATC Base cohorts ```{r, eval=FALSE} # Create ATC-based cohort template atcDefinition <- createAtcCohortTemplateDefinition( connection = connection, # Replace with your DatabaseConnector connection cdmDatabaseSchema = "main", mergeIngredientEras = TRUE, priorObservationPeriod = 365 ) cohortDefinitionSet <- cohortDefinitionSet |> addCohortTemplateDefintion(cohortTemplateDefintion = atcDefinition) # View ATC template references atcReferences <- atcDefinition$getTemplateReferences() head(atcReferences) ``` ## SNOMED condition cohorts ```{r} # Create SNOMED cohort template snomedDefinition <- createSnomedCohortTemplateDefinition( connection = connection, # Replace with your DatabaseConnector connection cdmDatabaseSchema = "main", priorObservationPeriod = 180 # Require 180 days prior observation ) cohortDefinitionSet <- cohortDefinitionSet |> addCohortTemplateDefintion(cohortTemplateDefintion = snomedDefinition) # View the condition template references snomedReferences <- snomedDefinition$getTemplateReferences() head(snomedReferences) ``` # Creating custom cohort templates Creating custom templates involves a number of developer decisions that must be decided: 1. How will references be defined? This must be a `data.frame` that contains `cohortId` and `cohortName` fields. Additionally, this may contain json (for example, if you wish to relate all templates to a Circe definition that includes concept sets). In all the 3 methods above, templates are generated entirely from the vocabulary within an OMOP CDM but this detail is left to be implementation specific. 2. Define an parameterize SQL. The SQL for generating template cohorts has a few restrictive properties: * It must result in inserts into the cohort table with column names `cohort_definition_id`, `cohort_start_date`, `cohort_end_date`, `subject_id` * It should use the SqlRender standard bindings `@cohort_table`, `@cdm_database_schema`, `@cohort_database_schema` and `@vocabulary_database_schema` (in general vocabulary tables should be stored within the cdm). ## Generating the cohorts Template cohorts generate in the standard model within cohort generator. ```{r} status <- generateCohortSet( connection = connection, cdmDatabaseSchema = "main", cohortTableNames = getCohortTableNames(), cohortDefinitionSet = cohortDefinitionSet, incremental = TRUE ) ``` ### On execution order One note on execution is that the order of execution is: 1. Circe cohorts 2. Non-standard sql/Template sql cohorts 3. Subset cohort definitions For this reason, it is possible to both create subset based cohorts form templates. However, it is not currently possible to reference subsetted cohorts from template definitions under this standard execution framework. When referencing Circe cohorts in template definitions, special care should be taken around incremental execution. # Conclusion The CohortGenerator package offers a robust, flexible approach for creating large-scale or custom cohorts, making it easier to prototype and analyze in the OHDSI ecosystem. By leveraging built-in templates or user-defined SQL, you can create cohorts that combine efficiency, reproducibility, and scalability. Key Takeaways: * Large-scale cohort templates (RxNorm, ATC, SNOMED) allow for rapid, efficient generation of thousands of cohorts. * Custom SQL-based templates enable precise control and address unique research requirements. * The approach integrates smoothly with OHDSI tools like ATLAS, CAPR, and DatabaseConnector to ensure compatibility across the OHDSI ecosystem.