Practical

Database

You can download the simulated database from here. The data is stored inside a zipped duckdb database. For this practical, we extract the database inside a directory named data.

You can connect to the database using HADES package DatabaseConnector like:

connectionDetails <- DatabaseConnector::createConnectionDetails(
  dbms = "duckdb", 
  server = "data/database-1M_filtered.duckdb"
)

connection <- DatabaseConnector::connect(
  connectionDetails = connectionDetails
)

Data extraction

For the specific problem we will use existing cohorts generated by members of the OHDSI community, available in the community atlas instance. More specifically, the target cohort (T) will be cohort with id 1782815, patients hospitalized with pneumonia and the outcome cohort (O) will be patients who died, that is, cohort with id 1782813.

cohortIds <- c(1782815,1782813)
baseUrl <- "http://api.ohdsi.org:8080/WebAPI"

cohortDefinitionSet <- ROhdsiWebApi::exportCohortDefinitionSet(
  baseUrl = baseUrl,
  cohortIds = cohortIds
)

Next we will generate the cohorts and store them in a table named summerschool inside the original database. For that, we will use the HADES package CohortGenerator.

cohortTableNames <- CohortGenerator::getCohortTableNames(cohortTable = "summerschool")

# Next create the tables on the database
CohortGenerator::createCohortTables(
  connectionDetails = connectionDetails,
  cohortTableNames = cohortTableNames,
  cohortDatabaseSchema = "main"
)

# Generate the cohort set
cohortsGenerated <- CohortGenerator::generateCohortSet(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames,
  cohortDefinitionSet = cohortDefinitionSet
)

Single model

In this section we will demonstrate the steps required for building a LASSO logistic regression model for the prediction of death within 60 days in patients hospitalized with pneumonia.

Database settings

First, we need to define the database details, that is, the connection details, the tables we stored our generated cohorts in (summerschool), the cohort ids we are interested in, as they are stored in the previous table (1782815, 1782813), and the schemas where the database is stored (in this case it is main).

connectionDetails <- DatabaseConnector::createConnectionDetails(
  dbms = "duckdb", 
  server = "data/database-1M_filtered.duckdb"
)

databaseDetails <- PatientLevelPrediction::createDatabaseDetails(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  cohortDatabaseSchema = "main",
  cohortTable = "summerschool",
  targetId = 1782815,
  outcomeDatabaseSchema = "main",
  outcomeTable = "summerschool",
  outcomeIds = 1782813
)

Covariate settings

Second, we need to define the covariates we will use for training the prediction model. We will do that with the FeatureExtraction. In this case, we will use patients’:

  • demographics (gender and age)
  • conditions any time and a year prior to being hospitalized with pneumonia
  • drug prescriptions any time and a year prior to being hospitalized with penumonia
  • number of visits observed in the last year before being hospitalized with pneumonia
covariateSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = TRUE,
  useDemographicsAge = TRUE,
  useConditionGroupEraLongTerm = TRUE,
  useConditionGroupEraAnyTimePrior = TRUE,
  useDrugGroupEraLongTerm = TRUE,
  useDrugGroupEraAnyTimePrior = TRUE,
  useVisitConceptCountLongTerm = TRUE,
  longTermStartDays = -365,
  endDays = -1
)

Restriction settings

Third, we can define (mostly) time restriction settings for the data we will extract, like the study start and end dates (only include patients in a certain period), washout periods (exclude patients if they are not followed long enough) etc. In this case, we do not define any restriction settings.

restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings()

Sample settings

Fourth, we can define sample settings, that can be used to define the way to select a sample of the original dataset to train our prediction models. This can be useful if the original dataset was very large. In this case, we will not define any sample settings.

sampleSettings <- PatientLevelPrediction::createSampleSettings()

Feature engineering settings

Fifth, we can define feature engineering settings to transform the extracted covariates. In this case, we will not use any feature engineering.

featureEngineeringSettings <- PatientLevelPrediction::createFeatureEngineeringSettings()

Preprocess settings

Sixth, we can define settings for preprocessing the train data, for example, in this case, we will require that any covariate, in order for it to be considered for selection, must be present in at least 1% of the included patients, we want to normalize the data covariates before model training and we want to remove redundant features.

preprocessSettings <- PatientLevelPrediction::createPreprocessSettings(
  minFraction = .01,
  normalize = TRUE,
  removeRedundancy = TRUE
)

Split settings

We need to define the split settings, that is, the way the original data will be separated into a training dataset (for model development) and a test dataset (for model evaluation). In this case, we will do a 75-25% train-test split, using 2-fold cross validation in the train set for hyperparameter tuning and evaluation, making sure that event rates remain the same across folds.

splitSettings <- PatientLevelPrediction::createDefaultSplitSetting(
  trainFraction = 0.75,
  testFraction = 0.25,
  type = 'stratified',
  nfold = 2, 
  splitSeed = 1234
)

Population settings

We need to define further settings and restrictions to generate the population to be actually used for model development. This allows us to use the same extracted data to generate multiple prediction models, using slightly altered populations (e.g. different TARs, patients with and without prior outcomes, etc.). In this case, we will require patients to have a continuous follow-up in the database of at least 364 days before their hospitalization with pneumonia, we will remove subjects with prior outcomes, and we define the TAR to be 60 days, requiring at least 59 days of follow-up after hospitalization.

populationSettings <- PatientLevelPrediction::createStudyPopulationSettings(
  washoutPeriod = 364,
  firstExposureOnly = FALSE,
  removeSubjectsWithPriorOutcome = TRUE,
  priorOutcomeLookback = 9999,
  riskWindowStart = 1,
  riskWindowEnd = 60, 
  minTimeAtRisk = 59,
  startAnchor = 'cohort start',
  endAnchor = 'cohort start',
  requireTimeAtRisk = TRUE,
  includeAllOutcomes = TRUE
)

Finally, we need to define the settings for training the model we want. In this case, we are going to train a LASSO logistic regression model using the default settings (cyclic coordinate descent).

lrModel <- PatientLevelPrediction::setLassoLogisticRegression()

We can now extract the data from the database using the following command:

plpData <- PatientLevelPrediction::getPlpData(
  databaseDetails = databaseDetails,
  covariateSettings = covariateSettings,
  restrictPlpDataSettings = restrictPlpDataSettings
)

We can, finally, train the LASSO logistic regression model using the following command:

lrResults <- PatientLevelPrediction::runPlp(
  plpData = plpData,
  outcomeId = 1782813, 
  analysisId = "single_model",
  analysisName = "Demonstration of runPlp for training single PLP models",
  populationSettings = populationSettings, 
  splitSettings = splitSettings,
  sampleSettings = sampleSettings, 
  featureEngineeringSettings = featureEngineeringSettings, 
  preprocessSettings = preprocessSettings,
  modelSettings = lrModel,
  logSettings = PatientLevelPrediction::createLogSettings(), 
  executeSettings = PatientLevelPrediction::createExecuteSettings(
    runSplitData = TRUE, 
    runSampleData = TRUE, 
    runfeatureEngineering = TRUE, 
    runPreprocessData = TRUE, 
    runModelDevelopment = TRUE, 
    runCovariateSummary = TRUE
  ), 
  saveDirectory = file.path(getwd(), "results")
)

This will build the model, evaluate its performance using the test set and cross-validation and will store it in the directory results.

View

We can now launch the Shiny app to have a look at the generated model.

PatientLevelPrediction::viewPlp(lrResults)

Multiple models

It is very straightforward to develop more than one models by only making a few additions to the previous settings. We will demonstrate how we can use PatientLevelPrediction to train a LASSO logistic regression, a random forest and a gradient boosting machine model on the same data and compare their performance.

First, we need to define the settings for the considered model. In this case, we leave only consider the default options:

modelDesignLasso <- PatientLevelPrediction::createModelDesign(
  targetId = 1782815, 
  outcomeId = 1782813, 
  restrictPlpDataSettings = restrictPlpDataSettings, 
  populationSettings = populationSettings, 
  covariateSettings = covariateSettings, 
  featureEngineeringSettings = featureEngineeringSettings,
  sampleSettings = sampleSettings, 
  splitSettings = splitSettings, 
  preprocessSettings = preprocessSettings, 
  modelSettings = PatientLevelPrediction::setLassoLogisticRegression()
)

modelDesignRandomForest <- PatientLevelPrediction::createModelDesign(
  targetId = 1782815, 
  outcomeId = 1782813, 
  restrictPlpDataSettings = restrictPlpDataSettings, 
  populationSettings = populationSettings, 
  covariateSettings = covariateSettings, 
  featureEngineeringSettings = featureEngineeringSettings,
  sampleSettings = sampleSettings, 
  splitSettings = splitSettings, 
  preprocessSettings = preprocessSettings, 
  modelSettings = PatientLevelPrediction::setRandomForest()
)

modelDesignGradientBoosting <- PatientLevelPrediction::createModelDesign(
  targetId = 1782815, 
  outcomeId = 1782813, 
  restrictPlpDataSettings = restrictPlpDataSettings, 
  populationSettings = populationSettings, 
  covariateSettings = covariateSettings, 
  featureEngineeringSettings = featureEngineeringSettings,
  sampleSettings = sampleSettings, 
  splitSettings = splitSettings, 
  preprocessSettings = preprocessSettings, 
  modelSettings = PatientLevelPrediction::setGradientBoostingMachine()
)

We can train the models all at once with:

results <- PatientLevelPrediction::runMultiplePlp(
  databaseDetails = databaseDetails, 
  modelDesignList = list(
    modelDesignLasso, 
    modelDesignRandomForest, 
    modelDesignGradientBoosting
  ), 
  onlyFetchData = FALSE,
  logSettings = PatientLevelPrediction::createLogSettings(),
  saveDirectory =  file.path(getwd(), "results/multiple_models")
)

Finally, we can view the results in a Shiny app using:

PatientLevelPrediction::viewMultiplePlp("results/multiple_models")