02 – Example of training on multiple phenotypes

In this relatively short tutorial, we will be looking at one approach of using EIR-auto-GP to train models for multiple phenotypes. This will be relatively short and technical tutorial, but the idea is just to show one example workflow.

We will be using the same data from the PennCATH study, as in tutorial 01 – Genomic Prediction for Coronary Artery Disease, but now we will be predicting some of the other phenotypes in the dataset. In particular, we will be predicting age, tg (triglycerides), hdl (high density lipoprotein) and ldl (low density lipoprotein). It’s indeed quite silly perhaps to be predicting age, from genotype data, but we’re just playing around a bit. As before, you start by downloading the processed PennCATH data.

We will be reusing the data from the previous tutorial, so the structure should look like this:

eir_auto_gp_tutorials/01_basic_tutorial/data
├── penncath
│   ├── penncath.bed
│   ├── penncath.bim
│   ├── penncath.csv
│   └── penncath.fam
└── penncath.zip

Since we are modelling on multiple phenotypes, we would like to avoid processing the data multiple times. We can start by preparing only the data, without any modelling, by using the --only-data flag:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--global_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial \
--only_data \
--freeze_validation_set

This will only generate a data folder, with the processed data, inside the path passed to global_output_folder.

Now, we can start training models for each phenotype. The main difference from the previous tutorial is that we will be reusing the processed data, but passing in separate flags for the modelling, feature selection and analysis. For example, in the case of tg:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/analysis \
--output_con_columns tg \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns ldl hdl age

Note

It’s a bit counter-intuitive that we are passing in data_output_folder, when it’s an “input” folder in this scenario. This is because the framework checks if the data folder exists, and if it does, it will skip the data processing step.

For the other phenotypes, here are the commands:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/analysis \
--output_con_columns hdl \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg ldl age
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/analysis \
--output_con_columns ldl \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg hdl age
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/analysis \
--output_con_columns age \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg ldl hdl

While this is a bit manually set up in this tutorial, this can easily be extended and automated as you see fit.

In case you are interested, here are the results on the test set for each phenotype:

TG (triglycerides)

Fold

RMSE

R2

PCC

Ensemble

71.386

0.0788

0.2856

Fold 0

71.6439

0.0722

0.2983

Fold 1

71.316

0.0806

0.2893

Fold 2

71.7878

0.0684

0.285

Fold 3

74.4062

-0.0008

0.1314

Fold 4

72.018

0.0624

0.27

HDL (high density lipoprotein)

Fold

RMSE

R2

PCC

Ensemble

10.6967

0.2674

0.5224

Fold 0

11.1895

0.1983

0.4824

Fold 1

10.7284

0.263

0.5358

Fold 2

10.9134

0.2374

0.5175

Fold 3

11.0505

0.2181

0.4708

Fold 4

10.792

0.2543

0.5098

LDL (low density lipoprotein)

Fold

RMSE

R2

PCC

Ensemble

40.8394

0.028

0.1692

Fold 0

41.4317

-0.0004

0.125

Fold 1

41.0572

0.0176

0.1485

Fold 2

39.8171

0.0761

0.2896

Fold 3

42.154

-0.0356

0.0789

Fold 4

42.1068

-0.0332

0.0746

Age

Fold

RMSE

R2

PCC

Ensemble

8.2984

0.0141

0.2595

Fold 0

8.6089

-0.0611

0.2408

Fold 1

8.2793

0.0186

0.2486

Fold 2

8.6151

-0.0626

0.2239

Fold 3

8.3359

0.0052

0.2146

Fold 4

8.1859

0.0407

0.2901

We see that the models explain the lipid phenotypes variably, with HDL being the best predicted phenotype. Interestingly, the PCC for age is ~0.25, which can be due to population bias, sex and the fact we included the lipids as inputs.

That will be it for this tutorial. Thank you for reading!