02 – Example of training on multiple phenotypes

In this relatively short tutorial, we will be looking at one approach of using EIR-auto-GP to train models for multiple phenotypes. This will be relatively short and technical tutorial, but the idea is just to show one example workflow.

We will be using the same data from the PennCATH study, as in tutorial 01 – Genomic Prediction for Coronary Artery Disease, but now we will be predicting some of the other phenotypes in the dataset. In particular, we will be predicting age, tg (triglycerides), hdl (high density lipoprotein) and ldl (low density lipoprotein). It’s indeed quite silly perhaps to be predicting age, from genotype data, but we’re just playing around a bit. As before, you start by downloading the processed PennCATH data.

We will be reusing the data from the previous tutorial, so the structure should look like this:

eir_auto_gp_tutorials/01_basic_tutorial/data
├── penncath
│   ├── penncath.bed
│   ├── penncath.bim
│   ├── penncath.csv
│   └── penncath.fam
└── penncath.zip

Since we are modelling on multiple phenotypes, we would like to avoid processing the data multiple times. We can start by preparing only the data, without any modelling, by using the --only-data flag:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--global_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial \
--only_data \
--freeze_validation_set

This will only generate a data folder, with the processed data, inside the path passed to global_output_folder.

Now, we can start training models for each phenotype. The main difference from the previous tutorial is that we will be reusing the processed data, but passing in separate flags for the modelling, feature selection and analysis. For example, in the case of tg:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/analysis \
--output_con_columns tg \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns ldl hdl age

Note

It’s a bit counter-intuitive that we are passing in data_output_folder, when it’s an “input” folder in this scenario. This is because the framework checks if the data folder exists, and if it does, it will skip the data processing step.

For the other phenotypes, here are the commands:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/analysis \
--output_con_columns hdl \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg ldl age

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/analysis \
--output_con_columns ldl \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg hdl age

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/analysis \
--output_con_columns age \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg ldl hdl

While this is a bit manually set up in this tutorial, this can easily be extended and automated as you see fit.

In case you are interested, here are the results on the test set for each phenotype:

TG (triglycerides)
Fold	RMSE	R2	PCC
Ensemble	71.386	0.0788	0.2856
Fold 0	71.6439	0.0722	0.2983
Fold 1	71.316	0.0806	0.2893
Fold 2	71.7878	0.0684	0.285
Fold 3	74.4062	-0.0008	0.1314
Fold 4	72.018	0.0624	0.27

HDL (high density lipoprotein)
Fold	RMSE	R2	PCC
Ensemble	10.6967	0.2674	0.5224
Fold 0	11.1895	0.1983	0.4824
Fold 1	10.7284	0.263	0.5358
Fold 2	10.9134	0.2374	0.5175
Fold 3	11.0505	0.2181	0.4708
Fold 4	10.792	0.2543	0.5098

LDL (low density lipoprotein)
Fold	RMSE	R2	PCC
Ensemble	40.8394	0.028	0.1692
Fold 0	41.4317	-0.0004	0.125
Fold 1	41.0572	0.0176	0.1485
Fold 2	39.8171	0.0761	0.2896
Fold 3	42.154	-0.0356	0.0789
Fold 4	42.1068	-0.0332	0.0746

Age
Fold	RMSE	R2	PCC
Ensemble	8.2984	0.0141	0.2595
Fold 0	8.6089	-0.0611	0.2408
Fold 1	8.2793	0.0186	0.2486
Fold 2	8.6151	-0.0626	0.2239
Fold 3	8.3359	0.0052	0.2146
Fold 4	8.1859	0.0407	0.2901

We see that the models explain the lipid phenotypes variably, with HDL being the best predicted phenotype. Interestingly, the PCC for age is ~0.25, which can be due to population bias, sex and the fact we included the lipids as inputs.

That will be it for this tutorial. Thank you for reading!