02 – Example of training on multiple phenotypes
In this relatively short tutorial,
we will be looking at one approach of using EIR-auto-GP
to train models for multiple phenotypes.
This will be relatively short and technical tutorial,
but the idea is just to show one example workflow.
We will be using the same data from the PennCATH study, as in tutorial 01 – Genomic Prediction for Coronary Artery Disease, but now we will be predicting some of the other phenotypes in the dataset. In particular, we will be predicting age, tg (triglycerides), hdl (high density lipoprotein) and ldl (low density lipoprotein). It’s indeed quite silly perhaps to be predicting age, from genotype data, but we’re just playing around a bit. As before, you start by downloading the processed PennCATH data.
We will be reusing the data from the previous tutorial, so the structure should look like this:
eir_auto_gp_tutorials/01_basic_tutorial/data
├── penncath
│ ├── penncath.bed
│ ├── penncath.bim
│ ├── penncath.csv
│ └── penncath.fam
└── penncath.zip
Since we are modelling on multiple phenotypes,
we would like to avoid processing the data multiple times.
We can start by preparing only the data, without any modelling,
by using the --only-data flag:
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--global_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial \
--only_data \
--freeze_validation_set
This will only generate a data folder,
with the processed data,
inside the path passed to global_output_folder.
Now, we can start training models for each phenotype. The main difference from the previous tutorial is that we will be reusing the processed data, but passing in separate flags for the modelling, feature selection and analysis. For example, in the case of tg:
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/tg/analysis \
--output_con_columns tg \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns ldl hdl age
Note
It’s a bit counter-intuitive that we are passing in data_output_folder,
when it’s an “input” folder in this scenario. This is because
the framework checks if the data folder exists,
and if it does, it will skip the data processing step.
For the other phenotypes, here are the commands:
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/hdl/analysis \
--output_con_columns hdl \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg ldl age
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/ldl/analysis \
--output_con_columns ldl \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg hdl age
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/02_multi_tutorial/age/analysis \
--output_con_columns age \
--input_cat_columns sex \
--folds 5 \
--feature_selection gwas \
--do_test \
--input_con_columns tg ldl hdl
While this is a bit manually set up in this tutorial, this can easily be extended and automated as you see fit.
In case you are interested, here are the results on the test set for each phenotype:
Fold |
RMSE |
R2 |
PCC |
|---|---|---|---|
Ensemble |
71.386 |
0.0788 |
0.2856 |
Fold 0 |
71.6439 |
0.0722 |
0.2983 |
Fold 1 |
71.316 |
0.0806 |
0.2893 |
Fold 2 |
71.7878 |
0.0684 |
0.285 |
Fold 3 |
74.4062 |
-0.0008 |
0.1314 |
Fold 4 |
72.018 |
0.0624 |
0.27 |
Fold |
RMSE |
R2 |
PCC |
|---|---|---|---|
Ensemble |
10.6967 |
0.2674 |
0.5224 |
Fold 0 |
11.1895 |
0.1983 |
0.4824 |
Fold 1 |
10.7284 |
0.263 |
0.5358 |
Fold 2 |
10.9134 |
0.2374 |
0.5175 |
Fold 3 |
11.0505 |
0.2181 |
0.4708 |
Fold 4 |
10.792 |
0.2543 |
0.5098 |
Fold |
RMSE |
R2 |
PCC |
|---|---|---|---|
Ensemble |
40.8394 |
0.028 |
0.1692 |
Fold 0 |
41.4317 |
-0.0004 |
0.125 |
Fold 1 |
41.0572 |
0.0176 |
0.1485 |
Fold 2 |
39.8171 |
0.0761 |
0.2896 |
Fold 3 |
42.154 |
-0.0356 |
0.0789 |
Fold 4 |
42.1068 |
-0.0332 |
0.0746 |
Fold |
RMSE |
R2 |
PCC |
|---|---|---|---|
Ensemble |
8.2984 |
0.0141 |
0.2595 |
Fold 0 |
8.6089 |
-0.0611 |
0.2408 |
Fold 1 |
8.2793 |
0.0186 |
0.2486 |
Fold 2 |
8.6151 |
-0.0626 |
0.2239 |
Fold 3 |
8.3359 |
0.0052 |
0.2146 |
Fold 4 |
8.1859 |
0.0407 |
0.2901 |
We see that the models explain the lipid phenotypes variably, with HDL being the best predicted phenotype. Interestingly, the PCC for age is ~0.25, which can be due to population bias, sex and the fact we included the lipids as inputs.
That will be it for this tutorial. Thank you for reading!