03 – Comparison of Feature Selection Methods
In this tutorial, we will be quickly going over the different feature selection methods that are implemented in the software.
We will be using the same data from the PennCATH study, as in tutorial 01 – Genomic Prediction for Coronary Artery Disease, predicting CAD. As before, you start by downloading the processed PennCATH data.
As we are reusing the data from the previous tutorial, so the structure should look like this:
eir_auto_gp_tutorials/01_basic_tutorial/data
├── penncath
│ ├── penncath.bed
│ ├── penncath.bim
│ ├── penncath.csv
│ └── penncath.fam
└── penncath.zip
Since we are doing multiple runs with different feature selection methods,
we would like to avoid processing the data multiple times.
We can start by preparing only the data, without any modelling,
by using the --only-data flag:
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--global_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data \
--only_data \
--freeze_validation_set
This will only generate a data folder,
with the processed data,
inside the path passed to global_output_folder.
Now, we can start training models for each feature selection method.
We will specifically focus on the following methods
gwas, gwas->dl and gwas+bo. There are other methods available,
such as dl, however, since we have around 500K SNPs and only around 2K samples,
we will skip the methods that do not include a GWAS pre-filtering step. This is
to (a) save time and (b) likely the models would be grossly overfit.
Note
For more information on the feature selection methods,
please refer to the output of eirautogp --help and the
documentation page Feature Selection Methods.
Here are the commands we can use to train the models:
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas \
--n_dl_feature_selection_setup_folds 3 \
--do_test
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas->dl \
--n_dl_feature_selection_setup_folds 3 \
--do_test
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas+bo \
--n_dl_feature_selection_setup_folds 3 \
--do_test
In case you are interested, here are the results on the test set for each of the approaches:
Fold |
MCC |
ACC |
ROC-AUC-MACRO |
AP-MACRO |
LOSS |
|---|---|---|---|---|---|
Ensemble |
0.34 |
0.6879 |
0.735 |
0.7991 |
|
Fold 0 |
0.3509 |
0.6879 |
0.7127 |
0.7633 |
0.7242 |
Fold 1 |
0.2857 |
0.6596 |
0.7307 |
0.7875 |
0.6122 |
Fold 2 |
0.3482 |
0.7021 |
0.7328 |
0.7972 |
0.6927 |
Fold 3 |
0.3218 |
0.6667 |
0.7346 |
0.8018 |
0.6618 |
Fold 4 |
0.3386 |
0.695 |
0.7067 |
0.7545 |
0.7327 |
Fold 5 |
0.3094 |
0.6667 |
0.708 |
0.7979 |
0.6629 |
Fold 6 |
0.2544 |
0.6241 |
0.7382 |
0.7761 |
0.6566 |
Fold 7 |
0.3695 |
0.7092 |
0.7281 |
0.7851 |
0.6008 |
Fold 8 |
0.263 |
0.6383 |
0.6998 |
0.7802 |
0.7171 |
Fold 9 |
0.302 |
0.6454 |
0.7185 |
0.7781 |
0.7405 |
Fold |
MCC |
ACC |
ROC-AUC-MACRO |
AP-MACRO |
LOSS |
|---|---|---|---|---|---|
Ensemble |
0.3224 |
0.6809 |
0.7305 |
0.793 |
|
Fold 0 |
0.34 |
0.6879 |
0.7088 |
0.776 |
0.7166 |
Fold 1 |
0.321 |
0.6879 |
0.7363 |
0.8025 |
0.6442 |
Fold 2 |
0.31 |
0.6738 |
0.7144 |
0.7959 |
0.7031 |
Fold 3 |
0.3241 |
0.6596 |
0.6975 |
0.7611 |
0.7158 |
Fold 4 |
0.3576 |
0.695 |
0.723 |
0.7659 |
0.7119 |
Fold 5 |
0.2744 |
0.6596 |
0.7058 |
0.7771 |
0.6711 |
Fold 6 |
0.363 |
0.695 |
0.7515 |
0.7895 |
0.6221 |
Fold 7 |
0.2869 |
0.6667 |
0.7099 |
0.7736 |
0.6666 |
Fold 8 |
0.3525 |
0.695 |
0.708 |
0.7913 |
0.6578 |
Fold 9 |
0.3094 |
0.6667 |
0.7275 |
0.7909 |
0.7332 |
Fold |
MCC |
ACC |
ROC-AUC-MACRO |
AP-MACRO |
LOSS |
|---|---|---|---|---|---|
Ensemble |
0.3875 |
0.7092 |
0.7354 |
0.7944 |
|
Fold 0 |
0.3807 |
0.695 |
0.7371 |
0.7962 |
0.634 |
Fold 1 |
0.2857 |
0.6596 |
0.7307 |
0.7875 |
0.6122 |
Fold 2 |
0.3869 |
0.7163 |
0.7303 |
0.7947 |
0.6755 |
Fold 3 |
0.2691 |
0.6596 |
0.6923 |
0.784 |
0.6683 |
Fold 4 |
0.281 |
0.6454 |
0.7093 |
0.781 |
0.6441 |
Fold 5 |
0.3094 |
0.6667 |
0.708 |
0.7979 |
0.6629 |
Fold 6 |
0.3513 |
0.6809 |
0.7423 |
0.781 |
0.6155 |
Fold 7 |
0.3525 |
0.695 |
0.7513 |
0.7893 |
0.5987 |
Fold 8 |
0.345 |
0.6809 |
0.7183 |
0.7796 |
0.7175 |
Fold 9 |
0.3349 |
0.6879 |
0.6932 |
0.7401 |
0.6879 |
So here, we can see that the default GWAS+BO approach performs the roughly best. This method has worked quite well on smaller datasets in internal tests, while it remains to be tested thoroughly on larger datasets such as the UKBB. For larger datasets such as the UKBB, the GWAS->DL has mostly been used, but it can well be that the GWAS+BO approach will work even better.
Thanks for reading this tutorial!