03 – Comparison of Feature Selection Methods

In this tutorial, we will be quickly going over the different feature selection methods that are implemented in the software.

We will be using the same data from the PennCATH study, as in tutorial 01 – Genomic Prediction for Coronary Artery Disease, predicting CAD. As before, you start by downloading the processed PennCATH data.

As we are reusing the data from the previous tutorial, so the structure should look like this:

eir_auto_gp_tutorials/01_basic_tutorial/data
├── penncath
│   ├── penncath.bed
│   ├── penncath.bim
│   ├── penncath.csv
│   └── penncath.fam
└── penncath.zip

Since we are doing multiple runs with different feature selection methods, we would like to avoid processing the data multiple times. We can start by preparing only the data, without any modelling, by using the --only-data flag:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--global_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data \
--only_data \
--freeze_validation_set

This will only generate a data folder, with the processed data, inside the path passed to global_output_folder.

Now, we can start training models for each feature selection method. We will specifically focus on the following methods gwas, gwas->dl and gwas+bo. There are other methods available, such as dl, however, since we have around 500K SNPs and only around 2K samples, we will skip the methods that do not include a GWAS pre-filtering step. This is to (a) save time and (b) likely the models would be grossly overfit.

Note

For more information on the feature selection methods, please refer to the output of eirautogp --help and the documentation page Feature Selection Methods.

Here are the commands we can use to train the models:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas \
--n_dl_feature_selection_setup_folds 3 \
--do_test
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas->dl \
--n_dl_feature_selection_setup_folds 3 \
--do_test
eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas+bo \
--n_dl_feature_selection_setup_folds 3 \
--do_test

In case you are interested, here are the results on the test set for each of the approaches:

GWAS

Fold

MCC

ACC

ROC-AUC-MACRO

AP-MACRO

LOSS

Ensemble

0.34

0.6879

0.735

0.7991

Fold 0

0.3509

0.6879

0.7127

0.7633

0.7242

Fold 1

0.2857

0.6596

0.7307

0.7875

0.6122

Fold 2

0.3482

0.7021

0.7328

0.7972

0.6927

Fold 3

0.3218

0.6667

0.7346

0.8018

0.6618

Fold 4

0.3386

0.695

0.7067

0.7545

0.7327

Fold 5

0.3094

0.6667

0.708

0.7979

0.6629

Fold 6

0.2544

0.6241

0.7382

0.7761

0.6566

Fold 7

0.3695

0.7092

0.7281

0.7851

0.6008

Fold 8

0.263

0.6383

0.6998

0.7802

0.7171

Fold 9

0.302

0.6454

0.7185

0.7781

0.7405

GWAS->DL

Fold

MCC

ACC

ROC-AUC-MACRO

AP-MACRO

LOSS

Ensemble

0.3224

0.6809

0.7305

0.793

Fold 0

0.34

0.6879

0.7088

0.776

0.7166

Fold 1

0.321

0.6879

0.7363

0.8025

0.6442

Fold 2

0.31

0.6738

0.7144

0.7959

0.7031

Fold 3

0.3241

0.6596

0.6975

0.7611

0.7158

Fold 4

0.3576

0.695

0.723

0.7659

0.7119

Fold 5

0.2744

0.6596

0.7058

0.7771

0.6711

Fold 6

0.363

0.695

0.7515

0.7895

0.6221

Fold 7

0.2869

0.6667

0.7099

0.7736

0.6666

Fold 8

0.3525

0.695

0.708

0.7913

0.6578

Fold 9

0.3094

0.6667

0.7275

0.7909

0.7332

GWAS+BO

Fold

MCC

ACC

ROC-AUC-MACRO

AP-MACRO

LOSS

Ensemble

0.3875

0.7092

0.7354

0.7944

Fold 0

0.3807

0.695

0.7371

0.7962

0.634

Fold 1

0.2857

0.6596

0.7307

0.7875

0.6122

Fold 2

0.3869

0.7163

0.7303

0.7947

0.6755

Fold 3

0.2691

0.6596

0.6923

0.784

0.6683

Fold 4

0.281

0.6454

0.7093

0.781

0.6441

Fold 5

0.3094

0.6667

0.708

0.7979

0.6629

Fold 6

0.3513

0.6809

0.7423

0.781

0.6155

Fold 7

0.3525

0.695

0.7513

0.7893

0.5987

Fold 8

0.345

0.6809

0.7183

0.7796

0.7175

Fold 9

0.3349

0.6879

0.6932

0.7401

0.6879

So here, we can see that the default GWAS+BO approach performs the roughly best. This method has worked quite well on smaller datasets in internal tests, while it remains to be tested thoroughly on larger datasets such as the UKBB. For larger datasets such as the UKBB, the GWAS->DL has mostly been used, but it can well be that the GWAS+BO approach will work even better.

Thanks for reading this tutorial!