03 – Comparison of Feature Selection Methods

In this tutorial, we will be quickly going over the different feature selection methods that are implemented in the software.

We will be using the same data from the PennCATH study, as in tutorial 01 – Genomic Prediction for Coronary Artery Disease, predicting CAD. As before, you start by downloading the processed PennCATH data.

As we are reusing the data from the previous tutorial, so the structure should look like this:

eir_auto_gp_tutorials/01_basic_tutorial/data
├── penncath
│   ├── penncath.bed
│   ├── penncath.bim
│   ├── penncath.csv
│   └── penncath.fam
└── penncath.zip

Since we are doing multiple runs with different feature selection methods, we would like to avoid processing the data multiple times. We can start by preparing only the data, without any modelling, by using the --only-data flag:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--global_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data \
--only_data \
--freeze_validation_set

This will only generate a data folder, with the processed data, inside the path passed to global_output_folder.

Now, we can start training models for each feature selection method. We will specifically focus on the following methods gwas, gwas->dl and gwas+bo. There are other methods available, such as dl, however, since we have around 500K SNPs and only around 2K samples, we will skip the methods that do not include a GWAS pre-filtering step. This is to (a) save time and (b) likely the models would be grossly overfit.

Note

For more information on the feature selection methods, please refer to the output of eirautogp --help and the documentation page Feature Selection Methods.

Here are the commands we can use to train the models:

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas \
--n_dl_feature_selection_setup_folds 3 \
--do_test

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_then_dl/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas->dl \
--n_dl_feature_selection_setup_folds 3 \
--do_test

eirautogp \
--genotype_data_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath \
--label_file_path eir_auto_gp_tutorials/01_basic_tutorial/data/penncath/penncath.csv \
--data_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_data/data \
--feature_selection_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/feature_selection \
--modelling_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/modelling \
--analysis_output_folder eir_auto_gp_tutorials/tutorial_runs/03_feature_selection_gwas_and_bo/analysis \
--output_cat_columns CAD \
--input_con_columns tg hdl ldl age \
--input_cat_columns sex \
--folds 10 \
--feature_selection gwas+bo \
--n_dl_feature_selection_setup_folds 3 \
--do_test

In case you are interested, here are the results on the test set for each of the approaches:

GWAS
Fold	MCC	ACC	ROC-AUC-MACRO	AP-MACRO	LOSS
Ensemble	0.34	0.6879	0.735	0.7991
Fold 0	0.3509	0.6879	0.7127	0.7633	0.7242
Fold 1	0.2857	0.6596	0.7307	0.7875	0.6122
Fold 2	0.3482	0.7021	0.7328	0.7972	0.6927
Fold 3	0.3218	0.6667	0.7346	0.8018	0.6618
Fold 4	0.3386	0.695	0.7067	0.7545	0.7327
Fold 5	0.3094	0.6667	0.708	0.7979	0.6629
Fold 6	0.2544	0.6241	0.7382	0.7761	0.6566
Fold 7	0.3695	0.7092	0.7281	0.7851	0.6008
Fold 8	0.263	0.6383	0.6998	0.7802	0.7171
Fold 9	0.302	0.6454	0.7185	0.7781	0.7405

GWAS->DL
Fold	MCC	ACC	ROC-AUC-MACRO	AP-MACRO	LOSS
Ensemble	0.3224	0.6809	0.7305	0.793
Fold 0	0.34	0.6879	0.7088	0.776	0.7166
Fold 1	0.321	0.6879	0.7363	0.8025	0.6442
Fold 2	0.31	0.6738	0.7144	0.7959	0.7031
Fold 3	0.3241	0.6596	0.6975	0.7611	0.7158
Fold 4	0.3576	0.695	0.723	0.7659	0.7119
Fold 5	0.2744	0.6596	0.7058	0.7771	0.6711
Fold 6	0.363	0.695	0.7515	0.7895	0.6221
Fold 7	0.2869	0.6667	0.7099	0.7736	0.6666
Fold 8	0.3525	0.695	0.708	0.7913	0.6578
Fold 9	0.3094	0.6667	0.7275	0.7909	0.7332

GWAS+BO
Fold	MCC	ACC	ROC-AUC-MACRO	AP-MACRO	LOSS
Ensemble	0.3875	0.7092	0.7354	0.7944
Fold 0	0.3807	0.695	0.7371	0.7962	0.634
Fold 1	0.2857	0.6596	0.7307	0.7875	0.6122
Fold 2	0.3869	0.7163	0.7303	0.7947	0.6755
Fold 3	0.2691	0.6596	0.6923	0.784	0.6683
Fold 4	0.281	0.6454	0.7093	0.781	0.6441
Fold 5	0.3094	0.6667	0.708	0.7979	0.6629
Fold 6	0.3513	0.6809	0.7423	0.781	0.6155
Fold 7	0.3525	0.695	0.7513	0.7893	0.5987
Fold 8	0.345	0.6809	0.7183	0.7796	0.7175
Fold 9	0.3349	0.6879	0.6932	0.7401	0.6879

So here, we can see that the default GWAS+BO approach performs the roughly best. This method has worked quite well on smaller datasets in internal tests, while it remains to be tested thoroughly on larger datasets such as the UKBB. For larger datasets such as the UKBB, the GWAS->DL has mostly been used, but it can well be that the GWAS+BO approach will work even better.

Thanks for reading this tutorial!