Feature selection ================= In this section we consider the Greedy RLS algorithm, that performs feature selection in order to learn sparse linear models. Conceptually the method combines together a greedy search heuristic for feature selection, regularized least-squares based model fitting, and leave-one-out cross-validation based selection criterion. However, unlike an implementation based on writing wrapper code around a RLS solver, Greedy RLS uses several computational shortcuts to guarantee that the feature selection can be done in linear time. On multi-target data, the method selects such common features that work on average best for all the target tasks. Greedy RLS was first introduced in [1]_. Work on applying and adapting the method to analyzing genome wide association studies can be found in [2]_, while the multi-target feature selection problem has been considered in detail in [3]_. In the literature one can usually find three main motivations for feature selection: 1. costs: features cost time and money, so let's use as few as possible 2. accuracy: use only the relevant features in order to make more accurate predictions 3. understandability: we can examine the model directly to gain insight into the studied phenomenon In our opinion, Greedy RLS is especially beneficial for the point 1: the method will search for such a subset that allows as accurate predictions as possible given a hard constraint on the number of features allowed. For point 2 we recommend using the regularization mechanisms of the regular RLS instead. For point 3 we note that analyzing the selected features can be quite challenging, due to the greedy search process applied. For example, an important feature may end up not being selected, if it has a high correlation with already selected features. Basics of feature selection *************************** Again, we test the algorithm on a toy example, the `Boston Housing data set `_ from the UCI machine learning repository. The data consists of 506 instances, 13 features and 1 output to be predicted. The data can be loaded from disk and split into a training set of 250, and test set of 256 instances using the following code. .. literalinclude:: src/housing_data.py .. literalinclude:: src/housing_data.out Selecting k features -------------------- First, we test selecting 5 features. .. literalinclude:: src/fselection1.py .. literalinclude:: src/fselection1.out The feature indices range from 0 to fcount-1. Callback function ----------------- GreedyRLS takes as an optional argument a callback function. This is an object that implements two methods: callback(learner) that is called each time a feature is selected, and finished(learner) that is called after the selection process has ended. The callback function can be used for example to follow how test set prediction error evolves as a function of selected features. .. literalinclude:: src/fselection2.py .. literalinclude:: src/fselection2.out Running GreedyRLS on all the features produces a ranking of the features: the most important, the second most important given the already selected one, etc. The final model is exactly the same as regular RLS trained on all the features. Feature selection with MNIST **************************** In this tutorial, we select features for the `MNIST handwritten digit recognition data set `_. For reading in the data, we use the `python-mnist package `_. (Easiest way to install: 'pip install python-mnist'). We assume, the unzipped MNIST data is stored in folder 'data'. There are now 60000 training examples, 10000 test examples, 784 features and 10 classes. The goal is to select such common features that allow jointly predicting all the 10 classes as well as possible. The classes are not linearly separable, with no pre-processing done the best a linear classifier can achieve with all the features is around 0.88 accuracy. .. literalinclude:: src/greedy_mnist.py .. literalinclude:: src/greedy_mnist.out On this data using 50 selected features we achieve classification accuracy of 0.82, better accuracy could still be gained by continuing the selection process further. References ********** .. [1] Tapio Pahikkala, Antti Airola, and Tapio Salakoski. Speeding up Greedy Forward Selection for Regularized Least-Squares. Proceedings of The Ninth International Conference on Machine Learning and Applications, 325-330, IEEE Computer Society, 2010. .. [2] Tapio Pahikkala, Sebastian Okser, Antti Airola, Tapio Salakoski, and Tero Aittokallio. Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations. Algorithms for Molecular Biology, 7(1):11, 2012. .. [3] Pekka Naula, Antti Airola, Tapio Salakoski, and Tapio Pahikkala. Multi-label learning under feature extraction budgets. Pattern Recognition Letters, 40:56--65, April 2014.