Feature selection

In this section we consider the Greedy RLS algorithm, that performs feature selection in order to learn sparse linear models. Conceptually the method combines together a greedy search heuristic for feature selection, regularized least-squares based model fitting, and leave-one-out cross-validation based selection criterion. However, unlike an implementation based on writing wrapper code around a RLS solver, Greedy RLS uses several computational shortcuts to guarantee that the feature selection can be done in linear time. On multi-target data, the method selects such common features that work on average best for all the target tasks.

Greedy RLS was first introduced in [1]. Work on applying and adapting the method to analyzing genome wide association studies can be found in [2], while the multi-target feature selection problem has been considered in detail in [3].

In the literature one can usually find three main motivations for feature selection:

  1. costs: features cost time and money, so let’s use as few as possible
  2. accuracy: use only the relevant features in order to make more accurate predictions
  3. understandability: we can examine the model directly to gain insight into the studied phenomenon

In our opinion, Greedy RLS is especially beneficial for the point 1: the method will search for such a subset that allows as accurate predictions as possible given a hard constraint on the number of features allowed. For point 2 we recommend using the regularization mechanisms of the regular RLS instead. For point 3 we note that analyzing the selected features can be quite challenging, due to the greedy search process applied. For example, an important feature may end up not being selected, if it has a high correlation with already selected features.

Basics of feature selection

Again, we test the algorithm on a toy example, the Boston Housing data set from the UCI machine learning repository. The data consists of 506 instances, 13 features and 1 output to be predicted.

The data can be loaded from disk and split into a training set of 250, and test set of 256 instances using the following code.

import numpy as np

def load_housing():
    np.random.seed(1)
    D = np.loadtxt("housing.data")
    np.random.shuffle(D)
    X = D[:,:-1]
    Y = D[:,-1]
    X_train = X[:250]
    Y_train = Y[:250]
    X_test = X[250:]
    Y_test = Y[250:]
    return X_train, Y_train, X_test, Y_test

def print_stats():
    X_train, Y_train, X_test, Y_test = load_housing()
    print("Housing data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__ == "__main__":
    print_stats()


Housing data set characteristics
Training set: 250 instances, 13 features
Test set: 256 instances, 13 features

Selecting k features

First, we test selecting 5 features.

from rlscore.learner import GreedyRLS
from rlscore.measure import sqerror

from housing_data import load_housing


def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    #we select 5 features
    learner = GreedyRLS(X_train, Y_train, 5)
    #Test set predictions
    P_test = learner.predict(X_test)
    print("test error %f" %sqerror(Y_test, P_test))
    print("Selected features " +str(learner.selected))

if __name__=="__main__":
    train_rls()
test error 25.962988
Selected features [12, 5, 10, 7, 11]

The feature indices range from 0 to fcount-1.

Callback function

GreedyRLS takes as an optional argument a callback function. This is an object that implements two methods: callback(learner) that is called each time a feature is selected, and finished(learner) that is called after the selection process has ended. The callback function can be used for example to follow how test set prediction error evolves as a function of selected features.

from rlscore.learner import GreedyRLS
from rlscore.measure import sqerror

from housing_data import load_housing

class Callback(object):

    def __init__(self, X_test, Y_test):
        self.X_test = X_test
        self.Y_test = Y_test
        self.iteration = 0

    def callback(self, learner):
        self.iteration += 1
        P = learner.predict(self.X_test)
        e = sqerror(self.Y_test, P)
        print("Features selected %d, test error %f" %(self.iteration, e))

    def finished(self, learner):
        pass

def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    cb = Callback(X_test, Y_test)
    learner = GreedyRLS(X_train, Y_train, 13, callbackfun = cb)
    #Test set predictions
    P_test = learner.predict(X_test)
    print("test error %f" %sqerror(Y_test, P_test))
    print("Selected features " +str(learner.selected))

if __name__=="__main__":
    train_rls()
Features selected 1, test error 35.107093
Features selected 2, test error 30.027482
Features selected 3, test error 27.443781
Features selected 4, test error 27.348488
Features selected 5, test error 25.962988
Features selected 6, test error 26.895165
Features selected 7, test error 26.287712
Features selected 8, test error 26.367339
Features selected 9, test error 25.823253
Features selected 10, test error 25.774935
Features selected 11, test error 25.321035
Features selected 12, test error 25.324392
Features selected 13, test error 25.497222
test error 25.497222
Selected features [12, 5, 10, 7, 11, 0, 1, 3, 4, 8, 9, 2, 6]

Running GreedyRLS on all the features produces a ranking of the features: the most important, the second most important given the already selected one, etc. The final model is exactly the same as regular RLS trained on all the features.

Feature selection with MNIST

In this tutorial, we select features for the MNIST handwritten digit recognition data set. For reading in the data, we use the python-mnist package. (Easiest way to install: ‘pip install python-mnist’). We assume, the unzipped MNIST data is stored in folder ‘data’.

There are now 60000 training examples, 10000 test examples, 784 features and 10 classes. The goal is to select such common features that allow jointly predicting all the 10 classes as well as possible. The classes are not linearly separable, with no pre-processing done the best a linear classifier can achieve with all the features is around 0.88 accuracy.

import numpy as np
from mnist import MNIST
from rlscore.learner import GreedyRLS
from rlscore.measure import ova_accuracy

def ova(Y):
    #maps range 0,...,classcount -1 to one-vs-all encoding
    Y_ova = -1. * np.ones((len(Y), np.max(Y)+1))
    for i in range(len(Y)):
        Y_ova[i, Y[i]] = 1
    return Y_ova

class Callback(object):

    def __init__(self, X_test, Y_test):
        self.X_test = X_test
        self.Y_test = Y_test
        self.iteration = 0

    def callback(self, learner):
        self.iteration += 1
        P = learner.predict(self.X_test)
        acc = ova_accuracy(self.Y_test, P)
        print("Features selected %d, accuracy %f" %(self.iteration, acc))

    def finished(self, learner):
        pass

def train_rls():
    mndata = MNIST("./data")
    X_train, Y_train = mndata.load_training()
    X_test, Y_test = mndata.load_testing()
    X_train, X_test = np.array(X_train), np.array(X_test)
    #One-vs-all mapping
    Y_train = ova(Y_train)
    Y_test = ova(Y_test)
    #Train greedy RLS, select 50 features
    cb = Callback(X_test, Y_test)
    learner = GreedyRLS(X_train, Y_train, 50, callbackfun=cb)
    print("Selected features " +str(learner.selected))

if __name__=="__main__":
    train_rls()
Features selected 1, accuracy 0.202400
Features selected 2, accuracy 0.323800
Features selected 3, accuracy 0.402900
Features selected 4, accuracy 0.442400
Features selected 5, accuracy 0.489400
Features selected 6, accuracy 0.509500
Features selected 7, accuracy 0.551600
Features selected 8, accuracy 0.574400
Features selected 9, accuracy 0.594600
Features selected 10, accuracy 0.638700
Features selected 11, accuracy 0.646300
Features selected 12, accuracy 0.668700
Features selected 13, accuracy 0.683300
Features selected 14, accuracy 0.682200
Features selected 15, accuracy 0.703100
Features selected 16, accuracy 0.711100
Features selected 17, accuracy 0.714800
Features selected 18, accuracy 0.722200
Features selected 19, accuracy 0.731700
Features selected 20, accuracy 0.733200
Features selected 21, accuracy 0.742900
Features selected 22, accuracy 0.742600
Features selected 23, accuracy 0.750100
Features selected 24, accuracy 0.755600
Features selected 25, accuracy 0.758000
Features selected 26, accuracy 0.760500
Features selected 27, accuracy 0.764300
Features selected 28, accuracy 0.769900
Features selected 29, accuracy 0.776100
Features selected 30, accuracy 0.780600
Features selected 31, accuracy 0.780500
Features selected 32, accuracy 0.779600
Features selected 33, accuracy 0.785800
Features selected 34, accuracy 0.788500
Features selected 35, accuracy 0.788000
Features selected 36, accuracy 0.788100
Features selected 37, accuracy 0.791500
Features selected 38, accuracy 0.794900
Features selected 39, accuracy 0.797400
Features selected 40, accuracy 0.799000
Features selected 41, accuracy 0.802000
Features selected 42, accuracy 0.805400
Features selected 43, accuracy 0.806600
Features selected 44, accuracy 0.810900
Features selected 45, accuracy 0.812800
Features selected 46, accuracy 0.812000
Features selected 47, accuracy 0.812100
Features selected 48, accuracy 0.815000
Features selected 49, accuracy 0.818600
Features selected 50, accuracy 0.822100
Selected features [350, 461, 542, 409, 401, 596, 376, 298, 151, 211, 386, 657, 267, 101, 290, 406, 220, 518, 427, 459, 301, 712, 573, 538, 583, 523, 320, 260, 352, 464, 70, 718, 154, 709, 715, 104, 277, 190, 343, 439, 214, 515, 512, 346, 296, 237, 651, 580, 549, 355]

On this data using 50 selected features we achieve classification accuracy of 0.82, better accuracy could still be gained by continuing the selection process further.

References

[1]Tapio Pahikkala, Antti Airola, and Tapio Salakoski. Speeding up Greedy Forward Selection for Regularized Least-Squares. Proceedings of The Ninth International Conference on Machine Learning and Applications, 325-330, IEEE Computer Society, 2010.
[2]Tapio Pahikkala, Sebastian Okser, Antti Airola, Tapio Salakoski, and Tero Aittokallio. Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations. Algorithms for Molecular Biology, 7(1):11, 2012.
[3]Pekka Naula, Antti Airola, Tapio Salakoski, and Tapio Pahikkala. Multi-label learning under feature extraction budgets. Pattern Recognition Letters, 40:56–65, April 2014.