Regression and classification
=============================
Tutorial 1: Basic regression
****************************
In this tutorial, we show how to train a regularized
least-squares (RLS) regressor. We use
the classical computational short-cuts [1]_ for fast
regularization and leave-one-out cross-validation.
Data set
--------
We consider the classical
`Boston Housing data set `_
from the UCI machine learning repository. The data consists of 506 instances,
13 features and 1 output to be predicted.
The data can be loaded from disk and split into a training set of 250, and test
set of 256 instances using the following code.
.. literalinclude:: src/housing_data.py
.. literalinclude:: src/housing_data.out
Linear regression with default parameters
-----------------------------------------
In this example we train RLS with default parameters (linear kernel, regparam=1.0)
and compute the mean squared error both for leave-one-out cross-validation,
and for the test set.
In order to check that the resulting model is better than a trivial baseline,
we compare the results to a predictor, that always predicts the mean
of training outputs.
.. literalinclude:: src/regression1.py
The resulting output is as follows.
.. literalinclude:: src/regression1.out
Clearly the model works much better than the trivial baseline. Still, since we used
default parameters there is no guarantee that the regularization parameter would
be close to optimal value.
Choosing regularization parameter with leave-one-out
----------------------------------------------------
Next, we choose the regularization parameter based on grid search in an exponential
grid. We select the parameter that leads to lowest leave-one-out cross-validation
error (measured with mean square error). Due to the computational short-cuts
implemented the whole procedure is almost as fast as training RLS once (though on this
small data set the running times would yet not be an issue even for brute force algorithms
that require re-training for each tested parameter and round of cross-validation).
.. literalinclude:: src/regression2.py
.. literalinclude:: src/regression2.out
Compared to previous case, we were able to slightly lower the error,
though it turns out in this case the default parameter was already close to optimal.
Usually one cannot expect to be so lucky.
For convenience, the procedure of training RLS and simultaneously selecting the
regularization parameter with leave-one-out is implemented in class LeaveOneOutRLS.
.. literalinclude:: src/regression3.py
.. literalinclude:: src/regression3.out
Learning nonlinear predictors using kernels
-------------------------------------------
Next we train RLS using a non-linear kernel function. The Gaussian kernel has a single
parameter, the kernel width gamma, that needs to be selected using cross-validation.
We implement a nested loop, where gamma is selected in the outer loop, and regparam
in the inner. The ordering is important, as it allows us to make use of the fast
algorithm for re-training RLS with different regularization parameter values.
.. literalinclude:: src/regression4.py
.. literalinclude:: src/regression4.out
Clearly there is a non-linear relationship between the features and the output,
that the Gaussian kernel allows us to model better than the linear kernel. For
simpler implementation, the inner loop could be replaced with training LeaveOneOutRLS
supplying the grid as a parameter, the results and running time are the same.
.. literalinclude:: src/regression5.py
.. literalinclude:: src/regression5.out
Using a custom kernel
---------------------
Custom kernel functions are supported via the kernel="PrecomputedKernel" -option, which
allows the user to directly give a precomputed kernel matrix for training (and later
for prediction). Revisiting the first example, we again train a regressor on the Housing
data:
.. literalinclude:: src/regression6.py
.. literalinclude:: src/regression6.out
Tutorial 2: K-fold cross-validation, non i.i.d. data
****************************************************
Next we consider how to implement K-fold cross-validation. RLS implements computational
shortcuts for repeated holdout, that allow implementing different cross-validation
schemes with almost the same computational cost as training RLS only once. This can
be used to implement regular 10-fold or 5-fold cross-validation if one prefers these
to leave-one-out, though in our opinion the main advantage is realized with non i.i.d.
data, where leave-one-out cannot be used reliably. The computational shortcuts are
based on further refinement of the results published in [2]_ and [3]_, the parse
ranking experiments presented next are similar to the experimental setup considered in [2]_.
We consider an application from the field of natural language processing known as
parse ranking. Syntactic parsing refers to the process of analyzing natural language
text according to some formal grammar. Due to ambiguity of natural language
("I shot an elephant in my pyjamas." - just who was in the pyjamas?), a sentence
can have multiple grammatically correct parses. In parse ranking, an automated parser
generates a set of candidate parsers, which need to be scored according to how
well they match the true (human made) parse of the sentence.
This can be modeled as a regression problem, where the data consists of inputs
representing sentence-parse pairs, and outputs that are scores describing the 'goodness'
of the parse for the sentence. The learned predictor should
generalize to making predictions for new sentences. In order to represent this
requirement in the experimental setup, the training and test data split is done
on the level of sentences, so that all parses related to same sentence are put
either in the training or test set.
Thus a main feature of the data is that it is not an i.i.d. sample, rather the data
is sampled in groups of parses, each group corresponding to one sentence. Similar
violation of the i.i.d. assumption is common in many other machine learning
applications. Typical situations where
this type of cross-validation may be required include setting where several instances correspond
to same real-world object (e.g. same experiment performed many times, same person at different
time points), pairwise data (customer-product, drug-target, query-document, parse-sentence),
data corresponding to points in some coordinate system (pixels in image, spatial data, time
series data) etc.
The data set
------------
The parse ranking data set can be downloaded from
`here `_.
First, we load the training set in and examine its properties
.. literalinclude:: src/parse_data.py
.. literalinclude:: src/parse_data.out
As is common in natural language applications the data is very high dimensional. In
addition to the data we load a list of sentence ids, denoting to which sentence each
instance is associated with. Finally, with the map\_ids function, based on the ids we map the data to fold indices, where
each fold contains the indices of all training instances associated with a given sentence.
Altogether there are 117 folds each corresponding to a sentence.
Incorrect analysis using leave-one-out
--------------------------------------
As before, we select the regularization parameter using leave-one-out.
.. literalinclude:: src/parse_regression1.py
.. literalinclude:: src/parse_regression1.out
It can be seen that something has gone wrong, the test error is almost 50 times higher than
the leave-one-out error! The problem is that in leave-one-out when a parse is left out of
the training set, all the other parses related to the same sentence are still left in.
This leads to high optimistic bias in leave-one-out.
K-fold cross-validation to the rescue
-------------------------------------
This time, we perform k-fold cross-validation, where each fold contains instances related
to a single sentence (k=117).
.. literalinclude:: src/parse_regression2.py
.. literalinclude:: src/parse_regression2.out
This time everything works out better than before, the cross-validation estimate and test
error are much closer. Interestingly, the regularization parameter chosen based on
the more reliable estimate does not work any better than the one chosen with leave-one-out.
K-fold vs. leave-one-out
------------------------
Finally, we plot the difference between the leave-sentence-out k-fold,
leave-one-out and test set errors.
.. literalinclude:: src/parse_regression_plot.py
.. image:: src/parse_plot.png
The moral of the story is, that if your data is not identically distributed, but rather sampled
in groups, this should be taken into account when designing training/test split and/or the
cross-validation folds. Cross-validation using such custom fold-partition can be implemented
efficiently using the hold-out method implemented in the RLS module.
Note: One issue with this data set is that even though there is signal present in the data, even the
best results are quite bad in terms of MSE. We cannot really accurately predict the
scores as such.
Tutorial 3: Binary classification and Area under ROC curve
**********************************************************
Adult data set
--------------
In this experiment, we build a binary classifier on the
`Adult data set `_.
We model this as a regression problem, where +1 encodes the positive class,
and -1 the negative one. This is the standard encoding assumed by the
performance measures within the RLScore package.
.. literalinclude:: src/adult_data.py
.. literalinclude:: src/adult_data.out
Binary classification
---------------------
We can train RLS and select the regularization parameter as before, by
simply using (binary) classification accuracy instead of squared error as the
performance measure.
.. literalinclude:: src/classification0.py
.. literalinclude:: src/classification0.out
Area under ROC curve (AUC) and cross-validation
-----------------------------------------------
A common approach in machine learning is to measure performance with area under the ROC curve (AUC)
rather than classification accuracy. Here, we combine the leave-one-out shortcuts, with using AUC
for parameter selection and performance estimation.
.. literalinclude:: src/classification1.py
.. literalinclude:: src/classification1.out
However, as shown for example in [4]_, especially for small data sets leave-one-out can have substantial
bias for AUC-estimation. In this experiment, we split the Adult data set to 1000 separate training sets
of 30 samples, and compare the leave-one-out AUC and test set AUC.
.. literalinclude:: src/classification2.py
.. literalinclude:: src/classification2.out
As can be seen, there is a systematic negative bias meaning that the leave-one-out AUCs tend to be
smaller than the AUC on the (quite large) test set. The results are similar to those obtained in
the simulated experiments in [4]_. The bias can be removed by performing leave-pair-out cross-validation,
where on each round of cross-validation one positive-negative example pair is left out of training set.
For this purpose, RLScore implements the fast and exact leave-pair-out algorithm first introduced in
[5]_. What follows is a low-level implementation of leave-pair-out, where all such (x_i, x_j) pairs,
where y_i = +1 and y_j = -1 are left out in turn. The leave-pair-out AUC is the relative fraction of
such pairs, where the f(x_i) > f(x_j), with ties assumed to be broken randomly (see [4]_ for further
discussion).
.. literalinclude:: src/classification3.py
.. literalinclude:: src/classification3.out
As can be seen, the negative bias is now almost completely eliminated.
The above code is very low-level and as such unlikely to be easy to understand or of practical use for most users.
However, leave-pair-out AUC can be automatically computed, and also used for model selection using the
equivalent higher level interface. In the following experiment, we train RLS on a sample of 1000 instances from Adult data
for regularization parameter values 2^-5, 1 and 2^5, and select the predictor corresponding
to highest leave-pair-out AUC.
.. literalinclude:: src/classification4.py
.. literalinclude:: src/classification4.out
Tutorial 4: Reduced set approximation
*************************************
Once the data set size exceeds a couple of thousands of instances, maintaining the
whole kernel matrix in memory is no longer feasible. Instead of using all the training
data to represent the learned model, one can instead restrict the model to a subset of
the data, resulting in the so-called reduced set approximation (aka 'subset of regressors',
also closely related to Nyström approximation). As a starting point one can randomly
select a couple of hundred of training instances as basis vectors. A lot of research
has been done on more advanced selection strategies.
.. literalinclude:: src/classification5.py
.. literalinclude:: src/classification5.out
Tutorial 5: Multi-target learning
*********************************
RLS supports efficient multi-target learning. Instead of a single vector of outputs Y, one
can supply [instances x targets] shape Y-matrix. For multi-target regression, one can simply
insert the values to be regressed as such. For multi-class or multi-label classification,
one should employ a transformation, where Y-matrix contains one column per class. The
class(es) to which an instance belongs to are encoded as +1, while the others are encoded
as -1 (so called one-vs-all transformation, in case of multi-class problems).
All the cross-validation and fast regularization algorithms are compatible with multi-target
learning.
We demonstrate multi-class learning with a simple toy example, utilizing the `Wine data
set `_ from the UCI repository
.. literalinclude:: src/wine_data.py
.. literalinclude:: src/wine_data.out
We implement the training and testing, using two additional helper functions, one which
transforms class labels to one-vs-all encoding, another that computes classification accuracy
for matrices using one-vs-all encoding.
.. literalinclude:: src/classification6.py
.. literalinclude:: src/classification6.out
The wine data turns out to be really easy to learn. Similarly, we could implement multi-target
regression, or multi-label classification. RLScore does not currently implement a wide variety
of multi-label classification measures. However, the implemented performance measures (e.g. accuracy,
auc, f-score), will compute micro-averaged performance estimates (i.e. compute the measure for each
column separately and then average), when applied to 2-dimensional Y and P matrices.
References
**********
.. [1] Ryan Rifkin, Ross Lippert. Notes on Regularized Least Squares. Technical Report, MIT, 2007.
.. [2] Tapio Pahikkala, Jorma Boberg, and Tapio Salakoski. Fast n-Fold Cross-Validation for Regularized Least-Squares. Proceedings of the Ninth Scandinavian Conference on Artificial Intelligence, 83-90, Otamedia Oy, 2006.
.. [3] Tapio Pahikkala, Hanna Suominen, and Jorma Boberg. Efficient cross-validation for kernelized least-squares regression with sparse basis expansions. Machine Learning, 87(3):381--407, June 2012.
.. [4] Antti Airola, Tapio Pahikkala, Willem Waegeman, Bernard De Baets, and Tapio Salakoski An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4):1828-1844, April 2011.
.. [5] Tapio Pahikkala, Antti Airola, Jorma Boberg, and Tapio Salakoski. Exact and efficient leave-pair-out cross-validation for ranking RLS. In Proceedings of the 2nd International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'08), pages 1-8, Espoo, Finland, 2008.