Learning to rank¶

In this tutorial, we show how to train the ranking regularized least-squares (RankRLS) method for learning to rank [1] [2]. We will use three variants of the method, depending on whether the data consists of (instance, utility score) pairs similar to regression, query-structured data, or pairwise preferences. In our experience for the first case competitive results can often be achieved also by simply using RLS regression, whereas for the latter two use cases RankRLS should be used. All of these learners support using nonlinear kernels.

RankRLS minimizes the magnitude preserving ranking error ((y_i-y_j) - (f(x_i) - f(x_j)))^2. We will also make use of the concordance index (a.k.a pairwise ranking accuracy), that computes the relative fraction of correctly ordered pairs (s.t. y_i > y_j and f(x_i) > f(x_j) with tied predictions broken randomly). For concordance index, trivial baselines such as random predictor or mean or majority voter yield 0.5 performance. For bipartite ranking tasks where there are only two possible output values, the concordance index is equivalent to area under ROC curve (AUC), a popular measure in binary classification.

Tutorial 1: Ordinal regression¶

First, let us assume an ordinal regression type of setting, where similar to regression each instance is associated with a score. However, now the aim is to learn to predict the ordering of instances correctly, rather than the scores exactly. We use the GlobalRankRLS implementation of the RankRLS. Global in the name refers to the fact that there exists a single global ranking over the data, rather than having many separate rankings such as with query structured data considered later.

The leave-pair-out cross-validation approach consists of leaving in turn each pair of training instances out as holdout data, and computing the fraction of cases where f(x_i) > f(x_j), assuming y_i > y_j. This is implemented using a fast algorithm described in [3] [2].

Data set¶

Again, we consider the classical Boston Housing data set from the UCI machine learning repository. The data consists of 506 instances, 13 features and 1 output to be predicted.

The data can be loaded from disk and split into a training set of 250, and test set of 256 instances using the following code.

import numpy as np

def load_housing():
    np.random.seed(1)
    D = np.loadtxt("housing.data")
    np.random.shuffle(D)
    X = D[:,:-1]
    Y = D[:,-1]
    X_train = X[:250]
    Y_train = Y[:250]
    X_test = X[250:]
    Y_test = Y[250:]
    return X_train, Y_train, X_test, Y_test

def print_stats():
    X_train, Y_train, X_test, Y_test = load_housing()
    print("Housing data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__ == "__main__":
    print_stats()

Housing data set characteristics
Training set: 250 instances, 13 features
Test set: 256 instances, 13 features

Linear ranking model with default parameters¶

First, we train RankRLS with default parameters (linear kernel, regparam=1.0) and compute the concordance for the test set.

from rlscore.learner import GlobalRankRLS
from rlscore.measure import cindex

from housing_data import load_housing


def train_rls():
    #Trains RLS with default parameters (regparam=1.0, kernel='LinearKernel')
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = GlobalRankRLS(X_train, Y_train)
    #Test set predictions
    P_test = learner.predict(X_test)
    print("test cindex %f" %cindex(Y_test, P_test))

if __name__=="__main__":
    train_rls()