Kernels

Most of the algorithms in RLScore are kernel methods. The default behavior of these methods is to use the linear kernel, if no additional arguments are provided. However, many of the learners allow supplying as parameters kernel name and parameters and/or support pre-computed kernel matrices supplied by a user. If the number of features is small, training with linear kernel can be much faster, than with the non-linear ones. Also the non-linear kernels are compatible with the fast cross-validation and regularization algorithms introduced in the tutorials.

Some further examples on using kernels can be found in the other tutorials.

RLScore currently implements the following kernels:

LinearKernel: k(xi,xj) = <xi , xj> + bias parameters: bias (default = 1.0)

GaussianKernel: k(xi,xj) = e^(-gamma*<xi-xj,xi-xj>) parameters: gamma (default = 1.0)

PolynomialKernel: k(xi,xj) = (gamma * <xi, xj> + coef0)**degree parameters: gamma (default = 1.0), coef0 (default = 0.), degree (default = 2)

PrecomputedKernel: caller will supply their own kernel matrix for training and prediction, instead of the data matrix.

Tutorial 1: Basic usage

In these examples we use the RLS learner (see regression and classification tutorials). The RankRLS modules (see learning to rank tutorials) behave exactly the same with respect to use of kernel functions.

Data set

We consider the classical Boston Housing data set from the UCI machine learning repository. The data consists of 506 instances, 13 features and 1 output to be predicted.

The data can be loaded from disk and split into a training set of 250, and test set of 256 instances using the following code.

import numpy as np

def load_housing():
    np.random.seed(1)
    D = np.loadtxt("housing.data")
    np.random.shuffle(D)
    X = D[:,:-1]
    Y = D[:,-1]
    X_train = X[:250]
    Y_train = Y[:250]
    X_test = X[250:]
    Y_test = Y[250:]
    return X_train, Y_train, X_test, Y_test

def print_stats():
    X_train, Y_train, X_test, Y_test = load_housing()
    print("Housing data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__ == "__main__":
    print_stats()


Housing data set characteristics
Training set: 250 instances, 13 features
Test set: 256 instances, 13 features

Default behavior

By default, if no kernel related parameters are supplied, the learners use linear kernel with bias=1.0.

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror

from housing_data import load_housing


def train_rls():
    #Trains RLS with default parameters (regparam=1.0, kernel='LinearKernel')
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(X_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()

The resulting output is as follows.

leave-one-out error 25.959399
test error 25.497222
mean predictor 81.458770

Different kernels

Alternatively, we can use the kernels implemented in RLScore.

First, let us try Gaussian kernel with the default parameters.

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror

from housing_data import load_housing


def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train, kernel="GaussianKernel", regparam=1, gamma=1)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(X_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 612.986310
test error 564.550970
mean predictor 81.458770

Oops, the results are horribly bad. Clearly, the default parameters are far from optimal.

Next, we cheat a bit by using prior knowledge about good kernel parameter values. In the other tutorials there are examples on how the fast cross-validation algorithms can be used to automatically select the regularization and kernel parameters.

Linear kernel:

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror

from housing_data import load_housing


def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train, kernel="LinearKernel", bias=1, regparam=1)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(X_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 25.959399
test error 25.497221
mean predictor 81.458770

Gaussian kernel:

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror

from housing_data import load_housing


def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train, kernel="GaussianKernel", regparam=0.0003, gamma=0.00003)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(X_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 22.074760
test error 16.483097
mean predictor 81.458770

Polynomial kernel:

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror

from housing_data import load_housing


def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train, kernel="PolynomialKernel", regparam = 100, gamma=1.0, coef0=1.0, degree=2)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(X_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 20.943394
test error 16.215182
mean predictor 81.458770

Reduced set approximation

Once the data set size exceeds a couple of thousands of instances, maintaining the whole kernel matrix in memory is no longer feasible. Instead of using all the training data to represent the learned model, one can instead restrict the model to a subset of the data, resulting in the so-called reduced set approximation (aka ‘subset of regressors’, also closely related to Nyström approximation). As a starting point one can randomly select a couple of hundred of training instances as basis vectors. A lot of research has been done on more advanced selection strategies.

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror

from housing_data import load_housing
import random
random.seed(1)

def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    #select randomly 100 basis vectors
    indices = range(X_train.shape[0])
    indices = random.sample(indices, 100)
    basis_vectors = X_train[indices]    
    learner = RLS(X_train, Y_train, basis_vectors = basis_vectors, kernel="GaussianKernel", regparam=0.0003, gamma=0.00003)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(X_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 22.757333
test error 21.779148
mean predictor 81.458770

Compared to previous example on using Gaussian kernel, the accuracy of the model slightly degrades due to approximation. Training is now faster and uses less memory (though you will not notice the difference yet on this data), and restricting the model to 100 basis vectors instead of all 250 training examples makes prediction roughly 2.5 time faster.

Tutorial 2 Precomputed kernels

With the “PrecomputedKernel” option, you can supply precomputed kernel matrices instead of data matrix as input to the learner.

Basic usage

In the basic case, one needs to supply a n_samples x n_samples -sized (valid positive semi-definite) kernel matrix for training. For prediction, we compute a n_test_samples x n_samples -sized matrix containing kernel evaluations between test and training data. In this example, we generate the kernel matrices using the Gaussian kernel implementation in RLScore. Note that one first initializes a GaussianKernel object with a set of training data. All subsequent calls of the getKM method return a kernel matrix consisting of all kernel evaluations between the training data and a set of data points provided as an argument. For example, the call getKM(X_test) provides a kernel matrix between the training and test data.

import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror
from rlscore.kernel import GaussianKernel
from housing_data import load_housing


def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    kernel = GaussianKernel(X_train, gamma = 0.00003)
    K_train = kernel.getKM(X_train)
    K_test = kernel.getKM(X_test)
    learner = RLS(K_train, Y_train, kernel="PrecomputedKernel", regparam=0.0003)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(K_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 22.074760
test error 16.483097
mean predictor 81.458770

Reduced set approximation

Precomputed kernel can also be combined with reduced set approximation. In this case, use “PrecomputedKernel”-option. For training supply the n_samples x n_bvectors -slice of full kernel matrix instead of data matrix, and n_bvectors x n_bvectors -slice of kernel matrix instead of basis vectors. In testing, supply n_test_samples x n_bvectors -sized matrix containing kernel evaluations between test examples and basis vectors.

import numpy as np
from rlscore.learner.rls import RLS
from rlscore.measure import sqerror
from rlscore.kernel import GaussianKernel
from housing_data import load_housing
import random
random.seed(1)

def train_rls():
    X_train, Y_train, X_test, Y_test = load_housing()
    #select randomly 100 basis vectors
    indices = range(X_train.shape[0])
    indices = random.sample(indices, 100)
    basis_vectors = X_train[indices]
    kernel = GaussianKernel(basis_vectors, gamma=0.00003)
    K_train = kernel.getKM(X_train)
    K_rr = kernel.getKM(basis_vectors)
    K_test = kernel.getKM(X_test)
    learner = RLS(K_train, Y_train, basis_vectors = K_rr, kernel="PrecomputedKernel", regparam=0.0003)
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    #Test set predictions
    P_test = learner.predict(K_test)
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 22.757333
test error 21.779148
mean predictor 81.458770

The results are exactly the same, as with previous reduced set example.

Tutorial 3 Kronecker learners

The Kronecker kernel type of learners take as input either two data matrices, or two kernel matrices. The final (only implicitly formed) data / kernel matrix is the Kronecker product of these two matrices.

For these experiments, we need to download from the drug-target binding affinity data sets page for the Davis et al. data the drug-target interaction affinities (Y), drug-drug 2D similarities (X1), and WS target-target similarities (X2). In the following we will use similarity scores directly as features for the linear kernel, since the similarity matrices themselves are not valid positive semi-definite kernel matrices.

We can load the data set as follows:

import numpy as np

def load_davis():
    Y = np.loadtxt("drug-target_interaction_affinities_Kd__Davis_et_al.2011.txt")
    XD = np.loadtxt("drug-drug_similarities_2D.txt")
    XT = np.loadtxt("target-target_similarities_WS.txt")    
    return XD, XT, Y

def settingB_split():
    np.random.seed(1)
    XD, XT, Y = load_davis()
    drug_ind = list(range(Y.shape[0]))
    np.random.shuffle(drug_ind)
    train_drug_ind = drug_ind[:40]
    test_drug_ind = drug_ind[40:]
    #Setting 2: split according to drugs
    Y_train = Y[train_drug_ind]
    Y_test = Y[test_drug_ind]
    Y_train = Y_train.ravel(order='F')
    Y_test = Y_test.ravel(order='F')
    XD_train = XD[train_drug_ind]
    XT_train = XT
    XD_test = XD[test_drug_ind]
    XT_test = XT
    return XD_train, XT_train, Y_train, XD_test, XT_test, Y_test   

def settingC_split():
    np.random.seed(1)
    XD, XT, Y = load_davis()
    drug_ind = list(range(Y.shape[0]))
    target_ind = list(range(Y.shape[1]))
    np.random.shuffle(target_ind)
    train_target_ind = target_ind[:300]
    test_target_ind = target_ind[300:]
    #Setting 3: split according to targets
    Y_train = Y[:, train_target_ind]
    Y_test = Y[:, test_target_ind]
    Y_train = Y_train.ravel(order='F')
    Y_test = Y_test.ravel(order='F')
    XD_train = XD
    XT_train = XT[train_target_ind]
    XD_test = XD
    XT_test = XT[test_target_ind]
    return XD_train, XT_train, Y_train, XD_test, XT_test, Y_test  

def settingD_split():
    np.random.seed(1)
    XD, XT, Y = load_davis()
    drug_ind = list(range(Y.shape[0]))
    target_ind = list(range(Y.shape[1]))
    np.random.shuffle(drug_ind)
    np.random.shuffle(target_ind)
    train_drug_ind = drug_ind[:40]
    test_drug_ind = drug_ind[40:]
    train_target_ind = target_ind[:300]
    test_target_ind = target_ind[300:]
    #Setting 4: ensure that d,t pairs do not overlap between
    #training and test set
    Y_train = Y[np.ix_(train_drug_ind, train_target_ind)]
    Y_test = Y[np.ix_(test_drug_ind, test_target_ind)]
    Y_train = Y_train.ravel(order='F')
    Y_test = Y_test.ravel(order='F')
    XD_train = XD[train_drug_ind]
    XT_train = XT[train_target_ind]
    XD_test = XD[test_drug_ind]
    XT_test = XT[test_target_ind]
    return XD_train, XT_train, Y_train, XD_test, XT_test, Y_test    

if __name__=="__main__":
    XD, XT, Y = load_davis()
    print("Y dimensions %d %d" %Y.shape)
    print("XD dimensions %d %d" %XD.shape)
    print("XT dimensions %d %d" %XT.shape)
    print("drug-target pairs: %d" %(Y.shape[0]*Y.shape[1]))
Y dimensions 68 442
XD dimensions 68 68
XT dimensions 442 442
drug-target pairs: 30056

Linear Kernel

Default behavior when supplying two data matrices is to use linear kernel for both domains:

from rlscore.learner import KronRLS
from rlscore.measure import cindex
import davis_data

def main():
    X1_train, X2_train, Y_train, X1_test, X2_test, Y_test = davis_data.settingD_split()
    learner = KronRLS(X1 = X1_train, X2 = X2_train, Y = Y_train)
    log_regparams = range(15, 35)
    for log_regparam in log_regparams:
        learner.solve(2.**log_regparam)
        P = learner.predict(X1_test, X2_test)
        perf = cindex(Y_test, P)
        print("regparam 2**%d, cindex %f" %(log_regparam, perf))

if __name__=="__main__":
    main()
regparam 2**15, cindex 0.565452
regparam 2**16, cindex 0.568766
regparam 2**17, cindex 0.573612
regparam 2**18, cindex 0.579559
regparam 2**19, cindex 0.585241
regparam 2**20, cindex 0.588507
regparam 2**21, cindex 0.590205
regparam 2**22, cindex 0.593595
regparam 2**23, cindex 0.600126
regparam 2**24, cindex 0.610286
regparam 2**25, cindex 0.622301
regparam 2**26, cindex 0.633359
regparam 2**27, cindex 0.643321
regparam 2**28, cindex 0.653964
regparam 2**29, cindex 0.665960
regparam 2**30, cindex 0.677539
regparam 2**31, cindex 0.685681
regparam 2**32, cindex 0.687180
regparam 2**33, cindex 0.681317
regparam 2**34, cindex 0.669595

Precomputed kernels

Alternatively, pre-computed kernel matrices may be supplied. The two kernels for the different domains need not be the same.

from rlscore.learner import KronRLS
from rlscore.measure import cindex
from rlscore.kernel.gaussian_kernel import GaussianKernel
import davis_data

def main():
    X1_train, X2_train, Y_train, X1_test, X2_test, Y_test = davis_data.settingD_split()
    kernel1 = GaussianKernel(X1_train, gamma=0.01)
    kernel2 = GaussianKernel(X2_train, gamma=10**-9)
    K1_train = kernel1.getKM(X1_train)
    K1_test = kernel1.getKM(X1_test)
    K2_train = kernel2.getKM(X2_train)
    K2_test = kernel2.getKM(X2_test)
    learner = KronRLS(K1 = K1_train, K2 = K2_train, Y = Y_train)
    log_regparams = range(-15, 15)
    for log_regparam in log_regparams:
        learner.solve(2.**log_regparam)
        P = learner.predict(K1_test, K2_test)
        perf = cindex(Y_test, P)
        print("regparam 2**%d, cindex %f" %(log_regparam, perf))

if __name__=="__main__":
    main()
regparam 2**-15, cindex 0.569067
regparam 2**-14, cindex 0.569091
regparam 2**-13, cindex 0.568828
regparam 2**-12, cindex 0.569258
regparam 2**-11, cindex 0.571447
regparam 2**-10, cindex 0.576672
regparam 2**-9, cindex 0.586088
regparam 2**-8, cindex 0.601126
regparam 2**-7, cindex 0.619626
regparam 2**-6, cindex 0.637500
regparam 2**-5, cindex 0.651186
regparam 2**-4, cindex 0.663625
regparam 2**-3, cindex 0.675121
regparam 2**-2, cindex 0.684109
regparam 2**-1, cindex 0.688166
regparam 2**0, cindex 0.686442
regparam 2**1, cindex 0.678351
regparam 2**2, cindex 0.666589
regparam 2**3, cindex 0.652443
regparam 2**4, cindex 0.635858
regparam 2**5, cindex 0.620647
regparam 2**6, cindex 0.610092
regparam 2**7, cindex 0.598681
regparam 2**8, cindex 0.588627
regparam 2**9, cindex 0.585657
regparam 2**10, cindex 0.585018
regparam 2**11, cindex 0.583830
regparam 2**12, cindex 0.583879
regparam 2**13, cindex 0.584028
regparam 2**14, cindex 0.584046