Sparse large data sets and linear models

In some application domains it is typical to have data sets with both large sample size and dimensionality, but small amount of non-zero feature for each individual instance. For example, in document classification where features correspond to words the number of possible words in a vocabulary can be very large, even though a single document will only contain a small subset of these words.

RLScore contains some tools for learning linear models with such data, combining sparse matrix data structures with conjugate gradient optimization. The idea was presented for RLS optimization in [1], [2] and [3] explore the idea further in context of RankRLS training.

While useful tools, these modules are not the main focus of RLScore development, as the conjugate gradient based training methods are not compatible with the computational short cuts implemented in the other modules, such as the fast cross-validation algorithms. For Big Data problems with millions of samples or more and huge dimensionality, we recommend using stochastic gradient descent solvers available in several other machine learning libraries.

Data set

We consider the classical 20 Newsgroups data set. We extract the preprocessed files in sparse matrix format contained in the archive file 20news-bydate-matlab.tgz. The data can be processed using the following code.

import numpy as np
from scipy import sparse as sp

from rlscore.utilities import multiclass

def load_newsgroups():
    T = np.loadtxt("train.data")
    #map indices from 1...n to 0...n-1
    rows = T[:,0] -1
    cols = T[:,1] -1
    vals = T[:,2]
    X_train = sp.coo_matrix((vals, (rows, cols)))
    X_train = X_train.tocsc()
    T = np.loadtxt("test.data")
    #map indices from 1...n to 0...n-1
    rows = T[:,0] -1
    cols = T[:,1] -1
    vals = T[:,2]
    X_test = sp.coo_matrix((vals, (rows, cols)))
    X_test = X_test.tocsc()
    #X_test has additional features not present in X_train
    X_test = X_test[:,:X_train.shape[1]]
    Y_train = np.loadtxt("train.label", dtype=int)
    Y_train = multiclass.to_one_vs_all(Y_train, False)
    Y_test = np.loadtxt("test.label", dtype=int)
    Y_test = multiclass.to_one_vs_all(Y_test, False)
    return X_train, Y_train, X_test, Y_test

def print_stats():
    X_train, Y_train, X_test, Y_test = load_newsgroups()
    print("Train X dimensions %d %d" %X_train.shape)
    print("Test X dimensions %d %d" %X_test.shape)
    print("Number of labels %d" %Y_train.shape[1])
    

if __name__=="__main__":
    print_stats()
Train X dimensions 11269 53975
Test X dimensions 7505 53975
Number of labels 20

Experiments

In the following, we train RLS using the conjugate gradient algorithm. Unlike the basic RLS module, CGRLS does not support multi-output data. In this example we train a classifier for a binary classification task of predicting whether a document belongs to Newsgroup number 1, or not. Multi-class classification could be implemented by training one one-vs-all predictor for each class, and assigning each test instance to the class with the highest prediction.

from rlscore.learner import CGRLS
from rlscore.measure import auc

from newsgroups_data import load_newsgroups

def train_rls():
    X_train, Y_train, X_test, Y_test = load_newsgroups()
    #CGRLS does not support multi-output learning, so we train
    #one classifier for the first column of Y. Multi-class learning
    #would be implemented by training one CGRLS for each column, and
    #taking the argmax of class predictions.
    predictions = []
    rls = CGRLS(X_train, Y_train[:,0], regparam= 100.0)
    P = rls.predict(X_test)
    perf = auc(Y_test[:,0], P)
    print("auc for task 1 %f" %perf) 


if __name__=="__main__":
    train_rls()
auc for task 1 0.950931

And the same for RankRLS, with similar results. The CGRankRLS learner supports also query-structured data, and the learner PCGRankRLS can be used for learning from pairwise preferences with sparse data.

from rlscore.learner import CGRankRLS
from rlscore.measure import auc

from newsgroups_data import load_newsgroups

def train_rls():
    X_train, Y_train, X_test, Y_test = load_newsgroups()
    #CGRLS does not support multi-output learning, so we train
    #one classifier for the first column of Y. Multi-class learning
    #would be implemented by training one CGRLS for each column, and
    #taking the argmax of class predictions.
    predictions = []
    rls = CGRankRLS(X_train, Y_train[:,0], regparam= 100.0)
    P = rls.predict(X_test)
    perf = auc(Y_test[:,0], P)
    print("auc for task 1 %f" %perf) 


if __name__=="__main__":
    train_rls()
auc for task 1 0.947574

References

[1]Rifkin, R., Yeo, G., & Poggio, T. (2003). Regularized least-squares classification. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, & J. Vandewalle (Eds.), NATO science series III: computer and system sciences: Vol. 190. Advances in learning theory: methods, model and applications (pp. 131–154).
[2]Tapio Pahikkala, Evgeni Tsivtsivadze, Antti Airola, Jouni Jarvinen, and Jorma Boberg. An efficient algorithm for learning to rank from preference graphs. Machine Learning, 75(1):129-165, 2009.
[3]Antti Airola, Tapio Pahikkala, and Tapio Salakoski. Large Scale Training Methods for Linear RankRLS. ECML/PKDD-10 Workshop on Preference Learning, 2010.