Performance measures

RLScore implement a variety of performance measures for classification, regression and ranking.

Let Y and P contain the true outputs and predicted outputs for some problem. For single-target learning problems both are one-dimensional lists or arrays of size [n_samples]. For multi-target problems, both are two-dimensional lists or arrays of size [n_samples, n_targets].

A performance measure is a function measure(Y,P), that returns a floating point value denoting how well P matches Y. If Y and P have several columns, typically the performance measure is computed for each column separately and then averaged. A performance measure has a property iserror, that is used by the grid search codes to check whether large or small values are better. An UndefinedPerformance error may be raised, if for some reason the performance measure is not well defined for the given input.

Tutorial 1: Basic usage

First, let us consider some basic binary classification measures. These performance measures assume that Y-values (true class labels) are from set {-1,1}. P-values (predicted class labels) can be any real values, but the are mapped with the rule P[i]>0 -> 1 and P[i]<=0 -> -1, before computing the performance.

This is how one can compute simple binary classification accuracy.

from rlscore.measure import accuracy

#My class labels, three examples in positive and two in negative
Y = [-1, -1, -1, 1, 1]

#Some predictions
P = [-1, -1, 1, 1, 1]

print("My accuracy %f" %accuracy(Y,P))

#Accuracy accepts real-valued predictions, P2[i]>0 are mapped to +1, rest to -1
P2 = [-2.7, -1.3, 0.2, 1.3, 1]

print("My accuracy with real-valued predictions %f" %accuracy(Y,P2))

Y2 = [2, 1 , 3, 4, 1]

#Labels must be in the set {-1,1}, this will not work

accuracy(Y2, P)
My accuracy 0.800000
My accuracy with real-valued predictions 0.800000

Traceback (most recent call last):
  File "measure1.py", line 20, in <module>
    accuracy(Y2, P)
...
rlscore.measure.measure_utilities.UndefinedPerformance: 'binary classification accuracy accepts as Y-values only 1 and -1'

Four out of five instances are correctly classified, so classification accuracy is 0.8. Giving as input Y-values outside {-1, 1} causes an exception to be raised.

Next, we compute the area under ROC curve.

from rlscore.measure import auc

#My class labels, three examples in positive and two in negative
Y = [-1, -1, -1, 1, 1]

#Predict all ties
P = [1, 1, 1, 1, 1]

print("My auc with all ties %f" %auc(Y,P))

#Use Y for prediction
print("My auc with using Y as P is %f" %auc(Y,Y))

#Perfect predictions: AUC is a ranking measure, so all that matters
#is that positive instances get higher predictions than negatives
P2 = [-5, 2, -1, 4, 3.2]

print("My auc with correctly ranked predictions is %f" %auc(Y,P2))

#Let's make the predictions worse

P2 = [-5, 2, -1, 1, 3.2]

print("Now my auc dropped to %f" %auc(Y,P2))

#AUC is undefined if all instances belong to same class, let's crash auc 

Y2 = [1, 1, 1, 1, 1]
#this will not work
auc(Y2, P2)
My auc with all ties 0.500000
My auc with using Y as P is 1.000000
My auc with correctly ranked predictions is 1.000000
Now my auc dropped to 0.833333

Traceback (most recent call last):
  File "measure2.py", line 30, in <module>
    auc(Y2, P2)
...
rlscore.measure.measure_utilities.UndefinedPerformance: 'AUC undefined if both classes not present'

Everything works as one would expect, until we pass Y full of ones to auc. UndefinedPerformance is raised, because AUC is not defined for problems, where only one class is present in the true class labels.

Finally, we test cindex, a pairwise ranking measure that computes how many of the pairs where Y[i] > Y[j] also have P[i] > P[j]. The measure is a generalization of the AUC.

import numpy as np

from rlscore.measure import cindex

#Concordance index is a pairwise ranking measure

#Equivalent to AUC for bi-partite ranking problems
Y = [-1, -1, -1, 1, 1]
P = [-5, 2, -1, 1, 3.2]

cind1 = cindex(Y, P)

print("My cindex is %f" %cind1)

#Can handle also real-valued Y-values

Y2 = [-2.2, -1.3, -0.2, 0.5, 1.1]
#Almost correct ranking, but last two inverted
P2 = [-2.7, -1.1, 0.3, 0.6, 0.5]

cind2 = cindex(Y2, P2)

print("My cindex is %f" %cind2)

#Most performance measures take average over the columns for multi-target problems:

Y_big = np.vstack((Y, Y2)).T
P_big = np.vstack((P, P2)).T
print(Y_big)
print(P_big)
print("(cind1+cind2)/2 %f" %((cind1+cind2)/2.))
print("is the same as cindex(Y_big, P_big) %f" %cindex(Y_big, P_big))
My cindex is 0.833333
My cindex is 0.900000
[[-1.  -2.2]
 [-1.  -1.3]
 [-1.  -0.2]
 [ 1.   0.5]
 [ 1.   1.1]]
[[-5.  -2.7]
 [ 2.  -1.1]
 [-1.   0.3]
 [ 1.   0.6]
 [ 3.2  0.5]]
(cind1+cind2)/2 0.866667
is the same as cindex(Y_big, P_big) 0.866667

We also observe, that when given Y and P with multiple columns, the performance measure is computed separately for each column, and then averaged. This is what happens when using some performance measure for parameter selection in cross-validation with multi-output prediction problems. The chosen parameter is the one that leads to best mean performance over all the targets.

Tutorial 2: Multi-class accuracy

RLScore contains some tools for converting multi-class learning problems to several independent binary classification problems, and for converting vector valued multi-target predictions back to multi-class predictions.

import numpy as np

from rlscore.utilities import multiclass
from rlscore.measure import ova_accuracy

Y = [0,0,1,1,2,2]
Y_ova = multiclass.to_one_vs_all(Y)

P_ova = [[1, 0, 0], [1.2,0.5, 0], [0, 1, -1], [1, 1.2, 0.5], [0.2, -1, -1], [0.3, -1, -2]]
acc = ova_accuracy(Y_ova, P_ova)
print("ova-mapped Y")
print(Y_ova)
print("P, class prediction is chosen with argmax")
print(P_ova)
print("Accuracy computed with one-vs-all mapped labels and predictions: %f" %acc)
print("original Y")
print(Y)
print("P mapped to class predictions")
P = multiclass.from_one_vs_all(P_ova)
print(P)
acc = np.mean(Y==P)
print("Accuracy is the same:%f " %acc)
ova-mapped Y
[[ 1. -1. -1.]
 [ 1. -1. -1.]
 [-1.  1. -1.]
 [-1.  1. -1.]
 [-1. -1.  1.]
 [-1. -1.  1.]]
P, class prediction is chosen with argmax
[[1, 0, 0], [1.2, 0.5, 0], [0, 1, -1], [1, 1.2, 0.5], [0.2, -1, -1], [0.3, -1, -2]]
Accuracy computed with one-vs-all mapped labels and predictions: 0.666667
original Y
[0, 0, 1, 1, 2, 2]
P mapped to class predictions
[0 0 1 1 0 0]
Accuracy is the same:0.666667 

When doing multi-class learning, one should use the ova_accuracy function for parameter selection and computing the final performance.