Agreement measures, such as Cohen’s (Cohen 1960) and (Casagrande, Fabris, and Girometti 2020), gauge the agreement between two discrete classifiers mapping their confusion matrices into real values.
# create a confusion matrix
M <- matrix(1:9, nrow = 3, ncol = 3)
print(M)
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
# rSignificativity implements some agreement measures
rSignificativity::cohen_kappa(M)
#> [1] -0.04166667
rSignificativity::scott_pi(M)
#> [1] -0.05633803
rSignificativity::IA(M)
#> [1] 0.005631984
When a golden standard classifier is available, the best among a set of classifiers can be selected by picking the one whose confusion matrix with respect to the golden standard has the maximal agreement value.
The agreement values alone lack of a meaningfulness indication. Understanding whether two classifiers significantly agree on the considered dataset is not possible by simply looking at the corresponding agreement value.
Let be the set of -confusion matrices built over a dataset having size , i.e., the elements of any matrix sum up to . For any agreement measure , the -significativity of in is $$ \varrho_{\sigma, n, m}(c) \stackrel{\tiny\textrm{def}}{=} \frac{\left|\left\{M \in \mathcal{M}_{n,m} \, |\, \sigma(M) < c\right\}\right|}{\left|\mathcal{M}_{n,m}\right|}. $$ The -significativity is the [0,1]-normalised number of matrices in having a -value greater than . From a different point of view, it corresponds to the probability of choosing by chance a matrix whose -value greater than .
Since , computing takes time . Nevertheless, the Monte Carlo method (Metropolis and Ulam 1949) can estimate with samples in time with an error proportional to (e.g., see (Mackay 1998)).
The function significativity()
lets us to both exactly
compute
and estimate it by using the Monte Carlo method.
library("rSignificativity")
# evaluate kappa-significativity of 0.5 in M_{2,100} with 10000 samples
significativity(cohen_kappa, c = 0.5, n = 2, m = 100)
#> [1] 0.888
# evaluate kappa-significativity of 0.5 in M_{2,100} with 1000 samples
significativity(cohen_kappa, c = 0.5, n = 2, m = 100, number_of_samples = 1000)
#> [1] 0.897
# evaluate kappa-significativity of 0.5 in M_{2,5} with 1000 samples
significativity(cohen_kappa, c = 0.5, n = 2, m = 5, number_of_samples = 1000)
#> [1] 0.799
# exactly compute kappa-significativity of 0.5 in M_{2,5}
significativity(cohen_kappa, c = 0.5, n = 2, m = 5, number_of_samples = NULL)
#> [1] 0.7857143
The -significativity of in , where is the set of all the -probability matrices, is $$ \rho_{\sigma, n}(c) \stackrel{\tiny\textrm{def}}{=} \frac{V\left(\left\{M \in \mathcal{P}_{n} \, |\, \sigma(M) < c\right\}\right)}{V(\Delta^{(n^2-1)})} $$ where is the -dimensional probability simplex and is the -dimensional Lebesgue measure (Casagrande et al. 2025).
If is definable in an o-minimal theory (van den Dries 1998), such as in the case of Cohen’s and , then
We can estimate
by using the Monte Carlo method with
samples in time
.
The function significativity()
also implements this
algorithm.
# evaluate kappa-significativity of 0.5 in P_{2} with 1000 samples
significativity(cohen_kappa, 0.5, n = 2, number_of_samples = 1000)
#> [1] 0.895
# evaluate kappa-significativity of 0.5 in P_{2} with 10000 samples
significativity(cohen_kappa, 0.5, n = 2)
#> [1] 0.8979
# successive calls to Monte Carlo methods may produce different results
significativity(cohen_kappa, 0.5, n = 2)
#> [1] 0.8961
# setting the random seed before the call guarantee repeatability
set.seed(1)
significativity(cohen_kappa, 0.5, n = 2)
#> [1] 0.9012
set.seed(1)
significativity(cohen_kappa, 0.5, n = 2)
#> [1] 0.9012