In this section several examples will be used to illustrate a typicalworkflow of kerntools
.
A simple example
Let’s suppose we are working with the well known irisdataset. This dataset contains sepal and petal measurements forN = 150 iris flowers. These flowers belong to three differentspecies: Iris setosa, Iris versicolor, and Irisvirginica. There are D = 4 numeric variables (also called“features”) that are measured: Sepal.Length
,Sepal.Width
, Petal.Length
andPetal.Width
.
We first standardize these four variables so they have mean 0 andstandard deviation 1. This places them on a common ground. Then, wecompute our first kernel: the linear kernel.
iris_std <- scale(iris[,c( "Sepal.Length","Sepal.Width","Petal.Length", "Petal.Width")])KL <- Linear(iris_std)dim(KL)#> [1] 150 150class(KL)#> [1] "matrix" "array"
The linear kernel is simply the pairwise inner product between allN samples (in our case: flower measurements). The result is“stored” on a matrix (the kernel matrix) that has dimensionsNxN. Furthermore, it is always symmetric and positivesemi-definite (PSD). To check the kernel between two samples (forinstance, 32 and 106), we can type either KL[32,106]
orKL[106,32]
. It should be noted that kernel matricesgenerated by kerntools
have class “matrix”, “array”. Infact, to keep things simple, no function in this package requires anyspecial classes/objects not present in base R.
Next, we examine this kernel matrix further. Although it is notexcessively large, it is large enough to make simply typingKL
in R unpractical. Instead, we may summarize its valuesusing a histogram:
histK(KL, vn = TRUE)
This “almost-gaussian” shape is due to the features’ scaling. Furtherparameters, including color, can be passed to histK()
(formore info, check the documentation of graphics::hist()
).The von Neumann entropy shown here is optional (in fact, this value canbe computed separately doing vonNeumann(KL)
). Entropyvalues close to 0 indicate that all kernel matrix elements are verysimilar, while values close to 1 indicate a high variability.
Another possibility is to visualize the whole kernel matrix with aheatmap:
heatK(KL,cos.norm = FALSE)
Here, yellow denotes a great similarity between the samples, whilered denotes that the similarity is low (the colour palette iscustomizable via the parameter color
). At a glance we seethat the first 50 samples (I. setosa) have a higher intra-groupsimilarity. Also, they are very different from the samples 101-150(which correspond to I. virginica). Instead, I.versicolor is kind of intermediate between these two groups.
To confirm our intuitions about the (dis)similarity among the threespecies, we may proceed to a widespread ordination method: the PrincipalComponents Analysis (PCA). PCAs can be computed from kernel matricesvery easily. In fact, using kernel matrices expands what a PCAcan do, but this will be discussed in further sections. To display abeautiful PCA plot that colors the samples by species, we do:
iris_species <- factor(iris$Species)kpca <- kPCA(KL,plot = 1:2,y = iris_species)kpca$plot
Indeed, we can see that I. setosa and I. virginicaare the most different groups, while I. versicolor andI.virginica are very close. The colors can be changed ifdesired with the kPCA(...,colors)
parameter.
After seeing this plot, we can infer that a predictive model fromthese data will work well. Although we have a ton of machine learningmethods at our disposal, in this vignette we will stick with the kernelmethods. More specifically, we will use the most famous kernel method:the Support Vector Machine (SVM).
SVMs are not implemented in kerntools
.However, they are in other R packages like kernlab
ore1071
. Here we will use the ksvm()
function ofthe former package:
library(kernlab)set.seed(777)## Training datatest_idx <- sample(1:150)[1:30] # 20% of samplestrain_idx <- (1:150)[-test_idx]KL_train <- KL[train_idx,train_idx]## Model (training data)linear_model <- kernlab::ksvm(x=KL_train, y=iris_species[train_idx], kernel="matrix")linear_model#> Support Vector Machine object of class "ksvm" #> #> SV type: C-svc (classification) #> parameter : cost C = 1 #> #> [1] " Kernel matrix used as input."#> #> Number of Support Vectors : 27 #> #> Objective Function Value : -0.9459 -0.3184 -14.3709 #> Training error : 0.025
First and foremost: in prediction models, it is mandatory to have anadditional test set so a honest estimation of the model’s performancecan be computed (we will do this latter). Also, please note that in areal-world machine learning setting, training data should have beenpreprocessed first and then the same exact preprocessing shouldhave been applied to test data. In our case, the only preprocessing wasstandardize the dataset: thus, the mean and standard deviation shouldhave been computed from training data, and then these values should havebeen used for standardize both the training and test sets. Thatbeing said, in order to not interrupt the flow of this vignette, we willuse leave things as they are.
Now returning to our (questionably obtained) model, we have a verylow training error. The support vectors (which are the only samples thatare relevant for us, as the rest are not used to define the SVMdiscriminating hyperplane) constitute only the 22% (approx) of samples.Not bad.
Before jumping to the test set, we may be interested in anothertopic: feature importance. This means studying which variables areconsidered more important by the model when discriminating betweenclasses. Feature importance is important for avoiding “black-boxmodels”: prediction models that we know that work well, but notwhy.
Obtaining the importances out of a SVM model can be somewhatconvoluted (this is discussed later in more depth) and sometimesdownright impossible. In our particular case, the only problem is thatwe wanted to classify 3 classes (species)… but SVM classifiers arebinary. For discriminating 3 classes, kernlab
in factbuilds 3 classifiers: “setosa vs versicolor”, “setosa vs virginica”, and“versicolor vs virginica”. These 3 classifiers constitute thelinear_model
object and the prediction of the class of asample is done by a voting scheme. To simplify things, for the features’importance part, we will focus only on the third classifier: “versicolorvs virginica”, which we have seen previously that are the two mostrelated species. The way to go here is to obtain the index of theSupport Vectors in our model, and their coefficients. All that isgracefully provided by kernlab
. Then, we will return tokerntools
and call the function svm_imp()
:
## Third model: Versicolor vs virginicasv_index <- kernlab::alphaindex(linear_model)[[3]] # Vector with the SV indicessv_coef <- kernlab::coef(linear_model)[[3]] # Vector with the SV coefficientsfeat_imp3 <- svm_imp(X=iris_std[train_idx,],svindx=sv_index,coeff=sv_coef)#> Do not use this function if the SVM model was created with the RBF,#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels
Note that here we need the data that we used to computeKL
: iris_std.
It is very important touse this version of the dataset, and not any other version with morepre-processing (or the “original” without pre-processing).svm_imp()
has parameters like center
,scale
and cos.norm
to take this widespreadnormalization techniques into account, but it is better to playsafe.
Conveniently, kerntools
provides a function to visualizethe features’ importance:
plotImp(feat_imp3, leftmargin = 7, main="3rd model: versicolor vs virginica", color="steelblue1")
#> $first_features#> [1] "Petal.Length" "Petal.Width" "Sepal.Width" "Sepal.Length"#> #> $c*msum#> [1] 1#> #> $barplot#> [,1]#> [1,] 0.7#> [2,] 1.9#> [3,] 3.1#> [4,] 4.3
As we can see, the model considers the petals most discriminatingthat the sepals, and within the petals’ measures, the petal length.
If we return to the PCA, we could also check the weight of eachvariable in the first and second PCs (that is, the ones we displayed).To do so, it comes in handy the kPCA_imp()
function. Again,this function requires the dataset that generated the KL
matrix:
loadings <- kPCA_imp(iris_std)#> Do not use this function if the PCA was created with the RBF,#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernelspcs <- loadings$loadings[1:2,]pcs#> Sepal.Length Sepal.Width Petal.Length Petal.Width#> PC1 0.5210659 -0.2693474 0.58041310 0.56485654#> PC2 0.3774176 0.9232957 0.02449161 0.06694199
It seems that the first component is dominated by Petal Length andPetal Width but also by Sepal Length. The second component, that plays arole in discriminating I. versicolor and I.virginica,is dominated by Sepal Width. The PCA disagrees a bit with the SVMfeature importances, but remember that in the latter we focused only onthe “versicolor vs virginica” problem, while in the former we arelooking at the ordination of the three classes. We may represent thecontributions of the four features for the 1st PC, and to make thingseasier we will include the 2nd PC onto the barplot:
plotImp(pcs[1,], y=pcs[2,], ylegend="PC2",absolute=FALSE, main="PC1", leftmargin = 7, color="rosybrown1")
#> $first_features#> [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" #> #> $c*msum#> [1] 1#> #> $barplot#> [,1]#> [1,] 0.7#> [2,] 1.9#> [3,] 3.1#> [4,] 4.3
We used absolute=FALSE
because the contribution of eachvariable to the PC is relevant not only in magnitude, but also in sign.Pink bars represent PC1, while the black line represents PC2 (parametery
). As we only wanted to see the order and the relativemagnitude, the X axis show the relative contribution (in theplotImp()
function, relative=TRUE
bydefault).
With kerntools
, we can draw the contributions on the PCAplot as arrows. We only need the PCA plot (given by kPCA()
)and the contributions (given by kPCA_imp()
):
kPCA_arrows(plot=kpca$plot,contributions=t(pcs),colour="grey15")
(Note that the arrows are scaled to match with the original PCA plot.They are somewhat orientative: their directions are correct, and longerarrows represent a greater contribution to a PC that shorter arrows;however, usually the arrows’ magnitudes do not coincide with the actualmagnitudes that can be computed from kPCA_imp()
).
And now, finally, we are going to check the performance in the testset (considering the 3 classes). To do so, we subset KL
tohave a suitable test x training matrix, input the matrix intoour linear_model, and compare with the predicted species with the actualspecies:
KL_test <- as.kernelMatrix(KL[test_idx,train_idx])## Prediction (test data)pred_class <- predict(linear_model,KL_test)actual_class <- iris_species[test_idx]pred_class#> [1] versicolor versicolor virginica virginica versicolor virginica #> [7] virginica virginica versicolor versicolor versicolor versicolor#> [13] versicolor versicolor versicolor versicolor virginica versicolor#> [19] versicolor versicolor virginica versicolor virginica versicolor#> [25] versicolor virginica versicolor versicolor versicolor versicolor#> Levels: setosa versicolor virginicaactual_class#> [1] setosa setosa versicolor versicolor setosa setosa #> [7] setosa versicolor setosa setosa versicolor virginica #> [13] virginica virginica virginica virginica setosa versicolor#> [19] setosa setosa setosa setosa versicolor setosa #> [25] virginica versicolor virginica virginica virginica versicolor#> Levels: setosa versicolor virginica
Mmmm… maybe we were a bit overconfident. It seems that the model hasignored I. setosa completely.
We can compute numerically how “good” (or wrong) our model isaccording to different performance measures implemented inkerntools
. When dealing with classification, all of themneed a contingency table that contrasts the actual and the predictedclasses (also known as “confusion matrix”). The most simple measure isthe accuracy: number of right predictions divided by the number ofpredictions (which is the test set size: in our case, 30).
ct <- table(actual_class,pred_class) # Confusion matrixct#> pred_class#> actual_class setosa versicolor virginica#> setosa 0 9 4#> versicolor 0 3 5#> virginica 0 9 0Acc(ct) ## Accuracy#> [1] 0.1Acc_rnd(actual_class) ## Accuracy of the random model#> [1] 0.3488889
As expected, our accuracy is overwhelmingly low. We can compare theresult with the accuracy of the random model (according to the classdistribution on the test): 0.35, which means that our model performs alot worse than someone classifying at random.
We can explore other measures to infer what are the problems with ourprediction model (for instance, if a species is systematicallymissclassified, etc.). For this example, we can compute these measuresby class:
Prec(ct,multi.class = "none") ## Precision or Positive Predictive Value#> setosa versicolor virginica #> 0.0000000 0.1428571 0.0000000Rec(ct,multi.class = "none") ## Recall or True Positive Rate#> setosa versicolor virginica #> 0.000 0.375 0.000Spe(ct,multi.class = "none") ## Specificity or True Negative Rate#> setosa versicolor virginica #> 1.0000000 1.0000000 0.5714286F1(ct,multi.class = "none") ## F1 (harmonic mean of Precision and Recall)#> setosa versicolor virginica #> 0.0000000 0.2068966 0.0000000
(In case we want the overall performance measure, we can compute themean of the three classes, or typemulti.class="macro"
).
The precision measure tell us that none of the samples predicted tobe “virginica” or “setosa” are correct (in the case of “setosa”, becausenone was predicted), and only some (1/7) that were predicted to be“versicolor” were right. The recall shows that only 3/8 “versicolor”samples in the test were correctly classified as “versicolor”, whilethere is none of “setosa” or “virginica.” F1 is useful because it givesa “mean” of Precision and Recall. Meanwhile, the low specificity of“versicolor” points that a lot of samples that were not “versicolor”were predicted as such.
A (slightly) more complicated example
In the previous section we picked naively the first model we couldtrain, with not-so-great results. Here, we are going to complicatethings a bit hoping that we will obtain a model that works.
(Note: of course, iris flower classification is a simple task. Infact, it can be achieved pretty decently with the linear kernel, as canbe deduced from the previous PCA: a linear classifier is enough todiscriminate the flowers using only the 1st and 2nd PCs. However, forthe sake of the example, we will use a different kernel in the presentsection).
The radial basis function (RBF) kernel is something like the “goldstandard” among kernels. Unlike the linear kernel (which is the mostsimple or “plain” kernel), it is nonlinear: in fact, the RBF kernel is auniversal approximator. We can compute it doing:
Krbf <- RBF(iris_std,g=0.25)histK(Krbf,col="aquamarine",vn = TRUE)
Kernel matrix values are typically between 0 and 1. The linear kernelrequired only our dataset, but RBF()
has a (hyper)parametercalled gamma (g
for short). The value of thishyperparameter should be decided by us, which is an important decision,because it will affect the decision boundary of the kernel. Fortunately,some heuristics to estimate a good gamma exist. kerntools
implement three of them, which are available in the functionestimate_gamma()
:
estimate_gamma(iris_std)#> $d_criterion#> [1] 0.25#> #> $scale_criterion#> [1] 0.2512584#> #> $quantiles_criterion#> 90% 50% 10% #> 0.05570343 0.16670322 1.73533468
In the previous histogram we visualized the RBF with the gamma givenby the “d_criterion” (and almost the one given by the “scalecriterion”). The third heuristic gives us a distribution of “good” gammavalues. Now, for the sake of comparison, we will compute the RBF kernelusing the median of the “quantiles_criterion”:
Krbf2 <- RBF(iris_std,g=0.1667)histK(Krbf2,col="darkseagreen1",vn=TRUE)
Not only the histogram changes: the von Neumann entropy changes aswell. It important to remark that the RBF kernel is very sensitive tothe gamma values. The higher entropy with respect to that of the linearkernel reflects that, here, we have a higher variability in the kernelmatrix values. (That can be also be deduced comparing both histograms.Conversely, if we do heatK(Krbf)
, we will observe moreextreme values/colors than before). This paper recommendsan entropy between 0.3-0.5, so maybe this will be reflected on the SVMmodel’s performance?
Now, we can also do a kernel PCA. Our previous kernel PCA used alinear kernel so, in reality, it was identical to a “normal” PCA. Thistime however we are using a different kernel and now we can actually saythat this is a kernel PCA. The main difference is that the projection ofsamples is not going to be linear. Sometimes, this createsstrange patterns that are difficult to interpret.
As later we are going to train a SVM model, it may occur to us thatit would be great to do a PCA only with the training samples, so we cancompare the prediction model with the PCA side by side. To do so, wewill use the same training indices than in the previous section. Evenbetter: what if we compute a (kernel) PCA with the training samples, andthen project the test samples over them?
Krbf_train <- Krbf2[train_idx,train_idx]Krbf_test <- Krbf2[test_idx,train_idx]rbf_kpca <- kPCA(K=Krbf_train, Ktest=Krbf_test, plot = 1:2, y = iris_species[train_idx], title = "RBF PCA")rbf_kpca$plot
(Note: remember that, in a real-world problem, the standardization ofthe dataset should have been done with the center and std deviation ofthe training set.)
Said and done! However, now the patterns on this kernel PCA are abit… radial. Still, I. setosa is again on one side, and I.versicolor and I. virginica on the other. The red, greenand blue samples are the training samples, where the grey samplescorrespond to the test samples we projected a posteriori (anyother color can be specified in the kPCA
parameterna_col
).
What now? As we have generated more than one kernel matrix from thesame data (thanks to the linear and the RBF kernels), we may comparethese matrices. To do so, we can use a kerntools
functioncalled simK
:
simK(list(linear=KL,rbf_0.166=Krbf, rbf_0.25=Krbf2))#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).#> This function does NOT verify the symmetry and PSD criteria.#> linear rbf_0.166 rbf_0.25#> linear 1.0000000 0.5208955 0.4803192#> rbf_0.166 0.5208955 1.0000000 0.9898203#> rbf_0.25 0.4803192 0.9898203 1.0000000
simK
will first remind us that a matrix should haveseveral mathematical properties to be a kernel matrix. When we work withkernel matrices generated by kerntools
(that is:Linear()
, RBF()
, etc.) everything will bealright. However, you can come to kerntools
with yourprecomputed kernel matrices (as long as they have class “matrix”,“array”). kerntools
implicitly trusts the user knows whathe/she is doing, so remember using proper kernel matrices.
simK
returns a score between 0 and 1: 1 is completesimilarity, and 0 is complete dissimilarity. We can see that the two RBFmatrices are very similar, while the linear kernel matrix is around a50% similar to the RBF matrices.
We could also compare the two PCAs. An option to do so is computingthe RV coefficient (Co-Inertia Analysis). However, the RV coefficientbetween rbf_kpca
and kpca
should give the sameresult than simK(list(KL,Krbf))
. It should be noted thatthis equivalence only holds if the dataset is centered beforehand, asPCAs usually are computed using centered data (for that reason we havekPCA(..., center=TRUE)
by default). If a kernel matrix wasobtained from a non-centered data, it can be centered afterwards withcenterK()
(more of this in later sections).
Another way to compare two PCAs is called the Procrustes Analysis.This analysis compares the correlation between two projections after“removing” the translation, scale and rotation effects. Although is notproperly a kernel method, kerntools
can do a basicProcrustes Analysis. In our data, we have a moderate Procrustescorrelation: 0.68 (the correlation coefficient is bounded between 0 and1).
rbf_kpca <- kPCA(K=Krbf)proc <- Procrustes(kpca$projection,rbf_kpca)proc$pro.cor # Procrustes correlation#> [1] 0.6862007
(This is a moment as good as any other to show thatkPCA()
can return the kernel PCA projection withoutdisplaying any plot. In that case, all graphical parameters like colors,labels, etc. are ignored.)
With all of these, we will train a brand new (and hopefully better)model. We will re-use the same training and test samples:
####### Model (training data)model <- kernlab::ksvm(x=Krbf_train, y=iris_species[train_idx], kernel="matrix", C=10)model#> Support Vector Machine object of class "ksvm" #> #> SV type: C-svc (classification) #> parameter : cost C = 10 #> #> [1] " Kernel matrix used as input."#> #> Number of Support Vectors : 30 #> #> Objective Function Value : -4.1707 -2.7089 -92.348 #> Training error : 0.008333
A very low training error, but now we are wiser. What about theperformance in test?
Krbf_test <- as.kernelMatrix(Krbf_test)####### Prediction (test data)pred_class <- predict(model,Krbf_test)actual_class <- iris_species[test_idx]ct <- table(actual_class,pred_class) # Confusion matrixAcc(ct) ## Accuracy#> [1] 0.5
Wow, that seems a lot better! However, before we get excited, we mustremember that this is a point estimation of accuracy, that comes from aspecific test set (the 30 samples we chose randomly in the previoussection). Another test set surely will deliver a different accuracy.What if we tried to compute a confidence interval (CI) to have an ideaof how other test sets will behave?
kerntools
provides two ways to obtain a CI: the NormalApproximation, and bootstrapping. Normal approximation is quicker, whilebootstrapping is usually considered to be safer (more details: here):
## Accuracy CI (95%)Normal_CI(value = 0.5,ntest = 30) ## Accuracy CI (95%)#> [1] 0.3210806 0.6789194Boots_CI(target = actual_class, pred = pred_class, nboots = 2000,index = "acc") #> 2.5% 97.5% #> 0.5016667 0.3333333 0.7000000
Both functions default to a 95% CI, but that can be changed via theconfidence
parameter. According to the normalapproximation, the accuracy is 0.5 (0.32, 0.68), while according tobootstrap strategy, it is 0.5 (0.33, 0.66). The CI is wide because thetest is very small (only 30 samples). However, with this test and CI(with the 95% confidence) we cannot really assure that thismodel is really different than the random model, which had an accuracyof 0.35 (as we computed in the previous section). Useful reminder: nexttime, we should choose a larger test set.
Before call it a day, we are going to compute again the otherperformance measures. This time, we will not compute them class byclass, but on average (the “macro” approach):
Prec(ct) ## Precision or Positive Predictive Value#> It is identical to weighted Accuracy#> [1] 0.4357143Rec(ct) ## Recall or True Positive Rate#> [1] 0.4807692Spe(ct) ## Specificity or True Negative Rate#> [1] 0.8085477F1(ct) ## F1 (harmonic mean of Precision and Recall)#> [1] 0.4484848
If we desire to do, we can compute a CI for these values: forinstance, to bootstrap the macro-F1 value, we could simply typeindex = "f1"
in the Boots_CI()
function. Inany case, we should congratulate ourselves because the performance isclearly higher than last time. Training seriously a machinelearning model involves fine hyperparameter tuning (remember that C inksvm()
?) that we have almost completely skipped. That is:we should use a strategy like, say, grid search, and compare theperformance measures of each hyperparameter combination via crossvalidation, which is far beyond the purposes of this vignette (andkerntools
).
Finally, a warning about computing feature importances in SVM and/orfeature contribution to PCs in PCA. In a few words: don’t do that whenusing the RBF kernel. More precisely: of all kernel functionsimplemented here, do not ever try to recover the feature contributionsfor RBF()
, Laplace()
, Jaccard()
,Ruzicka()
, BrayCurtis()
andKendall()
unless you know very well what you aredoing. If you type something like:
####### RBF versicolor vs virginica model:sv_index <- kernlab::alphaindex(model)[[3]]sv_coef <- kernlab::coef(model)[[3]]svm_imp(X=iris_std[train_idx,],svindx=sv_index,coeff=sv_coef)#> Do not use this function if the SVM model was created with the RBF,#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> 0.8931422 0.7828191 12.3522663 10.0792776
something will be computed, but the result is wrong.Do not ignore the warning raised here. This is not right because,ultimately, all kernels behave like the linear kernel: they compute theinner product of some features. But what features? That is the question.Under the hood, all kernels “send” the original features (the featurethat live in the input space) to other space (usuallyhigher-dimensional) called the feature space, and they computethe inner product there. The kernel conflates these two steps into one,which usually simplifies a lot the calculations and saves a lot ofmemory space: this is called the “kernel trick”. But to computeanalytically the feature contributions we need to go that feature space.To make things worse, the feature space that the kernel implicitly isusing depends on things like the dimensionality of input data, the kindof kernel, the specific value of its hyperparameters, etc. Going to thisfeature space is only trivial for the linear kernel: only then inputspace = feature space. Instead, for the RBF kernel, this feature spaceis infinite dimensional. Some techniques to estimate it exist (see forinstance: ExplicitApproximations of the Gaussian Kernel), but they are not yetimplemented in kerntools
(and maybe they never are).