In this section several examples will be used to illustrate a typicalworkflow of `kerntools`

.

### A simple example

Let’s suppose we are working with the well known irisdataset. This dataset contains sepal and petal measurements for*N = 150* iris flowers. These flowers belong to three differentspecies: *Iris setosa*, *Iris versicolor*, and *Irisvirginica*. There are *D = 4* numeric variables (also called“features”) that are measured: `Sepal.Length`

,`Sepal.Width`

, `Petal.Length`

and`Petal.Width`

.

We first standardize these four variables so they have mean 0 andstandard deviation 1. This places them on a common ground. Then, wecompute our first kernel: the linear kernel.

`iris_std <- scale(iris[,c( "Sepal.Length","Sepal.Width","Petal.Length", "Petal.Width")])KL <- Linear(iris_std)dim(KL)#> [1] 150 150class(KL)#> [1] "matrix" "array"`

The linear kernel is simply the pairwise inner product between all*N* samples (in our case: flower measurements). The result is“stored” on a matrix (the kernel matrix) that has dimensions*NxN*. Furthermore, it is always symmetric and positivesemi-definite (PSD). To check the kernel between two samples (forinstance, 32 and 106), we can type either `KL[32,106]`

or`KL[106,32]`

. It should be noted that kernel matricesgenerated by `kerntools`

have class “matrix”, “array”. Infact, to keep things simple, no function in this package requires anyspecial classes/objects not present in base R.

Next, we examine this kernel matrix further. Although it is notexcessively large, it is large enough to make simply typing`KL`

in R unpractical. Instead, we may summarize its valuesusing a histogram:

`histK(KL, vn = TRUE)`

This “almost-gaussian” shape is due to the features’ scaling. Furtherparameters, including color, can be passed to `histK()`

(formore info, check the documentation of `graphics::hist()`

).The von Neumann entropy shown here is optional (in fact, this value canbe computed separately doing `vonNeumann(KL)`

). Entropyvalues close to 0 indicate that all kernel matrix elements are verysimilar, while values close to 1 indicate a high variability.

Another possibility is to visualize the whole kernel matrix with aheatmap:

`heatK(KL,cos.norm = FALSE)`

Here, yellow denotes a great similarity between the samples, whilered denotes that the similarity is low (the colour palette iscustomizable via the parameter `color`

). At a glance we seethat the first 50 samples (*I. setosa*) have a higher intra-groupsimilarity. Also, they are very different from the samples 101-150(which correspond to *I. virginica*). Instead, *I.versicolor* is kind of intermediate between these two groups.

To confirm our intuitions about the (dis)similarity among the threespecies, we may proceed to a widespread ordination method: the PrincipalComponents Analysis (PCA). PCAs can be computed from kernel matricesvery easily. In fact, using kernel matrices *expands* what a PCAcan do, but this will be discussed in further sections. To display abeautiful PCA plot that colors the samples by species, we do:

`iris_species <- factor(iris$Species)kpca <- kPCA(KL,plot = 1:2,y = iris_species)kpca$plot`

Indeed, we can see that *I. setosa* and *I. virginica*are the most different groups, while *I. versicolor* and*I.virginica* are very close. The colors can be changed ifdesired with the `kPCA(...,colors)`

parameter.

After seeing this plot, we can infer that a predictive model fromthese data will work well. Although we have a ton of machine learningmethods at our disposal, in this vignette we will stick with the kernelmethods. More specifically, we will use the most famous kernel method:the Support Vector Machine (SVM).

SVMs are **not** implemented in `kerntools`

.However, they are in other R packages like `kernlab`

or`e1071`

. Here we will use the `ksvm()`

function ofthe former package:

`library(kernlab)set.seed(777)## Training datatest_idx <- sample(1:150)[1:30] # 20% of samplestrain_idx <- (1:150)[-test_idx]KL_train <- KL[train_idx,train_idx]## Model (training data)linear_model <- kernlab::ksvm(x=KL_train, y=iris_species[train_idx], kernel="matrix")linear_model#> Support Vector Machine object of class "ksvm" #> #> SV type: C-svc (classification) #> parameter : cost C = 1 #> #> [1] " Kernel matrix used as input."#> #> Number of Support Vectors : 27 #> #> Objective Function Value : -0.9459 -0.3184 -14.3709 #> Training error : 0.025`

First and foremost: in prediction models, it is mandatory to have anadditional test set so a honest estimation of the model’s performancecan be computed (we will do this latter). Also, please note that in areal-world machine learning setting, training data should have beenpreprocessed first and then the *same* exact preprocessing shouldhave been applied to test data. In our case, the only preprocessing wasstandardize the dataset: thus, the mean and standard deviation shouldhave been computed from training data, and then these values should havebeen used for standardize *both* the training and test sets. Thatbeing said, in order to not interrupt the flow of this vignette, we willuse leave things as they are.

Now returning to our (questionably obtained) model, we have a verylow training error. The support vectors (which are the only samples thatare relevant for us, as the rest are not used to define the SVMdiscriminating hyperplane) constitute only the 22% (approx) of samples.Not bad.

Before jumping to the test set, we may be interested in anothertopic: feature importance. This means studying which variables areconsidered more important by the model when discriminating betweenclasses. Feature importance is important for avoiding “black-boxmodels”: prediction models that we know that work well, but not*why*.

Obtaining the importances out of a SVM model can be somewhatconvoluted (this is discussed later in more depth) and sometimesdownright impossible. In our particular case, the only problem is thatwe wanted to classify 3 classes (species)… but SVM classifiers arebinary. For discriminating 3 classes, `kernlab`

in factbuilds 3 classifiers: “setosa vs versicolor”, “setosa vs virginica”, and“versicolor vs virginica”. These 3 classifiers constitute the`linear_model`

object and the prediction of the class of asample is done by a voting scheme. To simplify things, for the features’importance part, we will focus only on the third classifier: “versicolorvs virginica”, which we have seen previously that are the two mostrelated species. The way to go here is to obtain the index of theSupport Vectors in our model, and their coefficients. All that isgracefully provided by `kernlab`

. Then, we will return to`kerntools`

and call the function `svm_imp()`

:

`## Third model: Versicolor vs virginicasv_index <- kernlab::alphaindex(linear_model)[[3]] # Vector with the SV indicessv_coef <- kernlab::coef(linear_model)[[3]] # Vector with the SV coefficientsfeat_imp3 <- svm_imp(X=iris_std[train_idx,],svindx=sv_index,coeff=sv_coef)#> Do not use this function if the SVM model was created with the RBF,#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels`

Note that here we need the data that we used to compute`KL`

: `iris_std.`

It is *very* important touse this version of the dataset, and not any other version with morepre-processing (or the “original” without pre-processing).`svm_imp()`

has parameters like `center`

,`scale`

and `cos.norm`

to take this widespreadnormalization techniques into account, but it is better to playsafe.

Conveniently, `kerntools`

provides a function to visualizethe features’ importance:

`plotImp(feat_imp3, leftmargin = 7, main="3rd model: versicolor vs virginica", color="steelblue1")`

`#> $first_features#> [1] "Petal.Length" "Petal.Width" "Sepal.Width" "Sepal.Length"#> #> $c*msum#> [1] 1#> #> $barplot#> [,1]#> [1,] 0.7#> [2,] 1.9#> [3,] 3.1#> [4,] 4.3`

As we can see, the model considers the petals most discriminatingthat the sepals, and within the petals’ measures, the petal length.

If we return to the PCA, we could also check the weight of eachvariable in the first and second PCs (that is, the ones we displayed).To do so, it comes in handy the `kPCA_imp()`

function. Again,this function requires the dataset that generated the `KL`

matrix:

`loadings <- kPCA_imp(iris_std)#> Do not use this function if the PCA was created with the RBF,#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernelspcs <- loadings$loadings[1:2,]pcs#> Sepal.Length Sepal.Width Petal.Length Petal.Width#> PC1 0.5210659 -0.2693474 0.58041310 0.56485654#> PC2 0.3774176 0.9232957 0.02449161 0.06694199`

It seems that the first component is dominated by Petal Length andPetal Width but also by Sepal Length. The second component, that plays arole in discriminating *I. versicolor* and *I.virginica*,is dominated by Sepal Width. The PCA disagrees a bit with the SVMfeature importances, but remember that in the latter we focused only onthe “versicolor vs virginica” problem, while in the former we arelooking at the ordination of the three classes. We may represent thecontributions of the four features for the 1st PC, and to make thingseasier we will include the 2nd PC onto the barplot:

`plotImp(pcs[1,], y=pcs[2,], ylegend="PC2",absolute=FALSE, main="PC1", leftmargin = 7, color="rosybrown1")`

`#> $first_features#> [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" #> #> $c*msum#> [1] 1#> #> $barplot#> [,1]#> [1,] 0.7#> [2,] 1.9#> [3,] 3.1#> [4,] 4.3`

We used `absolute=FALSE`

because the contribution of eachvariable to the PC is relevant not only in magnitude, but also in sign.Pink bars represent PC1, while the black line represents PC2 (parameter`y`

). As we only wanted to see the order and the relativemagnitude, the X axis show the relative contribution (in the`plotImp()`

function, `relative=TRUE`

bydefault).

With `kerntools`

, we can draw the contributions on the PCAplot as arrows. We only need the PCA plot (given by `kPCA()`

)and the contributions (given by `kPCA_imp()`

):

`kPCA_arrows(plot=kpca$plot,contributions=t(pcs),colour="grey15")`

(Note that the arrows are scaled to match with the original PCA plot.They are somewhat orientative: their directions are correct, and longerarrows represent a greater contribution to a PC that shorter arrows;however, usually the arrows’ magnitudes do not coincide with the actualmagnitudes that can be computed from `kPCA_imp()`

).

And now, finally, we are going to check the performance in the testset (considering the 3 classes). To do so, we subset `KL`

tohave a suitable *test x training* matrix, input the matrix intoour linear_model, and compare with the predicted species with the actualspecies:

`KL_test <- as.kernelMatrix(KL[test_idx,train_idx])## Prediction (test data)pred_class <- predict(linear_model,KL_test)actual_class <- iris_species[test_idx]pred_class#> [1] versicolor versicolor virginica virginica versicolor virginica #> [7] virginica virginica versicolor versicolor versicolor versicolor#> [13] versicolor versicolor versicolor versicolor virginica versicolor#> [19] versicolor versicolor virginica versicolor virginica versicolor#> [25] versicolor virginica versicolor versicolor versicolor versicolor#> Levels: setosa versicolor virginicaactual_class#> [1] setosa setosa versicolor versicolor setosa setosa #> [7] setosa versicolor setosa setosa versicolor virginica #> [13] virginica virginica virginica virginica setosa versicolor#> [19] setosa setosa setosa setosa versicolor setosa #> [25] virginica versicolor virginica virginica virginica versicolor#> Levels: setosa versicolor virginica`

Mmmm… maybe we were a bit overconfident. It seems that the model hasignored *I. setosa* completely.

We can compute numerically how “good” (or wrong) our model isaccording to different performance measures implemented in`kerntools`

. When dealing with classification, all of themneed a contingency table that contrasts the actual and the predictedclasses (also known as “confusion matrix”). The most simple measure isthe accuracy: number of right predictions divided by the number ofpredictions (which is the test set size: in our case, 30).

`ct <- table(actual_class,pred_class) # Confusion matrixct#> pred_class#> actual_class setosa versicolor virginica#> setosa 0 9 4#> versicolor 0 3 5#> virginica 0 9 0Acc(ct) ## Accuracy#> [1] 0.1Acc_rnd(actual_class) ## Accuracy of the random model#> [1] 0.3488889`

As expected, our accuracy is overwhelmingly low. We can compare theresult with the accuracy of the random model (according to the classdistribution on the test): 0.35, which means that our model performs alot worse than someone classifying at random.

We can explore other measures to infer what are the problems with ourprediction model (for instance, if a species is systematicallymissclassified, etc.). For this example, we can compute these measuresby class:

`Prec(ct,multi.class = "none") ## Precision or Positive Predictive Value#> setosa versicolor virginica #> 0.0000000 0.1428571 0.0000000Rec(ct,multi.class = "none") ## Recall or True Positive Rate#> setosa versicolor virginica #> 0.000 0.375 0.000Spe(ct,multi.class = "none") ## Specificity or True Negative Rate#> setosa versicolor virginica #> 1.0000000 1.0000000 0.5714286F1(ct,multi.class = "none") ## F1 (harmonic mean of Precision and Recall)#> setosa versicolor virginica #> 0.0000000 0.2068966 0.0000000`

(In case we want the overall performance measure, we can compute themean of the three classes, or type`multi.class="macro"`

).

The precision measure tell us that none of the samples predicted tobe “virginica” or “setosa” are correct (in the case of “setosa”, becausenone was predicted), and only some (1/7) that were predicted to be“versicolor” were right. The recall shows that only 3/8 “versicolor”samples in the test were correctly classified as “versicolor”, whilethere is none of “setosa” or “virginica.” F1 is useful because it givesa “mean” of Precision and Recall. Meanwhile, the low specificity of“versicolor” points that a lot of samples that were not “versicolor”were predicted as such.

### A (slightly) more complicated example

In the previous section we picked naively the first model we couldtrain, with not-so-great results. Here, we are going to complicatethings a bit hoping that we will obtain a model that *works*.

(Note: of course, iris flower classification is a simple task. Infact, it can be achieved pretty decently with the linear kernel, as canbe deduced from the previous PCA: a linear classifier is enough todiscriminate the flowers using only the 1st and 2nd PCs. However, forthe sake of the example, we will use a different kernel in the presentsection).

The radial basis function (RBF) kernel is something like the “goldstandard” among kernels. Unlike the linear kernel (which is the mostsimple or “plain” kernel), it is nonlinear: in fact, the RBF kernel is auniversal approximator. We can compute it doing:

`Krbf <- RBF(iris_std,g=0.25)histK(Krbf,col="aquamarine",vn = TRUE)`

Kernel matrix values are typically between 0 and 1. The linear kernelrequired only our dataset, but `RBF()`

has a (hyper)parametercalled gamma (`g`

for short). The value of thishyperparameter should be decided by us, which is an important decision,because it will affect the decision boundary of the kernel. Fortunately,some heuristics to estimate a good gamma exist. `kerntools`

implement three of them, which are available in the function`estimate_gamma()`

:

`estimate_gamma(iris_std)#> $d_criterion#> [1] 0.25#> #> $scale_criterion#> [1] 0.2512584#> #> $quantiles_criterion#> 90% 50% 10% #> 0.05570343 0.16670322 1.73533468`

In the previous histogram we visualized the RBF with the gamma givenby the “d_criterion” (and almost the one given by the “scalecriterion”). The third heuristic gives us a distribution of “good” gammavalues. Now, for the sake of comparison, we will compute the RBF kernelusing the median of the “quantiles_criterion”:

`Krbf2 <- RBF(iris_std,g=0.1667)histK(Krbf2,col="darkseagreen1",vn=TRUE)`

Not only the histogram changes: the von Neumann entropy changes aswell. It important to remark that the RBF kernel is very sensitive tothe gamma values. The higher entropy with respect to that of the linearkernel reflects that, here, we have a higher variability in the kernelmatrix values. (That can be also be deduced comparing both histograms.Conversely, if we do `heatK(Krbf)`

, we will observe moreextreme values/colors than before). This paper recommendsan entropy between 0.3-0.5, so maybe this will be reflected on the SVMmodel’s performance?

Now, we can also do a kernel PCA. Our previous kernel PCA used alinear kernel so, in reality, it was identical to a “normal” PCA. Thistime however we are using a different kernel and now we can actually saythat this is a kernel PCA. The main difference is that the projection ofsamples is *not* going to be linear. Sometimes, this createsstrange patterns that are difficult to interpret.

As later we are going to train a SVM model, it may occur to us thatit would be great to do a PCA only with the training samples, so we cancompare the prediction model with the PCA side by side. To do so, wewill use the same training indices than in the previous section. Evenbetter: what if we compute a (kernel) PCA with the training samples, andthen project the test samples over them?

`Krbf_train <- Krbf2[train_idx,train_idx]Krbf_test <- Krbf2[test_idx,train_idx]rbf_kpca <- kPCA(K=Krbf_train, Ktest=Krbf_test, plot = 1:2, y = iris_species[train_idx], title = "RBF PCA")rbf_kpca$plot`

(Note: remember that, in a real-world problem, the standardization ofthe dataset should have been done with the center and std deviation ofthe training set.)

Said and done! However, now the patterns on this kernel PCA are abit… radial. Still, *I. setosa* is again on one side, and *I.versicolor* and *I. virginica* on the other. The red, greenand blue samples are the training samples, where the grey samplescorrespond to the test samples we projected *a posteriori* (anyother color can be specified in the `kPCA`

parameter`na_col`

).

What now? As we have generated more than one kernel matrix from thesame data (thanks to the linear and the RBF kernels), we may comparethese matrices. To do so, we can use a `kerntools`

functioncalled `simK`

:

`simK(list(linear=KL,rbf_0.166=Krbf, rbf_0.25=Krbf2))#> Remember that Klist should contain only kernel matrices (i.e. squared, symmetric and PSD).#> This function does NOT verify the symmetry and PSD criteria.#> linear rbf_0.166 rbf_0.25#> linear 1.0000000 0.5208955 0.4803192#> rbf_0.166 0.5208955 1.0000000 0.9898203#> rbf_0.25 0.4803192 0.9898203 1.0000000`

`simK`

will first remind us that a matrix should haveseveral mathematical properties to be a kernel matrix. When we work withkernel matrices generated by `kerntools`

(that is:`Linear()`

, `RBF()`

, etc.) everything will bealright. However, you can come to `kerntools`

with yourprecomputed kernel matrices (as long as they have class “matrix”,“array”). `kerntools`

implicitly trusts the user knows whathe/she is doing, so remember using proper kernel matrices.

`simK`

returns a score between 0 and 1: 1 is completesimilarity, and 0 is complete dissimilarity. We can see that the two RBFmatrices are very similar, while the linear kernel matrix is around a50% similar to the RBF matrices.

We could also compare the two PCAs. An option to do so is computingthe RV coefficient (Co-Inertia Analysis). However, the RV coefficientbetween `rbf_kpca`

and `kpca`

should give the sameresult than `simK(list(KL,Krbf))`

. It should be noted thatthis equivalence only holds if the dataset is centered beforehand, asPCAs usually are computed using centered data (for that reason we have`kPCA(..., center=TRUE)`

by default). If a kernel matrix wasobtained from a non-centered data, it can be centered afterwards with`centerK()`

(more of this in later sections).

Another way to compare two PCAs is called the Procrustes Analysis.This analysis compares the correlation between two projections after“removing” the translation, scale and rotation effects. Although is notproperly a kernel method, `kerntools`

can do a basicProcrustes Analysis. In our data, we have a moderate Procrustescorrelation: 0.68 (the correlation coefficient is bounded between 0 and1).

`rbf_kpca <- kPCA(K=Krbf)proc <- Procrustes(kpca$projection,rbf_kpca)proc$pro.cor # Procrustes correlation#> [1] 0.6862007`

(This is a moment as good as any other to show that`kPCA()`

can return the kernel PCA projection withoutdisplaying any plot. In that case, all graphical parameters like colors,labels, etc. are ignored.)

With all of these, we will train a brand new (and hopefully better)model. We will re-use the same training and test samples:

`####### Model (training data)model <- kernlab::ksvm(x=Krbf_train, y=iris_species[train_idx], kernel="matrix", C=10)model#> Support Vector Machine object of class "ksvm" #> #> SV type: C-svc (classification) #> parameter : cost C = 10 #> #> [1] " Kernel matrix used as input."#> #> Number of Support Vectors : 30 #> #> Objective Function Value : -4.1707 -2.7089 -92.348 #> Training error : 0.008333`

A very low training error, but now we are wiser. What about theperformance in test?

`Krbf_test <- as.kernelMatrix(Krbf_test)####### Prediction (test data)pred_class <- predict(model,Krbf_test)actual_class <- iris_species[test_idx]ct <- table(actual_class,pred_class) # Confusion matrixAcc(ct) ## Accuracy#> [1] 0.5`

Wow, that seems a lot better! However, before we get excited, we mustremember that this is a point estimation of accuracy, that comes from aspecific test set (the 30 samples we chose randomly in the previoussection). Another test set surely will deliver a different accuracy.What if we tried to compute a confidence interval (CI) to have an ideaof how other test sets will behave?

`kerntools`

provides two ways to obtain a CI: the NormalApproximation, and bootstrapping. Normal approximation is quicker, whilebootstrapping is usually considered to be safer (more details: here):

` ## Accuracy CI (95%)Normal_CI(value = 0.5,ntest = 30) ## Accuracy CI (95%)#> [1] 0.3210806 0.6789194Boots_CI(target = actual_class, pred = pred_class, nboots = 2000,index = "acc") #> 2.5% 97.5% #> 0.5016667 0.3333333 0.7000000`

Both functions default to a 95% CI, but that can be changed via the`confidence`

parameter. According to the normalapproximation, the accuracy is 0.5 (0.32, 0.68), while according tobootstrap strategy, it is 0.5 (0.33, 0.66). The CI is wide because thetest is very small (only 30 samples). However, with this test and CI(with the 95% confidence) we cannot *really* assure that thismodel is really different than the random model, which had an accuracyof 0.35 (as we computed in the previous section). Useful reminder: nexttime, we should choose a larger test set.

Before call it a day, we are going to compute again the otherperformance measures. This time, we will not compute them class byclass, but on average (the “macro” approach):

`Prec(ct) ## Precision or Positive Predictive Value#> It is identical to weighted Accuracy#> [1] 0.4357143Rec(ct) ## Recall or True Positive Rate#> [1] 0.4807692Spe(ct) ## Specificity or True Negative Rate#> [1] 0.8085477F1(ct) ## F1 (harmonic mean of Precision and Recall)#> [1] 0.4484848`

If we desire to do, we can compute a CI for these values: forinstance, to bootstrap the macro-F1 value, we could simply type`index = "f1"`

in the `Boots_CI()`

function. Inany case, we should congratulate ourselves because the performance isclearly higher than last time. Training *seriously* a machinelearning model involves fine hyperparameter tuning (remember that C in`ksvm()`

?) that we have almost completely skipped. That is:we should use a strategy like, say, grid search, and compare theperformance measures of each hyperparameter combination via crossvalidation, which is far beyond the purposes of this vignette (and`kerntools`

).

Finally, a warning about computing feature importances in SVM and/orfeature contribution to PCs in PCA. In a few words: don’t do that whenusing the RBF kernel. More precisely: of all kernel functionsimplemented here, do not ever try to recover the feature contributionsfor `RBF()`

, `Laplace()`

, `Jaccard()`

,`Ruzicka()`

, `BrayCurtis()`

and`Kendall()`

unless you know *very* well what you aredoing. If you type something like:

`####### RBF versicolor vs virginica model:sv_index <- kernlab::alphaindex(model)[[3]]sv_coef <- kernlab::coef(model)[[3]]svm_imp(X=iris_std[train_idx,],svindx=sv_index,coeff=sv_coef)#> Do not use this function if the SVM model was created with the RBF,#> Laplacian, Bray-Curtis, Jaccard/Ruzicka, or Kendall's tau kernels#> Sepal.Length Sepal.Width Petal.Length Petal.Width #> 0.8931422 0.7828191 12.3522663 10.0792776`

something will be computed, but the result **is wrong**.Do not ignore the warning raised here. This is not right because,ultimately, all kernels behave like the linear kernel: they compute theinner product of some features. But what features? That is the question.Under the hood, all kernels “send” the original features (the featurethat live in the *input space*) to other space (usuallyhigher-dimensional) called the *feature space*, and they computethe inner product there. The kernel conflates these two steps into one,which usually simplifies a lot the calculations and saves a lot ofmemory space: this is called the “kernel trick”. But to computeanalytically the feature contributions we need to go that feature space.To make things worse, the feature space that the kernel implicitly isusing depends on things like the dimensionality of input data, the kindof kernel, the specific value of its hyperparameters, etc. Going to thisfeature space is only trivial for the linear kernel: only then inputspace = feature space. Instead, for the RBF kernel, this feature spaceis infinite dimensional. Some techniques to estimate it exist (see forinstance: ExplicitApproximations of the Gaussian Kernel), but they are not yetimplemented in `kerntools`

(and maybe they never are).