Índex
lpda
is an R package that addresses the classification
problem through linear programming. The method looks for a hyperplane,
H, which separates the samples into two groups by minimizing
the sum of all the distances to the subspace assigned to the group each
individual belongs to. It results in a convex optimization problem for
which we find an equivalent linear programming problem. We demonstrated
that H exists when the centroids of the two groups are not
equal [1]. The method has been extended to more than two groups by
considering pairwise comparisons. Moreover, lpda
offers the
possibility of dealing with Principal Components (PCs) to reduce the
dimension of the data avoiding overfitting problems. This option can be
applied independently of the number of samples, \(n\), and variables, \(p\), that is \(n>p\) or \(n<p\). Compared to other similar
techniques it is very fast, mainly because it is based in a linear
programming problem [2].
Recently, the method has been adapted to 3-dimensional data [4].
Main function is lpda
that collect the input data,
standardizes the data or applies Principal Component Analysis (PCA)
through lpda.pca
if it is required. Then, it calls to
lpda.fit
as many times as pairwise comparisons there are.
The result is a lpda
type object that is the input to
compute predictions through ‘predict’ function and to visualize results
through ‘plot’ function.
The package has also a function named lpdaCV
to compute
by crossvalidation (CV) the classification error in different test data
sets. This is helpful to decide an appropriate number of PCs or a
specific strategy to compute the hyperplane.
lpda.3D
and lpdaCV.3D
are designed for
dealing with 3-dimensional data.
lpda
package includes two data sets concerning data
science: palmdates
and RNAseq
. The first one
is a real data set from a chemometrics study and the second one a
simulated RNA-seq experiment. In this document we show the performance
of the package with these data sets and with iris
data,
available in R
package.
Palmdates
is a data set with scores of 21 palm dates
including their respective Raman spectra and the concentration of five
compounds covering a wide range of concentrations: fibre, glucose,
fructose, sorbitol and myo-inositol [3]. The first 11 dates are Spanish
(from Elche, Alicante) with no well-defined variety and the last 10 are
from other countries and varieties, mainly Arabian. The data set has two
data.frames: conc
with 5 variables and spectra
with 2050.
data("palmdates")
names(palmdates)
#> [1] "conc" "spectra"
dim(palmdates$spectra)
#> [1] 21 2050
names(palmdates$conc)
#> [1] "fibre" "sorbitol" "fructose" "glucose" "myo-inositol"
As conc
and spectra
, are very correlated,
the application of the method with PCs reduces substantially the
dimension.
This data set has been simulated as Negative Binomial distributed and transformed to rpkm (Reads per kilo base per million mapped reads). It is a 3-dimensional array, that contains gene expression from 600 genes measured to 60 samples through 4 time-points. First 10 samples are from first group and the remaining samples from the second one. It has been simulated with few variables (genes) that discriminate between groups. There is few correlation and a lot of noise.
data("RNAseq")
dim(RNAseq)
#> [1] 20 600 4
head(RNAseq[,1:6,1:2])
#> , , 1
#>
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 70 129 81 86 169 179
#> [2,] 12 153 80 155 85 95
#> [3,] 217 68 136 112 179 96
#> [4,] 109 89 73 194 158 148
#> [5,] 58 92 95 40 59 55
#> [6,] 102 41 188 121 99 95
#>
#> , , 2
#>
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 65 94 95 195 164 95
#> [2,] 101 117 119 92 59 116
#> [3,] 55 67 64 24 54 104
#> [4,] 48 92 68 95 74 86
#> [5,] 221 65 85 81 52 45
#> [6,] 45 13 97 70 76 101
lpda
functionFirst we apply the method with the first two variables:
fibre
and sorbitol
to show the performance of
the package. The application of the method with two variables allows the
visualization of the hyperplane in two dimensions, in this case, a
straight line.
group = as.factor( c(rep("Spanish",11), rep("Other",10)) )
model1 = lpda(data = palmdates$conc[,1:2], group = group )
model1
#> Call:
#> lpda(data = palmdates$conc[, 1:2], group = group)
#> Coefficients:
#> [1] -1.116131 -3.002923 -6.563646
names(model1)
#> [1] "coef" "data" "group" "scale" "pca" "call"
One of the outputs of lpda
is a matrix with the
coefficients of the hyperplane: \(a'x=b\) for each pair-wise comparison.
In this example, -1.116 and -3.003 the coefficients of
fibre
and sorbitol
respectively and -6.564 the
constant \(b\).
In cases with 2 variables, as this simple example, we can plot the line on the points, that represents the samples, with the following code:
plot(palmdates$conc[,1:2], col = as.numeric(group)+1, pch = 20,
main = "Model 1. Palmdates: fibre & sorbitol")
abline(model1$coef[3]/model1$coef[2], -model1$coef[1]/model1$coef[2], cex = 2)
legend("bottomright", c("Other","Spanish"),col = c(2,3), pch = 20, cex=0.8)
Predicted group with the model is obtained with predict
function. Confusion matrix can be obtained with summary
. We
observe that all the samples are well classified.
predict(model1)
#>
#> elche1 elche2 elche3 elche4
#> "Spanish" "Spanish" "Spanish" "Spanish"
#> elche5 elche6 elche7 elche8
#> "Spanish" "Spanish" "Spanish" "Spanish"
#> elche9 elche10 elche11 salomon.israel
#> "Spanish" "Spanish" "Spanish" "Other"
#> hayani.israel medjool.israel barhi.israel deglet.noor.tunez
#> "Other" "Other" "Other" "Other"
#> tunez.alligh argelia.deglet.noor iran arabia.saudi.perny
#> "Other" "Other" "Other" "Other"
#> sud.africa.medjool
#> "Other"
summary(model1)
#> Call:
#> lpda(data = palmdates$conc[, 1:2], group = group)
#>
#> CONFUSION MATRIX
#> Real
#> Predicted Other Spanish
#> Other 10 0
#> Spanish 0 11
By considering the five concentration variables, all samples are well classified as well. As a 5-dimensional plot can not be showed, we offer the possibility of seeing a plot that shows the situation of samples with respect to \(H\). Y-axis represents order in which they appear in the data matrix and X-axis distances of each sample to \(H\).
model2 = lpda(data = palmdates$conc, group = group )
plot(model2, main="Model 2. Palmdates: all conc variables")
Another option is the application of lpda
to the first
PCs scores. When data is highly correlated, two components are usually
preferred. In such case choosing as plot
arguments
PCscores = TRUE
we will visualize the first two PCs,
indicating in the axis the proportion of explained variance by each PC.
Moreover if there are two groups, as in this example, the optimal
hyperplane is also showed.
model3 = lpda(data = palmdates$conc, group = group, pca = TRUE, Variability = 0.7)
plot(model3, PCscores = TRUE, main = "Model 3. Palmdates: PCA on conc variables")
When having data sets with more variables than individuals PCA is not
directly applicable. In lpda
we have implemented the
possibility of dealing with such problem, working with the equivalences
between the PCA of the data matrix and the transposed matrix.
In palmdates$spectra
there are 2050 very correlated
measurements as it is showed in the following figure:
Due to the high correlation the application of lpda
to
all the spectra variables and to the first 2 PCs (model4) give the same
solution: 0 prediction errors.
model4 = lpda(data = palmdates$spectra, group = group, pca = TRUE, Variability = 0.9)
plot(model4, PCscores = TRUE, main = "Model 4. Palmdates: PCA on Spectra variables")
Following example shows how predict the group of a test set that does not participate in the model. In this set we include two samples of each group.
test = c(10,11,12,13)
model5 = lpda(data = palmdates$spectra[-test,], group = group[-test], pca = TRUE,
Variability = 0.9)
predict(model5, datatest=palmdates$spectra[test,])
#>
#> elche10 elche11 salomon.israel hayani.israel
#> "Spanish" "Spanish" "Other" "Other"
summary(model5, datatest=palmdates$spectra[test,], grouptest = group[test])
#> Call:
#> lpda(data = palmdates$spectra[-test, ], group = group[-test],
#> pca = TRUE, Variability = 0.9)
#>
#> CONFUSION MATRIX
#> Real
#> Predicted Other Spanish
#> Other 2 0
#> Spanish 0 2
iris
dataTo show an example with more than two groups we use the famous
(Fisher’s or Anderson’s) iris
data set that is available in
base R
. The iris
data set gives the
measurements in centimeters of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa
,
versicolor
, and virginica
.
Results with the four variables give 2 classification errors. In this
case the model computes 3 hyperplanes for each one of the 3 pairwise
comparisons. Now plot
function gives 3 plots.
model.iris = lpda(iris[,-5], iris[,5])
summary(model.iris)
#> Call:
#> lpda(data = iris[, -5], group = iris[, 5])
#>
#> CONFUSION MATRIX
#> Real
#> Predicted setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 1
#> virginica 0 1 49
The application of lpda
with 3 PCs gives the same
classification error. Function plot
with
Pcscores = TRUE
gives the scores in the first two PCs, but
in this case without the hyperplanes.
model.iris2 = lpda(iris[,-5], iris[,5], pca=TRUE, PC=3)
summary(model.iris2)
#> Call:
#> lpda(data = iris[, -5], group = iris[, 5], pca = TRUE, PC = 3)
#>
#> CONFUSION MATRIX
#> Real
#> Predicted setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 1
#> virginica 0 1 49
plot(model.iris2, PCscores= TRUE)
lpdaCV
functionWe can use lpdaCV
function to compute the predicted
error in different test sets by crossvalidation with a specific model.
Two strategies are implemented: loo
(leave one out) and
ktest
that is specified in CV
argument. When
ktest
is selected, ntest
is the size of test
set (samples not used for computing the model, only to evaluate the
number of prediction errors) and R
is the times the model
is computed and evaluated with different training and test sets.
lpdaCV(palmdates$spectra, group, pca = TRUE, CV = "loo")
lpdaCV(palmdates$spectra, group, pca = TRUE, CV = "ktest", ntest = 5, R = 10)
CV results are so good due to the clear difference between the two groups in spectra data. We can explore ´lpdaCV` possibilities with one of the matrices of RNAseq simulated data, that is noisier than spectra data. Firstly we can see that the model with all the data gets a separating hyperplane.
data(RNAseq) # 3-dimensional array
data = RNAseq[,,3] # the third data matrix with dimensions 20 x 600
group = as.factor(rep(c("G1","G2"), each = 10))
model = lpda(data, group) # model with all the variables
summary(model)
Evaluating the error in several test sets with CV functions we can
see that, to avoid overfitting, it is preferable the application of
lpda
with PCA:
lpdaCV(data, group, pca = FALSE, CV = "ktest", ntest = 5)
#> Call:
#> lpdaCV(data = data, group = group, pca = FALSE, CV = "ktest",
#> ntest = 5)
#> Prediction Error Rate for analysed model/models:
#> Original Data
#> 0.2
lpdaCV(data, group, pca = TRUE, CV = "ktest", ntest = 5)
#> Call:
#> lpdaCV(data = data, group = group, pca = TRUE, CV = "ktest",
#> ntest = 5)
#> Prediction Error Rate for analysed model/models:
#> PC-2
#> 0
As the success of PCA solution depends on the chosen number of
components, there exist the possibility of applying lpdaCV
for several percentages of explained variability, for PCs from original
data, or several number of PCs, specified in a vector. Examples:
lpdaCV(data, group, pca = TRUE, CV = "ktest", Variability = c(0.1, 0.9))
#> Call:
#> lpdaCV(data = data, group = group, pca = TRUE, Variability = c(0.1,
#> 0.9), CV = "ktest")
#> Prediction Error Rate for analysed model/models:
#> Variability-0.1 Variability-0.9
#> 0.04 0.13
lpdaCV(data, group, pca = TRUE, CV = "ktest", PC= c(2, 10))
#> Call:
#> lpdaCV(data = data, group = group, pca = TRUE, PC = c(2, 10),
#> CV = "ktest")
#> Prediction Error Rate for analysed model/models:
#> PC-2 PC-10
#> 0.02 0.06
lpda.3D
functionThis function has been designed to apply the method lpda
to 3-dimensional data through 2 strategies:
lpda
through the third dimension
(pfac=FALSE).lpda
to the parafac scores (pfac=TRUE).For the second strategy function parafac
from
multiway
package is used. More details in [4].
data(RNAseq) # 3-dimensional array
dim(RNAseq)
#> [1] 20 600 4
group = as.factor(rep(c("G1","G2"), each = 10))
model3D = lpda.3D(RNAseq, group)
summary(model3D)
#> CONFUSION MATRIX
#> Real
#> Predicted G1 G2
#> G1 10 0
#> G2 0 10
predict(model3D)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G2" "G2" "G2" "G2" "G2" "G2"
#> 17 18 19 20
#> "G2" "G2" "G2" "G2"
plot(model3D, mfrow=c(2,2))
model3Ds2 = lpda.3D(RNAseq, group, pfac=TRUE, nfac=2)
model3Ds2$MOD$mod.pfac$Rsq
#> [1] 0.809859
predict(model3Ds2)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G2" "G2" "G2" "G2" "G2" "G2"
#> 17 18 19 20
#> "G2" "G2" "G2" "G2"
summary(model3Ds2)
#> CONFUSION MATRIX
#> Real
#> Predicted G1 G2
#> G1 10 0
#> G2 0 10
plot(model3Ds2, pfacscores=FALSE, main="Parafac Model")
lpdaCV.3D
functionIn the same way as lpdaCV
, this function is useful to
evaluate the model and to decide the number of factors in the parafac
model.
[1] Nueda, M.J.; Gandía, C. and Molina, M.D. (2022) LPDA: A new classification method based on linear programming. PLOS ONE 17(7): e0270403. https://doi.org/10.1371/journal.pone.0270403
[2] Nueda, M.J.; Gandía, C. and Molina, M.D. Classifying sequencing data using linear programming. Euro30 Conference, Dublin, June-2019.
[3] Abdrabo, S.S.; Gras, L. and Mora, J. (2013) Analytical methods applied to the chemical characterization and classification of palm dates (Phoenix dactylifera L.) from Elche’s Palm Grove. Ph.D. thesis, Departamento de Química Analítica, Nutrición y Bromatología. Universidad de Alicante.
[4] Gandía, C.; Molina, M.D. and Nueda, M.J. (2025) Adapting the lpda R package to Multiway data. In preparation.