\documentclass[nojss,shortnames]{jss} \usepackage{amsfonts} \usepackage{graphicx} \usepackage{amsmath} \usepackage{natbib} \usepackage{float} \usepackage{color} %\input colornam.sty %\VignetteIndexEntry{A Trimming Approach to Cluster Analysis} %\VignetteDepends{tclust} %\VignetteKeywords{Model-based clustering, trimming, heterogeneous clusters} %\VignettePackage{tclust} \newcommand{\rounding}[1]{\lceil#1\rceil} %% new rounding symbol %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Heinrich Fritz\\Department of Statistics\\ and Probability Theory\\ Vienna University of Technology \And Luis A. Garc\'{\i}a-Escudero\\Departamento de Estad\'{\i}stica\\ e Investigaci\'{o}n Operativa\\ Universidad de Valladolid \And Agust\'{\i}n Mayo-Iscar\\Departamento de Estad\'{\i}stica\\ e Investigaci\'{o}n Operativa\\ Universidad de Valladolid } \title{\pkg{tclust}: An \proglang{R} Package for a Trimming Approach to Cluster Analysis } %% for pretty printing and a nice hypersummary also set: \Plainauthor{Heinrich Fritz, Luis A. Garc\'{\i}a-Escudero, Agust\'{\i}n Mayo-Iscar} \Plaintitle{tclust: An R Package for a Trimming Approach to Cluster Analysis} %\Shorttitle{\pkg{tclust}: An \proglang{R} package for a trimming approach to cluster analysis} %% a short title (if necessary) % this is optional. no use if not different from "real" title %% an abstract and keywords \Abstract{This introduction to the \proglang{R} package \pkg{tclust} is a (slightly) modified version of \cite{FriG12}, published in the \emph{Journal of Statistical Software}. Outlying data can heavily influence standard clustering methods. At the same time, clustering principles can be useful when robustifying statistical procedures. These two reasons motivate the development of feasible robust model-based clustering approaches. With this in mind, an R package for performing non-hierarchical robust clustering, called \pkg{tclust}, is presented here. Instead of trying to ``fit'' noisy data, a proportion $\alpha$ of the most outlying observations is trimmed. The \pkg{tclust} package efficiently handles different cluster scatter constraints. Graphical exploratory tools are also provided to help the user make sensible choices for the trimming proportion as well as the number of clusters to search for.} \Keywords{Model-based clustering, trimming, heterogeneous clusters} \Plainkeywords{Model-based clustering, trimming, heterogeneous clusters} %% without formatting %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} %% The address of (at least) one author should be given %% in the following format: \Address{ Luis A. Garc\'{\i}a-Escudero\\ Departamento de Estad\'{\i}stica e Investigaci\'{o}n Operativa. Universidad de Valladolid\\ Prado de la Magdalena s/n, 47005 Valladolid, SPAIN\\ E-mail: \email{lagarcia@eio.uva.es}\\ %URL: \url{http://www.eio.uva.es/infor/personas/langel.html} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. \begin{document} \section{Introduction to robust clustering and tclust}\label{sec4:int} Methods for cluster analysis attempt to detect homogeneous clusters with large heterogeneity among them. As happens with other (non-robust) statistical procedures, clustering methods may be heavily influenced by even a small fraction of outlying data. For instance, two or more clusters might be joined artificially, due to outlying observations, or ``spurious'' non-informative clusters may be composed of only a few outlying observations \citep[see, e.g.,][]{GarG99,GarG10}. Therefore, the application of robust methods in this context is very advisable, especially in fully automatic clustering (unsupervised learning) problems. Certain relations between cluster analysis and robust methods \citep{RocW02,HarR04,GarG03,WooR04} also motivate interest in robust clustering techniques. For example, robust clustering techniques can be used to handle ``clusters'' of highly concentrated outliers which are especially dangerous in (non-robust) estimation. \cite{GarG10} provides a recent survey of robust clustering methods. The \pkg{tclust} package for the \proglang{R} environment for statistical computing \citep{R} implements different robust non-hierarchical clustering algorithms where trimming plays a key role. This package is available at \texttt{http://CRAN.R-project.org/package=tclust}. When trimming allows the removal of a fraction $\alpha$ of the ``most outlying'' data, the strong influence of outlying observations can be avoided. This trimming approach to clustering has been introduced in \cite{CueG97}, \cite{GalM02}, \cite{GalR05} and \cite{GarG08}. Trimming also serves to identify potentially interesting anomalous observations. Trimming is not a new concept in statistics. For instance, the trimmed mean for one-dimensional data removes a proportion $\alpha/2$ each of the largest and smallest observations before computing the mean. However, it is not straightforward to extend this philosophy to cluster analysis, because most of these problems are of multivariate nature. Moreover, it is often the case that ``bridge points'' lying between clusters ought to be trimmed. Instead of forcing the statistician to define the regions to be trimmed in advance, the procedures implemented in \pkg{tclust} take the whole data structure into account in order to decide which parts of the sample should be discarded. By considering this type of trimming, these procedures are even able to trim outlying bridge points. The ``self-trimming'' philosophy behind these procedures is exactly the same as adopted by some well-known high breakdown-point methods \citep[see, e.g.,][]{RouL87}. As a first example of this trimming approach, let us consider the trimmed $k$-means method introduced in \cite{CueG97}. The function \code{tkmeans} from the \pkg{tclust} package implements this method. In the following example, this function is applied to a bivariate data set based on the Old Faithful geyser called \code{geyser2} that accompanies the \pkg{tclust} package. The code given below creates Figure~\ref{fig4:f1}. <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE) ## standard margins mmar <- c(5.1, 4.1, 4.1, 1.1) og <- grey(0) ## color for outliers @ <>= library(tclust) set.seed(100) data(geyser2) clus <- tkmeans(geyser2, k = 3, alpha=0.03) plot(clus, col = c(og, 2:4), tol.lwd = 1, tol.lty = 2) @ In the data set \code{geyser2}, we are searching for $k=3$ clusters and a proportion $\alpha=0.03$ of the data is trimmed. The clustering results are shown in Figure~\ref{fig4:f1}. Among this $3\%$ of trimmed data, we can see 6 anomalous ``short followed by short'' eruptions lengths. Notice that an observation situated between the clusters is also trimmed. \begin{figure}[t!] \centering \includegraphics{tclust-fig-1} \caption{Trimmed $k$-means results with $k=3$ and $\alpha=0.03$ for the bivariate Old Faithful Geyser data. Trimmed observations are denoted by the symbol ``$\circ$'' (a convention followed in all the figures in this work).} \label{fig4:f1} \end{figure} The package presented here adopts a ``crisp'' clustering approach, meaning that each observation is either trimmed or fully assigned to a cluster. In comparison, mixture approaches estimate a cluster pertinence probability for each observation. Robust mixture alternatives have also been proposed where noisy data is tried to be fitted through additional mixture components. For instance, package \pkg{mclust} \citep{FraR12, BanR93,FraR98} and the Fortran program \pkg{emmix} \citep{EMMIX99, LacP00} implement such robust mixture fitting approaches. Mixture fitting results can be easily converted into a ``crisp'' clustering result by converting the cluster pertinence probabilities into 0-1 probabilities. Contrary to these mixture fitting approaches, the procedures implemented in the \pkg{tclust} package simply remove outlying observations and do not intend to fit them at all. Package \pkg{tlemix} \citep[see][]{NeyF12, NeyF07} also implements a closely related trimming approach. As described in Section~\ref{sec4:cons}, the \pkg{tclust} package focuses on offering adequate cluster scatter matrix constraints leading to a wide range of clustering procedures depending on the chosen constraint, and avoiding the occurrence of spurious non-interesting clusters. The outline of the paper is as follows: In Section~\ref{sec4:tri} we briefly review the so-called ``spurious outliers'' model and show how to derive two different clustering criteria from it. Different constraints on the cluster scatter matrices and their implementation in the \pkg{tclust} package are addressed in Section~\ref{sec4:cons}. Section~\ref{sec4:num} presents the numerical output returned by this package. Section~\ref{sec4:alg} provides some brief comments concerning the algorithms implemented, and a comparison of \pkg{tclust} and several other robust clustering approaches are given in Section~\ref{sec4:comp}. Section~\ref{sec4:kal} shows some graphical outputs that help advise the choice of the number of clusters and the trimming proportion. Other useful plots summarizing the robust clustering results are shown in Section~\ref{sec4:gra}. Finally, Section~\ref{sec4:exa} applies the \pkg{tclust} package to a well-know real data set. \section{Trimming and the spurious outliers model}\label{sec4:tri} \cite{GalM02} and \cite{GalR05} propose the ``spurious outliers model'' as a probabilistic framework for robust crisp clustering. Let $f(\cdot;\mu,\Sigma)$ denote the probability density function of the $p$-variate normal distribution with mean $\mu$ and covariance matrix $\Sigma$. The ``spurious-outlier model'' is defined through ``likelihoods'' like \begin{equation}\label{eqn4:spu} \bigg[\prod_{j=1}^{k}\prod_{i\in R_j}f(x_i;\mu_j,\Sigma_j)\bigg]\bigg[\prod_{i \in R_0}g_i(x_i)\bigg] \end{equation} with $\{R_0,...,R_k\}$ being a partition of the set of indices $\{1,2,...,n\}$ such that $\#R_0=\rounding{n\alpha}$. $R_0$ are the indices of the ``non-regular'' observations generated by other (not necessarily normal) probability density functions $g_i$. ``Non-regular'' observations can be clearly considered as ``outliers'' if we assume certain sensible assumptions for the $g_i$ \citep[see details in][]{GalM02,GalR05}. Under these assumptions, the search of a partition $\{R_0,...,R_k\}$ with $\#R_0=\rounding{n\alpha}$, vectors $\mu_j$ and positive definite matrices $\Sigma_j$ maximizing (\ref{eqn4:spu}) can be simplified to the same search (of a partition, vectors and positive definite matrices) by just maximizing \begin{equation}\label{eqn4:e3} \sum_{j=1}^{k}\sum_{i\in R_j}\log f(x_i;\mu_j,\Sigma_j). \end{equation} Notice that observations $x_i$ with $i\in R_0$ are not taken into account in (\ref{eqn4:e3}). Maximizing (\ref{eqn4:e3}) with $k=1$ yields the Minimum Covariance Determinant (MCD) estimator \citep{RouP85}. Unfortunately, the direct maximization of (\ref{eqn4:e3}) is not a well-defined problem when $k>1$. It is easy to see that (\ref{eqn4:e3}) is unbounded without any constraint on the cluster scatter matrices $\Sigma_j$. The \code{tclust} function from the \pkg{tclust} package approximately maximizes (\ref{eqn4:e3}) under different cluster scatter matrix constraints which will be shown in Section~\ref{sec4:cons}. The maximization of (\ref{eqn4:e3}) implicitly assumes equal cluster weights. In other words, we are ideally searching for clusters with equal sizes. The function \code{tclust} provides this option by setting the argument \code{equal.weights = TRUE}. The use of this option does not guarantee that all resulting clusters exactly contain the same number of observations, but the method hence prefers this type of solutions. Alternatively, different cluster sizes or cluster weights can be considered by searching for a partition $\{R_0,...,R_k\}$ (with $\#R_0=\rounding{n\alpha}$), vectors $\mu_j$, positive definite matrices $\Sigma_j$ and weights $\pi_j\in[0,1]$ maximizing \begin{equation}\label{eqn4:e4} \sum_{j=1}^{k}\sum_{i\in R_j}(\log \pi_j + \log f(x_i;\mu_j,\Sigma_j)). \end{equation} The (default) option \code{equal.weights = FALSE} is used in this case. Again, the scatter matrices also have to be constrained such that the maximization of (\ref{eqn4:e4}) becomes a well-defined problem. Note that (\ref{eqn4:e4}) simplifies to (\ref{eqn4:e3}) when assuming \code{equal.weights = TRUE} and all weights are equally set to $\pi_j=1/k$. \newpage \section{Constraints on the cluster scatter matrices}\label{sec4:cons} As already mentioned, the function \code{tclust} implements different algorithms aimed at approximately maximizing (\ref{eqn4:e3}) and (\ref{eqn4:e4}) under different types of constraints which can be applied on the scatter matrices $\Sigma_j$. The type of constraint is specified by the argument \code{restr} of the \code{tclust} function. Table~\ref{tab4:app.ovv} gives an overview of the different clustering approaches implemented by the \code{tclust} function depending on the chosen type of constraint. \begin{table}[t!] \centering \begin{tabular}{||l|l|l||} \hline\hline &\code{equal.weights = TRUE}&\code{equal.weights = FALSE}\\ \hline \code{restr = "eigen"} & \hspace{-1.8mm}$\begin{array}{l} k\textit{-means} \\ \text{\cite{CueG97}} \end{array}$ & \cite{GarG08} \\\hline \code{restr = "deter"} & \text{\cite{GalM02}} & \text{This work}\\\hline \hline\hline \end{tabular} \caption{\label{tab4:app.ovv}Clustering methods handled by \pkg{tclust}. Names in cursive letters are untrimmed ($\alpha=0$) methods.} \end{table} Imposing constraints is compulsory because maximizing (\ref{eqn4:e3}) or (\ref{eqn4:e4}) without any restriction is not a well-defined problem. Notice that an almost degenerated scatter matrix $\Sigma_j$ would cause trimmed log-likelihoods (\ref{eqn4:e3}) and (\ref{eqn4:e4}) to tend to infinity. This issue can cause a (robust) clustering algorithm of this type to end up finding ``spurious'' clusters almost lying in lower-dimensional subspaces. Moreover, the resulting clustering solutions might heavily depend on the chosen constraint. The strength of the constraint is controlled by the argument \code{restr.fact} $\geq 1$ in the \code{tclust} function. The larger \code{restr.fact} is chosen, the looser is the restriction on the scatter matrices, allowing for more heterogeneity among the clusters. On the contrary, small values of \code{restr.fact} close to 1 imply very ``equally scattered'' clusters. This idea of constraining cluster scatters to avoid spurious solutions goes back to \cite{HatR85}, who proposed it in mixture fitting problems. Also arising from the spurious outlier model, other types of constraints have recently been introduced by \cite{GalR09,GalR10}. These (closely related) constraints also serve to avoid degeneracy of trimmed likelihoods but they are not implemented in the current version of the \pkg{tclust} package. \subsection{Constraints on the eigenvalues}\label{sec4:consteig} Based on the eigenvalues of the cluster scatter matrices, a scatter similarity constraint may be defined. With $\lambda_l(\Sigma_j)$ as the eigenvalues of the cluster scatter matrices $\Sigma_j$ and \begin{equation}\label{eqn4:e5} M_n=\max_{j=1,...,k}\,\max_{l=1,...,p}\lambda_l(\Sigma_j)\text{ and } m_n=\min_{j=1,...,k}\,\min_{l=1,...,p}\lambda_l(\Sigma_j) \end{equation} as the maximum and minimum eigenvalues, the restriction \texttt{restr = "eigen"} constrains the ratio $M_n/m_n$ to be smaller or equal than a fixed value \code{restr.fact} $\geq 1$. A theoretical study of the properties of this approach with \code{equal.weights = FALSE} can be found in \cite{GarG08}. This type of constraint limits the relative size of the axes of the equidensity ellipsoids defined through the obtained $\Sigma_j$ when assuming normality. This way we are simultaneously controlling the relative group sizes and also the deviation from sphericity in each cluster. Setting \code{equal.weights = TRUE}, \code{restr = "eigen"} and \code{restr.fact = 1} implies the most constrained case. In this case, the \code{tclust} function tries to solve the trimmed $k$-means problem as introduced by \cite {CueG97}. This problem simplifies to the well-known $k$-means clustering criterion when no trimming is done (i.e., \code{alpha = 0}). The \code{tkmeans} function directly implements this most constrained application of the \code{tclust} function. \subsection{Constraints on the determinants}\label{sec4:constdet} Another way of restricting cluster scatter matrices is constraining their determinants. Thus, if $$ M_n=\max_{j=1,...,k}|\Sigma_j|\text{ and }m_n=\min_{j=1,...,k}|\Sigma_j| $$ are the maximum and minimum determinants, we attempt to maximize (\ref{eqn4:e3}) or (\ref{eqn4:e4}) by constraining the ratio $M_n/m_n$ to be smaller or equal than a fixed value \code{restr.fact}. This is done in the function \code{tclust} by using the option \code{restr = "deter"}. Now, this type of constraint limits the relative volumes of the mentioned equidensity ellipsoids, but not the cluster shapes. The use of this type of constraint is particularly advisable when affine equivariance is required because this property is satisfied when \code{restr = "deter"}. The untrimmed case \code{alpha = 0}, \code{restr = "deter"} and \code{restr.fact = 1} was already outlined in \cite{MarJ74}, as the only sensible way to avoid (Mahalanobis distance modified) $k$-means type algorithms to return clusters of a few almost collinear observations. The possibility of trimming data was also considered in \cite{GalM02} who implicitly assumed $|\Sigma_1|=...=|\Sigma_k|$ (and so \code{restr.fact = 1}). The package presented here extends her approach to more general cases (\code{restr.fact} $>1$). \subsection{Example} \begin{figure}[t!] \centering %\includegraphics[width= 0.65 \textwidth]{Plots_R-chunk2} %\includegraphics[width=10cm]{Plots_R-chunk2a} <>= ################ ## Figure 2 ## ################ data(M5data) cl <- M5data[, "cluster"] plot(M5data[, 1:2], col = cl + 1, pch = cl + 1, main = "The M5data data set") @ \caption{ A scatter plot of the \code{M5data} data set. Different symbols are used for the data points generated by each of the three bivariate normal components and ``$\circ$'' for the added outliers.}\label{f2a} \end{figure} In this example, we examine the influence of different constraints by applying the function \code{tclust} to the so-called \code{M5data} data set. This data set, which accompanies the \pkg{tclust} package, has been generated following the simulation scheme M5 introduced in \cite{GarG08}. Thus it is a bivariate mixture of three simulated gaussian components with very different scatters and a clear overlap between two of these components. \begin{figure}[t!] \centering %\includegraphics[width= 0.77 \textwidth]{Plots_R-chunk3} <>= ################ ## Figure 3 ## ################ data(M5data) x <- M5data[, 1:2] set.seed(100) ## applying tclust with restr.fact = 1, restr= "eigen" res.a <- tclust(x, k = 3, alpha=0.1, restr.fact = 1, restr= "eigen", equal.weights = TRUE) ## applying tclust with restr.fact = 50, restr= "eigen" res.b <- tclust(x, k = 3, alpha=0.1, restr.fact = 50, restr= "eigen", equal.weights = FALSE) ## applying tclust with restr.fact = 1, restr= "deter" res.c <- tclust(x, k = 3, alpha=0.1, restr.fact = 1, restr= "deter", equal.weights = TRUE) ## applying tclust with restr.fact = 50, restr= "deter" res.d <- tclust(x, k = 3, alpha=0.1, restr.fact = 50, restr= "deter", equal.weights = TRUE) op <- par(mfrow = c (2, 2), mar = c (2.1, 2.1, 3.6, 1.1)) col <- c(og, 4, 2, 3) pch <- c(1, 4, 2, 3) ## plotting results.. plot(res.a, col = col, pch = pch, tol.lty = 2, xlab = "", ylab = "", main.pre ="(a)", main = "/r") plot(res.b, col = col, pch = pch, tol.lty = 2, xlab = "", ylab = "", main.pre ="(b)", main = "/r") plot(res.c, col = col, pch = pch, tol.lty = 2, xlab = "", ylab = "", main.pre ="(c)", main = "/r") plot(res.d, col = col, pch = pch, tol.lty = 2, xlab = "", ylab = "", main.pre ="(d)", main = "/r") par(op) @ \caption{Results of the clustering processes for the \code{M5data} data set for different constraints on the cluster scatter matrices and the parameters $\alpha = 0.1$ and $k= 3$. Different colors and symbols represent each observation's individual cluster assignment.} \label{fig4:f2} \end{figure} A $10\%$ proportion of outliers is also added in the outer region of the bounding rectangle enclosing the three gaussian components. See Figure~\ref{f2a} for a graphical representation and \cite{GarG08} for more details on the structure of this \code{M5data} data set. Executing the following code yields Figure~\ref{fig4:f2}. \begin{Code} R > data ("M5data") R > x <- M5data[, 1:2] R > res.a <- tclust(x, k = 3, alpha = 0.1, restr.fact = 1, restr = "eigen", + equal.weights = TRUE) R > res.b <- tclust(x, k = 3, alpha = 0.1, restr.fact = 50, restr = "eigen", + equal.weights = FALSE) R > res.c <- tclust(x, k = 3, alpha = 0.1, restr.fact = 1, restr = "deter", + equal.weights = TRUE) R > res.d <- tclust(x, k = 3, alpha = 0.1, restr.fact = 50, restr = "deter", + equal.weights = FALSE) R > plot(res.a, main = "/r") R > plot(res.b, main = "/r") R > plot(res.c, main = "/r") R > plot(res.d, main = "/r") \end{Code} Although different constraints are imposed, we are searching for $k=3$ clusters and the trimming proportion is set to $\alpha=0.1$ in all the cases. Note that only the clustering procedure introduced in \cite{GarG08}, shown in Figure~\ref{fig4:f2},(d), with a sufficiently large value of \code{restr.fact} approximately returns the three original clusters in spite of the very different cluster scatters and different cluster sizes. Moreover, this clustering procedure adequately handles the severe overlap of two clusters. The value \code{restr.fact = 50} has been chosen in this case because the eigenvectors of the covariance matrices of the three gaussian components satisfy restriction (\ref{eqn4:e5}) for this value. Due to their underlying assumptions, the other three clustering methods (trimmed $k$-means in Figure~\ref{fig4:f2},(a), \cite{GalR05} in (b), \cite{GalM02} in (c)) return rather similarly structured clusters. In fact, we found spherical clusters in (a), clusters with the same scatter matrix in (b) and clusters with the same cluster scatter matrix determinant in (c). The \code{M5data} is perhaps a very ``extreme'' situation and restriction settings in (a), (b) and (c) can be useful (and easier to be interpreted) with not so extreme data sets and where the assumptions implied by these restriction settings hold. \section{Numerical output}\label{sec4:num} The function \code{tclust} returns an S3 object containing the cluster centers $\mu_j$ by columns (\code{\$centers}), scatter matrices $\Sigma_j$ as an array (\code{\$cov}), the weights (\code{\$weights}), the number of observations in each group (\code{\$size}) and the maximum value found for the trimmed log-likelihood objective function (\ref{eqn4:e3}) or (\ref{eqn4:e4}) (\code{\$obj}). The vector \code{\$cluster} provides the cluster assignment of each observation, whereas an artificial cluster ``\code{0}'' (without location and scatter information) is introduced which holds all trimmed data points. \begin{figure}[t!] \centering %\includegraphics[width= 0.65 \textwidth]{Plots_R-chunk4} <>= ################ ## Figure 4 ## ################ ## data generation library(MASS) set.seed (10) x <- rbind(MASS::mvrnorm(200, c (0, 0), diag (2)), MASS::mvrnorm(200, c (5, 0), diag (2))) ## applying tclust clus.4 <- tclust(x, k = 3, alpha = 0.0, restr.fact = 1) ## plotting results.. plot(clus.4, xlim = c (-3, 9), ylim = c (-6, 6)) @ \caption{Applying \code{tclust} with $k = 3$ and $\alpha = 0$ on a simulated data set which originally consists of 2 clusters when \code{equal.weights = FALSE}.} \label{fig4:f3a} \end{figure} Sometimes, (\ref{eqn4:e3}) and (\ref{eqn4:e4}) maximize with some clusters remaining empty (see Figure~\ref{fig4:f3a}). In this case, only information on the non-empty groups is returned. Notice that, if we are searching for $k$ clusters, empty clusters can be found when a clustering solution for a number of clusters strictly smaller than $k$ attains a higher value for (\ref{eqn4:e3}) or (\ref{eqn4:e4}) than the best solution found with $k$ clusters. In this case, artificial empty clusters may be defined by considering sufficiently remote centers $\mu_j$ and scatter matrices $\Sigma_j$ satisfying the specific constraints that are assigned to these empty clusters. They are chosen such that $f(\cdot;\mu_j,\Sigma_j)$ gives almost null density to all the observations in the sample. These artificially added centers and scatter matrices are not returned as output by the \code{tclust} function and a warning is issued. For instance, let us consider the following code \begin{Code} R > set.seed(10) R > x <- rbind(MASS::mvrnorm(200, c (0, 0), diag (2)), + MASS::mvrnorm(200, c (5, 0), diag (2))) R > clus <- tclust(x, k = 3, alpha = 0, restr.fact = 1) R > plot(clus) \end{Code} Although we are searching for $k=3$ clusters, Figure~\ref{fig4:f3a} and the issued warning show that only 2 clusters are found. Notice that $k=2$ is surely a more sensible choice for the number of clusters than $k=3$ for this generated data set. Therefore, the detection of empty clusters, or clusters with few data points, can be helpful, providing valuable tools for making sensible choices for $k$ as we will see in Section~\ref{sec4:kal}. On the other hand, the detection of empty clusters is very unlikely to happen when the argument \code{equal.weights = TRUE} is provided in the call to \code{tclust}. \section{Algorithms}\label{sec4:alg} The maximization of (\ref{eqn4:e3}) or (\ref{eqn4:e4}) considering different cluster scatter matrix constraints is not straightforward because of the combinatorial nature of the associated maximization problems. The algorithm presented in \cite{GarG08} can be adapted to approximately solve all these problems. The methods implemented in \pkg{tclust} could be seen as Classification EM algorithms \citep{Sch76,CelG92}, whereas a certain type of ``concentration'' steps \citep[see the fast-MCD algorithm in][]{RouD98} is also applied. In fact, the concentration steps applied by the package \pkg{tclust} can be considered as an extension of those applied by the batch-mode $k$-means algorithm \citep{Ste56,For65}. It can be seen that the target function always increases throughout the application of concentration steps, whereas several random start configurations are needed in order to avoid ending trapped in local maxima. Therefore, \code{nstart} random initializations and \code{iter.max} concentration steps are considered. The probability that the algorithm converges close to the global optimum maximizing (\ref{eqn4:e3}) or (\ref{eqn4:e4}) increases with larger values of \code{nstart} and \code{iter.max}. The drawback of high values of \code{nstart} and \code{iter.max} obviously is the increasing computational effort. In the concentration step, the centers and scatter matrices are updated by considering the cluster sample means and cluster sample covariance matrices. New cluster assignments are obtained by gathering the ``closest'' observations to the new centers. Mahalanobis distances, based on the computed cluster sample covariance matrices, are used in order to decide which are the closest observations to each center. If needed, in the updating step, the cluster sample covariance matrices are modified as little as possible but in such a way that they satisfy the desired constraints \citep{GarG08}. The main idea behind these ``constrained'' concentration steps is to replace the eigenvalues of the sample covariance matrices by optimally truncated eigenvalues, which satisfy the desired constraint. A more detailed description of the algorithm applied by \code{tclust} and the way the restrictions are forced onto the cluster scatter matrices can be found in \cite{FriG11}. \section{Comparison with other robust clustering proposals}\label{sec4:comp} In this section, we briefly compare the performance of the clustering procedures implemented in the \pkg{tclust} package with respect to other robust clustering proposals in the literature. The Partitioning Around Medoids (PAM) clustering method \citep{KauR90} has been proposed as a robust alternative to $k$-means clustering. It can be seen that the effect of the 6 anomalous ``short followed by short'' eruptions lengths in the lower left corner of Figure~\ref{fig4:f1} do not affect the position of the $k$-medoid centers (see Figure~\ref{fig4:pam},(a)) too much, nor the resulting clusters. However, in Figure~\ref{fig4:pam},(b), we see that the clustering results with $k=3$ are strongly affected when moving these 6 anomalous points toward a more distant position. On the other hand, in Figure~\ref{fig4:pam},(c), we can see that these outlying data points do not affect the trimmed $k$-means based clustering at all once that they are trimmed. \begin{figure}[t!] \centering %\includegraphics[width= 1 \textwidth]{plots_R-chunk5} <>= ################ ## Figure 5 ## ################ library(cluster) data(geyser2) set.seed(0) op <- par(mfrow = c (1, 3), mar = c (3.1, 3.1, 3.1, 2.1)) xlab <- "" #names (geyser2)[1] ylab <- "" #names (geyser2)[2] ## applying PAM on the geyser2 data set clus <- pam(geyser2, 3) plot(geyser2, xlab = xlab, ylab = ylab, col = clus$clustering + 1, main = "(a) PAM with Geyser") points(clus$medoids, pch = "X", cex = 1.4) ## modifying the geyser2 data set geyser2.mod <- geyser2 idx <- c(16, 21, 36, 171, 236, 265) geyser2.mod[idx,] <- geyser2[idx,] - 12 ## applying PAM on the modified geyser2 data set clus <- pam(geyser2.mod, 3) plot(geyser2.mod, xlab = xlab, ylab = ylab, col = clus$clustering + 1, main = "(b) PAM with modified Geyser") ## applying tkmeans on the modified geyser2 data set clus <- tkmeans(geyser2.mod, k = 3, alpha = 0.03) plot(clus, xlab = xlab, ylab = ylab, main = "(c) TCLUST with modified Geyser", sub = "") par(op) @ \caption{PAM's clustering results for the \code{geyser2} data with $3$-medoids denoted by the symbol ``$\times$'' in (a). PAM's clustering results for a modified \code{geyser2} data set in (b) and when applying \code{tkmeans} with $k=3$ and $\alpha=0.03$ in (c).} \label{fig4:pam} \end{figure} In fact, only one single outlier placed in a very remote position is able to completely break down the PAM method \citep{GarG99}. This also happens when applying \pkg{emmix}, which has a breakdown point of zero \citep{Hen04}. The \pkg{emmix} approach is able to obtain appropriate clustering results for the two data sets made of mixtures of symmetric and asymmetric $t$ components as those shown in Figure~\ref{fig4:elt}. These two data sets contain three main clusters with some distant observations in the heavy tails of these $t$ components, which would be considered as outliers when assuming normality. When applying the classification EM algorithm without trimming to these data sets, we are not able to find the three cluster structures and two main clusters are artificially joined together. However, when considering $\alpha =0.05$, \pkg{tclust} perfectly avoids the harmful effect of the observations in the tails and still discovers the three clusters. In fact, almost all non-trimmed observations are correctly clustered in both, symmetric and asymmetric, cases. Any $\alpha >0$ discarding the most outlying observations would give similar results. Moreover, it may be seen that the shape of the elliptical clusters are essentially discovered in the case of the symmetric-elliptical $t$ components. In this example, we see that applying \pkg{tclust} to data sets including non-normally distributed components as those in Figure~\ref{fig4:elt} may result in proper clustering solutions, this however cannot be guaranteed if the underlying distributions differ too much from normality. The closely related \pkg{tlemix} package allows to consider other non-normal models by taking advantage of the flexibility provided by the \pkg{FlexMix} package \citep{LeiF04}. On the other hand, \pkg{tclust} focuses on normally distributed components and on the implementation of appropriate cluster scatter matrix constraints while \pkg{tlemix} does not. The \pkg{tlemix} mainly controls the minimum number of observations in a cluster. \begin{figure}[t!] \centering %\includegraphics[width= 0.9 \textwidth]{plots_R-chunk6} <>= ################ ## Figure 6 ## ################ fig.elt <- list() library(sn) op <- par(mfrow = c (2, 3), mar = c (3.1, 2, 3.1, 2)) set.seed (3) n <- 200 xlab <- "" #"x1" ylab <- "" #"x2" ## data generation: Mixture of elliptical t's xi <- c (0.5, -1) Omega <- diag (2) / 2 + 0.5 alpha <- c (0, 0) rnd1 <- rmst (n, xi, Omega, alpha, nu = 2) xi <- c (6, -3) Omega <- matrix (c (4, 1, 1, 2), nrow = 2, ncol = 2) alpha <- c (0, 0) rnd2 <- rmst (n, xi, Omega, alpha, nu = 3) xi <- c (-3, 3) Omega <- matrix (c (2, 1, 1, 4), nrow = 2, ncol = 2) alpha <- c (0, 0) rnd3 <- rmst (n, xi, Omega, alpha, nu = 4) rnd.1 <- rbind (rnd1, rnd2, rnd3) real.clus1 <- c (rep (1, n), rep (2, n), rep (3, n)) ylim <- c (-25, 15) ## Plotting the real clusters plot(rnd.1, xlim = c (-15, 80), ylim = ylim, xlab = xlab, ylab = ylab, col = real.clus1 + 1, pch = real.clus1 + 1) title("(a) Elliptical t components", line = 1.6) ## applying tclust without trimming clus <- tclust(rnd.1, k = 3, alpha = 0.0) plot(clus, xlim = c (-15, 80), ylim = ylim, xlab = xlab, ylab = ylab, main.pre = "(b)") ## applying tclust with trimming (5%) clus <- tclust(rnd.1, k = 3, alpha = 0.05) plot(clus, xlim = c (-15, 80), ylim = ylim, xlab = xlab, ylab = ylab, main.pre = "(c)") ## data generation: Mixture of non-elliptical t's xi <- c (0.5, -1) Omega <- diag (2) / 2 + 0.5 alpha <- c (2, 80) rnd1 <- rmst (n, xi, Omega, alpha, nu = 2) xi <- c (6, -3) Omega <- matrix (c (4, 1, 1, 2), nrow = 2, ncol = 2) alpha <- c (2, 2) rnd2 <- rmst (n, xi, Omega, alpha, nu = 3) xi <- c (-3, 3) Omega <- matrix (c (2, 1, 1, 4), nrow = 2, ncol = 2) alpha <- c (4, 4) rnd3 <- rmst (n, xi, Omega, alpha, nu = 4) rnd.2 <- rbind (rnd1, rnd2, rnd3) real.clus2 <- c (rep (1, n), rep (2, n), rep(3, n)) ylim <- c (-10, 20) ## Plotting the real clusters plot (rnd.2, xlim = c (-10, 20), ylim = ylim, xlab = xlab, ylab = ylab, col = real.clus2 + 1, pch=real.clus2 + 1) title ("(d) Non-elliptical t components", line = 1.6) ## applying tclust without trimming clus <- tclust(rnd.2, k = 3, alpha = 0.0) plot(clus, xlim = c (-10, 20), ylim = ylim, xlab = xlab, ylab = ylab, main.pre = "(e)") ## applying tclust with trimming (5%) clus <- tclust(rnd.2, k = 3, alpha = 0.05) plot(clus, xlim = c (-10, 20), ylim = ylim, xlab = xlab, ylab = ylab, main.pre = "(f)") par(op) @ \caption{A data set made up of three elliptical $t$ components is shown in (a) and three asymmetrical $t$ components in (d). The associated clustering results when applying the \code{tclust} function with $k=3$ and $\alpha=0$ are shown in (b) and (e) and with $k=3$ and $\alpha=0.05$ in (c) and (f).} \label{fig4:elt} \end{figure} The widely used \pkg{mclust} package considers a uniformly distributed component for explaining outlying data points. As we can see, this uniform component successfully accommodates the $10\%$ ``background noise'' as seen in Figure~\ref{fig4:mclust},(a). However, it is not able to cope with a more structured noise pattern like the ``helix'' in Figure~\ref{fig4:mclust},(c) which also accounts for $10\%$ of the data, although the information of a $10\%$ contamination level was passed to \pkg{mclust}. Alternatively, the \pkg{tclust} package with $k = 2$ and $\alpha = 0.1$ properly discovers the outlying data points without trying to fit them. \begin{figure}[t!] \centering %\includegraphics[width= 0.7 \textwidth]{plots_R-chunk7} <>= ################ ## Figure 7 ## ################ library(mclust) ## function for plotting mclust-objects in tclust-style. mclustplottcluststyle2d <- function (x, clus, clus.perm, tol.lty = 3, size = sqrt (qchisq (0.95, 2)), ...) { if (missing (clus.perm)) cuse <- clus$classification else cuse <- clus.perm [clus$classification + 1] plot (x, col = cuse + 1, pch = cuse + 1, ...) k <- ncol (clus$parameters$mean) for (i in 1:k) tclust:::.doEllipses(cov = clus$parameters$variance$sigma[,, i], center = clus$parameters$mean[, i], size = size, lty = tol.lty) } op <- par (mfrow = c (2, 2), mar = c (3.1, 3.1, 3.1, 2.1)) ## example 1 - uniformly distributed noise ## data generation set.seed (0) x <- rbind ( MASS::mvrnorm(360, c (0.0, 0), matrix (c ( 1, 0, 0, 1), ncol = 2)), MASS::mvrnorm(540, c (5.0, 10), matrix (c ( 6, -2, -2, 6), ncol = 2)), cbind (runif (100, -5, 15), runif (100, -10, 20))) ## applying mclust noiseInit <- sample (c (TRUE, FALSE), size = nrow (x), replace = TRUE, prob = c (0.1, 0.9)) clus <- Mclust(x, initialization = list (noise = noiseInit), G = 2) mclustplottcluststyle2d (x, clus, c (0, 2, 1), xlab = "", ylab = "", main = "(a) mclust", tol.lty = 2) ## applying tclust clus <- tclust(x, k = 2, alpha = 0.1) plot (clus, xlab = "", ylab = "", main = "(b) tclust", sub = "", tol.lty = 2) ## example 2 - structured noise ## data generation set.seed (0) v <- runif (100, -2 * pi, 2 * pi) noise <- cbind (100 + 25 * sin (v), 10 + 5 * v) x <- rbind ( MASS::mvrnorm(360, c (0.0, 0), matrix (c (1, 0, 0, 1), ncol = 2)), MASS::mvrnorm(540, c (5.0, 10), matrix (c (6, -2, -2, 6), ncol = 2)), noise) ## applying mclust noiseInit <- sample (c (TRUE, FALSE), size = nrow (x), replace = TRUE, prob = c (0.1, 0.9)) clus <- Mclust (x, initialization = list (noise = noiseInit), G = 2) mclustplottcluststyle2d (x, clus, c (0, 2, 1), xlab = "", ylab = "", main = "(c) mclust", tol.lty = 2) ## applying tclust clus <- tclust(x, k = 2, alpha = 0.1) plot (clus, xlab = "", ylab = "", main = "(d) tclust", sub = "", tol.lty = 2) par (op) @ \caption{Clustering results for two simulated data sets when applying \pkg{mclust} with $k=2$ in (a) and (c) and \pkg{tclust} with $k=2$ and $\alpha=0.1$ in (b) and (d).} \label{fig4:mclust} \end{figure} Since groups of outliers may be considered as further clusters, it could be argued that robust clustering problems can always be solved by increasing the number of groups we are searching for. However, as explained in \cite{GarG10}, this is not necessarily the best strategy. Firstly, sometimes the researcher fixes the number of clusters in advance, not being aware of the presence of a small amount of outlying observations. Secondly, it could lead to a quite large number of clusters when very scattered outliers are present in the data set. A clear limitation of \pkg{tclust} is that it is not applicable on high-dimensional data sets, as the method in its current definition definitely needs a data set containing more observations than dimensions. \section{Selecting the number of groups and the trimming size}\label{sec4:kal} Perhaps one of the most complex problems when applying cluster analysis is the choice of the number of clusters, $k$. In some cases one might have an idea of the number of clusters in advance, but usually $k$ is completely unknown. Moreover, in the approach proposed here, the trimming proportion $\alpha$ has also to be chosen without knowing the true contamination level. As we will see through the following example, the choices for $k$ and $\alpha$ are related problems that should be addressed simultaneously. It is important to see that a particular trimming level implies a specific number of clusters and vice versa. This dependency can be explained as entire clusters tend to be trimmed completely when increasing $\alpha$. On the other hand, when choosing $\alpha$ too low, groups of outliers might form new spurious clusters and thus it appears that the number of clusters found in the data set is higher. Moreover, the simultaneous choice of $k$ and $\alpha$ depends on the type of clusters we are searching for and on the allowed differences between cluster sizes. These two aspects can be controlled by the choice of arguments \code{restr} and \code{restr.fact}. To demonstrate the relation between $\alpha$, $k$ and \code{restr.fact}, let us consider \code{restr = "eigen"} and the data set in Figure~\ref{fig4:f3} which could either be interpreted as a mixture of three components (a) or a mixture of two components (b) with a $10\%$ outlier proportion. Both clustering solutions shown in Figure~\ref{fig4:f3} are perfectly sensible and the final choice of $\alpha$ and $k$ only depends on the value given to \code{restr.fact}. The code used to obtain Figure~\ref{fig4:f3} is the following: \begin{Code} R > sigma1 <- diag(2) ## EigenValues: 1, 1 R > sigma2 <- diag(2) * 8 - 2 ## EigenValues: 8, 4 R > sigma3 <- diag(2) * 50 ## EigenValues: 50, 50 R > mixt <- rbind( + MASS::mvrnorm(360, mean = c (0.0, 0), sigma = sigma1), + MASS::mvrnorm(540, mean = c (5.0, 10), sigma = sigma2), + MASS::mvrnorm(100, mean = c (2.5, 5), sigma = sigma3)) R > plot(tclust(mixt, k = 3, alpha = 0.00, restr.fact = 50)) R > plot(tclust(mixt, k = 2, alpha = 0.05, restr.fact = 12)) \end{Code} \begin{figure}[t!] \centering %\includegraphics[width= 0.9 \textwidth]{Plots_R-chunk8} <>= ################ ## Figure 8 ## ################ ## data generation set.seed (100) mixt <- rbind(MASS::mvrnorm(360, c ( 0, 0), matrix (c (1, 0, 0, 1), ncol = 2)), MASS::mvrnorm(540, c ( 5, 10), matrix (c (6, -2, -2, 6), ncol = 2)), MASS::mvrnorm(100, c (2.5, 5), matrix (c (50, 0, 0, 50), ncol = 2))) ## applying tclust set.seed (100) clus.1 <- tclust(mixt, k = 3, alpha = 0.0, restr.fact = 50) clus.2 <- tclust(mixt, k = 2, alpha = 0.05, restr.fact = 8) ## plotting results op <- par (mfrow = c (1, 2)) plot(clus.1, by.clust = TRUE, col = c (og, 2, 3, 4), pch = c (1, 2, 3, 4), tol.lty = 2, main.pre = "(a)") plot(clus.2, by.clust = TRUE, col = c (og, 2, 3), pch = c (1, 2, 3), tol.lty = 2, main.pre = "(b)") par (op) @ \caption{Clustering results for the simulated data set \code{mixt} with $k=3$, $\alpha=0$ and \code{restr.fact} $= 50$ (a) and $k=2$, $\alpha=0.1$ and \code{restr.fact} $=8$ (b).} \label{fig4:f3} \end{figure} Considering \code{sigma1} and \code{sigma2}, the quotient of the largest and smallest eigenvalue is 8, whereas the maximal quotient is 50 if we consider \code{sigma1}, \code{sigma2} and \code{sigma3}. Thus \code{restr.fact} $=8$ would allow to consider two clusters while \code{restr.fact} $=50$ would also allow to assume three groups there. Although the proportion of ``contaminated'' data is equal to $10\%$, the trimming level must be reduced to $5\%$, when considering $k=2$, because the third (more scattered) gaussian component partially overlaps with the other two components. Let us assume first that \code{restr} and \code{restr.fact} have been fixed in advance by the researcher who applies the robust clustering method. Even with this information and assuming $\alpha=0$, choosing the appropriate number of clusters is not an easy task. The careful monitoring of the maximum value attained by log-likelihoods like those in (\ref{eqn4:e3}) and (\ref{eqn4:e4}) while changing $k$ has traditionally been applied as a method for choosing the number of clusters when $\alpha=0$. Moreover \cite{BryP91} stated that the use of ``weighted'' log-likelihoods (\ref{eqn4:e4}) is preferred to the use of log-likelihoods assuming equal weights (\ref{eqn4:e3}). Notice that increasing $k$ always causes the maximized log-likelihood (\ref{eqn4:e3}) to increase too, and this could lead to ``overestimate'' the appropriate number of clusters \citep[see][]{GarG11}. In this trimming framework, let us consider $\mathcal{L}_{\texttt{restr.fact}}^{\Pi}(\alpha,k)$ as the maximum value reached by (\ref{eqn4:e4}) for each combination of a given set of values for $k$ and $\alpha$. \cite{GarG11} propose to monitor the ``classification trimmed likelihoods'' functionals $$(\alpha,k)\mapsto \mathcal{L}_{\texttt{restr.fact}}^{\Pi}(\alpha,k)$$ while altering $\alpha$ and $k$, which yields an exploratory graphical tool for making sensible choices for parameters $\alpha$ and $k$. In fact, it is proposed to choose the number of clusters as the smallest value of $k$ such that \begin{equation}\label{eqn4:diff} \mathcal{L}_{\texttt{restr.fact}}^{\Pi}(\alpha,k+1)-\mathcal{L}_{\texttt{restr.fact}}^{\Pi}(\alpha,k) \end{equation} is (close to) 0 except for small values of $\alpha$. Once the number of clusters is fixed, a good choice for the trimming level is the first $\alpha_0$ such that (\ref{eqn4:diff}) is (close to) 0 for every $\alpha\geq \alpha_0$. Although we are convinced that monitoring the classification trimmed likelihoods functionals is very informative, no theoretical statistical procedures are available yet for determining when (\ref{eqn4:diff}) can be formally considered as ``close to 0''. The function \code{ctlcurves} in package \pkg{tclust} approximates the classification trimmed likelihoods by successively applying the \code{tclust} function for a sequence of values of $k$ and $\alpha$. A default value \code{restr.fact = 50} is considered but, if desired, other values of \code{restr.fact} can be passed to \code{tclust} via \code{ctlcurves} too. For instance, the following code applied to the previously simulated \code{mixt} data set \begin{Code} R > plot (ctlcurves (mixt, k = 1:4, alpha = seq (0, 0.2, by = 0.05))) \end{Code} results in Figure~\ref{fig4:f4}. This figure shows that increasing $k$ from 2 to 3 is needed when $\alpha=0$, as the objective functions value differs noticeably between $k = 2$ and $k = 3$. On the other hand, increasing $k$ from 2 to 3 is not needed anymore as the third (more scattered) ``cluster'' vanishes when trimming 5\% of the most outlying observations. Thus, there is no discernable difference of the objective functions value with $\alpha \geq \alpha_0=0.05$ and $k \geq 2$. Increasing $k$ from 3 to 4 is not needed in any case. \begin{figure}[t!] \centering %\includegraphics[width= 0.65 \textwidth]{Plots_R-chunk9} <>= ################ ## Figure 9 ## ################ ## computing ctlcurves ctl <- ctlcurves(mixt, restr.fact = 50, alpha = seq (0, 0.2, by = 0.05)) op <- par(mfrow = c (1, 1)) plot (ctl) par (op) @ \caption{Classification trimmed likelihoods with $k=1,...,4$, $\alpha=0,$ $0.05$, ..., $0.2$ and \code{restr.fact} $=50$ for the \code{mixt} data set in Figure~\ref{fig4:f3}.} \label{fig4:f4} \end{figure} The previously described procedure for making sensible choices for parameters $k$ and $\alpha$ requires an active role from the researcher. The type of restriction and the allowed restriction factor, which do not necessarily depend on the given data set, must be specified in advance. For instance, some specific clustering applications like ``location-facilities'' problems require almost spherical clusters that can be obtained by setting \code{restr = "eigen"} and a \code{restr.fact} close to 1. The researcher's decision on the restriction consequently modifies the proper determination of parameters $k$ and $\alpha$. Due to the important role of the statement of \code{restr} and \code{restr.fact}, some general guideline for fixing them will be given here. For instance, as already commented, fixing \code{restr = "deter"} is recommended when only the relative cluster sizes shall be constrained, or when affine equivariance is clearly needed. On the other hand, using \code{restr = "eigen"} is advised when we want to simultaneously constrain relative cluster sizes and shapes. With respect to the choice of \code{restr.fact}, we recommend to initially use large values when applying \code{ctlcurves}, thus, providing high flexibility to the clustering method. The default value \code{restr.fact = 50} is suggested for \code{ctlcurves}, as it worked well with a lot of data sets (especially if the variables have been properly standardized through the \code{scale} argument in the \code{tclust} function). The so obtained ``sensible'' values for $k$ and $\alpha$ and their associated clustering solutions must be explored carefully. For instance, \code{tclust} issues a warning when the returned clustering solution has been ``artificially restricted'' by the algorithm, as shown in Section~\ref{sec4:exa}. This means, that the values $M_n$ and $m_n$ (see Section~\ref{sec4:consteig} and Section~\ref{sec4:constdet}) derived from the returned scatter matrices satisfy $M_n/m_n =$ \code{restr.fact}, because the algorithm has forced the chosen constraint, since the (unconstrained) group sample covariance matrices do not satisfy $M_n/m_n \leq$ \code{restr.fact}. In this situation, if no specific constraints are required, \code{restr.fact} may be increased stepwise until this warning disappears. Moreover, printing the object returned by the \code{ctlcurves} function points out all ``artificially restricted'' solutions for each computed combination of parameters $k$ and $\alpha$. In this way, if desired, we can easily search for clustering solutions which are not artificially restricted and do not contain spurious clusters. Finally, the exploratory tools in Section~\ref{sec4:gra} also help to evaluate whether all these parameters are reasonably chosen. Note that arguments \code{nstart} and \code{iter.max} may be provided in the call to \code{ctlcurves} and they are internally passed to function \code{tclust}. The curves presented in \cite{GarG03} can be considered as precedents of those we obtain by using the \code{ctlcurves} function. Trimmed likelihoods have also been taken into account in \cite{NeyF07} for choosing $k$ and $\alpha$ by using a BIC criterion. \section{Graphical displays}\label{sec4:gra} As seen in previous examples, the package \pkg{tclust} provides functions for visualizing the computed cluster assignments. One-dimensional, two-dimensional and higher-dimensional cases are visualized differently: \begin{description} \item[$p = 1$:] The one-dimensional data set with the corresponding cluster assignments is displayed along the $x$-axis. Setting the argument \code{jitter = TRUE} jitters the data along the $y$-axis in order to increase the visibility of the actual data structure. Additionally, a (robust) scatter estimation of each cluster is also displayed. \item[$p = 2$:] Tolerance ellipsoids are plotted additionally in order to visualize the estimated cluster scatter matrices. \item[$p > 2$:] The first two Fisher's canonical coordinates are displayed in this case, which are computed based on the estimated cluster scatter matrices. Notice that trimmed observations are not taken into account when computing these coordinates, since they have been completely discarded. The implementation of these canonical coordinates is derived from the function \code{discrcoord} as implemented in the package \pkg{fpc} \citep{HenC10}. \end{description} A simple example demonstrates how the \code{plot} function works in different dimensions. The code: \begin{Code} R > geyser1 <- geyser2[, 1, drop = FALSE] R > geyser3 <- cbind (geyser2, rnorm (nrow (geyser2))) R > plot (tkmeans (geyser1, k = 2, alpha = 0.03), jitter = TRUE) R > plot (tkmeans (geyser3, k = 3, alpha = 0.03)) \end{Code} yields Figure~\ref{fig4:f5a}. For demonstrating the different plotting modes, we have selected one single variable from \code{geyser2} to obtain a one-dimensional data set (\code{geyser1}), and, added an additional normally distributed variable to \code{geyser2}, yielding a three-dimensional data set (\code{geyser3}). Figure~\ref{fig4:f5a} plots the results of the trimmed $k$-means robust clustering method for these two generated data sets. \begin{figure}[t!] \centering %\includegraphics[width= 0.9 \textwidth]{Plots_R-chunk10} <>= ################# ## Figure 10 ## ################# data (geyser2) op <- par (mfrow = c (1, 2)) set.seed (10) ## creating a one dimensional data "matrix" geyser1 <- geyser2[, 1, drop = FALSE] ## applying tkmeans plot(tkmeans(geyser1, k = 2, alpha = 0.03), jitter = TRUE, tol.lwd = 2, main.pre = "(a)") ## adding a random dimension to geyser2 geyser3 <- cbind(geyser2, rnorm (nrow (geyser2))) ## applying tkmeans plot(tkmeans(geyser3, k = 3, alpha = 0.03), main.pre = "(b)") par(op) @ \caption{Trimmed $k$-means clustering results for \code{geyser1} (one-dimensional) in (a) and for \code{geyser3} (three-dimensional) in (b). These two data sets are based on \code{geyser2}. $k=2$ is fixed in (a) and $k = 3$ in (b) while $\alpha=0.03$ is fixed in both cases.} \label{fig4:f5a} \end{figure} Given a \code{tclust} object, some additional exploratory graphical tools can be applied in order to evaluate the quality of the cluster assignments and the trimming decisions. This is done by applying the function \code{DiscrFact}. Let $\widehat{R}=\{\widehat{R_0},\widehat{R_1},...,\widehat{R_k}\}$, $\widehat{\theta}=(\widehat{\theta_1},...,\widehat{\theta_k})$ and $\widehat{\pi}=(\widehat{\pi_1},...,\widehat{\pi_k})$ be the values obtained by maximizing (\ref{eqn4:e3}) or (\ref{eqn4:e4}) (we set $\widehat{\pi_j}=1/k$ when maximizing (\ref{eqn4:e3})). $D_j(x_i;\widehat{\theta},\widehat{\pi})=\widehat{\pi_j}\phi(x_i,\widehat{\theta_j})$ is a measure of the degree of affiliation of observation $x_i$ with cluster $j$. These values can be ordered as $D_{(1)}(x_i;\widehat{\theta},\widehat{\pi}) \leq ... \leq D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})$. Thus the quality of the assignment decision of a non trimmed observation $x_i$ to the cluster $j$ with $D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})=D_{j}(x_i;\widehat{\theta},\widehat{\pi})$ can be evaluated by comparing its degree of affiliation with cluster $j$ to the best second possible assignment. That is $$\text{DF}(i)=\log\big(D_{(k-1)}(x_i;\widehat{\theta},\widehat{\pi})/D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})\big) \text{ for }x_i\text{ not trimmed}.$$ Let $x_{(1)},... ,x_{(n)}$ be the observations in the sample after being sorted according to their $D_{(k)}(\cdot;\widehat{\theta},\widehat{\pi})$ values, i.e., $ D_{(k)}(x_{(1)};\widehat{\theta},\widehat{\pi}) \leq ... \leq D_{(k)}(x_{(n)};\widehat{\theta},\widehat{\pi}).$ It is not difficult to see that $x_{(1)},... , x_{(\rounding{n\alpha})}$ are the trimmed observations which are not assigned to any cluster. Nevertheless, it is possible to compute the degree of affiliation $D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})$ of a trimmed observation $x_i$ to its nearest cluster. Thus, the quality of the trimming decision on this observation can be evaluated by comparing $D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})$ to $D_{(k)}(x_{(\rounding{n\alpha}+1)};\widehat{\theta},\widehat{\pi})$, with $x_{(\rounding{n\alpha}+1)}$ being the non-trimmed observation with smallest value of $D_{(k)}(\cdot;\widehat{\theta},\widehat{\pi})$. That is $$\text{DF}(i)=\log\big( D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})/D_{(k)}(x_{(\rounding{n\alpha}+1)};\widehat{\theta},\widehat{\pi})\big)\text{ for }x_i\text{ trimmed}. $$ Hence, discriminant factors $\text{DF}(i)$ $\leq 0$ are obtained for every observation in the data set, whether trimmed or not. \begin{figure}[t!] \centering %\includegraphics[width= 0.9 \textwidth]{Plots_R-chunk11} <>= ################# ## Figure 11 ## ################# ## data generation set.seed (100) mixt2 <- rbind ( MASS::mvrnorm(360, c (0, 0), matrix (c (1, 0, 0, 1), ncol = 2)), MASS::mvrnorm(540, c (5, 10), matrix (c (6, -2, -2, 6), ncol = 2)), MASS::mvrnorm(100, c (2.5, 5), matrix (c (50, 0, 0, 50), ncol = 2))) ## applying tclust set.seed (100) clus.w <- tclust(mixt2, k = 3, alpha = 0.1, restr.fact = 1, equal.weights = TRUE) ## applying DiscrFact discr.clus.w <- DiscrFact(clus.w) ## plotting results op <- par(mfrow = c(1, 3), mar = mmar) plot(clus.w, col = c(og, 2, 3, 4), tol.lty = 2, main.pre = "(a)") tclust:::plot_DiscrFact_p2(discr.clus.w, xlim = c (-70, 0), main.pre = "(b)") tclust:::plot_DiscrFact_p3(discr.clus.w, tol.lty = 2, main.pre = "(c)") par(op) @ \caption{Graphical displays based on the $\text{DF}(i)$ values for a \code{tclust} cluster solution obtained with $k=3$, $\alpha=0.1$, \code{restr.fact} $= 1$ and \code{equal.weights = TRUE} for the \code{mixt} data set.} \label{fig4:f5} \end{figure} Observations with large $\text{DF}(i)$ values (i.e., values close to zero) indicate doubtful assignments or trimming decisions. The use of this type of discriminant factors was already suggested in \cite{AelW06} in a clustering problem without trimming. ``Silhouette'' plots \citep{RouP87} can be used for summarizing the obtained ordered discriminant factors. Clusters in the silhouette plot with many large $\text{DF}(i)$ values indicate the existence of not very ``well-determined'' clusters. The most ``doubtful'' assignments with $\text{DF}(i)$ larger than a $\log(\texttt{threshold})$ value are highlighted by the function \code{DiscrFact}. Figure~\ref{fig4:f5} shows the result of applying the \code{DiscrFact} function to a clustering solution found for the \code{mixt} data set appearing in Figure~\ref{fig4:f3}. The following code is used to obtain this figure: \begin{Code} R > clus.w <- tclust (mixt, k = 3, alpha = 0.1, restr.fact = 1, + equal.weights = TRUE) R > discr.clus.w <- DiscrFact (clus.w, threshold = 0.1) R > plot (discr.clus.w) \end{Code} The choice \code{threshold} $= 0.1$ means that a decision on a particular observation $x_i$ is considered as doubtful, if the quality of the second best possible decision ($D_{(k-1)}(x_i;\widehat{\theta},\widehat{\pi})$ or $D_{(k)}(x_{(\rounding{n\alpha}+1)};\widehat{\theta},\widehat{\pi})$ for trimmed observations) is larger than one tenth of the quality of the actually made decision ($D_{(k)}(x_i;\widehat{\theta},\widehat{\pi})$). Although Figure~\ref{fig4:f4} suggests to choose $k = 2$, $k$ has been increased to $3$ in order to show how such a change leads to doubtful cluster assignment decisions which can be visualized by \code{DiscrFact}. Figure~\ref{fig4:f5},(a) simply illustrates the cluster assignments and trimming decisions. The mentioned silhouette plot is presented in (b), whereas the doubtful decisions are marked in (c). All observations with $\text{DF}(i) \geq \log(0.1)$ are highlighted as they are plotted darker/in color. Most of the doubtful decisions are located in the overlapping area of the two artificially found clusters (highlighted symbols ``$\times$'' and ``$+$''). Some doubtfully trimmed observations (highlighted symbol ``$\circ$'') are located in the boundaries of these two clusters. \section{Swiss Bank notes data}\label{sec4:exa} \begin{figure}[t!] \centering %\includegraphics[width= 0.65 \textwidth]{plots_R-chunk12} <>= ################# ## Figure 12 ## ################# data (swissbank) set.seed (0) fig6.ctl <- ctlcurves(swissbank, k = 1:4, alpha = seq (0, 0.3, by = 0.025), restr.fact = 50, iter.max = 100, nstart = 100) op <- par (mfrow = c (1, 1), mar = mmar) plot (fig6.ctl) par (op) @ \caption{Classification trimmed likelihoods for $k=1,...,4$ and $\alpha=0,$ $.025$, ..., $.3$ when \code{restr.fact} $=50$ for the ``Swiss Bank notes'' data set.} \label{fig4:f6} \end{figure} The well-known ``Swiss Bank notes'' data set includes 6 numerical measurements (six-dimensi\-onal data set) made on 100 genuine and 100 counterfeit old Swiss 1000-franc bank notes \citep{FluR88}. The following code can be used to obtain the classification trimmed likelihoods shown in Figure~\ref{fig4:f6}. \begin{Code} R > data ("swissbank") R > plot (ctlcurves (swissbank, k = 1:4, alpha = seq (0, 0.3, by = 0.025))) \end{Code} This figure indicates the clear existence of $k=2$ main clusters (``genuine'' and ``forged'' bills). Moreover, considering the clear difference between $\mathcal{L}_{\texttt{50}}^{\Pi}(0,3)$ and $\mathcal{L}_{\texttt{50}}^{\Pi}(0,2)$, we can see that a further cluster, i.e., $k=3$, is needed when no trimming is allowed. This extra cluster can be justified by the heterogeneity of the group of forgeries (perhaps due to the presence of different sources of forged bills). \begin{figure}[t!] \centering %\includegraphics[width= 0.9 \textwidth]{plots_R-chunk13} <>= ################# ## Figure 13 ## ################# fig7.clus <- tclust(swissbank, k = 2, alpha = .1, restr.fact = 50) fig7.discrfact <- DiscrFact(fig7.clus, threshold = .000125) op <- par (mfrow = c (1, 3), mar = mmar) plot (fig7.discrfact, enum = TRUE) par (op) @ \caption{Clustering results with $k=2$, $\alpha = 0.1$ and \code{restr.fact} $=50$ summarized by the use of \code{DiscrFact} function for the ``Swiss Bank notes'' data set. The threshold value is chosen in order to highlight the 7 most doubtful cluster assignments.} \label{fig4:f7} \end{figure} Considering Figure~\ref{fig4:f6}, the choice $k=2$ and a value of $\alpha$ close to 0.1 also seem sensible. Notice that $\mathcal{L}_{50}^{\Pi}(\alpha,3)$ is clearly larger than $\mathcal{L}_{50}^{\Pi}(\alpha,2)$ for $\alpha < 0.1$ while these differences are not so big when $\alpha \geq 0.1$. We can even see smaller differences in the classification trimmed likelihood curves when increasing $k$ from 3 to 4. However, these differences are less significant than those previously commented. More spurious clusters can be surely found but they have less entity and importance. Figure~\ref{fig4:f7} shows the clustering results with $k=2$, $\alpha = 0.1$ and \code{restr.fact = 50} obtained by executing the code: \begin{Code} R > clus <- tclust (swissbank, k = 2, alpha = 0.1, restr.fact = 50) R > plot (DiscrFact (clus, threshold = 0.0001)) \end{Code} Notice that, in this example, we did not want to impose a specific constraint on the solution. Thus, the default parameter \code{restr.fact = 50} has initially been used in \code{ctlcurves}. After choosing the combination $\alpha = 0.10$ and $k = 2$, we could try to reduce the restriction factor which resulted in a warning: \begin{Code} R > tclust(swissbank, k = 2, alpha = 0.1, restr.fact = 40) In .tclust.warn(warnings, ret): The result is artificially constrained due to restr.fact = 40. \end{Code} Thus the choice \code{restr.fact = 50} seems appropriate as it does not artificially restrict the result, whereas a slightly smaller restriction factor (40) does. By examining the sizes of the obtained groups, we see that no spurious groups are found with \code{restr.fact = 50}: \begin{Code} R > clus$size [1] 95 85 \end{Code} We have used \code{restr = "eigen"} in this example, but \code{restr = "deter"} can be also successfully applied with smaller values of \code{restr.fact}. We also use the function \code{DiscrFact} to summarize the obtained clustering results. The two first Fisher's canonical coordinates derived from the final cluster assignments are plotted. The threshold value 0.0001 is chosen in order to highlight the 7 most doubtful decisions. Finally, Figure~\ref{fig4:f8} shows a scatterplot of the fourth (``Distance of the inner frame to lower border'') against the sixth variable (``Length of the diagonal'') with the corresponding cluster assignments. We use the symbols ``\code{G}'' for the genuine bills and ``\code{F}'' for the forged ones. The 7 most doubtful decisions (i.e., the observations with largest $\text{DF}(i)$ values that were highlighted in Figure~\ref{fig4:f7},(c)) are surrounded by circles in this figure. \begin{figure}[t!] \centering %\includegraphics[clip,width= 0.9 \textwidth]{plots_R-chunk14} <>= ################# ## Figure 14 ## ################# clus <- tclust(swissbank, k = 2, alpha=.1, restr.fact=50) discrfact <- DiscrFact(clus) pch <- c (rep ("G", 100), rep ("F", 100)) condition <- discrfact$assignfact > log (.000125) cl <- clus$cluster data (swissbank) op <- par (mfrow = c (1, 3), mar = mmar) xlab <- "Distance of the inner frame to lower border" ylab <- "Length of the diagonal" plot(swissbank[, 4], swissbank[, 6], col = "darkgrey", pch = pch, main = "(a) Cluster1", xlab = xlab, ylab = ylab) cl1 <- cl == 1 points(swissbank[cl1, 4], swissbank[cl1, 6], pch = pch[cl1], col = 2) idx <- (cl1) & condition points(swissbank[idx, 4], swissbank[idx, 6], pch = 1, cex = 4, col = "blue") plot(swissbank[, 4], swissbank[, 6], col = "darkgrey", pch = pch, main = "(b) Cluster2", xlab = xlab, ylab = ylab) cl2 <- cl == 2 points(swissbank[cl2, 4], swissbank[cl2, 6], pch = pch[cl2], col = 3) idx <- (cl2) & condition points(swissbank[idx, 4], swissbank[idx, 6], pch = 1, cex = 4, col = "blue") cl0 <- cl == 0 plot(swissbank[, 4], swissbank[, 6], col = "darkgrey", pch = pch, main = "(c) Trimmed", xlab = xlab, ylab = ylab) points(swissbank[cl0, 4], swissbank[cl0, 6], pch = pch[cl0]) idx <- (cl0) & condition points(swissbank[idx, 4], swissbank[idx, 6], pch = 1, cex = 4, col = "blue") par(op) @ \caption{Clustering results with $k=2$, $\alpha=.1$ and \code{restr.fact} $=50$ for the ``Swiss Bank notes'' data set. Only the fourth and sixth variables are plotted. The 7 most doubtful decisions are rounded by a circle symbol.} \label{fig4:f8} \end{figure} We can see that ``Cluster 1'' essentially includes most of the ``forged'' bills while ``Cluster 2'' includes most of the ``genuine'' ones. Among the trimmed observations, we can find a subset of 15 forged bills following a clearly different forgery pattern that has been previously commented by other authors \citep[see, e.g.,][]{FluR88,CooP99}. These most doubtful assignments include 5 ``genuine'' bills that have perhaps been wrongly trimmed. \section{Conclusion}\label{sec4:con} This paper presents a package called \pkg{tclust} for robust (non-hierarchical) clustering. The implementation is flexible, so only the restrictions on the cluster scatters have to be changed in order to carry out different robust clustering algorithms. Robustness is achieved by trimming a specific proportion of observations which are identified as the ``most outlying'' ones. Although this \proglang{R}-package implements robust clustering approaches which have already been described in the literature, some of these approaches have been extended to provide increased flexibility. The package also provides some graphical tools which on the one hand help to chose appropriate parameters (\code{ctlcurves}) and on the other hand help to estimate the adequacy of a particular clustering solution (\code{DiscrFact}). Future work on this package focuses on implementing additional types of scatter restrictions, making the algorithm even more flexible, and on providing numerical tools for automatically choosing the number of clusters and the trimming proportion. \section*{Acknowledgements:}This research is partially supported by the Spanish Ministerio de Ciencia e Innovaci\'{o}n, grant MTM2011-28657-C02-01. We would like to thank the reviewers and the editor whose comments and suggestions greatly helped in improving the content and the presentation of this paper. \bibliography{tclust} \end{document}