--- title: "Clustering Tracks with CelltrackR" author: "Inge Wortel" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Clustering} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} matrix_check <- ( packageVersion( "Matrix" ) > "1.6.1" ) irlba_check <- ( packageVersion( "irlba" ) <= "2.3.5.1" ) matrix_irlba_clash <- TRUE #( matrix_check & irlba_check ) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 4.5, fig.height = 3 ) ``` ```{r savepar, echo = FALSE} # Save current par() settings oldpar <- par( no.readonly =TRUE ) ``` # Introduction To group tracks with similar properties, one can in principle perform any clustering method of interest on a *feature matrix* of quantification metrics for each track in the dataset. The package comes with three convenience functions -- `getFeatureMatrix()`,`trackFeatureMap()`, and `clusterTracks()` -- to easily compute several metrics on all tracks at once, visualize them in 2 dimensions, and to cluster tracks accordingly. This tutorial shows how to use these functions to explore heterogeneity in a track dataset. # Datasets First load the package: ```{r pack, warning = FALSE, message = FALSE} library( celltrackR ) ``` The package contains a dataset of B and T cells in a mouse cervical lymph node, and neutrophils responding to an *S. aureus* infection in a mouse ear; all are imaged using two-photon microscopy. While the original data contained 3D coordinates, we'll use the 2D projection on the XY plane (see the vignettes on [quality control methods](./QC.html) and [preprocessing of the package datasets](./data-QC.html) for details). The T-cell dataset consists of 199 tracks of individual cells in a tracks object: ```{r Tdata} str( TCells, list.len = 2 ) ``` Each element in this list is a track from a single cell, consisting of a matrix with $(x,y)$ coordinates and the corresponding measurement timepoints: ```{r showTdata} head( TCells[[1]] ) ``` Similarly, we will also use the `BCells` and `Neutrophils` data: ```{r bdata} str( BCells, list.len = 2 ) str( Neutrophils, list.len = 2 ) ``` Since there are quite many cells, we'll sample just some of the tracks for the legibility of the plots in this tutorial -- but everything we will do could also be done on the complete datasets. ```{r sampleData} # Take a sample set.seed(1234) TCells <- TCells[ sample( names(TCells), 30 ) ] BCells <- BCells[ sample( names(BCells), 30 ) ] Neutrophils <- Neutrophils[ sample( names(Neutrophils), 30 ) ] ``` Combine them in a single dataset, where labels also indicate celltype: ```{r combineData} T2 <- TCells names(T2) <- paste0( "T", names(T2) ) tlab <- rep( "T", length(T2) ) B2 <- BCells names(B2) <- paste0( "B", names(B2) ) blab <- rep( "B", length(B2) ) N2 <- Neutrophils names(N2) <- paste0( "N", names(Neutrophils) ) nlab <- rep( "N", length( N2) ) all.tracks <- c( T2, B2, N2 ) real.celltype <- c( tlab, blab, nlab ) ``` # 1 Extracting a feature matrix Using the function `getFeatureMatrix()`, we can quickly apply several quantification measures to all data at once (see `?TrackMeasures` for an overview of measures we can compute): ```{r featurematrix} m <- getFeatureMatrix( all.tracks, c(speed, meanTurningAngle, outreachRatio, squareDisplacement) ) # We get a matrix with a row per track and one column for each metric: head(m) ``` We can use this matrix to explore relationships between different metrics. For example, we can check the relationship between speed and mean turning angle: ```{r plotFM, fig.width = 4, fig.height = 3 } plot( m, xlab = "speed", ylab = "mean turning angle" ) ``` # 2 Dimensionality reduction methods: PCA, MDS, and UMAP When using more than two metrics at once to quantify track properties, it becomes hard to visualize which tracks are similar to each other. Like with single-cell data, dimensionality reduction methods can help visualize high-dimensional track feature datasets. The function `trackFeatureMap()` is a wrapper method that helps to quickly visualize data using three popular methods: principal component analysis (PCA), multidimensional scaling (MDS), and uniform manifold approximate and projection (UMAP). The function `trackFeatureMap()` can be used for a quick visualization of data, or return the coordinates in the new axis system if the argument `return.mapping=TRUE`. ### 1.1 PCA Use `trackFeatureMap()` to perform a principal component analysis (PCA) based on the measures "speed", "meanTurningAngle", "squareDisplacement", "maxDisplacement", and "outreachRatio" using the optional `labels` argument to color points by their real celltype and `return.mapping=TRUE` to also return the data rather than just the plot: ```{r pca} pca <- trackFeatureMap( all.tracks, c(speed,meanTurningAngle,squareDisplacement, maxDisplacement,outreachRatio ), method = "PCA", labels = real.celltype, return.mapping = TRUE ) ``` Note that the B cells and Neutrophils are relatively well-separated from each other in this plot, but the T cells are hard to distinguish from neutrophils based on these features. We can then inspect the data stored in `pca`. This reveals, for example, that the first principal component is correlated with speed: ```{r inspectPC} pc1 <- pca[,1] pc2 <- pca[,2] track.speed <- sapply( all.tracks, speed ) cor.test( pc1, track.speed ) ``` See `?prcomp` for details on how principal components are computed, and for further arguments that can be passed on to `trackFeatureMap()`. ### 1.2 MDS Another popular method for visualization is multidimensional scaling (MDS), which is also supported by `trackFeatureMap()`: ```{r mds} trackFeatureMap( all.tracks, c(speed,meanTurningAngle,squareDisplacement,maxDisplacement, outreachRatio ), method = "MDS", labels = real.celltype ) ``` Internally, `trackFeatureMap()` computes a distance matrix using `dist()` and then applies MDS using `cmdscale()`. See the documentation of `cmdscale` for details and further arguments that can be passed on via `trackFeatureMap()`. Again, we find that the B cells and Neutrophils are separated in this plot, while T cells mix with neutrophils. ### 1.3 UMAP Uniform manifold approximate and projection (UMAP) is another powerful method to explore structure in high-dimensional datasets. `trackFeatureMap()` supports visualization of tracks in a UMAP. Note that this option requires the `uwot` package. Please install this package first using `install.packages("uwot")`. ```{r rspec, echo = FALSE, eval = matrix_irlba_clash} library( RSpectra ) ``` ```{r umap} trackFeatureMap( all.tracks, c(speed,meanTurningAngle,squareDisplacement, maxDisplacement,outreachRatio ), method = "UMAP", labels = real.celltype ) ``` # 3 Clustering: hierarchical clustering and k-means To go beyond visualizing similar and dissimilar tracks using multiple track features, `clusterTracks()` supports the explicit grouping of tracks into clusters using two common methods: hierarchical and k-means clustering. ### 3.1 Hierarchical clustering Hierarchical clustering is performed by calling `hclust()` on a distance matrix computed via `dist()` on the feature matrix: ```{r hclus, fig.width = 7.5, fig.height = 3} clusterTracks( all.tracks, c(speed,meanTurningAngle,squareDisplacement,maxDisplacement, outreachRatio ), method = "hclust", labels = real.celltype ) ``` See methods `dist()` and `hclust()` for details. ### 3.2 K-means clustering Secondly, `clusterTracks()` also supports k-means clustering of tracks. Note that this requires an extra argument `centers` that is passed on to the `kmeans()` function and specifies the number of clusters to make. In this case, let's use three clusters because we have three celltypes: ```{r kmean, fig.height = 8, fig.width = 7 } clusterTracks( all.tracks, c(speed,meanTurningAngle,squareDisplacement,maxDisplacement, outreachRatio ), method = "kmeans", labels = real.celltype, centers = 3 ) ``` In these plots, we see the value of each feature in the feature matrix plotted for the different clusters, whereas the color indicates the "real" celltype the track came from. ```{r resetpar, echo = FALSE} # Reset par() settings par(oldpar) ```