Introduction

In this vignette a PARAFAC model is created for the Fujita2023 data. This is done by first processing the count data. Subsequently, the appropriate number of components are determined. Then the PARAFAC model is created and visualized.

library(parafac4microbiome)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)

Processing the data cube

The data cube in Fujita2023$data contains unprocessed counts. The function processDataCube() performs the processing of these counts with the following steps:

It performs feature selection based on the sparsityThreshold setting. Sparsity is here defined as the fraction of samples where a microbial abundance (ASV/OTU or otherwise) is zero.
It performs a centered log-ratio transformation of each sample using the compositions::clr() function with a pseudo-count of one (on all features, prior to selection based on sparsity).
It centers and scales the three-way array. This is a complex subject, for which we refer to a paper by Rasmus Bro and Age Smilde. By centering across the subject mode, we make the subjects comparable to each other within each time point. Scaling within the feature mode avoids the PARAFAC model focusing on features with abnormally high variation.

The outcome of processing is a new version of the dataset called processedFujita. Please refer to the documentation of processDataCube() for more information.

processedFujita = processDataCube(Fujita2023, sparsityThreshold=0.99, CLR=TRUE, centerMode=1, scaleMode=2)

Determining the correct number of components

A critical aspect of PARAFAC modelling is to determine the correct number of components. We have developed the functions assessModelQuality() and assessModelStability() for this purpose. First, we will assess the model quality and specify the minimum and maximum number of components to investigate and the number of randomly initialized models to attempt for each number of components.

Note: this vignette reflects a minimum working example for analyzing this dataset due to computational limitations in automatic vignette rendering. Hence, we only look at 1-3 components with 5 random initializations each. These settings are not ideal for real datasets. Please refer to the documentation of assessModelQuality() for more information.

# Setup
minNumComponents = 1
maxNumComponents = 3
numRepetitions = 3 # number of randomly initialized models
numFolds = 4 # number of jack-knifed models
ctol = 1e-5
maxit = 200
numCores= 1

# Plot settings
colourCols = c("", "Genus", "")
legendTitles = c("", "Genus", "")
xLabels = c("Replicate", "Feature index", "Time point")
legendColNums = c(0,5,0)
arrangeModes = c(FALSE, TRUE, FALSE)
continuousModes = c(FALSE,FALSE,TRUE)

# Assess the metrics to determine the correct number of components
qualityAssessment = assessModelQuality(processedFujita$data, minNumComponents, maxNumComponents, numRepetitions, ctol=ctol, maxit=maxit, numCores=numCores)

The overview plot showcases the number of iterations, the sum-of-squared error, the CORCONDIA and the variance explained for 1-3 components.

qualityAssessment$plots$overview

The overview plots show that we can reach ~40% explained variation if we take 3 components. The CORCONDIA for those models are ~98, which is well above the minimum requirement of 60. Based on this overview, either 2 or 3 components seems fine.

Jack-knifed models

Next, we investigate the stability of the models when jack-knifing out samples using assessModelStability(). This will give us more information to choose between 2 or 3 components.

stabilityAssessment = assessModelStability(processedFujita, minNumComponents=1, maxNumComponents=3, numFolds=numFolds, considerGroups=FALSE,
                                           groupVariable="", colourCols, legendTitles, xLabels, legendColNums, arrangeModes,
                                           ctol=ctol, maxit=maxit, numCores=numCores)
stabilityAssessment$modelPlots[[1]]

stabilityAssessment$modelPlots[[2]]

stabilityAssessment$modelPlots[[3]]

The three-component model is stable and can be safely chosen as the final model.

Model selection

Since a three-component model is the most appropriate for the Fujita2023 dataset, we can now select one of the random initializations from the assessModelQuality() output as the final model. The selected model corresponds to the one that explained the largest amount of variation.

numComponents = 3
modelChoice = which(qualityAssessment$metrics$varExp[,numComponents] == max(qualityAssessment$metrics$varExp[,numComponents]))
finalModel = qualityAssessment$models[[numComponents]][[modelChoice]]

Finally, we visualize the model using plotPARAFACmodel().

plotPARAFACmodel(finalModel$Fac, processedFujita, 3, colourCols, legendTitles, xLabels, legendColNums, arrangeModes,
  continuousModes = c(FALSE,FALSE,TRUE),
  overallTitle = "Fujita PARAFAC model")

You will observe that the loadings for some modes in some components are all negative. This is due to sign flipping: two modes having negative loadings cancel out but describe the same thing as two positive loadings. The flipLoadings() function automatically performs this procedure and also sorts the components by how much variation they describe.

finalModel = flipLoadings(finalModel, processedFujita$data)

plotPARAFACmodel(finalModel$Fac, processedFujita, 3, colourCols, legendTitles, xLabels, legendColNums, arrangeModes,
  continuousModes = c(FALSE,FALSE,TRUE),
  overallTitle = "Fujita PARAFAC model")

Fujita2023_analysis

Introduction

Processing the data cube

Determining the correct number of components

Jack-knifed models

Model selection