%\VignetteIndexEntry{hhh4: An endemic-epidemic modelling framework for infectious disease counts} %\VignetteDepends{surveillance, Matrix} \documentclass[a4paper,11pt]{article} \usepackage[T1]{fontenc} \usepackage[english]{babel} \usepackage{graphicx} \usepackage{color} \usepackage{natbib} \usepackage{lmodern} \usepackage{bm} \usepackage{amsmath} \usepackage{amsfonts,amssymb} \setlength{\parindent}{0pt} \setcounter{secnumdepth}{1} \newcommand{\Po}{\operatorname{Po}} \newcommand{\NegBin}{\operatorname{NegBin}} \newcommand{\N}{\mathcal{N}} \newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}} \newcommand{\surveillance}{\pkg{surveillance}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\hhh}{\texttt{hhh4}} \newcommand{\R}{\textsf{R}} \newcommand{\sts}{\texttt{sts}} \newcommand{\example}[1]{\subsubsection*{Example: #1}} %%% Meta data \usepackage{hyperref} \hypersetup{ pdfauthor = {Michaela Paul and Sebastian Meyer}, pdftitle = {'hhh4': An endemic-epidemic modelling framework for infectious disease counts}, pdfsubject = {R package 'surveillance'} } \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} \title{\code{hhh4}: An endemic-epidemic modelling framework for infectious disease counts} \author{ Michaela Paul and Sebastian Meyer\thanks{Author of correspondence: \email{seb.meyer@fau.de} (new affiliation)}\\ Epidemiology, Biostatistics and Prevention Institute\\ University of Zurich, Zurich, Switzerland } \date{8 February 2016} %%% Sweave \usepackage{Sweave} \SweaveOpts{prefix.string=plots/hhh4, keep.source=T, strip.white=true} \definecolor{Sinput}{rgb}{0,0,0.56} \DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom={\color{Sinput}},fontshape=sl,fontsize=\footnotesize} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontshape=sl,fontsize=\footnotesize} %%% Initial R code <>= library("surveillance") options(width=75) ## create directory for plots dir.create("plots", showWarnings=FALSE) ###################################################### ## Do we need to compute or can we just fetch results? ###################################################### compute <- !file.exists("hhh4-cache.RData") message("Doing computations: ", compute) if(!compute) load("hhh4-cache.RData") @ \begin{document} \maketitle \begin{abstract} \noindent The \R\ package \surveillance\ provides tools for the visualization, modelling and monitoring of epidemic phenomena. This vignette is concerned with the \hhh\ modelling framework for univariate and multivariate time series of infectious disease counts proposed by \citet{held-etal-2005}, and further extended by \citet{paul-etal-2008}, \citet{paul-held-2011}, \citet{held.paul2012}, and \citet{meyer.held2013}. The implementation is illustrated using several built-in surveillance data sets. The special case of \emph{spatio-temporal} \hhh\ models is also covered in \citet[Section~5]{meyer.etal2014}, which is available as the extra \verb+vignette("hhh4_spacetime")+. \end{abstract} \section{Introduction}\label{sec:intro} To meet the threats of infectious diseases, many countries have established surveillance systems for the reporting of various infectious diseases. The systematic and standardized reporting at a national and regional level aims to recognize all outbreaks quickly, even when aberrant cases are dispersed in space. Traditionally, notification data, i.e.\ counts of cases confirmed according to a specific definition and reported daily, weekly or monthly on a regional or national level, are used for surveillance purposes. The \R-package \surveillance\ provides functionality for the retrospective modelling and prospective aberration detection in the resulting surveillance time series. Overviews of the outbreak detection functionality of \surveillance\ are given by \citet{hoehle-mazick-2010} and \citet{salmon.etal2014}. This document illustrates the functionality of the function \hhh\ for the modelling of univariate and multivariate time series of infectious disease counts. It is part of the \surveillance\ package as of version 1.3. The remainder of this vignette unfolds as follows: Section~\ref{sec:data} introduces the S4 class data structure used to store surveillance time series data within the package. Access and visualization methods are outlined by means of built-in data sets. In Section~\ref{sec:model}, the statistical modelling approach by \citet{held-etal-2005} and further model extensions are described. After the general function call and arguments are shown, the detailed usage of \hhh\ is demonstrated in Section~\ref{sec:hhh} using data introduced in Section~\ref{sec:data}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Surveillance data}\label{sec:data} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Denote by $\{y_{it}; i=1,\ldots,I,t=1,\ldots,T\}$ the multivariate time series of disease counts for a specific partition of gender, age and location. Here, $T$ denotes the length of the time series and $I$ denotes the number of units (e.g\ geographical regions or age groups) being monitored. Such data are represented using objects of the S4 class \sts\ (surveillance time series). \subsection[The sts data class]{The \sts\ data class} The \sts\ class contains the $T\times I$ matrix of counts $y_{it}$ in a slot \code{observed}. An integer slot \code{epoch} denotes the time index $1\leq t \leq T$ of each row in \code{observed}. The number of observations per year, e.g.\ 52 for weekly or 12 for monthly data, is denoted by \code{freq}. Furthermore, \code{start} denotes a vector of length two containing the start of the time series as \code{c(year, epoch)}. For spatially stratified time series, the slot \code{neighbourhood} denotes an $I \times I$ adjacency matrix with elements 1 if two regions are neighbors and 0 otherwise. For map visualizations, the slot \code{map} links the multivariate time series to geographical regions stored in a \code{"SpatialPolygons"} object (package \pkg{sp}). Additionally, the slot \code{populationFrac} contains a $T\times I$ matrix representing population fractions in unit $i$ at time $t$. The \sts\ data class is also described in \citet[Section~2.1]{hoehle-mazick-2010}, \citet[Section~1.1]{salmon.etal2014}, \citet[Section~5.2]{meyer.etal2014}, and on the associated help page \code{help("sts")}. \subsection{Some example data sets} The package \surveillance\ contains a number of time series in the \code{data} directory. Most data sets originate from the SurvStat@RKI database\footnote{\url{https://survstat.rki.de}}, maintained by the Robert Koch Institute (RKI) in Germany. Selected data sets will be analyzed in Section~\ref{sec:hhh} and are introduced in the following. Note that many of the built-in datasets are stored in the S3 class data structure \mbox{\code{disProg}} used in ancient versions of the \surveillance\ package (until 2006). They can be easily converted into the new S4 \sts\ data structure using the function \code{disProg2sts}. The resulting \sts\ object can be accessed similar as standard \code{matrix} objects and allows easy temporal and spatial aggregation as will be shown in the remainder of this section. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \example{Influenza and meningococcal disease, Germany, 2001--2006} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% As a first example, the weekly number of influenza and meningococcal disease cases in Germany is considered. <>= # load data data("influMen") # convert to sts class and print basic information about the time series print(fluMen <- disProg2sts(influMen)) @ The univariate time series of meningococcal disease counts can be obtained with <>= meningo <- fluMen[, "meningococcus"] dim(meningo) @ The \code{plot} function provides ways to visualize the multivariate time series in time, space and space-time, as controlled by the \code{type} argument: \setkeys{Gin}{width=1\textwidth} <>= plot(fluMen, type = observed ~ time | unit, # type of plot (default) same.scale = FALSE, # unit-specific ylim? col = "grey") # color of bars @ See \code{help("stsplot")} for a detailed description of the plot routines. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \example{Influenza, Southern Germany, 2001--2008} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The spatio-temporal spread of influenza in the 140 Kreise (districts) of Bavaria and Baden-W\"urttemberg is analyzed using the weekly number of cases reported to the RKI~\citep{survstat-fluByBw} in the years 2001--2008. An \sts\ object containing the data is created as follows: <>= # read in observed number of cases flu.counts <- as.matrix(read.table(system.file("extdata/counts_flu_BYBW.txt", package = "surveillance"), check.names = FALSE)) @ \begin{center} \setkeys{Gin}{width=.5\textwidth} <>= # read in 0/1 adjacency matrix (1 if regions share a common border) nhood <- as.matrix(read.table(system.file("extdata/neighbourhood_BYBW.txt", package = "surveillance"), check.names = FALSE)) library("Matrix") print(image(Matrix(nhood))) @ \end{center} <>= # read in population fractions popfracs <- read.table(system.file("extdata/population_2001-12-31_BYBW.txt", package = "surveillance"), header = TRUE)$popFrac # create sts object flu <- sts(flu.counts, start = c(2001, 1), frequency = 52, population = popfracs, neighbourhood = nhood) @ These data are already included as \code{data("fluBYBW")} in \surveillance. In addition to the \sts\ object created above, \code{fluBYBW} contains a map of the administrative districts of Bavaria and Baden-W\"urttemberg. This works by specifying a \code{"SpatialPolygons"} representation of the districts as an extra argument \code{map} in the above \sts\ call. Such a \code{"SpatialPolygons"} object can be obtained from, e.g, an external shapefile using the \pkg{sf} functions \code{st\_read} followed by \code{as\_Spatial}. A map enables plots and animations of the cumulative number of cases by region. For instance, a disease incidence map of the year 2001 can be obtained as follows: \setkeys{Gin}{width=.5\textwidth} \begin{center} <>= data("fluBYBW") plot(fluBYBW[year(fluBYBW) == 2001, ], # select year 2001 type = observed ~ unit, # total counts by region population = fluBYBW@map$X31_12_01 / 100000, # per 100000 inhabitants colorkey = list(title = "Incidence [per 100'000 inhabitants]")) @ \end{center} <>= # consistency check local({ fluBYBW@map <- flu@map stopifnot(all.equal(fluBYBW, flu)) }) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \example{Measles, Germany, 2005--2007} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The following data set contains the weekly number of measles cases in the 16 German federal states, in the years 2005--2007. These data have been analyzed by \citet{herzog-etal-2010} after aggregation into bi-weekly periods. <>= data("measlesDE") measles2w <- aggregate(measlesDE, nfreq = 26) @ \setkeys{Gin}{width=.75\textwidth} \begin{center} <>= plot(measles2w, type = observed ~ time, # aggregate counts over all units main = "Bi-weekly number of measles cases in Germany") @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Model formulation}\label{sec:model} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Retrospective surveillance aims to identify outbreaks and (spatio-)temporal patterns through statistical modelling. Motivated by a branching process with immigration, \citet{held-etal-2005} suggest the following model for the analysis of univariate time series of infectious disease counts $\{y_{t}; t=1,\ldots,T\}$. The counts are assumed to be Poisson distributed with conditional mean \begin{align*} \mu_{t} = \lambda y_{t-1}+ \nu_{t}, \quad(\lambda,\nu_{t}>0) \end{align*} where $\lambda$ and $\nu_t$ are unknown quantities. The mean incidence is decomposed additively into two components: an epidemic or \emph{autoregressive} component $\lambda y_{t-1}$, and an \emph{endemic} component $\nu_t$. The former should be able to capture occasional outbreaks whereas the latter explains a baseline rate of cases with stable temporal pattern. \citet{held-etal-2005} suggest the following parametric model for the endemic component: \begin{align}\label{eq:nu_t} \log(\nu_t) =\alpha + \beta t + \left\{\sum_{s=1}^S \gamma_s \sin(\omega_s t) + \delta_s \cos(\omega_s t)\right\}, \end{align} where $\alpha$ is an intercept, $\beta$ is a trend parameter, and the terms in curly brackets are used to model seasonal variation. Here, $\gamma_s$ and $\delta_s$ are unknown parameters, $S$ denotes the number of harmonics to include, and $\omega_s=2\pi s/$\code{freq} are Fourier frequencies (e.g.\ \code{freq = 52} for weekly data). For ease of interpretation, the seasonal terms in \eqref{eq:nu_t} can be written equivalently as \begin{align*} \gamma_s \sin(\omega_s t) + \delta_s \cos(\omega_s t)= A_s \sin(\omega_s t +\varphi_s) \end{align*} with amplitude $A_s=\sqrt{\gamma_s^2+\delta_s^2}$ describing the magnitude, and phase difference $\tan(\varphi_s)=\delta_s/\gamma_s$ describing the onset of the sine wave. To account for overdispersion, the Poisson model may be replaced by a negative binomial model. Then, the conditional mean $\mu_t$ remains the same but the conditional variance increases to $\mu_t (1+\mu_t \psi)$ with additional unknown overdispersion parameter $\psi>0$. The model is extended to multivariate time series $\{y_{it}\}$ in \citet{held-etal-2005} and \citet{paul-etal-2008} by including an additional \emph{neighbor-driven} component, where past cases in other (neighboring) units also enter as explanatory covariates. The conditional mean $\mu_{it}$ is then given by \begin{align} \label{eq:mu_it} \mu_{it} = \lambda y_{i,t-1} + \phi \sum_{j\neq i} w_{ji} y_{j,t-1} +e_{it} \nu_{t}, \end{align} where the unknown parameter $\phi$ quantifies the influence of other units $j$ on unit $i$, $w_{ji}$ are weights reflecting between-unit transmission and $e_{it}$ corresponds to an offset (such as population fractions at time $t$ in region $i$). A simple choice for the weights is $w_{ji}=1$ if units $j$ and $i$ are adjacent and 0 otherwise. See \citet{paul-etal-2008} for a discussion of alternative weights, and \citet{meyer.held2013} for how to estimate these weights in the spatial setting using a parametric power-law formulation based on the order of adjacency. When analyzing a specific disease observed in, say, multiple regions or several pathogens (such as influenza and meningococcal disease), the assumption of equal incidence levels or disease transmission across units is questionable. To address such heterogeneity, the unknown quantities $\lambda$, $\phi$, and $\nu_t$ in \eqref{eq:mu_it} may also depend on unit $i$. This can be done via \begin{itemize} \item unit-specific fixed parameters, e.g.\ $\log(\lambda_i)=\alpha_i$ \citep{paul-etal-2008}; \item unit-specific random effects, e.g\ $\log(\lambda_i)=\alpha_0 +a_i$, $a_i \stackrel{\text{iid}}{\sim} \N(0,\sigma^2_\lambda)$ \citep{paul-held-2011}; \item linking parameters with known (possibly time-varying) explanatory variables, e.g.\ $\log(\lambda_i)=\alpha_0 +x_i\alpha_1$ with region-specific vaccination coverage $x_i$ \citep{herzog-etal-2010}. \end{itemize} In general, the parameters of all three model components may depend on both time and unit. A call to \hhh\ fits a Poisson or negative binomial model with conditional mean \begin{align*} \mu_{it} = \lambda_{it} y_{i,t-1} + \phi_{it} \sum_{j\neq i} w_{ji} y_{j,t-1} +e_{it} \nu_{it} \end{align*} to a (multivariate) time series of counts. Here, the three unknown quantities are modelled as log-linear predictors \begin{align} \log(\lambda_{it}) &= \alpha_0 + a_i +\bm{u}_{it}^\top \bm{\alpha} \tag{\code{ar}}\\ \log(\phi_{it}) &= \beta_0 + b_i +\bm{x}_{it}^\top \bm{\beta} \tag{\code{ne}}\\ \log(\nu_{it}) &= \gamma_0 + c_i +\bm{z}_{it}^\top \bm{\gamma}\tag{\code{end}} \end{align} where $\alpha_0,\beta_0,\gamma_0$ are intercepts, $\bm{\alpha},\bm{\beta},\bm{\gamma}$ are vectors of unknown parameters corresponding to covariate vectors $\bm{u}_{it},\bm{x}_{it},\bm{z}_{it}$, and $a_i,b_i,c_i$ are random effects. For instance, model~\eqref{eq:nu_t} with $S=1$ seasonal terms may be represented as $\bm{z}_{it}=(t,\sin(2\pi/\code{freq}\;t),\cos(2\pi/\code{freq}\;t))^\top$. The stacked vector of all random effects is assumed to follow a normal distribution with mean $\bm{0}$ and covariance matrix $\bm{\Sigma}$. In applications, each of the components \code{ar}, \code{ne}, and \code{end} may be omitted in parts or as a whole. If the model does not contain random effects, standard likelihood inference can be performed. Otherwise, inference is based on penalized quasi-likelihood as described in detail in \citet{paul-held-2011}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Function call and control settings}\label{sec:hhh} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% The estimation procedure is called with <>= hhh4(sts, control) @ where \code{sts} denotes a (multivariate) surveillance time series and the model is specified in the argument \code{control} in consistency with other algorithms in \surveillance. The \code{control} setting is a list of the following arguments (here with default values): <>= control = list( ar = list(f = ~ -1, # formula for log(lambda_it) offset = 1), # optional multiplicative offset ne = list(f = ~ -1, # formula for log(phi_it) offset = 1, # optional multiplicative offset weights = neighbourhood(stsObj) == 1), # (w_ji) matrix end = list(f = ~ 1, # formula for log(nu_it) offset = 1), # optional multiplicative offset e_it family = "Poisson", # Poisson or NegBin model subset = 2:nrow(stsObj), # subset of observations to be used optimizer = list(stop = list(tol = 1e-5, niter = 100), # stop rules regression = list(method = "nlminb"), # for penLogLik variance = list(method = "nlminb")), # for marLogLik verbose = FALSE, # level of progress reporting start = list(fixed = NULL, # list with initial values for fixed, random = NULL, # random, and sd.corr = NULL), # variance parameters data = list(t = epoch(stsObj)-1),# named list of covariates keep.terms = FALSE # whether to keep the model terms ) @ The first three arguments \code{ar}, \code{ne}, and \code{end} specify the model components using \code{formula} objects. By default, the counts $y_{it}$ are assumed to be Poisson distributed, but a negative binomial model can be chosen by setting \mbox{\code{family = "NegBin1"}}. By default, both the penalized and marginal log-likelihoods are maximized using the quasi-Newton algorithm available via the \R\ function \code{nlminb}. The methods from \code{optim} may also be used, e.g., \mbox{\code{optimizer = list(variance = list(method="Nelder-Mead")}} is a useful alternative for maximization of the marginal log-likelihood with respect to the variance parameters. Initial values for the fixed, random, and variance parameters can be specified in the \code{start} argument. If the model contains covariates, these have to be provided in the \code{data} argument. If a covariate does not vary across units, it may be given as a vector of length $T$. Otherwise, covariate values must be given in a matrix of size $T \times I$. In the following, the functionality of \hhh\ is demonstrated using the data sets introduced in Section~\ref{sec:data} and previously analyzed in \citet{paul-etal-2008}, \citet{paul-held-2011} and \citet{herzog-etal-2010}. Selected results are reproduced. For a thorough discussion we refer to these papers. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Univariate modelling} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% As a first example, consider the univariate time series of meningococcal infections in Germany, 01/2001--52/2006 \citep[cf.][Table~1]{paul-etal-2008}. A Poisson model without autoregression and $S=1$ seasonal term is specified as follows: <>= # specify a formula object for the endemic component ( f_S1 <- addSeason2formula(f = ~ 1, S = 1, period = 52) ) # fit the Poisson model result0 <- hhh4(meningo, control = list(end = list(f = f_S1), family = "Poisson")) summary(result0) @ To fit the corresponding negative binomial model, we can use the convenient \code{update} method: <>= result1 <- update(result0, family = "NegBin1") @ Note that the \code{update} method by default uses the parameter estimates from the original model as start values when fitting the updated model; see \code{help("update.hhh4")} for details. We can calculate Akaike's Information Criterion for the two models to check whether accounting for overdispersion is useful for these data: <<>>= AIC(result0, result1) @ Due to the default control settings with \verb|ar = list(f = ~ -1)|, the autoregressive component has been omitted in the above models. It can be included by the following model update: <>= # fit an autoregressive model result2 <- update(result1, ar = list(f = ~ 1)) @ To extract only the ML estimates and standard errors instead of a full model \code{summary}, the \code{coef} method can be used: <<>>= coef(result2, se = TRUE, # also return standard errors amplitudeShift = TRUE, # transform sine/cosine coefficients # to amplitude/shift parameters idx2Exp = TRUE) # exponentiate remaining parameters @ Here, \code{exp(ar.1)} is the autoregressive coefficient $\lambda$ and can be interpreted as the epidemic proportion of disease incidence \citep{held.paul2012}. Note that the above transformation arguments \code{amplitudeShift} and \code{idx2Exp} can also be used in the \code{summary} method. Many other standard methods are implemented for \code{"hhh4"} fits, see, e.g., \code{help("confint.hhh4")}. A plot of the fitted model components can be easily obtained: \begin{center} <>= plot(result2) @ \end{center} See the comprehensive \code{help("plot.hhh4")} for further options. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Bivariate modelling} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Now, the weekly numbers of both meningococcal disease (\textsc{MEN}) and influenza (\textsc{FLU}) cases are analyzed to investigate whether influenza infections predispose meningococcal disease \citep[cf.][Table~2]{paul-etal-2008}. This requires disease-specific parameters which are specified in the formula object with \code{fe(\ldots)}. In the following, a negative binomial model with mean \begin{align*} \binom{\mu_{\text{men},t}} {\mu_{\text{flu},t}}= \begin{pmatrix} \lambda_\text{men} & \phi \\ 0 & \lambda_\text{flu} \\ \end{pmatrix} \binom{\text{\sc men}_{t-1}}{\text{\sc flu}_{t-1}} + \binom{\nu_{\text{men},t}}{\nu_{\text{flu},t}}\,, \end{align*} where the endemic component includes $S=3$ seasonal terms for the \textsc{FLU} data and $S=1$ seasonal terms for the \textsc{MEN} data is considered. Here, $\phi$ quantifies the influence of past influenza cases on the meningococcal disease incidence. This model corresponds to the second model of Table~2 in \citet{paul-etal-2008} and is fitted as follows: <>= # no "transmission" from meningococcus to influenza neighbourhood(fluMen)["meningococcus","influenza"] <- 0 neighbourhood(fluMen) @ <>= # create formula for endemic component f.end <- addSeason2formula(f = ~ -1 + fe(1, unitSpecific = TRUE), # disease-specific intercepts S = c(3, 1), # S = 3 for flu, S = 1 for men period = 52) # specify model m <- list(ar = list(f = ~ -1 + fe(1, unitSpecific = TRUE)), ne = list(f = ~ 1, # phi, only relevant for meningococcus due to weights = neighbourhood(fluMen)), # the weight matrix end = list(f = f.end), family = "NegBinM") # disease-specific overdispersion # fit model result <- hhh4(fluMen, control = m) summary(result, idx2Exp=1:3) @ A plot of the estimated mean components can be obtained as follows: \setkeys{Gin}{width=1\textwidth} \begin{center} <>= plot(result, units = NULL, pch = 20, legend = 2, legend.args = list( legend = c("influenza-driven", "autoregressive", "endemic"))) @ \end{center} Alternatively, use the \code{decompose} argument to show the unit-specific contributions to the fitted mean: \begin{center} <>= plot(result, units = NULL, pch = 20, legend = 2, decompose = TRUE, col = c(7, 4)) @ \end{center} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Multivariate modelling} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% For disease counts observed in a large number of regions, say, (i.e.\ highly multivariate time series of counts) the use of region-specific parameters to account for regional heterogeneity is no longer feasible as estimation and identifiability problems may occur. Here we illustrate two approaches: region-specific random effects and region-specific covariates. For a more detailed illustration of areal \code{hhh4} models, see \verb+vignette("hhh4_spacetime")+, which uses \verb+data("measlesWeserEms")+ as an example. \subsubsection*{Influenza, Southern Germany, 2001--2008} \citet{paul-held-2011} propose a random effects formulation to analyze the weekly number of influenza cases in \Sexpr{ncol(fluBYBW)} districts of Southern Germany. For example, consider a model with random intercepts in the endemic component: $c_i \stackrel{iid}{\sim} \N(0,\sigma^2_\nu), i=1,\ldots,I$. Such effects are specified as: <>= f.end <- ~ -1 + ri(type = "iid", corr = "all") @ The alternative \code{type = "car"} would assume spatially correlated random effects; see \citet{paul-held-2011} for details. The argument \code{corr = "all"} allows for correlation between region-specific random effects in different components, e.g., random incidence levels $c_i$ in the endemic component and random effects $b_i$ in the neighbor-driven component. The following call to \hhh\ fits such a random effects model with linear trend and $S=3$ seasonal terms in the endemic component, a fixed autoregressive parameter $\lambda$, and first-order transmission weights $w_{ji}=\mathbb{I}(j\sim i)$ -- normalized such that $\sum_i w_{ji} = 1$ for all rows $j$ -- to the influenza data \citep[cf.][Table~3, model~B2]{paul-held-2011}. <>= # endemic component: iid random effects, linear trend, S=3 seasonal terms f.end <- addSeason2formula(f = ~ -1 + ri(type="iid", corr="all") + I((t-208)/100), S = 3, period = 52) # model specification model.B2 <- list(ar = list(f = ~ 1), ne = list(f = ~ -1 + ri(type="iid", corr="all"), weights = neighbourhood(fluBYBW), normalize = TRUE), # all(rowSums(weights) == 1) end = list(f = f.end, offset = population(fluBYBW)), family = "NegBin1", verbose = TRUE, optimizer = list(variance = list(method = "Nelder-Mead"))) # default start values for random effects are sampled from a normal set.seed(42) @ <>= if(compute){ result.B2 <- hhh4(fluBYBW, model.B2) s.B2 <- summary(result.B2, maxEV = TRUE, idx2Exp = 1:3) #pred.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52) predfinal.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52, type = "final") meanSc.B2 <- colMeans(scores(predfinal.B2)) save(s.B2, meanSc.B2, file="hhh4-cache.RData") } @ <>= # fit the model (takes about 35 seconds) result.B2 <- hhh4(fluBYBW, model.B2) summary(result.B2, maxEV = TRUE, idx2Exp = 1:3) @ <>= s.B2 @ Model choice based on information criteria such as AIC or BIC is well explored and understood for models that correspond to fixed-effects likelihoods. However, in the presence of random effects their use can be problematic. For model selection in time series models, the comparison of successive one-step-ahead forecasts with the actually observed data provides a natural alternative. In this context, \citet{gneiting-raftery-2007} recommend the use of strictly proper scoring rules, such as the logarithmic score (logs) or the ranked probability score (rps). See \citet{czado-etal-2009} and \citet{paul-held-2011} for further details. One-step-ahead predictions for the last 2 years for model B2 could be obtained as follows: <>= pred.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52) @ However, computing ``rolling'' one-step-ahead predictions from a random effects model is computationally expensive, since the model needs to be refitted at every time point. The above call would take approximately 45 minutes! So for the purpose of this vignette, we use the fitted model based on the whole time series to compute all (fake) predictions during the last two years: <>= predfinal.B2 <- oneStepAhead(result.B2, tp = nrow(fluBYBW) - 2*52, type = "final") @ The mean scores (logs and rps) corresponding to this set of predictions can then be computed as follows: <>= colMeans(scores(predfinal.B2, which = c("logs", "rps"))) @ <>= meanSc.B2[c("logs", "rps")] @ Using predictive model assessments, \citet{meyer.held2013} found that power-law transmission weights more appropriately reflect the spread of influenza than the previously used first-order weights (which actually allow the epidemic to spread only to directly adjacent districts within one week). These power-law weights can be constructed by the function \code{W\_powerlaw} and require the \code{neighbourhood} of the \sts\ object to contain adjacency orders. The latter can be easily obtained from the binary adjacency matrix using the function \code{nbOrder}. See the corresponding help pages or \citet[Section~5]{meyer.etal2014} for illustrations. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsection*{Measles, German federal states, 2005--2007} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= data("MMRcoverageDE") cardVac1 <- MMRcoverageDE[1:16,3:4] adjustVac <- function(cardVac, p=0.5, nrow=1){ card <- cardVac[,1] vac <- cardVac[,2] vacAdj <- vac*card + p*vac*(1-card) return(matrix(vacAdj,nrow=nrow, ncol=length(vacAdj), byrow=TRUE)) } vac0 <- 1 - adjustVac(cardVac1, p=0.5, nrow=frequency(measles2w)*3) colnames(vac0) <- colnames(measles2w) @ As a last example, consider the number of measles cases in the 16 federal states of Germany, in the years 2005--2007. There is considerable regional variation in the incidence pattern which is most likely due to differences in vaccination coverage. In the following, information about vaccination coverage in each state, namely the log proportion of unvaccinated school starters, is included as explanatory variable in a model for the bi-weekly aggregated measles data. See \citet{herzog-etal-2010} for further details. Vaccination coverage levels for the year 2006 are available in the dataset \code{MMRcoverageDE}. This dataset can be used to compute the $\Sexpr{nrow(vac0)}\times \Sexpr{ncol(vac0)}$ matrix \code{vac0} with adjusted proportions of unvaccinated school starters in each state $i$ used by \citet{herzog-etal-2010}. The first few entries of this matrix are shown below: <<>>= vac0[1:2, 1:6] @ We fit a Poisson model, which links the autoregressive parameter with this covariate and contains $S=1$ seasonal term in the endemic component \citep[cf.][Table~3, model~A0]{herzog-etal-2010}: <>= # endemic component: Intercept + sine/cosine terms f.end <- addSeason2formula(f = ~ 1, S = 1, period = 26) # autoregressive component: Intercept + vaccination coverage information model.A0 <- list(ar = list(f = ~ 1 + logVac0), end = list(f = f.end, offset = population(measles2w)), data = list(t = epoch(measles2w), logVac0 = log(vac0))) # fit the model result.A0 <- hhh4(measles2w, model.A0) summary(result.A0, amplitudeShift = TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Conclusion} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% As part of the \R~package \surveillance, the function \hhh\ provides a flexible tool for the modelling of multivariate time series of infectious disease counts. The presented count data model is able to account for serial and spatio-temporal correlation, as well as heterogeneity in incidence levels and disease transmission. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \bibliographystyle{apalike} \renewcommand{\bibfont}{\small} \bibliography{references} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}