% $Id: speed.Rnw 1796 2015-11-12 13:10:20Z sgaure $ \documentclass{amsproc} %\VignetteIndexEntry{Convergence rate with examples} %\VignetteKeyword{Multiple Fixed Effects} %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} \newcommand{\strong}[1]{{\normalfont\fontseries{b}\selectfont #1}} \let\pkg=\strong \newcommand\code{\bgroup\@codex} \def\@codex#1{{\normalfont\ttfamily\hyphenchar\font=-1 #1}\egroup} \newcommand{\bma}{\begin{bmatrix}} \newcommand{\ema}{\end{bmatrix}} \newcommand{\RN}{\mathbb{R}} \title{Convergence rate examples and theory} \author{Simen Gaure} \address{Ragnar Frisch Centre for Economic Research, Oslo, Norway} \date{March 26, 2013} \begin{document} \begin{abstract} If you use \pkg{lfe} for various tasks, you will notice that some estimations converge fast, whereas others converge slowly. Convergence rate of the methods used by \pkg{lfe} is not a walk in the park. Here are some examples. \end{abstract} \maketitle \section{Introduction} The method employed by \pkg{lfe} is \emph{the method of alternating projections} (\cite{Halp62}). The link to this method comes from viewing demeaning of a vector as a \emph{projection} (\cite{GS13}), i.e. a linear operator \(P\) with the property \(P^2 = P = P^t\). Also, the Kaczmarz-method used by \pkg{lfe} to solve the sparse resulting system is a variant of alternating projections, though \(P\) is then an affine projection with \(P^2 = P\), not a linear operator. That is, if the projection demeaning factor \(i\) is \(P_i\), their intersection which demeans all factors is \(\lim_{n\to\infty} (P_1 P_2 \cdots P_e)^n\). We can therefore iterate the operator \(T = P_1 P_2 \cdots P_e\) on a vector \(x\): \(T^n x\), to demean \(x\). The convergence rate has been analysed in \cite{DH97} and references cited therein; there are also newer elaborations, e.g.\ in \cite{BGM12}. The convergence rate theory is in terms of generalized angles between subspaces, i.e. the ranges of the projections. The angle concept may not be immediately intuitive to practitioners of linear regression methods, so let's have a look at some examples. \section{Examples} Our first example has two factors, they are independent of each other, and of quite high cardinality <>= cores <- as.integer(Sys.getenv("SG_RUN")) if (is.na(cores)) options(lfe.threads = 1) @ <<>>= library(lfe) set.seed(42) x <- rnorm(100000) f1 <- sample(10000, length(x), replace = TRUE) f2 <- sample(10000, length(x), replace = TRUE) y <- x + cos(f1) + log(f2 + 1) + rnorm(length(x), sd = 0.5) @ We time the first step: <<>>= system.time(est <- felm(y ~ x | f1 + f2)) @ and the second step <<>>= system.time(alpha <- getfe(est)) @ We see that there's nothing to complain about in terms of speed. After all, there are 20,000 dummies in this model. Now we let \code{f2} have fewer levels: <<>>= f2 <- sample(300, length(x), replace = TRUE) y <- x + cos(f1) + log(f2 + 1) + rnorm(length(x), sd = 0.5) system.time(est <- felm(y ~ x | f1 + f2)) system.time(alpha <- getfe(est)) @ Not much happens, the second step is apparently faster, whereas the first is about the same. Note that much of the time in the first step is spent in mundane bookkeeping such as creating a model matrix, not in the demeaning as such. Now we make the fixed effects dependent. We do this by ensuring that for each value of \code{f1} there are at most 10 different values of \code{f2}. We keep the number of levels in \code{f2} at 300, as in the previous example. The size of this problem is exactly the same as in the previous example, but the factor structure is very different. <<>>= f2 <- (f1 + sample(10, length(x), replace = TRUE)) %% 300 y <- x + cos(f1) + log(f2 + 1) + rnorm(length(x), sd = 0.5) system.time(est <- felm(y ~ x | f1 + f2)) system.time(alpha <- getfe(est)) @ The estimation now takes a lot more time. Indeed, in this case, it is worthwhile to consider coding \code{f2} as 300 dummies, i.e.\ as an ordinary factor, even though there now are \(\approx 300\) vectors to centre, as opposed to two in the previous examples. <<>>= system.time(est <- felm(y ~ x + factor(f2) | f1)) system.time(alpha <- getfe(est)) @ We could be tempted to believe that this is the whole story, but it's not. We may create another example, where each \code{f1} only has 10 different \code{f2}'s as above, but where the overlap is different. In the above example, an observation with \code{f1=1} will have \code{f2} drawn from \{2,3,4,\ldots,11\}, whereas an observation with \code{f1=2} will have \code{f2} in \{3,4,5,\ldots,12\} and so on, i.e. a considerable overlap. We reduce this overlap by creating \code{f2} with a little twist, by introducing some unevenly sized ``holes'' by cubing the samples of \code{1:10}. <<>>= f2 <- (f1 + sample(10, length(x), replace = TRUE)^3) %% 300 y <- x + cos(f1) + log(f2 + 1) + rnorm(length(x), sd = 0.5) system.time(est <- felm(y ~ x | f1 + f2)) system.time(alpha <- getfe(est)) nlevels(est[["cfactor"]]) @ We are now back at approximately the same fast convergence as before. That is, convergence rate is not a simple function of data size, counts of levels and so on, even when we have only two factors. \subsection{Why?} In the examples above we have played with the structure of the bipartite graph mentioned in \cite{GS13}, indicating that convergence rate is not a function solely of the average degree and number of connected components of the graph. Intuitively (to the author), the graph determines the convergence rate, at least in the operator norm topology, but it is unclear to me whether the rate can be described in simple graph theoretic terms, let alone in terms of some intuitive relations between dummy variables. One could speculate that convergence is related to how tightly connected the graph is, one such measure is the diameter of the graph, or the set of lengths of shortest paths. <>= library(igraph) mkgraph <- function(f1, f2) { graph.edgelist(cbind(paste("f1", f1), paste("f2", f2)), directed = FALSE) } appxdiam <- function(g) max(shortest.paths(g, sample(V(g), 10), sample(V(g), 10))) f2 <- sample(10000, length(x), replace = TRUE) appxdiam(mkgraph(f1, f2)) f2 <- (f1 + sample(5, length(x), replace = TRUE)^3) %% 300 appxdiam(mkgraph(f1, f2)) f2 <- (f1 + sample(5, length(x), replace = TRUE)) %% 300 appxdiam(mkgraph(f1, f2)) @ Loosely speaking, data with ``six degrees of separation'' seems to converge faster than less well connected networks. Informal trials, using \code{demeanlist} instead of \code{felm} to avoid overhead, show that convergence time grows roughly as the square of the diameter, also in more varied graphs than the above. However, the author has not had the time to attempt to prove such an assertion. A special case of this is when the second factor is an interaction of several factors in such a way that there is multicollinearity by design. This does not introduce any bias problems, but the speed can often be dramatically improved by removing the known multicollinearities in advance by collapsing appropriate levels to one. This will diminish the diameter of the graph. Note that convergence rate in practice is also a function of the vector to be centred, this can be seen from the trivial example when the vector is already centred and convergence is attained after a single iteration. It is a sad fact of life, shown in \cite{DH97}, that when there are more than two factors, the convergence rate can not be determined solely by considering the angles of \emph{pairs} of subspaces. In \cite{BGM12} a generalized angle between a finite set of subspaces is introduced for the purpose of analysis. It is left to the reader to imagine some simple examples with more than two factors. A real example with three factors can be found in \cite{TPAG13}, and real examples with up to five factors are treated in the appendix of \cite{GS13} as well as in \cite{MR12}. \section{Acceleration schemes} One point to note is that with more than two factors, even their \emph{order} may be important for the convergence rate, this has been utilized to speed up convergence in some cases, picking a random order of the projections for each iteration, or sampling randomly from the projections. One such scheme is used in \cite{SV09}, and randomization is also treated in \cite{BGM12} and the papers they cite. Random shuffling of the equations is now used in the Kaczmarz-step of \pkg{lfe}, i.e.\ in \code{getfe}, as suggested in \cite{GS13}. Another acceleration scheme is found in \cite{GK89}, where a line search method is used in each iteration. I.e. when a single iteration transforms our vector \(\eta \mapsto \eta'\), it is fairly easy to find the exact point on the line \(\{\eta + t(\eta' - \eta)\, :\, t \in \RN\}\) which is closest to the solution. \pkg{lfe} uses this line search method in \code{felm} (or rather, in \code{demeanlist}). However, it is noted in \cite{BDHP03} that this method does not always accelerate convergence. After publishing \cite{GS13}, the author was made aware that exactly the same method of alternating projections was arrived at in \cite[p. 637]{GP10}, though as a technical simplification of the Gauss-Seidel method of iterated regressions. It is possible that acceleration schemes for the Gauss-Seidel method may be applicable, though the author has not yet looked into it. \iffalse \bibliographystyle{amsplain} \bibliography{biblio} \else \providecommand{\bysame}{\leavevmode\hbox to3em{\hrulefill}\thinspace} \providecommand{\MR}{\relax\ifhmode\unskip\space\fi MR } % \MRhref is called by the amsart/book/proc definition of \MR. \providecommand{\MRhref}[2]{% \href{http://www.ams.org/mathscinet-getitem?mr=#1}{#2} } \providecommand{\href}[2]{#2} \begin{thebibliography}{10} \bibitem{BGM12} C.~Badea, S.~Grivaux, and V.~M{\"u}ller, \emph{The rate of convergence in the method of alternating projections}, St. Petersburg Mathematical Journal \textbf{23} (2012), 413--434, preprint in arXiv:1006.2047. \bibitem{BDHP03} Heinz~H. Bauschke, Frank Deutsch, Hein Hundal, and Sung-Ho Park, \emph{Accelerating the convergence of the method of alternating projections}, Trans. Amer. Math. Soc. \textbf{355} (2003), no.~9, 3433--3461. \bibitem{DH97} F.~Deutsch and H.~Hundal, \emph{{The Rate of Convergence for the Method of Alternating Projections, II}}, J. Math. Anal. App. \textbf{205} (1997), no.~2, {381--405}. \bibitem{GS13} S.~Gaure, \emph{{OLS with Multiple High Dimensional Category Variables}}, Computational Statistics and Data Analysis \textbf{66} (2013), 8--18. \bibitem{GK89} W.B. Gearhart and M.~Koshy, \emph{{Acceleration schemes for the method of alternating projections}}, {Journal of Computational and Applied Mathematics} \textbf{26} (1989), no.~3, 235--249. \bibitem{GP10} P.~Guimar{\~a}es and P.~Portugal, \emph{A simple feasible procedure to fit models with high-dimensional fixed effects}, Stata Journal \textbf{10} (2010), no.~4, 628--649(22). \bibitem{Halp62} I.~Halperin, \emph{{The Product of Projection Operators}}, Acta Sci. Math. (Szeged) \textbf{23} (1962), 96--99. \bibitem{MR12} S.~Markussen and K.~R{\o}ed, \emph{Social insurance networks}, IZA Discussion Papers 6446, Institute for the Study of Labor (IZA), March 2012. \bibitem{SV09} T.~Strohmer and R.~Vershynin, \emph{{A Randomized Kaczmarz Algorithm with Exponential Convergence}}, Journal of Fourier Analysis and Applications \textbf{15} (2009), 262--278. \bibitem{TPAG13} S.M. Torres, P.~Portugal, J.T. Addison, and P.~Guimar{\~a}es, \emph{{The Sources of Wage Variation: A Three-Way High-Dimensional Fixed Effects Regression Model.}}, IZA Discussion Paper 7276, Institute for the Study of Labor (IZA), March 2013. \end{thebibliography} \fi \end{document}