Fast and accurate parallel numerical derivatives in R

Andreï V. Kostyrka, University of Luxembourg

Created: 2024-10-01, last modified: 2024-10-01, compiled: 2025-02-25

Abstract

We illustrate how to use the novel pnd package for R to obtain numerical derivatives, gradients, and Hessians and offer empirically backed advice for common situations where finite-difference approximations are unreliable.

1 Introduction

As of 2025, most popular R numerical optimisers and functions for computing numerical derivative approximations continue to rely on serial evaluation of functions, and even in scenarios where parallelisation is feasible, users often lack a convenient parallel solution. The pnd package addresses this lacuna by providing functions that:

  1. Use multiple CPU cores to reduce numerical derivative computation time;
  2. Choose the optimal finite-difference step size for desired derivative and accuracy orders;
  3. Reuse existing function values to minimise evaluations or increase approximation accuracy;
  4. Report an estimated measure of numerical inaccuracy.

The first feature alone can significantly expedite optimisation in applied research by harnessing the full power of multi-core machines.

Some R packages offer finite-difference-based derivatives with either hard-coded or suboptimal step sizes (e.g. optim() in the core R Core Team (2024) for gradient-based methods), potentially leading to misleading results with quasi-Newton optimisers and lacking error diagnostic tools. Technical research articles and book chapters on numerical methods provide optimality results and explicit step-size formulæ, but user guides and package vignettes often lack practical insights into the principles underlying the selection of tuning parameters.

This vignette showcases the core functionality of the pnd package.

Despite the proliferation of parallelism-supporting R packages (e.g. psqn, DEoptim, hydroPSO), few implementations address gradient-related numerical optimisation issues. The optimParallel package by Gerber and Furrer (2019) offers an internally parallelised L-BFGS-B method for the optim() optimiser.1 As of May 2023, it does not support algorithms other than L-BFGS-B, which is problematic given the abundance of quasi-Newton solvers (nlm, nlmimb, nloptr::nloptr, Rsolnp::solnp, BB::spg, maxLik::maxBHHH), all of which would benefit from parallelised gradients or Hessians.

To the best of our knowledge, no R packages currently address the problem of step size in finite-difference derivative approximation comprehensively. The numDeriv package offers facilities for controlling the accuracy order, step size, and other tuning parameters, but does not output diagnostic information. Users must manually request the diagnostic print-out themselves, which works for gradients but not Hessians.2 The pnd package tackles this problem by offering three explicit data-driven step-search routines with detailed diagnostic information.

numDeriv and pnd differ fundamentally in attaining higher-order accuracy. numDeriv allows users to set the initial step size and request a certain number of Richardson improvement iterations to mitigate the truncation error, which run sequentially. In contrast, pnd achieves desired accuracy by creating a necessary argument-value grid and evaluating the function in parallel, benefitting both high-dimensional applications like Barzilai–Borwein optimisation and low-dimensional gradients and Hessians. The results are comparable in effective accuracy, with minor differences attributable to initial step size choice and subsequent extrapolation rules in numDeriv.

In summary, the pnd package provides fast and accurate numerical differentiation and diagnostic features for numerical gradients, Jacobians, Hessians, and higher-order derivatives. This vignette includes theory and practical examples for step size selection in numerical derivatives.

For the remainder of the paper, we shall use the following abbreviations and notation conventions:

This vignette demonstrates the capabilities of the pnd package, thus users should attach it before executing the code in subsequent sections:

library(pnd)

2 Derivative approximation via finite differences

2.1 Derivatives from Taylor series

The true (analytical) derivative of the scalar-valued function \(f\) at the point \(x\) is defined as the limit \[ f'(x) := \lim_{h\to0} \frac{f(x + h) - f(x)}{h}. \]

The usual caveat applies: \(f(x)\) must be differentiable at \(x\), i.e. the limit must exist.

In many applications where analytical derivatives \(f\) are cumbersome or impossible to compute but the value of \(f'(x)\) is needed, one intuitive approach is to remove the limit in the definition of the derivative and considering the finite difference \[ f_{\mathrm{FD}_1}' (x, h) := \frac{f(x + h) - f(x)}{h}. \] This is the first-order-accurate forward-difference approximation of \(f\).

One could hypothetically choose a sequence of decreasing step sizes \(h_i\) (e.g. \(\{0.1, 0.01, 0.001, \ldots\}\)), and observe that the sequence \(f'_{\mathrm{FD}} (x, 0.1), f'_{\mathrm{FD}} (x, 0.01), f'_{\mathrm{FD}} (x, 0.001), \ldots\) approaches a certain number as \(h\to 0\). In practice, however, computing this sequence on a computer can lead to erratic behavior for very small \(h\). This instability arises from the discrepancy between the true \(f'_{\mathrm{FD}}\) and its machine evaluation, \(\hat f'_{\mathrm{FD}}\). The nature of this discrepancy is investigated in what follows.

The first-order-accurate backward difference and the second-order-accurate central difference are defined similarly: \[ f'_{\mathrm{BD}_1} := \frac{f(x)-f(x-h)}{h}, \qquad f'_{\mathrm{CD}_2} := \frac{0.5 f(x+h) - 0.5f(x-h)}{h} \] Finite-difference-based approximations of derivatives receive the subscripts ‘FD’, ‘BD’, or ‘CD’, and are labelled ‘first-order-accurate’ and ‘second-order-accurate’, denoted by a digit in the subscript, based on the approximation error arising from truncating the Taylor expansion of \(f\) at \(x\): \[ f(x \pm h) = \sum_{i=0}^{\infty} \frac{1}{i!} \frac{\mathrm{d}^i f}{\mathrm{d} x^i} (x) (\pm h)^i = f(x) \pm \frac{h}{1!} f'(x) + \frac{h^2}{2!} f''(x) \pm \frac{h^3}{3!} f'''(x) + \ldots \]

Rearranging the terms and dividing by \(h\): \[ f'(x) = \frac{f(x+h) - f(x)}{h} - \frac{h}{2} f''(x) - \frac{h^2}{6} f'''(x) - \ldots \] The accuracy order is defined as the smallest exponent of \(h\) in the truncated part of the Taylor series (since \(h\to0\), the lowest power dominates the remainder): \[ f'(x) = \frac{f(x+h) - f(x)}{h} - \frac{h}{2} f''(x) - \frac{h^2}{6} f'''(x + \alpha h) = f'_{\mathrm{FD}_1} - \frac{f''(x)}{2} h + O(h^2), \] where \(\alpha \in [0, 1]\).

The truncation error is the difference between the true function and the first few terms of its Taylor expansion: \[ \varepsilon_{\mathrm{t}}(h) := f'(x) - f'_{\mathrm{FD}_1}(x, h) \] In forward differences, the dominant term in the truncation error is \(\frac{f''(x)}{2} h^1\) – the first power of~\(h\) multiplied by a constant, hence the term first-order accuracy; since \(h\to0\), we ignore the Lagrange remainder \(O(h^2)\).4 We reserve the letter \(a\) for the accuracy order – requested by the user and achievable by carrying out multiple function evaluations – and the letter \(m\) for the derivative order.

Central differences are second-order-accurate owing to the cancellation of even-power terms in the Taylor expansion: \[ f(x+h) - f(x-h) = (f(x) - f(x)) + (h - (-h)) f'(x)+ (h^2/2 - h^2/2) f''(x) \\ + \frac{h^3}{6} f'''(x+\alpha_1 h) + \frac{h^3}{6} f'''(x - \alpha_2 h) \] where \(\alpha_1, \alpha_2 \in [0, 1]\). To simplify this expression, we use the Generalsed Intermediate Value Theorem (GIVT).

By the generalised intermediate value theorem, \[ \frac{h^3}{6} f'''(x+\alpha_1 h) + \frac{h^3}{6} f'''(x - \alpha_2 h) = (h^3/6 + h^3/6) f'''(x+\alpha h) = \frac{h^3}{3} f'''(x+\alpha h), \] for some \(\alpha \in [-1, 1]\).

It follows that \[ f(x+h) - f(x-h) = 2h f'(x) + \frac{h^3}{3} f'''(x + \alpha h) \] and \[ f'_{\mathrm{CD}_2}(x) = \frac{0.5(2hf'(x) + \frac{h^3}{3} \cdot f'''(x+\alpha h) )}{h} = f'(x) + \frac{h^2}{6} f'''(x + \alpha h). \] The respective truncation error is \(f'(x) - f'_{\mathrm{CD}_2}(x) = -\frac{f'''(x + \alpha h)}{6} h^2 = O(h^2)\).

2.2 Derivatives on arbitrary stencils

Function evaluations may be costly. In certain large-scale applications, it may be necessary to obtain a numerical derivative \(f'(x_0)\) when the prohibitively slow \(f\) was evaluated on a supercomputer with arguments \(x_1\), \(x_2\), and \(x_3\). For such cases, an approach is necessary to use existing cached function values for approximating derivatives at arbitrary points. At the same time, having more than two function evaluations at distinct points grants an opportunity to cancel out more terms in the Taylor expansion of the first derivative for a smaller remainder term.

We adopt the following terminology, with some adjustments, from Fornberg (1988). For a sequence \(\{b_i\}_{i=1}^{n}\) let \(x + b_1 h, x + b_2 h, \ldots, x + b_n h\) be the evaluation grid for a function \(f\). The column vector \((b_1, \ldots, b_n)'\) with all distinct elements is called the stencil. In the pnd package, users may rely on the default integer-valued stencil centred at 0 or choose a custom stencil. For this stencil, there exist such finite-difference coefficients \((w_1, \ldots, w_n)'\) that, when used as weights in the sum \(\sum_{i=1}^n w_i f(x + b_i h)\), approximate the \(\mathrm{d}^mf/\mathrm{d}x^m\) up to the order of accuracy \(a\): \[\begin{equation}\label{eq:wsum} \frac{\mathrm{d}^m f}{\mathrm{d} x^m} (x) = h^{-m} \sum_{i=1}^n w_i f(x + b_i h) + O(h^{a}) \end{equation}\] It is necessary that \(n > m\), yielding a formal accuracy order \(a \ge n - m\) (in general).5

The intuition behind this derivative approximation via weighted sums is twofold.

  1. More evaluations, higher accuracy. Derivatives indicate the instantaneous rate of change of \(f\) in a small neighbourhood where the function resembles a line. It is possible to choose two points, \((x_1, f(x_1)), (x_2, f(x_2))\), close to the point of interest to construct a linear approximation of \(f\). More generally, \(n\)~points can be used to construct an interpolating polynomial of degree~\((n-1)\) and to compute the derivative of the power series analytically at any point. The higher the polynomial degree, the better the approximation within the convergence radius, hence any departures of \(f\) from linearity are captured by higher-order polynomial terms available with more points per evaluation grid.
  2. More evaluations, higher derivative order. Given a stencil of length \(n\) and the values of \(f\) at the corresponding grid, one can construct Taylor expansions of \(f\) at those points and find such coefficients for their linear combination that all derivatives of order 0 through \((n-2)\) should cancel out and the sum of coefficients on the \((n-1)\) derivative should become one.

This implies that stencils with more points should allow the researcher to compute linear combinations of functions whose Taylor expansion terms cancel out in the positions before and after the derivative of interest.

For a stencil \((b_1, \ldots, b_{n})\), and for derivative order \(m\), the weights that zero out the sums of derivatives of orders other than \(m\) and result in a unit coefficient on the \(m\) derivative must satisfy \[ \begin{pmatrix} b_1^0 & \cdots & b_n^0 \\ \vdots & \ddots & \vdots \\ b_1^{n-1} & \cdots & b_n^{n-1} \end{pmatrix} \begin{pmatrix} w_1 \\ \vdots \\ w_n \end{pmatrix} = \begin{pmatrix} 0 \\ \vdots \\ m! \\ \vdots \\ 0 \end{pmatrix} \iff B \vec w = \vec m \] where the non-zero entry \(m!\) of the vector \(\vec m\) occurs in the \((m+1)\) position. The solution is \[ \vec w = B^{-1} \vec m, \] provided that \(B\) is invertible – implying that the elements of the stencil must be distinct. Another condition follows from the definition of \(\vec m\): for the element \(m!\) to be placed in the position \(m+1\), the stencil length must be at least \(m+1\) (one more than the derivative order).

The optimal weights \(\vec w\) do not depend on \(h\) because the weighted sum in Eq.~\(\eqref{eq:wsum}\) is already normalised by factor \(h^{-m}\). The weights for stencils \((-2, -1, 1, 2)\) and \((-10, -5, 5, 10)\) are therefore identical.

The idea is to find such coefficients \(\{w_i\}_{i=1}^n\) that the derivatives of order less than \(m\) cancel out and the coefficients on the \(m\)th derivative add up to unity. Our implementation uses the Björck and Pereyra (1970) algorithm to avoid inverting the matrix the numerically ill-conditioned Vandermonde matrix \(B\). Using numerical routines relying on matrix inversion (such as LAPACK) invokable through base::solve() is not recommended due to the rapid deterioration of the solution accuracy. See Appendix A in Kostyrka (2025) for an investigation of this numerical instability.

By default, fdCoef() assumes central differences for second-order-accurate first derivatives, but every aspect of numerical differencing can be customised.

Central differences (or, more generally, stencils with both positive and negative points) yield higher accuracy for the same number of evaluations or require fewer evaluations to attain the desired accuracy order. Computing fourth-order-accurate second derivatives (e.g. for a very accurate Hessian diagonal) can be done on stencils \(\{0, \pm 1, \pm 2\}\) (5 evaluations) or \(\{0, 1, \ldots, 6\}\) (7 evaluations). The practical reason to prefer forward differences can be the extreme computation time of \(f\); if \(f(x)\) has been pre-computed, evaluating \((f(x+h) - f(x))/h\) is faster than \(0.5 (f(x+h) - f(x-h))/h\). If \(f(x)\) has not yet been computed, both \(f_{\mathrm{FD}_1}\) and \(f_{\mathrm{CD}_2}\) require 2 evaluations, so the latter is preferable for accuracy reasons.

The output of fdCoef() for 5-point stencils matches formulæ 5.1–5.4 from Li (2005).

2.3 Precision loss on computers

Numerical methods are often preferred over analytical ones. With highly non-linear models conquering applied research, providing analytical gradients becomes less feasible. Their derivation can be cumbersome and, due to the approximate nature of statistical methods, only marginally useful compared to their more affordable numerical approximations. Most numerical optimisation methods rely on gradients to determine the search direction, which is why the user is supposed to provide some form of gradient function. Without a gradient function, a gradient-based optimisation method might attempt to compute \(f'_{\mathrm{CD}_2}\) with default parameters that can be suboptimal or even perilous.

After the optimiser converges (hopefully with zero exit code) to a certain optimal value of the argument (usually called the estimate), many implementations of statistical methods calculate a measure of uncertainty in the estimate due to the data randomness through the objective function curvature. This measurement often involves the Hessian matrix, which also needs to be computed or at least approximated. For example, if the parameter vector \(\theta\) is estimated via maximum likelihood, assuming that the model specification is correct and the conditional density of the model error is fully known up \(\theta\), the asymptotic variance-covariance of the estimator is often computed as the inverse of the observed Fisher information at the optimum – the negative Hessian of the log-likelihood function. The accuracy of Hessian calculation depends on the step size, and the step size that minimises the numerical gradient error is unlikely to minimise the numerical Hessian error. Arguably, accurate Hessians are more important for applied researchers than accurate gradients because it is the inverse of the Hessian matrix that is used for inferential calculations. Matrix inversion, being a numerically unstable operation, propagates and magnifies any inaccuracies in the computed Hessian, often by several orders of magnitude.

The problem of derivative approximation boils down to limited precision in the default numerical routines written in most popular programming languages. Every number in computer memory is represented by a finite number of bits, which is why arithmetic operations on computers are often lossy. Except for special lossless operations like multiplying by powers of 2 via bit shifting or subtracting two numbers separated by less than one binary order of magnitude (Sterbenz lemma), most operations with most real numbers lead to machine-rounding errors. Even converting decimal to binary is lossy: merely entering the number 0.3 into computer memory already introduces an error because the closest representable binary number is slightly less than 3/10. In our notation, \([3/10] \ne 3/10\).

To accurately compute numerical derivatives, researchers should track every potentially lossy operation in computer memory. This allows bounding the error, comparing the losses, and focusing on the preventable ones. The expression \(\hat f'_{\mathrm{FD}_2}\) implies a 3-step procedure, namely \(\left[\frac{[\hat f([x+h]) - \hat f(x)]}{h}\right]\). Assuming that \(x\) is a value already loaded in the memory:

  1. \(x+h\) is computed by the machine (to the best available precision, usually 64-bit FP);
  2. \(f\) is evaluated at \(x\) and \([x+h]\) (to the same precision);
  3. The outermost arithmetic operations are performed: addition and division by \(h\) (to the same precision).

Thus, 3 types of errors appear: \[ \left[\frac{\hat f([x+h]) - \hat f(x)}{h}\right] = \frac{\bigl( f(x + h + \varepsilon_+) + \varepsilon_{f,1} - f(x) - \varepsilon_{f, 2}\bigr) + \varepsilon_-}{h} + \varepsilon_{/}. \]

  1. \(\varepsilon_+\) is the argument-rounding error;
  2. \(\varepsilon_{f,1}\) and \(\varepsilon_{f,2}\) are the function-rounding errors;
  3. \(\varepsilon_-\) and \(\varepsilon_{/}\) are the arithmetic-rounding errors.

The arithmetic-rounding errors may lead to catastrophic cancellation: if two real numbers, \(x_1\) and \(x_2\), are separated by a very small difference, \(|x_1 - x_2| \approx 0\), the machine representation of their difference, \([\hat x_1 - \hat x_2]\), has a potentially unbounded high relative error. Example: 1.000000000000001 - 0.999999999999999 evaluates to 2.109424e-15 instead of 2e-15, making the relative error higher than 5%.

It is common to ignore the argument-rounding errors because they are usually small compared to the function-rounding error. In subsequent analysis, argument-rounding errors are ignored as it is always possible to choose such step size \(h\) that \([x+h] = x+h\) without any accuracy loss (usually achieved by considering \(h = 2^{-i}\), \(i \in \mathbb{Z}\), making \(x+h\) a lossless bit-flip operation). For a rigorous analysis of the rounding error due to the binary representation error, see Table 3.1 in Mathur (2012); the recommendation of choosing \(h\) as a power of 2 can be found ibidem. It is also common to ignore the arithmetic-rounding errors in the numerator. Therefore, it suffices to bound the numerator error, calculate its amplification due to division by \(h^m\), and minimise the result with respect to~\(h\).

The terms \(f(x+h)\) and \(f(x-h)\) usually have the same binary exponent; thus, by the Theorem 11 in Goldberg (1991), \([f(x+h) - f(x-h)]\) is exact. Even for higher-order derivatives, the same order of magnitude of \(f(x+h), f(x+2h),\ldots\) implies that in most cases, no radix shifting is required for IEEE~794-compliant addition, and the round-off error is due only to the normalisation of the sum. This normalisation error has a high probability of being zero, depending on the least significant bits. Similarly, multiplication by integer powers of 2 is achieved through incrementing or decrementing the (usually 11-bit) exponent without changing the significand, unless the number is too large or too small and some non-zero bits of the mantissa are lost due to shifting.

Apart from the Taylor-series-related truncation error, the next source of inaccuracy is the rounding error \(\varepsilon_{\mathrm{r}}\) – the discrepancy between the theoretical finite-difference value and its machine evaluation equal to the sum of the argument-, function-, and arithmetic-rounding errors. The total error, \(\varepsilon := \varepsilon_{\mathrm{t}} + \varepsilon_{\mathrm{r}}\), is the sum of the ‘theoretical’ truncation and ‘practical’ rounding error. Each error can be bounded in absolute terms: \(|\varepsilon_{\mathrm{t}}| < \overline{\varepsilon_{\mathrm{t}}}\), \(|\varepsilon_{\mathrm{r}}| < \overline{\varepsilon_{\mathrm{r}}}\). Then, define the total-error function \(\varepsilon(h) := \overline{\varepsilon_{\mathrm{t}}}(h) + \overline{\varepsilon_{\mathrm{r}}}(h)\) as the objective discrepancy measure to be minimised; the optimal step size, \(h^* := \arg\min_h \varepsilon(h)\), minimises the total error. For the sake of brevity, the dependence of the \(\varepsilon\) function on \(f\), \(h\), and machine precision is omitted.

Frustratingly, there is no universal ‘one-size-fits-all’ numerical difference step size \(h\).

Thus, the term ‘accuracy order’ might be misleading to statistical software users since it is impossible to request arbitrary numerical derivative accuracy from a computer. Instead, one can only bound the total loss. The goal is to find \(h^*\) for a given application and obtain the best approximation of \(\frac{\mathrm{d}^m f}{\mathrm{d} x^m}(x)\) on a given machine for a given \(x\). The word ‘best’ is used in the sense that although by pure coincidence, the total rounding error might be equal to zero, \(h^*\) is ‘optimal’ if it ensures that the total-error function is minimised for most inputs in the small neighbourhood of \(x\).

3 Numerical derivatives of scalar functions

The derivations below follow the logic of Section 5.1 from Sauer (2017), without assuming that the order of magnitude of \(f(x)\) is approximately 1. Differences are written as weighted sums of \(f\) evaluated on a stencil sorted in ascending order in the spirit of Fornberg (1988), reflecting the flexibility to use arbitrary stencils, useful if the function evaluation is expensive and existing values must be reused. This also contains all multipliers affecting the rounding error in the numerator of the fraction, simplifying calculations.

3.1 Two-sided derivatives

Consider the central-difference approximation of the first derivative on the default uniformly spaced stencil: \[ f'(x) = \frac{-0.5f(x-h) + 0.5f(x+h)}{h} + O(h^2) \]

The truncation error for this expression is \[ f'(x) - f'_{\mathrm{CD}_2}(x) = - \frac{h^2}{6} f'''(x + \alpha h), \] where \(\alpha \in [-1, 1]\). This means that apart from the freely chosen \(h\), the approximation error depends on the 3rd derivative of \(f\), which is usually unknown.

In computer memory, \(f(x+h)\) is replaced with its FP version \(\hat f(x+h) = f(x+h) + \varepsilon_f\), where \(\varepsilon_f\) is a number on the order of machine epsilon in relative terms: \[ \frac{|\hat f(x) - f(x)|}{|f(x)|} \le \frac{\epsilon_{\mathrm{m}}}{2} \quad \Rightarrow \quad |\hat f(x) - f(x)| = |\varepsilon_f| \le 0.5|f(x)|\epsilon_{\mathrm{m}} \]

Each function-rounding error is thus bounded: \[ |\hat f(x\pm h) - f(x\pm h)| \le 0.5|f(x\pm h)|\epsilon_{\mathrm{m}} \approx 0.5|f(x)|\epsilon_{\mathrm{m}} \]

Expanding the numerator: \[ -0.5 \hat f(x-h) + 0.5\hat f(x+h) = -0.5 (f(x-h) + \varepsilon_{f,1}) + 0.5(f(x+h) + \varepsilon_{f,2}) \\ = 0.5(-f(x-h) + f(x+h)) + 0.5(\varepsilon_{f,2} - \varepsilon_{f,1}) \] The function-rounding error is bounded due to the triangle inequality: \[ 0.5(\varepsilon_{f,2} - \varepsilon_{f,1}) \le 0.5(|\varepsilon_{f,2}| + |\varepsilon_{f,1}|) \le 0.5(2\cdot 0.5|f(x)|\epsilon_{\mathrm{m}}) = 0.5|f(x)|\epsilon_{\mathrm{m}} \]

The total error decomposition into truncation and rounding errors follows: \[ f'(x) - \hat f'_{\mathrm{CD}_2}(x) = f'(x) - \frac{0.5([-\hat f(x-h)] + [\hat f(x+h)])}{h} \\ = \underbrace{f'(x) - \frac{0.5(-f(x-h) + f(x+h))}{h}}_{\varepsilon_{\mathrm{t}}} + \underbrace{\frac{0.5(\varepsilon_{f_2} - \varepsilon_{f_1})}{h}}_{\varepsilon_{\mathrm{r}}} \\ = \frac{h^2}{6}f'''(x + \alpha h) + \frac{0.5(\varepsilon_{f_2} - \varepsilon_{f_1})}{h} \le \frac{h^2}{6}f'''(x + \alpha h) + \frac{0.5|f(x)|\epsilon_{\mathrm{m}}}{h} \]

We use an extra approximation \(f'''(x + \alpha h) \approx f'''(x)\) because this term is not the largest source of error in this expression. The absolute error of the machine approximation of \(f'(x)\) becomes bounded by the conservative approximation of the total-error function

\[ \varepsilon(h) := \frac{|f'''(x)|}{6}h^2 + 0.5|f(x)|\epsilon_{\mathrm{m}} h^{-1}. \]

Minimising \(\varepsilon(h)\) with respect to \(h\) requires the first-order condition \[ \varepsilon'(h) = \frac{|f'''(x)|}{3}h - 0.5|f(x)|\epsilon_{\mathrm{m}} h^{-2} = 0, \] which can be solved for \(h > 0\): \[ \frac{|f'''(x)|}{3}h = 0.5|f(x)|\epsilon_{\mathrm{m}} h^{-2} \quad \Rightarrow \quad h^*_{\mathrm{CD}_2} = \sqrt[3]{\frac{1.5 |f(x)| \epsilon_{\mathrm{m}}}{|f'''(x)|}} \]

This expression has two problems:

  1. Recursivity: we approximate the unknown first derivative with a finite difference, and the optimal step size involves the unknown third derivative. One may assume any order of its magnitude, e.g. \(|f'''(x)| = 1\), or use a rough numerical plug-in ‘guesstimate’ of \(f'''(x)\) calculated on the smallest sufficient stencil (obtained via fdCoef(3)) and any reasonable ad hoc step size. \(10^{-4}\) seems an acceptable plug-in value of \(h\) for \(f'''_{\mathrm{CD}_2}\); Dumontet and Vignes (1977) describe a better approach, which we summarise in a dedicated section below. If computing a rough version of \(f'''(x)\) is too costly, assume \(|f(x) / f'''(x)| \approx 2/3\), which yields the simplest rule of thumb \(h^* \approx \sqrt[3]{\epsilon_{\mathrm{m}}} \approx 6\cdot 10^{-6}\). If \(f\) is suspected to have a high curvature, decrease \(h\), and if \(f\) changes at a steady rate, increase it.
  2. If \(f'''(x) \approx 0\), then, the optimal step size calculation involves division by near-zero numbers. Quadratic objective functions would be one such example: \(f'''(x) = 0\ \forall x \in \mathbb{R}\). In this case, \(f'(x)\) is a linear function that is defined by two points; therefore, any larger step size can be used to determine the coefficients of this line equation, yielding a near-zero truncation error. If \(f(x)\) is not linear and has \(f'''(x) \approx 0\), but \(|f''''(x)|\) and higher-order derivatives are not zero, data-driven procedures may be needed to find the optimal step size, although \(O(h^3)\) might be sufficiently close to zero to ignore it completely.

The decrease in accuracy due to incorrect assumptions about \(|f(x) / f'''(x)|\) depends on the ratio. If \(|f'''(x)| \approx |f(x)| \approx 1\) at some point \(x\), the total-error function is more sensitive to larger \(h\) in relative terms, but to smaller \(h\) in absolute in the neighbourhood of \(h^*\), which can be seen in the logarithmic and linear plots:

hseq <- 10^seq(-9, -3, length.out = 101)
hopt <- (1.5 * .Machine$double.eps)^(1/3)
e <- function(h) h^2/6 + 0.5*.Machine$double.eps / h
plot(hseq, e(hseq), log = "xy", xlab = "Step size", ylab = "Total error", asp = 1,
     bty = "n", type = "l", main = "Inaccuracy of CD-based derivatives (logarithmic)")
abline(v = hopt, lty = 2)

hseq2 <- seq(hopt - 5e-6, hopt + 5e-6, length.out = 101)
plot(hseq2, e(hseq2), xlab = "Step size", ylab = "Total error", bty = "n",
     type = "l", main = "Inaccuracy of CD-based derivatives (linear)")
abline(v = hopt, lty = 2)

With this optimal step size, the total error function attains its minimum value of order \(O(\epsilon_{\mathrm{m}}^{2/3})\), translating to 10–11 accurate decimal significant digits.

Other bounds for the numerator round-off error exist in the literature. Sauer (2017) assumes \(|f(x)| \approx 1\) and \(|\varepsilon_{f,1}| \approx |\varepsilon_{f,2}| \approx \epsilon_{\mathrm{m}}\). The latter assumption is too conservative compared to his earlier statement from Chapter 0 that the relative rounding error is at most \(\epsilon_{\mathrm{m}} / 2\). Even if the round-off error is accumulated through \([x+h] \ne x-h\) and \(|f'(x)| > 1\), this conservatism is stronger than the conservatism in our derivation, and the worst case is much less likely than the typical case. The absolute difference \(|\varepsilon_{f_2} - \varepsilon_{f_1}|\), bounded above by \(\epsilon_{\mathrm{m}} |f(x)|\), takes smaller values in most cases. If the relative rounding errors are independently uniformly distributed over \([-\epsilon_{\mathrm{m}}/2, \epsilon_{\mathrm{m}}/2]\), then \(\varepsilon_{f_2} - \varepsilon_{f_1}\) has a triangular (not uniform) distribution on \([-\epsilon_{\mathrm{m}}, \epsilon_{\mathrm{m}}]\) with density \(\frac{1-|t|}{\epsilon_{\mathrm{m}}} \mathbb{I}(t \le \epsilon_{\mathrm{m}})\). If the total rounding error were uniformly distributed on \([-\epsilon_{\mathrm{m}}, \epsilon_{\mathrm{m}}]\), its variance and mean absolute deviation would be 2 and 1.5 times higher, respectively. Gill, Murray, and Wright (1981) remark in Section 8.5.1.1 that even if the function-rounding error is theoretically bounded by some number \(\varepsilon^*\), this number may not be unique: multiple upper bounds may exist. The bound should represent a good estimate of the rounding error at all points in the neighbourhood of the evaluation point, which is usually achievable through rounding-error analysis. For many applied researchers, the goal is minimising the total error comprising the average (not maximum) absolute rounding error, as in Dumontet and Vignes (1977). In this case, the rounding-error component must be divided by 1.5: \[ \varepsilon_{\mathrm{r}} \le \frac{|f(x)|\epsilon_{\mathrm{m}}}{3h} \quad \Rightarrow \quad h^*_{\mathrm{CD}_2} = \sqrt[3]{\frac{|f(x)| \epsilon_{\mathrm{m}}}{|f'''(x)|}}. \] This translates to the step size being 1.145 times shorter than the one relying on the conservative rounding-error bound. Intuitively, a smaller rounding-error component allows the user to gain accuracy by reducing the truncation-related part of the error, although in most applications, this factor may be safely ignored.

The opposite occurs if \(f(x+h)\) and \(f(x-h)\) come from a numerical routine with limited accuracy. In such cases, the relative error of each evaluation is much higher, and the bound on the difference \(|\varepsilon_{f,2} - \varepsilon_{f,1}|\) must be adjusted accordingly. See Section~\(\ref{sec:noisy}\) for examples of such adjustments.

3.2 One-sided derivatives

We apply a similar argument to compute one-sided numerical derivatives of costly functions when the value \(f(x)\) is already known and only \(f(x+h)\) remains to be computed.

Mathematicians noticed the superior accuracy of approximations of functions on grids symmetric around the value of interest more than a century ago. Symmetric stencils for evaluations of derivatives were explicirtly proposed in Aitken (1938), following his 1932 paper on interpolation by iteration based on an earlier 1928 result by C. Jordan that recast the 1901 Everett interpolation formula into a format that was more parsimonious in terms of computations. The computational advantages and higher accuracy of symmetric stencils on finite-precision computers were restated by Oliver and Ruffhead (1975), who advocate the use of symmetric stencils ‘wherever possible’.

Although inferior in terms of accuracy, one-sided derivatives might be the only feasible solution where one evaluation of \(f(x)\) takes dozens of seconds or where the gradient-based optimisation routines takes too many steps to converge. Except for the situations where the curvature of a computationally expensive function must be inferred from what the researcher can glean from existing cached results, there are no other practical or theoretical reasons to use one-sided derivatives with more than two grid points instead of two-sided ones.

<…>

The fact that FD approximations have such a high total error was noted by early researchers (Gill, Murray, and Wright (1981), Section 8.2.2) and can be summarised as follows: at least half of all digits of a FD numerical derivative are wrong. The ‘at least’ part is due to the function scaling and step size; higher accuracy cannot simply be achieved. To the contrary, at least 1/3 of all digits of a CD numerical derivative are wrong, which leaves more room for error in such applications optimising with a gradient-based stopping rule. If the algorithm terminates when the Euclidean norm of the gradient is less than some tolerance (usually \(\approx \sqrt{\epsilon_{\mathrm{m}}}\), then, the FD derivative with error magnitude comparable to the tolerance should be rejected in favour of the more accurate CD derivative. This does not apply if the maximisation is preliminary and a more accurate optimiser at a later stage refines the argument value for a more rigorous solution.

3.3 Second derivatives

The formula~\(\eqref{eq:wsum}\) yields the coefficients for derivatives of any order; the following invocation of fdCoef() yields the optimal coefficients for the standard symmetric stencil.

fdCoef(deriv.order = 2, acc.order = 2) # Same as fdCoef(2)
#> $stencil
#> [1] -1  0  1
#> 
#> $weights
#> x-1h    x x+1h 
#>    1   -2    1 
#> 
#> attr(,"remainder.coef")
#> [1] 0.08333333
#> attr(,"accuracy.order")
#> requested effective 
#>         2         2 
#> attr(,"expansion")
#> [1] "f'' + 8.3333e-02 f'''' + ..."

To find the best approximation on this stencil, we use a similar expansion to quantify the truncation error: \[ f''(x) = \frac{f(x-h) - 2f(x) + f(x+h)}{h^2} + O(h^2) \] \[ f(x\pm h) = f(x) \pm \frac{h}{1!}\cdot f'(x) + \frac{h^2}{2!} f''(x) \pm \frac{h^3}{3!} f'''(x) + \frac{h^4}{4!} f''''(x+\alpha_4 h) \] where \(\alpha_4^+, \alpha_4 \in [0, 1]\) and \(\alpha_4 \in [-1, 1]\).

The approximations differ from the true derivatives by the following terms: \[ f''_{\mathrm{CD}}(x) = \frac{f(x-h) - 2 f(x) + f(x+h)}{h^2} \\ = \frac{h^2 f''(x) + \frac{h^4}{24} (f''''(x+\alpha_{4}^+ h) + f''''(x - \alpha_{4}^- h)) }{h^2} = f''(x) + \frac{h^2}{12}f''''(x + \alpha_4 h), \] where \(\alpha_4 \in [0, 1]\).

Since \(h\to 0\), we use the same approximate bound for steps of length \(2h\): \[ |[f(x\pm 2h)] - f(x\pm 2h)| \le 0.5|f(x \pm 2h)|\epsilon_{\mathrm{m}} \approx 0.5|f(x)|\epsilon_{\mathrm{m}} \] The rounding error in the numerator is bounded thusly: \[ [f(x-h)] - 2[f(x)] + [f(x+h)] = f(x-h) + \varepsilon_1 - 2(f(x) + \varepsilon_2) + f(x-h) + \varepsilon_3 \\ \le f(x-h) - 2f(x) + f(x+h) + 4\cdot 0.5|f(x)| \cdot \epsilon_{\mathrm{m}} \\ = f(x-h) - 2f(x) + f(x+h) + 2 |f(x)| \epsilon_{\mathrm{m}} \]

The overall error is \[ f''(x) - \hat f''_\mathrm{CD}(x) = f''(x) - \frac{[f(x-h)] - 2[f(x)] + [f(x+h)]}{h^2} \\ \le f''(x) - \frac{f(x-h) - 2f(x) + f(x+h)}{h^2} + \frac{2 |f(x)| \epsilon_{\mathrm{m}} }{h^2} \\ \approx \frac{h^2}{12}f''''(x + \alpha_4 h) + \frac{2 |f(x)| \epsilon_{\mathrm{m}} }{h^2} \]

Since \(f''''(x + \alpha_4 h) \approx f''''(x)\), the absolute error of \(f''_{\mathrm{CD}_2}(x)\) is bounded by \[ \varepsilon(h) := \frac{|f''''(x)|}{12}h^2 + 2|f(x)|\epsilon_{\mathrm{m}} h^{-2} \]

Minimising \(\varepsilon(h)\) with respect to \(h\) yields \[ \varepsilon'(h) = \frac{|f''''(x)|}{6}h - 4|f(x)|\epsilon_{\mathrm{m}} h^{-3} = 0, \] \[ \frac{|f''''(x)|}{6}h = 4|f(x)|\epsilon_{\mathrm{m}} \frac{1}{h^3} \quad \Rightarrow \quad h = \sqrt[4]{\frac{24 |f(x)| \epsilon_{\mathrm{m}}}{|f''''(x)|}} \]

Without additional information about the function, one can assume \(|f(x) / f''''(x)| \approx 1\) and take \(h^* = \sqrt[4]{24 \epsilon_{\mathrm{m}}} \approx 0.00027\). If higher-order derivatives of \(f\) are close to zero, then, there is no penalty in taking larger steps.

3.4 Higher derivatives

The analysis for third and fourth differences contains a new obstacle because it involves steps in multiples of \(h\).

w3 <- fdCoef(3)
w4 <- fdCoef(4)
print(w3)
#> $stencil
#> [1] -2 -1  1  2
#> 
#> $weights
#> x-2h x-1h x+1h x+2h 
#> -0.5  1.0 -1.0  0.5 
#> 
#> attr(,"remainder.coef")
#> [1] 0.25
#> attr(,"accuracy.order")
#> requested effective 
#>         2         2 
#> attr(,"expansion")
#> [1] "f''' + 2.5000e-01 f^(5) + ..."
print(w4)
#> $stencil
#> [1] -2 -1  0  1  2
#> 
#> $weights
#> x-2h x-1h    x x+1h x+2h 
#>    1   -4    6   -4    1 
#> 
#> attr(,"remainder.coef")
#> [1] 0.1666667
#> attr(,"accuracy.order")
#> requested effective 
#>         2         2 
#> attr(,"expansion")
#> [1] "f'''' + 1.6667e-01 f^(6) + ..."

\[ f'''(x) = \frac{-0.5f(x-2h) + f(x-h) - f(x+h) + 0.5f(x+2h)}{h^3} + O(h^4) \] \[ f''''(x) = \frac{f(x-2h) - 4f(x-h) + 6f(x) - 4f(x+h) + f(x+2h)}{h^4} + O(h^5) \]

The Taylor expansions for step-size multipliers \(c \in \{1, 2\}\) are \[ f(x\pm c\cdot h) = f(x) \pm \frac{ch}{1!}\cdot f'(x) + \frac{(ch)^2}{2!} f''(x) \pm \frac{(ch)^3}{3!} f'''(x) + \frac{(ch)^4}{4!} f''''(x) \\ \pm\begin{cases} \frac{(ch)^5}{5!} f^{(V)}(x + \alpha ch), & m=3\\ \frac{(ch)^5}{5!} f^{(V)}(x) + \frac{(ch)^6}{6!} f^{(VI)}(x+\alpha ch), & m=4, \end{cases} \] where \(\alpha \in [0, 1]\).

The value of the multiplier \(\alpha\) in the Lagrange remainder is different for each step \(\pm h\), \(\pm 2h\). For \(m=3\), the expression becomes \[ f'''_{\mathrm{CD}}(x) = \frac{0.5(-f(x-2h) + 2f(x-h) - 2 f(x+h) + f(x+2h))}{h^3} \\ = \frac{h^3 f'''(x) + \frac{h^5}{240} (32 f^{(V)}(x+2\alpha_{51} h) - 2f^{(V)}(x+\alpha_{52} h) - 2f^{(V)}(x-\alpha_{53} h) + 32 f^{(V)}(x+2\alpha_{54} h)}{h^3}, \] where \(\alpha_{51}, \alpha_{52}, \alpha_{53}, \alpha_{54} \in [0, 1]\). Here, the GIVT cannot be applied because not all coefficients on \(f^{(V)}\) are positive. They can still be grouped: \[ 32 f^{(V)}(x+2\alpha_{51} h) - 2f^{(V)}(x+\alpha_{52} h) - 2f^{(V)}(x-\alpha_{53} h) + 32 f^{(V)}(x+2\alpha_{54} h) \\ = 64 f^{(V)} (x + 2\alpha_{5p}h) - 4 f^{(V)} (x + \alpha_{5m}h), \] where \(\alpha_{5p}, \alpha_{5m} \in [0, 1]\). To place an upper bound on this term, we use the triangle inequality for the terms’ absolute values: \[ 64 f^{(V)} (x + 2\alpha_{5p}) - 4 f^{(V)} (x + \alpha_{5m}) \le 64| f^{(V)} (x + 2\alpha_{5p}h)| + 4|f^{(V)} (x + \alpha_{5m}h)| \\ \le 68 \max\{| f^{(V)} (x + 2\alpha_{5p}h)|, |f^{(V)} (x + \alpha_{5m}h)|\} \] Again, we rely on \(h\to0\) to obtain \(f^{(V)} (x + 2\alpha_{5p}h) \approx f^{(V)}(x) \approx f^{(V)} (x + \alpha_{5m}h) \approx f^{(V)}(x)\).

Instead of repeating the same chain of calculations with obfuscated notation, we show how to obtain the coefficients on the non-vanishing higher-order terms for \(m=4\) algorithmically:

denom  <- factorial(0:6)
names(denom) <- paste0("f'", 0:6)
num.0  <- c(1, rep(0, 6)) # f(x) = f(x) + 0*f'(x) + 0*f''(x) + ...
num.h  <- rep(1, 7)
num.2h <- 2^(0:6)
# f(x-ch) = f - ch f' + (ch)^2/2 f'' - (ch)^3/6 f''' ...
num.mh  <- suppressWarnings(num.h * c(1, -1)) # Relying on recycling
num.m2h <- suppressWarnings(num.2h * c(1, -1))
num <- colSums(rbind(num.m2h, num.mh, num.0, num.h, num.2h) * w4$weights)
print(round(num / denom, 5))
#>     f'0     f'1     f'2     f'3     f'4     f'5     f'6 
#> 0.00000 0.00000 0.00000 0.00000 1.00000 0.00000 0.16667

As expected, there is a unit coefficient on the 4th-order term. However, the last term shown above is not the correct one because the non-applicability of the GIVT requires a higher upper bound through the use of absolute values:

sum(abs(w4$weights[c(1, 2, 4, 5)] *
          c(num.m2h[7], num.mh[7], num.h[7], num.2h[7]))) / denom[7]
#>       f'6 
#> 0.1888889

This yields a coefficient on \(f^{(VI)}(x)\) equal to \(136/720 = 17/90\), not \(120/720 = 1/6\).

As for the rounding errors in the numerator, they are bounded by the sum of absolute relative errors. For \(m=3\), \[ [-0.5f(x-2h)] + [f(x-h)] - [f(x+h)] + 0.5[f(x+2h)] = \\ -0.5f(x-h) - 0.5\varepsilon_1 + f(x-h) + \varepsilon_2 - f(x+h) - \varepsilon_3 + 0.5f(x+2h) + 0.5\varepsilon_4 \\ = \ldots - 0.5\varepsilon_1 + \varepsilon_2 - \varepsilon_3 + 0.5\varepsilon_4 \\ \le \ldots + 0.5 |f(x)| \epsilon_{\mathrm{m}} (0.5 + 1 + 1 + 0.5) = \ldots + 1.5 |f(x)| \epsilon_{\mathrm{m}} \]

For \(m=4\), we compute the multiplier on \(0.5 |f(x)| \epsilon_{\mathrm{m}}\) in one line of code:

sum(abs(w4$weights))
#> [1] 16

The total errors are thus \[ f'''(x) - [f'''_\mathrm{CD}(x)] \le \varepsilon_3(h) := \frac{17 f^{(V)}(x)}{60} h^2 + \frac{1.5 |f(x)| \epsilon_{\mathrm{m}} }{h^3} \] \[ f''''(x) - [f''''_\mathrm{CD}(x)] \le \varepsilon_4(h) := \frac{17 f^{(VI)}(x)}{90} h^2 + \frac{8 |f(x)| \epsilon_{\mathrm{m}}}{h^4} \]

Finding the minimum of the total error functions: \[ \frac{17|f^{(V)}(x)|}{30}h = 4.5|f(x)|\epsilon_{\mathrm{m}}\frac{1}{h^4} \quad \Rightarrow \quad h^*_3 = \sqrt[5]{\frac{135 |f(x)| \epsilon_{\mathrm{m}}}{17|f^{(V)}(x)|}} \] \[ \frac{17|f^{(VI)}(x)|}{45}h = 32|f(x)|\epsilon_{\mathrm{m}}\frac{1}{h^5} \quad \Rightarrow \quad h^*_4 = \sqrt[6]{\frac{1440 |f(x)| \epsilon_{\mathrm{m}}}{17|f^{(VI)}(x)|}} \]

Without any extra information about the function, one can assume \(|f(x) / f^{(V)}(x)| \approx 1\), \(|f(x) / f^{(VI)}(x)| \approx 1\), and take \(h^*_3 = \sqrt[5]{\frac{135}{17} \epsilon_{\mathrm{m}}} \approx 0.0011\) and \(h^* = \sqrt[6]{\frac{1440}{17} \epsilon_{\mathrm{m}}} \approx 0.005\). In applied work, functions that are not approximated well by 4th- and 5th-order polynomials are not common, which is why it is possible that \(f^{(V)}(x) \to 0\), \(f^{(VI)}(x) \to 0\), and the main source of error is precision loss due to the division by \(h^m\), not poor Taylor approximation.

3.5 Fourth-order-accurate derivatives

If one requests fdCoef(acc.order = 4), then, the previously obtained result for \(h^*_{\mathrm{CD}_2}\) is no longer valid because the truncation error depends on a different power of \(h\). We compute this coefficient exactly:

w <- fdCoef(acc.order = 4)
h2 <- 2^(0:5)
h  <- rep(1, 6)
hm <- h * c(1, -1)
h2m <- h2 * c(1, -1)
coef.tab <- rbind(h2m, hm, h, h2) # Here, using rbind is more convenient
rownames(coef.tab) <- names(w$weights)
colnames(coef.tab) <- paste0("f'", seq_len(ncol(coef.tab)) - 1)
print(coef.tab * w$weights)
#>              f'0        f'1        f'2        f'3        f'4        f'5
#> x-2h  0.08333333 -0.1666667  0.3333333 -0.6666667  1.3333333 -2.6666667
#> x-1h -0.66666667  0.6666667 -0.6666667  0.6666667 -0.6666667  0.6666667
#> x+1h  0.66666667  0.6666667  0.6666667  0.6666667  0.6666667  0.6666667
#> x+2h -0.08333333 -0.1666667 -0.3333333 -0.6666667 -1.3333333 -2.6666667
print(colSums(coef.tab * w$weights))
#>           f'0           f'1           f'2           f'3           f'4 
#>  4.163336e-17  1.000000e+00 -1.665335e-16  4.440892e-16 -9.992007e-16 
#>           f'5 
#> -4.000000e+00

This calculation confirms that there is no \(f'''(x)\) in the expression for the difference-based approximation. We apply the same technique due to the presence of different step sizes and the non-applicability of the GIVT: \[ \frac{1}{12} f(x-2h) - \frac23 f(x-h) + \frac23 f(x+h) - \frac{1}{12} f(x+2h) = \\ f'(x) + \frac{h^5}{120}\left(-\frac83f^{(V)}(x-2\alpha_1 h) + \frac23 f^{(V)}(x-\alpha_2 h) + \frac23 f^{(V)}(x+\alpha_3 h) - \frac83 f^{(V)}(x+2\alpha_4 h) \right) \] Some terms in the brackets can still be grouped: \[ -\frac83 f^{(V)}(x-2\alpha_1 h) + \frac23 f^{(V)}(x-\alpha_2 h) + \frac23 f^{(V)}(x+\alpha_3 h) - \frac83 f^{(V)}(x+2\alpha_4 h) = -\frac{16}{3} f^{(V)}(c_1) + \frac43 f^{(V)}(c_2), \] where \(x-2h \le c_1 \le x+2h\) and \(x-h \le c_2 \le x+h\). An upper bound through approximation is available:

\[ |af^{(V)}(c_2) - bf^{(V)}(c_1)| \le a|f^{(V)}(c_1)| + b|f^{(V)}(c_2)|, \quad |f^{(V)}(c_1)| \approx |f^{(V)}(c_2)| \approx f^{(V)}(x), \] therefore, the difference in the numerator is less or equal to \(\frac{\frac{20}3 |f^{(V)}(x)|}{120} h^5\), and the truncation error is bounded by the same divided by \(h\): \[ |f'(x) - f'_{\mathrm{CD}_4}(x)| \approx \frac{|f^{(V)}(x)|}{18} h^4. \]

The rounding error is bounded by the weighted value of \(|f(x)|\epsilon_{\mathrm{m}}\): \[ \sum_{i} [w_i f(x_i)] - \sum_{i} w_i f(x_i) \le \sum_i |w_i f(x_i)| \cdot 0.5\epsilon_{\mathrm{m}}, \] and since the terms in the numerator are approximately equal to \(f(x)\), the absolute rounding error is bounded by \(0.75 |f(x)| \epsilon_{\mathrm{m}}\):

0.5*sum(abs(w$weights))
#> [1] 0.75

The total error minimisation problem is similar to the ones solved before: \[ \varepsilon(h) = \frac{|f^{(V)}(x)|}{18} h^4 + \frac{0.75 |f(x)| \epsilon_{\mathrm{m}}}{h} \] \[ \varepsilon(h) = 0 \quad \Rightarrow \quad \frac{2|f^{(V)}(x)|}{9} h^3 = \frac{0.75 |f(x)| \epsilon_{\mathrm{m}}}{h^2} \quad \Rightarrow \quad h^{*}_{\mathrm{CD}_4} = \sqrt[5]{\frac{27 |f(x)| \epsilon_{\mathrm{m}}}{8 |f^{(V)}(x)|}} \propto \sqrt[5]{\epsilon_{\mathrm{m}}} \]

The reduction of the truncation error from the Taylor series allows for a larger step size for better rounding-error control. The total error is therefore of the order \(O(\epsilon_{\mathrm{m}}^{4/5})\), which translates into 1–2 more accurate decimal digits compared to the second-order-accurate derivative: \(\log_{10} \epsilon_{\mathrm{m}}^{2/3} / \epsilon_{\mathrm{m}}^{4/5} \approx 2\). Of course, this is true only if the guessed value of \(|f^{(V)}|\) is reasonable.

3.6 General derivative and accuracy orders

Although calculation of derivatives of order greater than 4 is of purely theoretical interest, we provide a rule of thumb for bandwidth choice and the algorithm to compute the coefficients on the error components.

The preceding derivations yield a pattern: every time a higher-order derivative is required, the power of \(h\) in the denominator increases by 1, and the power of \(h\) in the numerator remains at~2 because the central differences in question are second-order-accurate. For the \(m\)th derivative of accuracy order \(a=2\), the optimal step size \(h^*_m\) is \(c_m \sqrt[m+2]{\epsilon_{\mathrm{m}}}\). The rough approximation \(c_m \approx 1\) yields \(h^*_m \approx \sqrt[m+2]{\epsilon_{\mathrm{m}}}\).

A slightly finer version relying on analytical expressions is as follows. When accuracy of order 2 is requested, the Taylor-series truncation error is \(O(h^2)\), and the multiplier on \(h^2\) depends on the \((m+2)\)th derivative of \(f\). For step sizes \(0, \pm h, \pm 2h, \ldots\) and weights \(w_{0}, w_{\pm 1}, w_{\pm 2}, \ldots\), the coefficients on the non-vanishing terms are computed as follows.

m <- 7  # Example order
w <- fdCoef(m)
k <- sum(w$stencil > 0) # ±h, ±2h, ..., ±kh
num.pos <- sapply(1:k, function(i) i^(0:(m+2)))
num.neg <- apply(num.pos, 2, function(x) x * c(1, -1))
num.neg <- num.neg[, rev(seq_len(ncol(num.neg)))]
nz <- abs(w$stencil) > 1e-12 # Non-zero function weights
coef.tab <- sweep(cbind(num.neg, num.pos), 2, w$weights[nz], "*")
rownames(coef.tab) <- paste0("f'", seq_len(nrow(coef.tab))-1)
colnames(coef.tab) <- names(w$weights[nz])
print(coef.tab)
#>         x-4h   x-3h  x-2h x-1h x+1h x+2h   x+3h     x+4h
#> f'0     -0.5      3    -7    7   -7    7     -3      0.5
#> f'1      2.0     -9    14   -7   -7   14     -9      2.0
#> f'2     -8.0     27   -28    7   -7   28    -27      8.0
#> f'3     32.0    -81    56   -7   -7   56    -81     32.0
#> f'4   -128.0    243  -112    7   -7  112   -243    128.0
#> f'5    512.0   -729   224   -7   -7  224   -729    512.0
#> f'6  -2048.0   2187  -448    7   -7  448  -2187   2048.0
#> f'7   8192.0  -6561   896   -7   -7  896  -6561   8192.0
#> f'8 -32768.0  19683 -1792    7   -7 1792 -19683  32768.0
#> f'9 131072.0 -59049  3584   -7   -7 3584 -59049 131072.0

This table shows the contribution of each derivative to the numerator of the approximation. By construction, the row sums add up to zero because every initial term, except for the \(m\)th derivative, must be zeroed out:

rs <- rowSums(coef.tab)
print(round(rs, 4))
#>    f'0    f'1    f'2    f'3    f'4    f'5    f'6    f'7    f'8    f'9 
#>      0      0      0      0      0      0      0   5040      0 151200

The coefficient on the \((m+2)\)th derivative is also non-zero, but, as we established before, it underestimates the true upper bound. Due to the non-applicability of the GIVT with different step sizes, we add up the absolute values of the expansion coefficients. Finally, this value must be divided by the denominator – the factorial. This yields the coefficient on \(h^2\):

print(c1 <- sum(abs(coef.tab[nrow(coef.tab), ])) / factorial(m+2))
#> [1] 1.067637

The second part, the rounding error, can be bounded from above by the sum of absolute values of the coefficients in the numerator. The coefficient on \(|f(x)| \epsilon_{\mathrm{m}}\) is

print(c2 <- 0.5*sum(abs(w$weights)))
#> [1] 17.5

The expression below is minimised w.r.t. \(h\) where solving the FOC is easy because both components are power functions: \[ \varepsilon(h) = c_1 f^{(m+2)}(x) h^2 + c_2 |f(x)| \epsilon_{\mathrm{m}} \frac{1}{h^m} \] \[ \varepsilon'(h) = 0 \quad \Rightarrow \quad 2c_1 |f^{(m+2)}(x)| h = m c_2 |f(x)| \epsilon_{\mathrm{m}} \frac{1}{h^{m+1}} \quad \Rightarrow \quad h = \sqrt[m+2]{\frac{mc_2 |f(x)| \epsilon_{\mathrm{m}}}{2c_1 |f^{(m+2)}(x)|}} \]

In general, the optimal step size for the desired accuracy order \(a\) (even) and derivative order \(m\) is proportional to \(\sqrt[a+m]{\epsilon_{\mathrm{m}}}\).

The accuracy loss due to the guessed higher-order derivative value depends on the exact values of \(h\) and \(h^*\). As it is shown in Section 3.4 of Mathur (2013), the slope of the total-error function is not equal for over- and under-estimated optimal step sizes: in logarithmic axes, for \(h > h^*\), the slope is positive and is equal to \(a\) (accuracy order), and for \(h < h^*\), it is negative and equal to \(-m\) (differentiation order). For higher-order-accurate first derivatives, the optimal size must be larger than the optimal size for \(a=2\), but not by much, lest the truncation-related part of the error should explode, which poses a problem that can be solved via data-driven methods. A viable solution is proposed in the same thesis and is described in Section .

3.7 Cross-derivatives

4 Gradients, Jacobians, and Hessians

4.1 Numerical gradients

4.2 Numerical Jacobians

4.3 Numerical Hessians and cross-derivatives

The Hessian matrix for a function \(f\) is denoted by \(H_f\) and is defined as the square matrix of second-order partial derivatives of \(f\): \[ H_f (x) = \nabla^2 f(x) = \left\{ \frac{\partial^2}{\partial x_i \partial x_j} f(x)\right\}_{i,j=1}^{\dim x} = \begin{pmatrix} \frac{\partial^2 }{\partial x_1^2} & \cdots & \frac{\partial^2 }{\partial x_1\,\partial x_p} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 }{\partial x_p\,\partial x_1} & \cdots & \frac{\partial^2 }{\partial x_p^2} \end{pmatrix} f(x) \]

The Hessian can be expressed more concisely as the transposed Jacobian of the gradient: \[ H_f (x) = J^T (\nabla f(x)) = \begin{pmatrix} \partial / \partial x_1\ (\nabla f(x))^T \\ \vdots \\ \partial / \partial x_p\ (\nabla f(x))^T \end{pmatrix} \]

The simplest approach to compute Hessian elements numerically is to calculate finite differences of finite differences w.r.t. two indices. For each \(i=1,\ldots,k\), define \(h_i := \begin{pmatrix}0 & \ldots & 0 & \underbrace{h}_{i^{\text{th}} \ \text{position}} & 0 & \ldots & 0 \end{pmatrix}'\) as the length-\(k\) vector where the only non-zero element is in the \(i\)th coordinate. Then, 4 evaluations of \(f\) are required to compute \(H_{ij}\) via central differences: \[ (\nabla f(x))^{(i)} := \nabla_i f(x) \approx \frac{f(x + h_i) - f(x - h_i)}{2 h} \] \[\begin{equation}\label{eq:nhess} H_{ij} \approx \tfrac{\nabla_i f(x + h_j) - \nabla_i f(x - h_j)}{2 h} \approx \tfrac{f(x + h_i + h_j) - f(x - h_i + h_j) - f(x + h_i - h_j) + f(x - h_i - h_j)}{4h^2} \end{equation}\] If \(f(x)\) is computationally costly, using forward differences \(\frac{f(x + h_i) - f(x)}{h}\) requires fewer evaluations and sacrifices accuracy for speed gains.

This weighted sum is similar to the expression for second derivatives (\(f''_{\mathrm{CD}}(x)\)), but derivation of the truncation error in this case becomes notationally more cumbersome. By analogy, we expect the truncation error to be of the second order in terms of the power of \(h\) and of the fourth order of the cross-derivative of \(f\). Therefore, the optimal step size for Hessian computation may be safely assumed to be of the order \(\sqrt[4]{\epsilon_{\mathrm{m}}}\). (Intuitively, a larger step size is required because the division in this case is by \(h^2\)).

The main diagonal of the Hessian, \(H_{ii}\), contains second derivatives of \(f\), and the result above for functions of scalar arguments applies: \(h^{**}_{\mathrm{CD}_2} \propto \sqrt[4]{\epsilon_{\mathrm{m}}}\).

The \(f''\) column contributes non-zero off-diagonal elements, and all the terms containing \(f'''\) cancel out, which can be verified numerically:

getCoefs <- function(x, ord) do.call(expand.grid, replicate(ord, x, simplify = FALSE))
rowProd <- function(x) apply(x, 1, prod)
getMults <- function(ord) {
  steps  <- list(c(1, 1), c(-1, 1), c(1, -1), c(-1, -1))
  coefs <- lapply(steps, getCoefs, ord = ord)
  signs <- lapply(coefs, rowProd)
  mults <- signs[[1]] - signs[[2]] - signs[[3]] + signs[[4]]
  ntab <- expand.grid(replicate(ord, c("i", "j"), simplify = FALSE))
  names(mults) <- apply(ntab, 1, paste0, collapse = "")
  return(mults)
}

print(getMults(2))
#> ii ji ij jj 
#>  0  4  4  0
print(getMults(3))
#> iii jii iji jji iij jij ijj jjj 
#>   0   0   0   0   0   0   0   0

The terms with \(f\) and \(f'\) disappear, and the 4-term sum becomes \[ \frac{f(x + h_{ij}) - f(x - h_i + h_j) - f(x + h_i - h_j) + f(x - {h_ij})}{4h^2} = \frac{\frac{1}{2}h^2 (4H_{ij} + 4H_{ji}) + O(h^4)}{4h^2}, \] therefore, \[ H_{ij} - \hat H_{ij} = O(h^2), \] For large steps, \(h \gg 0\), the dominant term in the total error of \(\nabla_i \nabla_j f(x)\) has the same order as the dominant term of \(\nabla_i \nabla_i f(x) = f''_{ii}(x)\). The multipliers on \(h^2\) and \(1/h^2\) in the expression for the total error change, but the order of the optimal step size is still \(h^* \propto \sqrt[4]{\epsilon_{\mathrm{m}}}\).

The exact expression for the truncation term depends on the sum of fourth-order cross-derivatives. The following code calculates which fourth derivatives enter the expression for the truncation error using the same logic (we skip the tedious derivation):

mults4 <- getMults(4)
print(mults4[mults4 != 0])
#> jiii ijii iiji jjji iiij jjij jijj ijjj 
#>    4    4    4    4    4    4    4    4

This implies that the error term is determined solely by \(\frac{\partial^4 f}{\partial x_i^3 \partial x_j}\) and \(\frac{\partial^4 f}{\partial x_i \partial x_j^3}\) owing to the symmetry of cross-derivatives: \(f''''_{jiii} = f''''_{ijii} = f''''_{iiji} = f''''_{iiij}\). The exact value of the higher-order coefficient is thus \[ O(h^4) = \frac{h^4}{24} (4 f''''_{jiii}(x + c_1) + 4f''''_{ijjj}(x+c_2)) \approx \frac{h^4}{6} (f''''_{jiii}(x) + f''''_{ijjj}(x)), \] where \(c_1, c_2 \in (-h, h)\) are some constants. After dividing by \(4 h^2\), we get the truncation error equal to \(\frac{h^2}{24} (f''''_{jiii}(x) + f''''_{ijjj}(x))\) that can be plugged into the expression for the total error.

The rounding error \([f([x \pm h_i \pm h_j])] - f(x \pm h_i \pm h_j)\) from 4 function evaluations is approximately bounded by \(4 \times 0.5 |f(x)| \epsilon_{\mathrm{m}}\).

\[ \varepsilon(h) = \frac{|f''''_{ijjj}(x) + f''''_{iiij}(x)|}{24} h^2 + \frac{0.5 |f(x)| \epsilon_{\mathrm{m}}}{h^2} \] \[ \varepsilon(h) = 0 \quad \Rightarrow \quad \frac{|f''''_{ijjj}(x) + f''''_{iiij}(x)|}{12} h = \frac{|f(x)| \epsilon_{\mathrm{m}}}{h^3} \quad \Rightarrow \quad h = \sqrt[4]{\frac{12 |f(x)| \epsilon_{\mathrm{m}}}{ |f''''_{ijjj}(x) + f''''_{iiij}(x)|}} \propto \sqrt[4]{\epsilon_{\mathrm{m}}} \]

We recommend that the main diagonal of the Hessian be computed separately using \(h^{**}_{\mathrm{CD}_2}\) (or, even better, \(h^{**}_{\mathrm{CD}_4}\)) estimated for each coordinate, and each off-diagonal element \(H_{ij}\) using data-driven procedures described in the section below (e.g. ‘AutoDX’).

4.4 Exploiting the Hessian symmetry

In theory, by the symmetry of second derivatives, if all partial derivatives are differentiable, then, the crossed partials are equal (see the proof of Theorem 3.3.8, ‘Equality of crossed partials’, in Hubbard and Hubbard (2015)). This is known as the Schwarz’s theorem, and the usual requirement of continuity may be weakened (because it implies differentiability).

In practice, however, the symmetry of the numerical Hessian depends on how it was computed. If the formula~\(\eqref{eq:nhess}\) is used, then, symmetry is guaranteed because the terms of the finite sum are identical: \([H_{ij}] = [H_{ji}]\). This can be exploited to reduce computation time: if \(\dim x = k\), calculate only the main upper triangular matrix and reflect it, carrying out only \(k(k+1)/2\) operations instead of \(k^2\). This is the default behaviour of the Hessian() function from this package. If instead, the implementation carries out repeated Jacobian calculation for the numerical gradient, then, there is no guarantee that \(\nabla_j [\nabla_i f(x)]\) be equal to \(\nabla_i [\nabla_j f(x)]\). This is, e.,g., the case with the optimHess() function from base: it calls the gradient in a loop over coordinates. To ensure the symmetry in this case, the numerical Hessian should be averaged with its transpose:

This approach is standard in applications where the matrix of interest has to be positive semi-definite, e.,g. autocorrelation-consistent estimation of variances due to Newey and West (1987).

If the Hessian is computed via repeated differences of the numerical gradient, the assumption \(|[H_{ij}] - [H_{ji}]| \approx 0\) may be used to reduce the number of elements to compute from \(p^2\) to \(p(p+1)/2\) because only the upper triangular matrix of the partial derivatives needs to be computed: \[ H_{ij} \approx \bigl[ J_j([\nabla_i f(x)]) \bigr] \]

From the computational point of view, approximating the Hessian by applying first differences to some coordinates of the numerical gradient is suboptimal in terms of numerical accuracy: pairs of values \(\{ f(x + h_i + h_j), f(x - h_i + h_j)\}\) and \(\{ f(x + h_i - h_j), f(x - h_i - h_j)\}\) are collapsed into single values \([\nabla_i f(x + h_j)]\) and \([\nabla_i f(x - h_j)]\). If the Hessian is evaluated in a time-saving loop for the upper triangular matrix (as it is, e.g, in numDeriv:::genD.default() by Gilbert and Varadhan (2019)), then, the rounding error from \([\nabla_i f(x + h_j)]\) is copied over to \(H_{ji}\). Not collapsing the evaluated values \(f(x + h_i + h_j)\) and \(f(x - h_i + h_j)\) into the numerical gradient and storing them separately eliminates the problem \([H_{ij}] = \bigl[ J_j([\nabla_i f(x)]) \bigr] \ne \bigl[ J_i([\nabla_j f(x)]) \bigr] = [H_{ji}]\) because the numerators in the non-simplified expressions of the first differences of the first differences are the same.

The current version of the package contains no implementation of the implicit symmetrisation by direct calculation without omission. However, facilities to choose between the quicker (no symmetrisation) and the more accurate (average of the full matrix and its transpose) exist.

5 Common issues with numerical derivatives

5.1 Stencil choice

Early literature (Jordan (1928), Aitken (1938)), being focused on interpolation techniques, distinguished two cases: equidistant stencils symmetric around the zero (\(\pm1/2\), \(\pm3/2\), \(\pm5/2\), ) and equidistant stencils containing the zero (0, \(\pm1\), \(\pm2\), ). The former was more economical because, as we previously showed, two-term central differences are more accurate than one-sided ones, and minimising the number of terms reduced the time spent by computers (humans). Nowadays, it is possible to obtain interpolation coefficients for arbitrary stencils.

In terms of step size, the stencil (\(\pm1/2\), \(\pm3/2\)) implies tripling the distance from zero for the outermost points, whereas the step size on the stencil (\(\pm1\), \(\pm2\)) is only doubled. This distinction begs the question: since both approaches are 4 order accurate, which stencil yields a smaller truncation error?

Comparing the stencils \((-2, 1, 1, 2)\) and \((-3, -1, 1, 3)\) is not fair because the Taylor remainder is directly proportional to the larger step size. Since the inner step size of 1 eliminates the first-order terms and leaves the second-order truncation error \(O(h^2)\), the outer step size eliminates the second-order truncation error and leaves the fourth-order remainder \(O(h^4)\), which is larger for larger \(h\). It is tempting to impose a normalisation on the step size for comparing stencils of the form \((-1-s, -1+s, 1-s, 1+s)\) because in this case, the average step size is one. Setting \(s=1/2\) yields \((\pm1/2\), \(\pm3/2)\) with step tripling, and \(s=1/3\) yields \((\pm2/3\), \(\pm4/3)\) with step doubling. However, this approach is also problematic because larger values of \(s\) result in near-zero step sizes and, therefore, near-zero higher-order Taylor remainders.

Proper optimal stencil analysis requires considering the total error including the machine-rounding term that increases for small step sizes.

5.2 Handling very small and very large arguments

Usually, the step size is proportional to the value of the argument, i.e. \(h \propto x \sqrt[c]{\varepsilon_{\text{mach}}}\), where \(c\) depends on the derivative and accuracy order (see above). However, for \(x\approx 0\), this may result in \([[x]+[h]] = [x]\). To avoid this, the default behaviour in many software implementations, including numDeriv, is to use fixed \(h\) for \(|x| < \delta\). E.g. numDeriv uses h = eps = 1e-4 in place of h = d*x if x < zero.tol; the default method arguments are eps = d = 1e-4, zero.tol = sqrt(.Machine$double.eps/7e-7) = 1.781e-5.

For iterative step-size search procedures, Stepleman and Winarsky (1979) suggest a reasonable workaround to prevent this behaviour: use \(h = 0.04\sqrt[c]{\varepsilon_{\text{mach}}}\) as the starting value in the step-search procedure. In practice, one may take any reasonable value that is suitable for the application and supply some a priori information about the function and the meaningful distinction between zero and non-zero values. In certain applications, only positive parameter values are allowed, which makes the zero the boundary of the parameter space – in such cases, particular solutions such as ‘take the relative step size \(10^{-4}\)’ are more reliable than ‘take the absolute step size \(10^{-5}\)’.

Example. In GARCH models, the constant \(\omega\) in the conditional-variance equation \(\sigma^2_t = \omega + \alpha \varepsilon^2_{t-1} + \beta \sigma^2_{t-1}\) is usually very small (often in the range \([10^{-7}, 10^{-3}]\)). If \(\omega = 10^{-6}\), then, using the absolute step size \(h = 10^{-5}\) (reasonable in most applications) – would result in wildly inaccurate numerical derivatives or even invalid variance values. By definition, \(\sigma^2_t > 0\), but negative values of \(\omega\) may create negative values in the generated series of \(\sigma^2_t\), which is meaningless in statistics. In the best case, the estimated numerical differences will be equal to NA. In the worst case, the function computing the log-likelihood of such a model will throw an error.

It is not uncommon to use \((0, \ldots, 0)\) as the starting value in numerical optimisation; in this case, using the knowledge about the expected reaction of \(f(x)\) to changes in each coordinate of \(x\) is necessary to avoid disproportionately large truncation or rounding errors. Therefore, if it is possible, the researcher should either take into account the curvature of their functions with respect to the arguments that are too small or use heuristic search procedures to determine the optimal step size.

5.3 Derivatives of noisy functions

The very definition of a derivative involves differentiability, which sounds tautological, but specifically, this assumption is often violated in applied research. In fact, the computer representations \([x+h]\) and \(\hat f([x+h])\) are discontinuous due to the finite machine precision (the number of bits in the mantissa). The machine epsilon is such a number that \([1 + \delta] = 1\) for \(|\delta| \le \epsilon_{\mathrm{m}}/2\); therefore, if \(x = [x]\), then, \(f([x \pm \delta x]) = f(x)\) for \(|\delta| \le \epsilon_{\mathrm{m}}/2\). Hence, the core problem in numerical differentiation is working with discontinuous functions where \(\lim_{h\to 0} \frac{\hat f([x+h]) - \hat f([x])}{h} = 0\), but the derivative of the noiseless function \(f\) – as if the machine had infinite precision – has to be computed with the maximum achievable precision.

So far, we have considered the ‘best’ case assuming that \(\hat f([x+h]) - f(x) \le \epsilon_{\mathrm{m}}/2\). However is quite common to have objective functions that depend on the values of inaccurate internal numerical routines. Usually, such routines involve optimisation, or integration, or root search, or computing derivatives. Depending on the convergence tolerance (and the ensuing routine error magnitude) of the inner routines, small changes in the input parameter of the outermost objective function might lead to abrupt changes in the return values of inner functions, which implies discontinuity and, therefore, non-differentiability.

More often than not, the total error of internal routines is outside the user’s control. Even such innocent actions as swapping two mathematical operations may lead to numerical errors. Example: if the internal loop of empirical-likelihood maximisation involves replacing logarithms with their 4th-order Taylor expansion for small inputs, then, the order of multiplication/division in computing the Taylor approximation affects the result! Consider computing \(\frac14 (x-t)^4 / t^4\) and getting two different results:

x   <- -0.2456605107847454  # 16 sig. digs
t   <- 1/59
print(res1 <- (x-t)^4 * (t^-4 / 4), 17)
#> [1] 14407.574265883137
print(res2 <- (x-t)^4 / (t^4 * 4), 17)
#> [1] 14407.574265883139
print(res1 - res2)
#> [1] -1.818989e-12

Two mathematically equivalent expressions, \((x-t)^4 \cdot (t^{-4} / 4)\) and \((x-t)^4 / (4t^4)\), differ in absolute terms by more than \(1000 \epsilon_{\text{m}}\)! Evaluating \((t^{-4} / 4) -1/ (4t^4)\) yields zero exactly; however, the extra multiplication by \((x-t)^4\) creates lost accuracy bits. The magnitude of the objective function in this specific application is approximately 1, which is why the termination of the optimisation algorithm occurs in a different number of steps with the two different implementations of the Taylor series.

The consequence of this loss is the increase in the rounding-error variance and the need to take into account the absolute error magnitude of \(\hat f([x+h]) - f(x+h)\) into account; the relative error is still bounded by \(\epsilon_{\text{m}}/2\), but when some intermediate terms of the computation shoot off due to division by small quantities (in this example, \(4t^4 \approx 3.3\cdot 10^{-7}\)), it changes the magnitude of the numerator. Usually, it is not feasible to extract the information about the most ill-conditioned operation in the chain of thousands or millions of operations taking place under the hood of present-day objective functions, but a rough measure of the ‘combined noisiness due to inaccurate internal routines’ could be calculated based on certain input characteristics of the input.

If computing \(f\) involves optimisation with a relative-tolerance-based stopping criterion ret.tol = 1e-8, then, the difference might be as bad as \(\epsilon_{\mathrm{m}} \cdot 10^8\)! The same argument applies to numerical integration, root solving, stochastic algorithms and such: if \(\hat f(x + h)\) is prone to deviating from \(f(x+h)\) due to some extra losses occurring under the hood of \(f\), the bound on the rounding error must reflect this fact. It would be dangerous to assume \(|\varepsilon_{f_2} - \varepsilon_{f_1}| \le \epsilon_{\mathrm{m}}\) if changes of inputs cause the internal routines of \(f\) to go differently due to the butterfly effect. If \(f\) relies on stochastic optimisation with relative tolerance \(2.22 \cdot 10^{-7}\) (not uncommon in popular algorithms) as the stopping criterion, \(|\varepsilon_{f,2} - \varepsilon_{f,1}| \le 10^9 \cdot \epsilon_{\mathrm{m}}\), which is why the rule-of-thumb \(h^*\) needs to be blown up by a factor of 1000 (becoming 0.006) to obtain any meaningful information about the slope of the function! Ironically, this is the reason why repeated differences for higher-order derivatives can be so unstable: central differences of well-behaved functions lose 5 accurate digits, and one-sided differences, 8 (out of 16!) digits.

The problem with function optimisation is, the relative tolerance as a stopping criterion is

5.4 Accuracy loss due to repeated differencing

So far, we have considered only the ‘ideal’ case where the diagonal elements of a Hessian are computed via the proper formula for second derivatives, and the off-diagonal elements are computed in one single step. However, this approach, being the simplest to analyse, is cumbersome to implement. It is much more common to compute the Jacobian of the gradient (e.g. the popular implementation stats::optimHess follows this approach). This might incur extra accuracy loss because the numerator in this approach contains the difference of ‘lossy’ numerical derivatives, not the original function values.

The effect of this error compounding can be illustrated by the following example. Recall that two-sided differences are accurate only up to (approximately) 2/3 of their significant digits. Indeed, \(|\hat f([x+h]) - f(x-h)| \le \epsilon_{\mathrm{m}} / 2\), but \(|\hat f'_{\mathrm{FD}}([x+h]) - f'(x+h)| = O(h^2) = O(\epsilon_{\mathrm{m}}^{2/3})\) for the optimal \(h \propto \sqrt[3]{\epsilon_{\mathrm{m}}}\). The noisy \(\hat f'_{\mathrm{CD}_2} (x+h) = f'(x+h) + O(\epsilon_{\mathrm{m}}^{2/3})\) enters the numerator of the second step and is multiplied by the \(1/h\) factor again. Therefore, rounding errors accumulate in the two-step differencing procedure to a greater extent than in the one-step one, and dedicated analysis is required to gauge the order of magnitude of the accuracy loss.

6 Complex derivatives

If the function \(f(x)\) supports complex arguments and outputs, as is the case with many simple arithmetic and algebraic functions, then, evaluating complex arguments allows one to obtain highly accurate numerical derivatives with fewer function evaluations.

The Taylor expansion for a function of a complex variable is \[ f(x + \mathrm{i}h) = f(x) + \mathrm{i}h f'(x) - \frac{h^2}{2} f''(x) - \mathrm{i} \frac{h^3}{3!} f'''(x) + \frac{h^4}{4!} f''''(x) + O(h^5). \] The imaginary part of this expression divided by \(h\) is \[ \frac{\Im f(x + \mathrm{i}h)}{h} = f'(x) - \frac{h^2}{3!} f'''(x) + O(h^4). \] One evaluation and one division yield the largest error term of the order \(O(h^2)\), and the overall accuracy does not suffer from catastrophic cancellation. The total approximation error of this method stays small even for small step sizes \(h\). See Squire and Trapp (1998) for numerical examples.

The choice of the optimal step size is not obvious in the general case because the user should choose not only \(h\) but also the direction (the power of \(\mathrm{i}\)). Martins, Sturdza, and Alonso (2003) give the following recommendation: require that \(h^2 |f'''(x)/3!| < \varepsilon |f'(x)|\), where \(\varepsilon\) is the relative algorithm precision, for which \(h\) has to be small enough. If \(f(x) \approx 0\) or \(f'(x) \approx 0\), this condition might not hold. However, a small complex step \(h = 10^{-20}\) is a reasonable rule-of-thumb value.

Higher-order differences require fewer evaluations, too, but are not difference-free, which is why \[ f''(x) = \frac{2[f(x) - \Re f(x+\mathrm{i}h)]}{h^2} - \frac{h^2}{12} f''''(x) + O(h^4) \] suffers from machine round-off to a certain degree.

The method accuracy and convergence can be improved by carefully choosing the complex-step angle: \[ f'(x) = \frac{\Im\{ f(x+\mathrm{i}^{2/3}h) - f(x+\mathrm{i}^{8/3}h)\} }{\sqrt{3} h} - \frac{h^4}{120} f^{(V)}(x) + \ldots \] \[ f''(x) = \frac{\Im\{ f(x+\mathrm{i}^{1/2}h) + f(x+\mathrm{i}^{5/2}h)\} }{h^2} - \frac{h^4}{360} f^{(VI)}(x) + \ldots \] Jacobians and Hessians can be evaluated similarly with fewer function calls. However, for small step sizes, they suffer from the round-off error just like the real-valued finite-difference counterparts. The benefit is, the complex method yields a much wider range of accurate matrix solutions.

Richardson extrapolation can be used for attaining higher-order accuracy at the cost of more evaluations. Using 4 instead of 2 complex argument evaluations yields ludicrous accuracy: \[ f''(x) = \frac{\Im \left\{ 64 [f(x+\frac{\sqrt{\mathrm{i}} h}{2}) + f(x-\frac{\sqrt{\mathrm{i}} h}{2})] - f(x+\sqrt{\mathrm{i}} h) - f(x-\sqrt{\mathrm{i}} h) \right\}}{15 h^2} + \frac{h^8 f^{(X)}(x)}{1\,814\,400}, \] i.e. one millionth of the tenth derivative times the eighth power of the tiny step!

See Lai and Crassidis (2008) for more details.

Note that this approach is not feasible in many applications where the function is not defined for complex inputs. In econometrics, densities are non-negative, probabilities are bounded, and the function in question is often the objective function of a non-linear model where many more numerical procedures are run under the hood (loops, numerical root searches, integration routines, data-driven hyper-parameter cross-validation etc.), not allowing any complex inputs.

Currently, the complex method is not supported in the pnd package; derivations and an implementation may be done in the future to fully restore the functionality of numDeriv.

7 TODO – Uncategorised

The results are comparable in terms of effective accuracy: e.g. the grid \(x-2h, x-h, x+h, x+2h\) from pnd corresponds to 4 Richardson iterations from numDeriv, with minor differences due to the initial step size choice.

Simulation set-up from Curtis and Reid:

f1 <- function(x) expm1(x)^2 + (1/sqrt(1+x^2) - 1)^2
df1 <- function(x) 2 * exp(x) * expm1(x) - 2*x * (1/sqrt(1+x^2) - 1) / (1 + x^2) / sqrt(1 + x^2)
ddf1 <- function(x) (6 * (1/sqrt(1 + x^2) - 1) * x^2)/(1 + x^2)^(5/2) + (2 * x^2)/(1 + x^2)^3 - (2 * (1/sqrt(1 + x^2) - 1))/(1 + x^2)^(3/2) + 2 * exp(2*x) + 2 * exp(x) * expm1(x)
dddf1 <- function(x) (18 * (1/sqrt(1 + x^2) - 1) * x)/(1 + x^2)^(5/2) + (6 * x)/(1 + x^2)^3 - (30 * (1/sqrt(1 + x^2) - 1) * x^3)/(1 + x^2)^(7/2) - (18 * x^3)/(1 + x^2)^4 + 6 * exp(2*x) + 2 * exp(x) * expm1(1)

squish <- function(x, pow = 1/2, shift = 1) ((abs(x) + shift)^pow - shift^pow) * sign(x)
unsquish <- function(y, pow = 1/2, shift = 1) ((abs(y) + shift^pow)^(1/pow) - shift) * sign(y)

xgrid <- seq(-0.2, 1.5, 0.01)
par(mar = c(2, 2, 0, 0) + .1)
plot(xgrid, squish(f1(xgrid)), type = "l", ylim = squish(c(-1, 20)), bty = "n", lwd = 2, xlab = "", ylab = "", yaxt = "n")
lines(xgrid, squish(df1(xgrid)), col = 2)
lines(xgrid, squish(ddf1(xgrid)), col = 3)
lines(xgrid, squish(dddf1(xgrid)), col = 4)
axis(2, squish(ats <- c(-1, 0, 1, 2, 5, 10, 20)), labels = ats, las = 1)

8 References

Aitken, A. C. 1938. “The Application of Quadratic Extrapolation to the Evaluation of Derivatives, and to Inverse Interpolation.” Proceedings of the Royal Society of Edinburgh 58: 161–75. https://doi.org/10.1017/S0370164600011093.
Björck, Åke, and Victor Pereyra. 1970. “Solution of Vandermonde Systems of Equations.” Mathematics of Computation 24 (112): 893–903.
Dumontet, J., and J. Vignes. 1977. “Détermination Du Pas Optimal Dans Le Calcul Des Dérivées Sur Ordinateur.” RAIRO. Analyse Numérique 11 (1): 13–25. https://doi.org/10.1051/m2an/1977110100131.
Fornberg, Bengt. 1988. “Generation of Finite Difference Formulas on Arbitrarily Spaced Grids.” Mathematics of Computation 51 (184): 699–706. https://doi.org/10.1090/S0025-5718-1988-0935077-0.
Gerber, Florian, and Reinhard Furrer. 2019. “optimParallel: An R Package Providing a Parallel Version of the L-BFGS-B Optimization Method.” The R Journal 11 (1): 352–58. https://doi.org/10.32614/RJ-2019-030.
Gilbert, Paul, and Ravi Varadhan. 2019. numDeriv: Accurate Numerical Derivatives. https://CRAN.R-project.org/package=numDeriv.
Gill, Philip E., Walter Murray, and Margaret H. Wright. 1981. Practical Optimization. Academic Press. https://doi.org/10.1137/1.9781611975604.
Goldberg, David. 1991. “What Every Computer Scientist Should Know about Floating-Point Arithmetic.” ACM Computing Surveys 23 (1): 5–48. https://doi.org/10.1145/103162.103163.
Hubbard, John Hamal, and Barbara Burke Hubbard. 2015. Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. 5th ed. Matrix Editions.
Jordan, Charles. 1928. “Sur Une Formule d’interpolation.” In Atti Del Congresso Internazionale Dei Matematici, edited by Nicola Zanichelli, 157–77.
Kostyrka, Andreï V. 2025. “What Are You Doing, Step Size: Fast Computation of Accurate Numerical Derivatives with Finite Precision.” Working paper.
Lai, K.-L., and J. L. Crassidis. 2008. “Extensions of the First and Second Complex-Step Derivative Approximations.” Journal of Computational and Applied Mathematics 219 (1): 276–93. https://doi.org/10.1016/j.cam.2007.07.026.
Li, Jianping. 2005. “General Explicit Difference Formulas for Numerical Differentiation.” Journal of Computational and Applied Mathematics 183 (1): 29–52. https://doi.org/10.1016/j.cam.2004.12.026.
Martins, Joaquim R. R. A., Peter Sturdza, and Juan J. Alonso. 2003. “The Complex-Step Derivative Approximation.” ACM Transactions on Mathematical Software 29 (3): 245–62. https://doi.org/10.1145/838250.838251.
Mathur, Ravishankar. 2012. “An Analytical Approach to Computing Step Sizes for Finite-Difference Derivatives.” PhD thesis, University of Texas at Austin. http://hdl.handle.net/2152/ETD-UT-2012-05-5275.
———. 2013. “Algorithm AAS 13-723: An Analytical Approach to Computing Step Sizes for Finite-Difference Derivatives.” Emergent Space Technologies, Inc. https://web.archive.org/web/20240527044243/https://www.emergentspace.com/wp-content/uploads/stepsize.pdf.
Newey, Whitney K., and Kenneth D. West. 1987. “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55 (3): 703. https://doi.org/10.2307/1913610.
Oliver, John, and Andrew Ruffhead. 1975. “The Selection of Interpolation Points in Numerical Differentiation.” BIT Numerical Mathematics 15 (3): 283–95. https://doi.org/10.1007/BF01933661.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sauer, Timothy. 2017. Numerical Analysis. 3rd ed. Pearson Education.
Squire, William, and George Trapp. 1998. “Using Complex Variables to Estimate Derivatives of Real Functions.” SIAM Review 40 (1): 110–12. https://doi.org/10.1137/s003614459631241x.
Stepleman, R. S., and N. D. Winarsky. 1979. “Adaptive Numerical Differentiation.” Mathematics of Computation 33 (148): 1257–64. https://doi.org/10.1090/s0025-5718-1979-0537969-8.

  1. The problem with the L-BFGS-B method in R is its reliance on finite function values; the BFGS method admits infinite function values during the linear improvement search, but its limited-memory counterpart with box constraints throws an error upon encountering , , or \(\pm\). This limits the scope of the optimParallel solution to a narrow family of well-behaved functions or functions modified by the user to replace non-finite values with an arbitrary large number.↩︎

  2. numDeriv:::grad.default() prints out the Richardson extrapolation steps for first derivatives, but not cross-derivatives, which rely on numDeriv:::genD.default() instead.↩︎

  3. Other works (e.g. Goldberg (1991)) give a different definition of the machine epsilon – the relative error bound. In our notation, any real number is rounded to the closest FP number with relative error less that \(\epsilon_{\mathrm{m}}/2\).↩︎

  4. Some confusion may stem from the fact that the derivative order in the error term is 2, \(f''(x)\); the accuracy order relates to the step size power, not the derivative order in the Taylor expansion.↩︎

  5. We write `formal’ to distinguish between the desired accuracy and the finite machine precision discussed in the next section.↩︎