Category Archives: Statistics

Notes: Jonathan D. Cryer and Kung-Sik Chan, Time Series Analysis with Applications in R

This post is to record my notes on the first book on time series that I am reading seriously, namely the 2008 edition of Time Series Analysis with Applications in R by Jonathan D. Cryer and Kung-Sik Chan.

Chapter 2. Fundamental Concepts

  1. The model for a time series is a stochastic process, which is a sequence of random variables \((Y_t: t = 0, \pm 1, \pm 2, \ldots)\)
  2. Mean function of the time series \(Y_t\) is \(\mu_t := E(Y_t)\), where \(E\) denotes “expectation.”.
  3. Autocovariance \(\gamma_{t,s} := Cov(Y_t, Y_s)\), where \(Cov\) is the covariance defined as \(Cov(Y_t, Y_s) := E((Y_t – \mu_t)(Y_s – \mu_s)) = E(Y_tY_s) – \mu_t\mu_s.\)
  4. Autocorrelation \(\rho_{t,s} := Corr(Y_t, Y_s)\), where \(Corr\) is the correlation defined as \[Corr(Y_t, Y_s) := \frac{Cov(Y_t, Y_s)}{\sqrt{Var(Y_t)Var(Y_s)}} = \frac{\gamma_{t,s}}{\sqrt{\gamma_{t,t}\gamma_{s,s}}}\]
  5. A white noise is a time series \(e_t\) of iid random variables. The white noise processes in this book usually also has zero mean.
    Note: I think the above definition of a white noise is too restrictive even for the purpose of this book. According to Hamilton white noises do not have to be identically distributed, the requirements for \(e_t\) to be a white noise are: each \(e_t\) has the same (usually, zero) mean and the same (finite) variance, and that they have zero autocorrelation, i.e. \(E(e_te_s) = 0\) whenever \(t \neq s\). If in addition \(e_t, e_s\) are independent for all \(t \neq s,\) he calls it an independent white noise process.
  6. A random walk is a time series \((Y_t: t = 1, 2, \ldots ) \) defined as \[Y_t := \begin{cases} e_1 & \text{if}\ t = 1,\\ Y_{t-1} + e_t & \text{if}\ t \geq 2, \end{cases} \] where \(e_t\) is a white noise process with zero mean and finite variance. \(Y_1 = e_1\) is the “initial condition”. Specific realizations of random walks can show “trends” which are not really present.
  7. \(Y_t\) is said to be strictly stationary if for all \(n\) the joint distributions of \(Y_{t_1 – k}, Y_{t_2 – k}, \ldots, Y_{t_n – k}\) are the same for all choices of \(k, t_1, \ldots, t_n\). It is said to be weakly (or second-order) stationary if the mean \(\mu_t\) does not depend on \(t\), and \(\gamma_{t, t-k} = \gamma_{0, k}\) for all time \(t\) and lag \(k\); in that case we write \(\gamma_k\) for \(\gamma_{k, 0}\) and \(\rho_k\) for \(\rho_{k, 0}\). In this book stationary = weakly stationary.
    Note: if \(Y_t\) is weakly stationary, then \(\gamma_k = \gamma_{k,0} = \gamma_{0,k} = \gamma_{-k}\), and similarly \(\rho_k = \rho_{-k}\).

Chapter 3. Trends

  1. Discusses the deterministic trend of a time series, which is defined as the time series of its mean. In particular, in the notation of Chapter 2, the trend of a time series \(Y_t\) is its mean function \(\mu_t\).
    Note: May be sometimes it would be useful to consider to substitute “mean” by “median” or some other function? Well in that case, the difference between \(Y_t\) and “that function” of \(Y_t\) should have good analytic properties.
  2. If the model is \(Y_t = \mu + X_t\) with \(E(X_t) = 0\), then the most common estimate for \(\mu\) is the sample mean \[\bar Y := (\sum_{t=1}^n Y_t)/n.\] Assumptions on \(\rho_k\) can be used to estimate \[Var(\bar Y) = \frac{\gamma_0}{n}[1 + 2 \sum_{k=1}^{n-1}(1 – \frac{k}{n})\rho_k]\]
  3. Cyclical/seasonal trends with period \(M\) can be modeled by a seasonal means model \(Y_t = \mu_t + X_t\) with \(E(X_t) = 0\) and \(\mu_{t-M} = \mu_t\). The \(t\)-values and \(Pr(>|t|)\)-values (reported e.g. in the R-output) are not very interesting, since these correspond to the null hypothesis that \(\mu_t = 0\).
  4. Using a sine-cosine trend model e.g. \[Y_t = \beta_0 + \beta_1 \cos(\frac{2\pi}{M}t) + \beta_2 \sin(\frac{2\pi}{M}t)\] is more “economical” since uses fewer parameters than the seasonal means model, and this translates to smaller variance of estimates (under the hypothesis that the chosen model is correct). For better fitting also include terms of the form \(\cos(\frac{2k\pi}{M}t)\) and \(\sin(\frac{2k\pi}{M}t)\) for \(k > 1\).
  5. Residual analysis
    • Look at the plots of standardized residuals for pattern, e.g. whether too much/too few “runs” (runs test).
    • Standardized residuals vs fitted values, possibly using seasonal symbols for seasonal models.
    • Histogram.
    • Shapiro-Wilk test for normality (which measures the correlation of residuals vs the corresponding normal quantiles – the lower the number the more evidence against normality).
    • Normal quantile-quantile plot.
    • sample autocorrelation of standardized residuals.

Chapter 4. Models for Stationary Time Series

  1. A general linear process is a time series \(Y_t\) that can be represented as a weighted linear combination of present and past white noise terms \[Y_t = \sum_{j \geq 0} \psi_j e_{t – j}\] Without loss of generality one can assume \(\psi_0 = 1\). For the variance to be finite one requires \(\sum_{j \geq 0} \psi_j^2 < \infty\). A general linear process is stationary by construction.
    Note: Wold’s theorem says that any (weakly) stationary process \(Y_t\) admits a decomposition of the form \(Y_t = U_t + X_t\) with \(X_t\) a general linear process and \(U_t\) a purely deterministic stationary process (i.e. a stationary process which can be predicted with arbitrarily small mean squared error by some linear predictors of finitely many past lags of the process) such that \(U_t\) has zero correlation with each white noise component of \(X_t\).
  2. A moving average of order \(q\), in short, \(MA(q)\) process is a general linear process such that \(psi_j = 0\) for \(j > k.\) In this book one uses \(\theta\) as parameters for \(MA(q)\) processes: \[Y_t = e_t – \sum_{j=1}^q \theta_j e_{t-j}.\]
  3. Autocorrelations \(\rho_k\) can be examined by lag plots, i.e. plots of \(Y_t\) vs \(Y_{t-k}\).
  4. For an \(MA(q)\) process, \(\rho_k\) is zero for \(k > q.\)
  5. An autoregressive process of order \(p\), in short, \(AR(p)\) process is a time series of the form \[Y_t = \sum_{j = 1}^p \phi_j Y_{t-j} + e_t\] where \(e_t\), the innovation term, is independent of \(Y_{t-j}\) for each \(j \geq 1\). Also assume, at least in the beginning, that \(E(Y_t) = 0\) for each \(t.\)
    Note: The assumption \(E(Y_t) = 0\) does not impose any restriction (provided we are dealing with processes with finite mean): indeed, writing \(\mu_t\) for \(E(Y_t)\), we have \(\mu_t = \sum_{j=1}^p \phi_j \mu_{t-j} + E(e_t)\), so that replacing \(Y_t\) by \(Y’_t := Y_t – \mu_t\) gives \[Y’_t = \sum_{j=1}^p \phi_j Y’_{t-j} + e’_t\] where \(e’_t := e_t – E(e_t)\), so that \(Y’_t\) is an \(AR(p)\) process with \(E(Y’_t) = E(e’_t) = 0.\)
  6. The characteristic polynomial of the above \(AR(p)\) process \(Y_t\) is \[\phi(x) := 1 – \sum_{j = 1}^p \phi_j x^j\] and the corresponding characteristic equation is \[\phi(x) = 0.\]
  7. \(Y_t\) is stationary if and only if all the roots of the characteristic equation has modulus greater than 1.
    Note: to see this try to “solve” for \(Y_t\) by writing its defining equation as \[\phi(L)(Y_t) = e_t,\] where \(L\) is the “Lag” operator. If \[\phi(L) = \prod_{i=1}^p(1 – \alpha_iL)\], with \(\alpha_i \in \mathbb{C},\) the solution is \[Y_t = \prod_{i=1}^p(1 – \alpha_iL)^{-1}e_t\] which makes sense if each \(|\alpha_i| < 1,\) since then \((1 – \alpha_iL)^{-1}\) can be expanded as \(1 + \alpha_iL + \alpha_i^2L^2 + \cdots .\) Now note that the roots of \(\phi(x) = 0\) are precisely the \(\alpha_i^{-1}\).
  8. Multiplying the defining equation of the \(AR(p)\) process \(Y_t\) by \(Y_{t-k}\), taking expectation, dividing by \(\gamma_0\), and using \(\rho_0 = 1\) and \(\rho_{-k} = \rho_k\) yields the Yule-Walker equations \[\begin{align*} \rho_1 &= \phi_1 + \phi_2 \rho_1 + \phi_3 \rho_2 + \cdots + \phi_p \rho_{p-1} \\ \rho_2 &= \phi_1\rho_1 + \phi_2 + \phi_3 \rho_1 + \cdots + \phi_p \rho_{p-2} \\ &\vdots \\ \rho_p &= \phi_1\rho_{p-1} + \phi_2\rho_{p-2} + \phi_3 \rho_{p-3} + \cdots + \phi_p \end{align*}\] Given \(\phi_1, \ldots, \phi_p\), solving these linear equations gives \(\rho_1, \ldots, \rho_p\), and then iteratively all \(\rho_k\) can be found using the equation: \[\rho_k = \phi_1 \rho_{k-1} + \phi_2 \rho_{k-2} + \cdots + \phi_p \rho_{k-p}\]
  9. A stationary \(AR(p)\) process can be expressed as a general linear process (see the computation of \(\psi\)-coefficients of ARMA processes below).
  10. Autocorrelations of a stationary \(AR(p)\) process does not vanish, but rather “dies off” with increasing lags; the sign of autocorrelations might alternate with lags if correlation with one of the lags is negative and significant. One can try to separate the “damping factor”, “frequency” and “phase” of the autocorrelations (with respect to lags).
  11. A mixed autoregressive moving average process of orders \(p, q\), in short, \(ARMA(p, q)\) process is a time series of the form \[Y_t = \sum_{j = 1}^p \phi_j Y_{t-j} + e_t – \sum_{j=1}^q \theta_j e_{t-j}\] where \(e_t\) is independent of \(Y_{t-j}\) for each \(j \geq 1\). As in the case of \(AR\) processes, if \(E(Y_t)\) and \(E(e_t)\) are finite, then one can assume without loss of generality that \(E(Y_t) = E(e_t) = 0\) for each \(t\).
  12. Exactly as in the autoregressive case, \(Y_t\) is stationary if and only if each root of the characteristic equation \(1 – \sum_{j=1}^p \phi_j x^j = 0\) has modulus greater than \(1.\)
  13. If stationary, then \(Y_t\) can be represented as a general linear process as follows: write \[\psi_0e_t + \psi_1 e_{t-1} + \psi_2 e_{t-2} + \cdots = \phi_1 Y_{t-1} +\cdots + \phi_p Y_{t-p} +e_t – \theta_1 e_{t-1} – \cdots – \theta_q e_{t-q}\] Multiply both sides by \(e_t\), take expectation, and divide by \(\sigma^2_e\) to get \[\psi_0 = 1\] Then multiply both sides by \(e_{t-1}\), and do the same thing to get \[\psi_1 = \phi_1 – \theta_1\] Continuing with \(e_{t-2}\) gives \[\psi_2 = \phi_1(\phi_1 – \theta_1) + \phi_2 – \theta_2 = \phi_1\psi_1 + \phi_2 – \theta_2 \] and so on.

Chapter 5. Models for Nonstationary Time Series

  1. If \(Y_t = \phi Y_{t-1} + e_t\) and \(e_t\) is uncorrelated to \(Y_{t-1},\) then \(Var(Y_t) = \phi^2 Var(Y_{t-1}) + Var(e_t).\) Therefore if \(Y_t\) is stationary and \(\gamma_0 = Var(Y_t) = Var(Y_{t-1}),\) then \(|\phi| < 1.\)
  2. If \(|\phi| > 1,\) and \(e_t\) is an “innovation” (i.e. uncorrelated to \(Y_{t-1}, Y_{t-2}, \ldots\)), then \(Y_t\) cannot be stationary. However, in that case \(Var(Y_t)\) increases exponentially.
  3. If \(|\phi| = 1,\) then \(\nabla Y_t := Y_t – Y_{t-1}\) is stationary with \(\nabla Y_t = e_t.\)
  4. An integrated moving average process of order (\(d, q\)), in short an \(IMA(d,q)\) process, is a process \(Y_t\) such that \(\nabla^d(Y_t)\) is a moving average of order \(q.\)
    Note: the inverse to the “difference operator” \(\nabla\) is the operator \(I\) which maps a time series \((W_t)\) to the time series whose \(i\)-th term is \(\sum_{j = -\infty}^i W_j\), in other words \(I\) is a “sum” or “integration” operator. This is the origin of the term “integrated” in “integrated moving average”.
  5. If \(Y_t = M_t + e_t\) with \(M_t\) changing slowly overtime, then \(IMA\) models appear naturally in the following cases:
    • if \(M_t\) is approximately constant, say \(\beta_t\), over two consecutive time points, say \(t-1\) and \(t,\) then minimizing \((Y_t-\beta_t)^2 + (Y_{t-1} – \beta_t)^2\) yields \(\beta_t = (Y_t + Y_{t-1})/2.\) This implies \((1/2)\nabla Y_t = e_t,\) i.e. an \(IMA(1,0)\) process.
    • if \(M_t = M_{t-1} + \epsilon_t,\) then \(\nabla Y_t = \epsilon_t + e_t – e_{t-1},\) is an \(MA\) process, so that \(Y_t\) is \(IMA\).
      Question: what is the order of \(Y_t\)? Need to understand the order of sums of \(ARIMA\) processes.
    • if \(M_t\) is approximately linear over three consecutive time points or if \(M_t = M_{t-1} + W_t\) with \(\nabla W_t\) stationary, then \(\nabla^2 Y_t\) is \(MA\) with the same autocorrelation function as an \(MA(2)\) process (question: is it \(MA(2)?\)).
  6. \(Y_t \sim ARIMA(p,d,q)\) means \(\nabla^d(Y) \sim ARMA(p,q)\)
  7. \(ARIMA(p,d,q)\) with nonzero mean \(\mu\) is by definition a time series \(Y_t\) such that \(\nabla^d(Y) – \mu \sim ARMA(p,q),\) which implies that \(Y_t = \mu_t + Y’_t,\) where \(\mu_t\) is a deterministic polynomial of degree-\(d\) in \(t\) and \(Y’_t \sim ARIMA(p,d,q)\) (with zero mean).
  8. Special considerations for a time series of positive values:
    • Logarithm appears if \(\) is approximately proportional to \(E(Y_t).\) Then writing \(\mu_t := E(Y_t),\) one has \(\log(Y_t) = \log(\mu_t) + \log(Y_t/\mu_t) = \log(\mu_t) + \log (1 + (Y_t – \mu_t)/\mu_t) \approx \log(\mu_t) + (Y_t – \mu_t)/\mu_t\) so that \(\log(Y_t))\) is approximately proportional to \(\mu_t,\) which is constant.
    • Logartihm also appears in modelling the percentage change from one time period to the next.
    • Also, one can try a Power transformation (also probably called Box-Cox), which depends on a parameter \(\lambda\). The appropriate value for \(\lambda\) can come from theory/external information/data. To estimate \(\lambda\) from data one computes the log-likelihood assuming some (e.g. normal) distribution of the data, and then takes some \(\lambda\) which is close to the minimum (say within 95% confidence interval).

Chapter 6. Model Specification

  1. Autocorrelation function (ACF): for \(MA(q)\) processes vanishes for \(lag(k)\) for \(k > q.\)
  2. Partical autocorrelation function (PACF): for \(AR(p)\) processes vanishes for \(lag(k)\) for \(k > p.)
  3. PACF of MA processes behave similarly to ACF of AR processes.
  4. If an \(AR(p)\) model is correct, then the sample partial autocorrelations \(\hat \phi_{kk}\) at lags greater than \(p\) are approximately normally distributed with mean 0 and variance \(1/n\) (Quenoulle, 1949). Thus for \(k > p\), \(\pm 2/\sqrt{n}\) can be used as critical limits on \(\hat \phi_{kk}\) to test the null hypothesis that an \(AR(p)\) model is correct.
  5. Extended ACF (EACF): infinite two dimensional array whose \((k,j)\)-th entry is \(0\) or a “nonzero”-symbol (e.g. \(\times\)) depending on whether the \(lag(j + 1)\) sample correlation of the residual after fitting an \(ARMA(k,j)\) model is “close to zero” or not.
  6. Approximate linear decay of the sample ACF is often taken as a symptom that the underlying time series is nonstationary and requires differencing.
  7. Differencing should be done in succession, and principle of parsimony needs to be followed: models should be simple, but not too simple.
  8. Dickey-Fuller test and its variations can be used to quantify the evidence of nonstationarity.
  9. Orders of \(ARMA(p,q)\) seem to be estimated by AIC, BIC, or their variations, i.e. fit different \(ARMA(p,q)\) models and pick the one with the lowest AIC or BIC.
  10. Once the order is determined, subset models should be considered, e.g. via AIC or BIC.

Chapter 7. Parameter Estimation

  1. Method of moments: sometimes works well for \(AR(p)\) models (provided the parameters are not too close to the boundary of stationary regions), but in general not very well for \(MA(q)\) models.
  2. Least squares: also works better for \(AR(p)\) models. For \(MA(q)\) models the numerical estimate of the minimum involves some choices for the error of the initial time periods, which might have a nontrivial impact on the parameters/predictions for certain classes of models or if the order is not very small compared to the length of the time series.
  3. Maximum likelihood: involves knowledge/assumption about the distribution of the white noise terms; most common assumption is Gaussian. If the maximization of the maximum likelihood is too complicated, one compromise is to minimize the unconditional least square that appears in the maximum likelihood function (under the assumption of Gaussian white noise).
  4. The variance of a method-of-moments estimator tends to be higher than the variance of the corresponding ML estimate; the (conditional) least square estimates may have similar level of variance as ML/unconditional least square estimates.
  5. When the series is not very long, bootstrap gives a probably more reliable estimate of the variance, or more generally, the distribution of functions of parameters, and from which one can estimate any prescribed confidence interval for the parameters.

Chapter 8. Model Diagnostics

  1. Model diagnostics = testing goodness of fit, and if the fit is poor, suggesting appropriate modifications.
  2. Residual analysis: analysis of standardized residuals.
    1. Time series of standardized residuals: adequate model should lead to rectangular scatter of standardized residuals around a zero horizontal level with no trends.
    2. If outliers present (which can be tested, e.g. using the Bonferroni criterion), requires further analysis.
    3. Quantile-Quantile plot of the standardized residuals: compare with the normal quantiles
    4. Autocorrelation of residuals: even though actual residuals do not have autocorrelation (provided the model is correct), the sample residuals may be highly correlated. For larger lags however they are indeed approximately uncorrelated, and have variance approximately \(1/n\).
    5. Ljung-Box test can be applied to sum of squares of autocorrelations.
  3. Another diagnostic is to overfit a slightly general model, then look at the new parameter and the change of the original parameters
  4. While overfitting, one should not increase the order of \(AR\) and \(MA\) parts of an \(ARMA\) or \(ARIMA\) model simultaneously, since any \(ARMA(p,q)\) model can have many different representations as an \(ARMA(p+1, q+1)\) model.