Statistics Module

The Stats module provides a comprehensive toolkit for time series analysis, including decomposition methods, transformation functions, autocorrelation analysis, and unit root tests. These functions are essential for preprocessing, analyzing, and understanding time series data before fitting forecasting models.


Overview

The Stats module exports the following functions and types:

CategoryFunctions/Types
Transformationsbox_cox, box_cox!, box_cox_lambda, inv_box_cox
Decompositiondecompose, DecomposedTimeSeries, stl, STLResult, mstl, MSTLResult
Differencingdiff, ndiffs, nsdiffs
Autocorrelationacf, pacf, ACFResult, PACFResult
Unit Root Testsadf, ADF, kpss, KPSS, phillips_perron, PhillipsPerron, ocsb, OCSB
Missing Valueshandle_missing, longest_contiguous, interpolate_missing, check_missing, MissingMethod, Contiguous, Interpolate, FailMissing
Utilitiesfourier, ols, OlsFit, seasonal_strength

Box-Cox Transformations

Box-Cox transformations stabilize variance and make data more normally distributed, which can improve forecast accuracy.

Mathematical Formulation

The Box-Cox transformation is defined as:

\[y^{(\lambda)} = \begin{cases} \frac{y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(y) & \text{if } \lambda = 0 \end{cases}\]

box_cox_lambda

Automatically select the optimal Box-Cox transformation parameter.

box_cox_lambda(x, m; method=:guerrero, lower=-1, upper=2)

Arguments:

  • x::AbstractVector: A numeric vector (must be positive for Guerrero method)
  • m::Int: Frequency of the data
  • method::Symbol: Selection method - :guerrero (default) or :loglik
  • lower::Float64: Lower bound for λ search (default: -1)
  • upper::Float64: Upper bound for λ search (default: 2)

Methods:

  • Guerrero: Minimizes the coefficient of variation for subseries (Guerrero, 1993)
  • Log-likelihood: Maximizes the profile log likelihood of a linear model

Example:

using Durbyn.Stats

y = [120, 135, 148, 152, 141, 158, 170, 165, 180, 195]
lambda = box_cox_lambda(y, 12, method=:guerrero)

box_cox

Apply Box-Cox transformation to a series.

box_cox(x, m; lambda=:auto)

Arguments:

  • x::AbstractVector: Input vector
  • m::Int: Frequency
  • lambda: Transformation parameter or :auto for automatic selection

Returns: Tuple (transformed_vector, lambda_used)

Example:

y_transformed, lambda = box_cox(y, 12; lambda=:auto)

box_cox!

In-place Box-Cox transformation for memory efficiency.

box_cox!(output, x, m; lambda)

Note: Use this in tight loops where box_cox is called repeatedly to avoid allocations.

inv_box_cox

Reverse the Box-Cox transformation.

inv_box_cox(x; lambda, biasadj=false, fvar=nothing)

Arguments:

  • x::AbstractArray: Transformed data
  • lambda::Real: Transformation parameter used
  • biasadj::Bool: Apply bias adjustment for mean forecasts (default: false)
  • fvar: Forecast variance (required if biasadj=true)

Example:

y_original = inv_box_cox(y_transformed; lambda=0.5)

# With bias adjustment for forecasts
y_mean = inv_box_cox(y_transformed; lambda=0.5, biasadj=true, fvar=forecast_variance)

References

  • Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. JRSS B, 26, 211-246.
  • Guerrero, V.M. (1993). Time-series analysis supported by power transformations. Journal of Forecasting, 12, 37-48.
  • Bickel, P. J. and Doksum K. A. (1981). An Analysis of Transformations Revisited. JASA, 76, 296-311.

Time Series Decomposition

Classical Decomposition (decompose)

Decompose a time series into trend, seasonal, and residual components using moving averages. References: Brockwell & Davis (2016), Section 1.5.2 (Method S1).

decompose(; x, m, type=:additive, filter=nothing)

Model forms:

\[\text{Additive: } x_i = \hat{T}_i + S_i + R_i\]

\[\text{Multiplicative: } x_i = \hat{T}_i \cdot S_i \cdot R_i\]

Trend estimate (\hat{T}_i) via symmetric moving average:

\[\text{Even }m:\quad \hat{T}_i = \frac{0.5\,x_{i-m/2} + \sum_{j=-(m/2)+1}^{(m/2)-1} x_{i+j} + 0.5\,x_{i+m/2}}{m}\]

\[\text{Odd }m:\quad \hat{T}_i = \frac{1}{m}\sum_{j=-(m-1)/2}^{(m-1)/2} x_{i+j}\]

Trend values are defined on the interior range (\lfloor m/2 \rfloor < i \le n-\lfloor m/2 \rfloor).

Detrending:

\[\text{Additive: } D_i = x_i - \hat{T}_i,\qquad \text{Multiplicative: } D_i = x_i / \hat{T}_i\]

Seasonal figure normalization (k = 1,\ldots,m):

\[F_k = \frac{1}{N_k}\sum_j D_{k + m(j-1)},\qquad \bar{F} = \frac{1}{m}\sum_{k=1}^m F_k\]

\[\text{Additive: } F_k \leftarrow F_k - \bar{F},\qquad \text{Multiplicative: } F_k \leftarrow F_k / \bar{F}\]

Seasonal and residual components:

\[S_i = F_{((i-1)\bmod m)+1}\]

\[\text{Additive: } R_i = x_i - S_i - \hat{T}_i,\qquad \text{Multiplicative: } R_i = x_i/(S_i\hat{T}_i)\]

By construction, (Si) is periodic with period (m): (S{i+m}=S_i).

Arguments:

  • x::AbstractVector: Time series vector
  • m::Int: Frequency (observations per cycle)
  • type::Symbol: :additive or :multiplicative
  • filter: Custom filter coefficients (optional)

Returns: DecomposedTimeSeries struct with fields:

  • x: Original series
  • seasonal: Seasonal component
  • trend: Trend component
  • random: Residual/remainder
  • figure: Seasonal figure
  • type: Decomposition type
  • m: Frequency

Example:

using Durbyn.Stats

ap = air_passengers()
result = decompose(x=ap, m=12, type=:multiplicative)
result.trend
result.seasonal

References:

  • Brockwell, P. J., & Davis, R. A. (2016). Introduction to Time Series and Forecasting (3rd ed.), Section 1.5.2.

STL Decomposition (stl)

Seasonal-Trend decomposition using LOESS (STL) - a robust and flexible method for decomposing time series.

stl(x, m; seasonal_window=:periodic, seasonal_degree=0, trend_window=nothing,
    trend_degree=1, lowpass_window=nothing, lowpass_degree=trend_degree,
    seasonal_jump=nothing, trend_jump=nothing, lowpass_jump=nothing,
    robust=false, inner=nothing, outer=nothing)

Arguments:

  • x::AbstractVector: Time series to decompose
  • m::Int: Seasonal frequency (must be at least 2, x must contain at least two full cycles, and missing values are not allowed)
  • seasonal_window: Seasonal smoothing window (integer span, rounded to the nearest odd value, or :periodic)
  • seasonal_degree::Int: Seasonal smoothing polynomial degree (0 or 1)
  • trend_window: Trend smoothing window
  • trend_degree::Int: Trend smoothing polynomial degree (0 or 1)
  • lowpass_window: Low-pass filter window
  • lowpass_degree::Int: Low-pass filter polynomial degree
  • seasonal_jump, trend_jump, lowpass_jump: Subsampling steps
  • robust::Bool: Use robustness iterations (default: false)
  • inner, outer: Inner/outer iteration counts

Returns: STLResult struct with:

  • seasonal, trend, remainder: Decomposition components
  • weights: Robustness weights
  • seasonal_window, trend_window, lowpass_window: Window sizes actually used
  • seasonal_degree, trend_degree, lowpass_degree: Polynomial degrees
  • seasonal_jump, trend_jump, lowpass_jump: Jump parameters
  • inner_iterations, outer_iterations: Iteration counts

Example:

ap = air_passengers()
result = stl(ap, 12; seasonal_window=7, robust=true)

# Access components
result.trend
result.seasonal
result.remainder

# Summarize
summary(result)

Multiple Seasonal Decomposition (mstl)

Decompose time series with multiple seasonal periods using iterative STL.

mstl(x, m; lambda=nothing, iterate=2, seasonal_window=nothing, stl_kwargs...)

Arguments:

  • x::AbstractVector: Time series
  • m: Single period (Int) or vector of periods, passed positionally
  • lambda: Box-Cox parameter (nothing, :auto, or numeric)
  • iterate::Int: Number of outer iterations (default: 2)
  • seasonal_window: Seasonal window(s)
  • stl_kwargs...: Additional arguments passed to stl

Returns: MSTLResult struct with:

  • data: Original series
  • trend: Trend component
  • seasonals: Vector of seasonal components
  • m: Periods used
  • remainder: Residual component
  • lambda: Box-Cox λ used

If lambda is supplied, the decomposition is fit on the transformed series and the returned component vectors correspond to that transformed scale.

Example:

# Hourly data with daily and weekly patterns
y = rand(200) .+ 2sin.(2π*(1:200)/7) .+ 0.5sin.(2π*(1:200)/30)
result = mstl(y, [7, 30]; iterate=2, seasonal_window=[11, 23], robust=true)

# Access components
result.trend
result.seasonals[1]  # First seasonal component (period 7)
result.seasonals[2]  # Second seasonal component (period 30)
result.remainder

# Summarize
summary(result)

Seasonal Strength

Measure the strength of seasonality in an MSTL decomposition. Reference: Wang, Smith & Hyndman (2006), Section 3.1.

seasonal_strength(x, m; kwargs...)
seasonal_strength(res::MSTLResult)

Let (Rt) denote the remainder and (St) the seasonal component for a given period. The seasonal strength is:

\[F_{\text{seasonal}} = 1 - \frac{\operatorname{Var}(R_t)}{\operatorname{Var}(R_t + S_t)}\]

Equivalent form in Wang et al. notation ((Xt = Yt^* - Tt), (Yt' = Yt^* - Tt - S_t)):

\[F_{\text{seasonal}} = 1 - \frac{\operatorname{Var}(Y_t')}{\operatorname{Var}(X_t)}\]

In Durbyn, the statistic is computed per seasonal component from MSTLResult.seasonals and clipped to ([0,1]), where 0 indicates weak seasonality and 1 indicates strong seasonality.

Example:

result = mstl(y, [7, 30])
strengths = seasonal_strength(result)

# Convenience form
strengths2 = seasonal_strength(y, [7, 30]; iterate=2)

Autocorrelation Functions

ACF (acf)

Compute the sample autocorrelation function. References: Brockwell & Davis (2016), Definition 1.4.4 and Remark 3.

acf(y, m, n_lags=nothing; demean=true)

Arguments:

  • y::AbstractVector: Input time series
  • m::Int: Frequency/seasonal period
  • n_lags: Number of lags (default: min(floor(Int, 10*log10(n)), n-1))
  • demean::Bool: Subtract mean before computing (default: true)

Returns: ACFResult with:

  • values: ACF values at each lag (including lag 0)
  • lags: Lag indices
  • n: Series length
  • m: Frequency
  • ci: 95% confidence interval (±1.96/√n)

Implemented equations:

Default lag count (when n_lags=nothing):

\[n_{\text{lags}} = \min\!\Big(\lfloor 10\,\log_{10}(n) \rfloor,\; n-1\Big)\]

95% confidence interval:

\[\text{CI} = \pm\,\frac{1.96}{\sqrt{n}}\]

Under the iid null and large (n), (\hat\rho(k)) for (k>0) is approximately (N(0, 1/n)).

Sample variance (biased, divisor n):

\[\hat{\gamma}(0) = \frac{1}{n}\sum_{t=1}^{n}(y_t - \bar{y})^2\]

Sample autocovariance at lag k:

\[\hat{\gamma}(k) = \frac{1}{n}\sum_{t=1}^{n-k}(y_t - \bar{y})(y_{t+k} - \bar{y})\]

The divisor is (n), not (n-k), matching the standard sample autocovariance definition used in Box-Jenkins style time series analysis.

Sample autocorrelation:

\[\hat{\rho}(0) = 1, \qquad \hat{\rho}(k) = \frac{\hat{\gamma}(k)}{\hat{\gamma}(0)}\]

Example:

y = randn(100)
result = acf(y, 12)
result.values  # ACF values
result.ci      # Confidence interval

plot(result)   # Requires Plots.jl

PACF (pacf)

Compute the sample partial autocorrelation function using the Durbin-Levinson algorithm. References: Brockwell & Davis (2016), Section 2.5.3 and Section 3.2.3.

pacf(y, m, n_lags=nothing)

Arguments:

  • y::AbstractVector: Input time series
  • m::Int: Frequency/seasonal period
  • n_lags: Number of lags (default: min(floor(Int, 10*log10(n)), n-1))

Returns: PACFResult with:

  • values: PACF values (lags 1 to n_lags)
  • lags: Lag indices
  • n: Series length
  • m: Frequency
  • ci: 95% confidence interval

Implemented equations (Durbin-Levinson):

Initialization:

\[\phi_{1,1} = \hat{\rho}(1)\]

Recursion for k = 2,3,\dots:

\[\phi_{k,k} = \frac{\hat{\rho}(k) - \sum_{j=1}^{k-1}\phi_{k-1,j}\hat{\rho}(k-j)} {1 - \sum_{j=1}^{k-1}\phi_{k-1,j}\hat{\rho}(j)}\]

Coefficient update for j = 1,\dots,k-1:

\[\phi_{k,j} = \phi_{k-1,j} - \phi_{k,k}\phi_{k-1,k-j}\]

PACF value at lag k:

\[\text{PACF}(k) = \phi_{k,k}\]

For a causal AR((p)) process, (\phi_{k,k}=0) for all (k>p), which gives the PACF cutoff diagnostic for AR order selection.

Diagnostic Interpretation

ACF/PACF patterns can help identify appropriate model structures:

PatternACFPACFSuggested model
Cuts off after lag $q$$\hat{\rho}(k) \approx 0$ for $k > q$Tails offMA($q$)
Tails offGeometric/oscillatory decayCuts off after lag $p$AR($p$)
Both tail offGeometric/oscillatory decayGeometric/oscillatory decayARMA($p,q$)
Slow linear decayLarge positive valuesSingle large spike at lag 1Non-stationary; needs differencing

Example:

y = randn(100)
result = pacf(y, 12)
result.values
plot(result)

Differencing

diff

Compute lagged differences of a vector or matrix.

diff(x; lag_steps=1, difference_order=1)

Arguments:

  • x: Vector or matrix
  • lag_steps::Int: Lag interval (default: 1)
  • difference_order::Int: Number of times to apply differencing (default: 1)

Implemented equations:

Lag operator for k > 0:

\[L^k(x_t) = x_{t-k}, \qquad t = k+1,\dots,n\]

with t = 1,\dots,k undefined (NaN).

Lead form for k < 0:

\[L^k(x_t) = x_{t-k}, \qquad t = 1,\dots,n+k\]

Lag-d first difference:

\[\Delta_d x_t = x_t - x_{t-d}, \qquad t = d+1,\dots,n\]

with t = 1,\dots,d undefined.

Standard first difference (d=1):

\[\Delta x_t = x_t - x_{t-1}\]

Repeated differencing (D applications), with x_t^{(0)} = x_t:

\[x_t^{(i)} = x_t^{(i-1)} - x_{t-d}^{(i-1)}, \qquad i=1,\dots,D\]

Final output is x_t^{(D)} with first d \cdot D entries undefined.

Special cases:

\[\Delta^2 x_t = x_t - 2x_{t-1} + x_{t-2}\]

\[\Delta_m x_t = x_t - x_{t-m}\]

Matrix case: apply the same scalar equations independently to each column.

Example:

y = [1, 3, 6, 10, 15]
diff(y)                    # [2, 3, 4, 5]
diff(y; lag_steps=2)       # [5, 7, 9]
diff(y; difference_order=2) # [1, 1, 1]

ndiffs

Determine the number of non-seasonal differences needed for stationarity.

ndiffs(x; alpha=0.05, test=:kpss, deterministic=:level, max_d=2, kwargs...)

Arguments:

  • x::AbstractVector: Time series
  • alpha::Float64: Significance level (clamped to [0.01, 0.10])
  • test::Symbol: Unit root test - :kpss, :adf, or :pp
  • deterministic::Symbol: :level (intercept) or :trend (intercept + trend)
  • max_d::Int: Maximum differences to try (default: 2)

Test Behavior:

  • KPSS: Null = stationarity. Returns smallest d where KPSS does not reject.
  • ADF/PP: Null = unit root. Returns smallest d where test rejects unit root.

Example:

y = cumsum(randn(100))  # Random walk (non-stationary)
d = ndiffs(y; test=:kpss)
println("Differences needed: $d")

# Using ADF test
d_adf = ndiffs(y; test=:adf, deterministic=:trend)

nsdiffs

Determine the number of seasonal differences needed.

nsdiffs(x, m; alpha=0.05, test=:seas, max_D=1, kwargs...)

Arguments:

  • x::AbstractVector: Time series
  • m::Int: Seasonal period
  • alpha::Float64: Significance level
  • test::Symbol: :seas (default) or :ocsb
  • max_D::Int: Maximum seasonal differences (default: 1)

Example:

ap = air_passengers()
D = nsdiffs(ap, 12)
println("Seasonal differences needed: $D")

Unit Root Tests

ADF Test (adf)

Augmented Dickey-Fuller test for unit roots.

adf(y; model=:none, lags=1, k_max=nothing, criterion=:fixed)

Null Hypothesis: The series has a unit root (non-stationary)

Mathematical Formulation

First difference:

\[\Delta y_t = y_t - y_{t-1}\]

ADF auxiliary regressions:

\[\texttt{model=:none}\quad \Delta y_t = \beta_1 y_{t-1} + \sum_{i=1}^{k}\gamma_i \Delta y_{t-i} + \varepsilon_t\]

\[\texttt{model=:drift}\quad \Delta y_t = \mu + \beta_1 y_{t-1} + \sum_{i=1}^{k}\gamma_i \Delta y_{t-i} + \varepsilon_t\]

\[\texttt{model=:trend}\quad \Delta y_t = \mu + \beta_1 y_{t-1} + \delta t + \sum_{i=1}^{k}\gamma_i \Delta y_{t-i} + \varepsilon_t\]

where k is the augmentation lag order.

Tau statistic (ADF statistic):

\[\tau = \frac{\hat{\beta}_1}{\widehat{SE}(\hat{\beta}_1)}\]

  • :none\tau_1
  • :drift\tau_2
  • :trend\tau_3

F-statistics (joint restrictions):

\[\phi = \frac{(RSS_R - RSS_F)/(df_R - df_F)}{RSS_F/df_F}\]

  • \phi_1 (model=:drift): H_0: \beta_1 = 0,\ \mu = 0
  • \phi_2 (model=:trend): H_0: \beta_1 = 0,\ \mu = 0,\ \delta = 0
  • \phi_3 (model=:trend): H_0: \beta_1 = 0,\ \delta = 0 (intercept retained)

Automatic lag selection (criterion=:aic or :bic) minimizes:

\[IC(k) = n_k \log\!\left(\frac{RSS_k}{n_k}\right) + k_{\text{pen}} p_k,\quad k_{\text{pen}}=\begin{cases} 2 & \text{AIC}\\ \log(n_k) & \text{BIC} \end{cases}\]

\[k^\* = \arg\min_{k \in \{0,\dots,k_{\max}\}} IC(k)\]

Underlying OLS quantities:

\[\hat{\beta} = (X^\top X)^{-1}X^\top y,\quad \hat{\varepsilon}=y-X\hat{\beta}\]

\[\widehat{SE}(\hat{\beta}_j)=\sqrt{\hat{\sigma}^2[(X^\top X)^{-1}]_{jj}},\quad \hat{\sigma}^2 = RSS / df_{\text{residual}}\]

Arguments:

  • y::AbstractVector: Time series
  • model::Symbol: :none, :drift (intercept), or :trend (intercept + trend)
  • lags::Int: Fixed augmentation order (criterion=:fixed)
  • k_max::Int: Maximum lag considered for automatic selection
  • criterion::Symbol: :fixed, :aic, or :bic

Compatibility aliases: typemodel, selectlagscriterion.

Returns: ADF struct with:

  • model: Test type used
  • cval: Critical values matrix
  • clevels: Significance levels [0.01, 0.05, 0.10]
  • lags: Selected augmentation order (alias: lag)
  • tau: Tau statistic (\tau_1/\tau_2/\tau_3 depending on model)
  • phi1, phi2, phi3: F-statistics (when applicable)
  • beta, se: OLS coefficients and standard errors
  • testreg: Auxiliary regression results
  • res: Residuals

Example:

y = cumsum(randn(100))
result = adf(y; model=:drift, k_max=4, criterion=:aic)
println("Tau statistic: $(result.tau)")
println("Phi1 statistic: $(result.phi1)")
println("Tau critical values (1%, 5%, 10%): $(result.cval[1, :])")

KPSS Test (kpss)

Kwiatkowski-Phillips-Schmidt-Shin test for stationarity.

kpss(y; type=:mu, lags=:short, use_lag=nothing)

Null Hypothesis: The series is stationary

Mathematical Formulation

Bandwidth (L) selection:

\[L = \left\lfloor 4\left(\frac{n}{100}\right)^{1/4} \right\rfloor \quad (\texttt{lags=:short}),\qquad L = \left\lfloor 12\left(\frac{n}{100}\right)^{1/4} \right\rfloor \quad (\texttt{lags=:long}),\qquad L = 0 \quad (\texttt{lags=:nil})\]

Deterministic residuals:

\[\texttt{type=:mu}: \quad e_t = y_t - \bar{y}\]

\[\texttt{type=:tau}: \quad e_t \text{ are OLS residuals from } y_t = \alpha + \beta t + u_t,\ \ t=1,\dots,n\]

KPSS statistic:

\[\eta = \frac{\frac{1}{n^2}\sum_{t=1}^{n} S_t^2}{\hat{\sigma}^2},\qquad S_t = \sum_{i=1}^{t} e_i\]

Residual variance and Bartlett long-run variance correction:

\[\hat{\sigma}_e^2 = \frac{1}{n}\sum_{t=1}^{n} e_t^2,\qquad \hat{\sigma}^2 = \hat{\sigma}_e^2 + \frac{2}{n}\sum_{\ell=1}^{L}\left(1-\frac{\ell}{L+1}\right)\sum_{t=\ell+1}^{n} e_t e_{t-\ell}\]

For L=0, the denominator is \hat{\sigma}_e^2.

Arguments:

  • y::AbstractVector: Time series
  • type::Symbol: :mu (constant) or :tau (constant + trend)
  • lags::Symbol: :short, :long, or :nil
  • use_lag: Manually specify bandwidth

Returns: KPSS struct with:

  • type: Test type
  • lag: Bandwidth used
  • teststat: Test statistic
  • cval, clevels: Critical values and levels
  • res: Regression residuals

Example:

y = randn(100)  # Stationary
result = kpss(y; type=:mu)
println("Test statistic: $(result.teststat)")
println("Critical values: $(result.cval)")

Phillips-Perron Test (phillips_perron)

Phillips-Perron unit root test with non-parametric correction.

phillips_perron(x; type=:Z_alpha, model=:constant, lags=:short, use_lag=nothing)

Null Hypothesis: The series has a unit root

Mathematical Formulation

Test regressions (t = 2,\dots,n_0):

\[\texttt{model=:constant}: \quad y_t = \alpha + \rho y_{t-1} + u_t\]

\[\texttt{model=:trend}: \quad y_t = \alpha + \rho y_{t-1} + \beta (t - n/2) + u_t\]

Bandwidth (L) uses the same short/long rules as KPSS.

Variance quantities:

\[s = \frac{1}{n}\sum_{t=1}^{n}\hat{u}_t^2,\qquad \hat{\sigma}^2 = s + \frac{2}{n}\sum_{\ell=1}^{L}\left(1-\frac{\ell}{L+1}\right)\sum_{t=\ell+1}^{n}\hat{u}_t\hat{u}_{t-\ell}\]

\[\lambda = \tfrac{1}{2}(\hat{\sigma}^2 - s),\qquad \lambda' = \frac{\lambda}{\hat{\sigma}^2}\]

Scaled sample moments:

\[\overline{Y}_{\mathrm{var}}=\frac{1}{n^2}\sum (y_t-\bar{y})^2,\quad S_{y^2}=\frac{1}{n^2}\sum y_t^2,\quad S_y=\frac{1}{n^{3/2}}\sum y_t,\quad S_{ty}=\frac{1}{n^{5/2}}\sum t\,y_t\]

Constant model PP statistics:

\[Z_\tau = \sqrt{\frac{s}{\hat{\sigma}^2}}\,t_\rho - \lambda'\frac{\sqrt{\hat{\sigma}^2}}{\sqrt{\overline{Y}_{\mathrm{var}}}},\qquad Z_\alpha = n(\hat{\rho}-1) - \frac{\lambda}{\overline{Y}_{\mathrm{var}}}\]

Constant-model auxiliary statistic:

\[Z_{\tau,\mu} = \sqrt{\frac{s}{\hat{\sigma}^2}}\,t_\alpha + \lambda'\frac{\sqrt{\hat{\sigma}^2}\,S_y}{\sqrt{S_{y^2}}\,\sqrt{\overline{Y}_{\mathrm{var}}}}\]

Trend model definitions and PP statistics:

\[M = (1-n^{-2})S_{y^2} - 12S_{ty}^2 + 12\left(1+\tfrac{1}{n}\right)S_{ty}S_y - \left(4+\tfrac{6}{n}+\tfrac{2}{n^2}\right)S_y^2\]

\[Z_\tau = \sqrt{\frac{s}{\hat{\sigma}^2}}\,t_\rho - \lambda'\frac{\sqrt{\hat{\sigma}^2}}{\sqrt{M}},\qquad Z_\alpha = n(\hat{\rho}-1) - \frac{\lambda}{M}\]

Trend-model auxiliary statistics:

\[Z_{\tau,\mu} = \sqrt{\frac{s}{\hat{\sigma}^2}}\,t_\alpha - \lambda'\frac{\sqrt{\hat{\sigma}^2}\,S_y}{\sqrt{M}\,\sqrt{M+S_y^2}}\]

\[Z_{\tau,\beta} = \sqrt{\frac{s}{\hat{\sigma}^2}}\,t_\beta - \lambda'\frac{\sqrt{\hat{\sigma}^2}\,(\tfrac{1}{2}S_y - S_{ty})}{\sqrt{M/12}\,\sqrt{\overline{Y}_{\mathrm{var}}}}\]

Finite-sample critical values are provided for type=:Z_tau via:

\[c = a + \frac{b}{n} + \frac{d}{n^2}\]

Coefficient tuples (a,b,d) used by model and significance level:

  • model=:constant: 1% (-3.4335,-5.999,-29.25), 5% (-2.8621,-2.738,-8.36), 10% (-2.5671,-1.438,-4.48)
  • model=:trend: 1% (-3.9638,-8.353,-47.44), 5% (-3.4126,-4.039,-17.83), 10% (-3.1279,-2.418,-7.58)

Arguments:

  • x::AbstractVector: Time series
  • type::Symbol: :Z_alpha or :Z_tau
  • model::Symbol: :constant or :trend
  • lags::Symbol: :short or :long
  • use_lag: Bartlett truncation lag

Returns: PhillipsPerron struct with test results.

Example:

result = phillips_perron(y; type=:Z_tau, model=:trend)

OCSB Test (ocsb)

Osborn-Chui-Smith-Birchenhall test for seasonal unit roots.

ocsb(x, m; lag_method=:fixed, maxlag=0, clevels=[0.10, 0.05, 0.01])

Null Hypothesis: Seasonal unit root exists

Mathematical Formulation

Difference operators:

\[\Delta X_t = X_t - X_{t-1},\qquad \Delta_m X_t = X_t - X_{t-m},\qquad \Delta\Delta_m X_t = X_t - X_{t-1} - X_{t-m} + X_{t-m-1}\]

AR prewhitening (p lags):

\[\Delta\Delta_m X_t = \lambda_1\Delta\Delta_m X_{t-1} + \cdots + \lambda_p\Delta\Delta_m X_{t-p} + \varepsilon_t\]

with lag polynomial \hat{\lambda}(B)=1-\lambda_1B-\cdots-\lambda_pB^p.

Filtered regressors:

\[Z_{4,t} = \hat{\lambda}(B)\Delta_m X_t = \Delta_m X_t - \sum_{j=1}^{p}\lambda_j\Delta_m X_{t-j}\]

\[Z_{5,t} = \hat{\lambda}(B)\Delta X_t = \Delta X_t - \sum_{j=1}^{p}\lambda_j\Delta X_{t-j}\]

Final OCSB regression:

\[\Delta\Delta_m X_t = \beta_1 Z_{4,t-1} + \beta_2 Z_{5,t-m} + \sum_{j=1}^{p}\alpha_j \Delta\Delta_m X_{t-j} + u_t\]

Test statistic:

\[\text{OCSB stat} = \frac{\hat{\beta}_2}{\mathrm{se}(\hat{\beta}_2)}\]

Simulation-smoothed critical value for period m (\ell=\log m):

\[c(m) = -0.2937411\exp\!\Big[-0.2850853(\ell-0.7656451)-0.05983644(\ell-0.7656451)^2\Big]-1.652202\]

Lag selection for non-fixed mode minimizes AIC/BIC/AICc over p=1,\dots,p_{\max} with:

\[\ell = -\frac{n}{2}\left(\log(2\pi)+1+\log(\mathrm{RSS}/n)\right),\quad \mathrm{AIC}=-2\ell+2k,\quad \mathrm{BIC}=-2\ell+k\log n,\quad \mathrm{AICc}=\mathrm{AIC}+\frac{2k(k+1)}{n-k-1}\]

Arguments:

  • x::AbstractVector: Time series
  • m::Int: Seasonal period
  • lag_method::Symbol: :fixed, :AIC, :BIC, or :AICc (case-insensitive)
  • maxlag::Int: Maximum AR order to consider

Returns: OCSB struct with:

  • teststat: OCSB t-statistic
  • cval: Smoothed critical value for the requested seasonal period
  • clevels: Nominal critical-value levels metadata
  • lag: Selected AR order (alias: lags)

Example:

ap = air_passengers()
result = ocsb(ap, 12; lag_method=:AIC)

Utility Functions

Fourier Terms (fourier)

Generate Fourier terms for seasonal modeling in regression.

fourier(x; m, K, h=nothing)

Arguments:

  • x::AbstractVector: Time series
  • m: Seasonal period
  • K::Int: Number of Fourier terms
  • h: Forecast horizon (optional)

Returns: NamedTuple of sin/cos vectors keyed by symbols such as :S1, :C1, :S2, :C2, ...

Example:

y = randn(100)
F = fourier(y; m=12, K=6)
propertynames(F)
F.S1

Ordinary Least Squares (ols)

Fit OLS linear regression.

ols(y, X)

Arguments:

  • y::AbstractVector: Response vector
  • X::AbstractMatrix: Design matrix (include intercept column if needed)

Returns: OlsFit struct with:

  • coef: Estimated coefficients
  • fitted: Fitted values
  • residuals: Residuals
  • sigma2: Residual variance
  • cov: Coefficient covariance matrix
  • se: Standard errors
  • df_residual: Residual degrees of freedom

Example:

n = 100
X = hcat(ones(n), randn(n))  # Intercept + predictor
y = X * [2.0, 3.0] + randn(n) * 0.5

fit = ols(y, X)
fit.coef       # Coefficients
fit.se         # Standard errors
fit.residuals  # Residuals

# Predictions
predict(fit, X_new)

Missing Value Handling

Durbyn provides a type-dispatched system for handling missing values (missing and NaN) in time series data.

Type Hierarchy

All missing value strategies are subtypes of the abstract type MissingMethod:

TypeDescription
Contiguous()Extract the longest contiguous segment without missing values
Interpolate()Interpolate missing values (seasonal-aware)
Interpolate(; linear=true)Force linear interpolation
FailMissing()Throw an error if any missing values are present

handle_missing

Dispatch to the appropriate strategy based on the MissingMethod type.

handle_missing(x, Contiguous())           # longest contiguous segment
handle_missing(x, Interpolate(); m=12)    # seasonal interpolation
handle_missing(x, FailMissing())          # error if missing

Arguments:

  • x::AbstractArray: Input vector (may contain missing or NaN)
  • method::MissingMethod: Strategy to use
  • m::Union{Int,Nothing}: Seasonal period (used by Interpolate)

Example:

using Durbyn.Stats

x = [1.0, 2.0, missing, 4.0, 5.0]

# Type dispatch replaces string dispatch
result = handle_missing(x, Contiguous())
result = handle_missing(x, Interpolate())
result = handle_missing(x, FailMissing())  # throws ArgumentError

longest_contiguous

Extract the longest contiguous segment of non-missing values.

longest_contiguous(x)

Both missing and NaN are treated as missing values.

Example:

x = [missing, 1.0, 2.0, 3.0, missing, 4.0, missing]
longest_contiguous(x)  # Returns [1.0, 2.0, 3.0]

interpolate_missing

Interpolate missing values in a time series. For seasonal data, uses STL decomposition with seasonal-aware interpolation. For non-seasonal data, uses linear interpolation.

interpolate_missing(x; m=nothing, lambda=nothing, linear=nothing)

Arguments:

  • x::AbstractVector: Time series (may contain missing or NaN)
  • m::Union{Int,Nothing}: Seasonal period (nothing or 1 = non-seasonal)
  • lambda::Union{Nothing,Real}: Box-Cox transformation parameter
  • linear::Union{Nothing,Bool}: Force linear interpolation

Algorithm (seasonal):

  1. Fit preliminary model with Fourier terms and polynomial trend
  2. Apply robust MSTL decomposition
  3. Linearly interpolate the seasonally adjusted series
  4. Add back the seasonal component
  5. Fall back to linear if results are unstable

Example:

# Non-seasonal
y = [1.0, 2.0, missing, 4.0, 5.0]
interpolate_missing(y)

# Seasonal (monthly)
ap = air_passengers()
ap[50] = NaN
interpolate_missing(ap; m=12)

# With Box-Cox
interpolate_missing(y; lambda=0.5)

check_missing

Verify that a vector contains no missing values. Returns the input unchanged if clean; throws ArgumentError otherwise.

check_missing(x)

Example:

check_missing([1.0, 2.0, 3.0])         # Returns [1.0, 2.0, 3.0]
check_missing([1.0, missing, 3.0])      # Throws ArgumentError
check_missing([1.0, NaN, 3.0])          # Throws ArgumentError

Usage in ETS

The ets() function accepts a missing_method keyword argument:

using Durbyn.ExponentialSmoothing

ap = air_passengers()

# Default: extract longest contiguous segment
fit1 = ets(ap, 12, "ZZZ")

# Interpolate missing values
fit2 = ets(ap, 12, "ZZZ"; missing_method=Interpolate())

# Error on missing values
fit3 = ets(ap, 12, "ZZZ"; missing_method=FailMissing())

Complete Example: Time Series Preprocessing Pipeline

using Durbyn
using Durbyn.Stats

# Load data
ap = air_passengers()

# 1. Check for stationarity and determine differencing
d = ndiffs(ap; test=:kpss)
D = nsdiffs(ap, 12)
println("Non-seasonal differences: $d, Seasonal differences: $D")

# 2. Apply Box-Cox transformation
y_transformed, lambda = box_cox(ap, 12; lambda=:auto)
println("Box-Cox lambda: $lambda")

# 3. Decompose the series
result = stl(ap, 12; s_window=7, robust=true)
summary(result)

# 4. Examine autocorrelation structure
acf_result = acf(ap, 12, 24)
pacf_result = pacf(ap, 12, 24)

# 5. Multiple seasonal decomposition (if applicable)
mstl_result = mstl(ap; m=12, lambda=lambda)
strength = seasonal_strength(mstl_result)
println("Seasonal strength: $strength")

# 6. Perform unit root tests
adf_result = adf(ap; model=:drift, criterion=:aic, k_max=5)
kpss_result = kpss(ap; type=:mu)

println("ADF tau statistic: $(adf_result.tau)")
println("KPSS test statistic: $(kpss_result.teststat)")

References

  • Cleveland, R. B., Cleveland, W. S., McRae, J. E., & Terpenning, I. (1990). STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, 6(1), 3-73.
  • Dickey, D. A., & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA, 74, 427-431.
  • Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root. Journal of Econometrics, 54, 159-178.
  • Phillips, P. C. B., & Perron, P. (1988). Testing for a Unit Root in Time Series Regression. Biometrika, 72(2), 335-346.
  • Osborn, D. R., Chui, A. P. L., Smith, J. P., & Birchenhall, C. R. (1988). Seasonality and the Order of Integration for Consumption. Oxford Bulletin of Economics and Statistics, 50, 361-377.
  • Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control. Wiley.
  • Brockwell, P. J., & Davis, R. A. (2016). Introduction to Time Series and Forecasting (3rd ed.). Springer.
  • Wang, X., Smith, K. A., & Hyndman, R. J. (2006). Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335-364.