Utilities Module

The Utils module provides helper functions and utilities used throughout Durbyn.jl. This includes example datasets for testing and learning, data manipulation functions, and other supporting tools.


Example Datasets

Durbyn.jl provides several classic time series datasets commonly used in forecasting literature and examples. All datasets are returned as Vector{Float64}.

Available Datasets

DatasetFrequencyLengthPeriodCharacteristics
air_passengersMonthly1441949-1960Trend, multiplicative seasonality
ausbeerQuarterly2181956-2010Seasonal pattern, varying trend
lynxAnnual1141821-1934Cyclic (~10 year cycle)
sunspotsMonthly235*1749-1768Cyclic (~11 year solar cycle)
pedestrian_countsDaily29222009-2016Weekly + annual seasonality
simulate_seasonal_dataConfigurableUser-defined-Synthetic data generator

*Truncated for demonstration; full dataset has 2820 observations.


Real-World Datasets

air_passengers

Monthly airline passenger numbers (in thousands) from January 1949 to December 1960. This is the classic Box & Jenkins dataset exhibiting both trend and multiplicative seasonal patterns.

using Durbyn

ap = air_passengers()
println("Length: ", length(ap))      # 144
println("Range: ", extrema(ap))      # (104.0, 622.0)

# Fit a model
fit = ets(ap, 12, "ZZZ")
fc = forecast(fit, h=12)
plot(fc)

Properties:

  • Frequency: 12 (monthly)
  • Characteristics: Strong upward trend, multiplicative seasonality with increasing amplitude

Source: Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2015). Time Series Analysis: Forecasting and Control.


ausbeer

Quarterly Australian beer production in megalitres from Q1 1956 to Q2 2010.

using Durbyn

beer = ausbeer()
println("Length: ", length(beer))    # 218
println("Mean: ", round(mean(beer), digits=1))  # ~430 megalitres

# Fit Holt-Winters
fit = holt_winters(beer, 4)
fc = forecast(fit, h=8)
plot(fc)

Properties:

  • Frequency: 4 (quarterly)
  • Characteristics: Strong Q4 peak (summer in Australia), varying long-term trend

Source: Australian Bureau of Statistics, Cat. 8301.0.55.001.


lynx

Annual Canadian lynx trappings from 1821 to 1934 in the Mackenzie River district.

using Durbyn

lynx_data = lynx()
println("Length: ", length(lynx_data))  # 114
println("Range: ", extrema(lynx_data))  # (39.0, 6991.0)

# Good for demonstrating cyclic patterns
# Note: This is annual data, so m=1 (no within-year seasonality)

Properties:

  • Frequency: 1 (annual)
  • Characteristics: Famous ~10-year population cycle, predator-prey dynamics

Source: Campbell, M.J. & Walker, A.M. (1977). Journal of the Royal Statistical Society Series A, 140, 411-431.


sunspots

Monthly mean relative sunspot numbers showing the ~11-year solar cycle.

using Durbyn

spots = sunspots()
println("Length: ", length(spots))   # 235 (truncated)
println("Max: ", maximum(spots))     # ~158

# Demonstrates long cycles in time series

Properties:

  • Frequency: 12 (monthly)
  • Characteristics: ~11-year solar cycle, non-stationary

Note: This is a truncated version (1749-1768). Full dataset available from SILSO.

Source: World Data Center-SILSO, Royal Observatory of Belgium.


pedestrian_counts

Daily pedestrian counts from a city sensor location (2009-2016), exhibiting multiple seasonal patterns.

using Durbyn

pedestrians = pedestrian_counts()
println("Length: ", length(pedestrians))  # 2922 (~8 years)

# Ideal for testing multiple seasonality models
# Weekly pattern (period=7) + Annual pattern (period=365)
using Durbyn.Bats
fit = tbats(pedestrians, [7, 365.25])
fc = forecast(fit, h=30)

Properties:

  • Frequency: Daily (multiple seasonal periods)
  • Characteristics:
    • Weekly seasonality (period=7): Lower weekend traffic
    • Annual seasonality (period=365.25): Seasonal variations
    • Upward trend

Source: Simulated based on Melbourne Pedestrian Counting System patterns.


Synthetic Data Generator

simulate_seasonal_data

Generate synthetic time series with configurable components for testing and experimentation.

simulate_seasonal_data(n=365; m=12, trend=true, seasonal_strength=1.0,
                       noise_level=0.1, base_level=100.0, trend_coef=0.1)

Arguments:

ParameterDefaultDescription
n365Number of observations
m12Seasonal period
trendtrueInclude linear trend
seasonal_strength1.0Seasonal amplitude multiplier
noise_level0.1Noise as fraction of base level
base_level100.0Base level of the series
trend_coef0.1Trend coefficient per observation

Common Frequency Values:

mData Type
4Quarterly
7Daily with weekly pattern
12Monthly
24Hourly with daily pattern
52Weekly with annual pattern
365Daily with annual pattern

Examples:

using Durbyn

# Monthly data similar to air_passengers
monthly = simulate_seasonal_data(144; m=12, base_level=100.0, trend_coef=0.5)

# Quarterly data similar to ausbeer
quarterly = simulate_seasonal_data(100; m=4, seasonal_strength=1.5)

# Daily data with weekly pattern
daily = simulate_seasonal_data(365; m=7, base_level=1000.0)

# No seasonality (for testing trend-only models)
trend_only = simulate_seasonal_data(100; m=1, seasonal_strength=0.0)

# Strong seasonality, no trend
seasonal_only = simulate_seasonal_data(200; m=12, trend=false, seasonal_strength=2.0)

Complex Seasonality (Multiple Periods):

# Daily data with both weekly and annual patterns
n = 365 * 2
weekly = simulate_seasonal_data(n; m=7, trend=false, base_level=0.0, seasonal_strength=0.5)
annual = simulate_seasonal_data(n; m=365, trend=false, base_level=0.0, seasonal_strength=1.0)
trend_only = simulate_seasonal_data(n; m=1, seasonal_strength=0.0, base_level=100.0)
complex_data = trend_only .+ weekly .+ annual

Generated Series Structure:

\[Y(t) = \text{Base} + \text{Trend}(t) + \text{Seasonal}(t) + \text{Noise}(t)\]

where:

  • Base = base_level
  • Trend(t) = trend_coef * t (if trend=true)
  • Seasonal(t) = seasonal_strength * base_level * sin(2π * t / m)
  • Noise(t) ~ Normal(0, noise_level * base_level)

Quick Reference

using Durbyn

# Load all datasets
ap = air_passengers()      # Monthly, m=12, classic Box-Jenkins
beer = ausbeer()           # Quarterly, m=4, Australian beer
lynx_data = lynx()         # Annual, m=1, cyclic pattern
spots = sunspots()         # Monthly, m=12, solar cycle
peds = pedestrian_counts() # Daily, m=[7, 365], multi-seasonal

# Generate custom data
custom = simulate_seasonal_data(100; m=12, seasonal_strength=1.5)

Data Cleaning Utilities

The Utils module provides functions for handling missing values (missing and NaN) in vectors and matrices.

dropmissing

Remove missing values from a vector, or remove rows with missing values from a vector-matrix pair.

dropmissing(x::AbstractVector)                    # drop missing/NaN from vector
dropmissing(x::AbstractVector, X::AbstractMatrix)  # drop rows with missing/NaN from both

Example:

using Durbyn.Utils: dropmissing

x = [1.0, NaN, 3.0, missing, 5.0]
dropmissing(x)  # [1.0, 3.0, 5.0]

# Paired removal (vector + matrix)
x = [1.0, NaN, 3.0]
X = [1.0 2.0; 3.0 4.0; 5.0 6.0]
x_clean, X_clean = dropmissing(x, X)

ismissingish

Test whether a value is "missing-like" (missing or NaN). Follows Julia's is* predicate convention.

ismissingish(v)

Example:

using Durbyn.Utils: ismissingish

ismissingish(missing)  # true
ismissingish(NaN)      # true
ismissingish(1.0)      # false
ismissingish(Inf)      # false

completecases

Return a boolean vector indicating which elements are complete (neither missing nor NaN). Matches the DataFrames.jl naming convention.

completecases(x::AbstractArray)

Example:

using Durbyn.Utils: completecases

completecases([1.0, missing, 3.0, NaN])  # [true, false, true, false]

mean2

Compute the mean, with an option to skip missing values.

mean2(x; skipmissing=false)

Example:

using Durbyn.Utils: mean2

mean2([1.0, 2.0, missing, 4.0]; skipmissing=true)  # 2.333...

References

  • Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley.
  • Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.
  • Campbell, M.J. & Walker, A.M. (1977). A Survey of statistical work on the Mackenzie River series. Journal of the Royal Statistical Society Series A, 140, 411-431.