Optimization Module
The Optimize module provides numerical optimization algorithms for minimizing objective functions. These solvers are used internally throughout Durbyn.jl for model fitting (e.g., maximum likelihood estimation in ETS, ARIMA, and BATS models) and are also available for general-purpose optimization tasks.
Overview
The module implements four optimization algorithms accessible through a unified interface:
| Algorithm | Function | Type | Use Case |
|---|---|---|---|
| Nelder-Mead | nelder_mead | Derivative-free | General purpose, no gradient needed |
| BFGS | bfgs | Quasi-Newton | Fast convergence with gradients |
| L-BFGS-B | lbfgsb | Bounded quasi-Newton | Box-constrained optimization |
| Brent | brent | 1D derivative-free | Scalar optimization |
Exported Functions and Types
# Main optimization functions
export optimize, nelder_mead, bfgs, lbfgsb, brent
# Options types
export NelderMeadOptions, BFGSOptions, LBFGSBOptions, BrentOptions
# Supporting functions
export numgrad, numgrad!, numgrad_with_cache!, NumericalGradientCache
export numerical_hessian, bfgs_hessian_update!, BFGSWorkspace
export scaler, descalerUnified Interface (optimize)
The optimize function provides a unified interface for all solvers, allowing easy switching between methods.
optimize(x0, fn; grad=nothing, method=:nelder_mead, lower=-Inf, upper=Inf,
control=Dict(), hessian=false, kwargs...)Arguments
x0::Vector{Float64}: Initial parameter vectorfn::Function: Objective function to minimize, called asfn(x; kwargs...)
Keyword Arguments
grad::Union{Function,Nothing}: Gradient function, called asgrad(x; kwargs...). Ifnothing, numerical gradients are computed automatically.method::Symbol: Optimization method::nelder_mead(default) - Derivative-free simplex:bfgs- Quasi-Newton with line search:lbfgsb- Limited-memory BFGS with box constraints:brent- 1D optimization (scalarx0only)
lower,upper: Bounds for L-BFGS-B and Brent methodscontrol::Dict: Control parameters (see below)hessian::Bool: Iftrue, compute Hessian at solution
Control Parameters
| Parameter | Default | Description |
|---|---|---|
trace | 0 | Verbosity level (0=silent, >0=verbose) |
fnscale | 1.0 | Function scaling factor |
parscale | ones(n) | Parameter scaling vector |
ndeps | 1e-3 | Step sizes for numerical derivatives |
maxit | 500/100 | Maximum iterations (500 for NM, 100 for others) |
abstol | -Inf | Absolute convergence tolerance |
reltol | sqrt(eps) | Relative convergence tolerance |
gtol | 0.0 | Gradient norm tolerance (BFGS only) |
alpha | 1.0 | Nelder-Mead reflection coefficient |
beta | 0.5 | Nelder-Mead contraction coefficient |
gamma | 2.0 | Nelder-Mead expansion coefficient |
REPORT | 10 | Reporting frequency for BFGS |
lmm | 5 | L-BFGS-B memory parameter |
factr | 1e7 | L-BFGS-B tolerance factor |
pgtol | 0.0 | L-BFGS-B projected gradient tolerance |
Returns
Named tuple with fields:
par::Vector{Float64}: Optimal parametersvalue::Float64: Function value at optimumcounts::NamedTuple:(function_=n, gradient=m)evaluation countsconvergence::Int: Status code (0=success, 1=maxit reached)message: Convergence message (method-dependent)hessian: Hessian matrix at solution (if requested)
Examples
using Durbyn.Optimize
# Define Rosenbrock function
rosenbrock(x) = 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
# Analytical gradient
rosenbrock_grad(x) = [
-400*x[1]*(x[2]-x[1]^2) - 2*(1-x[1]),
200*(x[2]-x[1]^2)
]
# Nelder-Mead (no gradient needed)
result = optimize([-1.2, 1.0], rosenbrock)
println("Optimal: $(result.par), Value: $(result.value)")
# BFGS with analytical gradient
result = optimize([-1.2, 1.0], rosenbrock; grad=rosenbrock_grad, method=:bfgs)
# BFGS with numerical gradient (automatic)
result = optimize([-1.2, 1.0], rosenbrock; method=:bfgs)
# L-BFGS-B with box constraints
result = optimize([0.5, 0.5], rosenbrock; method=:lbfgsb,
lower=[0.0, 0.0], upper=[2.0, 2.0])
# With control parameters
result = optimize([-1.2, 1.0], rosenbrock; method=:bfgs,
control=Dict("trace" => 1, "maxit" => 500, "gtol" => 1e-6))
# Request Hessian at solution
result = optimize([-1.2, 1.0], rosenbrock; grad=rosenbrock_grad,
method=:bfgs, hessian=true)
println("Hessian:\n$(result.hessian)")
# 1D optimization with Brent's method
f1d(x) = (x[1] - 2)^2
result = optimize([0.0], f1d; method=:brent, lower=-5.0, upper=5.0)Nelder-Mead Simplex (nelder_mead)
A derivative-free optimization method that searches for a minimum using a simplex – a polytope with $n+1$ vertices in $n$ dimensions. The algorithm adaptively reshapes the simplex through reflection, expansion, contraction, and shrink operations.
Reference: Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313. Implementation follows the compact formulation in Nash (1990).
Algorithm
The algorithm performs four operations on the simplex:
- Reflection: Reflect the worst point through the centroid
- Expansion: If reflection improves the best, expand further
- Contraction: If reflection fails, contract toward the best
- Shrink: If still no improvement, shrink simplex about the best point
Mathematical Details
Given simplex vertices $\{x_1, ..., x_{n+1}\}$ ordered by function values $f(x_1) \leq ... \leq f(x_{n+1})$:
- Centroid: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$
- Reflection: $x_r = \bar{x} + \alpha(\bar{x} - x_{n+1})$
- Expansion: $x_e = \bar{x} + \gamma(x_r - \bar{x})$
- Contraction: $x_c = \bar{x} + \beta(x_{n+1} - \bar{x})$
Default coefficients: $\alpha = 1.0$, $\beta = 0.5$, $\gamma = 2.0$
Usage
nelder_mead(f, x0, options::NelderMeadOptions)Options:
NelderMeadOptions(;
abstol = -Inf, # Absolute tolerance on function value
reltol = sqrt(eps()), # Relative tolerance
alpha = 1.0, # Reflection coefficient
beta = 0.5, # Contraction coefficient
gamma = 2.0, # Expansion coefficient
trace = false, # Print diagnostics
maxit = 500, # Maximum function evaluations
invalid_penalty = 1e35, # Penalty for non-finite values
project_to_bounds = false,
lower = nothing,
upper = nothing,
init_step_cap = nothing
)Returns: Named tuple (x_opt, f_opt, fncount, fail)
fail=0: Converged successfullyfail=1: Exceeded maximum iterationsfail=10: Degenerate simplex (shrink failure)
Example
using Durbyn.Optimize
f(x) = (x[1] - 1)^2 + (x[2] - 2)^2
opts = NelderMeadOptions(trace=true, maxit=1000)
result = nelder_mead(f, [0.0, 0.0], opts)
println("Optimum: $(result.x_opt)")BFGS Quasi-Newton (bfgs)
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is a quasi-Newton method that iteratively builds an approximation to the inverse Hessian matrix using gradient information. It achieves superlinear convergence on smooth problems without requiring explicit second-derivative computation.
Reference: Nocedal, J. and Wright, S. J. (1999). Numerical Optimization. Springer. See also the original papers by Broyden (1970), Fletcher (1970), Goldfarb (1970), and Shanno (1970).
Algorithm
BFGS iteratively updates the inverse Hessian approximation $B_k$ using the formula:
\[B_{k+1} = B_k + \frac{(1 + c^T B_k c / D_1) t t^T}{D_1} - \frac{t (B_k c)^T + (B_k c) t^T}{D_1}\]
where:
- $t = x_{k+1} - x_k$ (parameter step)
- $c = \nabla f_{k+1} - \nabla f_k$ (gradient difference)
- $D_1 = t^T c$
Features
- Armijo line search with backtracking
- Periodic Hessian restarts every 2n gradient evaluations
- Parameter masking to freeze variables
- Automatic numerical gradients if analytical gradient not provided
- Gradient norm convergence criterion
Usage
bfgs(f, g, x0; mask=nothing, options=BFGSOptions(), step_sizes=1e-3*ones(n),
numgrad_cache=nothing, extra=nothing)Function Signatures:
f(n, x, extra)- Objective functiong(n, x, grad, extra)- Gradient function (modifiesgradin-place), ornothing
Options:
BFGSOptions(;
abstol = -Inf, # Absolute tolerance
reltol = sqrt(eps()), # Relative tolerance
gtol = 0.0, # Gradient norm tolerance
trace = false, # Print progress
maxit = 100, # Maximum iterations
report_interval = 10 # Reporting frequency
)Gradient Norm Convergence (gtol):
When gtol > 0, convergence is declared if:
\[\|\nabla f(x)\| < \text{gtol} \times \max(1, |f(x)|)\]
This provides a first-order optimality condition. Recommended values: 1e-5 to 1e-8.
Returns: Named tuple (x_opt, f_opt, n_iter, fail, fn_evals, gr_evals)
Example
using Durbyn.Optimize
# Internal function signature: f(n, x, extra)
rosenbrock_internal(n, x, extra) = 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
# Gradient modifies grad in-place
function rosenbrock_grad_internal(n, x, grad, extra)
grad[1] = -400*x[1]*(x[2]-x[1]^2) - 2*(1-x[1])
grad[2] = 200*(x[2]-x[1]^2)
return nothing
end
opts = BFGSOptions(trace=true, gtol=1e-6)
result = bfgs(rosenbrock_internal, rosenbrock_grad_internal, [-1.2, 1.0]; options=opts)
# With numerical gradients (g=nothing)
result = bfgs(rosenbrock_internal, nothing, [-1.2, 1.0]; options=opts)L-BFGS-B Bounded Optimization (lbfgsb)
A limited-memory variant of BFGS that supports box constraints on variables. Instead of storing the full inverse Hessian approximation, it retains only the last $memory_size$ iterations of gradient information, making it memory-efficient for large-scale problems.
Reference: Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190–1208. See also Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. (1997). Algorithm 778: L-BFGS-B.
Features
- Box constraints: Lower and upper bounds on variables
- Limited memory: Stores only
memory_sizegradient steps (default: 10) - Wolfe line search with zoom refinement
- Projected gradient for convergence checking
- Parameter masking by setting
lower[i] = upper[i] = x0[i]
Usage
lbfgsb(f, g, x0; mask=nothing, lower=-Inf, upper=Inf, options=LBFGSBOptions())Options:
LBFGSBOptions(;
memory_size = 10, # Memory size (number of stored iterations)
ftol_factor = 1e7, # Tolerance factor: f_tol = ftol_factor * eps()
pg_tol = 1e-5, # Projected gradient infinity-norm tolerance
maxit = 1000, # Maximum iterations
print_level = 0 # Print level (0=silent, >0=verbose)
)Convergence Criterion:
The algorithm converges when:
\[\|\text{proj}(\nabla f(x))\|_\infty < \text{pg\_tol}\]
where the projected gradient accounts for variables at their bounds.
Returns: Named tuple (x_opt, f_opt, n_iter, fail, fn_evals, gr_evals)
Example
using Durbyn.Optimize
rosenbrock(n, x, extra) = 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
opts = LBFGSBOptions(memory_size=5, pg_tol=1e-6, print_level=1)
result = lbfgsb(rosenbrock, nothing, [0.5, 0.5];
lower=[0.0, 0.0], upper=[2.0, 2.0], options=opts)
println("Bounded optimum: $(result.x_opt)")Brent's 1D Method (brent)
A derivative-free algorithm for one-dimensional optimization that combines golden section search with parabolic interpolation. Golden section steps provide guaranteed progress, while parabolic interpolation accelerates convergence near the minimum.
Reference: Brent, R. P. (1973). Algorithms for Minimization Without Derivatives. Prentice-Hall. Implementation follows the compact formulation in Nash (1990).
Convergence
For unimodal functions with positive second derivative at the minimum:
- Superlinear convergence with order approximately 1.324
- Error bound: $< 3\epsilon|x_{\min}| + \text{tol}$
Usage
brent(f, lower, upper; options=BrentOptions())Options:
BrentOptions(;
tol = 1.5e-8, # Interval tolerance
trace = false, # Print diagnostics
maxit = 1000 # Maximum iterations
)Returns: Named tuple (x_opt, f_opt, n_iter, fail, fn_evals)
Example
using Durbyn.Optimize
f(x) = (x - 3.5)^2 + 2*sin(x)
opts = BrentOptions(tol=1e-10, trace=true)
result = brent(f, 0.0, 10.0; options=opts)
println("Minimum at x = $(result.x_opt), f(x) = $(result.f_opt)")Numerical Gradient Computation
The module provides efficient numerical gradient computation using central finite differences.
numgrad
numgrad(f, n, x, extra, step_sizes; usebounds=false, lower=nothing, upper=nothing)Computes the gradient using central differences:
\[\frac{\partial f}{\partial x_i} \approx \frac{f(x + \epsilon_i e_i) - f(x - \epsilon_i e_i)}{2\epsilon_i}\]
numgrad_with_cache!
For repeated gradient evaluations, use a pre-allocated cache:
cache = NumericalGradientCache(n)
numgrad_with_cache!(cache, f, n, x, extra, step_sizes)This eliminates memory allocations during iterative optimization.
Example
using Durbyn.Optimize
f(n, x, extra) = x[1]^2 + x[2]^2
x = [1.0, 2.0]
step_sizes = [1e-6, 1e-6]
# Single evaluation
g = numgrad(f, 2, x, nothing, step_sizes)
# With cache for repeated evaluations
cache = NumericalGradientCache(2)
for i in 1:1000
numgrad_with_cache!(cache, f, 2, x, nothing, step_sizes)
# cache.gradient contains the gradient
endHessian Computation
The numerical_hessian function computes the Hessian matrix at a given point using finite differences.
Reference: Nocedal, J. and Wright, S. J. (1999). Numerical Optimization. Springer. Chapter 8, finite difference formulas.
numerical_hessian(fn, x, grad=nothing; fnscale=1.0, parscale=ones(n), step_sizes=1e-3*ones(n))Method:
- If
grad=nothing: Uses second-order finite differences of the objective function - If
gradprovided: Computes Hessian from gradient via finite differences
Returns: Symmetric Hessian matrix (n x n)
Example
using Durbyn.Optimize
rosenbrock(x) = 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
# At the optimum
x_opt = [1.0, 1.0]
H = numerical_hessian(rosenbrock, x_opt)
println("Hessian at optimum:\n$H")
# Eigenvalues (should be positive for minimum)
using LinearAlgebra
eigvals(H)Parameter Scaling
Scaling parameters can improve numerical conditioning when parameters have different magnitudes.
scaler and descaler
x_scaled = scaler(x, scale) # x_scaled = x ./ scale
x_original = descaler(x_scaled, scale) # x_original = x_scaled .* scaleExample
# Parameters with different scales
par = [1e-6, 1e6]
scale = [1e-6, 1e6]
# Scale for optimization
par_scaled = scaler(par, scale) # [1.0, 1.0]
# After optimization, recover original scale
par_opt = descaler(par_scaled_opt, scale)Practical Guidelines
Choosing an Optimization Method
| Scenario | Recommended Method |
|---|---|
| No gradient available | Nelder-Mead |
| Smooth function, gradient available | BFGS |
| Box constraints needed | L-BFGS-B |
| Large-scale problem (many parameters) | L-BFGS-B |
| 1D optimization | Brent |
| Non-smooth or noisy function | Nelder-Mead |
Tips for Better Convergence
Parameter scaling: Scale parameters to similar magnitudes
control = Dict("parscale" => [1e-3, 1e3])Good starting point: Provide reasonable initial values
Check gradients: Verify analytical gradients match numerical
grad_analytical = my_gradient(x) grad_numerical = numgrad(f, n, x, nothing, 1e-6*ones(n)) @assert isapprox(grad_analytical, grad_numerical, rtol=1e-4)Adjust tolerances: Tighten tolerances for more precision
control = Dict("reltol" => 1e-10, "gtol" => 1e-8)Monitor convergence: Use
traceto diagnose issuescontrol = Dict("trace" => 1)
Complete Example: Maximum Likelihood Estimation
using Durbyn.Optimize
using Distributions
# Generate data from Normal(mu=5, sigma=2)
data = rand(Normal(5.0, 2.0), 100)
# Negative log-likelihood (to minimize)
function neg_loglik(params)
mu, log_sigma = params
sigma = exp(log_sigma) # Ensure sigma > 0
n = length(data)
ll = -n/2 * log(2*pi) - n*log_sigma - sum((data .- mu).^2) / (2*sigma^2)
return -ll # Return negative for minimization
end
# Analytical gradient
function neg_loglik_grad(params)
mu, log_sigma = params
sigma = exp(log_sigma)
n = length(data)
d_mu = sum(data .- mu) / sigma^2
d_log_sigma = n - sum((data .- mu).^2) / sigma^2
return [-d_mu, -d_log_sigma]
end
# Optimize with BFGS
result = optimize([0.0, 0.0], neg_loglik;
grad=neg_loglik_grad,
method=:bfgs,
hessian=true)
# Extract estimates
mu_hat = result.par[1]
sigma_hat = exp(result.par[2])
println("Estimated mu: $mu_hat (true: 5.0)")
println("Estimated sigma: $sigma_hat (true: 2.0)")
# Standard errors from Hessian
using LinearAlgebra
se = sqrt.(diag(inv(result.hessian)))
println("SE(mu): $(se[1])")
println("SE(log_sigma): $(se[2])")References
- Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313.
- Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms. Journal of the Institute of Mathematics and Its Applications, 6, 76–90.
- Fletcher, R. (1970). A new approach to variable metric algorithms. Computer Journal, 13, 317–322.
- Goldfarb, D. (1970). A family of variable metric methods derived by variational means. Mathematics of Computation, 24, 23–26.
- Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24, 647–656.
- Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190–1208.
- Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. (1997). Algorithm 778: L-BFGS-B, Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23, 550–560.
- Brent, R. P. (1973). Algorithms for Minimization Without Derivatives. Prentice-Hall.
- Nash, J. C. (1990). Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation. 2nd ed. Adam Hilger.
- Nocedal, J. and Wright, S. J. (1999). Numerical Optimization. Springer.