Skip to content

ISTA (Iterative Shrinkage-Thresholding)

At a glance

Family: first-order-prox · Regime: high-dim / low-dim · Penalty: lasso (any prox-able) · Output: point · Links: identity, logit, log · Status: draft · Refs: daubechies2004iterative, beck2009fast

Setting & assumptions

  • Any GLM in the exponential family whose loss \(\mathcal L\) is convex and differentiable with Lipschitz-continuous gradient (constant \(L_\nabla\)); Gaussian/identity is the canonical case.
  • High- or low-dimensional. The penalty \(P\) need only admit a cheap proximal operator (lasso is the headline case; the method extends to any prox-able \(P\)).
  • Columns of \(X\) standardized; \(y\) centered (Gaussian). Intercept unpenalized.

Estimator / objective

ISTA targets the composite convex objective

\[ \hat\beta(\lambda) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) \;+\; \lambda\lVert\beta\rVert_1 , \qquad \mathcal L(\beta)=\frac{1}{2n}\lVert y-X\beta\rVert_2^2 \;\text{(Gaussian)} , \]

a smooth loss \(\mathcal L\) plus a (possibly nonsmooth) penalty \(\lambda P\). For a general GLM \(\mathcal L\) is the mean negative log-likelihood.

Algorithm

Proximal gradient. Take a gradient step on the smooth part, then apply the proximal operator of the penalty. With step size \(t_k>0\),

\[ \beta^{(t+1)} \;=\; \operatorname{prox}_{t_k\lambda\lVert\cdot\rVert_1}\!\Big(\beta^{(t)} - t_k\,\nabla\mathcal L(\beta^{(t)})\Big) \;=\; \mathcal S_{t_k\lambda}\!\Big(\beta^{(t)} - t_k\,\nabla\mathcal L(\beta^{(t)})\Big), \]

where \(\mathcal S\) is the soft-threshold (the lasso prox) applied componentwise.

Input: X, y, λ, gradient ∇L, step rule, init β^(0)
for t = 0, 1, 2, ... until convergence:
    g       = ∇L(β^(t))                 # e.g. -(1/n) Xᵀ(y - Xβ^(t)) for Gaussian
    v       = β^(t) - t_k * g           # gradient step
    β^(t+1) = Soft(v, t_k * λ)          # prox: soft-threshold componentwise
return β̂
  • Constant step size. \(t_k = 1/L_\nabla\) guarantees descent and convergence. For Gaussian loss \(L_\nabla = \tfrac1n\lambda_{\max}(X^\top X)\) (largest eigenvalue).
  • Backtracking line search. When \(L_\nabla\) is unknown, shrink \(t_k\leftarrow\eta t_k\) (e.g. \(\eta=0.5\)) until the quadratic majorization \(\mathcal L(\beta^{(t+1)})\le \mathcal L(\beta^{(t)})+\nabla\mathcal L(\beta^{(t)})^\top(\beta^{(t+1)}-\beta^{(t)})+\tfrac{1}{2t_k}\lVert\beta^{(t+1)}-\beta^{(t)}\rVert_2^2\) holds.
  • Stopping. Relative change in objective or \(\lVert\beta^{(t+1)}-\beta^{(t)}\rVert\) below tolerance.

Hyperparameters & configuration

Knob Default Notes
\(\lambda\) fixed / path regularization strength; path via warm starts
step \(t_k\) \(1/L_\nabla\) constant (\(L_\nabla\) = Lipschitz const of \(\nabla\mathcal L\)) or backtracking
backtracking \(\eta\) \(0.5\) step-shrink factor when \(L_\nabla\) unknown
tol \(10^{-6}\) stopping on objective / iterate change
max iter \(10^3\)\(10^4\) ISTA needs many iterations (\(O(1/k)\) rate)
init \(\beta^{(0)}\) \(0\) warm start across a \(\lambda\)-path accelerates convergence

Mapping to framework

  • Input: \(X, y\), link; regularization \(\lambda\); step rule.
  • Output: \(\hat\beta(\lambda)\) — a single point.
  • Links: identity, logit, log (any convex GLM loss with Lipschitz gradient).
  • Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).

Complexity

  • Per iteration: one gradient (\(O(np)\) for dense \(X\)) + one prox (\(O(p)\)). Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
  • Iteration count: \(O(1/\epsilon)\) iterations for an \(\epsilon\)-accurate objective (sublinear); linear rate if \(\mathcal L\) is additionally strongly convex.

Statistical guarantees

  • Optimization (convex \(\mathcal L\)). With \(t_k=1/L_\nabla\), \(F(\beta^{(k)})-F(\hat\beta)\le \dfrac{L_\nabla\lVert\beta^{(0)}-\hat\beta\rVert_2^2}{2k}=O(1/k)\), where \(F=\mathcal L+\lambda P\) (beck2009fast). The accelerated FISTA improves this to \(O(1/k^2)\).
  • ISTA is an optimization method: statistical properties of the solution \(\hat\beta(\lambda)\) are those of the lasso/penalized M-estimator it computes (see Lasso-CD).
  • Daubechies, Defrise & De Mol (2004) established convergence of the iterative thresholding iteration for linear inverse problems with a sparsity penalty (daubechies2004iterative).
  • FISTA (Fast ISTA) — Nesterov-accelerated version, \(O(1/k^2)\).
  • Lasso via Coordinate Descent — alternative solver for the same objective.
  • Proximal gradient with general \(P\) — group lasso, nuclear norm, etc., by swapping the prox.

References

  • Daubechies, Defrise & De Mol (2004), An iterative thresholding algorithm for linear inverse problems with a sparsity constraint (daubechies2004iterative).
  • Beck & Teboulle (2009), A fast iterative shrinkage-thresholding algorithm for linear inverse problems (beck2009fast) — states ISTA, its \(O(1/k)\) rate, and backtracking.