ISTA (Iterative Shrinkage-Thresholding)¶

At a glance

Family: first-order-prox · Regime: high-dim / low-dim · Penalty: lasso (any prox-able) · Output: point · Links: identity, logit, log · Status: draft · Refs: Daubechies et al., 2004 · Beck and Teboulle, 2009

Setting & assumptions¶

Any GLM in the exponential family whose loss \(\mathcal L\) is convex and differentiable with Lipschitz-continuous gradient (constant \(L_\nabla\)); Gaussian/identity is the canonical case.
High- or low-dimensional. The penalty \(P\) need only admit a cheap proximal operator (lasso is the headline case; the method extends to any prox-able \(P\)).
Columns of \(X\) standardized; \(y\) centered (Gaussian). Intercept unpenalized.

Estimator / objective¶

ISTA targets the composite convex objective

\[ \widehat\beta(\lambda) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) \;+\; \lambda\lVert\beta\rVert_1 , \qquad \mathcal L(\beta)=\frac{1}{2n}\lVert y-X\beta\rVert_2^2 \;\text{(Gaussian)} , \]

a smooth loss \(\mathcal L\) plus a (possibly nonsmooth) penalty \(\lambda P\). For a general GLM \(\mathcal L\) is the mean negative log-likelihood.

Algorithm¶

Proximal gradient. Take a gradient step on the smooth part, then apply the proximal operator of the penalty. With step size \(t_k>0\),

\[ \beta^{(t+1)} \;=\; \operatorname{prox}_{t_k\lambda\lVert\cdot\rVert_1}\!\Big(\beta^{(t)} - t_k\,\nabla\mathcal L(\beta^{(t)})\Big) \;=\; \mathcal S_{t_k\lambda}\!\Big(\beta^{(t)} - t_k\,\nabla\mathcal L(\beta^{(t)})\Big), \]

where \(\mathcal S\) is the soft-threshold (the lasso prox) applied componentwise.

Input: X, y, λ, gradient ∇L, step rule, init β^(0)
for t = 0, 1, 2, ... until convergence:
    g       = ∇L(β^(t))                 # e.g. -(1/n) Xᵀ(y - Xβ^(t)) for Gaussian
    v       = β^(t) - t_k * g           # gradient step
    β^(t+1) = Soft(v, t_k * λ)          # prox: soft-threshold componentwise
return β̂

Constant step size. \(t_k = 1/L_\nabla\) guarantees descent and convergence. For Gaussian loss \(L_\nabla = \tfrac1n\lambda_{\max}(X^\top X)\) (largest eigenvalue).
Backtracking line search. When \(L_\nabla\) is unknown, shrink \(t_k\leftarrow\eta t_k\) (e.g. \(\eta=0.5\)) until the quadratic majorization \(\mathcal L(\beta^{(t+1)})\le \mathcal L(\beta^{(t)})+\nabla\mathcal L(\beta^{(t)})^\top(\beta^{(t+1)}-\beta^{(t)})+\tfrac{1}{2t_k}\lVert\beta^{(t+1)}-\beta^{(t)}\rVert_2^2\) holds.
Stopping. Relative change in objective or \(\lVert\beta^{(t+1)}-\beta^{(t)}\rVert\) below tolerance.

Hyperparameters & configuration¶

Knob	Default	Notes
\(\lambda\)	fixed / path	regularization strength; path via warm starts
step \(t_k\)	\(1/L_\nabla\)	constant (\(L_\nabla\) = Lipschitz const of \(\nabla\mathcal L\)) or backtracking
backtracking \(\eta\)	\(0.5\)	step-shrink factor when \(L_\nabla\) unknown
tol	\(10^{-6}\)	stopping on objective / iterate change
max iter	\(10^3\)–\(10^4\)	ISTA needs many iterations (\(O(1/k)\) rate)
init \(\beta^{(0)}\)	\(0\)	warm start across a \(\lambda\)-path accelerates convergence

Mapping to framework¶

Input: \(X, y\), link; regularization \(\lambda\); step rule.
Output: \(\widehat\beta(\lambda)\) — a single point.
Links: identity, logit, log (any convex GLM loss with Lipschitz gradient).
Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).

Complexity¶

Per iteration: one gradient (\(O(np)\) for dense \(X\)) + one prox (\(O(p)\)). Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
Iteration count: \(O(1/\epsilon)\) iterations for an \(\epsilon\)-accurate objective (sublinear); linear rate if \(\mathcal L\) is additionally strongly convex.

Statistical guarantees¶

Optimization (convex \(\mathcal L\)). With \(t_k=1/L_\nabla\), \(F(\beta^{(k)})-F(\widehat\beta)\le \dfrac{L_\nabla\lVert\beta^{(0)}-\widehat\beta\rVert_2^2}{2k}=O(1/k)\), where \(F=\mathcal L+\lambda P\) Beck and Teboulle, 2009. The accelerated FISTA improves this to \(O(1/k^2)\).
ISTA is an optimization method: statistical properties of the solution \(\widehat\beta(\lambda)\) are those of the lasso/penalized M-estimator it computes (see Lasso-CD).
Daubechies, Defrise & De Mol (2004) established convergence of the iterative thresholding iteration for linear inverse problems with a sparsity penalty Daubechies et al., 2004.

FISTA (Fast ISTA) — Nesterov-accelerated version, \(O(1/k^2)\).
Lasso via Coordinate Descent — alternative solver for the same objective.
Proximal gradient with general \(P\) — group lasso, nuclear norm, etc., by swapping the prox.