Skip to content

FISTA (Fast ISTA)

At a glance

Family: first-order-prox · Regime: high-dim / low-dim · Penalty: lasso (any prox-able) · Output: point · Links: identity, logit, log · Status: draft · Refs: beck2009fast, daubechies2004iterative

Setting & assumptions

  • Any GLM in the exponential family whose loss \(\mathcal L\) is convex and differentiable with Lipschitz-continuous gradient (constant \(L_\nabla\)); Gaussian/identity is the canonical case.
  • High- or low-dimensional. The penalty \(P\) need only admit a cheap proximal operator (lasso is the headline case; any prox-able \(P\) works).
  • Columns of \(X\) standardized; \(y\) centered (Gaussian). Intercept unpenalized.

Estimator / objective

FISTA solves the same composite convex objective as ISTA:

\[ \hat\beta(\lambda) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) \;+\; \lambda\lVert\beta\rVert_1 , \qquad \mathcal L(\beta)=\frac{1}{2n}\lVert y-X\beta\rVert_2^2 \;\text{(Gaussian)} , \]

but converges faster. For a general GLM \(\mathcal L\) is the mean negative log-likelihood.

Algorithm

Accelerated proximal gradient (Beck & Teboulle 2009). Apply the proximal-gradient step at an extrapolated point \(y^{(k)}\) that carries Nesterov momentum, using a fixed step \(s=1/L_\nabla\). Initialize \(\beta^{(0)}=\beta^{(-1)}\), \(t_1=1\). The exact recursion is

\[ \beta^{(k+1)} = \operatorname{prox}_{s\lambda\lVert\cdot\rVert_1}\!\Big(y^{(k+1)} - s\,\nabla\mathcal L(y^{(k+1)})\Big), $$ $$ t_{k+1} = \frac{1+\sqrt{1+4t_k^2}}{2}, \qquad y^{(k+1)} = \beta^{(k)} + \frac{t_k - 1}{t_{k+1}}\big(\beta^{(k)} - \beta^{(k-1)}\big). \]
Input: X, y, λ, gradient ∇L, step s = 1/L_∇, init β^(0)
β^(-1) = β^(0);  y^(1) = β^(0);  t_1 = 1
for k = 1, 2, ... until convergence:
    v        = y^(k) - s * ∇L(y^(k))            # gradient step at extrapolated point
    β^(k)    = Soft(v, s * λ)                    # prox: soft-threshold componentwise
    t_{k+1}  = (1 + sqrt(1 + 4 t_k^2)) / 2       # momentum weight
    y^(k+1)  = β^(k) + ((t_k - 1)/t_{k+1}) (β^(k) - β^(k-1))   # extrapolation
return β̂
  • The momentum term \(\tfrac{t_k-1}{t_{k+1}}\to 1\) as \(k\to\infty\), adding a vanishing fraction of the previous step's direction — this is what lifts the rate from \(O(1/k)\) to \(O(1/k^2)\).
  • Backtracking variant. When \(L_\nabla\) is unknown, search \(s_k\leftarrow\eta s_k\) until the majorization inequality holds at \(y^{(k)}\), and use \(s_k\) in both the prox level and the \(t\)-update; convergence guarantees are preserved.
  • Stopping. Relative change in objective or iterate below tolerance.

Hyperparameters & configuration

Knob Default Notes
\(\lambda\) fixed / path regularization strength; path via warm starts
step \(s\) \(1/L_\nabla\) constant (\(L_\nabla\) = Lipschitz const of \(\nabla\mathcal L\)) or backtracking
backtracking \(\eta\) \(0.5\) step-shrink factor when \(L_\nabla\) unknown
momentum Nesterov \(t_k\) fixed recursion above; no tuning needed
tol \(10^{-6}\) stopping on objective / iterate change
max iter \(10^2\)\(10^3\) far fewer than ISTA thanks to \(O(1/k^2)\) rate

Mapping to framework

  • Input: \(X, y\), link; regularization \(\lambda\); step rule.
  • Output: \(\hat\beta(\lambda)\) — a single point.
  • Links: identity, logit, log (any convex GLM loss with Lipschitz gradient).
  • Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).

Complexity

  • Per iteration: one gradient (\(O(np)\) dense) + one prox (\(O(p)\)) + \(O(p)\) extrapolation. Same per-step cost as ISTA. Memory \(O(np)\) (plus \(O(p)\) to store \(\beta^{(k-1)}\)).
  • Iteration count: \(O(1/\sqrt{\epsilon})\) for an \(\epsilon\)-accurate objective — quadratically fewer iterations than ISTA's \(O(1/\epsilon)\).

Statistical guarantees

  • Optimization (convex \(\mathcal L\)). With \(s=1/L_\nabla\), \(F(\beta^{(k)})-F(\hat\beta)\le \dfrac{2L_\nabla\lVert\beta^{(0)}-\hat\beta\rVert_2^2}{(k+1)^2}=O(1/k^2)\), where \(F=\mathcal L+\lambda P\) — the optimal rate for first-order methods on this class (beck2009fast).
  • FISTA is an optimization method: statistical properties of the solution \(\hat\beta(\lambda)\) are those of the lasso/penalized M-estimator it computes (see Lasso-CD).
  • Builds on the iterative thresholding iteration of Daubechies, Defrise & De Mol (2004) (daubechies2004iterative).

References

  • Beck & Teboulle (2009), A fast iterative shrinkage-thresholding algorithm for linear inverse problems (beck2009fast) — introduces FISTA and proves the \(O(1/k^2)\) rate.
  • Daubechies, Defrise & De Mol (2004), An iterative thresholding algorithm for linear inverse problems with a sparsity constraint (daubechies2004iterative) — the ISTA precursor.