FISTA (Fast ISTA)¶

At a glance

Family: first-order-prox · Regime: high-dim / low-dim · Penalty: lasso (any prox-able) · Output: point · Links: identity, logit, log · Status: draft · Refs: Beck and Teboulle, 2009 · Daubechies et al., 2004

Setting & assumptions¶

Any GLM in the exponential family whose loss $\mathcal L$ is convex and differentiable with Lipschitz-continuous gradient (constant $L_\nabla$); Gaussian/identity is the canonical case.
High- or low-dimensional. The penalty $P$ need only admit a cheap proximal operator (lasso is the headline case; any prox-able $P$ works).
Columns of $X$ standardized; $y$ centered (Gaussian). Intercept unpenalized.

Estimator / objective¶

FISTA solves the same composite convex objective as ISTA:

\[ \widehat\beta(\lambda) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) \;+\; \lambda\lVert\beta\rVert_1 , \qquad \mathcal L(\beta)=\frac{1}{2n}\lVert y-X\beta\rVert_2^2 \;\text{(Gaussian)} , \]

but converges faster. For a general GLM $\mathcal L$ is the mean negative log-likelihood.

Algorithm¶

Accelerated proximal gradient (Beck & Teboulle 2009). Apply the proximal-gradient step at an extrapolated point $y^{(k)}$ that carries Nesterov momentum, using a fixed step $s=1/L_\nabla$. Initialize $\beta^{(0)}=\beta^{(-1)}$, $t_1=1$. The exact recursion is

\[ \beta^{(k+1)} = \operatorname{prox}_{s\lambda\lVert\cdot\rVert_1}\!\Big(y^{(k+1)} - s\,\nabla\mathcal L(y^{(k+1)})\Big), $$ $$ t_{k+1} = \frac{1+\sqrt{1+4t_k^2}}{2}, \qquad y^{(k+1)} = \beta^{(k)} + \frac{t_k - 1}{t_{k+1}}\big(\beta^{(k)} - \beta^{(k-1)}\big). \]

Input: X, y, λ, gradient ∇L, step s = 1/L_∇, init β^(0)
β^(-1) = β^(0);  y^(1) = β^(0);  t_1 = 1
for k = 1, 2, ... until convergence:
    v        = y^(k) - s * ∇L(y^(k))            # gradient step at extrapolated point
    β^(k)    = Soft(v, s * λ)                    # prox: soft-threshold componentwise
    t_{k+1}  = (1 + sqrt(1 + 4 t_k^2)) / 2       # momentum weight
    y^(k+1)  = β^(k) + ((t_k - 1)/t_{k+1}) (β^(k) - β^(k-1))   # extrapolation
return β̂

The momentum term $\tfrac{t_k-1}{t_{k+1}}\to 1$ as $k\to\infty$, adding a vanishing fraction of the previous step's direction — this is what lifts the rate from $O(1/k)$ to $O(1/k^2)$.
Backtracking variant. When $L_\nabla$ is unknown, search $s_k\leftarrow\eta s_k$ until the majorization inequality holds at $y^{(k)}$, and use $s_k$ in both the prox level and the $t$-update; convergence guarantees are preserved.
Stopping. Relative change in objective or iterate below tolerance.

Hyperparameters & configuration¶

Knob	Default	Notes
$\lambda$	fixed / path	regularization strength; path via warm starts
step $s$	$1/L_\nabla$	constant ($L_\nabla$ = Lipschitz const of $\nabla\mathcal L$) or backtracking
backtracking $\eta$	$0.5$	step-shrink factor when $L_\nabla$ unknown
momentum	Nesterov $t_k$	fixed recursion above; no tuning needed
tol	$10^{-6}$	stopping on objective / iterate change
max iter	$10^2$–$10^3$	far fewer than ISTA thanks to $O(1/k^2)$ rate

Mapping to framework¶

Input: $X, y$, link; regularization $\lambda$; step rule.
Output: $\widehat\beta(\lambda)$ — a single point.
Links: identity, logit, log (any convex GLM loss with Lipschitz gradient).
Preprocessing: standardize $X$; center $y$ (Gaussian) or fit an unpenalized intercept (GLM).

Complexity¶

Per iteration: one gradient ($O(np)$ dense) + one prox ($O(p)$) + $O(p)$ extrapolation. Same per-step cost as ISTA. Memory $O(np)$ (plus $O(p)$ to store $\beta^{(k-1)}$).
Iteration count: $O(1/\sqrt{\epsilon})$ for an $\epsilon$-accurate objective — quadratically fewer iterations than ISTA's $O(1/\epsilon)$.

Statistical guarantees¶

Optimization (convex $\mathcal L$). With $s=1/L_\nabla$, $F(\beta^{(k)})-F(\widehat\beta)\le \dfrac{2L_\nabla\lVert\beta^{(0)}-\widehat\beta\rVert_2^2}{(k+1)^2}=O(1/k^2)$, where $F=\mathcal L+\lambda P$ — the optimal rate for first-order methods on this class Beck and Teboulle, 2009.
FISTA is an optimization method: statistical properties of the solution $\widehat\beta(\lambda)$ are those of the lasso/penalized M-estimator it computes (see Lasso-CD).
Builds on the iterative thresholding iteration of Daubechies, Defrise & De Mol (2004) Daubechies et al., 2004.

ISTA (Iterative Shrinkage-Thresholding) — the unaccelerated $O(1/k)$ base method.
Lasso via Coordinate Descent — alternative solver for the same objective.
Accelerated proximal gradient with general $P$ — group lasso, nuclear norm, etc.

Knob	Default	Notes
\(\lambda\)	fixed / path	regularization strength; path via warm starts
step \(s\)	\(1/L_\nabla\)	constant (\(L_\nabla\) = Lipschitz const of \(\nabla\mathcal L\)) or backtracking
backtracking \(\eta\)	\(0.5\)	step-shrink factor when \(L_\nabla\) unknown
momentum	Nesterov \(t_k\)	fixed recursion above; no tuning needed
tol	\(10^{-6}\)	stopping on objective / iterate change
max iter	\(10^2\)–\(10^3\)	far fewer than ISTA thanks to \(O(1/k^2)\) rate