Lasso via Coordinate Descent¶
At a glance
Family: penalized-batch · Regime: high-dim / low-dim · Penalty: lasso · Output: path over \(\lambda\) · Links: identity, logit, log · Status: reviewed · Refs: tibshirani1996regression, fhht2007
Setting & assumptions¶
- Any GLM in the exponential family; Gaussian/identity is the canonical case, logistic/Poisson handled via the IRLS outer loop below.
- High- or low-dimensional; sparsity \(\lVert\beta^\star\rVert_0=s\ll p\) is the motivating regime.
- Columns of \(X\) standardized to mean \(0\), unit variance; \(y\) centered (Gaussian). Intercept unpenalized.
Estimator / objective¶
The \(\ell_1\)-penalized template:
For a general GLM, \(\mathcal L\) is the mean negative log-likelihood (logistic, Poisson, …).
Algorithm¶
Gaussian — cyclic coordinate descent. With standardized columns (\(\tfrac1n\lVert X_{\cdot j}\rVert_2^2=1\)), each coordinate has a closed-form soft-threshold update. Let the partial residual be \(r^{(j)} = y - \sum_{k\ne j} X_{\cdot k}\beta_k\). Then
Input: X (standardized), y (centered), λ-grid λ_max>...>λ_min
Warm starts along the grid (pathwise):
for λ in grid: # decreasing
repeat until convergence:
for j = 1..p:
r = y - X β + X[:,j] β_j # partial residual
β_j = Soft( (1/n) X[:,j]ᵀ r , λ ) # soft-threshold
record β(λ)
Return path {β(λ)}
- \(\lambda_{\max}=\tfrac1n\lVert X^\top y\rVert_\infty\) (smallest \(\lambda\) with \(\hat\beta=0\)); grid is log-spaced down to \(\lambda_{\min}=\epsilon\,\lambda_{\max}\).
- Active-set / strong rules restrict cycling to likely-nonzero coordinates for speed.
General GLM — penalized IRLS (outer) + coordinate descent (inner). Repeat: form the quadratic approximation of \(\mathcal L\) at the current \(\beta\) (working response \(z_i=\eta_i+(y_i-\mu_i)g'(\mu_i)\), weights \(w_i\)), then run weighted coordinate descent on the penalized weighted least squares problem.
Hyperparameters & configuration¶
| Knob | Default | Notes |
|---|---|---|
| \(\lambda\) | path | selected by CV (lambda.min / lambda.1se), AIC/BIC, or fixed |
| grid length | 100 | log-spaced \(\lambda_{\max}\to\epsilon\lambda_{\max}\), \(\epsilon=10^{-3}\) (or \(10^{-2}\) if \(p>n\)) |
| standardize | true | columns to unit variance; coefficients returned on original scale |
| intercept | true, unpenalized | |
| tol | \(10^{-7}\) | convergence on max coordinate change |
| family/link | gaussian/identity | also binomial/logit, poisson/log via IRLS |
Mapping to framework¶
- Input: \(X, y\), link; regularization \(\lambda\) (or request full path).
- Output: \(\hat\beta(\lambda)\) — a single point or the whole path.
- Links: identity (LS inner loop), logit, log (IRLS outer loop).
- Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit unpenalized intercept (GLM).
Complexity¶
- Per full cycle: \(O(np)\) (Gaussian, dense), or \(O(n\,\lvert\text{active set}\rvert)\) with active-set tricks.
- Whole path of \(L\) values with warm starts: typically near \(O(npL)\) in practice.
- Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
Statistical guarantees¶
- Estimation: under a restricted-eigenvalue / compatibility condition and \(\lambda\asymp\sigma\sqrt{\log p / n}\), \(\lVert\hat\beta-\beta^\star\rVert_1=O_P(s\sqrt{\log p/n})\).
- Selection: support recovery under the irrepresentable/beta-min conditions
(
zhao2006model). - Coordinate descent converges to a global optimum (convex objective, separable penalty).
Variants & related¶
- Elastic Net · Adaptive Lasso · Group Lasso · Fused Lasso
- LARS — exact path; FISTA/ISTA — proximal-gradient solvers.
- Debiased Lasso — inference built on this estimator.
References¶
- Tibshirani (1996), Regression shrinkage and selection via the lasso (
tibshirani1996regression). - Friedman, Hastie, Höfling & Tibshirani (2007), Pathwise coordinate optimization (
fhht2007). - Zhao & Yu (2006), On model selection consistency of Lasso (
zhao2006model).