Adaptive Lasso¶
At a glance
Family: penalized-batch · Regime: low-dim / high-dim · Penalty: adaptive-lasso · Output: path over \(\lambda\) · Links: identity, logit, log · Status: draft · Refs: buhlmann2011statistics, tibshirani1996regression
Setting & assumptions¶
- Any GLM in the exponential family; Gaussian/identity is canonical, logistic/Poisson via the IRLS outer loop.
- Originally proposed for the low-dimensional regime where a \(\sqrt n\)-consistent initial estimator (e.g. OLS) exists; extends to high dimensions when \(\hat\beta_{\text{init}}\) is a ridge or marginal-regression estimate.
- Columns of \(X\) standardized to unit variance; \(y\) centered (Gaussian). Intercept unpenalized.
- Sparsity \(\lVert\beta^\star\rVert_0=s\) assumed; relies on an initial estimate whose nonzero coefficients are bounded away from \(0\) (beta-min).
Estimator / objective¶
The adaptive lasso replaces the uniform \(\ell_1\) penalty by a weighted \(\ell_1\) penalty whose weights are computed from a preliminary estimate \(\hat\beta_{\text{init}}\):
For a general GLM, replace the Gaussian loss by the mean negative log-likelihood \(\mathcal L(\beta)\). Large initial coefficients get small weights (penalized lightly, kept), while near-zero ones get large weights (penalized heavily, driven to exactly \(0\)). This data-driven asymmetry is what buys the oracle property that the plain lasso lacks.
Algorithm¶
The key observation is that the weighted problem reduces to an ordinary lasso by rescaling the columns. Define rescaled predictors \(\tilde X_{\cdot j} = X_{\cdot j}/w_j\) and coefficients \(\tilde\beta_j = w_j\beta_j\). Then
an ordinary lasso in \((\tilde X,\tilde\beta)\). Solve it by coordinate descent, then map back \(\hat\beta_j = \hat{\tilde\beta}_j / w_j\).
Input: X (standardized), y (centered), λ-grid, exponent γ, initial estimator type
1. Compute β_init (OLS if n>p; ridge or marginal regression if p≥n)
2. Weights: w_j = 1 / |β_init,j|^γ # w_j = ∞ where β_init,j = 0 → β_j forced to 0
3. Rescale: X̃[:,j] = X[:,j] / w_j
4. Solve ordinary lasso on (X̃, y) by coordinate descent over the λ-grid:
for λ in grid (decreasing), warm-started:
repeat until convergence:
for j with w_j < ∞:
r = y - X̃ β̃ + X̃[:,j] β̃_j
β̃_j = Soft( (1/n) X̃[:,j]ᵀ r , λ )
5. Map back: β_j(λ) = β̃_j(λ) / w_j
Return path {β(λ)}
Coordinates with \(\hat\beta_{\text{init},j}=0\) have \(w_j=\infty\) and are dropped a priori (\(\beta_j\equiv 0\)), giving an automatic screening step. For GLMs the inner solve is the penalized IRLS + coordinate-descent loop applied to \(\tilde X\).
Hyperparameters & configuration¶
| Knob | Default | Notes |
|---|---|---|
| \(\lambda\) | path | selected by CV (lambda.min / lambda.1se), AIC/BIC, or fixed |
| \(\gamma\) (weight exponent) | 1 | \(\gamma\in\{0.5,1,2\}\) typical; larger \(\gamma\) ⇒ stronger asymmetry; tuned with \(\lambda\) by CV |
| \(\hat\beta_{\text{init}}\) | OLS (\(n>p\)) | ridge or univariate/marginal regression when \(p\ge n\) |
| standardize | true | columns to unit variance |
| intercept | true, unpenalized | |
| tol | \(10^{-7}\) | convergence on max coordinate change |
| family/link | gaussian/identity | also binomial/logit, poisson/log via IRLS |
The pair \((\gamma,\lambda)\) (and the choice of initial estimator) are usually selected jointly by cross-validation.
Mapping to framework¶
- Input: \(X, y\), link; weight exponent \(\gamma\), initial estimator, regularization \(\lambda\).
- Output: \(\hat\beta(\lambda)\) — a point or the whole path, on the original scale.
- Links: identity (LS inner loop), logit, log (IRLS outer loop).
- Preprocessing: standardize \(X\); center \(y\) (Gaussian). Requires computing \(\hat\beta_{\text{init}}\) before the main solve.
Complexity¶
- Initial estimator: \(O(np^2+p^3)\) for OLS/ridge, or \(O(np)\) for marginal regression.
- Main solve: identical to lasso coordinate descent — per cycle \(O(np)\), full path with warm starts near \(O(npL)\); the \(w_j=\infty\) screening shrinks the working set.
- Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
Statistical guarantees¶
- Oracle property (Zou, 2006). With a \(\sqrt n\)-consistent \(\hat\beta_{\text{init}}\) and \(\lambda_n\) chosen so that \(\lambda_n/\sqrt n\to 0\) and \(\lambda_n n^{(\gamma-1)/2}\to\infty\), the adaptive lasso (i) selects the true support with probability \(\to 1\) and (ii) estimates the nonzero coefficients with the same asymptotic distribution as if the true support were known. The ordinary lasso does not enjoy this property in general.
- High-dimensional consistency holds under restricted-eigenvalue/compatibility conditions and a
suitable initial estimator; see Bühlmann & van de Geer (
buhlmann2011statistics).
Variants & related¶
- Lasso — the uniform-weight special case (\(w_j\equiv 1\)).
- Elastic Net — adaptive elastic net combines weights with an \(\ell_2\) term.
- SCAD / MCP — nonconvex penalties achieving the oracle property directly.
- Iterating the reweighting step yields connections to nonconvex / reweighted-\(\ell_1\) schemes.
References¶
- Zou (2006), The adaptive lasso and its oracle properties — original proposal and oracle
theory (canonical paper; not in
reference.bib). - Bühlmann & van de Geer (2011), Statistics for High-Dimensional Data (
buhlmann2011statistics) — adaptive lasso, weighting, and high-dimensional analysis. - Tibshirani (1996), Regression shrinkage and selection via the lasso (
tibshirani1996regression) — base \(\ell_1\) estimator. - Friedman, Hastie, Höfling & Tibshirani (2007), Pathwise coordinate optimization (
fhht2007) — coordinate-descent solver reused after rescaling.