Elastic Net¶
At a glance
Family: penalized-batch · Regime: high-dim / low-dim · Penalty: elastic-net · Output: path over \(\lambda\) · Links: identity, logit, log · Status: draft · Refs: zou2005regularization, fhht2007
Setting & assumptions¶
- Any GLM in the exponential family; Gaussian/identity is canonical, logistic/Poisson handled via the IRLS outer loop below.
- High- or low-dimensional. The method targets the \(p \gg n\) with correlated predictors regime where pure lasso is unstable: lasso saturates at \(n\) selected variables and arbitrarily picks one from a correlated group.
- Columns of \(X\) standardized to mean \(0\), unit variance; \(y\) centered (Gaussian). Intercept unpenalized.
- Sparsity \(\lVert\beta^\star\rVert_0=s\) assumed in the high-dimensional regime, but exact sparsity is not required — the \(\ell_2\) part keeps grouped variables in together.
Estimator / objective¶
The elastic net mixes the lasso (\(\ell_1\)) and ridge (\(\ell_2\)) penalties through a mixing parameter \(\alpha\in[0,1]\):
For a general GLM, replace the Gaussian loss by the mean negative log-likelihood \(\mathcal L(\beta)\). The limiting cases are \(\alpha=1\) (pure lasso) and \(\alpha=0\) (pure ridge). The combined penalty is strictly convex for \(\alpha<1\), so the solution is unique even when columns of \(X\) are collinear.
Algorithm¶
Gaussian — cyclic coordinate descent. With standardized columns (\(\tfrac1n\lVert X_{\cdot j}\rVert_2^2=1\)), each coordinate has a closed form combining soft-thresholding (from the \(\ell_1\) term) with a ridge shrinkage denominator (from the \(\ell_2\) term). Let the partial residual be \(r^{(j)} = y - \sum_{k\ne j} X_{\cdot k}\beta_k\). Then
The numerator soft-thresholds at level \(\lambda\alpha\); the denominator \(1+\lambda(1-\alpha)\) is the proximal shrinkage induced by the ridge term.
Input: X (standardized), y (centered), λ-grid λ_max>...>λ_min, mixing α
Warm starts along the grid (pathwise):
for λ in grid: # decreasing
repeat until convergence:
for j = 1..p:
r = y - X β + X[:,j] β_j # partial residual r^(j)
z = (1/n) X[:,j]ᵀ r
β_j = Soft(z, λ·α) / (1 + λ·(1-α)) # soft-threshold then ridge shrink
record β(λ)
Return path {β(λ)}
- \(\lambda_{\max}=\tfrac1{n\alpha}\lVert X^\top y\rVert_\infty\) (smallest \(\lambda\) giving \(\hat\beta=0\) for given \(\alpha>0\)); grid log-spaced down to \(\lambda_{\min}=\epsilon\,\lambda_{\max}\).
- Active-set / strong rules restrict cycling to likely-nonzero coordinates.
General GLM — penalized IRLS (outer) + coordinate descent (inner). Form the quadratic approximation of \(\mathcal L\) at the current \(\beta\) (working response \(z_i=\eta_i+(y_i-\mu_i)g'(\mu_i)\), weights \(w_i\)), then run the weighted version of the update above on the penalized weighted least squares problem.
Naive vs corrected (rescaled) elastic net. The raw coordinate solution is the naive elastic net, which applies a double amount of shrinkage (\(\ell_1\) then \(\ell_2\)) and can over-shrink. Zou & Hastie (2005) propose the corrected elastic net, rescaling
which undoes the ridge contraction while keeping the grouping/variable-selection behaviour.
Modern glmnet-style parameterizations fold this scaling into the penalty definition.
Hyperparameters & configuration¶
| Knob | Default | Notes |
|---|---|---|
| \(\lambda\) | path | selected by CV (lambda.min / lambda.1se), AIC/BIC, or fixed |
| \(\alpha\) (mixing) | 0.5 | \(1\) = lasso, \(0\) = ridge; often itself tuned on a small grid by CV |
| grid length | 100 | log-spaced \(\lambda_{\max}\to\epsilon\lambda_{\max}\), \(\epsilon=10^{-3}\) (\(10^{-2}\) if \(p>n\)) |
| standardize | true | columns to unit variance; coefficients returned on original scale |
| intercept | true, unpenalized | |
| correction | true | rescale naive → corrected estimate |
| tol | \(10^{-7}\) | convergence on max coordinate change |
| family/link | gaussian/identity | also binomial/logit, poisson/log via IRLS |
Selecting \((\lambda,\alpha)\) is typically a 2-D cross-validation over a small \(\alpha\) grid \(\{0,0.1,\dots,1\}\) with a full \(\lambda\) path for each.
Mapping to framework¶
- Input: \(X, y\), link; regularization \(\lambda\) and mixing \(\alpha\) (or request the full path).
- Output: \(\hat\beta(\lambda,\alpha)\) — a single point or the whole path over \(\lambda\).
- Links: identity (LS inner loop), logit, log (IRLS outer loop).
- Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).
Complexity¶
- Per full cycle: \(O(np)\) (Gaussian, dense), or \(O(n\,\lvert\text{active set}\rvert)\) with active-set tricks.
- Whole path of \(L\) values with warm starts: typically near \(O(npL)\) in practice; multiply by the number of \(\alpha\) values when tuning the mixing parameter.
- Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
Statistical guarantees¶
- Grouping effect. For two predictors with sample correlation \(\rho\), Zou & Hastie (2005) bound the coefficient difference: \(\;|\hat\beta_j-\hat\beta_k| \le C\sqrt{2(1-\rho)}\,\), so highly correlated predictors receive nearly equal coefficients (and enter/leave the model together) — unlike the lasso, which selects one arbitrarily.
- Strict convexity for \(\alpha<1\) gives a unique solution even under collinearity; the model can select more than \(n\) variables (lasso cannot).
- Estimation/selection consistency follows lasso-type analyses under restricted-eigenvalue / compatibility conditions with \(\lambda\asymp\sigma\sqrt{\log p/n}\).
Variants & related¶
- Lasso (\(\alpha=1\)) · Ridge (\(\alpha=0\)) — the two endpoints.
- Adaptive Lasso · Group Lasso · Fused Lasso — other structured penalties.
- Adaptive elastic net — combine data-driven weights with the \(\ell_1+\ell_2\) mix.
References¶
- Zou & Hastie (2005), Regularization and variable selection via the elastic net
(
zou2005regularization) — defines the penalty, grouping effect, naive vs corrected estimate. - Friedman, Hastie, Höfling & Tibshirani (2007), Pathwise coordinate optimization (
fhht2007) — coordinate-descent path solver used here. - Tibshirani (1996), Regression shrinkage and selection via the lasso (
tibshirani1996regression) — the \(\ell_1\) endpoint.