Elastic Net¶

At a glance

Family: penalized-batch · Regime: high-dim / low-dim · Penalty: elastic-net · Output: path over \(\lambda\) · Links: identity, logit, log · Status: draft · Refs: Zou and Hastie, 2005 · Friedman et al., 2007

Setting & assumptions¶

Any GLM in the exponential family; Gaussian/identity is canonical, logistic/Poisson handled via the IRLS outer loop below.
High- or low-dimensional. The method targets the \(p \gg n\) with correlated predictors regime where pure lasso is unstable: lasso saturates at \(n\) selected variables and arbitrarily picks one from a correlated group.
Columns of \(X\) standardized to mean \(0\), unit variance; \(y\) centered (Gaussian). Intercept unpenalized.
Sparsity \(\lVert\beta^\star\rVert_0=s\) assumed in the high-dimensional regime, but exact sparsity is not required — the \(\ell_2\) part keeps grouped variables in together.

Estimator / objective¶

The elastic net mixes the lasso (\(\ell_1\)) and ridge (\(\ell_2\)) penalties through a mixing parameter \(\alpha\in[0,1]\):

\[ \widehat\beta(\lambda,\alpha) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \frac{1}{2n}\lVert y-X\beta\rVert_2^2 \;+\; \lambda\Big( \alpha\lVert\beta\rVert_1 + \frac{1-\alpha}{2}\lVert\beta\rVert_2^2 \Big). \]

For a general GLM, replace the Gaussian loss by the mean negative log-likelihood \(\mathcal L(\beta)\). The limiting cases are \(\alpha=1\) (pure lasso) and \(\alpha=0\) (pure ridge). The combined penalty is strictly convex for \(\alpha<1\), so the solution is unique even when columns of \(X\) are collinear.

Algorithm¶

Gaussian — cyclic coordinate descent. With standardized columns (\(\lVert X_{\cdot j}\rVert_2^2/n=1\)), each coordinate has a closed form combining soft-thresholding (from the \(\ell_1\) term) with a ridge shrinkage denominator (from the \(\ell_2\) term). Let the partial residual be \(r^{(j)} = y - \sum_{k\ne j} X_{\cdot k}\beta_k\). Then

\[ \beta_j \;\leftarrow\; \frac{\mathcal S_{\lambda\alpha}\!\big(\tfrac1n X_{\cdot j}^\top r^{(j)}\big)} {1 + \lambda(1-\alpha)}, \qquad \mathcal S_{t}(z)=\operatorname{sign}(z)\,(|z|-t)_+ . \]

The numerator soft-thresholds at level \(\lambda\alpha\); the denominator \(1+\lambda(1-\alpha)\) is the proximal shrinkage induced by the ridge term.

Input: X (standardized), y (centered), λ-grid λ_max>...>λ_min, mixing α
Warm starts along the grid (pathwise):
for λ in grid:                          # decreasing
    repeat until convergence:
        for j = 1..p:
            r = y - X β + X[:,j] β_j               # partial residual r^(j)
            z = (1/n) X[:,j]ᵀ r
            β_j = Soft(z, λ·α) / (1 + λ·(1-α))     # soft-threshold then ridge shrink
    record β(λ)
Return path {β(λ)}

\(\lambda_{\max}=\tfrac1{n\alpha}\lVert X^\top y\rVert_\infty\) (smallest \(\lambda\) giving \(\widehat\beta=0\) for given \(\alpha>0\)); grid log-spaced down to \(\lambda_{\min}=\epsilon\,\lambda_{\max}\).
Active-set / strong rules restrict cycling to likely-nonzero coordinates.

General GLM — penalized IRLS (outer) + coordinate descent (inner). Form the quadratic approximation of \(\mathcal L\) at the current \(\beta\) (working response \(z_i=\eta_i+(y_i-\mu_i)g'(\mu_i)\), weights \(w_i\)), then run the weighted version of the update above on the penalized weighted least squares problem.

Naive vs corrected (rescaled) elastic net. The raw coordinate solution is the naive elastic net, which applies a double amount of shrinkage (\(\ell_1\) then \(\ell_2\)) and can over-shrink. Zou and Hastie, 2005 propose the corrected elastic net, rescaling

\[ \widehat\beta^{\text{enet}} = (1+\lambda(1-\alpha))\,\widehat\beta^{\text{naive}}, \]

which undoes the ridge contraction while keeping the grouping/variable-selection behaviour. Modern glmnet-style parameterizations fold this scaling into the penalty definition.

Hyperparameters & configuration¶

Knob	Default	Notes
\(\lambda\)	path	selected by CV (`lambda.min` / `lambda.1se`), AIC/BIC, or fixed
\(\alpha\) (mixing)	0.5	\(1\) = lasso, \(0\) = ridge; often itself tuned on a small grid by CV
grid length	100	log-spaced \(\lambda_{\max}\to\epsilon\lambda_{\max}\), \(\epsilon=10^{-3}\) (\(10^{-2}\) if \(p>n\))
standardize	true	columns to unit variance; coefficients returned on original scale
intercept	true, unpenalized
correction	true	rescale naive → corrected estimate
tol	\(10^{-7}\)	convergence on max coordinate change
family/link	gaussian/identity	also binomial/logit, poisson/log via IRLS

Selecting \((\lambda,\alpha)\) is typically a 2-D cross-validation over a small \(\alpha\) grid \(\{0,0.1,\dots,1\}\) with a full \(\lambda\) path for each.

Mapping to framework¶

Input: \(X, y\), link; regularization \(\lambda\) and mixing \(\alpha\) (or request the full path).
Output: \(\widehat\beta(\lambda,\alpha)\) — a single point or the whole path over \(\lambda\).
Links: identity (LS inner loop), logit, log (IRLS outer loop).
Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).

Complexity¶

Per full cycle: \(O(np)\) (Gaussian, dense), or \(O(n\,\lvert\text{active set}\rvert)\) with active-set tricks.
Whole path of \(L\) values with warm starts: typically near \(O(npL)\) in practice; multiply by the number of \(\alpha\) values when tuning the mixing parameter.
Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).

Statistical guarantees¶

Grouping effect. For two predictors with sample correlation \(\rho\), Zou and Hastie, 2005 bound the coefficient difference: \(\;|\widehat\beta_j-\widehat\beta_k| \le C\sqrt{2(1-\rho)}\,\), so highly correlated predictors receive nearly equal coefficients (and enter/leave the model together) — unlike the lasso, which selects one arbitrarily.
Strict convexity for \(\alpha<1\) gives a unique solution even under collinearity; the model can select more than \(n\) variables (lasso cannot).
Estimation/selection consistency follows lasso-type analyses under restricted-eigenvalue / compatibility conditions with \(\lambda\asymp\sigma\sqrt{\log p/n}\).

Lasso (\(\alpha=1\)) · Ridge (\(\alpha=0\)) — the two endpoints.
Adaptive Lasso · Group Lasso · Fused Lasso — other structured penalties.
Adaptive elastic net — combine data-driven weights with the \(\ell_1+\ell_2\) mix.