MCP via Coordinate Descent¶
At a glance
Family: penalized-batch (nonconvex) · Regime: high-dim / low-dim · Penalty: mcp ·
Output: path over \(\lambda\) · Links: identity, logit, log · Status: draft ·
Refs: Zhang, 2010 · Loh and Wainwright, 2015
Setting & assumptions¶
- Any GLM in the exponential family; Gaussian/identity is the canonical case, logistic/Poisson handled through the IRLS-weighted inner solver below.
- High- or low-dimensional; sparsity \(\lVert\beta^\star\rVert_0=s\ll p\) is the motivating regime.
- Columns of \(X\) standardized so that \(\tfrac1n\lVert X_{\cdot j}\rVert_2^2=1\); \(y\) centered (Gaussian). Intercept unpenalized.
- The penalty is nonconvex but, paired with a strongly convex loss, the overall objective can remain convex (see condition below).
Estimator / objective¶
The Minimax Concave Penalty (Zhang 2010) is applied coordinatewise:
with MCP defined through its derivative, for \(\gamma>1\):
Integrating (\(p_\lambda(0)=0\)) gives
Like SCAD, MCP starts with the lasso slope \(\lambda\) at the origin and relaxes the penalization rate continuously to \(0\), becoming flat for \(t>\gamma\lambda\) (no bias on large coefficients). For a GLM, replace the squared-error loss with the mean negative log-likelihood \(\mathcal L(\beta)\).
Algorithm¶
Cyclic coordinate descent with firm thresholding. With standardized columns, let the partial residual be \(r^{(j)} = y - \sum_{k\ne j} X_{\cdot k}\beta_k\) and \(z_j = \tfrac1n X_{\cdot j}^\top r^{(j)}\). The MCP coordinate update is the firm-thresholding operator \(F(\cdot;\lambda,\gamma)\):
Input: X (standardized), y (centered), λ-grid λ_max>...>λ_min, γ (>1)
Warm starts along the grid (pathwise):
for λ in grid: # decreasing
repeat until convergence:
for j = 1..p:
r = y - X β + X[:,j] β_j # partial residual
z_j = (1/n) X[:,j]ᵀ r
if |z_j| <= γλ:
β_j = Soft(z_j, λ) / (1 - 1/γ) # firm threshold
else:
β_j = z_j
record β(λ)
return path {β(λ)}
- \(\lambda_{\max}=\tfrac1n\lVert X^\top y\rVert_\infty\) (smallest \(\lambda\) with \(\widehat\beta=0\)); grid log-spaced down to \(\lambda_{\min}=\epsilon\lambda_{\max}\), with warm starts.
- General GLM. Penalized IRLS (outer quadratic approximation) + weighted coordinate descent (inner) applying the firm-thresholding update on the working response, as in Lasso-CD.
Hyperparameters & configuration¶
| Knob | Default | Notes |
|---|---|---|
| \(\lambda\) | path | selected by CV or BIC |
| \(\gamma\) | \(3\) | concavity; \(\gamma>1\) required. \(\gamma\to\infty\) → lasso (soft), \(\gamma\to1^+\) → hard threshold |
| grid length | 100 | log-spaced \(\lambda_{\max}\to\epsilon\lambda_{\max}\) |
| standardize | true | columns to unit variance; coefficients returned on original scale |
| intercept | true, unpenalized | |
| tol | \(10^{-7}\) | convergence on max coordinate change |
| family/link | gaussian/identity | also binomial/logit, poisson/log via IRLS |
Mapping to framework¶
- Input: \(X, y\), link; regularization \(\lambda\), concavity \(\gamma\) (or request full path).
- Output: \(\widehat\beta(\lambda)\) — a point or the whole path.
- Links: identity (LS inner loop), logit, log (IRLS outer loop).
- Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).
Complexity¶
- Per full cycle: \(O(np)\) (Gaussian, dense), or \(O(n\,\lvert\text{active set}\rvert)\) with active-set tricks.
- Whole path of \(L\) values with warm starts: typically near \(O(npL)\) in practice.
- Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
Statistical guarantees¶
- Convexity condition. For Gaussian loss with standardized columns, the objective is convex whenever \(\gamma > 1/c_{\min}\), where \(c_{\min}\) is the minimum eigenvalue of $ X^\top X/n$; more generally each coordinate subproblem is convex when \(\gamma>1\). This makes MCP-CD better behaved than generic nonconvex problems.
- Near-unbiased selection (Zhang 2010). Under a sparse Riesz / restricted-eigenvalue
condition, MCP attains the oracle sparsity pattern and near-minimax estimation rates while
minimizing the maximum concavity for a given threshold gap
Zhang, 2010. - Stationary-point control. As a regularized M-estimator with nonconvex penalty, all
stationary points are within statistical precision of \(\beta^\star\) under restricted strong
convexity
Loh and Wainwright, 2015.
Variants & related¶
- SCAD via Local Linear Approximation — sibling folded-concave penalty.
- Lasso via Coordinate Descent — convex limit (\(\gamma\to\infty\)) / inner solver.
- Hard thresholding — the \(\gamma\to1^+\) limit of the firm-thresholding operator.