MCP via Coordinate Descent¶
At a glance
Family: penalized-batch (nonconvex) · Regime: high-dim / low-dim · Penalty: mcp · Output: path over \(\lambda\) · Links: identity, logit, log · Status: draft · Refs: zhang2010nearly, loh2015regularized
Setting & assumptions¶
- Any GLM in the exponential family; Gaussian/identity is the canonical case, logistic/Poisson handled through the IRLS-weighted inner solver below.
- High- or low-dimensional; sparsity \(\lVert\beta^\star\rVert_0=s\ll p\) is the motivating regime.
- Columns of \(X\) standardized so that \(\tfrac1n\lVert X_{\cdot j}\rVert_2^2=1\); \(y\) centered (Gaussian). Intercept unpenalized.
- The penalty is nonconvex but, paired with a strongly convex loss, the overall objective can remain convex (see condition below).
Estimator / objective¶
The Minimax Concave Penalty (Zhang 2010) is applied coordinatewise:
with MCP defined through its derivative, for \(\gamma>1\):
Integrating (\(p_\lambda(0)=0\)) gives
Like SCAD, MCP starts with the lasso slope \(\lambda\) at the origin and relaxes the penalization rate continuously to \(0\), becoming flat for \(t>\gamma\lambda\) (no bias on large coefficients). For a GLM, replace the squared-error loss with the mean negative log-likelihood \(\mathcal L(\beta)\).
Algorithm¶
Cyclic coordinate descent with firm thresholding. With standardized columns, let the partial residual be \(r^{(j)} = y - \sum_{k\ne j} X_{\cdot k}\beta_k\) and \(z_j = \tfrac1n X_{\cdot j}^\top r^{(j)}\). The MCP coordinate update is the firm-thresholding operator \(F(\cdot;\lambda,\gamma)\):
Input: X (standardized), y (centered), λ-grid λ_max>...>λ_min, γ (>1)
Warm starts along the grid (pathwise):
for λ in grid: # decreasing
repeat until convergence:
for j = 1..p:
r = y - X β + X[:,j] β_j # partial residual
z_j = (1/n) X[:,j]ᵀ r
if |z_j| <= γλ:
β_j = Soft(z_j, λ) / (1 - 1/γ) # firm threshold
else:
β_j = z_j
record β(λ)
return path {β(λ)}
- \(\lambda_{\max}=\tfrac1n\lVert X^\top y\rVert_\infty\) (smallest \(\lambda\) with \(\hat\beta=0\)); grid log-spaced down to \(\lambda_{\min}=\epsilon\lambda_{\max}\), with warm starts.
- General GLM. Penalized IRLS (outer quadratic approximation) + weighted coordinate descent (inner) applying the firm-thresholding update on the working response, as in Lasso-CD.
Hyperparameters & configuration¶
| Knob | Default | Notes |
|---|---|---|
| \(\lambda\) | path | selected by CV or BIC |
| \(\gamma\) | \(3\) | concavity; \(\gamma>1\) required. \(\gamma\to\infty\) → lasso (soft), \(\gamma\to1^+\) → hard threshold |
| grid length | 100 | log-spaced \(\lambda_{\max}\to\epsilon\lambda_{\max}\) |
| standardize | true | columns to unit variance; coefficients returned on original scale |
| intercept | true, unpenalized | |
| tol | \(10^{-7}\) | convergence on max coordinate change |
| family/link | gaussian/identity | also binomial/logit, poisson/log via IRLS |
Mapping to framework¶
- Input: \(X, y\), link; regularization \(\lambda\), concavity \(\gamma\) (or request full path).
- Output: \(\hat\beta(\lambda)\) — a point or the whole path.
- Links: identity (LS inner loop), logit, log (IRLS outer loop).
- Preprocessing: standardize \(X\); center \(y\) (Gaussian) or fit an unpenalized intercept (GLM).
Complexity¶
- Per full cycle: \(O(np)\) (Gaussian, dense), or \(O(n\,\lvert\text{active set}\rvert)\) with active-set tricks.
- Whole path of \(L\) values with warm starts: typically near \(O(npL)\) in practice.
- Memory \(O(np)\) (or \(O(\text{nnz})\) for sparse \(X\)).
Statistical guarantees¶
- Convexity condition. For Gaussian loss with standardized columns, the objective is convex whenever \(\gamma > 1/c_{\min}\), where \(c_{\min}\) is the minimum eigenvalue of \(\tfrac1n X^\top X\); more generally each coordinate subproblem is convex when \(\gamma>1\). This makes MCP-CD better behaved than generic nonconvex problems.
- Near-unbiased selection (Zhang 2010). Under a sparse Riesz / restricted-eigenvalue
condition, MCP attains the oracle sparsity pattern and near-minimax estimation rates while
minimizing the maximum concavity for a given threshold gap (
zhang2010nearly). - Stationary-point control. As a regularized M-estimator with nonconvex penalty, all
stationary points are within statistical precision of \(\beta^\star\) under restricted strong
convexity (
loh2015regularized).
Variants & related¶
- SCAD via Local Linear Approximation — sibling folded-concave penalty.
- Lasso via Coordinate Descent — convex limit (\(\gamma\to\infty\)) / inner solver.
- Hard thresholding — the \(\gamma\to1^+\) limit of the firm-thresholding operator.
References¶
- Zhang (2010), Nearly unbiased variable selection under minimax concave penalty
(
zhang2010nearly) — defines MCP and the MC+ / coordinate-descent algorithm. - Loh & Wainwright (2015), Regularized M-estimators with nonconvexity (
loh2015regularized) — statistical control of stationary points under nonconvex penalties.