Skip to content

Truncated Gradient

At a glance

Family: online-streaming · Regime: streaming / high-dim / low-dim · Penalty: lasso · Output: point · Links: identity, logit, log · Status: draft · Refs: truncated_SGD_Langford_2009

Setting & assumptions

  • Any GLM in the exponential family with canonical link; identity, logit, log are the primary cases.
  • Streaming / online: points arrive sequentially; per-step cost and storage are bounded.
  • \(\ell_1\)-type online sparsity is desired; naïve rounding of small SGD coefficients is too aggressive, so truncation is applied gradually.

Estimator / objective

Truncated Gradient (Langford, Li & Zhang, 2009) approximately solves the \(\ell_1\)-penalized GLM

\[ \hat\beta \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) + \lambda\lVert\beta\rVert_1, \qquad \nabla\ell_i(\beta) = -(y_i-\mu_i)\,x_i = (\mu(x_i^\top\beta)-y_i)\,x_i, \]

by interleaving plain SGD steps with a soft truncation applied every \(K\) steps. The coordinatewise truncation operator, with gravity \(\alpha=K\gamma\lambda\) and window \(\theta\), is

\[ \boxed{\; T(v,\alpha,\theta) = \begin{cases} \max(0,\, v-\alpha) & 0 \le v \le \theta,\\[2pt] \min(0,\, v+\alpha) & -\theta \le v < 0,\\[2pt] v & |v| > \theta, \end{cases} \;} \]

i.e. shrink toward \(0\) by \(\alpha\) but not past \(0\), and only within the window \(|v|\le\theta\) (coefficients larger than \(\theta\) are left untouched).

Algorithm

Input: stream (x, y); step γ; λ; truncation period K; gravity g=γλ; window θ; init β⁽⁰⁾
1. β ← β⁽⁰⁾
2. for t = 1, 2, ... :
3.     receive (x, y)
4.     μ = g⁻¹(xᵀ β)                              # O(p)
5.     grad = (μ - y) · x                          # ∇ℓ_{i_t}(β), O(p)
6.     β ← β - γ · grad                            # plain SGD step, O(p)
7.     if t mod K == 0:                            # truncate periodically
8.         α = K · γ · λ
9.         for j = 1..p:  β_j ← T(β_j, α, θ)       # soft truncation, O(p)
Return β
  • \(K\) controls truncation frequency (amortizes the zeroing); the per-event gravity is \(\alpha=K\gamma\lambda\) so the average shrinkage rate matches \(\gamma\lambda\) per step.
  • Special case. As \(\theta\to\infty\) with \(K=1\), \(T(v,\gamma\lambda,\infty)=\mathcal S_{\gamma\lambda}(v)\) is exact soft-thresholding, recovering FOBOS-\(\ell_1\). Finite \(\theta\) protects large coefficients from shrinkage (a SCAD/MCP-like effect).

Hyperparameters & configuration

Knob Default Notes
\(\lambda\) tuned \(\ell_1\) strength (gravity per step \(=\gamma\lambda\))
\(K\) \(1\)\(10\) truncation period; larger \(K\) truncates less often, more strongly
\(\theta\) \(\infty\) truncation window; \(\theta=\infty\) ⇒ soft-threshold; finite ⇒ protect large coefs
\(\gamma\) \(\gamma_0 t^{-a}\) SGD step size, \(a\in(0.5,1]\)
init \(\beta^{(0)}\) \(0\) warm start permitted

Mapping to framework

  • Input: stream of \((x_i,y_i)\), link \(g\), \((\lambda, K, \theta)\), step \(\gamma\).
  • Output: sparse point estimate \(\hat\beta\).
  • Links: identity, logit, log.
  • Preprocessing: feature scaling recommended so \(\lambda\) acts uniformly across coordinates.

Complexity

  • Per step: one inner product and one SGD update, each \(O(p)\); the periodic truncation is an \(O(p)\) elementwise sweep (amortized \(O(p/K)\) per step, but \(O(p)\) worst case). Total \(O(p)\) per observation.
  • Memory: \(O(p)\) — only the iterate \(\beta\) — bounded and independent of the stream length; with sparse \(x_{i_t}\) the gradient step is \(O(\mathrm{nnz})\).

Statistical guarantees

  • Online regret / sparsity trade-off. Truncated Gradient enjoys an online regret bound and provably controls the induced sparsity through \((\lambda, K, \theta)\), with the truncation bias bounded by the gravity parameter (truncated_SGD_Langford_2009).
  • It recovers FOBOS-\(\ell_1\) (hence its regret guarantees) in the \(\theta\to\infty\), \(K=1\) limit.
  • FOBOS — special case (\(\theta\to\infty\), \(K=1\)).
  • RDA — gradient-averaging alternative with stronger sparsity.
  • SGD — the underlying unpenalized update.

References

  • Langford, Li & Zhang (2009), Sparse online learning via truncated gradient (truncated_SGD_Langford_2009).