Regularized Dual Averaging (RDA)¶
At a glance
Family: online-streaming · Regime: streaming / high-dim / low-dim · Penalty: lasso · Output: point · Links: identity, logit, log · Status: draft · Refs: xiao2009dual
Setting & assumptions¶
- Any GLM in the exponential family with canonical link; identity, logit, log are the primary cases.
- Streaming / online: points arrive sequentially; per-step cost and storage are bounded.
- A convex regularizer \(P\) (canonically \(\ell_1\)); sparse \(\beta^\star\) in high dimension is the motivating regime.
Estimator / objective¶
RDA minimizes the composite online objective
but, unlike FOBOS, it forms the running average of all past subgradients and minimizes against it plus the penalty and a vanishing proximal term:
where \(h\) is a strongly convex auxiliary (canonically \(h(\beta)=\tfrac12\lVert\beta\rVert_2^2\)) and \(\{\beta_t\}\) is a nondecreasing scale (e.g. \(\beta_t=\gamma\sqrt{t}\)).
\(\ell_1\) closed form. With \(h(\beta)=\tfrac12\lVert\beta\rVert_2^2\) and \(P=\lVert\cdot\rVert_1\), the update is a coordinatewise soft-threshold on the averaged gradient:
so \(\beta^{(t+1)}_j=0\) whenever \(|\bar g_{t,j}|\le\lambda\) — a threshold that does not shrink with \(t\), giving stronger and more stable sparsity than FOBOS.
Algorithm¶
Input: stream (x, y); scale schedule β_t (e.g. γ√t); λ; aux h; init β⁽⁰⁾=0
1. ḡ ← 0 (length-p running average of gradients)
2. for t = 1, 2, ... :
3. receive (x, y)
4. μ = g⁻¹(xᵀ β) # O(p)
5. g = (μ - y) · x # subgradient g_t, O(p)
6. ḡ ← ḡ + (g - ḡ)/t # update running average, O(p)
7. # closed form for h = ½‖·‖², P = ‖·‖₁:
8. for j = 1..p: β_j = -(t/β_t) · Soft(ḡ_j, λ) # O(p)
Return β
Hyperparameters & configuration¶
| Knob | Default | Notes |
|---|---|---|
| \(\lambda\) | tuned | \(\ell_1\) strength; threshold on \(\bar g_t\) |
| \(\beta_t\) | \(\gamma\sqrt{t}\) | nondecreasing proximal scale; \(\gamma>0\) tunes aggressiveness |
| \(h\) | \(\tfrac12\lVert\cdot\rVert_2^2\) | strongly convex auxiliary (enables closed form) |
| \(P\) | \(\ell_1\) | other convex regularizers admissible |
| init \(\beta^{(0)}\) | \(0\) | \(\bar g_0=0\) |
Mapping to framework¶
- Input: stream of \((x_i,y_i)\), link \(g\), penalty \((\lambda, P)\), schedule \(\beta_t\), auxiliary \(h\).
- Output: sparse point estimate \(\hat\beta\).
- Links: identity, logit, log.
- Preprocessing: feature scaling recommended so \(\lambda\) thresholds coordinates uniformly.
Complexity¶
- Per step: one inner product (\(O(p)\)), a running-average update (\(O(p)\)), and a closed-form coordinatewise threshold (\(O(p)\)). Total \(O(p)\) time per observation.
- Memory: \(O(p)\) — the iterate \(\beta\) and the averaged gradient \(\bar g_t\) — bounded and independent of the stream length.
Statistical guarantees¶
- Online regret. With \(\beta_t\propto\sqrt{t}\), RDA attains \(\mathcal O(\sqrt{T})\) regret for
convex losses (
xiao2009dual). - Sparsity. Because the threshold acts on the averaged gradient \(\bar g_t\) with a level that
does not vanish, RDA produces sparser, more stable solutions than the diminishing-threshold
FOBOS, as analyzed in
xiao2009dual.
Variants & related¶
- FOBOS — thresholds the current half-step (weaker sparsity).
- Truncated Gradient — alternative online-\(\ell_1\) truncation scheme.
- AdaGrad — adaptive (per-coordinate) dual-averaging form.
References¶
- Xiao (2009), Dual averaging method for regularized stochastic learning and online optimization (
xiao2009dual) — introduces RDA and contrasts it with FOBOS.