Skip to content

FOBOS (Forward-Backward Splitting)

At a glance

Family: online-streaming · Regime: streaming / high-dim / low-dim · Penalty: lasso (any prox-able) · Output: point · Links: identity, logit, log · Status: draft · Refs: xiao2009dual, truncated_SGD_Langford_2009

Setting & assumptions

  • Any GLM in the exponential family with canonical link; identity, logit, log are the primary cases.
  • Streaming / online: points arrive sequentially; per-step cost and storage are bounded.
  • A convex, prox-able penalty \(P\) (canonically \(\ell_1\) for online sparsity); high-dimensional sparse \(\beta^\star\) is the motivating regime.

Estimator / objective

FOBOS (Duchi & Singer, 2009) minimizes the composite online objective

\[ \hat\beta \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) + \lambda P(\beta), \qquad \nabla\ell_i(\beta) = -(y_i-\mu_i)\,x_i = (\mu(x_i^\top\beta)-y_i)\,x_i, \]

by splitting each iteration into an explicit (forward) subgradient step on the loss and an implicit (backward) proximal step on the penalty:

\[ \boxed{\; \begin{aligned} \text{(i) forward:}\quad & \beta^{(t+1/2)} = \beta^{(t)} - \gamma_t\,\nabla\ell_{i_t}(\beta^{(t)}) = \beta^{(t)} - \gamma_t\big(\mu(x_{i_t}^\top\beta^{(t)})-y_{i_t}\big)\,x_{i_t},\\[2pt] \text{(ii) backward:}\quad & \beta^{(t+1)} = \arg\min_{\beta}\;\tfrac12\lVert\beta-\beta^{(t+1/2)}\rVert_2^2 + \gamma_t\lambda\,P(\beta) \;=\; \operatorname{prox}_{\gamma_t\lambda P}\big(\beta^{(t+1/2)}\big). \end{aligned} \;} \]

\(\ell_1\) case. The backward step is coordinatewise soft-thresholding, producing exact zeros (sparse online iterates):

\[ \beta^{(t+1)}_j = \mathcal S_{\gamma_t\lambda}\big(\beta^{(t+1/2)}_j\big) = \operatorname{sign}(\beta^{(t+1/2)}_j)\,\big(|\beta^{(t+1/2)}_j|-\gamma_t\lambda\big)_+. \]

Algorithm

Input: stream (x, y); step schedule γ_t; penalty λP (prox known); init β⁽⁰⁾
1. β ← β⁽⁰⁾
2. for t = 0, 1, 2, ... :
3.     receive (x, y)
4.     μ = g⁻¹(xᵀ β)                          # O(p)
5.     g = (μ - y) · x                         # forward gradient, O(p)
6.     v = β - γ_t · g                         # forward (half) step, O(p)
7.     β ← prox_{γ_t·λ·P}(v)                   # backward step
       # for P = ‖·‖₁:  β_j = Soft(v_j, γ_t·λ)  (soft-threshold), O(p)
Return β
  • Step sizes follow \(\gamma_t=\gamma_0 t^{-a}\), \(a\in(0.5,1]\) (Robbins–Monro conditions).
  • Any prox-able penalty fits: \(\ell_1\) (soft-threshold), \(\ell_2^2\)/ridge (shrinkage), group lasso (block soft-threshold), elastic net.

Hyperparameters & configuration

Knob Default Notes
\(\lambda\) tuned regularization strength; controls online sparsity
\(P\) \(\ell_1\) any prox-able penalty
\(\gamma_0\) problem-dependent base learning rate
\(a\) (decay) \(0.5<a\le1\) \(\gamma_t=\gamma_0 t^{-a}\)
averaging optional tail-average iterates to stabilize
init \(\beta^{(0)}\) \(0\) warm start permitted

Mapping to framework

  • Input: stream of \((x_i,y_i)\), link \(g\), penalty \((\lambda, P)\), schedule \(\gamma_t\).
  • Output: sparse point estimate \(\hat\beta\).
  • Links: identity, logit, log.
  • Preprocessing: feature scaling recommended so a single \(\lambda\) acts uniformly.

Complexity

  • Per step: one inner product (\(O(p)\)), one axpy forward step (\(O(p)\)), and one elementwise prox/soft-threshold (\(O(p)\)). Total \(O(p)\) time per observation.
  • Memory: \(O(p)\) — only the iterate \(\beta\) — bounded and independent of the stream length. With sparse \(x_{i_t}\), the forward step touches only \(O(\mathrm{nnz})\) coordinates.

Statistical guarantees

  • Online regret. For convex losses with step sizes \(\gamma_t\propto 1/\sqrt{t}\), FOBOS attains \(\mathcal O(\sqrt{T})\) regret against the best fixed \(\beta\) for the composite objective (Duchi & Singer, 2009).
  • Sparsity behavior. Single-point soft-thresholding zeros only coordinates whose half-step magnitude falls below \(\gamma_t\lambda\); because \(\gamma_t\to0\), FOBOS yields weaker online sparsity than gradient-averaging schemes such as RDA (xiao2009dual).
  • RDA — averages gradients before thresholding, giving stronger/more stable sparsity.
  • Truncated Gradient — recovers FOBOS-\(\ell_1\) as a special case.
  • SGD — the forward step alone; AdaGrad — adaptive composite variant.

References

  • Duchi & Singer (2009), Efficient online and batch learning using forward-backward splitting — original FOBOS (named in prose; not in reference.bib).
  • Xiao (2009), Dual averaging method for regularized stochastic learning and online optimization (xiao2009dual) — analyzes FOBOS-type sparsity and the dual-averaging alternative.
  • Langford, Li & Zhang (2009), Sparse online learning via truncated gradient (truncated_SGD_Langford_2009) — closely related online-\(\ell_1\) family.