Skip to content

Stochastic Gradient Descent (SGD)

At a glance

Family: online-streaming · Regime: streaming / low-dim / high-dim · Penalty: none · Output: point · Links: identity, logit, log · Status: draft · Refs: Robbins1951, polyak1992acceleration, ruppert1988efficient, chen2020statistical

Setting & assumptions

  • Any GLM in the exponential family with canonical link; identity (Gaussian), logit (Bernoulli), log (Poisson) are the primary cases.
  • Streaming / online: observations \((x_{i_t}, y_{i_t})\) arrive one at a time (or are sampled i.i.d. from a fixed set); only the current point and the running iterate are held in memory.
  • The population risk \(\mathbb{E}[\ell(\beta)]\) is convex (strictly so under a full-rank design) with a unique minimizer \(\beta^\star\); gradients are unbiased estimates of \(\nabla\mathcal L\).

Estimator / objective

SGD targets the unpenalized GLM minimizer (\(P\equiv 0\)):

\[ \hat\beta \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \frac1n\sum_{i=1}^n \ell_i(\beta), \]

using the per-sample gradient of the GLM mean negative log-likelihood,

\[ \nabla \ell_i(\beta) \;=\; -\big(y_i - \mu_i\big)\, x_i \;=\; \big(\mu(x_i^\top\beta) - y_i\big)\, x_i, \qquad \mu_i = g^{-1}(x_i^\top\beta). \]

The defining recursion (Robbins–Monro stochastic approximation) is

\[ \boxed{\; \beta^{(t+1)} \;=\; \beta^{(t)} - \gamma_t\,\nabla \ell_{i_t}(\beta^{(t)}) \;=\; \beta^{(t)} - \gamma_t\big(\mu(x_{i_t}^\top\beta^{(t)}) - y_{i_t}\big)\,x_{i_t}. \;} \]

Algorithm

Input: stream (x_{i_t}, y_{i_t}); step schedule γ_t; init β⁽⁰⁾ (e.g. 0)
Optional: Polyak–Ruppert average β̄
1. β ← β⁽⁰⁾ ;  β̄ ← β⁽⁰⁾
2. for t = 0, 1, 2, ... :
3.     receive (x, y)
4.     η = xᵀ β                         # linear predictor, O(p)
5.     μ = g⁻¹(η)                        # inverse link
6.     g_t = (μ - y) · x                 # ∇ℓ_{i_t}(β), rank-1 in x
7.     β ← β - γ_t · g_t                 # SGD step, O(p)
8.     β̄ ← β̄ + (β - β̄)/(t+1)            # running average (optional)
Return β  (or the averaged β̄ for inference)
  • Step-size schedules. A decaying rate \(\gamma_t = \gamma_0\, t^{-a}\) with \(a\in(0.5, 1]\) satisfies the Robbins–Monro conditions \(\sum_t\gamma_t=\infty\), \(\sum_t\gamma_t^2<\infty\) and guarantees a.s. convergence. \(a=1\) is the classical choice; \(a\in(0.5,1)\) is preferred when it is combined with averaging.
  • Polyak–Ruppert averaging. Report the tail average \(\bar\beta_t = \frac1t\sum_{s=1}^{t}\beta^{(s)}\) rather than the last iterate. With a slowly decaying \(\gamma_t\) (\(a<1\)), \(\bar\beta_t\) is asymptotically efficient — it attains the same limiting variance as the MLE despite using only first-order updates.

Hyperparameters & configuration

Knob Default Notes
\(\gamma_0\) problem-dependent base learning rate; too large diverges, too small crawls
\(a\) (decay) \(0.5 < a \le 1\) \(\gamma_t=\gamma_0 t^{-a}\); use \(a\in(0.5,1)\) with averaging
averaging on (for inference) Polyak–Ruppert tail/running average \(\bar\beta_t\)
init \(\beta^{(0)}\) \(0\) warm start permitted
passes / epochs 1 (true streaming) multiple passes if data are revisitable
standardization recommended scale columns so a single \(\gamma_t\) suits all coordinates

Mapping to framework

  • Input: stream of \((x_i, y_i)\), link \(g\), schedule \(\gamma_t\).
  • Output: point estimate \(\hat\beta\) (last iterate), or \(\bar\beta_t\) with an asymptotic covariance estimate (point+inference via averaging).
  • Links: identity, logit, log (any canonical link supplying \(\mu=g^{-1}(\eta)\)).
  • Preprocessing: feature scaling strongly recommended; no full-data storage required.

Complexity

  • Per step: one inner product \(x_{i_t}^\top\beta\) and one rank-1 axpy update, both \(O(p)\); no matrix is formed. Total \(O(p)\) time per observation.
  • Memory: \(O(p)\) — the iterate \(\beta\) (plus \(O(p)\) for the running average). Bounded and independent of the number of observations seen, which is what makes SGD suitable for streams.

Statistical guarantees

  • Consistency. Under the Robbins–Monro step conditions, \(\beta^{(t)}\to\beta^\star\) a.s. (Robbins1951).
  • Asymptotic normality of the averaged iterate. With \(\gamma_t=\gamma_0 t^{-a}\), \(a\in(0.5,1)\), \(\sqrt{t}\,(\bar\beta_t - \beta^\star)\xrightarrow{d}\mathcal N(0,\, H^{-1} G H^{-1})\) where \(H=\nabla^2\mathcal L(\beta^\star)\) and \(G\) is the gradient-noise covariance; this matches the MLE's efficiency (polyak1992acceleration, ruppert1988efficient).
  • Inference. Plug-in / online estimators of the sandwich covariance permit confidence intervals from the averaged SGD path (chen2020statistical).

References

  • Robbins & Monro (1951), A stochastic approximation method (Robbins1951).
  • Polyak & Juditsky (1992), Acceleration of stochastic approximation by averaging (polyak1992acceleration).
  • Ruppert (1988), Efficient estimations from a slowly convergent Robbins–Monro process (ruppert1988efficient).
  • Chen, Lee, Tong & Zhang (2020), Statistical inference for model parameters in stochastic gradient descent (chen2020statistical).