Stochastic Gradient Descent (SGD)¶

At a glance

Family: online-streaming · Regime: streaming / low-dim / high-dim · Penalty: none · Output: point · Links: identity, logit, log · Status: draft · Refs: Robbins and Monro, 1951 · Polyak and Juditsky, 1992 · Ruppert, 1988 · Chen et al., 2020

Setting & assumptions¶

Any GLM in the exponential family with canonical link; identity (Gaussian), logit (Bernoulli), log (Poisson) are the primary cases.
Streaming / online: observations \((x_{i_t}, y_{i_t})\) arrive one at a time (or are sampled i.i.d. from a fixed set); only the current point and the running iterate are held in memory.
The population risk \(\mathbb{E}[\ell(\beta)]\) is convex (strictly so under a full-rank design) with a unique minimizer \(\beta^\star\); gradients are unbiased estimates of \(\nabla\mathcal L\).

Estimator / objective¶

SGD targets the unpenalized GLM minimizer (\(P\equiv 0\)):

\[ \widehat\beta \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \frac1n\sum_{i=1}^n \ell_i(\beta), \]

using the per-sample gradient of the GLM mean negative log-likelihood,

\[ \nabla \ell_i(\beta) \;=\; -\big(y_i - \mu_i\big)\, x_i \;=\; \big(\mu(x_i^\top\beta) - y_i\big)\, x_i, \qquad \mu_i = g^{-1}(x_i^\top\beta). \]

The defining recursion (Robbins–Monro stochastic approximation) is

\[ \boxed{\; \beta^{(t+1)} \;=\; \beta^{(t)} - \gamma_t\,\nabla \ell_{i_t}(\beta^{(t)}) \;=\; \beta^{(t)} - \gamma_t\big(\mu(x_{i_t}^\top\beta^{(t)}) - y_{i_t}\big)\,x_{i_t}. \;} \]

Algorithm¶

Input: stream (x_{i_t}, y_{i_t}); step schedule γ_t; init β⁽⁰⁾ (e.g. 0)
Optional: Polyak–Ruppert average β̄
1. β ← β⁽⁰⁾ ;  β̄ ← β⁽⁰⁾
2. for t = 0, 1, 2, ... :
3.     receive (x, y)
4.     η = xᵀ β                         # linear predictor, O(p)
5.     μ = g⁻¹(η)                        # inverse link
6.     g_t = (μ - y) · x                 # ∇ℓ_{i_t}(β), rank-1 in x
7.     β ← β - γ_t · g_t                 # SGD step, O(p)
8.     β̄ ← β̄ + (β - β̄)/(t+1)            # running average (optional)
Return β  (or the averaged β̄ for inference)

Step-size schedules. A decaying rate \(\gamma_t = \gamma_0\, t^{-a}\) with \(a\in(0.5, 1]\) satisfies the Robbins–Monro conditions \(\sum_t\gamma_t=\infty\), \(\sum_t\gamma_t^2<\infty\) and guarantees a.s. convergence. \(a=1\) is the classical choice; \(a\in(0.5,1)\) is preferred when it is combined with averaging.
Polyak–Ruppert averaging. Report the tail average \(\bar\beta_t = \frac1t\sum_{s=1}^{t}\beta^{(s)}\) rather than the last iterate. With a slowly decaying \(\gamma_t\) (\(a<1\)), \(\bar\beta_t\) is asymptotically efficient — it attains the same limiting variance as the MLE despite using only first-order updates.

Hyperparameters & configuration¶

Knob	Default	Notes
\(\gamma_0\)	problem-dependent	base learning rate; too large diverges, too small crawls
\(a\) (decay)	\(0.5 < a \le 1\)	\(\gamma_t=\gamma_0 t^{-a}\); use \(a\in(0.5,1)\) with averaging
averaging	on (for inference)	Polyak–Ruppert tail/running average \(\bar\beta_t\)
init \(\beta^{(0)}\)	\(0\)	warm start permitted
passes / epochs	1 (true streaming)	multiple passes if data are revisitable
standardization	recommended	scale columns so a single \(\gamma_t\) suits all coordinates

Mapping to framework¶

Input: stream of \((x_i, y_i)\), link \(g\), schedule \(\gamma_t\).
Output: point estimate \(\widehat\beta\) (last iterate), or \(\bar\beta_t\) with an asymptotic covariance estimate (point+inference via averaging).
Links: identity, logit, log (any canonical link supplying \(\mu=g^{-1}(\eta)\)).
Preprocessing: feature scaling strongly recommended; no full-data storage required.

Complexity¶

Per step: one inner product \(x_{i_t}^\top\beta\) and one rank-1 axpy update, both \(O(p)\); no matrix is formed. Total \(O(p)\) time per observation.
Memory: \(O(p)\) — the iterate \(\beta\) (plus \(O(p)\) for the running average). Bounded and independent of the number of observations seen, which is what makes SGD suitable for streams.

Statistical guarantees¶

Consistency. Under the Robbins–Monro step conditions, \(\beta^{(t)}\to\beta^\star\) a.s. Robbins and Monro, 1951.
Asymptotic normality of the averaged iterate. With \(\gamma_t=\gamma_0 t^{-a}\), \(a\in(0.5,1)\), \(\sqrt{t}\,(\bar\beta_t - \beta^\star)\xrightarrow{d}\mathcal N(0,\, H^{-1} G H^{-1})\) where \(H=\nabla^2\mathcal L(\beta^\star)\) and \(G\) is the gradient-noise covariance; this matches the MLE's efficiency (Polyak and Juditsky, 1992, Ruppert, 1988).
Inference. Plug-in / online estimators of the sandwich covariance permit confidence intervals from the averaged SGD path Chen et al., 2020.

Implicit SGD — proximal/implicit update for stability.
AdaGrad — per-coordinate adaptive step sizes.
FOBOS · RDA · Truncated Gradient — regularized/sparse online variants.
Renewable GLM — batch streaming with full-data efficiency.

References¶

Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist., 22(3):400--407.
Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838--855.
Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report, Cornell University Operations Research and Industrial Engineering.
Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2020). Statistical inference for model parameters in stochastic gradient descent. Ann. Statist., 48(1):251--273.