Skip to content

GLM via IRLS / Fisher Scoring

At a glance

Family: classical · Regime: low-dim (\(n>p\)) · Penalty: none · Output: point · Links: any differentiable \(g\) · Status: draft · Refs: IWLS1987, Nocedal2006

Setting & assumptions

  • Responses follow a one-parameter exponential dispersion family with mean \(\mu_i=g^{-1}(x_i^\top\beta)\) for a chosen monotone, differentiable link \(g\) (canonical or not), variance function \(V(\mu)\), and dispersion \(\phi\).
  • Low-dimensional: \(n>p\) and \(X\) full column rank, so the Fisher information \(X^\top W X\) is invertible and the MLE is well-defined.
  • Observations conditionally independent given \(X\); the linear predictor is correctly specified.
  • No penalty (\(P\equiv 0\)): this is the plain maximum-likelihood estimator.

Estimator / objective

The MLE maximizes the log-likelihood, equivalently minimizes the mean negative log-likelihood:

\[ \hat\beta \;=\; \arg\min_{\beta\in\mathbb{R}^p}\; \mathcal L(\beta) , \qquad \mathcal L(\beta) = \frac1n\sum_{i=1}^n \ell_i(\beta), \]

with \(\ell_i\) the per-sample negative log-likelihood. The optimum solves the score equation

\[ S(\beta) \;=\; X^\top\!\big(y - \mu(\beta)\big) \;=\; 0 , \]

(for the canonical link; for a general link a diagonal derivative factor appears, which the IRLS weights below absorb). There is no closed form in general, so \(\hat\beta\) is found iteratively.

Algorithm

Iteratively Reweighted Least Squares (IRLS) = Fisher scoring. Each iteration forms a local working response and weights, then solves a weighted least-squares problem:

\[ z_i = \eta_i + (y_i-\mu_i)\,g'(\mu_i), \qquad W_i = \frac{1}{g'(\mu_i)^2\,V(\mu_i)}, \qquad \beta \leftarrow (X^\top W X)^{-1} X^\top W z . \]
Input: X (n×p), y (n), link g, variance function V
Init:  μ_i = sensible start (e.g. y_i adjusted), η_i = g(μ_i),  β = 0
repeat (t = 1, 2, ...):
    η_i = x_iᵀ β
    μ_i = g⁻¹(η_i)
    z_i = η_i + (y_i - μ_i) · g'(μ_i)            # working / adjusted response
    W_i = 1 / ( g'(μ_i)² · V(μ_i) )              # IRLS weights (diagonal)
    β_new = (Xᵀ W X)⁻¹ Xᵀ W z                    # weighted least squares step
    if deviance increased:  β_new = β + ½(β_new - β)   # step-halving (repeat as needed)
    if ||β_new - β|| < tol:  break
    β = β_new
Return β̂   (and dispersion φ̂, deviance D)

Newton–Raphson equivalence under the canonical link. Newton's method updates \(\beta \leftarrow \beta - \big(\nabla^2\mathcal L\big)^{-1}\nabla\mathcal L\). Using the framework identities \(\nabla\mathcal L = -\tfrac1n X^\top(y-\mu)\) and \(\nabla^2\mathcal L = \tfrac1n X^\top W X\), the update becomes \(\beta + (X^\top W X)^{-1}X^\top(y-\mu)\), which is exactly the IRLS step \((X^\top W X)^{-1}X^\top W z\). For the canonical link the observed information equals the expected (Fisher) information, so Newton–Raphson and Fisher scoring coincide; for non-canonical links Fisher scoring replaces the observed Hessian by its expectation \(X^\top W X\), which keeps the weights positive and the step stable.

Hyperparameters & configuration

Knob Default Notes
link \(g\) canonical any monotone differentiable link; non-canonical → Fisher scoring
init \(\mu_i\approx y_i\) mean-start avoids boundary issues (e.g. \(\mu_i=(y_i+\bar y)/2\) for binomial)
tol \(10^{-8}\) convergence on \(\lVert\Delta\beta\rVert\) or relative deviance change
max-iter 25 IRLS converges quadratically near the optimum
step-halving on backtrack \(\beta+2^{-k}\Delta\) if deviance worsens / iterate diverges
dispersion \(\phi\) estimated fixed at \(1\) for binomial/Poisson; estimated for Gaussian/Gamma

Dispersion estimation. When \(\phi\) is unknown it is estimated after convergence, typically by the Pearson statistic \(\hat\phi = \tfrac{1}{n-p}\sum_i (y_i-\hat\mu_i)^2/V(\hat\mu_i)\).

Deviance. Convergence and goodness-of-fit are monitored by the deviance \(D = 2\big[\ell(\text{saturated}) - \ell(\hat\beta)\big]\), which IRLS decreases monotonically (with step-halving) and which generalizes the residual sum of squares.

Mapping to framework

  • Input: \(X\in\mathbb{R}^{n\times p}\), \(y\in\mathbb{R}^n\), link \(g\) and family (\(V\), \(\phi\)).
  • Output: point estimate \(\hat\beta\) (the MLE), plus \(\hat\phi\) and deviance \(D\).
  • Links: any differentiable \(g\); identity/Gaussian reduces to a single OLS solve, logit→logistic regression, log→Poisson regression.
  • Preprocessing: intercept as a column of ones; standardization optional (improves conditioning of \(X^\top W X\)).

Complexity

  • Per IRLS iteration: \(O(np^2)\) to form \(X^\top W X\) and \(O(p^3)\) to factor/solve; memory \(O(np)\).
  • Total: (#iterations) \(\times\,O(np^2)\); typically 5–25 iterations thanks to local quadratic convergence.

Statistical guarantees

  • Consistency & asymptotic normality. Under standard regularity (correct family/link, full-rank design, \(n\to\infty\) with \(p\) fixed), \(\hat\beta\) is consistent and \(\sqrt n(\hat\beta-\beta^\star)\xrightarrow{d}\mathcal N\!\big(0,\,\phi\,(\tfrac1n X^\top W X)^{-1}\big)\), attaining the Cramér–Rao bound (efficiency).
  • Inference. \(\widehat{\operatorname{Var}}(\hat\beta)=\hat\phi\,(X^\top W X)^{-1}\) at convergence gives Wald standard errors; deviance differences give likelihood-ratio tests.
  • Convergence is guaranteed locally; step-halving safeguards the global iteration.
  • OLS — the single-step IRLS solution for Gaussian/identity.
  • Ridge Regression and Lasso via Coordinate Descent — penalized GLMs add a term to each IRLS step (penalized IRLS).
  • Quasi-likelihood / GEE — replace the full likelihood by a mean–variance specification.

References

  • McCullagh & Nelder (1989), Generalized Linear Models, 2nd ed. — the canonical textbook treatment of GLMs and IRLS (not in reference.bib; named here in prose). The method originates with Nelder & Wedderburn (1972).
  • Green (1987), Penalized likelihood for general semi-parametric regression models (IWLS1987) — iteratively reweighted least squares for (penalized) likelihood.
  • Nocedal & Wright, Numerical Optimization (Nocedal2006) — Newton/Gauss–Newton view, line search and step-halving safeguards.