Generalized Linear Models (GLM)
1. Exponential Family
Generalized Linear Models (GLM) allows us to model exponential family probability distribution in. Exponential family consists every distributions that can be expressed like below.
\[p(y;\eta) = b(y)\exp(\eta^TT(y)-a(\eta))\]I’ll introduce 3 examples; Bernoulli distribution, Gaussian distribution, and Categorical distribution.
Bernoulli Distribution
\[p(y;\phi) \\ = \phi^y(1-\phi)^{(1-y)} \\ =\exp(y\log(\phi)+(1-y)\log(1-\phi)) \\ =\exp(\log\frac{\phi}{1-\phi}y+\log(1-\phi))\]Here, $\eta = \frac{\phi}{1-\phi}$ (or $\phi = \frac{1}{1+\exp(-\eta)}$)
Gaussian Distribution ($\sigma^2=1$)
\[p(y;\mu) \\ = \frac{1}{\sqrt{2\pi}}\exp(-\frac{(y-\mu)^2}{2}) \\ = \frac{1}{\sqrt{2\pi}}\exp(-\frac{y^2}{2})\exp(\mu y - \frac{\mu^2}{2})\]Here, $\eta=\mu$
Categorical Distribution
$T(y) = [1{y=1}, 1{y=2},…,1{y=k-1}]^T$
$p(y;\phi)$
$= \phi_1^{T(y)_1}\phi_2^{T(y)_2} \dots \phi_k^{1-\sum_{i=1}^{k-1}T(y)_i}$
$= \exp(T(y)_1\log(\phi_1/\phi_k) + T(y)_2\log(\phi_2/\phi_k) + \dots + T(y)_{k-1}\log(\phi_{k-1}/\phi_k) + \log\phi_k)$
Here, $\eta = [\log(\phi_1/\phi_k), \log(\phi_2/\phi_k), \dots , \log(\phi_{k-1}/\phi_k)]^T$
2. Generalized Linear Models (GLM)
GLM assumes $p(y|x;\theta)$ as an exponential family distribution. Our objective is to make a model $h_\theta$ that outputs $E[T(y)|x]$ (typically, $T(y)=y$). We assume $\eta = \theta^Tx$ and calculate the expected value based on $\eta$. Let’s see three examples.
Bernoulli Distribution
$\phi = \frac{1}{1+\exp(-\eta)}$, and $T(y)=y$, as seen above.
So $E[T(y)|x;\theta]=E[y|x;\theta] = \phi = \frac{1}{1+\exp(-\eta)} = \frac{1}{1+\exp(-\theta^Tx)}$.
Now we know why we use sigmoid function for the final output of binary classification.
Gaussian Distribution
$\mu$ happens to be same as $\eta$, and $T(y)=y$, as seen above.
So $E[T(y)|x;\theta]=E[y|x;\theta]=\mu=\eta=\theta^Tx$.
Categorical Distribution
We’ve seen $\eta = [\log(\phi_1/\phi_k), \log(\phi_2/\phi_k), … , \log(\phi_{k-1}/\phi_k)]^T$ above. Now we need to express $\phi$ with $\eta$.
Let’s define $\eta_k = \log(\phi_k/\phi_k)=0$.
$\phi_i = e^{\eta_i}\phi_k$
$\sum_{i=1}^ke^{\eta_i}\phi_k = \sum_{i=1}\phi_i = 1$
$\phi_k = 1/\sum_{i=1}^ke^{\eta_i}$
$\phi_i = \frac{\exp(\eta_i)}{\sum_{j=1}^k\exp(\eta_j)}$
So $h_\theta(x) = E[T(y)|x;\theta] = [\phi_1, \phi_2,…,\phi_{k-1}]^T = [\frac{\exp(\theta^T_1x)}{\sum_{j=1}^k\exp(\theta^T_jx)}, \frac{\exp(\theta^T_2x)}{\sum_{j=1}^k\exp(\theta^T_jx)}, …, \frac{\exp(\theta^T_{k-1}x)}{\sum_{j=1}^k\exp(\theta^T_jx)}]^T$
Our objective is to find $\theta$ that maximizes $p(y|x;\theta)$. We can optimize $\theta$ with stochastic gradient descent.
\[-\frac{\partial\log p(y|x;\theta)}{\partial \theta_j} \\ = -\frac{\partial\log (b(y)\exp(\eta y-a(n)))}{\partial \theta_j} \\ = (y-h_\theta(x))x_j\]Accordingly, update rule will be
\[\theta_j := \theta_j - \alpha(h_\theta(x)-y)\]Now we know why gradient of logistic regression loss and linear regression loss with respect to weights are identical. Both losses are defined so that optimizing the losses are the same as maximum likelihood estimation. And doing maximum likelihood estimation on GLMs with SGD yields the above update rule. This update rule is naturally applied to both logistic regression and linear regression since they are both GLMs; logistic regression assumes conditional probability as bernoulli, and linear regression assumes as gaussian.
Leave a Comment