Fundamentals 2: Generalized Linear Models (technical)
In the previous post, we walked through the basic theory of maximum likelihood and coded up some examples. In these models, the relation between the predictors and the response was linear and the error around the expected value was normally distributed. We will now relax both restrictions. That is, we will allow for nonlinear relations between the response and the predictors and we will we allow different probability distributions for the error.
We will achieve both ends by using a Generalized Linear Model (GLM). This is a bit odd: how are we going to model nonlinear relations with a ‘linear’ model? We can do this because GLMs are not really linear. It takes a linear regression model and slaps a so called link function around the response variable, which makes the relation between predictor and response nonlinear. This may sound mysterious now, but will become clear soon.
In GLMs we are also able to use distributions beyond the Normal, but not all. The distributions that are allowed come from the so called exponential family. This is not the same as the exponential distribution, which is just a member of this family of distributions. As are the famous distributions from an intro to probability class, such as the Normal, the Binomial, The Poisson or the Gamma, but not for example the uniform.
The main motivation for introducing the GLM framework is computational. As nonlinear functions and distributions grow more wild, the maximum likelihood surface will typically become more wild as well. With millions of hills and mountains, finding the absolute maximum of the surface can become extremely difficult, even with the help of computer power. The restrictions introduced by GLM help to solve this problem.
Understanding the logic behind GLM also sheds light on famous regression models. Take the most famous nonlinear GLM, which is the logistic model.
At the end we briefly discuss a more principled line of reasoning in favor of the GLM framework, which is about entropy.
The Exponential Family
Recall that a probability distribution assigns probabilities (or densities) to values of the random variable X that we are interested in. A parametric probability distribution is special in that this process of assignment is controlled by typically just one or two parameters
A probability distribution belongs to the exponential family if it can be written in the form
It is hard to multiply
We also do some algebra that brings the formula into a form that will prove useful. We fold
Let’s get some more intuition by molding the Bernoulli into the exponential family shape. We know that
And now for the last bit of terminology before we see were all of this is leading: we say that the distribution is in the Canonical form if x and
We also split off a dispersion parameter
The process by which we arrive the results below involve a sequence of substitutions of the expectation of the partial derivative of the log likelihood with respect to theta as well as the negative of the Fisher information. They do not build intuition and are not discussed here, but see here for the details.
The result of the substitutions is that
and
So if
Canonical link functions
Now that we have a grip on the type of distributions that are allowed in GLMs, we turn to the link functions that are allowed. So the link function takes the expected value
Just as we have a canonical exponential distribution, we have a canonical link. This is the g for which
If we apply this reasoning to the canonical form of the Bernoulli, we’ll discover where the logit comes from. We had
Computation
Now all the building blocks are in place to talk about computation. The problem was that the absolute maximum is hard to estimate if we have millions of hills and mountains. A solution to this computational problem is to ensure that the surface is concave. An example of a concave function is
We want to show now that we can ensure we are dealing with such functions in GLMs as well.
The next step is to think about inference for several observations. If we are doing regression, we want to tie these together with a
In this general case, there is nothing we can say about the shape of the likelihood surface. We don’t know what h looks like in the general case after all. However, for the canonical link, h is the identity. Then
We recall that for canonical distributions
Entropy
So far, we have seen that using canonical exponential families and canonical links is convenient when we compute the maximum likelihood.
There is a more principled argument for choosing distributions from the exponential family, which is related to entropy. This concept from information theory, which is also known as Shannon entropy, is a measure for surprise. That is, the entropy tells you how surprised you should be if you encounter a new bit of information. For a discrete random variable, one can calculate it’s entropy with the following formula.
The formula takes probabilities from a probability distribution (in this case a PMF, because the formula deals with discrete random variables). One may wonder in general which distributions maximize entropy. These are the distributions that generate the most surprise when new information comes in, and so are able to learn the most. These are the distributions one may want.
The answer turns out to depend on the constraints we put on the distribution. If the expected value is specified however, then distributions from the exponential family happen to maximize entropy. See here for a proof.
Summing up
In conclusion it is worth noting that as we tried to give a rationale for using GLMs, we nowhere made the case that they accurately model some feature of reality. GLMs are what Richard McElreath calls ‘geocentric models’: models that generate outcomes that match what was seen, but do not necessarily model the process that generated the outcomes. The term ‘geocentric’ refers to ancient astronomers, who made models that gave the location of planets from their point of view on earth. What they modeled was what they saw against the nightly firmament however, and not the actual location. A space mission that used their maps would definitely not arrive at Mars! So it is with GLMs. They are useful in getting a grip on what we see, but we shouldn’t trick ourselves into thinking we have modeled the data generating process.
That being said, entropy gives some guidelines as to which exponential family distributions should be picked to model which process. This is a more principled way to build a model than to stare at the spread of your data and fit distributions until one happens to match. See this lecture for some of these guidelines.
Even if the concept of entropy may not be that convincing for some, the computational argument in favor of GLMs is stronger than one may think. When we turn to Bayesian inference we will see that calculations of the estimates can become a nightmare, so that algorithms are developed that try to chart the unwieldy hyperspaces. The algorithms can go off the rails however, so that knowledge of their performance metrics is needed, while long wait times can become a factor as well. In sum, ease of computation is nothing to sniff at and GLMs deliver in that respect.