BayesFactor: Software for Bayesian inference: What is a Bayes factor?

Sunday, February 9, 2014

What is a Bayes factor?

The BayesFactor package

This blog is a companion to the BayesFactor package in R (website), which supports inference by Bayes factors in common research designs. Bayes factors have been proposed as more principled replacements for common classical statistical procedures such as \(p\) values; this blog will offer tutorials in using the package for data analysis.
In this first post, I describe the general logic of Bayes factors using a very simple research example. In the coming posts, I will show how to do a more complete Bayesian data analysis using the R package.

What is a Bayes factor?

Suppose that two researchers are interested in public opinion about public smoking bans. Paul believes that 70% of the public support such bans; Carole believes that the support is less, at 60%. Paul and Carole decide ask 100 randomly selected people whether they support public smoking bans.
Because Paul and Carole have very specific beliefs about the true support for the public smoking bans, they also have predictions about how the sample will turn out. Of course, Paul's best guess is that 70 out of 100 will support smoking bans, and likewise Carole's best guess is 60. However, Paul and Carole can be more specific because the Binomial distribution is appropriate for model the kind of random sample. The figure below shows the predictions of Carole (blue) and Paul (red).

Having made predictions, Paul and Carole collect their random sample. Of the 100 people in the sample, 62 are supportive. This seems to support Carole because the observation is closer to Carole's average prediction; however, Paul points out that he also predicted that 62 was possible, and so his hypothesis is not ruled out. It does seem obvious that the observation supports Carole, but by how much?
One way of answering this question is to ask about the relative weight of the evidence: that is, how convincing is the observation with respect to the two hypothesis in question? In order to answer this question, we need a way of understanding what it means for evidence to change what we believe. We start with the idea of prior odds, which describe the degree to which we favor one hypothesis over another before we see the data. These can be written as \[ \frac{P(\cal H_c)}{P(\cal H_p)} \] where \(P\) here represents plausibility and \(\cal H_c\) and \(\cal H_p\) are hypotheses of Carole and Paul. We can also describe the posterior odds, which describe the degree to which we favor one hypothesis over another after observing the data. \[ \frac{P(\cal H_c\mid y)}{P(\cal H_p\mid y)} \] where \(y\) is what we observed. The question now is how we move from the prior odds to the posterior odds in the right way. Obviously, we cannot simply do this in any way we like. There must be some ways that are better than others. Bayes' rule gives a way to answer this question. Bayes' rule says that the relative plausibility of two hypotheses must be changed by the data in a particular way. \[ \frac{P(\cal H_c\mid y)}{P(\cal H_p\mid y)} = \frac{P(y\mid \cal H_c)}{P(y\mid\cal H_p)}\times\frac{P(\cal H_c)}{P(\cal H_p)} \] The term in the middle is a factor that we multiply the prior odds by to become the posterior odds. It represents how our relative beliefs should change in light of the data, and is called the Bayes factor. Luckily, the logic is quite simple:

The Bayes factor is the relative evidence in the data. The evidence in the data favors one hypothesis, relative to another, exactly to the degree that the hypothesis predicts the observed data better than the other.

We can now easily compute the evidence in the data favoring Carole over Paul. Carole specified that the observation \(y=62\) had a probability of 0.0754. Paul specified that the observation \(y=62\) had a probability of 0.0191. Under Carole's hypothesis, the observed data is much more likely than under Paul's, as shown in the figure below:

The Bayes factor — the evidence in the data — is precisely factor by which Carole's line is taller than Paul's, which is:

dbinom(y, N, carole)/dbinom(y, N, paul)

## [1] 3.953

or about 4. The observed data favors Carole by a factor of 4.
The figure below shows how much the data favors Carole for many possible observations. The data favor Carole when the Bayes factor is greater than 1, which happens for all observations 65 or less; the data favor Paul when the Bayes factor is less than 1, which happens when the observation greater than 65. For very large or small observations, the evidence is strong; for observations around 65, the evidence is slight, making it difficult to distinguish between the two hypotheses.

Bayes factors when the parameter is uncertain

The above example was adequate for a simple demonstration, but lacks an important feature common to research: uncertainty about the true parameter value. Carole and Paul both had very specific hypotheses. Typically, this is not the case. Unless a hypothesis is very specific about a particular parameter (for instance, that an experimental manipulation has no effect), hypotheses are more diffuse.
Carole and Paul might have hypotheses that look something like the figure below:

Note that on the \(x\)-axis is the true proportion of the population who support the smoking bans. Suppose we call the true proportion of supporters \(\theta\). Instead of believing that the true proportion is exactly 60%, Carole now believes that it is “around” 60%. The blue distribution shows how her belief drops off as the proportion is further away from 0.6. Likewise, Paul has adopted a distribution around 0.7. We refer to these curves as \(p_c(\theta)\) for Carole and \(p_p(\theta)\) for Paul.
Since the Bayes factor is determined by the probability that each hypothesis assigns to the observed data, we must determine what Paul and Carole would now predict. Carole's hypothesis \(\cal H_c\) is a distribution of values of \(\theta\), each weighted differently. Some values, like 0.6, are quite plausible; other values, like 0.7, are implausible. In order to determine the plausibility of a particular observation, we can determine the probability that it would occur given a particular value for \(\theta\) (that is, \(P(y\mid\theta)\)). We then weight this probability by the plausibility given to that particular \(\theta\) value, \(p(\theta)\). For each hypothesis, this can be represented by an integral: \[ P(y\mid{\cal H}_c) = \int P(y\mid \theta)p_c(\theta)\,d\theta \] where \(p_c(\theta)\) is the function representing Carole's hypothesis. The probability of an observation is thus a weighted average over all the different possible values for \(\theta\). The figure below shows how the more diffuse hypothesis affects the predictions for data. In gray are the predictions from the old, single point hypothesis. In color are the new predictions from the diffuse hypothesis. The diffuse hypotheses have made the predictions for the data more diffuse. The bottom row shows the new predictions made by Carole and Paul side-by-side.

We can now compare the two hypotheses in light of the observation of 62 supporters out of 100, as the figure below shows. The Bayes factor has been reduced to 2.8267 in favor of Carole. This makes sense; both Carole and Paul have hedged their predictions for the data, making them more spread out and in the process less distinct. The reduction in the Bayes factor reflects this by becoming attenuated.

Uncertain hypotheses are not a problem for the Bayes factor to handle, as long as they can be specified as probability distributions, but hedging one's bets with more uncertain predictions will limit reduce the amount of evidence that can be accumulated in favor of one's hypothesis. This means Bayes factors penalizes flexible hypotheses, yielding a natural Occam's razor.

Strength of the evidence

One might ask how much evidence is provided by the Bayes factors of 3.9534 and 2.8267 found above. The simple answer is that the number itself is directly interpretable, since it arises as the shift of relative odds. In this, the Bayes factor is different from other proposed (but ultimately flawed) evidential statistics, such as the \(p\) value, because such statistics have no direct evidential interpretation.
Nevertheless, various labels for the “strength” of a Bayes factor have been proposed. Kass and Raftery (1995), for instance, propose that Bayes factors between 1/3.2 and 3.2 are “not worth more than a bare mention”. See the Wikipedia entry for Bayes factors for more details.

Summary

Bayes factors represent the weight of evidence in the data for competing hypotheses. Bayes factors are the degree to which the data shift the relative odds between two hypotheses. There are principled reasons why we should interpret the Bayes factor as a measure of the strength of the relative evidence.
The Bayes factor is intimately linked to the predictions of a hypothesis. Because the evidence for a hypothesis given by the degree to which it has predicted the observed data, hypotheses that do not have predictions for data cannot accumulate evidence. Hypotheses with no predictions, such as \(\mu\neq0\) are not allowed.
The Bayes factor can be directly interpreted, without recourse to labels. The strength of the Bayes factor is reflected by the fact that it is a multiplicative change in odds. However, some authors provide labels to help interpret evidence.

In the next post, we will discuss Bayes factors for one-sample designs with the BayesFactor package.