# Bench philosophy: Bayesian statistics

#### Confidence Multiplied by Evidence by Steven D. Buckingham, Labtimes 04/2011

In a recent bench philosophy feature we explored right and wrong ways to conduct the most common statistical tasks. In this issue, we will look at another statistical method that takes a completely different approach: Bayesian statistics.

Thomas Bayes

Bayesian statistics has a noble and ancient origin. Its creator, Thomas Bayes, was a Presbyterian minister in England in the early 1700s. His interest in probability led to the publication of his “Essay­ Towards Solving a Problem in the Doctrine of Chances”, commonly credited as the origin of Bayesian statistics. Three centuries later, the progress in computer technology opened up new possibilities for the method, and today Bayes’ name crops up all over the place. Bayesian methods probably underlie your email spam filter; they have been used to infer the functions of genes, predict the financial market, read (and understand) text, and have even helped in diagnosing diseases. Even biologists are catching on: practically all relevant papers in epidemiology now use some type of Bayesian-based techniques. And if the proponents of the Bayesian Brain (http://en.wikipedia.org/wiki/Bayesian_brain) are right, your wetware is at this very moment using Bayes to read this sentence.

The Bayesian approach to statistics is quite different to the understanding of statistics with which most of us are familiar. Chances are you were brought up, like me, with, what is called, the “frequentist” approach to statistics. In this view, the term “probability” refers to how often something happens compared to something else. “The probability of rolling a double six on a dice is 1/36,” is almost identical to saying, “over many rolls of the dice, I get a double 6 one in 36 times”. But most of the time we mean something quite different when we talk about probabilities. Think of the following statement: “I am 99% sure that President Obama really was born in Hawaii,” or “I am 75% confident the defendant is guilty”. Clearly, you can’t apply frequencies here – there is only one President Obama (as far as I am aware) and the defendant is (hopefully) guilty or otherwise, but only once of a particular crime.

When we talk like this, we don’t mean frequencies – what we are really talking about are strengths of belief. And this is the way most Bayesian approaches think of probability. Before we go any further, we need to be sure exactly what the Bayes Theory is. This powerful approach boils down to a very simple formula: decide how confident you are that something is true and multiply it by the evidence.

OK, that is an outrageous oversimplification, but it is pretty much the heart of what Bayesian statistics is all about, and illustrates its overarching philosophy. For a more sensible and more useful definition, please see the accompanying box.

This expression describes how an existing belief (“prior”) held before any evidence is considered, is updated by the evidence to produce a new level of belief (“posterior”).

Now let’s think of an example, and you will see how it approximates to common sense. Let’s say I am reading a story and I am not sure if the events described are taking place in summer (March - August) or winter (September - February). If that is all I know, then it is reasonable to assign a 0.5 probability to either hypothesis. This is what Bayesians call the “prior”. Thus my prior belief that the story takes place in summer is 0.5. Now, as I continue reading I come across a piece of evidence: the story mentions that there is snow on the ground. This is more likely in winter than in summer, so the evidence that it is winter is greater than 0.5. So, this evidence strengthens my belief that it is winter and weakens my belief that it is summer.

Weakened belief

The more critical reader (or the well-instructed student of statistics) will have noticed a little problem. Because you cannot have evidence with a probability greater than 1, this will mean that belief is continually weakened by evidence, never strengthened by it. The even keener-eyed will, however, have already seen in the Bayes formula in the box that we also have to divide the whole sum by a normalising factor. This normalising factor is what statisticians call the “marginal probability”. It is, roughly speaking, the overall probability of getting that evidence, over all the various possibilities, regardless of the hypothesis (or belief) in question. In other words, it is a normalising constant that takes into account the chance of getting that piece of evidence, regardless of whether the hypothesis is true or not. In our winter story example, it would have been the probability of getting snow at any time of the year.

But have you seen what has happened here? Far from being a fudge factor (go on, admit it – that was what you were thinking), this normalising factor is really one of the powerful features of Bayesian thinking. What it means is: you don’t just multiply the prior belief by the evidence (which is what UFO hunters do) but by the surprisingness of the evidence (which is what scientists do).

Let me show you what I mean, and you will immediately see why this is important. In our winter story example, the snow on the ground was the evidence. I would estimate that in Oxford, England (where I’m writing this piece), it snows on average about 7 days in the year. In my experience, it has always been in winter, except for once when it snowed on the day of my wedding in April, 1985. So, the probability of snow on any winter’s day is about 7 in 182 or 0.038 and the probability of snow on a summer’s day is one day in 26 years! I make that about 0.0001, giving the probability of snow on any given day of the year overall (the marginal probability) as 0.0381. So applying Bayes’ rule, my confidence that it is summer – which began at 0.5 – is now 0.5 x 0.0001 / 0.0381, which comes out to 0.0013.

Important is that Bayes automatically weighs evidence by its surprisingness. Evidence that is no more likely given your prior hypothesis than otherwise would not change your belief. If e.g. I read in the story that the detective leaves the office with a gun, that is just as likely given the hypothesis (it is summer) as it is marginally (it is winter or summer), so the evidence is normalised out to about 1, having little effect on updating my prior belief. Competent scientists do this all the time but the Bayes formula makes it explicit and even allows you to assign numbers.

The same thing happens when you have large priors: strong confidence in a belief would need very solid and surprising evidence to overturn it. So a P of 0.01 on a T-test of an experiment showing extra-sensory perception will not overturn my confidence that there is no such thing, no matter how properly the experiment was done or how correctly the stats test was performed. So you can see that good scientists, even those (if there are any) that have never heard the name of Bayes, do Bayesian stats all the time.

Bayes or T-test?

So if Bayes is so good, will it ever replace the humble T-test? I put this question to Harvey Motulsky, CEO and Founder of GraphPad Software Inc. and author of Intuitive Biostatistics. Motulsky points out that the question gets to the very core of how we use statistics every day in the lab. “First, a reality check. For many experimental bio­logists much of the time, the question of whether to use a Bayesian approach is irrelevant because, often, statistical analysis has nothing to do with understanding data,” says Motulsky, “The goal is to decorate a graph or table with asterisks, sprinkle the word ‘significant’ in the papers, and impress editors and reviewers.” Motulsky points out the often neglected fact that the results of any statistical analysis can only be interpreted at face value when you decided on the exact statistical methods as part of the experimental design. Motulsky drags many of us to the courts of statistical justice when he claims, “Many scientists try one analysis first, then try again with a subset of data, then switch to nonparametric tests, etc., and finally stop when they can get the computer to give them an asterisk. If no asterisk is forthcoming, they might run the experiment a few more times to increase sample size and try again. This is not really data analysis but rather is asterisk generation.”

The flip of a coin

Formal Bayesian methods are foreign to many experimental biologists, who prefer to generate P values or statements of statistical significance. But Motulsky points out that P values can, and should, be interpreted in a Bayesian context, by accounting for the experimental situation (the prior probability). Motulsky again, “How you interpret a P value depends on the context. Consider two extreme situations. One extreme situation is when two treatment groups of subjects in a randomised clinical trial are compared after they are randomised to receive alternative treatments but before receiving those treatments. You can be 100% sure that both data sets are sampled from identical populations. The only thing that distinguishes them is a flip of a coin. No matter how small the P value is, you can be 100% sure that the two populations are identical. The other extreme is a positive control run as part of an experimental protocol. You know for sure that the treatment affects the outcome. It doesn’t matter how high the P value is, you can be 100% sure that the populations (or distributions) differ. Most situations are in between these two extremes but it is essential to consider the context of the experiment when interpreting P values. It makes sense to demand stronger evidence when testing unlikely hypotheses.”

So can I abandon my T-tests and impress all my friends and colleagues with the Bayesian alternative? I doubt it. First of all, while you might impress your friends and relations, you may not meet with unanimous approval with journal editors or referees, some of which prefer to remain in their frequentist comfort zone. Secondly, you won’t have much in the way of friendly software of the copy-paste-analyse variety. For the present, it is assumed that if you know Bayes you will do the work yourself. After all, Bayes is not difficult to understand. Perhaps that is why googling “bayes software” gets you lots of programmes for experts – some of them in the form of C++ source code, if you want – but nothing below a very high entry level.

There is one notable exception: the BUGS suite (www.mrc-bsu.cam.ac.uk/bugs/). BUGS (Bayesian inference Using Gibbs Sampling) will empower a user with a moderate understanding of Bayes, not much more than outlined in this article, to do “Bayesian networks”. Here, instead of having one prior modified by evidence, the very evidence itself is expressed in Bayesian terms. In the winter’s tale we talked about earlier, it would amount to asking “given that there is snow on the ground and that the story is set on a movie set-up but the setup may not be operating if it is late winter, what are the chances it is winter?” Can you get your head around that? Nor can I, but BUGS can. And it is unlikely that a Bayesian version of your favourite asterisk-generator will come along in the near future. I asked Motulsky if the maker of GraphPad Prism, a well-known stats package popular (rightly, in my view) with biologists, had received many requests for Bayesian modules. “Hardly any,” he replied. He then added, “Perhaps zero.”

Model comparison

But this isn’t answering our question. Can we replace the standard hypothesis tests with a Bayes alternative? Some statisticians are trying to do just that, and there are signs that model comparison metric, known as the Bayesian Information Criterion, is making its way into standard paste-and-go stats packages. But remember that Bayes is never meant to be a replacement for hypothesis testing – it is more about model comparison. In this approach, Bayes is used to calculate the relative likelihoods of two or more models under review, usually expressed as the log likelihood ratio.

Until we see demand from scientists and journal editors driving the development of a Bayes equivalent of GraphPad or SPSS, most of us will probably go with the frequentist flow. But not, I hope, just content with consulting the asterisk oracle.

Last Changed: 10.11.2012