Why I am not a Bayesian*

JAC: Today Greg contributes his opinion on the use of Bayesian inference in statistics. I know that many—perhaps most—readers aren’t familiar with this, but it’s of interest to those who are. Further, lots of secular bloggers either write about or use Bayesian inference, as when inferring the probability that Jesus existed given the scanty data. (Theists use it too, sometimes to calculate the probability that God exists given some observations, like the supposed fine-tuning of the Universe’s physical constants.)

When I warned Greg about the difficulty some readers might have, he replied that, “I tried to keep it simple, but it is, as Paul Krugman says about some of his posts, ‘wonkish’.” So wonkish we shall have!

___________

by Greg Mayer

Last month, in a post by Jerry about Tanya Luhrmann’s alleged supernatural experiences, I used a Bayesian argument to critique her claims, remarking parenthetically that I am not a Bayesian. A couple of readers asked me why I wasn’t a Bayesian, and I promised to reply more fully later. So, here goes; it is, as Paul Krugman says, “wonkish“.

Approaches to inference

I studied statistics as an undergraduate and graduate student with some of the luminaries in the field, used statistics, and helped people with statistics; but it wasn’t until I began teaching the subject that I really thought about the logical basis of the subject. Trying to explain to students why we were doing what we were doing forced me to explain it to myself. And, I wasn’t happy with some of those explanations. So, I began looking more deeply into the logic of statistical inference. Influenced strongly by the writings of Ian Hacking, Richard Royall, and especially the geneticist A.W.F. Edwards, I’ve come to adopt a version of the likelihood approach. The likelihood approach takes it that the goal of statistical inference is the same as that of scientific inference, and that the operationalization of this goal is to treat our observations as data bearing upon the adequacy of our theories. Not all approaches to statistical inference share this goal. Some are more modest, and some are more ambitious.

The more modest approach to statistical inference is that of Jerzy Neyman and Egon Pearson. In the Neyman-Pearson approach, one is concerned to adopt rules of behavior that minimize one’s mistakes. For example, buying a mega-pack of paper towels at Sam’s Club, and then finding that they are of unacceptably low quality, would be a mistake. They define two sorts of errors that might occur in making decisions, and see statistics as a way of reducing one’s decision making error rates. Although they, and especially, Neyman, made some quite grandiose claims for their views, the whole approach seems rather despairing to me: having given up on any attempt to obtain knowledge about the world, they settle for a clean, well-lighted place, or at least one in which the light bulbs usually work. While their approach makes perfect sense in the context of industrial quality control, it is not a suitable basis for scientific inference (which, indeed, Neyman thought was not possible).

The approach of R.A. Fisher, founder of modern statistics and evolutionary theory, shares with the likelihood approach the goal of treating our observations as data bearing upon the adequacy of our theories, and the two approaches also share many statistical procedures, but differ most notably on the issue of significance testing (i.e., those “p” values you often see in scientific papers, or commentaries upon them). What is actually taught and practiced by most scientists today is a hodge-podge of the Neyman-Pearson and Fisherian approaches. Much of the language and theory of Neyman-Pearson is used (e.g., types of errors), but, since few or no scientists actually want to do what Neyman and Pearson wanted to do, current statistical practice is suffused with an evidential interpretation quite congenial to Fisher, but foreign to the Neyman-Pearson approach.

Bayesianism, like the Fisherian and likelihood approaches, also sees our observations as data bearing upon the adequacy of our theories, but is more ambitious in wanting to have a formal, quantitative method for integrating what we learn from observation with everything else we know or believe, in order to come up with a single numerical measure of rational belief in propositions.

So, what is Bayesianism?

The Rev. Thomas Bayes was an 18th century English Nonconformist minister. His “An Essay Towards Solving a Problem in the Doctrine of Chances” was published in 1763, two years after his death. In the Essay, Bayes proved the famous theorem that now bears his name. The theorem is a useful, important, and nonproblematic result in probability theory. In modern notation, it states

P(H∣D) = [P(D∣H)⋅P(H)]/P(D).

In words, the probability P of an hypothesis H in the light of data D is equal to the probability of the data if the hypothesis were true (called the hypothesis’s likelihood) times the probability of the hypothesis prior to obtaining data D, with the product divided by the unconditional probability of the data (for any given problem, this would be a constant). Ignoring the constant in the denominator, P(D), we can say that the posterior probability, P(H∣D), (the probability of the hypothesis after we see the data), is proportional to the likelihood of the hypothesis in light of the data, P(D∣H), (the probability of the data if the hypothesis were true), times the prior probability, P(H), (the probability we gave to the hypothesis before we saw the data).

The theorem has many uncontroversial applications in fields such as genetics and medical diagnosis. These applications may be thought of as two-stage experiments, in which an initial experiment (or background set of observations) establishes probabilities for each of a set of exhaustive and mutually exclusive hypotheses, while the results of a second experiment (or set of observations), providing data D, are used to reevaluate the probabilities of the hypotheses. Thus, knowing something about the grandparents of a set of offspring may influence my evaluation of genetic hypotheses concerning the offspring. Or, in making a diagnosis, I may include in my calculations the known prevalence of a disease in the population, as well as the test results on a particular patient. For example, suppose a 95% accurate test for disease X is positive (+) for a patient, and the disease X is known to occur in 1% of the population. Then, by Bayes’ Theorem

P(X∣+) = P(+∣X)⋅P(X)/P(+)

= (.95)(.01)/[(.95)(.01)+(.05)(.99)]

= .16.

The probability that the patient has the disease is thus 16%. Note that despite the positive result on a pretty accurate test, the odds are more than four to one against the patient actually having condition X. This is because, since the disease is quite rare, most of the positive tests are false positives. [JAC: This is a common and counterintuitive result that could be of practical use to those of you who get a positive test. Such tests almost always mandate re-testing!]

So what could be controversial? Well, what if there is no first stage experiment or background knowledge which gives a probability distribution to the hypotheses? Bayes proposed what is known as Bayes’ Postulate: in the absence of prior information, each of the specifiable hypotheses should be accorded equal probability, or, for a continuum of hypotheses, a uniform distribution of probabilities. Bayes’ Postulate is an attempt to specify a probability distribution for ignorance. Thus, if I am studying the relative frequency of some event (which must range from 0 to 1), Bayes’ Postulate says I should assign a probability of .5 to the hypothesis that the event has a frequency greater than .5, and that the hypothesis that the frequency of the event falls between .25 and .40 should be given a probability of .15, and so on. But is Bayes’ Postulate a good idea?

Problems with Bayes’ Postulate

Let’s look at simple genetic example: a gene with two alleles (forms) at the locus (say alleles A and a). The two alleles have frequencies p + q = 1, and, if there are no evolutionary forces acting on the population and mating is at random, then the three genotypes (AA, Aa, and aa) will have the frequencies p², 2pq and q², respectively. If I am addressing the frequency of allele a, and I am a Bayesian, then I assign equal prior probability to all possible values of q, so

P(q>.5) = .5

But this implies that the frequency of the aa genotype has a non-uniform prior probability distribution

P(q²>.25) = .5.

My ignorance concerning q has become rather definite knowledge concerning q² (which, if there is genetic dominance at the locus, would be the frequency of recessive homozygotes; as in Mendel’s short pea plants, this is a very common way in which we observe the data). This apparent conversion of ‘ignorance’ to ‘knowledge’ will be generally so: prior probabilities are not invariant to parameter transformation (in this case, the transformation is the squaring of q). And even more generally, there will be no unique, objective distribution for ignorance. Lacking a genuine prior distribution (which we do have in the diagnosis example above), reasonable men may disagree on how to represent their ignorance. As Royall (1997) put it, “pure ignorance cannot be represented by a probability distribution”.

Bayesian inference

Bayesians proceed by using Bayes’ Postulate as a starting point, and then update their beliefs by using Bayes’ Theorem:

Posterior probability ∝ Likelihood × Prior probability

which can also be given as

Posterior opinion ∝ Likelihood × Prior opinion.

The appeal of Bayesianism is that it provides an all-encompassing, quantitative method for assessing the rational degree of belief in hypotheses. But there is still the problem of prior probabilities: what should we pick as our prior probabilities if there is no first-stage set of data to give us such a probability? Bayes’ Postulate doesn’t solve the problem, because there is no unique measure of ignorance. We must choose some prior probability distribution in order to carry out the Bayesian calculation, but you may choose a different distribution from the one I do, and neither is ‘correct’: the choice is subjective.

There are three ways round the problem of prior distributions. First, try really hard to find an objective way of portraying ignorance. This hasn’t worked yet, but some people are still trying. Second, note that the prior probabilities make little difference to the posterior probabilty as more and more data accumulate (i.e. as more experiments/observations provide more likelihoods), viz.

P(posterior) ∝ P(prior) × Likelihood  × Likelihood × Likelihood × . . .

In the end, only the likelihoods make a difference; but this is less a defense of Bayesianism than a surrender to likelihood. Third, boldly embrace subjectivity. But then, since everyone has their own prior, the only thing we can agree upon are the likelihoods.  So, why not just use the likelihoods?

The problem with Bayesianism is that it asks the wrong question. It asks, ‘How should I modify my current beliefs in the light of the data?’, rather than ‘Which hypotheses are best supported by the data?’. Bayesianism tells me (and me alone) what to believe, while likelihood tells us (all of us) what the data say.

*Apologies to Clark Glymour and Bertrand Russell.


Further Reading

The best and easiest place to start is with Sober and Royall.

Edwards, A.W.F. 1992. Likelihood. Expanded edition. Johns Hopkins University Press, Baltimore. An at times terse, but frequently witty, book that rewards careful study. In many ways, the founding document of likelihood inference; to paraphrase Darwin, it is ‘origin all my views’.

Gigerenzer, G., et al. 1989. The Empire of Chance. Cambridge University Press, Cambridge. A history of probability and statistics, including how the incompatible approaches of Fisher and Neyman-Pearson became hybridized into textbook orthodoxy.

Hacking, I. 1965. The Logic of Statistical Inference. Cambridge University Press, Cambridge. Hacking’s argument for likelihood as the fundamental concept for inference; he later changed his mind.

Hacking, I. 2001. An Introduction to Probability and Inductive Logic. Cambridge University Press, Cambridge. A well-written introductory textbook reflecting Hacking’s now more eclectic, and specifically Bayesian, views.

Royall, R. 1997. Statistical Evidence: a Likelihood Paradigm. Chapman & Hall, London. A very clear exposition of the likelihood approach, requiring little mathematical expertise. Along with Edwards, the key work in likelihood inference.

Sober, E. 2002. Bayesianism– Its Scope and Limits. Pp. 21-38 in R. Swinburne, ed., Bayes’ Theorem. Proceedings of the British Academy Press, vol. 113. An examination of the limits of both Bayesian and likelihood approaches. pdf (read this first!)

130 Comments

  1. Aaron S.
    Posted April 16, 2015 at 8:57 am | Permalink

    “Bayesianism tells me (and me alone) what to believe, while likelihood tells us (all of us) what the data say.”

    But that’s not an argument against Bayesianism. The end goal (or one of them) of scientific exploration is to determine what you should believe about the universe. A list of data might be interesting, but unless that data shapes a belief, it’s not really useful. Likelihood alone will never tell you what to believe (or change your beliefs).

    • Aaron S.
      Posted April 16, 2015 at 9:12 am | Permalink

      I think it helps to understand that all Probability is in the Mind, since that makes it obvious that the proper use of data should, indeed, be to modify one’s beliefs in a Bayesian manner.

      • darrelle
        Posted April 16, 2015 at 10:48 am | Permalink

        But you want to modify your beliefs to be as congruent with reality as you can achieve, right? I think Greg’s point is that if you have no prior data and you arbitrarily assign P .5 to all priors that can not reasonbly have significant confidence in your assessment. Also, assigning the same P to all priors, for example, is much the same as not considering priors at all when comparing competing hypotheses.

        I think I agree with Greg. It seems to me to be pretty common that people give Bayes way to much weight on problems in which there is little or no data about priors. I am, in particular, thinking about people using Bayes to support their religious beliefs, though it certainly isn’t limited to that. There is an excellent example of this over at Sean Carroll’s website right now, for anyone who has the courage, or endurance, to look.

        • Aaron S.
          Posted April 16, 2015 at 11:02 am | Permalink

          I think the problem of arbitrarily assigning priors is really no problem at all. Even a single small scrap of evidence is enough to modify an arbitrary prior into something meaningful. And really, situations where you literally have no prior belief are very few and far between. Any gut feeling is a good enough prior.

        • Posted April 16, 2015 at 12:07 pm | Permalink

          If one is assigning probability to *hypotheses*, rather than *events* (in the ontological sense, not the “set of points” sense), one is *already* a Bayesian about *probability*. I have yet to figure out if that necessarily makes one a Bayesian about *statistics* and *methodological*/*inference* matters.

      • Torbjörn Larsson, OM
        Posted April 17, 2015 at 5:00 am | Permalink

        “Probability is in the Mind”.

        Except of course they aren’t, since you can define objective such. That is vital (see significance testing).

        All else is speculation (betting, if informed speculation).

        • Torbjörn Larsson, OM
          Posted April 17, 2015 at 5:02 am | Permalink

          Let me add that choosing to include speculation or not is philosophical, not practical. (As long as you are careful.)

          So uninteresting. [Yudkowsky is such a bore, frankly.]

  2. Avis James
    Posted April 16, 2015 at 9:03 am | Permalink

    Thanks! That was a fun morning read!

  3. Shawn Beaulieu
    Posted April 16, 2015 at 9:08 am | Permalink

    Excited to read this, but I have to leave for class soon. However, I’m commenting first to let Jerry know I haven’t passed it over!

    Hopefully the lack of comments won’t discourage him, or Greg, from publishing posts like this in the future.

  4. Jack
    Posted April 16, 2015 at 9:09 am | Permalink

    When you write “then I assign equal prior probability to all possible values of q, so
    P(q)>.5 = .5”, you mean that the probability that q > 0.5 is 0.5, right? Shouldn’t that be stated as P(q>.5) = .5?

    • Posted April 16, 2015 at 10:00 am | Permalink

      Yes, and in the next line too. Parentheses now corrected– thanks!

      GCM

  5. bobkillian
    Posted April 16, 2015 at 9:23 am | Permalink

    I’m going to have to read this three times. I hope comments will clarify the wonkishness….

  6. Posted April 16, 2015 at 9:24 am | Permalink

    Reblogged this on CancerEvo and commented:
    Against Bayes… at least in some occasions.

  7. marksolock
    Posted April 16, 2015 at 9:36 am | Permalink

    Reblogged this on Mark Solock Blog.

  8. quiscalus
    Posted April 16, 2015 at 9:42 am | Permalink

    Reading this reminds me of a great quote from the unfortunate movie, Outlaw Josie Wales; “Endeavor to Persevere”. I did, but can’t say I am more enlightened for it. Two problems exist to inhibit my understanding, one, that I am the product of the American public football, er, I mean, school system, and two, my mathematical fluency is equal to that of a two-year old child learning to speak. Basically, I can inform you I made a poopy, but beyond that, it’s either laughing, crying, or banging things on the table. Or to put it simply, I have trouble translating maths into English.

    For the sake of those readers most like me, I’d say that the use of mathematical and statistical language is a barrier to comprehension and I was not always sure where the numbers you used came from. Also, while this is a science web site, many may not have encountered Hardy-Weinberg or like me, have not seen much less used it since Biology 101, so that may require a brief explanation , although it is only a quick wiki-search away. I’m not saying you need to dumb down the post, only that it may require more sidebar explanations for the mathematically illiterate and statistically stupefied readers like me.

    A critique, not a criticism. I don’t think anyone who visits this site is looking for softball fluff pieces and I enjoy a challenge. Having said that, for those of us who read it, I think we’ve earned some cat pics!

    • Posted April 16, 2015 at 10:04 am | Permalink

      Yes, cat pics!! And cockatoos!

      It seems to my poor skills that Bayesian methods work better the less we know? By which I mean that if we little knowledge about something we would want to assume everything (more or less) equal but the more we know, you need to adjust.

      Forgive me but I’ll slip into a sports analogy. In baseball, having never seen a batter I would probably play the defense straight up but once I’ve seen the batter hit a few times I should probably just hypothesize that he’s going to hit to right field (a likelihood) and adjust to that.

  9. daveyc
    Posted April 16, 2015 at 9:44 am | Permalink

    Greg,
    I’m curious how you feel about the use of Bayesian methods in phylogenetic inference. While maximum likelihood approaches still get a lot of use (especially with very large datasets), it seems to me that Bayesian methods have become the “state of the art” for much of the Systematics community.

    • Posted April 16, 2015 at 9:57 am | Permalink

      But I don’t think very many systematists choose Bayesian methods because they’re convinced Bayesians. They just do it because there are Bayesian analysis programs that seem to produce results, possibly in less computer time than likelihood programs.

      And when you do use Bayesian analyses, assigning priors is always a major bone of contention.

      • gluonspring
        Posted April 16, 2015 at 10:28 am | Permalink

        I think this highlights an important point that people should be aware of when viewing this debate from the sidelines: if you have decent data you will probably get the same results with Bayesian statistics. In that sense the phrase “convinced Bayesians” is important here because in many cases the debate is about a philosophical position, not about practical utility. It would be silly, for example, to dismiss a published result based solely on the fact that they used “untrustworthy” Bayesian statistics, just as it’d be silly to dismiss a result because they used frequentist statistics.

      • daveyc
        Posted April 16, 2015 at 10:54 am | Permalink

        I definitely think you’re right in general that the majority of Bayesian analytic methods are not necessarily committed Bayesians (I’m not). That said, if you look at most recent issues of Systematic Biology, my guess is you’ll see that more often than not (although there are of course exceptions), the researchers who are spending their time developing new methods and theoretical tools for analyzing and inferring phylogenies seem to have a preference for Bayesian methods.

        Perhaps some of that is due to the computational efficiency of Bayesian vs ML, although modern ML methods (e.g., RaxML) are much, much more efficient than older methods (e.g., PAUP). Also, I think the efficiency advantage of Bayesian methods may depend on what type of analysis you’re doing. In my experience, a concatenated analysis in RaxML will run much faster than *Beast Bayesian species tree using coalescent models.

        However, I think that at least part of the reason that Bayesian methods are in heavy use is that they have some desirable properties that are, perhaps, not shared by ML methods. For example, I think many people appreciate the fact that Bayesian methods tend to do a better job of accounting for phylogenetic uncertainty through the output of a posterior distribution of trees and other parameter values. Also, many of the Bayesian programs that I am aware of are able to implement a much larger and more complex suite of evolutionary models in a variety of phylogenetic contexts compared to ML methods, but again, that may come down to the computational efficiency point you brought up already.

        All that said, I am just a lowly evolutionary biology grad student, and I have much to learn about likelihood and Bayesian statistics.

      • darrelle
        Posted April 16, 2015 at 10:55 am | Permalink

        I think it is definitely all about assigning priors. If there is plenty of solid data to do that, no problem. When there isn’t?

      • Posted April 16, 2015 at 5:18 pm | Permalink

        John Harshman,

        In my experience, Bayesian phylogenetics is the slowest of all approaches. I believe that many colleagues really use it because they think it is somehow fancier, state of the art, etc.

        And then there is the issue of what methods are available. Although I personally prefer simpler methods with less assumptions – I’d like to know where the computer has its hands, so to say – there are sometimes no alternatives. BEAST does certain things that you wouldn’t easily be able to do using other software.

        daveyc,

        I don’t really get the uncertainty argument. A ‘fundie’ Bayesian colleague once said the same to me, and I immediately wondered: had she never heard of bootstrapping, jackknifing, decay indices and consensus trees?

        • daveyc
          Posted April 16, 2015 at 6:55 pm | Permalink

          I think the methods of assessing topological confidence that you’re describing (bootstrapping, etc.) are in some ways fundamentally different than the posterior distribution of values you get from the MCMC chain of a Bayesian analysis. With a Bootstrap ML analysis you’re resampling your data to determine how frequently a particular clade is recovered. With the posterior distribution from an MCMC analysis, you’re resampling your entire set of parameters by treating them essentially as random variables and are able to get an idea of the uncertainty present for all of your parameters, not just the tree topology.

          Anyway, that’s the argument. Like I said above, I’m by no means an expert.

          • Posted April 16, 2015 at 11:41 pm | Permalink

            I see, that makes sense; then again, in our area people are usually predominantly interested in tree topology and perhaps branch lengths anyway. And if you are using a likelihood tool like r8s for divergence dating for example there are also ways of getting confidence intervals for the divergence times.

    • Posted April 17, 2015 at 10:37 am | Permalink

      The earlier comments in reply are well worth reading; here are my thoughts, not very different from some already expressed.

      My own view is that systematists, like most scientists, are pragmatic and results-oriented. The wide use of Bayesian methods reflects not a deep commitment to Bayesianism, but rather systematists using available and respected tools to answer the questions that are important to them. How a method becomes available and respected is a history/sociology of science doctoral thesis that needs to be written.

      I recently spoke about this with a colleague and a student of his who are engaged in phylogenetic work. They use Bayesian methods, but not through any commitment to Bayesianism, but because some phylogenetic programs are readily at hand and frequently used to solve the kind of problems they are working on (i.e. the methods are ‘industry standard’). I (and I think they) enjoyed our discussion of Bayesianism per se, but it was clear that these were not the considerations they had in mind when choosing methods of analysis.

      I do not strongly criticize Bayesian phylogenetic methods because 1) I haven’t used them myself, and thus don’t know how they ‘work under the hood’; and 2) some of the people developing such methods are really smart guys who have previously worked on likelihood methods, and I figure there’s a fair chance they know what they’re doing (which reasons are relevant to the questions of how methods become available and respected).

      In my own work on population structure, I’ve been attracted to the results one gets with Structure, a Bayesian population assignment program. Expressing my hesitations to a colleague who knows how everything works under the hood (and is not a committed Bayesian), he said not to be put off by the Bayesian aspect, and just go ahead and use the methods.

      GCM

      • daveyc
        Posted April 17, 2015 at 5:36 pm | Permalink

        Thanks for your response!

      • Posted April 17, 2015 at 6:33 pm | Permalink

        This pragmatism is precisely what seems most sensible to me.

        The people who are deeply committed to only one method worry me at bit, especially in their roles as referees and editors…

    • aspidoscelis
      Posted April 19, 2015 at 3:55 pm | Permalink

      My $.02:

      I don’t think anyone in the systematics community thinks the use of prior probabilities is desirable. I haven’t been reading the literature too much in the last couple of years, but the last time I checked everyone was trying to use uninformative priors. If you don’t want your results to be affected by prior probabilities, so far as I can tell there is no conceptual reason to use a Bayesian approach. It’s like buying a hybrid car and then doing your best to disconnect the batteries and electric motor.

      However, for those of us who are users rather than developers of phylogenetic methods there are two pragmatic reasons to use a Bayesian approach: 1) it either is, or is believed to be, more computationally efficient than maximum likelihood; 2) it universally assigns higher numbers to the certainty of clades.

      The first pragmatic reason is, IMO, basically justifiable. Phylogenetic analyses can still be prohibitively slow on the kind of computational infrastructure available to your average researcher. Projects like CIPRES have largely eliminated this problem by now, though, by making big processing power widely available.

      The second pragmatic reason is a cause for concern. Given that we’re under pressure to present phylogenetic results that look informative and relatively certain and that maximum likelihood bootstraps represent a coin flip with a value of “50” and Bayesian posterior probabilities represent a coin flip with a value of “90”, we’re going to use the Bayesian method. Even if we know the numbers aren’t comparable, we’re not very good at doing the mental corrections to evaluate clade support that is presented using different scales. Bigger numbers win. “90” looks like good support for a clade even though we know it isn’t.

  10. Delphin
    Posted April 16, 2015 at 9:45 am | Permalink

    THANK YOU
    I was raised properly and carefully, as a frequentist. Recently I have been wondering if I am turning into a Bayesian. So this, and the list, are particularly interesting topics for me.

    • Alec
      Posted April 17, 2015 at 7:34 pm | Permalink

      If you are worried about this you could always try to pray away the Bayes.

      • Posted April 17, 2015 at 7:38 pm | Permalink

        Yes, but the prior probability suggests that the chances of that succeeding are awfully slim….

        b&

  11. drakodoc
    Posted April 16, 2015 at 9:45 am | Permalink

    Excellent post. I deal with pretest probability on a daily basis (medical testing) but had never thought through the implications of where to start if you have, a priori, zero information on likelihood. I can’t fault Bayes for starting with 50/50 and going from there but can see where that may lead to either confusion or dead ends.

  12. Dominic
    Posted April 16, 2015 at 9:49 am | Permalink

    I appreciated these educative posts! This is one to sit down & digest however…!

  13. Posted April 16, 2015 at 9:51 am | Permalink

    In the end, only the likelihoods make a difference

    I don’t know any Bayesians who don’t already admit this in some manner, since Bayesian updates meant to be recursive.

  14. Rhinanthus
    Posted April 16, 2015 at 9:54 am | Permalink

    As the post says, Bayes Theorem is uncontroversial and can be derived from the axioms of probability.
    Posterior probability ∝ Likelihood × Prior probability

    Therefore, one cannot simultaneously argue that the equation is correct but prior probabilities are meaningless! Knowing the probability of the data, given the hypothesis (likelihood) doesn’t tell us how likely the hypothesis is given the data until we know how probable the hypothesis was before we knew the data.

    The example given, in which squaring a prior probability implies added prior knowledge about its square, simply means that an objective prior must be invarient to transformations (i.e. changes in scale). This is the modern understanding in which maximally uninformative priors are those that include the information contained in the definition of the variable (is is discrete or continuous? Is it bounded or not? etc) but are otherwise invariant to changes in measurement scale. An example is Jaynes, E.T. Probability Theory. The logic of science.

    • darrelle
      Posted April 16, 2015 at 11:04 am | Permalink

      “Therefore, one cannot simultaneously argue that the equation is correct but prior probabilities are meaningless!

      Not attempting to speak for Greg, but my interpretation of his criticism is that he is directing it specifically at cases where Bayes’ Postulate must be used due to ignorance, and not generally against Bayes’ Theorem.

      • Rhinanthus
        Posted April 16, 2015 at 11:30 am | Permalink

        But my point is that “Bayes Postulate” (in the absence of prior knowledge, assign equal probabilities” is an old (centuries old…) idea that has been supersceded. In newer versions, a maximally uninformative prior is assigned based on (i) axiomatic properties of the variable and (ii) the requirement that these prior probabilities are invariant to changes in the measurement scale. The only time when Bayes Postulate is appropriate is when the variables is (1) discrete, (2) bounded and (3) unordered.

    • Posted April 16, 2015 at 12:13 pm | Permalink

      The Bayesian *about probability* thinks that one can assign probability to hypotheses.

      Other viewpoints about probability *deny* this: frequentists and propensity-ists. (About probability I am the latter, as it happens – like Bunge, Popper, Mellor, Ville, etc.)

      So in my view one should go the whole way if one is convinced one can do this – I for one am not. (Frequentism as an understanding of probability was refuted in 1939 by Jean Ville – but this is little known – in his book _Etude critique de la notion de collectif_ – but this does not necessarily refute frequentist *statistics*, which come in as sort of a pun.)

    • Mark Peterson
      Posted April 16, 2015 at 2:30 pm | Permalink

      This was exactly my thought as I read through this: knowing that the probability of a homozygous recessive is q^2 is additional information that *should* affect your prior beliefs. Setting priors makes prior assumptions explicit, rather than implicit in the design of an experiment.

      As others here have said, Bayesian methods are most useful when there is a strong prior and limited data. Collecting additional data reduces the influence of the prior, but the place where these methods really differ is when the data are explicitly limited. In Bayesian search theory (http://en.wikipedia.org/wiki/Bayesian_search_theory), for example any one piece of data is far too small to reject any frequentist test, and even low-information priors (perhaps even completely subjective ones) can help guide early searching to limit the search space.

      Priors should be subject to scrutiny, and even testing (e.g., run the analysis with two competing priors/models and test them explicitly). But, that doesn’t mean that the potentially subjective nature of priors should invalidate them.

      • gluonspring
        Posted April 19, 2015 at 2:36 pm | Permalink

        This reminds me of something worth pointing out to the casual observer also: Bayesianism crops up sometimes in AI kinds of applications. There you are not trying to prove something with data, as in scientific hypothesis testing, but merely trying to have a model of your (or the AI system’s) beliefs about the world, or a model about your uncertainty about the state of the world. Such models can start with some assumption but modify itself in the face of data. Such systems often are required by the application to give some kind of output even when there is no data to justify the choice, and that something can come from some kind of prior. If more data is obtained, the output can be updated in a natural way to reflect this.

        Bayes nets are one example of this kind pragmatic AI-ish of use of Bayesian inference.

  15. Posted April 16, 2015 at 10:01 am | Permalink

    I see that one or two readers seem to be asking for a lighter approach to this subject and in response I couldn’t resist re-posting a joke I put up a while ago and an even better riposte.

    Here’s a Bayesian joke.

    I’m a bit of a nervous flyer and I have been told that the probability of a bomb being on my plane is one in a million but the probability of there being two bombs on my plane is one in a million million, then to be extra safe should I take a bomb with me?

    Torbjörn Larsson, OM
    Posted July 2, 2014 at 9:33 am | Permalink
    So you are saying that you need a loaded prior to protect your posterior!?

    • gluonspring
      Posted April 16, 2015 at 10:31 am | Permalink

      If we’re explaining by way of humor, it’s hard to beat:

      https://xkcd.com/1132/

      • Posted April 16, 2015 at 11:08 am | Permalink

        Torbjörn actually linked to a picture of a stoned prior but I don’t know how to do this on my iPad.

    • Sergio Graziosi
      Posted April 17, 2015 at 9:52 am | Permalink

      Sorry, I can’t resit.
      Bayesian and plane joke:
      In flight:
      “Oh dear, I’m so scared of flying, it’s windy out there!”
      “Do not worry: flying is very safe. Are you scared of driving your own car? You know, you are much, much more likely to die in a car accident…”
      “What, right now?”

      I’m not sure how, but this may be telling something about the ubiquity of priors. Can we really truly say that the case where “prior knowledge about the phenomenon we are studying is no knowledge at all” is worth considering? Is it even possible? (how can you study something if you know nothing about it?)
      Yes, there are plenty of situations where assigning some of the priors looks irremediably bound to reflect subjective judgement. This is a plus, in my eyes: it reminds us that perfect objectivity is a never achievable aim. Something we aspire to, knowing we can only approach it. Furthermore, this tells us that because of our current ignorance, sometimes we should acknowledge that our conclusions are preliminary (not worthy of high confidence).

      Also: as noted in many other comments, the Bayesian approach neatly provides us a method to reduce subjectivity ( = maximise objectivity). You need to collect more evidence and update your beliefs accordingly. It’s about error minimisation, making whatever (wildly wrong) initial assumptions we had almost irrelevant. A good thing, if only we humans were a bit quicker in changing our own minds…

      Finally: via the Bayesian brain hypothesis, Bayesian epistemology unifies folk understanding of knowledge with scientific knowledge, thus it reinforces the argument that the best way to produce reliable knowledge is via the scientific approach (and questions the distinction between science and philosophy in a way that should please Sam Harris while irritating Ben).
      What’s not to like (apart poking Ben)?

      Not sure I’m a committed Bayesian, but I do know that all the against-Bayes arguments I see are easily (and subjectively) explainable-away by thinking “This person wants facts to be facts. The idea that all knowledge is fallible and somewhat wrong/inaccurate doesn’t sit well with his/her views”. This post is no exception: a really good overview that outlines a valid worry, but not enough for me to conclude that “the Bayesian approach is overvalued”.

      PS “wonkish” posts, are good! I may be too busy to contribute (or may not have anything to contribute), but they are the reason why I visit regularly.

    • Richard Bond
      Posted April 19, 2015 at 12:13 pm | Permalink

      Actually, the probability of two bombs is one in two million million: Poisson distribution. /pedantry.

  16. Posted April 16, 2015 at 10:10 am | Permalink

    I’ve always had a problem with associating probabilities with states of nature. It may sound like a nit-pick, but what’s really meant here, rather than probability, is confidence.

    Probability is a property of a random future event, such as the roll of a pair of dice, and is a property of the event itself.

    Confidence is a measure of a person’s state of mind and can be different for different people assessing the same aspect of nature, based upon their knowledge and ability to rationally interpret that knowledge. When people differ in this regard I consider it nonsensical for them to state that “the probability for you is different from the probability for me.” But if they instead speak of their differing confidence then it makes sense.

    I realize that virtually everyone speaks of probability in this way in casual discussions, such as in the Monty Hall puzzle where one might say that there’s a 1/3 probability that the big prize is behind door #3 even though Monty knows where the prize is and would assess that probability at either 0% or 100%, but if two people are assigning different numerical values to the same aspect of reality, how could they both be right? The only way would be if they instead spoke of their own levels of confidence. I really think this is an important distinction and shouldn’t be so casually abused.

    • Aaron S.
      Posted April 16, 2015 at 11:15 am | Permalink

      See also the link in my second comment up at the top, one of the best explanations of this principle I’ve seen.

    • Posted April 16, 2015 at 12:15 pm | Permalink

      See my earlier remark – there’s room for an interesting discussion which I’ve never seen done – about the difference between being a frequentist or a bayesian about *probability* vs. about statistical methodology.

    • Compuholic
      Posted April 17, 2015 at 3:59 am | Permalink

      I’ve always had a problem with associating probabilities with states of nature. It may sound like a nit-pick, but what’s really meant here, rather than probability, is confidence.

      That is essentially the fight between frequentist and bayesian approaches. The question always boils down to “What exactly do you mean by observations”. The frequentist will define events as fundamental building blocks and the bayesian will define probability distributions as fundamental building blocks.

      I generally don’t want to get into this fight (which I regard to be mostly philosophical – and therefore ultimately useless). But then I don’t really know much about the subject. I see it like any other tool: Most tools are useful but you need to be aware of its limitations.

      The frequentist approach makes sense when you can be absolutely certain about what you have observed. The bayesian approach makes sense when you cannot. Each observation in the bayesian approach is itself a probability distribution and quantifies the (ultimately subjective) uncertainty in the observation.

      • John Scanlon, FCD
        Posted April 22, 2015 at 12:50 am | Permalink

        ‘Observations’ are just provisionally accepted (but fallible, and falsifiable) hypotheses about what has occurred based on integration of sensory experience and/or intersubjective agreement between investigators.
        It’s hypotheses all the way down.

    • compuholio
      Posted April 17, 2015 at 4:01 am | Permalink

      I’ve always had a problem with associating probabilities with states of nature. It may sound like a nit-pick, but what’s really meant here, rather than probability, is confidence.

      That is essentially the fight between frequentist and bayesian approaches. The question always boils down to “What exactly do you mean by observations”. The frequentist will define events as fundamental building blocks and the bayesian will define probability distributions as fundamental building blocks.

      I generally don’t want to get into this fight (which I regard to be mostly philosophical – and therefore ultimately useless). But then I don’t really know much about the subject. I see it like any other tool: Most tools are useful but you need to be aware of its limitations.

      The frequentist approach makes sense when you can be absolutely certain about what you have observed. The bayesian approach makes sense when you cannot. Each observation in the bayesian approach is itself a probability distribution and quantifies the (ultimately subjective) uncertainty in the observation.

    • Henry Fitzgerald
      Posted April 17, 2015 at 4:48 am | Permalink

      I’m afraid the “confidence” sense is the only sense I can attach to the notion of probability. The probability that tomorrow’s roll of the dice will come up with a six is my level of confidence that it will – or more precisely, the level of confidence I ought to have that it will, or the level of confidence I would have if I were fully rational (while retaining my current level of ignorance).

      Me, today, looking at tomorrow’s dice roll will assign a probability of 1/6. You, the day after tomorrow, looking at exactly the same dice roll, will assign a probability of either 1 or 0. Exactly as in your Monty Hall example, different observers correctly assign different probabilities, with the only relevant difference between them being how much they happen to know.

  17. DTaylor
    Posted April 16, 2015 at 10:13 am | Permalink

    I have been loosely following related arguments in research psychology in recent years and really appreciated this clear and succinct post. Thanks, Greg! Maybe I’m not as Bayesian as I thought I was.

  18. Posted April 16, 2015 at 10:15 am | Permalink

    Nice post, thanks for this, Greg. I’m always interested to hear people’s opinions on the tug-of-war between Bayesianism and frequentism.

    Personally, I have always felt that no conflict should exist between Bayesians and frequentists. I will admit to a personal soft spot for frequentist methods; I find them far more mathematically appealing than Bayesian ones usually are. I use frequentist methods more often than Bayesian ones, partly because I am more comfortable with frequentist methods, but mostly because my clients are usually completely unversed in Bayesian methods, and my main goal is to communicate what the data say effectively. I wish more people were aware of Bayesian methods, but I leave the task of promotion to other, better equipped people.

    Your criticism of Bayes’ Postulate is well founded, but there are many schools of thought as to how to best go about assigning a prior distribution. What Bayes’ Postulate tries to get at is how to assign an objective prior when we have no previous knowledge. In my own work, such a situation is unlikely to occur, and I often wonder how much this actually pops up in real practice. From my own end, I would probably question the wisdom of applying *any* formal statistical techniques to a problem where we literally have no prior knowledge of the distributions in play. People need to be willing to be satisfied with exploratory analyses in such a situation, in my opinion.

    But there are many different ways to assign a prior. In the more usual situation where we have at least partial knowledge of the prior, we can use weakly informative or hierarchal priors, depending on the level of subjectivity we are willing to allow into our model. (Personally, I’m not against subjectivity, as long as we are transparent about it.) These types of priors perform quite well in practice.

    I like that you point out that, “In the end, only the likelihoods make a difference.” To me, this is the crux of the matter and the real reason why there shouldn’t be a conflict between Bayesian and frequentist practice, or at least not the kind of conflict we often see. To be sure of our conclusions, we need data *and replications*. This is as true for Bayesians as it is for frequentists. If we can reliably replicate our observation processes (not at all always trivial), then eventually both schools of thought will converge on similar conclusions, assuming they start from sound models. I find that too often people forget about the need to replicate, both experimentally and observationally.

    Finally, while I generally decry the conflict between Bayesian and frequentist methodists, I do think that certain problems are philosophically better tuned to Bayesian methods. These are problems where replication is not possible. Say you are studying the changing demographics and composition of the employees of some particular large company over the course of many years. Using this data, you would like to identify trends and make some meaningful predictions about what to expect in the near future. This requires modelling, but a frequentist model makes no fundamental sense since there is no way we can ever replicate the observed data. The factors influencing the composition of the employee pool depend on nebulous things like the economy, immigration, and other social issues. You can never hope to even approximate a replication of the data, especially if the company is quite large. Here, Bayesian methods are the only ones that make fundamental sense since they do not rely on an interpretation of probabilities as the frequency of an event in a large number of trials.

    Thankfully, this problem crops up far less in the physical sciences.

    • gluonspring
      Posted April 16, 2015 at 10:42 am | Permalink

      Good comment. I agree. Practically I see Bayesian statistics as one tool in a grab bag of tools. Sometimes it’s just easier to approach a problem a Bayesian way. Often times it’s easier to pull some frequentist statistics out of a bag. In both cases there are a lot of built-in assumptions about the problem, so you can never treat either as a magic answer box. You always have to understand your data and it’s limitations.

      I have the opposite feeling about the math from you, however. The Bayesian approach is more mathematically appealing to me because it’s a simple framework. I guess working out the actual posteriors is tedious, but writing them down is rather neat. Frequentists derivations always feel rather one-off to me, an elaborate proof that applies to one and only one very narrow situation. If I were stranded on a desert island without books or the internet, and needed to do statistics to survive 🙂 , I doubt I could reconstruct many frequentists statistical tests from scratch. I think I could reconstruct most of the Bayesian statistics I have learned.

      • gluonspring
        Posted April 16, 2015 at 10:52 am | Permalink

        writing them down = writing the priors+likelihoods down. After which you do some math, often a lot of boring algebra, to derive a posterior distribution.

      • gluonspring
        Posted April 16, 2015 at 10:56 am | Permalink

        Of course, I’d have to have a computer to make use of any of the Bayesian statistics I reconstructed…

      • Posted April 16, 2015 at 12:36 pm | Permalink

        I’ll certainly give you that the Bayesian approach provides a much simpler mathematical framework, from a strictly theoretical viewpoint. It’s implementing that framework in practice where the math gets “ugly” in my opinion. That’s not a criticism, just a statement of aesthetic preference. But I’m a mathematical analyst and so I tend to enjoy complex theoretical structures. Regardless, go with what works best for the problem, and for communicating the problem.

  19. First Approximation
    Posted April 16, 2015 at 10:22 am | Permalink

    First, try really hard to find an objective way of portraying ignorance. This hasn’t worked yet, but some people are still trying

    If anyone is interested, a notable method is the Principle of maximum entropy. It was championed by ET Jaynes, one of the strongest advocates for Objective Bayesianism.

    The principle of maximum entropy is generalization of Laplace’s principle of indifference, which states that if you have n indistinguishable possibilities and no other information, you should assign each the same probability.

    The problem with Bayesianism is that it asks the wrong question. It asks, ‘How should I modify my current beliefs in the light of the data?’, rather than ‘Which hypotheses are best supported by the data?’.

    Really? It seems to me that Bayesianism is asking ‘What is the probability of the hypothesis, given the data?’. Meanwhile, looking at the likelihood is asking ‘Given the hypothesis, what is probability of this data’. The former question is the one we want to know.

    During a trial the goal isn’t to answer ‘If the person were innocent, what would be the probability of the evidence?’. What you want to know is ‘What is the probability of innocence, given the evidence?’.

  20. JM
    Posted April 16, 2015 at 10:55 am | Permalink

    This is a good post, but I think you’re ignoring some important properties. First, I think it should be clear that for certain well defined questions, there are good objective priors to specify your initial ignorance. We use them frequently in AI models when we get to define the question we want our algorithms to solve 🙂

    That said, your challenge is important in the context of scientific endeavors when we often don’t even know how to define the complete hypothesis space, let alone assign a specific prior to any hypothesis. Algorithmic Probability gives us some insight into directions to go, but it’s not computable and there are some reasonable questions to raise about it currently (e.g., which language?) that have not been resolved in an entirely satisfying why.

    Despite the importance of highlighting that for scientific questions we don’t have precise means to compute priors, I think it would be a mistake to give up and not consider priors at all. I might not be able to ascribe an exact number for a prior, but sometimes I can conclude that one hypothesis is less likely than another. For example, Mormonism’s prior is less strong than Christianity’s. Why? Because Mormonism adopts all the same premises as Christianity, and then *adds* additional ones. We know from probability theory that a conjunction of predicates is at best equally likely as a statement that is a subset of those predicates, and it only equals it when the additional predicates are either necessarily true, or necessarily entailed by the others. Therefore, it would be unwise to treat these hypotheses on equal ground, a priori.

    More generally, we have Occam’s razor and if you’re using Occam’s razor, then you’re thinking like a Bayesian. Questions of hypothesis complexity on which Occam’s razor operates have challenges when one hypothesis isn’t strictly additive to another as in the above case. That is, when there is some non-empty set difference between the hypothesis it may be challenging to ascribe complexity. But with work, in many cases I think we do a pretty good job of that (again, Algorithmic Probability provides some direction). Now you might say that we cannot use it at all if it’s not fully and perfectly formalized, but then allow me to appeal to your likelihood reasoning: Occam’s razor has served science well thus far in correctly (later proved by the data) suggesting that some hypotheses are not likely due to the complexity we assign them, so our ability determine meaningful complexity differences (at least for this universe) must have some signal in it.

    So while we cannot use Bayes to get exact numbers, I think ignoring the role of priors entirely in science is a major mistake. Since science very often uses Occam’s razor, I’d in fact say that science doesn’t ignore it.

  21. TJR
    Posted April 16, 2015 at 11:01 am | Permalink

    I hear you sister! This is pretty much what I’ve been saying since 1988. To quote myself at the 1988 UK Statistics Research Students’ Conference in Guildford “Bayesians are in league with the devil”.

    Bayesian inference is fine if you have genuine prior information which can be quantified sensibly, and is the natural approach in sequential experimentation.

    For example, most models used for predicting the UK general election will be Bayesian, with the 2010 result providing the prior and recent opinion polls etc the data.

    However, if you don’t have prior information, Bayesian inference forces you to pretend that you do. Throughout the 90s I went to many Bayesian talks where they would talk about using an “uninformative prior” when what they actually meant was a “flat prior”, i.e. the postulate that Greg talks about, which is not “uninformative” at all. This used to drive me up the wall, although they tend to be more honest about it nowadays.

  22. Posted April 16, 2015 at 11:12 am | Permalink

    Great to see a meaty topic like this treated here!!

    For now I just want to add a reference to a contrary view, which I have not yet read. It may add something to this debate, though. Google this pdf:

    “Why I Am Not a Likelihoodist
    Greg Gandenberger”

  23. Posted April 16, 2015 at 11:14 am | Permalink

    Absolutely gobbling up this post (and especially the references).

    Thanks Greg!

  24. nickswearsky
    Posted April 16, 2015 at 11:16 am | Permalink

    “Are kosher switches that much more interesting than why humans have behaviors that seem hard to explain by evolution? Well, readers have spoken!”

    It is simply easier to comment on ridiculous stories, such as the kosher switch. I read most every entry, but I don’t think I am up on latest research to comment meaningfully on altruism or group selection.

    So, number of comments is not a useful gauge of interest in a topic.

  25. Posted April 16, 2015 at 12:14 pm | Permalink

    Sharon McGrayne’s book, The Theory That Would Not Die, made me happy that folks like Turing and LaPlace were Bayesians at some point.

    The beef with assigning priors is valid, and constitutes a debate that’s been going on for centuries.

    This article makes great points, but in my opinion the world of statistics is more complex than the two-camp model proposed here.

  26. kieran
    Posted April 16, 2015 at 12:29 pm | Permalink

    It’s on my to do list of statistics to try and get a better grip on. I mostly still use what I learned as an undergrad which I only could understand with the help of an economist.

  27. Posted April 16, 2015 at 12:32 pm | Permalink

    Thanks Greg, this helps me. I will be following up with some of your further reading suggestions.

  28. YF
    Posted April 16, 2015 at 1:04 pm | Permalink

    Likelihoodism has been shown to suffer from various shortcomings, while Bayesianism appears to provide a more complete account of evidence and hypothesis evaluation that aligns better with actual scientific practice. For some discussion of these issues see:

    Bandyopadhyay, P. S. (2007). Why Bayesianism? A primer on a probabilistic philosophy of science. In S. K. Upadhyay, U. Singh, and D. K. Dey (Eds.), Bayesian statistics and its applications, New Delhi: Amaya Publishing Company.

    Bandyopadhyay, P. S. & Brittan, G. Jr. (2006). Acceptibility, evidence, and severity. Synthese, 148, 259–293.

    Howson, C. (1997). Error probabilities in error. Philosophy of Science (Supplement), 64,
    S185–S194.

  29. Nicholas J. Matzke
    Posted April 16, 2015 at 1:08 pm | Permalink

    Good post overall, but some specifics are problematic:

    1. This bit at the end is kind of mixed up:

    ==========
    The problem with Bayesianism is that it asks the wrong question. It asks, ‘How should I modify my current beliefs in the light of the data?’, rather than ‘Which hypotheses are best supported by the data?’. Bayesianism tells me (and me alone) what to believe, while likelihood tells us (all of us) what the data say.
    ==========

    Actually, the likelihood is the probability of the data given the model. As Elliot Sober notes in Evidence and Evolution, the problem with likelihood is that according to likelihood, “the probability of the data is 1, if the data is noise in my attic, and the model is that there are ghosts in my attic making noise” (paraphrasing here).

    “What the data say about models” is what *Bayes’ Theorem* gives you — literally, P(model|data).

    Likelihood gives you only P(data|model). Now, you might well decide that the model that confers the highest probability is the best model, and that’s fine, but in effect when you do that you are assuming some sort of flat-ish prior on models, and then letting the likelihood determine which is the best model.

    So, actually, you are a Bayesian and are implicitly endorsing Bayes’s Postulate to boot!

    2. You sort of acknowledge he variety of priors at some points in the post, but at other points you seem to say that Bayesians by definition adhere to Bayes’s Postulate. At least for modern Bayesians, this is the furthest thing from the truth. There are well-categorized types of priors and schools of thought about when to use them — subjective Bayes, objective Bayes, maximally uninformative priors, informative priors, conjugate priors, etc. What to use depends on research goals, computational resources, tractability, etc.

    (also, I have always heard the flat prior discussed as Laplace’s principle of insufficient reason, rather than Bayes’s Postulate, although googling does reveal some of the latter)

    Cheers, Nick

  30. Mark
    Posted April 16, 2015 at 1:09 pm | Permalink

    Thanks for the post. For a real world example, I very much used a Bayesian methodology in deconverting from Mormonism. With a relatively more rigorous framework for weighing the evidences for and against, it became very clear very fast that Mormonism is not very likely.

    Questions like “How likely would Egyptianologists be to agree with Joseph Smith’s translations of the Book of Abraham Facsimiles” if Mormonism were true vs. if it were false.

    I put something like 20 major for/against questions up, basically, anything I thought was relevant and used Bayes to combine them all. I found that there was no reasonable way to weigh the evidence and come to any other conclusion. Even if I gave Mormonism as much benefit of the doubt as possible, and weighting things as generously as possible, the posterior probability it was true was basically negligible from zero.

    A big problem before doing this was the tendency to weigh one piece of ‘against’ evidence with all the ‘for’ evidence and if there was any wiggle room throw out the ‘against’ and move on to the next one. Thinking Bayesian, where I actually updated up/down (let’s be honest, mostly down) when looking at evidence quickly fixed this though.

    What I found though is that priors didn’t effect the results much. Didn’t matter if I started at 99% true or 1% true, there was enough evidence to bring it down. So, although I recognize the issue with agreeing on priors, I still find Bayesian thinking extremely helpful.

    • John Scanlon, FCD
      Posted April 22, 2015 at 12:58 am | Permalink

      That’s an unusually methodical deconversion process. Congratulations!

  31. Josh Schraiber
    Posted April 16, 2015 at 1:37 pm | Permalink

    See: Jeffreys prior for a way to define objective priors.

  32. Timothy Hughbanks
    Posted April 16, 2015 at 2:37 pm | Permalink

    …but it wasn’t until I began teaching the subject that I really thought about the logical basis of the subject. Trying to explain to students why we were doing what we were doing forced me to explain it to myself. And, I wasn’t happy with some of those explanations. So, I began looking more deeply…

    Another example of what must be the Universal Law of Learning, which stands as a rejoinder to the insulting truism “Those who can’t do, teach.” On the contrary, “if you want to truly learn something, teach it.” Examples of this are everywhere; I’ve sat in a lot of talks at scientific conferences and thought, “It’s clear that this speaker has never had to teach freshmen.”

    • Michael Waterhouse
      Posted April 18, 2015 at 11:51 pm | Permalink

      Agree. As a simple minded observer, on this topic at least, it has been my observation and experience, that there can be a significant difference between thinking one knows something and knowing it well enough to explain it clearly to someone else.

  33. DrDroid
    Posted April 16, 2015 at 2:48 pm | Permalink

    A nice little article on Bayesian inference, I liked it. My own experience with it is in the field of parameter estimation in communication signals. Many years ago I remember being troubled by the fact that I could assume a uniform a priori distribution on some unknown parameter x, but then I could also assume x^2 or x^3, etc was uniformly distributed. I was pleased to see that dilemma in Greg’s discussion.

  34. Posted April 16, 2015 at 2:56 pm | Permalink

    This post is right in my wheelhouse (that’s rare). I’ll try and be brief…

    Bayesian priors do not have to be based on “pure ignorance”. A prior can be estimated from lines of independent evidence; for instance the prior may come from qualitative evidence to inform the interpretation of quantitative data. In artificial intelligence and information theory, iterative Bayesian algorithms have been used to extraordinary success. For example, we may have multiple independent methods for deducing the probability of some event. We use the posterior probability from method A as the prior for method B, and so on. Upon iterating, the final inference becomes insensitive to the initial prior, so it doesn’t necessarily matter that you started with “pure ignorance” or that there were some apparent paradoxes in the initial calculations. While these methods come from computing applications, I think their incredible success says something about the value of Bayesian epistemologies.

    • Posted April 17, 2015 at 11:14 am | Permalink

      But note that this involves *two* (assuming they are both well defined, which is doubtful) notions of probability – probability of a hypothesis, the bayesian idea, and probability of an event (frequentists and propensity-ists).

      • Posted April 17, 2015 at 11:47 am | Permalink

        Within the Bayesian framework, hypotheses often concern the prediction or inference of some event. There isn’t a muddling or clash of definitions. In fact the Bayesian approach allows us to reason about the probability of singular events or unobservable events. Radford Neal offered these examples: “What is the probability that Alexander the Great played the flute” and “What is the probability that grandmother will enjoy a new cribbage set as a gift” (in his 1993 tutorial “Probabilistic inference using Markov Chain Monte Carlo methods”). These “events” have no definable probability in frequentist terms.

        • Posted April 20, 2015 at 11:33 am | Permalink

          There isn’t a *muddle* (except in a LOT of introductory textbooks – I did a review of this as a graduate student, as it happens) if one keeps them seperate, and some succeed in doing so. Once one *does that* one realizes one cannot necessarily use the same theory for both. One has to *argue* for that, and in my view nobody has ever done so. The “dutch book” argument sort of works sometimes, but that’s complicated. (The “over-all time” case, particularly.)

  35. Jaime
    Posted April 16, 2015 at 4:07 pm | Permalink

    To expand on Nicholas J. Matzke’s second point, a good addition to the reading list is section 1.5-1.7 of

    Gelman, A., et al. 2014. Bayesian Data Analysis. 3rd Ed http://books.google.com/books?id=ZXL6AQAAQBAJ&lpg=PP1&pg=PA11#v=onepage&q&f=false

  36. Posted April 16, 2015 at 5:25 pm | Permalink

    Thanks! This was most interesting.

    As I mentioned above, I prefer simpler methods where they are available, but most of all I would argue that all of these methods are tools with their own advantages and disadvantages. There are cases for parsimony, cases for likelihood, and cases for Bayesianism, and it seems unhelpful to exclude some of them just because one has pledged allegiance to a particular school.

    Bayesianism is currently all the rage in my area of science (and in parts of the atheist community). The problem is that there are too many Bayesians who think that all other approaches are outdated and everybody should convert exclusively to their way of doing it. It can be very frustrating to talk with those colleagues…

  37. Stephen
    Posted April 16, 2015 at 5:39 pm | Permalink

    So is Bayes really useful in helping to determine the probability that Jesus actually existed?

    I think he did but I base it on Occam’s Razor. The existence of Jesus is the simplest explanation for what we do know about the origins of Christianity.

    • MadScientist
      Posted April 16, 2015 at 7:19 pm | Permalink

      I can’t imagine how Bayes’ theorem could possibly support a Jesus claim – you’d have to create ad hoc probabilities to start with if you used the “suspended judgement” approach. Personally I’d put P(divinity) = 0 into the equation so naturally there never was a Jesus as depicted in the bible. If you attempt to come up with figures by looking at the claims in the different gospels then you get contradictory claims and must make up your mind who, if any, to trust – naturally that raises the question of how you can establish that one is more trustworthy than another. If I were to apply Occam’s razor I’d say the “No Jesus” hypothesis must be the correct one because it doesn’t defy what we know of reality.

      • Posted April 17, 2015 at 10:57 am | Permalink

        Richard Carrier performs exactly that Bayesian analysis in On the Historicity of Jesus.

        His hypothesis isn’t one of divinity, but a much reduced claim that Jesus was an Haile Selassie sort of figure — a very real flesh-and-blood human about whom legends grew over time. Since the divine Jesus theory is a superset of that — the divine Jesus was superficially an human born in a manger and there are stories that even diehard fundamentalists will agree were added after the fact — Richard considers is a reasonable starting point. If Jesus was real and divine, he’d still pass the “human with legends piled on afterwards” test; but if he doesn’t pass even that test, there’s no chance he could also have been divine.

        For Richard’s prior, he assembles a list of a couple dozen or so figures from history who fit the heroic archetype by having certain larger-than-life characteristics attributed to them — divine birth, overcame impossible obstacles in an epic struggle, ascension to the heavens after death, and so on. Some of the people about whom such tales have been told have been real people, but most have not. Richard simply took the ratio of real-to-unreal figures and used that as his prior. He also noted that Jesus was an exceptional example in that he had basically all the heroic properties whereas almost no others did, and the more heroic properties the less likely an individual was to be real…but he gave Jesus the benefit of the doubt and went with the straight percentage.

        Richard then performed an exhaustive census of all extant ancient mentions of Jesus and first attempted to figure out where the author’s materials came from. Most (but not all) of those writing after the Roman conquest of Judea in 70 CE were obviously basing their writings on the Gospel according to Mark, and so clearly constituted hearsay that he didn’t bother including in his calculation. But the writings of Mark, Paul, and quite a few others did have original materials in them.

        For those with new-to-them stuff, Richard then analyzed the likelihood that the author would have written what they wrote about a real historical figure. And, even then, he always bent over backwards in an effort to be generous to Jesus, often assigning absurdly high probabilities that there was a real person at the heart of it all.

        In the end, it didn’t matter. His most realistic calculation came up with some tiny fraction of a percent that Jesus had any basis in reality. He performed a parallel calculation using even more insanely generous odds of the type that he would anticipate an apologist scholar assigning to each piece of evidence…and he still came up with, I think, about a 20% chance Jesus was real.

        It’s the exact same sort of thing that Mark at #30 describes. In isolation, a single inconvenient fact here and there might not seem adequate to challenging the common wisdom of existence…but when you pile all of them together on top of one another, there’s just no way of sustaining credible belief. And, worse, the deeper you dig, the worse the picture becomes. In stark contrast…the more experiments you do to measure the time it takes for a ball to fall to the floor, the more and more the measurements cluster around a single answer.

        For me for the Jesus question, though, the smoking gun is Zechariah 6. There’s Jesus, in all his glory, half a millennium before the time of the Caesars. Not fully fleshed out, of course — but that’s hardly surprising. All the important stuff is there. So then you’re left wondering: is it more likely that Jesus was an ancient Jewish archangel given a recent biography when his cult started to get popular, or that the really real Jesus waited that half a millennium to come down from the heavens, or that some preacher took on the identity of a well-known celestial being and people believed him?

        Some historicists might go for that last one, I suppose. But it’s every bit as absurd as somebody at the same time claiming to be Mercury bringing the divine Message from Olympus. Somebody claiming Mercury delivered a Message to him? Sure — happened all the time. Somebody actually convincing others that he really was Mercury and not merely one of his soothsayers? Give me a break.

        Cheers,

        b&

      • gluonspring
        Posted April 19, 2015 at 2:48 pm | Permalink

        Well, if you have 0 data, then the posterior of a Bayesian inference is whatever prior you plug in. You are free to choose your prior. Choose it appropriately and Jesus pops out.

        This can be seen as a flaw of Bayesiansm or a feature. It’s a flaw because the computations allow you to come to unjustified conclusions. It’s could be seen as a feature, however, because it makes your assumptions explicit in your prior. This is the argument that Bayesians always give to the subjectivity claim: All analysis contain subjective elements. We state ours up front so you can decide if the analysis that flows from that assumption is merited or not. Not everyone buys this defense of Bayesiansm, but I think the underlying point is pretty sound: ALL analytical techniques have assumptions, some assumptions are more hidden (implicit) than others, and you can’t evaluate the output of any analysis without understanding and evaluating the assumptions.

        • Posted April 20, 2015 at 11:35 am | Permalink

          I am not a Bayesian (because I am not convinced credences are a suitable domain for a probability function) about probability, but the advantage to Carrier’s work as I see it is that it forces one to set out where one agrees or disagrees about the evidence (or lack of same). Unfortunately his opponents are not doing this much, near as I can tell.

  38. Dan Fromm
    Posted April 16, 2015 at 6:33 pm | Permalink

    This discussion brings me back to an aspect of systematists’ practice that I’ve never understood.

    Give a systematist a dataset of characters x taxa and it will somehow extract a phylogeny of the taxa from it. If the systematist has a powerful enough computer and the right software the phylogeny will have associated goodness of fit statistics. Tree length, likelihood ratio, standard errors of branch lengths, … In short, it will be a fully specified and testable model.

    Why do systematists claim that what looks to me like specification search is in fact model testing? And where are the tests of such models against datasets independent of the one used to select the model and its parameters?

    Systematists whom I respect as smarter and better trained than I am clearly believe that their practice is correct. I don’t. What have I missed?

    In my own fields — economics, market research — we do out-of-sample testing after specification search. In the ideal case we don’t iterate between search and test since that collapses to something like ML estimation using all of the data. We see this as illegitimate.

    • Posted April 16, 2015 at 6:40 pm | Permalink

      As a systematist, I have never seen anybody call the phylogenetic inference a ‘model test’. Likelihood phylogenetics works like this:

      1. get data

      2. do model testing to see which model of sequence evolution best fits your data

      3. infer the phylogeny using the model from step 2

      Sometimes step 2 is left out, as when using the software RAxML, because it only allows one and a half models anyway (basically GTR with or without gamma).

      • Dan Fromm
        Posted April 16, 2015 at 7:31 pm | Permalink

        Thanks for the reply. Your steps 2 and 3 look like specification search, not model testing. Why isn’t the inferred phylogeny a testable model?

        Also, and this has nothing to do with my question or your reply, I’m a little surprised that no one has brought up the power against alternatives of the tests used by systematists. Its been a good ten years since I fitted a tree (fish, I used phylip and mega) but when I did I found that likelihood ratios had very low power against alternatives. Found the same thing for goodness of fit tests (r bar squared, F ratio) when using grid search to estimate parameters in highly non-linear equations using economic data. This in the days before good nonlinear estimation packages were available.

        • Nicholas J. Matzke
          Posted April 16, 2015 at 8:20 pm | Permalink

          What’s a specification search?

          I know what a model is: it is something that is proposed to stochastically generates data. It can be anything from a normal distribution to large number of stochastic and deterministic processes linked together.

          In the case of phylogenetics, the most typical situation is that the data is a DNA alignment and the model is the tree topology+branch lengths+sequence evolution model.

          As for “power against alternatives”, again I’m not sure what you mean, the term “power” is most typically used in the context of frequentist contexts of false positives and false negatives. Inferring a phylogeny and all the associated model parameters is much more complex than an up-down test of a null hypothesis.

          Perhaps you mean “alternative tree topologies” or “alternative sequence evolution models” etc. Here, the answer would be, yes, there is a whole large literature on measuring the support for particular topologies (e.g. posterior probabilities of clade bipartitions), sequence evolution models (statistical model choice, with e.g. AIC in a likelihood framework or marginal likelihoods estimated with e.g. stepping-stone sampling in a Bayesian framework). There is even work on testing whether or not some dataset supports a tree of common ancestry at all (versus e.g. independent origins), see work by Doug Theobald for that.

          As for a general rule: there is no general rule, your statistical support for a particular phylogenetic hypothesis will depend on various things, especially the amount of data. A conclusion that is very uncertain with a 1-gene alignment might be rock-solid with a 100-gene alignment.

          • Dan Fromm
            Posted April 16, 2015 at 9:45 pm | Permalink

            Nicholas, thanks for the reply. We’re from different traditions that may not be very compatible.

            One can see a model as a way of characterizing a data set. A tree, including assumptions about sequence evolution; an equation or system of equations with stochastic error terms attached; … In this context specification search is search for the model that best fits the data. We have to search because we don’t know the one true model a priori.

            More than one model can be fitted to a data set. And some fitting procedures generate more than one model from the same data set. I gather that MP isn’t used much these days, but it has often found numerous most-parsimonious trees. The question of which is the one true tree is been solved by a convention, not, as far as I can see, by out-of-sample testing.

            A model that’s been instantiated on a data set is a testable hypothesis. What puzzles me about systematists’ practice — I’m sorry that I’ve failed to communicate this well enough — is that they seem to use the same data to instantiate and to test a model and are happy with the results they get. In other fields the dataset used to instantiate a model isn’t used to test it. Yet that’s what systematists seem to do. Am I mistaken about this?

            You’ve reminded me that “there is a whole large literature on measuring the support for particular topologies.” Indeed there is. The procedures are used to choose a tree. Choice again, not testing against data independent of the data used to instantiate the model. You split the model-building process into stages and call the later stages testing. What have I misunderstood?

            I agree with you that the more degrees of freedom the better.

            • Nicholas J. Matzke
              Posted April 16, 2015 at 11:28 pm | Permalink

              I guess we have different terminologies. It sounds like the translation is:

              specification search = search for the best fit model

              instantiate a model = infer a model

              “What puzzles me about systematists’ practice…is that they seem to use the same data to instantiate and to test a model and are happy with the results they get”

              It depends what you mean by “test a model”. To me, that means comparing two or models on the same dataset to see which confers higher likelihood on the data (or which model has a higher probability given the dataset, in a Bayesian analysis). E.g., is tree A or tree B a better explanation. Or is the Jukes-Cantor DNA model or the GTR DNA model a better explanation. These sorts of things are all routine.

              It is of course perfectly possible to take two different datasets (e.g. two genes) and see if they support the same or different trees, the same or different DNA substitution models, etc. At least for sequence models, this is kind of less interesting, since there is no particular reason that 2 different genes should support identical models.

              For inferring a tree, the typical assumption is that 2 different genes will support roughly the same tree (there are lots of detailed caveats about the small size of the data in single-gene analyses, the possibility of incomplete lineage sorting, etc.). I guess this may not be a typical feature of most studies, mostly because, most of the time, we think the best tree will be estimated by using all of the available data (again there are caveats, particularly once you throw in tons of noncoding, poorly conserved and poorly alignable junk DNA from genomic studies). The testing of a phylogenetic hypothesis against different datsets is typically done in new studies in new papers.

              The typical sequence in the literature is something like: a mtDNA study published in the 1990s, revised by a multilocus nuclear DNA study published in the 2000s, revised by a transcriptomic or genomic study published in the last few years. (And or total-evidence datasets adding morphology and fossils, dating analyses, etc.)

              Here’s a study where we estimated hundreds of protein trees in cyanobacteria and then looked at whether or not they showed strong evidence of horizontal transfer and/or vertical inheritance. Mostly we got strong signal that the gene trees are distributed around a central species tree, i.e. the vertical inheritance tree, even though cyanos are thought to have lots of HGT and single-protein phylogenies are bound to be noisy for various reasons:

              Bayesian Analysis of Congruence of Core Genes in Prochlorococcus and Synechococcus and Implications on Horizontal Gene Transfer
              http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0085103

              Re: reconstructing the “one true tree” — this was a popular idea amongst parsimony people, in part motivated by the search for (ideally) the single most parsimonious (MP) tree. In a Bayesian framework we recognize that trees are statistically inferred rather than reconstructed, that all statistical inference has uncertainty, and that it’s perfectly possible for some parts of a tree to be resolved with high confidence while other parts of the tree remain highly uncertain (due to missing data, closeness of speciation events in time, or whatever). And, it’s perfectly possible for gene trees to imperfectly match the species tree for various reasons.

              • Posted April 16, 2015 at 11:57 pm | Permalink

                Nick,

                As something of a parsimony guy I would say that your last paragraph is a bit of an oversimplification.

                Yes, there are those cladists who are irrationally hostile to modelling and Bayesian analyses, but that isn’t all there is; and conversely, a few hundred metres down the hill I have several colleagues who are irrationally hostile to everything that isn’t Bayesian (which is why I appreciate this post so much) or who are shocked when somebody uses parsimony for ancestral character reconstruction in a little Java applet for data exploration because they think that method should have died out around 1990 or so.

                These are all just tools; some of them are simpler than others, but sometimes simple is just what you want.

                There are, of course, parsimony techniques for dealing with gene tree incongruence, introgression and incomplete lineage sorting, and honestly I often find them more flexible about the input data and more tolerant of missing data than most of the likelihood and Bayesian alternatives. (StarBEAST is pretty awesome though.)

              • TJR
                Posted April 17, 2015 at 4:22 am | Permalink

                I think I can see both your points. Out-of-sample prediction should be the gold standard in any area of science, whenever possible.

                Just to point out that AIC (Akaike Information Criterion) estimates the model which will have best predictive performance, by penalising models with more parameters which may therefore be overparameterised and hence have poor predictive ability. Hence if you choose your model using AIC then at least to some extent you are already doing this.

              • Dan Fromm
                Posted April 17, 2015 at 8:36 am | Permalink

                Nick, thanks for your patience and the further reply. I don’t think we’ll ever agree on what “test” means. As I see it you confuse model selection with testing. As you see it, I just don’t get it. As I’ve already said, I’m not smart enough or well-trained enough.

                I understand the process you’ve explained but don’t see it as you do. In particular, I don’t see why “paper A using dataset a got model alpher, paper B using dataset b (disjunct from dataset a) got model beter, …” is equivalent to “dataset b rejects model alpher.”

                In my neighborhood — regression — out of sample testing is relatively straightforward. We fit a regression, end up with parameter estimates and their associated standard errors of estimate and with an estimate of the distribution of errors about the equation as fitted. Out of sample testing comes down to using the equation as fitted to calculate errors around observations not used in parameter estimation. We can then test whether those errors’ distribution is, with adjustments for the test observations’ distances from the regression’s point of means, the same as the distribution of errors in the estimation data set.

                In the mid-90s I tried to find parallel test procedures for trees fitted to mtDNA data and to morphological data. I failed, eventually proved — but not cleanly enough to publish and besides I don’t like the result — a limitative theorem to the effect that parallel procedures for trees don’t exist.

              • Nicholas J. Matzke
                Posted April 17, 2015 at 11:20 am | Permalink

                “In my neighborhood — regression — out of sample testing is relatively straightforward. We fit a regression, end up with parameter estimates and their associated standard errors of estimate and with an estimate of the distribution of errors about the equation as fitted.”

                Ah — well at least I think I know what you are getting at now. If one wishes to test the predictive accuracy of a model on new data, then the approach you suggest makes sense. Such a procedure can be imagined for phylogenetics — you take your inferred tree+site model+whatever, and then simulate DNA under that model, then see how well you “predict” the DNA sequence. If you repeated this many times you would get multinomial probabilities for each DNA base at each site of the alignment at each tip, and then you could see how well this matches some new dataset.

                Hopefully you could see why the cost-to-benefit ratio for this procedure wouldn’t be that great — the exact DNA sequence you observe is a highly stochastic realization of the process, and so to say the model predicts the exact DNA sequence poorly isn’t saying anything very interesting. A phylogenetic model doesn’t strong predict the exact sequence of the DNA data, instead it strongly predicts e.g. high correlation in sequence between close relatives, and this is true even if the exact sequence changes wildly between simulation realizations.

                So, one could test the predictive accuracy of a phylogenetic model in terms of its ability to predict certain summary statistics of the data. This is in fact sometimes done, e.g. “posterior predictive” analyses or parametric bootstrapping.

                In response to others Re: parsimony — yes, I was oversimplifying on some points!

            • Dan Fromm
              Posted April 17, 2015 at 12:14 pm | Permalink

              Nick, thanks for the reply. We’re slowly converging.

              You suggested “Ah — well at least I think I know what you are getting at now. If one wishes to test the predictive accuracy of a model on new data, then the approach you suggest makes sense. Such a procedure can be imagined for phylogenetics — you take your inferred tree+site model+whatever, and then simulate DNA under that model, then see how well you “predict” the DNA sequence. If you repeated this many times you would get multinomial probabilities for each DNA base at each site of the alignment at each tip, and then you could see how well this matches some new dataset.”

              This isn’t parallel to the approach I described. Moving from my context to yours with phylogenetic trees in mind, the parallel would be to find the best (in whatever sense of best you like) tree for a set of taxa given sequences of some of their genes or some of their morphological characters. Then ask a disjunct set of genes or characters from those taxa whether that best tree fits them as well as would be expected given how well it fits the data used to infer it. Not a trivial problem.

              The test isn’t to reproduce the data from which the tree was inferred, it is whether the tree is a plausibly good fit to data not used in estimating it.

              For what its worth, most equations (or sets of equations) fitted to economic data fail out-of-sample testing. If out-of-sample testing for inferred phylogenies is possible — I doubt it is, could well be mistaken — I expect that most will fail too.

              • Nicholas J. Matzke
                Posted April 18, 2015 at 11:44 am | Permalink

                You seem to have an absolute idea of goodness of fit, whereas usually goodness of fit is a relative statement where different models confer different likelihoods on the data. It would be trivial to compare the relative fit of various trees etc. to a new dataset. And, that would be useful.

                If you want an absolute measure of goodness of fit, it’s got to be some kind of posterior predictive approach. This could be applied to a new test dataset just as well as a training dataset, with the same issues as described above. Some people are very interested in posterior predictive stuff, the one nice thing is that it tells you when your model is *way* off, e.g. very misspecified, which the relative approach won’t tell you unless you have a better model ready to compare to. Apart from that, I’m not sure how interesting it is. The ubiquitous wisdom in the field is that all models are wrong, but some models are useful, and some models are better than others, and we advance science and test old models by proposing new models and measuring how much better or worse they are at explaining the data. Thus science is advanced in steps. Which seems pretty good to me…

              • gluonspring
                Posted April 20, 2015 at 12:18 pm | Permalink

                I sounds to me that what you are getting at is often called “overfitting” of models.

                One can often find models that fit some corpus of data exceedingly well but which don’t generalize to new unseen data. In this sense you are modeling the peculiarities of the data instead of the underlying phenomena that created the data. For those readers listening in who haven’t though about this before, consider fitting a curve to points… you could find a 5th order polynomial that perfectly goes through a handful points collected from a simple linear process, and might be impressed with the perfect fit, but this complex precisely fitting curve would’t fit new data obtained form the real underlying line… the complex curve has overfit the data. Sometimes one applies constraints on model complexity to reduce this overfitting tendency… since any data can be perfectly fit by a sufficiently complex model with enough tunable parameters, reducing the complexity and number of tunable parameters is a crude counter to this tendency. In the curve-fitting example restricting your models to linear and quadratic equations would be an example of such a constraint on model complexity. Still, at the end of this process you’d like to know if you have overfit your data even with the simpler model, or you might like to be able to consider more complex models but don’t want to be fooled by overfitting. The ideal thing would be to generate some more data points and see if this new, never seen by the model building process, data fits your trained model. Sometimes this is expensive, or even impossible, so sometimes people use cross-validation to repeatedly fit a model on part of the data and validate it on some hold-out set. Only when the model can be shown to fit multiple hold-out sets when trained on multiple versions of the training data is it accepted.

                So I think the question being asked of systematists is what they do to ensure they aren’t overfitting the data.

  39. Ken Kukec
    Posted April 16, 2015 at 6:50 pm | Permalink

    To lightly paraphrase the blind man of Bethsaida, cured by Jesus’ spit and the laying on of hands in Mark 8:22-26: “Now I see.”

    Thanks.

  40. MadScientist
    Posted April 16, 2015 at 7:07 pm | Permalink

    Oh good, I’m not the only scientist who goes raving mad when people apply Bayes’ theorem when they have absolutely no information on the likelihood of events. You’d think it would be pretty damned obvious that you’d probably get the wrong results but unfortunately there are a lot of believers out there. I remember having an argument about it about 10 years ago and this guy was essentially saying it’s better to have the wrong answer than to admit ignorance and go searching for the right answer. The situation is a bit different in some classes of ‘learning machines’ – they start out ignorant and the behave exactly as the programmers expect the machines to behave: they make the wrong decisions. However, as actual data is acquired over time the machines make better decisions.

  41. Diane G.
    Posted April 16, 2015 at 8:01 pm | Permalink

    sub

  42. Posted April 16, 2015 at 8:25 pm | Permalink

    I can recommend Nate Silver’s book The Signal and the Noise as a decent non-technical introduction to this topic.

    As far as this post: it was enjoyable. I echo what others have said: there is a time and place to use Bayesian techniques, and that is when you DO have some actual prior knowledge.

    I often give this example: suppose you want to get an idea at how good of a free throw shooter someone is when you’ve seen them miss 2 of 4 free throws? The frequentist approach is to just do the calculation of the p-value given some null hypothesis.

    The Bayesian approach might be to see if you know something about the shooter; for example if the shooter is on the basketball team, one might take that into account (say the team shoots 70 percent) and be more likely to conclude that the person is really better than a 50 percent shooter.

  43. Jonathan Livengood
    Posted April 16, 2015 at 9:37 pm | Permalink

    The link to Glymour’s essay appears to be broken. I assume that the intention was to point to Fitelson’s repository here: http://fitelson.org/probability/glymour.pdf

    As a note, what Bayes actually proves in his famous paper is much more interesting than what is usually called Bayes’ Theorem today. What he actually does is to answer the following challenge. Given the number of times in which some event has happened and the number of times in which it has failed to happen, determine “the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” In other words, determine the credence that one ought to assign to the claim that the probability of (or credence one should have in) success on the next observation lies between p1 and p2 for arbitrary values of p1 and p2.

    It is true that the proof requires a controversial symmetry assumption (made also by Laplace in his proof of the same result a few years later). Some have speculated that Bayes declined to publish in his lifetime because he didn’t have an airtight argument for the symmetry assumption, and hence, he didn’t have an airtight solution to the problem.

    • Posted April 17, 2015 at 11:26 am | Permalink

      >Some have speculated that Bayes declined to publish in his lifetime because he didn’t have an airtight argument for the symmetry assumption.

      Yes, I was going to mention the idea that Bayes’s doubts may have kept him from publishing in his lifetime, but deleted that, as the post was already long.

      Glymour link– Thanks, now fixed!

      GCM

  44. Ken Kukec
    Posted April 17, 2015 at 1:00 am | Permalink

    ‘Well, what if there is no first stage experiment or background knowledge which gives a probability distribution to the hypotheses?’

    When I first became casually acquainted with Bayes’ Theorem, and with the problem of determining “prior probabilities” in the absence of adequate information, it occurred to me that a possible starting point for inquiry into naïve priors was suggested by the 20th Century American writer Damon Runyon, whose milieu was the gamblers, wiseguys, and Broadway hangers-on of Depression-era New York — the guy, that is, who gave the stage “Guys and Dolls.”

    Now, having read this treatment of the issue, and of Bayes’ efforts to address it with his Postulate, it occurs to me that Bayes and Runyon, each in his own way, came out on the matter perhaps not that far apart. Said Runyon, in his gambler’s parlance: “I long ago came to the conclusion that all life is 6 to 5 against.”

  45. Ken Kukec
    Posted April 17, 2015 at 1:52 am | Permalink

    Also appreciated is the allusion to what is my favorite Hemingway short story of the 1930s, “A Clean, Well-Lighted Place.” It is certainly the locus classicus for his so-called “iceberg method” and his contention that the measure of a writer’s talent is how much he can leave out of a story — since, there, he omits plot, character names, characterization itself, geographical location, even the routine attribution of quotations.

    Of special interest for present company is that it is where Hemingway (notwithstanding his religious affiliations from prior marriages) came out as an atheist, even something of an existentialist. The story contains what must be the most jarring parody of The Lord’s Prayer and the Hail Mary in the Western literary canon:

    Our nada who art in nada, nada be thy name thy kingdom nada thy will be nada in nada as it is in nada. Give us this nada our daily nada and nada us our nada as we nada our nadas and nada us not into nada but deliver us from nada; pues nada. Hail nothing full of nothing, nothing is with thee.

  46. Marilee Lovit
    Posted April 17, 2015 at 3:37 am | Permalink

    Thanks for this post. An example of what a great resource this website is. Not only the post, but all the discussion too.

    • Diane G.
      Posted April 17, 2015 at 6:07 am | Permalink

      Indeed–it’s like a graduate seminar!

  47. Torbjörn Larsson, OM
    Posted April 17, 2015 at 4:49 am | Permalink

    Thank you, very informative! I have just recently grasped the ‘doing Fisher, using Neyman-Person language’ use. I’m sure this article will influence my further understanding.

    As an example where bayesian methods are valid I would add hidden Markov models (HMM), which do what you describe in the end, suppress the prior until the result is reasonable. “A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. An HMM can be presented as the simplest dynamic Bayesian network. … Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition,[7] part-of-speech tagging, musical score following,[8] partial discharges[9] and bioinformatics.” [ http://en.wikipedia.org/wiki/Hidden_Markov_model ]

    But that leads me up to my idiosyncratic take. The implicit discussion of what constitutes probabilities is philosophic. These are all useful, and used, methods.

    I would use likelihoods for choosing between valid data and theories, bayesian models for improving valid data and theories (see above – Planck data processing and model choosing is a premier example), and significance testing for weeding out invalid data and theories.

    In the latter I join the empiricist par excellence, Mr Holmes:

    “How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”

    [ http://en.wikiquote.org/wiki/Sherlock_Holmes ]

    Not the only possible ‘truth’, which is why likelihoods are useful. But I would want a quality bar to hurdle for useful science, and as far as I can see significance testing is the only game in town.

    Perhaps we can dispense with that quality stamp of approval and rely on just likelihood competition. But why do we then have peer review in order to weed out erroneous (including pseudo) science? If quality testing is useful in the end, why not throughout?

  48. Posted April 17, 2015 at 10:30 am | Permalink

    I’m still trying to warp my head around the heart of the debate…and I think something that would really help me would be to understand how “the other side” would approach the supportive examples.

    So, for example, for the likelihood approach, how would you calculate the odds the patient actually has the disease?

    b&

    • Posted April 17, 2015 at 10:46 am | Permalink

      Everyone (likelihoodist, Fisherian, Bayesian, Neyman-Pearsonian, and textbook hodge-podgeist) would calculate it just the way I did above. The debate is not over Bayes’ Theorem– the theorem is a simple mathematical truth, accepted by all. The debate (at least in part) is over what do you do when you don’t have a well-founded, objective, prior probability to plug into Bayes’ equation. In the disease diagnosis example, we did have a well-founded, objective, prior probability, and hence there’s no dispute about the odds the patient has the disease.

      GCM

      • Posted April 17, 2015 at 11:00 am | Permalink

        Okay…then how do the two compare in the don’t-have-a-prior example?

        Bayesians, as I understand it, either go with 50-50 or pick a number out of their netherbits and plug it into that part of the equation. What’s the math for the likelihood approach?

        b&

        • Posted April 18, 2015 at 8:06 am | Permalink

          To use the medical diagnosis example above, and now supposing we have absolutely no clue as to the general incidence of the disease (i.e. no prior probability), the likelihood analysis would go like this. First, we need a slightly more complete description of the properties of the test. It takes two numbers, the true positive rate (or sensitivity), which I gave above as .95, but also the false positive rate (which is equal to 1- the true negative rate [the latter called the specificity]). Let’s say the false positive rate is .1 (in the calculation above I had used a false positive rate of .05, since it required less explanation, but it’s important to realize that the false positive rate plus the true positive rate does not, in general, equal one). We then calculate the likelihoods of the two hypotheses (Hd, he has the disease; Hh, he’s healthy), given the data, +, that there was a positive result:

          L(Hd∣+) = P(+∣Hd) = .95

          L(Hh∣+) = P(+∣Hh) = .10

          We then take the ratio of the likelihoods, which gives a likelhood ratio of 9.5 in support of the hypothesis that he has the disease, over the hypothesis that he is healthy; most people would interpret such a likelihood ratio as at least fairly strong support in favor of Hd. The evidence provided by the test is thus supportive of Hd over Hh.

          The likelihood question, “What do the data say?”, is thus answered. Note that this does not answer the question, “What is the probability the fellow has the disease?”; if we had a prior (which we don’t), we could go on to answer that question. This is why I said in the OP that Bayesianism is more ambitious than likelihoodism. For a physician, perhaps the most important question goes even further: “Should I treat the fellow for the disease?”. This is the realm of decision theory, in which we must go beyond statistics, and add moral or monetary values to the hypotheses, in order to arrive at a decision.

          GCM

          • Posted April 18, 2015 at 11:21 am | Permalink

            That helps — thanks!

            Would it be fair to summarize the two approaches as the likelihood tuned to attempting to isolate the significance of the data under investigation with the Bayesian attempting to put the data in perspective with everything else that’s known?

            b&

            • Posted April 19, 2015 at 11:10 am | Permalink

              Yes, that’s right. Lacking a valid objective prior, non-Bayesians would also try to put the results in the context of what is already known, but the result of that reasoning would not be an algorithmically derived posterior probability.

              GCM

              • Posted April 19, 2015 at 11:59 am | Permalink

                That confirmation helps a great deal — knowing that they’re two different (but related) tools designed for two different (but overlapping) jobs.

                Thanks!

                b&

          • Nicholas J. Matzke
            Posted April 18, 2015 at 11:46 am | Permalink

            “To use the medical diagnosis example above, and now supposing we have absolutely no clue as to the general incidence of the disease (i.e. no prior probability), the likelihood analysis would go like this.”

            You could build a hierarchical model and put a flat hyperprior on PrP, the prior probability of having the disease…

      • TJR
        Posted April 17, 2015 at 11:08 am | Permalink

        Adding to that, another way of seeing the fundamental difference, as noted in some comments above, is that in *Bayesian inference* we explicitly model our uncertainty using probability distributions.

        At the risk of oversimplifying, consider the case of trying to estimate the mean of some measured quantity.

        non-Bayesian: the mean is a fixed value which we don’t know.

        Bayesian: based on previous work I reckon the mean is somewhere between 200 and 400, so I will quantify my uncertainty using a normal distribution with mean 300 and “large” standard deviation (say 100). This is the prior distribution for the unknown mean, and as we collect data we can update it to a posterior distribution which will have smaller standard deviation, and mean closer (we hope) to the unknown true mean.

      • Posted April 17, 2015 at 11:22 am | Permalink

        IMO, not quite: this is where the *statisticians* debate. There’s actually a related, perhaps prior (:)) debate on what the interpretation of the probability calculus (in factual matters) should be.

        In some contexts, people would *deny* that there’s a probability (or if you prefer, that the probability is 0 or 1) that the patient has the disease: what is the case is that a given frequency in the propulation exists and another for those with and without the test, and a *random sample* produces p() thus and so.

        (This is encapsulated in the slogan: “no randomness, no probability”.)

        Another way to ask this is: what is the domain of the function?

        • Posted April 18, 2015 at 7:36 am | Permalink

          Yes, you’re quite right. The fellow has the disease or he doesn’t. What Bayes’ Theorem says is that an individual drawn from a similarly situated population (i.e. drawn from the same general population and with a positive result on the test) would have(in this case) a 16% probability of having the disease. Philosophical Bayesians are more committed to the notion that hypotheses about the nature of the world can have probability values, while others are less committed or opposed to that view.

          GCM

  49. Posted April 18, 2015 at 8:24 am | Permalink

    Yippee, there is now a criticism of this post from Richard Carrier. It appears to be an attempt to expand the definition of Bayesianism so far that everything is Bayesian.

    The problem with attempts like that – such as when some scientismists claim all reasoning as science (hi Coel!), or some apologists strategically claim that all beliefs are equivalent to religious faith – is that important distinctions get lost.

    And of course I could easily make the same argument for, say, parsimony analysis; Carrier also instinctively uses Occam’s Razor all the time in daily life, but that does not mean that he is a cladist. No, he is a Bayesian, because that is his preferred formal method.

  50. Ankur Chakravarthy
    Posted April 18, 2015 at 9:18 am | Permalink

    There is one more way round prior distributions; empirical bayes procedures estimate the prior distribution from the data themselves.

  51. JacksonA
    Posted April 18, 2015 at 4:36 pm | Permalink

    Great column and fun comments to read. My thoughts.. In practice it might not seem like Bayesian analysis comes up much but it can be used heuristically to assess whether data is being reported objectively or only because it supports a view or a bias toward one hypothesis. This is another way it complements the scientific method.

  52. Michael Waterhouse
    Posted April 19, 2015 at 12:19 am | Permalink

    I had no idea that Bayesianism was so widely used and so, seemingly, important.
    I remember having a class in it as part of studying human rationality in 3rd year philosophy and can’t remember anything about it. Nor can I readily understand much about this article.
    A question, as some who pursued education and philosophy in order to learn how to think and reason well (among other things), is a good understanding of this Bayesian thing necessary for good critical thinking?
    Is it necessary for a reliable evaluation of some ‘truths’?
    Or is there other methods? I did well at formal logic, but haven’t got math.

    My lecturer was a charismatic guy, a good teacher. The most salient thing he left me was brevity on ones answering machine. They were fairly new back then and everybody was coming up with lengthy tedious or amusing messages.
    He just had “Leave a massage for Neil”. That’s the style I have used ever since.

    So, can I have trust in my beliefs without Bayesianism?

    I do intend to look into it further.

  53. Jonathan Dore
    Posted April 19, 2015 at 1:53 pm | Permalink

    In the last 15 years or so a Bayesian approach has revolutionized the analysis of carbon-14 dating in archaeology, greatly reducing the error bars and enabling sequences to be ordered with much greater precision and confidence, e.g. in clarifying the order in which a large number of long barrows (burial mounds) were built across southern Britain in the 4th millennium BC, showing the direction in which the style spread across the country. See https://journals.uair.arizona.edu/index.php/radiocarbon/article/download/3836/3261 for a general introduction to this kind of application.

  54. tomcampbellricketts
    Posted April 20, 2015 at 12:28 pm | Permalink

    The example of allele frequencies given doesn’t seem to me to have been well chosen. At least, it was not well explained.

    The article complains that a uniform distribution of the frequency for a implies a non-uniform distribution for the frequency for aa. What did the author expect? If I deal 4 cards from a randomized standard deck, should I expect 4 aces as frequently as 4 non-matching cards?

    The are 3 times as many ways of not getting aa as there are ways of getting aa, so a median of 0.25 seems like a good property for the distribution over frequencies for aa to possess. It is not at all obvious from the post why this distribution should be problematic.

    Now, very often a distribution over a continuous hypothesis space will lack transformation invariance, as suggested. But methods have existed for about 50 years to find the desired unique uninformative prior for such cases (see http://bayes.wustl.edu/etj/articles/well.pdf for example).

    The claim made here that pure ignorance cannot be represented by a probability distribution is thus unsupported.

    As for the complaint that prior distributions exhibit subjectivity, I agree. But whatever subjectivity exists in the prior will also enter the likelihood, via the same mechanism (the subjectively assumed probability model, which specifies both the ignorance prior and the sampling distribution over the possible data sets).

    • Posted April 20, 2015 at 12:31 pm | Permalink

      Since you’re new here, apparently, please try to be a little more polite when tendering your criticisms. If you don’t understand how you could have made this post in a more civil manner, I can’t help you.

  55. Posted May 6, 2015 at 11:18 am | Permalink

    Here’s my comment on this:

    http://web.bryant.edu/~bblais/priors-vs-likelihoods.html

    Bottom line – likelihoods (and frequentist methods in general) have no way of ruling out well-fitting, but totally implausible, hypotheses. Bayesian reasoning does – and Greg Mayer uses it all the time.


%d bloggers like this: