Dec 20, 2020

Bayesian priors and posteriors in 3D

Bayesian priors and posteriors in 3D

Once a year you travel on business to New York, always looking forward to lunch with your “year abroad” friend who moved there. Your tradition is to “flip for lunch” using the same tarnished coin you found on a street in Florence that wonderful day the two of you first met.

The opening image above is a visualization of the various probability distributions regarding the fairness of that old coin that could arise after four years of lunches. It presumes significant doubt about its fairness to begin with.

This post visualizes those distributions in three dimensions and asks what insights might have been gained.

Background

In the first Bayesian statistics Coursera course From Concept to Data Analysis, instructor Dr. Herbert Lee, UCSC School of Engineering, writes his notes seemingly backwards (!) on a picture window, frequently looking at us through his handiwork to make sure we are getting his point. The effect is attention-grabbing; his windowboards screen-shot-worthy.

In the shot below Dr. Lee summarizes five important aspects of the relationship between a Binomial likelihood (a.k.a, a “Binomial process,” as when flipping a coin and counting the number of heads) and a Beta prior:

  1. When the prior distribution for the probability \(\theta\) of the Binomial process is a Beta, the posterior for \(\theta\) is also a Beta (the conjugate property)
  2. The mean of a Beta is \(\frac \alpha {\alpha + \beta}\)
  3. The parameters of the posterior are \(\alpha + \#(successes)\) and \(\beta + n - \#(successes)\)
  4. The effective sample size of the prior is \(\alpha + \beta\)
  5. The posterior mean is the weighted average of the prior mean and the data mean

Also noteworthy but unwritten:

  • “Points 4 and 5 can help a practitioner select prior values for \(\alpha\) and \(\beta\).” (Demonstrated below.)
  • \(\alpha + \beta\) is the effective sample size of the prior because the sum \(\alpha + \beta + n\) forms the denominator in the weighted average formula for the posterior mean.” (Where his pen is pointing. Enlightening!)
  • The posterior distribution is a distribution for \(\theta\) because \(\theta\) is the determining factor of the Binomial process of interest.

However, the most illuminating point I heard him make was one I’d actually misheard!

When Dr. Lee discussed the mathematically natural conjugate relationship between the Binomial and the Beta, I thought I’d heard him say, “The Beta distribution is conjugate to the Binomial likelihood.” From “conjugate to” I imagined “perpendicular to,” picturing the data observations rising at right angles from the prior in the \(x-y\) plane, like an old bumper jack whose base forms the prior, whose shaft houses the data, and whose lever, ratcheted by the practitioner, hoists the posterior to sequentially higher planes with each new success.

Let’s graph that.

The process

Flip a coin n = 4 times and count the number of heads (“successes”).

The prior

Model “coin fairness uncertainty” with Beta parameters \(\alpha = 2\) and \(\beta = 2\). A Beta prior with those parameters satisfies two conditions:

  • The prior mean is 1/2, reflecting our apriori expectation of a fair coin.
  • The effective sample size of the prior and the sample size of the data are the same, thereby giving equal weight to the prior mean and the data mean when reassessing our fairness expectation after seeing the data.

Using algebra and the formulas in points 2 and 4 above, we can solve for \(\alpha\) and \(\beta\) – or just guess.

The graph

Using the base \(R\) persp function,1 the figure below visualizes the prior distribution and the five possible posteriors in three dimensions. It is the opening 2D image, now in 3D.2

Takeaways

A pair of quick observations

  • Four heads in four flips would give us reason to believe the coin is biased towards heads – illustrated by the shifting of the mode from the center of the prior to the right in the red posterior at the top.
  • Two heads in four flips would increase our belief in the fairness of the coin – illustrated by the movement of mass away from the tails of the prior toward the center of the green posterior in the middle.

Arguably those points could have been made from the opening 2D graph. However, the separateness of the planes seems to lend greater emphasis to the individuality of the posteriors.

New insight

  • Each posterior can be envisioned as a contour on a three-dimensional surface.
  • The space between the planes suggests the existence of latent factors driving the potential real-life outcomes.

Some closing questions

Instead of the top plane being at a height of 4, what if we had 40 or 400 or 4000 data points? How would the surface change?

If we were more confident in the apriori fairness of the Firenze coin, how would the surface change?

Look for answers in a future post.


The end of business today, 12/20/2020, marks 251.645 months since the end of the last millenium.
Generated with Rmarkdown in RStudio.


R Environment

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mondate_0.10.01.02

loaded via a namespace (and not attached):
 [1] compiler_4.0.3  magrittr_1.5    tools_4.0.3     htmltools_0.5.0
 [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3   knitr_1.28     
 [9] stringr_1.4.0   xfun_0.14       digest_0.6.25   rlang_0.4.6    
[13] evaluate_0.14  

  1. Color scheme by Paul Tol.↩︎

  2. The image at the top of this post was created from the same \(R\) code with two differences – persp arguments phi = 90 and r = sqrt(10000) – taking the viewer to a 10000-foot perspective, effectively producing the 2D prior/posterior combination graphs more commonly seen (as here) – although most of the combination graphs you find will be of multiple priors and only one of the many possible Binomial outcomes rather than, in this post, the other way around.↩︎