Illustrate Bayesian inference.

# Illustrate Bayesian inference. GOAL: Illustrate Bayesian inference. Probability is the codification of logical thought in the presence of uncertainty. It the ONLY consistent system for logical inference with uncertainty (TODO: Cox's Theorem, Fisher's Information, Cramer-Rao bound, KL co-ordinate free invariant information geometry) When interpreting the world you get observations $O$ and you want to decide what model of truth $T$ best explains those observations. This inference of model from data sometimes allows you to understand and predict things about the world (The Scientific Method). To do this inference, you do experiments to determine what you observe given you have fixed and know the truth, $P(O|T)$, the likelihood / density (interpreting as function of T / O). Bayes rule then turns that around to compute $P(T|O)$ the posterior (after-data) of the true hypothesis given what you observe and your prior about what might be true before seeing any data: $P(T)$: $$P(T|O) = P(O|T)*P(T) / P(O)$$ This is inspired from MacKay's book in which he shows the simple intuitive beauty of Bayes rule: <a href="http://www.inference.org.uk/mackay/itprnn/ps/47.59.pdf">http://www.inference.org.uk/mackay/itprnn/ps/47.59.pdf</a> For example, say I want to predict the species for a given tree based on its height. <img src="redmaple2__58290.1538684018__32799.1539281421.jpg"> <img width="300px" src="Kousa_Dogwood_1_540x.jpg"> First I construct some probability distributions. I go out and look at only maple trees and get their heights. I then I look at only dogwood trees and get their heights. I capture my observations in functions $P(height|maple), P(height|dogwood)$. Maples are generally taller (70 feet) than dogwoods (30 feet). Now I have a tree that is 69 feet; what species is it? I want to compute the posterior probability of it being each species of tree: $P(maple | 69 feet), P(dogwood | 69 feet)$. I might also have a prior that I might see more maples because I'm farther north. I look at which posterior has greater probability and decide that species of tree. Skipping all the details, this procedure would likely call Maple because 69 feet is much closer to the average 70 of Maple than the average of 30 feet of Dogwood. The details make this all very precise and quantitative. Probability distributions forces you to place rational bets across all possibilities. This allows you precisely weigh how often different events happen. Bayesian probability mean you write down the probability of everything. After that it is _mechanical_ to infer different quantities. ================================ # Simple Binomial Example Let's work through a simple example: <a href="https://en.wikipedia.org/wiki/Binomial_distribution">https://en.wikipedia.org/wiki/Binomial_distribution</a> Flip a coin $N$ times where the probability of heads is $p$. How often do you see $k$ heads where $k$ is between $[0,1,...,N]$? The binomial distribution tells you this: $$ P(k|N,p) = choose(N, k) p^k (1-p)^{(N-k)} $$ The three terms are 1) probability of heads to the power of the number of observed heads $(p^k)$, 2) probability of tails to the power of number of observed tails $(1-p)^{(N-k)}$, 3) how many ways are there are to choose number of observed heads out of the total $(choose(N,k) = N!/(k! * (N-k)!))$ or the binomial coefficient. These distributions are carefully constructed to sum to 1.0. The powers of probability are simply because probabilities multiply for independent events. This is mathematical truth. If you independently flip coins, binomial describes exactly the outcomes. ================================ # Generate Some Data Let's generate some data from the model using R. Binomial data is generated using "rbinom". Set the model parameters to $N=10$ flips, $p=0.3$ or 30% for heads. Let's run 100 experimental trials. ``` N=10 p=0.3 trials= 100 obs = rbinom(trials, size=N, prob=p) png("tableobs.png") plot(table(observed)) title("counts of observed values") dev.off() ``` <img src="tableobs.png"> The count of about 30 at observed=3 says that out of the 100 experiments you ran, ~30 of those experiments showed 3 heads and 7 tails when you flipped the coin ten times. ================================ # Inference OK, we have some data. Let's forget that we know the probability of heads that generated the data (0.3). Let's infer is value from the data itself where we know we flipped a coin 10 times for 100 trials. To make things simple, let's assume the true probability is quantized so that it can only take on 1/10'th values. There are nine different hypotheses to test: ``` hyp = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9) ``` We want to know the posterior probability of each hypothesis given the data. This says after you are given the data, what is the probability of each of the hypotheses. You can use this and do things like take the maximum one as the best bet as to which one is the correct one. How do we do that? Write down Bayes rule: $$P(T|O) = P(O|T)*P(T) / P(O)$$ Here $T$ or truth is over the possible hypotheses and $O$ is the observed data. Let's fill out the terms. We assume a model. Here it is binomial but it could be any model that assigns a probability to a data point given some parameters of the model. (Note: here the model "reflects" reality. What happens when the model or class of models is limited and doesn't exactly reflect reality? In machine learning, is your hypothesis concept class capable of representing the problem? ). R binomial probability is "dbinom". $$P(O|T) = P(k|N,p) = choose(N,k) p^k (1-p)^{(N-k)} = dbinom(k,size=N,prob=p)$$ The observed data is $k$ the number of heads and the truth is the setting of the model parameters $N,p$. When you sum over all possibilities of the observed data (0:N) you get 1.0, or that you've placed rational bets on all possibilities. $P(T)$ is the prior. This is your a-priori belief that each of the hypotheses is true. In our case we don't have any real information and therefore assign uniform probability over the nine hypotheses. $P(O)$ is a normalizing factor to make sure the posterior sums to 1.0 over the different hypotheses. It is $\sum_i P(O|T_i)*P(T_i)$. We have all the ingredients! ================================ # Inference Calculation: Likelihoods Compute some numbers! Both the likelihood and the posterior are 2d arrays $[O,T]$. In this case, $O$ can take on 11 possible values (0:10 the number of heads out of 10 flips) and $T$ can take on 9 values (0.1:0.9, the 9 hypothesis values that we decided to consider). Let's use R to compute these things. ``` # storage [hpy,obj] likelihood = matrix(nrow=9, ncol=11) # step through all hyptheses for (hypii in 1:length(hyp)){ # step through all possibilties for running one trial, # the number of observed heads obspossible = seq(0,N) for (obsii in 1:length(obspossible)){ # compute the likelihood for this observed, hypothesis combination likelihood[hypii,obsii] = dbinom(obspossible[obsii], size=N, prob=hyp[hypii]) } } write.table( likelihood, file="likelihood.tsv", sep="\t", col.names=F,row.names=F) write.table( hyp, file="likelihood_rowval.tsv", sep="\t", col.names=F,row.names=F) write.table( obspossible, file="likelihood_colval.tsv", sep="\t", col.names=F,row.names=F) ``` OK, now we have the likelihood for each possible hypothesis over every possible observed outcome using a simple mechanical computation. Let's make sure that our distributions properly sum to 1.0. The sum across the columns should be 1.0 for each of the 9 hypotheses: ``` apply(likelihood, 1, sum) [1] 1 1 1 1 1 1 1 1 1 ``` GOOD! Each hypothesis places a distribution over all observed possibilities that sum to 1.0 Let's look at all the hypotheses: ``` for (hh in 1:length(hyp)){ fn = paste("hypoth.",hh,".png",sep="") png(fn) plot(obspossible, likelihood[hh,],type="b") title(paste("hypothesis p=",hyp[hh])) dev.off() } mycol=rainbow(9) png("hypoth.all.png") plot(obspossible, likelihood[1,],col=mycol[1],type="b",ylim=c(0,0.4)) for (hh in 1:length(hyp)){ points(obspossible, likelihood[hh,],col=mycol[hh],type="b",ylim=c(0,1)) } dev.off() ``` These are plots of the likelihood over the observations 0:10 given the hypothesis probability. <img src="hypoth.all.png"> SlideIt: <input id="val1" type="range" min="1" max="9" value="1" step="1" oninput="showVal1(this.value)" onchange="showVal1(this.value)" /> <span id="range">1</span> <img id="img1" src="hypoth.1.png"> A plot shows that given a particular hypothesis probability of heads say 0.1 and flip a coin 10 times, the probability distribution over the outcomes of a total of 0 heads, 1 heads, ... 10 heads given my probabilistic model. For binomial and p=0.1 it says there is about 40% probability of seeing 1 heads out of the 10 flips. ================================ # Likelihood Dual / Posterior Let's look at the dual or along the other axis of the matrix we constructed. This is the value of the likelihood at each hypothesis=(0.1,...0.9) given a single observation of 0:11. First note that they do not sum to 1.0: ``` apply(likelihood, 2, sum) [1] 0.4914342 0.8287079 0.9061899 0.9099927 0.9091388 0.9090730 0.9091388 [8] 0.9099927 0.9061899 0.8287079 0.4914342 ``` ``` for (oo in 1:length(obspossible)){ fn = paste("obs.",oo,".png",sep="") png(fn) plot(hyp, likelihood[,oo],type="b") title(paste("obs=",obspossible[oo])) dev.off() } mycol=rainbow(11) png("obs.all.png") plot(hyp, likelihood[,1],col=mycol[1],type="b",ylim=c(0,0.4)) for (oo in 1:length(obspossible)){ points(hyp, likelihood[,oo],col=mycol[oo],type="b",ylim=c(0,0.4)) } dev.off() ``` These are the un-normalized distributions of the different hypotheses given a single observation of obs: <img src="obs.all.png"> SlideIt: <input id="val2" type="range" min="1" max="11" value="1" step="1" oninput="showVal2(this.value)" onchange="showVal2(this.value)" /> <span id="range2">1</span> <img id="img2" src="obs.1.png"> Note that these are very similar to the likelihood plots from before. There is a dual symmetry. The dual conjugate to the <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a> is the beta distribution: <a href="https://en.wikipedia.org/wiki/Beta_distribution">beta distribution</a>. The distributions are EXACTLY the same except for the normalizing factor and interpretation! (Which might be hard to tell from the equations at first glance) A plot shows that given a particular observation from my experiment, say 0 heads out of 10 flips, show me the probabilities over the different hypotheses (0.1, 0.2, ..., 0.9). This is the dual of the likelihood plots shown first. We are taking slices through the probability distributions: <img style="box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);" width=400px src="bayesSlice.svg.png"> Note these "dual" distributions don't sum to 1.0! They are still values between 0 and 1 and can be compared against each other because we've taken effort to make sure our models return proper probability distributions (viz: examine the maximum value). However, we can't interpret them as probability _distributions_ yet. ***This is very important!:*** How can we interpret these numbers as probability distributions? Bayes rule says to simply multiply by prior and normalize. - Going one way, we can propose a multiple hypothesis models to make predictions about how the outcome of an experiment will turn out (the likelihood) - We can also quantify how much we believe each model is likely to be true (the prior). - We can then run experiments and let Nature tell us the results (the observations). - We can then mechanically plug these resulting observations into our models and make _inferences_ about what how likely each of the different models is to be ***true*** (the posterior). - This ***is*** the scientific method and codifies logical thought under uncertainty. Note that for uniform prior, this multiplying by prior and normalizing is the same as dividing by the sum, as the prior cancels out, and does not change the shape. For non-uniform prior, more probability will be put on those with higher a-prior probability. We do this in the next section. Rather than cuts parallel to the two axes, you can also plot the likelihood as a 2d image: ``` png("like.2d.png") image(likelihood) dev.off() ``` <img src="like.2d.png"> Or in 3d: <a href="three.html">three.html</a> <iframe src="three.html" width=800 height=800></iframe> ================================ # Uniform Prior Say we see a single of observation of 3 out of 10. First let's pull out the probability of observing a single 3 (out of 10) under the different hypotheses (obspossible[4]= 3 because of the 0 obs) ``` Pobs3 = likelihood[,4] [1] 0.057395628 0.201326592 0.266827932 0.214990848 0.117187500 0.042467328 [7] 0.009001692 0.000786432 0.000008748 sum(Pobs3) [1] 0.9099927 which.max(Pobs3) 3 png("Pobs3.png") plot(hyp, Pobs3,type="b") dev.off() ``` <img src="Pobs3.png"> Good! If you observe a single observation of 3 (out of 10) then the hypothesis with the highest likelihood is p=0.3 with probability 26.7% . This is the maximum likelihood choice. The next most likely is p=0.4 at 21.5%. Here we can compare probabilities through things like maximum even though they do not sum to one and are not a distribution Let's look at the true posterior with uniform prior properly normalized: ``` sum(Pobs3) [1] 0.9099927 # doesn't sum to 1 because over hypotheses # uniform prior and normalize prior = rep(1/9,9) posterior = prior*Pobs3 posterior = posterior/sum(posterior) posterior [1] 6.307262e-02 2.212398e-01 2.932199e-01 2.362556e-01 1.287785e-01 [6] 4.666777e-02 9.892049e-03 8.642179e-04 9.613264e-06 png("Pobs3post.png") plot(hyp, posterior,type="b") dev.off() ``` <img src="Pobs3post.png"> As you can see, the uniform prior posterior has changed the scale, but the shape remains unchanged. But now summing over all hypotheses gives 1.0 so it is a nice distribution. Note that getting the "partition function" normalizations are correct so it sums to 1.0 is key if you want probabilities. With more complicated models, this can sometimes be tricky. So much so that fields like Bayes decision theory construct functions that look at ratios so the these normalizations cancel out. With a real probability distribution, you can also also talk about the maximum a posterior choice (MAP), which for uniform prior is the maximum-likelihood choice. There is the mean posterior estimate as alternative to the maximum likelihood estimate: ``` sum(hyp*posterior) [1] 0.3330378 ``` The mean posterior estimate is 0.333 which is close the maximum likelihood estimate. Note it doesn't lie right on a hypothesis 0.3, 0.4. This is the Bayesian way. You keep around the distributions of everything carrying full information and then possibly take reductions like max or mean if you want a single estimate. ================================ # Informative Prior Let's say we believe a-priori that the coin is biased and will come up heads 30% of the time. Let's put a prior probability 76% on the 0.3 hypothesis and uniform over others. ``` hypPrior= c( 0.03, 0.03, 0.76, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03 ) posterior = likelihood # set up storage for (hypii in 1:length(hyp)){ for (obsii in 1:length(obspossible)){ posterior[hypii,obsii] = likelihood[hypii,obsii]*hypPrior[hypii] } } for (obsii in 1:length(obspossible)){ # normalize posterior[,obsii] = posterior[,obsii]/sum(posterior[,obsii]) } for (oo in 1:length(obspossible)){ fn = paste("post3.",oo,".png",sep="") png(fn) plot(hyp, posterior[,oo],type="b") title(paste("prior=3 76% obs=",obspossible[oo])) dev.off() } apply(posterior,2,sum) [1] 1 1 1 1 1 1 1 1 1 1 1 ``` SlideIt: <input id="val3" type="range" min="1" max="11" value="1" step="1" oninput="showVal3(this.value)" onchange="showVal3(this.value)" /> <span id="range3">1</span> <img id="img3" src="post3.1.png"> The sums over hypotheses are to 1. You can see how the prior really favors the 0.3 hypothesis. Only when you seen an observation of 7 or greater does the likelihood overcome the prior and change the most likely call. ================================ # Multiple Observations Of Same Outcome Before we had the probability given a single observation: $P(obs | hyp)$ or likelihood[hyp,obs] For multiple observations if you assume independence then you can multiply together: $$P({obs} | hyp ) = \prod_N P(obs_N | hyp)$$ ( Multiplying makes probabilities smaller because they are less than or equal to 1.0. With large number of observations, these probabilities can get very small (Think 10^-100 for 100 observations each at 10% probability). Are they still a distribution because they are so small? Yes, because the input space is much larger for the multiple observations as you have to sum over all possibilties of _all_ observations. See appendix. ) Bayes rule applies in the same way to multiple observations. Let's say you see 5 observations of 3. What is the likelihood under the 9 different hypotheses for these multiple observations? This is just probability of each hypothesis given the single observation of 3 (Prob3 = $P(3|H_i)$) to the 5th power because observations are independent. For example, for just one hypothesis $$ P( (3,3,3,3,3) |H) = P(3|H) P(3|H) P(3|H) P(3|H) P(3|H) = (P(3|H))^5 $$ ``` Pobs3^5 [1] 6.228652e-07 3.307545e-04 1.352560e-03 4.593036e-04 2.210072e-05 [6] 1.381258e-07 5.910453e-11 3.008194e-16 5.123230e-26 # With uniform prior, shape is UNCHANGED prior = rep(1/9,9) posterior = prior*Pobs3^5 posterior = posterior/sum(posterior) [1] 2.876338e-04 1.527396e-01 6.246005e-01 2.121025e-01 1.020592e-02 [6] 6.378530e-05 2.729397e-08 1.389158e-13 2.365864e-23 png("Pobs3pow5.png") plot(hyp, posterior,type="b") points(hyp, Pobs3/sum(Pobs3) ,type="b",col="red") title("Black: posterior 5 observations of 3/10.\nRed: posterior of single 3/10 observation") dev.off() ``` <img src="Pobs3pow5.png"> The maximum likelihood choice is again p=0.3 with probability 1.3e-03. But you can see that the likelihood is peaking up on 0.3 and falling off rapidly on other hypotheses because of the multiplication. The posterior is 62.5% at the 0.3 hypothesis With a large number of 3-observations 100, almost all the probability falls on 0.3. Note that with such small probabilities you must be careful of floating point underflow. In this case computing the posterior puts machine 100% probability on 0.3 with "rounding error" at the other hypotheses. As shown later, you usually solve this problem by storing log(probability) where multiplication is addition and addition involves moving in and out of log using exponential and log functions (slower than multiplying probabilities). ``` Pobs3^100 [1] 7.724693e-125 2.455354e-70 4.198837e-58 1.745811e-67 7.729100e-94 [6] 6.389894e-138 2.706543e-205 3.682287e-311 0.000000e+00 prior = rep(1/9,9) posterior100 = prior*Pobs3^100 posterior100 = posterior100/sum(posterior100) [1] 1.839722e-67 5.847701e-13 1.000000e+00 4.157844e-10 1.840772e-36 [6] 1.521825e-80 6.445934e-148 8.769779e-254 0.000000e+00 sum(posterior) [1] 1 png("Pobs3pow5c100.png") plot(hyp, posterior100,type="b",col="blue") points(hyp, Pobs3/sum(Pobs3) ,type="b",col="red") points(hyp, posterior ,type="b") title("Black: posterior 5 observations of 3/10.\nRed: posterior of single 3/10 observation\nBlue: 100 observations") dev.off() ``` <img src="Pobs3pow5c100.png"> In general the likelihood under these multinomial distributions is the probability to the number of observations: $$P( obs | hyp) = \prod_i (hypP_i) ^ {obs_i}$$ ================================ # Multiple observations Of Different Outcomes We've seen if you observe the same value several times then the likelihood peaks up on the more probable hypotheses looking at single observation. But what if you observe multiple observations at different points, where does the likelihood peak up? Say you observed the following observations after running 102 different experiments: ``` myobs = c(3,11,24,27,20,10,4,2,1,0,0) sum(myobs) [1] 102 cbind(obspossible,myobs) obspossible myobs [1,] 0 3 [2,] 1 11 [3,] 2 24 [4,] 3 27 [5,] 4 20 [6,] 5 10 [7,] 6 4 [8,] 7 2 [9,] 8 1 [10,] 9 0 [11,] 10 0 ``` What is the likelihood of this ensemble of data under the different hypotheses? For each hypothesis take a product of the probability estimated by that hypothesis to the number of times it is seen in the data. ``` mylike = sapply(1:9, function(hh){ prod(likelihood[hh,]^myobs)}) [1] 1.447318e-156 1.882205e-97 1.239293e-82 2.775077e-90 2.104950e-115 [6] 1.317852e-158 2.080566e-225 0.000000e+00 0.000000e+00 mylike/sum(mylike) [1] 1.167858e-74 1.518773e-15 1.000000e+00 2.239241e-08 1.698509e-33 [6] 1.063389e-76 1.678832e-143 0.000000e+00 0.000000e+00 -log2(mylike) [1] 517.6874 321.3146 272.0886 297.5010 380.9479 524.4665 746.3768 Inf [9] Inf ``` The maximum likelihood is p=0.3 that is 8 orders of magnitude greater than p=0.4 given the data (The posterior gets machine 100% due to underflow). Because probabilities get so small with larger number of observations, taking -log2 not only makes numerics easier but also can be interpreted as bits. Even here, you run into underflow issues that make normalization tricky. Here p=0.3 has (- 297.5010 272.0886)=24.5 bits in support of it over the p=0.4 hypothesis. Roughly seeing this as a log odds ratio, the probability of making a classification error is $2^{-24.5}=4.2E-08$. Thinking in bits is useful! When there are two hypotheses, examining $LR= log_2 P1/P2$, the log probability ratio, has nice properties in Bayes Decision Theory. You decide P1 is the winner if LR is positive otherwise P2. $LR$ gives the bits in support over the alternative. ================================ # Maximum Likelihood and KL From before we saw the likelihood from multiple independent observations: $$P( obs | hyp) = \prod_i (hypP_i) ^ {obs_i}$$ Say $hypP$ is a set of hypotheses ${hypA, hypB, ...}$. Maximum likelihood is: $$\max_X \prod_i (hypX_i) ^ {obs_i}$$ Take the negative log and now minimizing as log in monotonic and negative flips $$\min_X - \sum_i obs_i \log (hypX_i)$$ Divide by total number of observations $N$, a constant $$\min_X - \sum_i (obs_i/N) \log (hypX_i)$$ Introduce Kullback-Liebler between distributions P and Q $$KL(P|Q) = \sum_i P_i * \log( P_i/Q_i )$$ $$KL(P|Q) = \sum_i P_i \log( P_i) - \sum_i P_i \log( Q_i )$$ Fix $P$ to be the truth and find min KL to $hypX$, the different hypotheses to truth $$min_X KL(P|hypX) = \sum_i P_i \log( P_i) - \sum_i P_i \log( hypX_i )$$ Note the first term the entropy of $P$ is constant $$min_X - \sum_i P_i \log( hypX_i )$$ This is exactly like maximum likelihood except $obs_i/N$ is replaced by $P_i$. As $N$ gets big $(obs_i/N) \rightarrow P_i$ by the strong law of large numbers. Finding the maximum likelihood hypothesis given the observed set of data is the same as computing KL between the hypothesis' distributions and the distribution from the observed set of data. The nice thing about thinking in KL minimization is that you can apply a bound from Hoeffding related to Bayes decision theory: $$ P(\mbox{Err}) = P( \log \frac {p_1(O)}{p_2(O)} > 0 | O \sim p_2 ) < \exp(-2N KL(p_2, p_1)^2) $$ This says the that the probability of making a misclassification error by choosing the maximum posterior through the log posterior ratio, goes down exponentially fast with the number of data points $N$ and the square KL divergence in the competing hypotheses $p1,p2$. So with large data and hypotheses that don't have the same shape you will make very few errors. So taking your observations and estimating a distribution and comparing that distribution through KL to a set of competing distributions is the same as maximum likelihood! KL is looking for the same shape of distribution. Let's compute KL for the data in the previous: ``` kl = function(pp, qq){ sum(pp*log2(pp/qq)) } # estimate distribution from observed data with 0.01 Laplace pseudo count myobsP = (myobs+0.01)/sum(myobs+0.01) # compute KL from observed to hypothesis distributions sapply(1:length(hyp), function(hh){ kl(myobsP, likelihood[hh,])}) [1] 2.43565192 0.50803132 0.02383006 0.27166534 1.08857409 2.49441972 [7] 4.66870578 8.10767875 14.48080813 png("kl.png") plot(obspossible,likelihood[2,],type="b",col="red") points(obspossible,likelihood[3,],type="b",col="green") points(obspossible,likelihood[4,],type="b",col="blue") points(obspossible,myobsP,type="b",lwd=2) title("data = black. hypothesis=0.3=green 0.4=blue 0.2=red\nKL(black,green)=0.0238 ,blue)=0.27 ,red)=0.5") dev.off() ``` <img src="kl.png"> Yes the minimum KL is the maximum likelihood answer! You can see that green and black have almost the exactly same shape. So with multiple observations over several values, the likelihood peaks up on the hypothesis that has the most similar shape as measured by KL! TODO: what if the data has nothing to do with your models and the "fit" is horrible? KL is the number of bits required to communicate the differences between two distributions. This example has 102 data points. Let's construct the number of bits between the top two hypotheses: ``` 102*(0.27166534-0.02383006) [1] 25.2792 ``` The top hypothesis has 25.2792 bit in support of it over the 2nd alternative as measured by KL. This is very close to the maximum likelihood difference in bits! Min KL is max likelihood. TODO: why not exactly the same. ================================ # Appendix: Multiplying Independent Distributions is Still a Distribution Let's consider multiple Bernoulli trials of flipping a biased coin: $ P(H)=90 \%, P(T)=10 \% $ For a single flip of a single coin it is obviously a distribution: $\sum_{c1} P(C) = 0.9 + 0.1 = 1.0$ For two independent flips we want the joint which is a 2d matrix covering possibilities [[HH, HT],[TH,TT]] where we multiply to Bernoulli together $$P(c1,c2) = P(c1) P(c2) = [[0.9 \times 0.9, 0.9 \times 0.1], [0.9 \times 0.1, 0.1 \times 0.1]]$$ The probabilities get smaller because of the multiplying but we have a larger number of terms in the sum to account for the different coins. It is a distribution: $\sum_{c1,c2} P(c1,c2)= (0.81 + 0.09 + 0.09 + 0.01) = 1.0$ For 3 coins, the joint is a 3d matrix. For $N$ coins, the joint is a N-d matrix which can be very large. For 100 coins there would be 2^100 terms in the sum to evaluate! However, note that the probabilities simply depend on the number of heads and tails for Bernoulli: $$\sum_{c1,c2} P(c1,c2)= 0.9^2 0.1^0+ 0.9^1 0.1^1 + 0.9^1 0.1^1 + 0.9^0 0.1^2 $$ $$= choose(2,2) 0.9^2 0.1^0 + choose(2,1) (0.9 0.1) + choose(2,0) 0.9^0 0.1^2$$ This is exactly the binomial distribution! It sums to 1.0: $$ 1 = 1^N = (p + 1-p)^N = \sum_{k=0}^N choose(N, k) p^k (1-p)^{(N-k)} $$ This is motivation why multiplying independent distributions is still a distribution and sums to 1 over all possibilities.