Chapter 5 Probability Distributions
이 단원에서는 실제로 많이 쓰이는 확률 분포 몇 개와 그 성질들을 간략히 요약하고자 합니다. 확률 분포에 대한 이론적 설명은 모두 (Bishop 2006)의 Appendix B에 있으며 각 분포 별 표본 추출은 (Lemieux 2009)를 참고하였습니다. 과정은 대개 다음과 같이 이루어질 것입니다.
- 확률 분포 정의
- 대푯값(Representative value) 계산
- 최대가능도추정(Maximally Likelihood Estimation)
- Bayesian Update
- Random number generator (Algorithm & Programming with Rust)
편의상 증명은 대부분 영어로 진행하겠습니다.
5.1 Binary Variables
5.1.1 Bernoulli Distribution
Definition 5.1.1 - Bernoulli distribution |
Let consider a single binary random variable \(\small x\in\{0,1\}\). Suppose the probability of \(\small x=1\) is given as \(\small p(x=1|\mu) = \mu\) where \(\small 0\leq \mu \leq 1\) then \[\small\text{Bern}(x|\mu) = \mu^x(1 - \mu)^{1-x}\] is called Bernoulli distribution. |
Property 5.1.2 - Representative values of Bernouli distribution |
Let \(\small\text{Bern}(x|\mu)\) be given Bernoulli distribution. Then \[\small\begin{aligned} \mathbb{E}[x] &= \mu \\ \text{var}[x] &= \mu(1-\mu) \end{aligned}\] |
Theorem 5.1.3 - MLE for Bernoulli distribution |
Let \(\small\mathcal{D} = \{x_1, \cdots, x_N\}\) be i.i.d data belong to \(\small\text{Bern}(x|\mu)\). Applying maximally likelihood estimation, then \[\small\mu_{ML} = \frac{1}{N}\sum_{i=1}^N x_n \] |
Proof for Thm 5.1.3
Since there is i.i.d assumption, the likelihood given as \[\small p(\mathcal{D}|\mu) = \prod_{n=1}^N p(x_n | \mu) = \prod_{n=1}^N \mu^{x_n} (1 - \mu)^{1-x_n}\] For convenience, let’s obtain log likelihood. \[\small\begin{aligned} \ln p(\mathcal{D}| \mu) = \sum_{n=1}^N \ln p(x_n | \mu) &= \sum_{n=1}^N \left\{x_n \ln \mu + (1-x_n)\ln(1-\mu) \right\}\\ &= \ln \mu \sum_{n=1}^N x_n + \ln(1-\mu) \sum_{n=1}^N (1-x_n) \\ &= (\ln\mu - \ln(1-\mu))\sum_{n=1}^N x_n + N \ln(1-\mu) \end{aligned}\] Then for MLE, let’s differentiate log likelihood with \(\mu\). \[\small\frac{\partial}{\partial \mu}\ln p(\mathcal{D}|\mu) = \frac{1}{\mu}\sum_{n=1}^N x_n - \frac{1}{1-\mu}\sum_{n=1}^N (1 - x_n) = 0\] Therefore we can find \(\small \mu_{ML}\). \[\small\therefore\,\mu_{ML} = \frac{1}{N}\sum_{n=1}^N x_n\]
이번에는 균등 분포(Uniform distribution)로 부터 베르누이 분포를 구현해보도록 하겠습니다.
Algorithm 5.1.4 - Uniform to Bernoulli |
We want to generate \(\small X \sim \text{Bern}(x | \mu)\). 1. Generate \(\small U \sim \text{Unif}(0,1)\) 2. If \(\small U \leq \mu\) where \(\mu \in [0, 1]\) then \(X = 1\) else \(X = 0\). |
이 간단한 알고리즘을 Rust를 이용해서 나타내면 다음과 같습니다.
// Generate 100 samples of Bern(x|0.1)
extern crate rand;
use self::rand::prelude::*;
fn main() {
let sample_size = 100;
let mut rng = thread_rng();
let mut v = vec![0usize; 100];
let mu: f64 = 0.1;
for i in 0 .. sample_size {
let u = rng.gen_range(0f64, 1f64);
if u <= mu {
v[i] = 1;
} else {
v[i] = 0;
}
}
println!("Samples:\n{:?}", v);
}
이미 구현되어 있는 라이브러리인 Peroxide을 사용한 결과는 다음과 같습니다.
extern crate peroxide;
use peroxide::*;
fn main() {
let b = Bernoulli(0.1);
let b_samples = b.sample(100);
b.print();
}
5.1.2 Beta Distribution
Definition 5.1.5 - Beta Distribution |
The Beta distribution is a family of continuous probability distributions defined on the interval \(\small[0, 1]\) parametrized by two positive shape parameters, denoted by \(\small\alpha\) and \(\small\beta\) : \[\small \text{Beta}(\mu | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta - 1}\] |
Property 5.1.6 - Normalization of Beta distribution |
\[\small \int_0^1 \text{Beta}(\mu | \alpha, \beta) d\mu = 1\] |
Proof for Prop 5.1.6
To prove this, we should show next relation. \[\small \int_0^1 \mu^{\alpha - 1}(1 - \mu)^{\beta - 1}d\mu = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}\] Then let’s start proof. \[\small\begin{aligned} \Gamma(\alpha)\Gamma(\beta) &= \int_0^\infty \mu^{\alpha - 1}e^{-\mu}d\mu \int_0^\infty \nu^{\beta - 1}e^{-\nu}d\nu \\ &= \int_0^\infty \int_0^\infty \mu^{\alpha - 1}\nu^{\beta-1}e^{-(\mu + \nu)} d\mu d\nu \\ &= \int_0^\infty \left(\int_\mu^\infty \mu^{\alpha-1}(t-\mu)^{\beta - 1} e^{-t} dt\right)d\mu \\ \end{aligned}\]
For last line of above equations, I used the substitution \(t = \mu + \nu,~ dt = d\nu\). Next, let change order of integration. \[\small\begin{aligned} \Gamma(\alpha)\Gamma(\beta) &= \int_0^\infty \left(\int_0^t \mu^{\alpha-1}(t-\mu)^{\beta-1}d\mu\right)e^{-t}dt \\ &= \int_0^\infty \left(\int_0^1 (t\mu')^{\alpha-1}t^{\beta-1}(1-\mu')^{\beta-1}t d\mu'\right)e^{-t}dt \\ &= \int_0^\infty \left(\int_0^1 t^{\alpha + \beta - 1} \mu'^{\alpha-1} (1- \mu')^{\beta-1} d\mu'\right)e^{-t}dt \\ &= \int_0^\infty t^{\alpha + \beta - 1} e^{-t}dt \cdot \int_0^1 \mu'^{\alpha-1}(1-\mu')^{\beta-1}d\mu' \end{aligned}\]
\[\small\therefore \int_{0}^1 \mu^{\alpha - 1}(1-\mu)^{\beta-1}d\mu = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}\]
Then proof is finish.
Property 5.1.7 - Representative values of Beta distribution |
Let \(\small\text{Beta}(\mu | \alpha, \beta)\) be a given Beta distribution, then \[\small\begin{aligned} \mathbb{E}[\mu] &= \frac{\alpha}{\alpha + \beta} \\ \text{var}[\mu] &= \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \\ \text{mode}[\mu] &= \frac{\alpha - 1}{\alpha + \beta - 2} \end{aligned}\] |
Proof for prop 5.1.7
First, let’s see expectation value. \[\small\begin{aligned} \mathbb{E}[\mu] &= \int_0^1 \mu \cdot \text{Beta}(\mu|\alpha, \beta)d\mu \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \int_0^1 \mu \cdot \mu^{\alpha - 1}(1 - \mu)^{\beta - 1} d\mu \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \int_0^1 \mu^{\alpha}(1-\mu)^{\beta-1}d\mu \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \frac{\Gamma(\alpha + 1)\Gamma(\beta)}{\Gamma(\alpha + \beta + 1)} \\ &= \frac{\alpha}{\alpha + \beta} \end{aligned}\]
Next, see variation. \[\small\begin{aligned} \text{var}[\mu] &= \int_0^1 \mu^2 \cdot \text{Beta}(\mu | \alpha, \beta)d\mu - \left(\frac{\alpha}{\alpha + \beta}\right)^2 \\ &= \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \frac{\Gamma(\alpha + 2)\Gamma(\beta)}{\Gamma(\alpha + \beta + 2)} - \left(\frac{\alpha}{\alpha + \beta}\right)^2 \\ &= \frac{\alpha(\alpha + 1)}{(\alpha + \beta)(\alpha + \beta + 1)} - \left(\frac{\alpha}{\alpha + \beta}\right)^2 \\ &= \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \end{aligned}\]
Before viewing mode, we should differentiate Beta distribution under specific condition. \(\small(\alpha, \beta > 1)\). \[\small \frac{\partial}{\partial \mu}\text{Beta}(\mu | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \left((\alpha- 1) \mu^{\alpha - 2}(1 - \mu)^{\beta - 1} - (\beta - 1)\mu^{\alpha - 1}(1 - \mu)^{\beta - 2} \right)\]
Thus, we can find value of \(\mu\) directly. \[\small\begin{aligned} &(\alpha - 1)(1 - \mu) - (\beta - 1)\mu = 0 \\ \Rightarrow ~ &\mu = \frac{\alpha - 1}{\alpha + \beta - 2} \end{aligned}\]
Algorithm 5.1.8 - Acceptance Rejection Method |
For given pdf \(\small \varphi(x)\) which is defined on \(\small \mathcal{D}\), let \(\small t(x)\) be a function
such that \(\small t(x) \geq \varphi(x), ~\forall x \in \mathcal{D}\) with next property:
\[\small T = \int_{\mathcal{D}} t(x) dx \geq \int_{\mathcal{D}} \varphi(x) dx = 1\]
Since \(\displaystyle\small r(x) \equiv \frac{t(x)}{T}\) is density function, we generate \(\small r(x)\) instead of \(\small t(x)\) with following procedure: 1. Generate \(\small Y\) having density \(\small r(x)\). 2. Generate \(\small U \sim \text{Unif}(0,1)\), independent of \(\small Y\). 3. If \(\displaystyle \small U \leq \frac{\varphi(Y)}{t(Y)}\), then return \(\small X=Y\); otherwise go back to step 1. |
Proof for Algorithm 5.1.8
Through above 3 steps, we can generate the pair \(\small (Y,U)\). To be accepted, a pair must satisfy \(\displaystyle\small U \leq \frac{\varphi(Y)}{t(Y)}\). Thus, pdf of \(\small X\) is equivalent to pdf of \(\small(Y|U \leq \varphi(Y) / t(Y))\). Therefore,
\[\small P(X\leq x) = P\left(Y\leq x \left\lvert U\leq \frac{\varphi(Y)}{t(Y)}\right.\right) = \frac{P(Y\leq x, U\leq\varphi(Y)/t(Y))}{P(U\leq \varphi(Y)/t(Y))}\]
And we can find what the RHS is.
\[\small \begin{aligned} P\left(Y\leq x, U\leq \frac{\varphi(Y)}{t(Y)}\right) &= \int_{-\infty}^{x} P\left( U \leq \frac{\varphi(y)}{t(y)}\right) r(y) dy = \int_{-\infty}^{x} \frac{\varphi(y)}{t(y)}r(y) dy \\ &= \frac{1}{T}\int_{-\infty}^x \varphi(y) dy = \frac{F(x)}{T} \end{aligned}\]
\[\small P\left(U \leq \frac{\varphi(Y)}{t(Y)} \right) = \int_{-\infty}^{\infty} \frac{\varphi(y)}{t(y)}r(y) dy = \frac{1}{T}\]
Therefore the CDF of \(\small X\) becomes \(\small F(x)\) which is CDF corresponding to \(\small \varphi(x)\).
이제 증명도 마쳤으니 이 알고리즘을 구현해봅시다.
extern crate rand;
extern crate peroxide;
use rand::prelude::* // For RNG
use peroxide::special::function::beta; // For Beta function
fn main() {
// For Beta(x | 3, 2)
let a = 3f64;
let b = 2f64;
let n = 100usize; // For 100 samples
// Two independent RNG
let mut rng1 = thread_rng();
let mut rng2 = thread_rng();
let mut result = vec![0f64; n];
let mut iter_num = 0usize;
// To construct t(x) -> t(x) = T (constant function)
// T >= φ(x), ∀x ∈ D => T = mode
// r(x) = t(x)/T = 1 ~ Unif(0,1)
let mode_x = (a - 1f64) / (a + b - 2f64);
let mode = beta_pdf(mode_x);
while iter_num < n {
// 1. Y ~ Unif(0, 1)
let u1 = rng1.gen_range(0f64, 1f64);
// 2. U ~ Unif(0, 1)
let u2 = rng2.gen_range(0f64, 1f64);
// 3. U <= varphi(Y) / t(Y)
if u2 <= beta_pdf(u1) / c {
v[iter_num] = u1;
iter_num += 1;
}
}
println!("{:?}", v);
}
fn beta_pdf(x: f64, a: f64, b: f64) -> f64 {
1f64 / beta(a, b)
* x.powf(a - 1f64)
* (1f64 - x).powf(b - 1f64)
}
다음은 Peroxide
라이브러리에서의 위와 동일한 코드 사용법 입니다.
(내부 구현도 동일합니다.)
extern crate peroxide;
use peroxide::*;
fn main() {
let beta = Beta(3, 2);
beta.sample(100).print();
}
References
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer. https://www.springer.com/kr/book/9780387310732.
Lemieux, Christiane. 2009. Monte Carlo and Quasi-Monte Carlo Sampling. 1st ed. Springer Series in Statistics. Springer-Verlag New York. https://www.springer.com/us/book/9780387781648.