Vamshi Kumar Kurva

Probability Theory and Random variables

2020-05-20T00:00:00-07:00

This Blog post briefly introduces some concepts in set theory and measure theory that are needed to define probability. It also talks about measurable transformations and random variables.

Set

Set is a collection of some elements and space is the collection of all elements under consideration. For example, $A_1=\{1\}, A_2=\{1,5,10,12\}$ are all sets contained in the space of natural numbers, $\mathbb{N}=\{0,1,…\}$ . A point set or atomic set is a set containing a single element such as $\{1\}$ above. The entire space itself is always a valid set, as is the empty set or null set, $\emptyset$ , which contains no elements at all. Sets are often defined implicitly via an inclusion criterion. These sets are denoted with the set builder notation. For example,

$\mathbb{R}^{+} = \{x \in \mathbb{R}| x > 0\}$

Algebra of sets

Let $X$ be a set, then $F$ is an algebra over $X$ ,i.e. subset of power set of $X$ , if it is closed under complements and under unions (hence intersections) of pairs of elements of $F$ .i.e.

$\emptyset \in F$ (Includes null set)
if $A \in F$ , then $A^{c} \in F$ (closed under complement)
if $A \in F, B \in F$ , then $A \cup B \in F$ (closed under finite unions)

$\sigma$ - algebra

A sigma algebra $F$ is a set of subsets of $X$ , such that

$\emptyset \in F$ (Includes null set)
if $A \in F$ , then $A^{c} \in F$ (closed under complement)
if $A_1, A_2, ...A_n,... \in F$ , then $\cup_{i=1}^{\infty}A_i \in F$ (closed under countable unions)

$\sigma$ -algebras are a subset of algebras in the sense that all $\sigma$ -algebras are algebras, but not vice versa. Algebras only require that they be closed under pairwise unions while $\sigma$ -algebras must be closed under countably infinite unions. $\sigma$ -algebras can be defined over the real line as well as over abstract sets.

Topological space

There are many definitions of topological space based on open sets, closed sets, neighborhood etc. Here we give the definition w.r.to open sets.

Let $X$ be a non-empty set, then a set $\tau$ of subsets of $X$ is said to be a topology if

$\emptyset, X \in \tau$ (Includes null set and the set itself)
$\tau$ is closed under arbitrary unions (finite or infinite)
$\tau$ is closed under finite intersections.

The ordered pair $(X, \tau)$ is called topological space. The members of $\tau$ are called open-sets. For a given set $X$ , there can be many topologies.

Example: Let $X = \{1,2,3\}$ , then

$\tau = \{\{\emptyset\}, \{1,2,3\}\}$ is a trivial topology on $X$ .
$\tau = \{\{\emptyset\}, \{1\}, \{1,2,3\}\}$ is another topology.

Borel $\sigma$ - algebra

Borel $\sigma$ -algebra or Borel field on a topological space $(X, \tau)$ is a $\sigma$ -algebra generated by a by a collection of subsets of $X$ whose elements are finite open intervals on Real numbers. This is a special case of $\sigma$ -algebra. The Borel algebra on $X$ is the smallest $\sigma$ -algebra containing all open sets (or, equivalently, all closed sets). The elements of Borel field are called Borel sets.

How to construct Borel field? Take all possible open intervals.Take their compliments. Take arbitrary unions. Include $\emptyset$ and $\mathbb{R}$ . $B_R$ contains a wide range of intervals including open, closed, and half-open intervals. It also contains disjoint intervals such as ${(2, 7] \cup (19, 32)}$ . It contains (nearly) every possible collection of intervals that are imagined.

Measurable Space

A pair $(X, \Sigma)$ is a measurable space if $X$ is a set and $\Sigma$ is a $\sigma$ -algebra of subsets of $X$ . Measurable space allows us to define a function that assigns real numbered values to the abstract elements of $\sigma$ .

Measure ( $\mu$ )

Let $(X, \Sigma)$ be a measurable space. A set function $\mu$ defined on $\Sigma$ is called a measure iff it has the following properties.

$0 \leq \mu(A) \leq \infty; \forall A \in \Sigma$ (measure is a non-negative real number)
$\mu(\emptyset) = 0$ (measure of empty set is zero)
For disjoint sets $A_1, A_2...\in \Sigma$ , $\mu(\cup_{n=1}^{\infty}A_n) = \sum_{n=1}^{\infty} A_n$

A measure on a set, $S$ , is a systematic way to assign a positive number to each suitable subset of that set, intuitively interpreted as its size.

Examples of measures

Counting measure: $\mu(S) =$ no of elements in $S$ .i.e cardinality in case of discrete sets and $\infty$ in case of interval sets.
Lebesgue measure: Conventional length of $S$ , i.e. if $S = [a, b]$ , then $\mu(S) = b-a$ . For discrete sets, lebesgue measure is zero.

A triplet $(X, \Sigma, \mu)$ is called a measure space if $(X, \Sigma)$ is a measurable space and $\mu: \Sigma \rightarrow [0, \infty)$ is a measure.

Properties of Measure

Monotonicity: if $A \subset B$ , then $\mu(A) \leq \mu(B)$
Subadditivity: if $A_1, A_2...\in \sigma$ , then $\mu(\cup_{i} A_i) \leq \sum_{i} \mu(A_i)$

Probability Space

Let

$\Omega$ - be the set of all possible outcomes of a random experiment. This is called as ‘Sample space’.
$\sigma$ - be the $\sigma$ -algebra over $\Omega$ . Elements of $\Sigma$ are called events.
- is the probability measure defined on the space with the following properties
- $P(A) \geq 0;\forall A \in \sigma$ (non-negative quantity)
- $P(\Omega) = 1$ (probability assigned to the sample space is 1)
- $P(A_1 \cup A_2 \cup...A_n) = P(A_1)+P(A_2+...+P(A_n)$ for disjoint sets in $\sigma$ , i.e. $A_i \cap A_j = \emptyset$ for $i \neq j$ These are called Kolgromov’s axioms of probability.

Then the triplet $(\Omega, \Sigma, P)$ is called a probability space. The construction of $\Sigma$ avoids some pathological subsets, called non-measurable sets. Non-measurable sets are those for which measure is not properly not defined, i.e. elements of the set can be rearranged in such a way that measure of the set changes. By restricting ourselves to the $\sigma$ -algebra, we are making sure that events are assigned a specific, defined measure (probability in this case). Also by making the $\sigma$ -algebra closed under unions and complements, we are making sure that the resulting events also have a definite measure. An important benefit of these closure properties is that they ensure that any non-constructible sets that may be lurking within the power set don’t persist into a $\sigma$ -algebra. Consequently identifying any $\sigma$ -algebra removes many of the pathological behaviors that arise in uncountably large spaces.

Measurable functions/transformations

Once we have defined a probability distribution on a space $\Omega$ , and a well-behaved collection of subsets $\Sigma$ , we can then consider how the probability distribution transforms when $\Omega$ transforms. In particular, let $f: \Omega \rightarrow Y$ be a transformation from $\Omega$ to another space $Y$ . Can this transformation also transform our probability distribution on $\Omega$ onto a probability distribution on $Y$ , and if so under what conditions?

Let $T$ be the $\sigma$ -algebra on $Y$ . In order for $f$ to induce a probability distribution on $Y$ we need the two $\sigma$ -algebras to be compatible in some sense. In particular we need every subset $B \in T$ to correspond to a unique subset $f^{-1}(B) \in \Sigma$ . If this holds for all subsets in $T$ then we say that the transformation $f$ is measurable and we can define a measure on $Y$ , which is induced by $f$ as

$Q(B) = P(f^{-1}(B)) = P(\{\omega \in \Omega; f(\omega) \in B\})$

According to wikipidea, the definition of mesurable function is as follows: Let $(X, \Sigma), (Y, T)$ be measurable spaces meaning that $X, Y$ are equipped with respective $\sigma$ -algebras $\Sigma, T$ . A functions $f:X \rightarrow Y$ is said to be measurable, if for every $E \in T$ , there is a pre-image of $E$ under $f$ in $\Sigma$ , i.e. $\forall E \in T$

$f^{-1}(E) := \{x \in X| f(x) \in E\} \in \Sigma$

If $f$ is measurable from $(\Omega, \Sigma)$ to $(Y, T)$ then $f^{-1}(T)$ is a sub $\sigma$ -field of $\Sigma$ . It is called $\sigma$ -field generated by $f$ and denoted as $\sigma(f)$ .

Random Variables

Probability measure is a set function, i.e it defines the probability for the events the $\sigma$ -algebra. It’d be easy for us to work with if we map everything onto the line,i.e. the sample space. A random variable is a convenient way to express the elements of $\Omega$ as numbers rather than abstract elements of sets.
A random variable is a measurable function from the probability space $(\Omega, \Sigma, P)$ to other probability space on real line which is $(\mathcal{X}, B_{\mathcal{X}}, P_{\mathcal{X}})$ where $\mathcal{X}$ is the range of $X$ in $\mathbb{R}$ , $B_{\mathcal{X}}$ is the Borel field of $\mathcal{X}$ and $P_{\mathcal{X}}$ is the probability measure on $\mathcal{X}$ induced by $X$ . The induced measure on $\mathcal{X}$ which is $P_{\mathcal{X}} = P \circ X^{-1}$ is called the distribution of $X$ . Specifically, Cumulative distribution function is defined as follows

\[ \begin{aligned} F(x) &= P_{\mathcal{X}}(X \leq x) \newline &= P_{\mathcal{X}}(X \in (-\infty, x]) \newline &= P(\{\omega; X(\omega) \in (-\infty, x]\}) \newline &= P(\{\omega; -\infty \leq X(\omega) \leq x\}) \end{aligned} \]

$(\Omega, \Sigma, P)$ and $(\mathcal{X}, B_{\mathcal{X}}, P_{\mathcal{X}})$ are really just two different manifestations, or parameterizations, of the same abstract probability system. The two parameterizations, for example, might correspond to different choices of coordinate system, different choices of units etc.

Not all the times we are interested in just calculating the probabilities, some times the random variables defined on the original probability space turns out to be quite useful. Consider the following statistical experiment. Go to the road outside the college building and consider the first car that goes left to right after your arrival. As we do not know/cannot predict which car in the city might be there it is a statistical experiment. The sample space is the set of all cars in your city (or in your country). Now consider the following questions

How many people are in that car?
What is the amount of petrol in the fuel tank at that time?
How many kilometers the car has travelled that day before you noticed?

All of these are random variables on the same sample space. Answer to question 1 might be useful to a person who sells eatables on the roadside? (more passengers means more business). Answer to question 2 might help decide if it would be profitable to open a petrol-selling shop.

Various functions defined on random variables like expectation, variance etc shed further light on the understanding of these variables.

References

Even though the blogpost is short, it took time for me to understand certain concepts and put together these things. The following sources have been very helpful in understanding few concepts.

Geometric view of matrices, Diagonalization and SVD

2020-05-05T00:00:00-07:00

I’ve found very few notes on the internet which talks about geometric intuition about matrices and linear algebra and wanted to write about if I ever found any good material. This blogpost talks about linear transformation, geometric intuitions of matrices, change of basis, eigen decomposition and singular value decomposition.

Linear mapping

Vectors are objects that can added together multiplied by a scalar and the resulting object is still a vector. Consider two vector spaces $V, W$ . A mapping $\phi : V \rightarrow W$ preserves the structure of vector space if for all $x,y \in V$ and $\alpha \in \mathbb{R}$ .

\[ \begin{aligned} \phi(x+y) &= \phi(x) + \phi(y) \newline \phi(\alpha x) &= \alpha \phi(x) \end{aligned} \]

This mapping is also called linear mapping. It can be summarized as

$\forall x, y \in V, \forall \alpha, \beta \in \mathbb{R}: \phi(\alpha x + \beta y) = \alpha \phi(x) + \beta \phi(y)$

We can represent linear mappings/transformations as matrices. Consider a mapping $\phi : V \rightarrow W$ , where $V, W$ are arbitrary sets, Then the mapping $\phi$ is called

Injective if $\forall x, y \in V: \phi(x)=\phi(y) \implies x=y$
Surjective if $\phi(V) = W$
Bijective if both injective and surjective

The following are special cases of linear mappings between vector spaces $V$ and $W$ :

Isomorphism if $\phi: V \rightarrow W$ is linear and bijective
Endomorphism if $\phi: V \rightarrow V$ is linear.
Automorphism if $\phi: V \rightarrow V$ is linear and bijective.

Consider the vector spaces $V, W$ with corresponding ordered basis $B = (b_1, b_2...b_n)$ and $C = (c_1, c_2...c_m)$ . Consider that $\phi : V \rightarrow W$ is a linear transformation such that for $j \in {1,2,..n}$ ,

$\phi (b_j) = \alpha_{1j} c_1 + \alpha_{2j} c_2 +...+ \alpha_{mj} c_j$

is the unique representation of $\phi (b_j)$ w.r.to $C$ . We call the $m \times n$ matrix $A_{\phi}$ given by

$A_{\phi}(i,j) = \alpha_{ij}$

as the transformation matrix of $\phi$ (w.r.to the ordered basis $B$ of $V$ and $C$ of $W$ ). If $\hat{x}$ is the representation of vector $x \in V$ w.r.to $B$ and $\hat{y}$ is the representation of $y = \phi(x) \in W$ w.r.to $C$ , then

$\hat{y} = A_{\phi} \hat{x}$ i.e. transformation matrix can be used to map coordinate vectors w.r.to an ordered basis in $V$ to coordinates w.r.to an ordered basis in $W$ .

Let see how the transformation matrix changes if we change the ordered basis of $V$ from $B$ to $B_1$ and the ordered basis of $W$ from $C$ to $C_1$ . Let $\hat{A_{\phi}}$ be the transformation matrix w.r.to the new basis, i.e.

$y_1 = \hat{A_{\phi}} x_1 \label{eq5}$

$y_1, x_1$ are the representations of $\hat{y}, \hat{x}$ w.r.to new basis of $W, V$ . Therefore

$C_1 y_1 = C\hat{y} \implies \hat{y} = \underbrace{C^{-1}C_1}_{T}y_1 \label{eq3}$ $B_1 x_1 = B\hat{x} \implies \hat{x} = \underbrace{B^{-1}B_1}_{S}x_1 \label{eq4}$

substituting (\ref{eq3}), (\ref{eq4}) in (\ref{eq5})

\[ \begin{aligned} T\hat{y} &= A_{\phi} S \hat{x} \newline \hat{y} &= T^{-1}A_{\phi}S \hat{x} \end{aligned} \]

comparing the above equation to (\ref{eq5}) gives

$\hat{A_{\phi}} = T^{-1}A_{\phi}S$

Here, $S \in \mathbb{R}^{n \times n}$ is the transformation matrix of $id_V$ that maps coordinates with respect to $B_1$ onto coordinates with respect to B and $T \in \mathbb{R}^{m \times m}$ is the transformation matrix of $id_W$ that maps coordinates with respect to $C_1$ onto coordinates with respect to C.

The transformation matrix $\hat{A_{\phi}}$ can be interpreted as follows: First, transform coordinates w.r.to $B_1$ onto $B$ .Then, use transformation matrix to map coordinates onto vector space $W$ w.r.to $C$ . Finally, map the coordinates onto $C_1$ from $C$ . If we write down all the linear transformations, then $A_{\phi}:B \rightarrow C, \hat{A_{\phi}}: B_1 \rightarrow C_1, S:B_1 \rightarrow B, T: C_1 \rightarrow C$ and $T^{-1}: C \rightarrow C_1$ ,

\[ \begin{aligned} B_1 \rightarrow C_1 &= B_1 \rightarrow B \rightarrow C \rightarrow C_1 \newline \hat{A_{\phi}} &= T^{-1} A_{\phi} S \end{aligned} \]

The matrices/linear mappings $A_{\phi}$ and $\hat{A_{\phi}}$ are called equivalent matrices, since they correspond to the same linear mapping, but w.r.to different basis.

Similar matrices

Two matrices $A, \hat{A} \in \mathbb{R}^{n \times n}$ are similar, if there exists a matrix $S$ , such that $\hat{A} = S^{-1}AS$ Similar matrices correspond to the same automorphism w.r.to different basis. Similar matrices are always equivalent, however equivalent matrices are not necessarily similar. Similar matrices have the same eigen values.

$det(A-\lambda I) = det(P (D-\lambda I) P^{-1}) = det(D-\lambda I)$

Remark: Eigen value analysis can be used to characterize a matrix and it’s associated linear mappings. Similar matrices have the same eigen values. Therefore, a linear mapping $\phi$ has eigenvalues that are independent of the choice of basis of its transformation matrix. This makes eigenvalues, together with the determinant and the trace, key characteristic parameters of a linear mapping as they are all invariant under basis change.

Determinant

Determinant can be thought of as the signed volume of an n-dimensional parallelopiped formed by the columns of a matrix. Similar matrices have the same determinant, thus the determinant is invariant to the change of basis of linear mapping, so does the trace of a matrix.

$det(A) = det(PDP^{-1}) = det(P) det(D) det(P^{-1}) = det(D) = \prod_{i=1}^n \lambda_i$

Let’s gain some intuition about how eigen values, vectors and determinanat using some linear mappings

Overview of five linear mappings using 400 color coded points in $\mathbb{R}^2$(left column) onto target points in $\mathbb{R}^2$(right column) using mapping $A \in \mathbb{R}^{2 \times 2}$. The central column depicts eigen vectors streched by its corresponding eigen value. All transformations here are w.r.to the standard basis.

Two eigen vectors corresponding to standard basis in $\mathbb{R}^2$ . Vertical one stretched twice its size( $\lambda_1 = 2.0$ ) and the horizontal one made half it’s size ( $\lambda_2 = 0.2$ ). So $det(A) = 1$, the mapping is area preserving.
Both eigen values are same and eigen vectors are collinear (along horizontal axis). So, mapping only works horizontally. Also area preserving
Eigen values are imaginary, indicating that the transformation is rotation (there is no direction along which values simply get stretched/shrinked). Also area preserving
One of the eigen values is zero, so the space/points in the direction of eigen vector corresponds to eigen value 0 collapses and hence the mapping maps 2d points along a single direction (the other eigen vector). Area of mapping is zero
Two eigen values $\lambda_1 = 0.5, \lambda_2 = 1.5$ . Scales the space along one vector by a factor of 0.5 and along the other by a factor of 1.5. Since $det(A) = 0.75$ , mapping scales the space by 75%

Diagonalization

A matrix $A \in \mathbb{R}^{n \times n}$ is diagonalizable if it is similar to a diagonal matrix $D$ ,i.e. if there exists a invertible matrix $P$ such that $D = P^{-1}AP$

It can be shown that the matrix $P$ is a matrix formed by eigen vectors of $A$ as columns.
Since the matrix $P$ has to be invertible, only matrices with $n$ linearly independent eigen vectors can be diagonalized.
If a square matrix has less than $n$ linearly independent eigen vectors, the matrix is said to be defective. Defective matrices can not be diagonalized.
A square symmetric matrix is always diagonalizable, because it has the eigen vectors that can form the basis of $\mathbb{R}^n$ . Furthermore, eigen vectors are orthogonal and eigen values are real.

Diagonalization/Eigen decomposition can be geometrically interpreted as follows.

$P^{-1}$ performs basis change from standard basis to eigen basis. This identifies the eigen vectors $p_i$ onto the standard basis vectors $e_i$
$D$ scales these vectors along the axes by $\lambda_i$
Finally, $P$ transforms these scaled vectors back to standard coordinates yielding $\lambda_i p_i$ .

Geometric intuition of eigen decomposition as sequential linear transformations. Top-left to bottom-left: Basis change, mapping the eigenvectors into the standard basis. Bottom-left to bottom-right: Scaling along the remapped orthogonal eigenvectors, depicted here by a circle being stretched to an ellipse. Bottom-right to top-right: Undoing the basis change (depicted as a reverse rotation) and restores the original coordinate frame.

The whole idea of diagonalising a matrix is to perform a basis change onto eigen space so that it’s easy to work with and then revert back to standard basis after performing the mapping.

Let $A$ be a diagonalizable matrix with eigen values $\lambda_1, \lambda_2..\lambda_n$ and corresponding eigen vectors $p_1, p_2...p_n$ . Now for any vector $x \in \mathbb{R}^n$ ,

Let $P^{-1}x = (\alpha_1, \alpha_2.....\alpha_n)^T$ , i.e. the representation of $x$ w.r.to the eigen basis, i.e. $x = \alpha_1p_1+\alpha_2p_2+..+\alpha_np_n$
Then, $DP^{-1}x = (\alpha_1\lambda_1, \alpha_2\lambda_2.....\alpha_n\lambda_n)^T$
Finally, $PDP^{-1}x = \alpha_1\lambda_1p_1+\alpha_2\lambda_2p_2+....+\alpha_n\lambda_np_n = Ax$

Diagonalizing makes it easier to raise a matrix to an integer power, i.e

$A^k = (PDP^{-1})^k = PD^kP^{-1}$

Diagonalization can only be performed on square matrices. Let’s look at the general matrix decomposition technique called singular value decomposition.

Singular Value Decomposition.

SVD can be applied to all matrices and it always exists. The SVD of a matrix $A$ which represents a linear mapping $\phi : V \rightarrow W$ quantifies the change between the underlying geometry of these two vector spaces. Let $A \in \mathbb{R}^{m \times n}$ be a rectangular matrix with rank $r \in [0, \min(m, n)$ . Then, $A$ can be decomposed as

$A = \underbrace{U}_{m \times m} \underbrace{S}_{m \times n} \underbrace{V^T}_{n \times n}$

$U$ is an orthogonal matrix with column vectors $u_i, i=1,2,..m$ and $V$ is another orthogonal matrix with column vectors $v_j, j=1,2,..n$ and $S_ii = \sigma_i > 0$ and $S_ij = 0, i \neq j$ . The diagonal entries $\sigma_i; i=1,2,...r$ of $S$ are called the singular values, $u_i$ are called the left-singular vectors, and $v_j$ are called the right-singular vectors. By convention, the singular values are ordered, i.e. $\sigma_1 \geq \sigma_2 \geq ..\sigma_r \geq 0$ .

Geometric view of SVD

SVD of a matrix can be interpreted as decomposition of a linear mapping into a sequential linear operations. Assume a linear transformation $\phi: \mathbb{R}^n \rightarrow \mathbb{R}^m$ w.r.to standard basis B of $\mathbb{R}^n$ and C of $\mathbb{R}^m$ . Then

$V^T$ performs the basis change from standard basis onto right singular basis $V$ . This maps the vectors $v_j$ onto standard basis as $e_j$ .
$S$ is the transformation matrix of $\phi$ w.r.to the right singular basis(dimension n) and left singular basis(dimension m). $S$ scales the new coordinates by the corresponding singular value $\sigma_i$ and adds/deletes dimensions based on the value of $m$ . If $m > n$ , it introduces new dimensions by embedding data into a higher dimensional space and if $m < n$ , it drops certain dimensions.
$U$ maps the coordinates back onto the standard basis of $\mathbb{R}^m$ from left singular basis.

Geometric intuition of SVD of a matrix $A\in \mathbb{R}^{3 \times 2}$ as sequential linear transformations. Top-left to bottom-left: $V^T$ performs a basis change in $\mathbb{R}^2$. Bottom-left to bottom-right: $S$ scales and maps $\mathbb{R}^2$ to $\mathbb{R}^3$, the ellipse in the bottom right lies in $\mathbb{R}^3$. Bottom-right to top-right: $U$ performs basis change in $\mathbb{R}^3$

SVD performs the basis change in both domain and co-domain, and the transformation w.r.to these new basis is represented by the singular value matrix $S$ .

Transformations one and three includes multiplication by an orthogonal matrix, hence the transformation is only rotation in steps one and three. Orthogonality preserves the lengths of vectors as well as the angle between the vectors.

Construction of SVD

Let $rank(A) = r \leq \min(m,n)$ , then $rank(A^TA) = rank(AA^T) = r$ . Both $A^TA$ and $AA^T$ are SPD (Symmetric Positive Definite) matrices, so can be eigen decomposed. Since both the matrices have rank $r$ , they have $r$ non-zero eigen values and are same. $A^TA$ has $(n-r)$ zero eigen values and $AA^T$ has $(m-r)$ zero eigen values

so, diagonal elements are $S^TS$ are eigen values of $A^TA$ and right singular vectors are eigen vectors of $A^TA$ . Let $V = [v_1, v_2...v_n]$ are eigen vectors corresponding to eigen values $\sigma^2_1, \sigma^2_2....\sigma^2_n$ such that $\sigma^2_1 \geq \sigma^2_2 \geq ...\geq \sigma^2_r > 0$ and $\sigma^2_{r+1} = ....= \sigma^2_{n} = 0$ . Let $V = [V_1 V_2]; V_1 = [v_1, v_2...v_r], V_2 = [v_{r+1},...v_n]$ , Then

$V_2^T (A^TA) V_2 = 0 \implies (AV_2)^T (AV_2) = 0 \implies AV_2 = 0$

i.e. the second set of eigen vectors $V_2$ corresponds to null space of $A$ and since $rank(A) = r$ , $dim(V_2) = n-r = nullspace(A)$ .

So, we have \begin{equation} Z^{-1}V^T_1 A^TAV_1Z^{-1} = \mathbf{I} \end{equation} where

Let $U_1 = AV_1Z^{-1}$ is an $m \times r$ matrix, then we have $U^TU = \mathbf{I}$ showing that $u_1, u_2..u_r$ are orthonormal vectors in $\mathbb{R}^m$ where $u_i = \frac{1}{\sigma_i} Av_i; i=1,2,..r$

Similarly we can show that diagonal elements are $SS^T$ are eigen values of $AA^T$ and right singular vectors are eigen vectors of $AA^T$ . We can factor $U = [U_1 U_2]$ where $U_1$ is given by the above equation, $U_2$ is the null space of $A^T$ which is of dimension (m-r).

Eigen decomposition Vs SVD

Let us consider the eigendecomposition $A=PDP^{-1}$ and the SVD $A = USV^T$ .

The SVD always exists for any matrix $\mathbb{R}^{m \times n}$ . The eigen decomposition is only defined for square matrices $\mathbb{R}^{n \times n}$ and only exists if we can find a basis of eigenv ectors of $\mathbb{R}^n$ .
The vectors in the eigendecomposition matrix $P$ are not necessarily orthogonal, i.e., the change of basis is not a simple rotation and scaling. On the other hand, the vectors in the matrices $U$ and $V$ in the SVD are orthonormal, so they do represent rotations.
Both the eigendecomposition and the SVD are compositions of three linear mappings:
- Change of basis in the domain
- Independent scaling of each new basis vector and mapping from domain to codomain
- Change of basis in codomain
A key difference between the eigendecomposition and the SVD is that in the SVD, domain and codomain can be vector spaces of different dimensions.
In the SVD, the left and right-singular vector matrices $U$ and $V$ are generally not inverse of each other (they perform basis changes in different vector spaces). In the eigen decomposition, the basis change matrices $P$ and $P^{-1}$ are inverses of each other.
In the SVD, the entries in the diagonal matrix $S$ are all real and non- negative, which is not generally true for the diagonal matrix in the eigen decomposition.
For symmetric square matrices, Eigen decomposition and SVD are one and the same.

SVD has many applications in machine learning ranging from dimensionality reduction to data compression and clustering.

References

Most of the content and all the images are borrowed from

Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong, Mathematics for Machine Learning

Variational Auto Encoder

2020-05-02T00:00:00-07:00

This blogpost talks about modelling the distribution of images, the challenges in modelling and training etc. Let’s try to fit a model $p(x)$ to a set of images, just like we tried to fit GMM to a set of points. If you are not familiar with latent variable models and GMM you can refer to my previous post here

why model $p(x)$ for images?

Modelling the distribution of images has the following advantages.

We can represent the data in a compact form using model parameters. Typically the latent dimension is much lesser compared to the input dimension.
We can use the model to generate new samples (images in this case).
Model can be used to detect anomalies/outliers by predicting the probability of the data point according to the model.

How to model distribution of images?

Use the CNN to model the images,i.e.

\[ \begin{aligned} \log \hat{p}(x) &= CNN(x) \newline p(x) &= \frac{\exp(CNN(x))}{Z} \newline \end{aligned} \]
- The problem is the normalization constant, the sum of probabilities should be 1 w.r.to all possible images. i.e. to calculate $Z$ we have to use every possible natural image. So, this approach is not feasible
Use chain rule and model conditional distribution
- The joint distribution over the pixels can be modelled using the conditional distributions. For a $100 X 100$ gray scale image, the model is a factorized over 10k conditional distributions. So, each conditional distribution needs to be normalized using only 256 values. We can model any probability distribution in this way. Conditional distribution can be modelled using RNN as $p(x_n/x_1,x_2...x_{n-1}) = RNN(x_1,x_2...x_{n-1})$
- The model reads the image pixel by pixel and generates the distribution over the next pixel. This model takes too long to generate even a low resolution image.
We can model the distribution of pixels in an image as independent, but since that’s not the case, the images generated by the model doesn’t resemble the images in the data set.
Mixture of several Gaussians (GMM)
- Theoretically GMM can model any distribution, but in practice it can be inefficient for complicated data sets like images. GMM in this case may fail to capture the structure in the data, since it’s too difficult to train.
Mixture of infinitely many Gaussians
- Here $t$ is a latent variable, and $t$ is assumed to cause $x$ . $p(x/t)$ can be modelled as Gaussian. Even if the Gaussians are factorized, i.e. have independent components for each dimension, mixture is not. Hence, this model is a bit more powerful than the GMM model.

Let’s explore the idea of representing an image as infinite mixture of Gaussians.

$p(x) = \int p(x/t) p(t) dt$

Let’s model the likelihood $p(x/t)$ to be Gaussian and prior to be standard normal, then

$p(t) = \mathcal{N}(0, I) \\$ $p(x/t) = \mathcal{N}(\mu(t), \sigma(t))$

If we use CNN with parameters $w$ to model the likelihood, then,

$p(x/w) = \int p(x/t, w) p(t/w) dt$ $p(t) = \mathcal{N}(0, I)$ $p(x/t, w) = \mathcal{N}(\mu(t, w), \sigma(t, w))$

Decoder

For a $100 X 100$ image, covariance matrix to be generated by our CNN would be of dimension 10k X 10k which is huge. To avoid this, let’s model the covariance matrix as diagonal.

$p(x/t, w) = \mathcal{N}(\mu(t, w), diag(\sigma^2(t, w)))$

This is same as modelling our images as a mixture of factorised Gaussian distributions. This is better than modelling as just the factorised distributions. Since our model is fully defined, let’s look into the training part. The ML training objective is to

$\max_w. p(X/w) = \int p(X/T,w) p(T) dt$

Since our model is a latent variable model, let’s use EM. EM builds a variational lower bound and maximises it w.r.to both variational and model parameters.

$\log p(X/w) \geq \mathcal{L}(w,q)$ $\max_{w,q} \mathcal{L}(w,q)$

but the E-step contains finding the posterior on latent variables i.e. $p(T/X,w)$ , which is intractable. We can variational EM. Here, the idea is to maximize the same variational lower bound, bound with factorized distribution over latent variables. Each variational distribution is independent of another, i.e. the objective becomes

\[ \begin{aligned} \max_{w, q_1,q_2…q_N} \quad & \mathcal{L}(w, q_1, q_2..q_N) \newline \textrm{s.t.} \quad & q_i(t_i) = q_{i1}(t_{i1}) q_{i2}(t_{i2})…q_{im}(t_{im}) \end{aligned} \]

Unfortunately, this introduces a lot of variables for each data point and it’s unclear how to find the variational distribution params for test data points. So, let’s introduce a new network that gives a variational distribution over latent variables given the data point. This way all the distributions all independent and depends only on the given data point $x$ . Also, let the distribution be Gaussian, then the objective becomes

\[ \begin{aligned} \max_{w, \phi} \quad & \mathcal{L}(w, q_1, q_2..q_N)\newline \textrm{s.t.} \quad & q_i(t_i) = \mathcal{N} (m(x_i, \phi), s^2(x_i, \phi)) \end{aligned} \]

Encoder

Here each $q_i$ is different from one another, but they all share the same parametrization. We can use a CNN to find the mean and variance of variational distribution. Since the variational distribution is modelled as Gaussian, it’s be easy to sample from this distribution and this is very useful since the variational lower bound contains the expectation w.r.to the variational distribution.

\[ \begin{aligned} \max_{w, \phi} \quad & \sum_{i} \mathbf{E}_{q_i} \log \frac{p(x_i/t_i,w) p(t_i)}{q_i(t_i)}\newline \textrm{s.t.} \quad & q_i(t_i) = \mathcal{N} (m(x_i, \phi), s^2(x_i, \phi))
\end{aligned} \]

Now this complete workflow is called a Variational Auto Encoder. It contains an encoder and a decoder. Encoder takes the image and gives the distribution over latent dimensions. Sample from this distribution and feed it to decoder. Decoder tries to reconstruct the image that is as close to the input image as possible. If we set the variance of latent distribution to zero, the resultant architecture is called an Auto Encoder.

Variational Auto Encoderr

Now, let’s look at the maximisation objective closely.

The first term can be interpreted as trying to push $\mu(t_i)$ as close to input image $x_i$ as possible. This is called reconstruction loss. The second term is the KL divergence between prior and variational distribution.i.e. we are trying to push $q(t_i)$ is be as close to $p(t_i)$ as possible. KL divergence term pushes the $q(t_i)$ to be non-deterministic, since if the variance of $q(t_i) = 0$ , KL loss will be infinity. This helps in keeping some noise in the structure. After training, VAE can be used to generate new images by using the decoder by feeding a sample from standard normal distribution (prior).

Generating samples using VAE-decoder

VAE can also be used to detect images that are very different from the training distribution, by using the KL divergence between prior and variational distribution as a measure.

Gradient of Decoder

our objective is

$\max_{w, \phi} \sum_{i} \mathbf{E}_{q_i} \log p(x_i/t_i,w) - \mathcal{KL}(q(t_i)||p(t_i))$

The KL term can be computed analytically, so gradients w.r.to this term can be calculated easily.

$\mathcal{KL}(q_i(t_i)||p(t_i)) = \sum_j \left( -\log\sigma_j(t_i) + \frac{\sigma^2_j(t_i) + \mu_j^2(t_i)-1}{2}\right)$

Now, the first term

$f(w, \phi) = \sum_i \mathbf{E}_{q(t_i/x_i, \phi)} \log p(x_i/t_i,w)$

Gradient w.r.to the first term

i.e. Feed the image to encoder, get the variational parameters. Sample from variational distribution, find the gradient and average the gradients to approximate the expectation.

Gradient of Encoder

One problem here is that variance of this gradient could be very high because of the term $\log p(x_i/t_i,w)$ which could be very high, since any image is highly unlikely under our model in the initial phase of training. To avoid this problem we use the reparametrization trick.

Instead of directly sampling $\hat{t_i}$ from the posterior with mean $m$ ans variance $s^2$, we can first sample from standard normal and then convert it to sample from posterior, i.e.

Reparametrization view of VAE

$\hat{t_i} \sim q(t_i/x_i, \phi) = \mathcal{N} (m_i, diag(s^2))$

$\hat{t_i} = \varepsilon_i \odot s_i + m_i = g(\varepsilon_i, x_i, \phi)$ $\varepsilon_i \sim p(\varepsilon_i) = \mathcal{N}(0, I)$

Now, the gradient can be written as

This is simply the expectation of gradient of VAE w.r.to encoder parameters $\phi$, where expectation is w.r.to the standard normal. The reparametrization trick helps us in avoiding the high variance of gradients.

You can refer here for the implementation of VAE applied to MNIST dataset and conditional VAE that can generate an image from the given class label.

References

Bayesian Methods for Machine Learning by National Research University Higher School of Economics

Bayesian view and Variational Inference

2020-05-01T00:00:00-07:00

This blog post talks about Bayesian view of statistics and the need for variatinal inference and a simple Mean Field approximation method

Bayesian view of statistics

Traditional ML methods treats model parameters as constants, and tries to find them using maximum likelihood principle. Here the model parameters $\theta$ are just unknown, not random. This view of $\theta$ as being constant, but unknown value is taken in frequentist statistics. An alternate approach to parameter estimation is Bayesian view which treats model parameters as random variables with unknown values. The following can be considered as main ideas of Bayesian view.

Use prior knowledge
Choose answers that explains the observations mostly
Avoid making extra assumptions (Occam’s razor)

Baye’s Theorem

Let $\theta$ –> parameters, $X$ –> observations

$\underbrace{p(\theta/X)}_{Posterior} = \frac{\overbrace{p(X/\theta)}^{Likelihood}\overbrace{p(\theta)}^{Prior}}{\underbrace{p(X)}_{Evidence}}$

In the above formulae, $p(\theta)$ –> prior information that we have on parameters $p(X/\theta)$ –> Likelihood (How well our params explains observations) $p(\theta/X)$ –> Posterior (How well the data explains the params) $p(X)$ —> Evidence (How likely to observe the given data, can only be calculated if we have a model that can generate the data)

Bayesian methods tries to find the distribution over the model parameters, after observing the data (posterior). In case of

Training

$p(\theta/ X_{tr}, y_{tr}) = \frac{p(y_{tr}/X_{tr}, \theta) p(\theta)}{p(y_{tr}/X_{tr})}$

Inference

$p(y_{ts}/X_{ts}, X_{tr}, y_{tr}) = \int_{\theta} p(y_{ts}/X_{ts}, \theta) p(\theta/ X_{tr}, y_{tr}) d\theta$

By choosing a proper prior, we can embed our prior knowledge into the model and hence prior can be used as a regularizer. Bayes theorem can also be used for online training.

$\underbrace{p_k(\theta)}_{New prior} = \underbrace{p(\theta/x_k)}_{Posterior} = \frac{\overbrace{p(x_k/\theta)}^{Likelihood}\overbrace{p_{k-1}(\theta)}^{Prior}}{\underbrace{p(x_k)}_{Evidence}}$

On every iteration, we get new data and we use the posterior from the previous iteration as prior for the current iteration. Our posterior becomes more and more accurate with the incoming evidence over iterations.

Analytical Inference

The denominator in the Baye’s theorem is the evidence term $p(X)$ is difficult to model, so calculating posterior distribution in closed form is not possible. Therefore, we approximate the posterior in general. One common approach is to replace the posterior with a single point estimate. The MAP (Maximu A Posteriori) estimate of $\theta$ is given by

\[ \begin{aligned} \theta_{MAP} &= argmax_{\theta} p(\theta/X) \newline &= argmax_{\theta} p(X/\theta) p(\theta) \end{aligned} \]

Conjugate distributions

Prior $p(\theta)$ is conjugate to the likelihood $P(X/\theta)$ if the posterior and prior lies in the same family of distributions. For example, let both prior and likelihood are normal distributions with $p(\theta) = \mathcal{N}(\theta/m,s^2)$ and $p(X/\theta) = \mathcal{N}(X/\theta, \sigma^2)$

$\underbrace{p(\theta/X)}_{\mathcal{N}(a, b^2)} = \frac{\overbrace{p(X/\theta)}^{\mathcal{N}(X/\theta, \sigma^2)}\overbrace{p(\theta)}^{\mathcal{N}(\theta/m, s^2)}}{p(X)}$

i.e. if we choose prior that’s conjugate to the likelihood, we can avoid computing the evidence, since the posterior belongs to the prior family of distributions.

Let the likelihood be Bernouli, and prior be beta, then

\[ \begin{aligned} p(X/\theta) &= \theta^{N_1}(1-\theta)^{N_0} \newline p(\theta) &= B(\theta/a,b) \propto \theta^{a-1} (1-\theta)^{b-1} \newline p(\theta/X) &\propto p(X/\theta) p(\theta) \newline &\propto \theta^{N_1+a-1} (1-\theta)^{N_0+b-1} \newline &= B(N_1+a, N_0+b) \end{aligned} \]

i.e we calculated the exact posterior without calculating the evidence.

Variational Inference

$p^*(z) = p(z/X) = \frac{p(X/z) p(z)}{p(X)} = \frac{p(X/z) p(z)}{\int p(X/z) p(z) dz} = \frac{\hat{p}(z)}{Z}$

Here $\hat{p}(z)$ is the un-normalized posterior and $Z$ is the normalization constant.

Computing the posterior using Bayes formula in closed form (analytical expression) is not possible in many cases because of intractable integrals involved in calculating the evidence. Only when likelihood and prior are conjugate to each other, this is possible. So, there is a need to approximate the posterior distribution. Variational inference is an idea to approximate the posterior using other simple known distributions. The main idea behind variational inference is as follows

Pick a family of distributions $Q$ over latent variables with variational parameters. Let’s call this as variational family
Find the variational params such that $q(z)$ is the best approximation to the posterior $p^*(z)$ .

$KL(q(z) || p^{*}(z)) --> \min_{q \in Q}$

Use the $q(z)$ with fitted parameters as an approximation to the posterior,e.g. to from predictions over future data etc. Typically, true posterior does not lie in the variational family.

\[ \begin{aligned} KL(q(z) || p^{*}(z)) &= KL(q(z) || \frac{\hat{p}(z)}{Z}) \newline &= \int q(z) \log\frac{q(z)}{\hat{p}(z)/Z} \newline &= \int q(z) \log\frac{q(z)}{\hat{p}(z)} + \int q(z) \log Z dz \newline &= KL(q(z) || \hat{p}(z)) + \log Z \end{aligned} \]

So, we only need to approximate unnormalized posterior.

Mean Field Approximation

In mean field variational inference, we assume that the variational family factorizes over the dimensions of latent variable.
$Q = \{ q; q(z) = q(z_1, z_2,...z_d) = q_1(z_1) q_2(z_2)...q_d(z_d)\}$
for example
$p^*(z_1, z_2) \approx q_1(z_1) q_2(z_2)$
Find best approximation $q(z)$ of $p^*(z)$ . We will use Coordinate ascent algorithm, iteratively optimizing each variational distribution, keeping the others fixed.

Minimizing the KL divergence w.r.to $q_k$ gives

$\log q_k = h(z_k) = \mathbf{E}_{q_{-k}} \log p^*$

References

Bayesian Methods for Machine Learning by National Research University Higher School of Economics

Latent Variables and EM

2020-04-30T00:00:00-07:00

This blog post talks about latent variables, why we need them and how to train latent variable models.

Latent Variables

Latent variable is a variable which is never observed (hidden). Why do we need them?

Our data might contain missing values
We want to know about the uncertainty in our predictions.

Assume you are an HR in a company, and you want to call eligible candidates for interview. Also assume that the variables for consideration are somethings like GPA, IQ score, Aptitude test score, high school grade etc. Not all candidates might have taken all the tests, and hence data might contain missing values. Also, we want to quantify uncertainty in our predictions, so that we can make better decisions about whom to invite for interview. For example,if a candidate is predicted to be a not good fit for the interview, but with high uncertainty, we may as well invitehim for the interview since we are not sure he is not a good fit. To handle these kinds of data we need probabilistic models.

Probabilistic model

Suppose $X_1, X_2, X_3$ are random variables that depends on one another. Now let’s assume each of them taken on one of 100, 200, 300 values respectively. So, the total no of combinations of all variables are 6 million. We can model the joint probability distribution as

$p(x_1, x_2, x_3) = \frac{\exp(-w^Tx)}{Z}$

Where $Z$ is the normalization constant, which is a sum over all 6 million combinations. This makes the training and inference impractical.

Now let’s introduce a new variable $t$, which we call latent variable and $t$ causes each of RVs $X_1,X_2,X_3$ . Then,

\[ \begin{aligned} p(x_1,x_2,x_3) &= \int p(x_1, x_2, x_3/t) p(t) dt \newline &= \int p(x_1/t) p(x_2/t) p(x_3/t)p(t) dt \hspace{2ex}\text{(because of independence)} \end{aligned} \]

Each conditional distribution is easy to model now, as the distribution is over 100, 200, 300 values respectively. So, the model complexity is reduced without compromising the flexibility. Probabilistic models can be used as generating models and can also be used to project data onto a low dimensional latent space.

Probabilistic clustering

This is a soft clustering mechanism, which assigns each data point the probability of belonging to a particular cluster. This helps in classifying uncertainty in our cluster assignment.

Gaussian Mixture Model (GMMs)

When the data contains several sets of clusters, fitting a Gaussian model to the data is not the best way. Rather, we can use a mixture of Gaussian’s which can well model most types of clusters.

As you can see in the above image, there are clearly 3 clusters, and a single gaussian fit to the data won’t be able to capture the structure in the data. We can use a mixture of 3 gaussians

$p(x/\theta) = \sum_{i=1}^{3} \pi_i \mathcal{N}(\mu_i, \sigma_i)$

where $\theta = \{\pi_1, \pi_2, \pi_3, \mu_1, \mu_2, \mu_3, \sigma_1, \sigma_2, \sigma_3\}$ , are parameters of the model.

If we find the parameters $\theta$ , we can know which cluster each data point came from, also we can generate new data using these parameters. All we need to do is sample from the distribution. How to find parameters? Use maximum likelihood

Training GMMs

Using maximum likelihood principle, the optimization problem is

\[ \begin{aligned} \max_{\theta} \quad & p(X/\theta) = \prod_{i=1}^N p(x_i/\theta) = \prod_{i=1}^{N} \sum_{k=1}^{C} \pi_k \mathcal{N}(\mu_k, \sigma_k)\newline \textrm{s.t.} \quad & \sum_{k=1}^C \pi_k = 1; \pi_k \geq 0\newline &\sigma_i > 0 \newline \end{aligned} \]

We can use SGD to maximize the objective, but it is difficult to enforce the positive definiteness constraint. Let’s introduce a latent variable $t$ here and assume that each data point $x$ is generated using some information from $t$ i.e. $t$ causes $x$ . Here $t \in \{1,2,..C\}$ , that is it tells us which Gaussian the given data point come from. $t$ is a latent variable, we don’t observe it.

Let the prior distribution on $t$ is $\pi$ . i.e. $p(t=c/\theta)= \pi_c$

The likelihood of the point $x$ belonging to the cluster $c$ is $p(x/t=c,\theta)= \mathcal{N}(\mu_c,\sigma_c)$

Now the likelihood under the mixture of Gaussins is

$p(x/\theta) = \sum_{k=1}^{C}p(x/t=c,\theta) p(t=c/\theta)= \sum_{k=1}^{C} \pi_k \mathcal{N}(\mu_k, \sigma_k)$

i.e. introducing latent variable has not changed the model. How to estimate parameters $\theta$ ?

If we know the sources $t$ , i.e. $p(t_i/x_i, \theta)$ , we can find the parameters $\theta$ . i.e. in one dimensional case,

$\mu_c = \frac{\sum_{i} p(t_i=c/x_i, \theta) x_i}{\sum_{i} p(t_i=c/x_i, \theta)}$ $\sigma_c = \frac{\sum_{i} p(t_i=c/x_i, \theta) (x_i-\mu_c)^2}{\sum_{i} p(t_i=c/x_i, \theta)}$

If we know the parameters $\theta$ , i.e we can find the sources $t$

$p(t=c/x, \theta) = \frac{p(x/t=c, \theta)p(t=c/\theta)}{\sum_{c=1}^C p(x/t=c, \theta)p(t=c/\theta)}$

This is basically a chicken and egg problem. We know neither parameters nor sources.

Expectation-Maximization algorithm

1.Start with random Gaussian params $\theta$

2.Until convergence, repeat

a). For each point compute $p(t_i = c/x_i, \theta)$ , probability that the given point comes from cluster c.

b). Update Gaussian parameters $\theta$ , to fit points assigned to them.

EM algorithm can be faster than SGD and handles the complicated constraints. EM algorithm suffers from local maxima, i.e. different initialization can lead to different solutions.

Concave-function

A function $f(x)$ is concave, if $\forall a, b, 0 \leq \alpha \leq 1$ $f(\alpha a + (1-\alpha b)) \geq \alpha f(a) + (1-\alpha) f(b)$

In general $\forall a_1, a_2,..a_n$

$f(\alpha_1 a_1+\alpha_2 a_2+...+\alpha_n a_n) \geq \alpha_1 f(a_1) + \alpha_2 f(a_2)+...+ \alpha_n f(a_n) \label{eq1}$

If we assume a RV $t$ , such that $p(t=a_i)= \alpha_i$ , then the inequality (\ref{eq1}) can be written as

$f(\mathbf{E}_{p(t)} t ) \geq \mathbf{E}_{p(t)} (f(t)) \label{eq2}$

(\ref{eq2}) is called Jensen’s inequality.

$f(x) = \log(x)$ is a concave function. Concave functions have only one global maxima.

Kullback-Leibler Divergence

In the above image, even though the parametric difference between the distributions is same, the second set of distributions are closer to each other when compared to that of first set. KL divergence would reflect this in its value.

KL divergence is a measure of how a probability distribution is different from a reference probability distribution. For $p$ and $q$ , discrete probability distributions on the same space $\mathcal{X}$ , it is defined as

\[ \begin{aligned} KL(q || p) &= \sum_{x \in \mathcal{X}} q(x) log (\frac{q(x)}{p(x)}) \newline &= \mathbf{E}_{q} log(\frac{q}{p}) \end{aligned} \]

In other words, it is the expectation of the logarithmic difference between the probabilities $q$ and $p$ , where the expectation is taken using the probabilities $q$ . For continuous distributions, sum is replaced by integration. Some properties are

KL divergence measure is not symmetric, and hence not a metric. $KL(q || p) \neq KL(p || q)$ .
For identical distributions $q$ and $p$ , $KL(q || p) = 0$
Non-negative, $KL(q || p) \geq 0$

The above figure illustrates the mode seeking vs mode-covering bahaviour of KL optimization. Blue distribution is the reference distribution and the red one is the parametric distribution we are trying to approximate to reduce the KL divergence

General from of EM

Maximum Likelihood objective of EM algorithm is

\[ \begin{aligned} \max_{\theta}. \log p(X/\theta) &= \log \prod_{i=1}^N p(x_i/\theta) \newline &= \sum_{i=1}^N \log(p(x_i/\theta)) \newline &= \sum_{i=1}^{N} \log (\sum_{c=1}^C p(x_i, t_i=c/\theta)) \end{aligned} \]

Now let’s introduce a new distribution $q$ on latent variables $t$ , such that we can form a family of variational lower bounds on the objective function.

\[ \begin{aligned} log p(X/\theta) &= \sum_{i=1}^N \log p(x_i/\theta) \newline &= \sum_{i=1}^{N} \log \left(\sum_{c=1}^C p(x_i, t_i=c/\theta)\right) \newline &= \sum_{i=1}^{N} \log \left(\sum_{c=1}^C \frac{q(t_i=c)}{q(t_i=c)}p(x_i, t_i=c/\theta)\right) \newline &\geq \sum_{i=1}^N \sum_{c=1}^C q(t_i=c).\log\left(\frac{p(x_i, t_i=c/\theta)}{q(t_i=c)}\right) \hspace{4ex}\text{ since} \hspace{1ex} \log\left(\sum_c \alpha_c\nu_c\right) \geq \sum_c \alpha_c \log(\nu_c) \newline &= L(\theta, q) \end{aligned} \]

Here $L(\theta, q)$ is a family of variational lower bonds. varying $q$ gives different lower bounds. For a given $\theta_k$ , we can to choose the $q_{k+1}$ such that $L(\theta_k, q)$ is the best lower bound, i.e. $q_{k+1} = argmax_q L(\theta_k, q)$ . Once we found $q_{k+1}$ , we find the best parameters w.r.to the best lower bound, i.e. $\theta_{k+1} = argmax_{\theta} L(\theta, q_{k+1})$ . EM algorithm performs these two steps repeatedly. First, finding the best lower bound for the given parameters and second, optimizing the parameters with respect to the best lower bound.

$k^{th}$ iteration of EM	$(k+1)^{th}$ iteration of EM

E-step

$\log p(X/\theta) \geq L(\theta, q)$

We want to find the best lower bound w.r.to the given params $\theta_k$ . This is same as minimising the gap (see the figure below) between the original log likelihood and the lower bound. i.e.

\[ \begin{aligned} GAP &= \log p(X/\theta) - L(\theta, q) \newline &= \sum_{i=1}^N \log p(x_i/\theta) - \sum_{i=1}^N \sum_{c=1}^C q(t_i=c).\log\left(\frac{p(x_i, t_i=c/\theta)}{q(t_i=c)}\right) \newline &= \sum_{i=1}^N \left( \log p(x_i/\theta) \sum_{c=1}^C q(t_i=c) - \sum_{c=1}^C q(t_i=c) \log\left(\frac{p(x_i, t_i=c/\theta)}{q(t_i=c)}\right) \right) \newline &= \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \left( \log p(x_i/\theta) - \log\left(\frac{p(x_i, t_i=c/\theta)}{q(t_i=c)}\right) \right) \newline &= \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \left( \log\left(\frac{p(x_i/\theta) q(t_i=c)}{ p(x_i, t_i=c/\theta)}\right) \right) \newline &= \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \log \frac{q(t_i=c)}{p(t_i=c/x_i, \theta)} \newline &= \sum_{i=1}^N KL\left(q(t_i) || p(t_i/ x_i, /\theta)\right) \geq 0 \end{aligned} \]

Now,

\[ \begin{aligned} \max_q L(\theta_k, q) &= \min_q \sum_{i=1}^N KL\left(q(t_i) || p(t_i/x_i, \theta)\right) \newline \end{aligned} \]

Since the gap is always non-negative, minimising the gap w.r.to variational parameter $q$ implies $q(t_i) = p(t_i/x_i, \theta)$ , which is a posterior distribution on latent variable $t$ . Therefore, E-step reduces to

$q_{k+1} = argmax_{q(t_i)} L(\theta_k, q(t_i)) = p(t_i/x_i, \theta_k)$

M-step

\[ \begin{aligned} \max_{\theta} L(\theta, q) &= \sum_{i=1}^N \sum_{c=1}^C q(t_i=c).\log \frac{p(x_i, t_i=c/\theta)}{q(t_i=c)} \newline &= \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \log p(x_i, t_i=c/\theta) - \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \log q(t_i=c) \newline &= \max_{\theta}. \mathbf{E}_{q} \log p(X,T/\theta) \end{aligned} \]

The function $\log p(X,T/\theta)$ is usually concave, so it’d be easy to maximize w.r.to $\theta$ . Easier when compared to the original maximum likelihood optimisation problem.

Convergence Guarantees

$\log p(X/\theta_{k+1}) \geq L(\theta_{k+1}, q_{k+1}) \geq L(\theta_{k}, q_{k+1}) = \log p(X/\theta_{k})$

i.e. marginal log likelihood never decreases during an iteration of an EM algorithm, which means EM algorithm never makes things worse. This property is very useful in debugging the implementation. At least guaranteed to converge to a local maxima.

EM applied to GMM

E-step

$q(t_i=c) = p(t_i=c/x_i, \theta)$

where $x_i$ is the $i^{th}$ data point and $\theta = \{\pi_i, \mu_i, \sigma_i\}_{i=1,2,..C}$

M-step

$\theta = argmax_{\theta} \mathbf{E}_{q} \log p(X, T/\theta)$ $\mathbf{E}_{q} \log p(X, T/\theta) = \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \log p(x_i/t_i=c, \theta) p(t_i=c/\theta)$ $\mathbf{E}_{q} \log p(X, T/\theta) = \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \log \left(\frac{1}{Z} \exp(-\frac{(x_i-\mu_c)^2}{2\sigma^2}) \pi_c \right)$ $\mathbf{E}_{q} \log p(X, T/\theta) = \sum_{i=1}^N \sum_{c=1}^C q(t_i=c) \left(\log \frac{\pi_c}{\sqrt{2*\pi*\sigma^2}} - \frac{(x_i-\mu_c)^2}{2\sigma^2} \right)$

Equating partial derivative w.r.to $u_k, \sigma_k, \pi_k; 1\leq k \leq C$ to zero gives

\[ \begin{aligned} \mu_k &= \frac{\sum_{i=1}^N q(t_i=k) x_i}{\sum_{i=1}^N q(t_i=k)} \newline \sigma_k &= \frac{\sum_{i=1}^N q(t_i=k) (x_i-u_k)^2}{\sum_{i=1}^N q(t_i=k)} \newline \pi_k &= \frac{\sum_{i=1}^N q(t_i=k)}{N} \end{aligned} \]

Summary of EM

Method for training latent variable models.
Handles missing data.
Solve a sequence of simple tasks instead of one hard one.
Guaranteed to converge.
Helps with complicated parameter constraints like positive definiteness.
Several extensions (variational EM, sampling in M-step using MCMC etc)

References

Bayesian Methods for Machine Learning by National Research University Higher School of Economics

Vamshi Kumar Kurva

Probability Theory and Random variables

Set

Algebra of sets

\sigma - algebra

Topological space

Borel \sigma - algebra

Measurable Space

Measure (\mu)

Probability Space

Measurable functions/transformations

Random Variables

References

Geometric view of matrices, Diagonalization and SVD

Linear mapping

Similar matrices

Determinant

Diagonalization

Singular Value Decomposition.

Geometric view of SVD

Construction of SVD

Eigen decomposition Vs SVD

References

Variational Auto Encoder

why model p(x) for images?

How to model distribution of images?

Gradient of Decoder

Gradient of Encoder

References

Bayesian view and Variational Inference

Bayesian view of statistics

Baye’s Theorem

Training

Inference

Analytical Inference

Conjugate distributions

Variational Inference

Mean Field Approximation

References

Latent Variables and EM

Latent Variables

Probabilistic model

Probabilistic clustering

Gaussian Mixture Model (GMMs)

Training GMMs

Expectation-Maximization algorithm

Concave-function

Kullback-Leibler Divergence

General from of EM

E-step

M-step

Convergence Guarantees

EM applied to GMM

E-step

M-step

Summary of EM

References

$\sigma$ - algebra

Borel $\sigma$ - algebra

Measure ( $\mu$ )

why model $p(x)$ for images?