0. Notation sets
\(X_{1:D}\), where we introduce the Matlablike notation 1 : D to denote the set {1, 2, . . . , D}.
1. Prob & Stats Terminologies
 A diagonal covariance matrix has D parameters, and has 0s in the offdiagonal terms.
 A spherical or isotropic covariance, \(\Sigma = \sigma^2 \bI_D \), has one free parameter.
 Disjoint: Two events do not overlap.
 Disjoint VS Independent: Disjoint events aren’t independent, unless one event is impossible, which makes the two events trivially independent.

\(\E_\btheta[\btheta] = \E_\data[\E_\btheta[\btheta  \data] ]]\)  Bishop P74  Says that the posterior mean of \(\btheta \), averaged over the distribution generating the data is equal to the prior mean of \(\btheta \)
 Biased but consistent estimator  Wikipedia
 Expectation for nonnegative random variable X
1.1 Normal
 Normal: $p(x \mu, \Sigma) = \frac{1}{(2\pi)^{p/2}\Sigma^{1/2}} \exp{\frac{1}{2}(x\mu)^\top\Sigma^{1}(x\mu)}$
 Normal: $\Sigma = Cov(X) = \E[(X\mu)(X\mu)^\top]$
 Mahalanobis distance: $(x\mu)^\top\Sigma^{1}(x\mu)$
 LogNormal: $\log p = \frac{p}{2} \log 2\pi  \frac{1}{2}\log \Sigma \frac{1}{2}(x\mu)^\top\Sigma^{1}(x\mu)$
 trace trick: $x^\top A x=tr(x^\top A x)=tr(xx^\top A)=tr(Axx^\top)$
2. Matrix Terminologies
 Linearly dependent: In the theory of vector spaces, a set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others; if no vector in the set can be written in this way, then the vectors are said to be linearly independent. Wikipedia
 Covariance matrix and Correlation matrix  Sec 2.5.1 of Kevin Murph’s book
 自相关矩阵，互相关矩阵  P29 of Zhang matrix
 Unitary Matrix
 Equivalence for \(\bA \in R^{n\times n} \), (P66 of zhang matrix)
 $\bA$ is nonsingular;
 $\bA^{1}$ exists;
 rank($A$) = n;
 The rows of $\bA$ is linearly independent;
 The columns of $\bA$ is linearly independent;
 $det(\bA) \neq 0$;
 The dimension of null space for $\bA$ is 0;
 $\bA \bx = \bb$ has only one solution;
 $\bA \bx = \bzero$ has only trivial solution: $\bx=\bzero$;
 algebraic multiplicity and geometric multiplicity:
 assume $A \in \mathbb{R}^{n\times n}$, $det(xI − A) = (x − \lambda_1)^{k_1} \cdots (x − \lambda_m)^{k_m}$,
 then $k_i$ is called the algebraic multiplicity of eigenvalue $\lambda_i$  $alg(\lambda_i)$;
 dimension of eigenspace $Ker(\lambda_i I  A)$ is called the geometric multiplicity of $\lambda_i$  $geo(\lambda_i)$.
 By definition, the sum of the algebraic multiplicities is equal to n, but the sum of the geometric multiplicities can be strictly smaller: $geo(\lambda_i) \leq alg(\lambda_i)$.
3. Opt Terms.
 LLipschitz:
 $ \nabla f(x) \leq L$; Sebastien Bubeck’s survey says this should also be the dual norm;
 $f(x)  f(y) \leq L  xy$.
 $\beta$smooth convex: [Note] that smooth function is differentiable.
 $ \nabla f(x)  \nabla f(y)_* \leq \beta  xy$; Notice that Euclidean norm is selfDual.
 $f(y)  f(x)  <\nabla f(x), yx> \leq \frac{\beta}{2}  xy ^2$;
 $f(y)  f(x)  <\nabla f(x), yx> \geq \frac{1}{2\beta} \nabla f(x) \nabla f(y)_*$
 If we only have smoothness (not convex): then we have $f(y)  f(x)  <\nabla f(x), yx> \leq \frac{\beta}{2}  xy ^2$;
 $f(\lambda x + (1\lambda)y) \geq \lambda f(x) + (1\lambda) f(y)  \frac{\beta}{2} \lambda (1\lambda) xy^2$;
 $\nabla^2 f(x) \leq \beta I $;
 eigenvalues of Hessian smaller than $\beta$;
 $\alpha$strongly convex:
 $f(\lambda x + (1\lambda)y) \leq \lambda f(x) + (1\lambda) f(y)  \frac{\alpha}{2} \lambda (1\lambda) xy^2$;
 $f(y)  f(x)  <\nabla f(x), yx> \geq \frac{\alpha}{2}  xy ^2 $, replace $\nabla f(x)$ with $g \in \partial f(x)$ if it is nondifferentiable;
 $\alpha I \leq \nabla^2 f(x) $;
 $\alpha$strongly conve iff $f(x)  \frac{\alpha}{2} x^2 $ is convex;
 Proper function: A function $f: \mathbb{E} \rightarrow [−\infty, \infty]$ is called proper if it does not attain the value $−\infty$ and there exists at least one x ∈ E such that $f(x) < \infty$;
 Closed function: A function $f: \mathbb{E} \rightarrow [−\infty, \infty]$ is closed if its epigraph is closed. In this case, the convex function $f$ is lower semicontinuous: $lim_{y\rightarrow x} f(y) \geq f(x)$;
 Conjugate function: Let $f: \mathbb{E} \rightarrow [−\infty,\infty]$ be an extended realvalued function. The function $f^∗ : \mathbb{E}^∗ \rightarrow [−\infty,\infty]$, defined by $f^∗(y)= max _{x\in \mathbb{E}} {<y,x>− f(x)} , y \in \mathbb{E}^∗$;
 properness: if $f$ is proper convex, then $f^*$ is proper;
 $f(x) + f^*(y) \geq <y,x>$;
 Conjugate correspondence theorem: $f$ is $\sigma$strongly convex if and only if $f^∗$ is $\frac{1}{\sigma}$smooth.
 Conjugate subgradient theorem: See Theorem 4.20 of [First Order Book].
 interior and relative interior: relative interior of a convex set is alway nonempty. See Wainright & Jordan, 2008 Appendex 2.3;
 Fullydimensional: A convex set $\mathcal{C} \subseteq \mathbb{R}^d$ is fulldimensional if its affine hull is equal to $\mathbb{R}^d$;
4. VI terms
 score function: $\nabla_x \log p(x)$. [See stein variational inference paper]
5. Math Facts
 \(\lim_{n\rightarrow \infty} (1+\frac{x}{n})^n = \exp^x\)
 The above limitation needs to use that \(\log(1+x)=x+O(x^2)\) when \(x \rightarrow 0\). And a proof of this can be found link.
 logsumexp trick
Equalities
Here are some equalities that you need to put in mind:
Anonymous
\(ab = a(12b)+b\) for \(a,b \in\) {\( 0,1\)}
Anonymous
 For any scalar $x<1$, $\frac{1}{x} = \sum_{k=0}^\infty (1x)^k$, from link. Extend to matrix:
 matrix inverse: $\bA^{1} = \sum_{k=0}^\infty (\bI  \bA)^k$ if the eigenvalues of $A$ lie with in (0,1).
6. Inequalities
Here are some inequalities that you need to put in mind for your research convenience:
Anonymous
\(\sum_{t=1}^T\frac{1}{t} \leq log(T)+1\)
Anonymous 2
\(1x \leq \exp^{x}, \forall x \geq 0\)
CauchySchwartz inequality
[here].
Jensen’s inequlity
If \(f\) is a real continuous function that is convex, and \(x\) is a random variable, then $f(\mathbb{E} x) \leq \mathbb{E}f(x)$. A more detailed explanation can be found [here].
Hoeffding’s lemma
For a zeromean random variable \(U\) bounded almost surely as \(a \leq U \leq b\), then $\mathbb{E} exp(\lambda \, U) \leq exp{\frac{\lambda^2(ba)^2}{8}}$
Markov’s inequality
If \(U\) is a nonnegative random variable on \(\mathbb{R}\), then for all \(t>0\)
$Pr(U>t) \leq \frac{1}{t} \mathbb{E}[U]$
Proof.
where both inequalities use the fact that \(U\) is nonnegative.
Chebyshev’s inequality
If \(Z\) is a random variable on \(\mathbb{R}\) with mean \(\mu\) and variance \(\sigma^2\), then
$Pr(Z \mu \geq \sigma t)\leq \frac{1}{t^2}$
Proof.
Hint: by Markov’s inequality
Chernoff’s Bounding method
Let \(Z\) be a random variable on \(\mathbb{R}\). Then for all \(t>0\)
$Pr(Z\geq t) \leq inf_{s>0} e^{st}M_z(s)$
where \(M_z\) is the momentgenerating function of \(Z\).
Proof.
For any \(s>0\) we can use Markov’s inequality to obtain:
$Pr(Z \geq t) = Pr(sZ \geq st) = Pr(e^{sZ} \geq e^{st}) \leq e^{st}\mathbb{E}[e^{sZ}] = e^{sZ}M_z(s)$
Since \(s>0\) was arbitrary, this proof follows.
Probabilities
Conditional independent
Two events A and B are conditionally independent give C if Pr(A,BC) = Pr(AC) Pr(BC). Then:
 Pr(A,BC) = Pr(BC) Pr(AC,B)
The proof can be found in (Hoff 2009) Section 2.3.
Dirac measure
(Murphy, 2012) Equation 2.41.
Covariance and correlation
If A and B such that Cov(A, B)=0, then A and B are uncorrelated.
Not vice versa. Uncorrelated does not mean independent.
See (Murphy, 2012) Section 2.5.1.