Parametric models vs. Nonparametric Models

Parametric models: fixed number of parameters, use learned parameters for predictions
- (+) efficient
- (-) limited flexibility, assumption on patterns/distribution (inductive bias)
- ex) linear regression, logistic regression, Naive Bayes, Ridge regression, SVM (w/o kernel or w/ linear kernel), Polynomial regression, K-Means Clustering, Gaussian Mixture Model
Nonparametric models: no assumption on underlying distribution, rely on training data for prediction
- (+) flexible, complex patterns
- (-) expensive, poor scalability, require a large amount of data
- ex) KNNs, Kernel regression, Kernel density estimation, Decision Tree, RF, SVM (w/nonlinear kernel)

Probability (MLPP 2.)

Random Variable

measurable function $X: \Omega \rightarrow E$ from a set of possible outcomes $\Omega$ to a measurable space $E$
Joint distribution
Event
Marginal distribution
Conditional probability
Probabilistic inference
Product rule

$P(x|y)=\frac{P(x,y)}{P(y)} \iff P(x,y)=P(x|y)P(y)$
Chain rule

$P(x_1,x_2,...,x_n)=P(x_1)P(x_2|x_1)...P(x_n|x_{n-1},...,x_1)=\prod_iP(x_i|x_{<i})$
Bayes Theorem

$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$
Bayes’ Terminology

$P(e|D)=\frac{P(D|e)}{P(e)}$

$P(e)$ is called the prior probability of $e$. Its what we know about $e$ with no other evidence.

$P(D|e)$ is the conditional probability of $D$ given that $e$ happened, or just the likelihood of $D$.

$P(e|D)$ is the posterior probability of $e$ given $D$. It’s the answer we want, or the way we choose a best answer.

Maximum Likelihood Estimation (MLE)

$\theta_{MLE}=\argmax_\theta P(X|\theta)$
Maximum A Posteriori (MAP)

$\theta_{MAP}=\argmax_\theta P(\theta|X)=\argmax_\theta P(X|\theta)P(\theta)$

$=\argmax_\theta\log P(X|\theta) + \log P(\theta)$