Regularized Linear Regression and Logistic Regression Cheatsheet

Original Source: https://www.coursera.org/learn/machine-learning

Notations

I will represent scalar as $x$, vector as $\vec{x}$, and matrix as $X$.

Basic

$m$: the number of training examples

$n$: the number of features

$k$: the number of output classes

$\alpha$: learning rate

$\lambda$: regularization term

$X$: an input matrix, where each row represents each training example, and each column represents each feature. Note that first column $X_1$ is a vector composed only of 1s.

$X^{(i)}$: the row vector of all the feature inputs of the ith training example

$X_j$: the column vector of all the jth feature of training examples.

$X_j^{(i)}$: value of feature j in the ith training example.

For Linear Regression and Binary Classification

$\vec{y}$: an output column vector, where each row represents each training example.

$\vec{\theta}$: column vector of weights

For Binary Classification

$\vec{z}$: an output vector of regression step, also an input vector of sigmoid.

For Multiclass Classification

$Y$: an output matrix, where each row represents each training example, and each column represents each class.

$Z$: an output matrix of regression step, also an input matrix of softmax

$\Theta$: n by k matrix of weights

Regularized Linear Regression

Hypothesis

\[h_\vec{\theta}(X)=X\vec{\theta}\]

Cost

\[J(\vec{\theta}) = \frac{1}{2m}\Big((h_\vec{\theta}(X)-\vec{y})^T(h_\vec{\theta}(X)-\vec{y})+\lambda\vec{\theta}^T\vec{\theta}\Big)\]

Gradient Descent

\[\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\Big(X^T (h_\vec{\theta}(X)-\vec{y})+\lambda\vec{\theta}\Big)\]

Regularized Logistic Regression

Binary Classification

Hypothesis

\[h_{\vec{\theta}}(X) = g(X\vec{\theta})\] \[g(\vec{z}) = \frac{1}{1 + e^{-\vec{z}}}\]

Cost

\[J(\vec{\theta})=-\frac{1}{m}\Big(\vec{y}^T\log\big(h_\vec{\theta}(X)\big)+(1-\vec{y})^T\log\big(1-h_\vec{\theta}(X)\big)+\frac{\lambda}{2}\vec{\theta}^T\vec{\theta}\Big)\]

Gradient Descent

\[\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\Big(X^T (h_\vec{\theta}(X)-\vec{y})+\lambda\theta\Big)\]

Multiclass Classification

Hypothesis

\[h_\Theta(X) = g(X\Theta)\] \[g(Z^{(i)}) = \frac{e^{Z^{(i)}}}{\sum e^{Z^{(i)}}}\]

where $\sum \vec{a}$ is the sum of all elements of vector a

Cost

\[J(\Theta)=-\frac{1}{m}\bigg(\sum_{i=1}^m \Big(Y^{(i)}\log\big(h_\Theta(X^{(i)})^T\big)+\frac{\lambda}{2}\Theta_i^T\Theta_i\Big)\bigg)\]

Gradient Descent

\[\Theta := \Theta - \frac{\alpha}{m} \Big(X^T(h_\Theta(X)-Y)+\lambda\Theta\Big)\]

Leave a Comment