Regularized Linear Regression and Logistic Regression Cheatsheet

Original Source: https://www.coursera.org/learn/machine-learning

Notations

I will represent scalar as $x$, vector as $\vec{x}$, and matrix as $X$.

Basic

$m$: the number of training examples

$n$: the number of features

$k$: the number of output classes

$\alpha$: learning rate

$\lambda$: regularization term

$X$: an input matrix, where each row represents each training example, and each column represents each feature. Note that first column $X_1$ is a vector composed only of 1s.

$X^{(i)}$: the row vector of all the feature inputs of the ith training example

$X_j$: the column vector of all the jth feature of training examples.

$X_j^{(i)}$: value of feature j in the ith training example.

For Linear Regression and Binary Classification

$\vec{y}$: an output column vector, where each row represents each training example.

$\vec{\theta}$: column vector of weights

For Binary Classification

$\vec{z}$: an output vector of regression step, also an input vector of sigmoid.

For Multiclass Classification

$Y$: an output matrix, where each row represents each training example, and each column represents each class.

$Z$: an output matrix of regression step, also an input matrix of softmax

$\Theta$: n by k matrix of weights

Regularized Linear Regression

Hypothesis

\[h_\vec{\theta}(X)=X\vec{\theta}\]

Cost

\[J(\vec{\theta}) = \frac{1}{2m}\Big((h_\vec{\theta}(X)-\vec{y})^T(h_\vec{\theta}(X)-\vec{y})+\lambda\vec{\theta}^T\vec{\theta}\Big)\]

Gradient Descent

\[\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\Big(X^T (h_\vec{\theta}(X)-\vec{y})+\lambda\vec{\theta}\Big)\]

Regularized Logistic Regression

Binary Classification

Hypothesis

\[h_{\vec{\theta}}(X) = g(X\vec{\theta})\] \[g(\vec{z}) = \frac{1}{1 + e^{-\vec{z}}}\]

Cost

\[J(\vec{\theta})=-\frac{1}{m}\Big(\vec{y}^T\log\big(h_\vec{\theta}(X)\big)+(1-\vec{y})^T\log\big(1-h_\vec{\theta}(X)\big)+\frac{\lambda}{2}\vec{\theta}^T\vec{\theta}\Big)\]

Gradient Descent

\[\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\Big(X^T (h_\vec{\theta}(X)-\vec{y})+\lambda\theta\Big)\]

Multiclass Classification

Hypothesis

\[h_\Theta(X) = g(X\Theta)\] \[g(Z^{(i)}) = \frac{e^{Z^{(i)}}}{\sum e^{Z^{(i)}}}\]

where $\sum \vec{a}$ is the sum of all elements of vector a

Cost

\[J(\Theta)=-\frac{1}{m}\bigg(\sum_{i=1}^m \Big(Y^{(i)}\log\big(h_\Theta(X^{(i)})^T\big)+\frac{\lambda}{2}\Theta_i^T\Theta_i\Big)\bigg)\]

Gradient Descent

\[\Theta := \Theta - \frac{\alpha}{m} \Big(X^T(h_\Theta(X)-Y)+\lambda\Theta\Big)\]

Share on

Twitter Facebook Google+ LinkedIn

YoonSoo

Regularized Linear Regression and Logistic Regression Cheatsheet

Notations

Basic

For Linear Regression and Binary Classification

For Binary Classification

For Multiclass Classification

Regularized Linear Regression

Hypothesis

Cost

Gradient Descent

Regularized Logistic Regression

Binary Classification

Hypothesis

Cost

Gradient Descent

Multiclass Classification

Hypothesis

Cost

Gradient Descent

Share on

Leave a Comment

You May Also Enjoy

Generalized Linear Models (GLM)

“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” Summarized

“Generative Pretraining from Pixels” Summarized

“Language Models are Few-Shot Learners” Summarized