Multivariate Linear Regression

Original Source: https://www.coursera.org/learn/machine-learning

Multivariate linear regression is a linear regression with multiple variables.

Notations

I will represent scalar as $x$, vector as $\vec{x}$, and matrix as $X$.

Basic

$m$: the number of training examples

$n$: the number of features

$k$: the number of output classes

$\alpha$: learning rate

$X$: an input matrix, where each row represents each training example, and each column represents each feature. Note that first column $X_1$ is a vector composed only of 1s.

$X^{(i)}$: the row vector of all the feature inputs of the ith training example

$X_j$: the column vector of all the jth feature of training examples.

$X_j^{(i)}$: value of feature j in the ith training example.

$\vec{y}$: an output column vector, where each row represents each training example.

$\vec{\theta}$: column vector of weights

Hypothesis

First we will assume there is only one training example $x$.

\[h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n\]

Below equation is the same as the above.

\[\begin{align*}h_\theta(x) =\begin{bmatrix}x_0 \hspace{2em} x_1 \hspace{2em} ... \hspace{2em} x_n\end{bmatrix}\begin{bmatrix}\theta_0 \newline \theta_1 \newline \vdots \newline \theta_n\end{bmatrix}= x\theta\end{align*}\]

Vectorized Version

We have implemented cost function for every training examples. By ‘vectorizing’ this implementation, we can deal with those training examples at the same time. It saves tremendous amount of time and resource when it comes to computing with computer.

Think of vectorizing as stacking training examples from top to bottom of the input. So the training examples are stored row-wise in matrix X. Thetas are also stacked vertically in a vector.

\[\begin{align*}X = \begin{bmatrix}X^{(1)}_0 & X^{(1)}_1 \newline X^{(2)}_0 & X^{(2)}_1 \newline X^{(3)}_0 & X^{(3)}_1 \end{bmatrix}&,\vec{\theta} = \begin{bmatrix}\vec{\theta}_0 \newline \vec{\theta}_1 \newline\end{bmatrix}\end{align*}\]

Then the generalized hypothesis is as follows.

\[h_\vec{\theta}(X)=X\vec{\theta}\]

Cost Function

\[J(\vec{\theta}) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\vec{\theta} (X^{(i)}) - \vec{y}^{(i)} \right)^2\]

Vectorized Version

\[J(\vec{\theta}) = \frac{1}{2m}\Big((h_\vec{\theta}(X)-\vec{y})^T(h_\vec{\theta}(X)-\vec{y})\Big)\]

where $\vec{y}$ denotes the vector of all y values.

Gradient Descent

\(\vec{\theta}_j := \vec{\theta}_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\vec{\theta}(X^{(i)}) - \vec{y}^{(i)}) \cdot X_j^{(i)}\) \(\text{for j := 0..n}\)

Vectorized Version

Vectorized version of gradient descent is basically this.

\[\vec{\theta} := \vec{\theta} - \alpha \nabla J(\vec{\theta})\]

where

\[\nabla J(\vec{\theta}) = \begin{bmatrix}\frac{\partial J(\vec{\theta})}{\partial \vec{\theta}_0} \newline \frac{\partial J(\vec{\theta})}{\partial \vec{\theta}_1} \newline \vdots \newline \frac{\partial J(\vec{\theta})}{\partial \vec{\theta}_n} \end{bmatrix}\]

Like we’ve seen in univariate linear regression, we calculate derivative of each theta as follows.

\[\begin{align*} \; &\frac{\partial J(\vec{\theta})}{\partial \vec{\theta}_j} = \frac{1}{m} \sum\limits_{i=1}^{m} X_j^{(i)} \cdot \left(h_\vec{\theta}(X^{(i)}) - \vec{y}^{(i)} \right) \end{align*}\]

We can vectorize above equation.

\[\begin{align*}\; &\frac{\partial J(\vec{\theta})}{\partial \vec{\theta}_j} = \frac1m {X_j}^T (X\vec{\theta} - \vec{y})\end{align*}\]

Then

\[\begin{align*}\; &\nabla J(\vec{\theta}) = \frac 1m X^{T} (X\vec{\theta} - \vec{y}) \newline\end{align*}\]

In conclusion, vectorized gradient descent rule is:

\[\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\Big(X^T (h_\vec{\theta}(X)-\vec{y})\Big)\]

Leave a Comment