Overfitting, Underfitting, and Regularization

Original Source: https://www.coursera.org/learn/machine-learning

Underfitting and Overfitting

When the form of our hypothesis function h maps poorly to the trend of the data, we say that our hypothesis is underfitting or has high bias. It is usually caused by a function that is too simple or uses too few features.

Overfitting or high variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

underfitting and overfitting

Why Regularization?

We can prevent overfitting by reducing the number of features or implementing ‘regularization’.

Regularization keeps all the features, but makes parameters $\theta_j$ smaller. It reduces the impact of each features to the predicted output, thus smoothes the regression curve or decision boundary.

Regularization works well when we have a lot of slightly useful features.

Regularized Linear Regression

Cost Function

We will penalize increase of theta by adding ‘regularization term’ to cost function.

\[J(\vec{\theta}) = \frac{1}{2m}\Big((h_\vec{\theta}(X)-\vec{y})^T(h_\vec{\theta}(X)-\vec{y})+\lambda\vec{\theta}^T\vec{\theta}\Big)\]

If theta increases, regularization term $\lambda\Theta^T\Theta$ increases, so cost J increases. We want to minimize cost, so theta has to be smaller than non-regularized regression.

$\lambda$ is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Gradient Descent

\[\vec{\theta} := \vec{\theta} - \frac{\alpha}{m}\Big(X^T (h_\vec{\theta}(X)-\vec{y})+\lambda\vec{\theta}\Big)\]

Normal Equation

\[\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}\]

Regularized Logistic Regression

Cost Function

\[J(\Theta)=-\frac{1}{m}\bigg(\sum_{i=1}^m \Big(Y^{(i)}\log\big(h_\Theta(X)\big)+(1-Y^{(i)})\log\big(1-h_\Theta(X)\big)+\frac{\lambda}{2}\Theta_i^T\Theta_i\Big)\bigg)\]

Gradient Descent

\[\Theta := \Theta - \frac{\alpha}{m} \Big(X^T(h_\Theta(X)-Y)+\lambda\Theta\Big)\]

Leave a Comment