Cost Function, and Gradient Descent

chum
3 min readOct 25, 2020

The best fit line that goes through a scatter plot can be generalized in the following equation:

𝑦=𝑚𝑥+𝑏

We quantify the accuracy of a regression line by squaring all of the errors (to eliminate negative values) and adding these squares together to get our residual sum of squares (RSS). With a number that describes the line’s accuracy (or goodness of fit), we iteratively try new regression lines by adjusting our slope value, m, and then comparing these RSS values. By finding the values m and b that minimize RSS, we can found best fit line.

We can use either the RSS or RMSE to calculate the accuracy of a line. Once the accuracy of a line is calculated, we improve by minimizing the RSS. This is the task of the gradient descent technique.

RSS curve and how it’s related to Machine Learning Models

A more generalized name for RSS is the cost or loss function curve. It tells us the error between the actual and predicted values of our y=mx+b best fit line and is thereby used for optimizing machine learning models. Essentially, the cost function tells us how good our machine learning model is at making
predictions for a given value of m (slope) and b (y-intercept).

You would rather choose a lower m value because it is closer to our optimization point of the minimum RSS error. With the goal being to find the slope that yields the minimum RSS/cost/loss value, the lower slope gives us a lower error between our actual and predicted values.

Understanding Step Size

You can incrementally shrink your step size as you approach the minimum by using the partial derivatives of the cost function with respect to m and b to find your next iteration of m and b values. Taking the partial derivative with respect to m gives us the slope of the line tangent to our point on the curve. The closer we are to the minimum the, steepness of the tangent line is less and less (has a lower slope). So with our m partial derivative getting smaller and smaller with each iteration, we are moving our model’s m value less and less with each iteration.

Purpose of a Learning Rate

The size of the steps taken to reach the minimum in our gradient descent is controlled with the learning rate. A very large learning rate will cover more area moving toward the minimum, however you risk overshooting the
minimum and increasing your error value. A very small learning rate would make the descent to the minimum too slow, especially if you are far from the minimum, and you may not get to the optimized value.

Remember that gradient descent works by starting at a regression line with values m, and b, which corresponds to a point on our cost curve. Then we alter our m or b value (here, the b value) by looking to the slope of the cost curve at that point. Then we look to the slope of the cost curve at the new b value to indicate the size and direction of the next step.

Gradient descent allows our function to improve to a regression line that better matches our data. We see how to change our regression line, by looking at the Residual Sum of Squares related to the current regression line. We update our regression line by looking at the rate of change of our RSS as we adjust our regression line in the right direction — that is, the slope of our cost curve. The larger the magnitude of our rate of change (or slope of our cost curve) the larger our step size. This way, we take larger steps the further away we are from our minimizing our RSS, and take smaller steps as we converge towards our minimum RSS.

--

--