Deepak’s blog

Gradient Descent 

there are three variants of gradient descent,which differ in how much data we use to compute the gradient of objective function Depending on the amount of data , we make a trade-off between the accuracy of the parameter update and the time it takes to perform as update

Batch gradient descent: 

Parameters are updated after computing the gradient of error with respect to the entire training set 

In code, batch  gradient descent looks like this:

for i in range(nd_epochs):       

params_grad = eval uate_gra dient ( loss_function , data , params )
       params = params - learning_rate * params_grad

 Stochastic Gradient Descent: 

Parameters are updated after computing the gradient of error with respect to a single training example

Mini-Batch Gradient Descent: 

Parameters are updated after computing the gradient of error with respect to a subset of the training set

now lets look Gradient Descent ..

When you venture into machine learning one of the fundamental aspects of your learning would be to understand “Gradient Descent”. Gradient descent is the backbone of an machine learning algorithm. In this article I am going to attempt to explain the fundamentals of gradient descent using python code. Once you get hold of gradient descent things start to be more clear and it is easy to understand different algorithms.Much has been already written on this topic so it is not going to be a ground breaking one. you will need some basic python packages viz. numpy and matplotlib to visualize.

Let us start with some data, even better let us create some data. We will create a linear data with some random Gaussian noise.

X = 2 * np.random.rand(100,1)
y = 4 +3 * X+np.random.randn(100,1)

Next let’s visualize the data

you may study that line can be expressed as:

And then you can solve the equation for b and m as follows:

To explain in brief about gradient descent, imagine that you are on a mountain and are blindfolded and your task is to come down from the mountain to the flat land without assistance. The only assistance you have is a gadget which tells you the height from sea-level. What would be your approach be. You would start to descend in some random direction and then ask the gadget what is the height now. If the gadget tells you that height and it is more than the initial height then you know you started in wrong direction. You change the direction and repeat the process. This way in many iterations finally you successfully descend down.

Well here is the analogy with machine learning terms now:

Size of Steps took in any direction = Learning rate
Gadget tells you height = Cost function
The direction of your steps = Gradients

Looks simple but mathematically how can we represent this. Here is the maths:

where m = number of observation 
this is an example of linear regression

I am taking an example of linear regression.You start with a random Theta vector and predict the h(Theta), then derive cost using the above equation which stands for Mean Squared Error. The partial derivative is something that can help to find the Theta for next iteration

But ,what if we had multiple features then we would have multiple Theta. Don’t worry here is a generalized form to calculate Theta:

where alpha = Learning Rate
if alpha is too small,gradient descent can be slow . if alpha is too large ,gradient descent can overshoot the minimum.it may fail to converge,or even diverge .

for code follow this link:
https://github.com/DeepakDeepu123/GradientDescent/blob/master/GradientDescent.ipynb

Deepak’s blog

Linear Regression — Detailed View

Linear regression is used for finding linear relationship between target and one or more predictors. There are two types of linear regression- Simple and Multiple.  

Simple Linear Regression: 

Regression analysis is commonly used for modeling the relationship between a single dependent variable Y and one or more predictors.  When we have one predictor, we call this “simple” linear regression,It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other.for example, using temperature in degree Celsius it is possible to accurately predict kelvin.Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.

Real-time example

lets say we have a data set which have the information about years of experience  and salary . we can have a linear relation between this two continuous variables goal is to make a model that can predict the salary if years of experience is given(Univariate linear regression) .This can be used to predict on new data The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.

Y(pred) = b0 + b1*x

the values of b0 and b1 should be chosen so that they minimize the error for evaluation of the models if w take sum of square as metric our aim is to get then goal to obtain a line that best reduces the error.for that  we can define cost function for linear regression cost function for linear regression 

 

For model with one predictor,  Intercept Calculation Co-efficient can be defined as

intercept calculation
  co-efficient formula

generally , 

  • If b1>0 then the relation between continuous variable is positive which mean if one increases the other will also increases
  • If b1<0 the the relation between continuous variable is negative which mean if one increases the other will reduces 

 the Normal Equation 

To find the value of θ that minimizes the cost function, there is a closed-form solution—in other words, a mathematical equation that gives the result directly.this is called Normal Equation 

Co-efficient calculation using Normal Equation
  • Theta is the value that minimizes the cost function
  • y is the vector of target values 

below is the python implementation for the theta_best

python implementation for Normal Equation

Optimizing using gradient descent 

the Normal Equation computes the inverse of X.T.dot(X) which is n*n matrix(where n is number if features). the computational complexity of inverting such a typically about O(n^2.4) to O(n^3)(depending on the implementation).Complexity of the normal equation makes it difficult to use, this is where gradient descent method comes into picture. Partial derivative of the cost function with respect to the parameter can give optimal co-efficient value.

(Complete details of gradient descent is in https://gradientdescentvariants.blogspot.com/2019/07/gradient-descent.html

Gradient Descent Visualization

 

 Python code for gradient descent 

 Polynomial Regression 

what if your data is actually more complex than a simple straight line ? you know we ca use Linear Model to to fit nonlinear data. A simple way is by adding powers of each feature as new features, then train a linear model on this extended set of features and this called Polynomial Regression. 

lets generate some nonlinear data,based on a simple quadratic equation and see how it works 

>>>m = 100 

>>>x = 6*np.random.rand(m,1)-3

>>>y = 0.5* x**2 +x+2+np.random.rand(m,1)

Generated nonlinear and noisy data set

 its clear that a straight line will never fit this data properly. so let’s use polynomialFearures in sklearn to transform our data, add the square of each in training set as new features 

 >>> from sklearn.preprocessing import PolynomialFeatures
>>> poly_features = PolynomialFeatures(degree=2, include_bias=False)
>>> X_poly = poly_features.fit_transform(X)
>>> X[0]
array([-0.75275929])
>>> X_poly[0]
array([-0.75275929, 0.56664654]) 

 X_poly now contain the original feature of square of this fearure.now you can fit the data to this extended training data   Polynomial Regression model prediction 


Metrics for model evaluation 

this values range from 0 to 1 if the value is 1 indicates predictor (X) perfectly accounts for all the variation in Y .if the value is 0 indicates predictor (X) not perfectly accounts for all the variation in Y

1.Regression sum of squares (SSR)


This gives information about how far estimated regression line is from the horizontal ‘no relationship’ line (average of actual output).

Regression error formula

2. Sum of Squared error (SSE)

How much the target value varies around the regression line (predicted value).

sum of square error formula

 

 for detailed View of code for LinearRegression  follow my GitHub link :https://github.com/DeepakDeepu123/LinearRegression/blob/master/Untitled7.ipynb


Thank You ..!