Introduction
This article is a summary of week2 of Andrew Ng's neural network course available in coursera. Week 2 mainly tells us how to deal with logistic regression problem such as binary classification.
Logistic regression has wide applications through our daily life, predicting the housing price, for instance, and estimating how many goods you could sell in this month. Logistic regression problem is that given a set of input data, and real output result, you are asked to simulate a linear function to make the predictions and real results as close as possible. And binary classification is under the same given conditions, the only difference being that the result is only two numbers, conventionally, 0
and 1
. One way to implement in mathematics is to apply least square to it, and according to the formula, we could get our coefficients. But what about there are 100 training sets with each sets contains 5 features? I don't know whether there is a formula to apply, even if it exists, it must be prohibitively sophisticated, hard to compute. So here comes the gradient descent.
Gradient Descent
As we know, a gradient of a function points to its steepest descending direction, which is just like coming down a mountain, that you would always choose the steepest way to go down for it is the fastest way. Gradient descent's philosophy lies here. In each step, you take the steepest descending direction and then you look around, finding another direction which is the steepest in your current position, and do it recursively until you get the wanted result. In this case, result is a minimum value we can get for the errors between estimated output and real output.
How does gradient descent really works? Here is an example, and I am sure having seen this, you would be clear about gradient descent and write a piece of code using it.
Problem: Find the a value x such that f(x)=3+14x-5x^2,initial x=0.
You must be scoffing at it for it's too simple to use as an illustration. But I have to say that starters are better getting started with simple ones, especially good if you are using an example which you could handle with another method.
Ok, let's get started. So first, let's define what is error. Obviously, error here is the difference between f(x) and 0, which is just f(x), but error as we defined should be the difference between f(x) and 0, it could be negative, which is not good, for we want it to be always positive. So we define error function which is also called cost function:
L(x)=f(x)^2=(3+14x-5x^2)^2
According to gradient descent, we need to take the steepest direction on each step, so we have to compute its derivative, which is:
L`(x)=2(3+14x-5x^2)(14-10x)
So apply gradient descent, x should be updated as the following rule:
x=x-alpha * L`(x)=x- alpha * 2(3+14x-5x^2)(14-10x),
where alpha decides how big a step you're going to take, is called the learning rate.
The whole process is that you exam the difference of the real output and predicted output, usually you'll get a very bad result for it's the first guess. In this case, x=0 ,which mean error=9(3 square). If you're satisfied with this result, then quit the programme. (But I am sure no one is satisfied with this), otherwise, go to the updating rule, update x and see the result. In this case, if I make alpha=0.01, then the newer value of x should be 0-2*0.01*3*14=-0.84. Use this x again to compute the error, and if you're not satisfied with it, update x and do it again!
Some careful readers may notice that this function must have 2 different roots. Yes, you are right. But by this means, we could only get one result. If you want another result, what should you do? You could think about it.
Yeah, we could make initial x=10 or if you're not sure how large the root is, you could make it as big as you want (but do not let Python complain about overflow), to make sure you can get the positive root!
Background
We all want our machines to be more intelligent, meaning that they can adjust their actions according to the environment, that they can be cleverer than you to tell you whether this can work or not, even they can diagnose a kind of disease!
I am zesty about that, making machines do things human can do, or even things we can not. For example, according some statistic data, it can tell the probability whether your tumor is benign or malignant, which, to be honest I am unable to.
Using the Code
This piece of code is for the example illustrated above, you could code yourself, actually it's quite simple.
x=2
alpha=0.001
error=(3+14*x-5*(x**2))**2
count=0
accuracy=0.00000000001
while error>accuracy or error<-accuracy:
x-=(-10*x+14)*(-5*x**2+14*x+3)*alpha*2
error=(3+14*x-5*(x**2))**2
count+=1
Forecast
Next time, I would write an article about how to use gradient descent to predict whether a tumour is malignant or benign. See you!
History
- 14th March, 2018: Initial version