Week 3 - Logistic Regression & Regularization
Classification and Representation
- to determine what class a new input should fall into,
- Classification problems
- Email -> spam/not spam?
- Online transactions -> fraudulent?
- Tumor -> Malignant/benign
- Sometimes, Linear Regression may work, like the sample below:
- But most time the training set won’t be so perfect
- Here we use Logistic regression, which generates a value where is always either 0 or 1
- To change our hypotheses to satisfy .
- Our new form uses the “Logistic Function” , also called the “Sigmoid Function”:
- The following image shows us what the logistic function looks like:
- Interpreting hypothesis output, we can use:
- , it give us the probability that output is 1.
- probability that y = 1, given x, paramerterized by
- In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
- so if our input to is , then that means:
- So our parameter vector is a column vector with the above values:
- Then becomes
- We predict " " if
- So we graphically plot our decision boundary:
- Blue = false, Magenta = true
- Line = decision boundary
Non-linear decision boundaries
- Get logistic regression to fit a complex non-linear data set
- Say :
- Predict that " ", if
- If we plot , then this gives us a circle with a radius of 1 around 0:
- Mean we can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis
Logistic Regression Model
- Define the optimization object for the cost function we use to fit the parameters
- Training set:
- m example:
- Each example is a feature vector which is dimensional
- Linear regression uses the following function to determine
Cost function to simplify the function:
- then we got:
- to further simplify it, we can get rid of the superscripts:
- If we use this function for logistic regression, it will be a non-convex function which has many local optimum. Like:
- So we come out a new convex logistic regression cost function:
- We only care , so:
Simplified cost function and gradient descent
- Compress cost function’s two conditional cases into one case:
- Cause y can either be 0 or 1,
- if y = 1, then
- if y = 0, then
- this cost function can be derived from statistics using the principle of maximum likelihood estimation.
- the idea of how to efficiently find parameters’ data for different models.
- Then we can fully write out our entire cost function as follows:
- and a vectorized implementation is:
- Gradient Descent
- the general form is:
- We can work out the derivative part using calculus to get:
- A vectorized implementation is:
Alternatively, instead of gradient descent to minimize the cost function we could use
- Conjugate gradient
- BFGS (Broyden-Fletcher-Goldfarb-Shanno)
- L-BFGS (Limited memory - BFGS)
- No need to manually pick alpha (learning rate)
- Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one
- Often faster than gradient descent
- Do more than just pick a good learning rate
- Can be used successfully without understanding their complexity
- Could make debugging more difficult
- Should not be implemented themselves
- Different libraries may use different implementations - may hit performance
Using advanced cost minimization algorithms
Multiclass Classification: One-vs-all
- Divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that
y is a member of one of our classes.
- The following image show how one could classify 3 classes:
- Train a logistic regression classifier for each class i to predict the probability that
- On a new input, to make a prediction, pick the class that maximizes the probability that
The Problem of Overfitting
- Three figures to shows that underfitting, fitting and overfitting: (take housing price as sample)
- under-fitting or high bias: leftmost, , doesn’t really lie on straight line.
- overfitting: rightmost, , not a good predictor.
- fitting one: , obtain a slightly better fit to the data.
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm.
- Keep all the features, but reduce the magnitude of parameters .
- Regularization works well when we have a lot of slightly useful features.
- if we have overfitting from our hypothesis function , we can reduce the weight that some of the terms in our function carry by increasing their cost.
- Say we wanted to make the following function more quadratic:
- We’ll want to eliminate the influence of and . Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
- Add two extra terms at the end to inflate the cost of and . This will in turn greatly reduce the values of and in our hypothesis function.
- We could also regularize all of our theta parameters in a single summation as:
- (lambda), is the regularization parameter. It determines how much the costs of our theta parameters are inflated.
- But if lambda is chosen to be too large, it may smooth out the function too much and cause under-fitting.
Regularized Linear Regression
- The term performs our regularization. With some manipulation our update rule can also be represented as:
- The first term in the above equation, will always be less than 1. Intuitively you can see it as reducing the value of by some amount on every update. Notice that the second term is now exactly the same as it was before.
- To add in reguarlization, the equation is the same as our original, except that we add another term inside the parentheses:
- L is a matrix with 0 at the top left and 1’s down the diagonal, with 0’s everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including ), multiplied with a single real number .
- Recall that if , then is non-invertible. However, when we add the term , then becomes invertible.
Regularized Logistic Regression
- Cost function:
- Regularize this equation by adding a term to the end:
- The second sum, means to explicitly exclude the bias term, . I.e. the vector is indexed from 0 to n (holding n+1 values, through ), and this sum explicitly skips , by running from 1 to n, skipping 0 (This is because for regularization we don’t penalize so treat it slightly differently). Thus, when computing the equation, we should continuously update the two following equations:
logistic [ləu’dʒistik] adj. [数] 符号逻辑的
fraudulent ['frɔ:djulənt] adj. 欺骗性的；不正的
malignant [mə’liɡnənt] adj. [医] 恶性的；
benign [bi’nain] adj. 良性的；
polynomial [,pɔli’nəumiəl] n. [数] 多项式；
quadratic [kwɔ’drætik] adj. [数] 二次的 n. 二次方程式;
penalize ['pi:nəlaiz] vt. 处罚；处刑；使不利
sigmoid function ['sigmɔid] 双弯曲函数