- Multivariate Linear Regression
- Computing Parameters Analytically
- In original version we had
- = house size, use this to predict
- = house price
- If in a new scheme we have more variables ( such as number of bedrooms, number floors, age of the house)
- are the four features
- - size (feet squared)
- - Number of bedrooms
- - Number of floors
- - Age of house (years)
- is the output varible (price)
- are the four features
- : number of features ( n =4 )
- m : number of examples ( i.e. number of rows in a table )
- : input (features) of training example.
- : value of feature in training example.
- Cost function with multiple features
- ( )
- for convenience of notation, define
- is an matrix
- The training examples are stored in row-wise. The following example shows us the reason behind setting :
- ( )
- If you have a problem with multiple features, you should make sure those features have a similar scale
- x1 = size(0 - 2000 feet)
- x2 = number of bedrooms(1 - 5)
- Means the contours generated if we plot vs. give a very tall and thin shape due to the huge range difference
- Running gradient descent on this kind of cost function can take a long time to find the globale minumum
- The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
- The goal is to get all input variables into roughly one of these ranges, give or take a few.
- Two techniques to help with this are feature scaling and mean normalization.
- Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1.
- Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
- To implement both of these techniques, adjust your input values as shown in this formula:
- Where is the average of all the values for feature (i) and is the range of values (max - min), or is the standard deviation.
- For example, if represents housing prices with a range of 100 to 2000 and a mean value of 1000, then,
- Debugging: how to make sure gradient descent is working correctly.
- How to choose learning rate .
- Make a plot with number of iterations on the x-axis. Now plot the cost function, over the number of iterations of gradient descent. If ever increases, then you probably need to decrease .
- Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as . However in practice it's difficult to choose this threshold value.
- It has been proven that if learning rate is sufficiently small, then will decrease on every iteration.
- If is too small: slow convergence.
- If is too large: may not decrease on every iteration and thus may not converge.
- Try a range of alpha values
- Plot vs number of iterations for each version of
- Go for roughly threefold increases
- 0.001, 0.003, 0.01, 0.03. 0.1, 0.3
- to improve our features and the form of our hypothesis function.
- We can combine and into a new feature by taking .
- take the housing price as sample: , is frontage, and is depth of the house.
- Our hypothesis function need NOT be linear (a straight line) if that does not fit the data well. And, choose new features to get a better model, called polynomial regression.
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
For housing data could use a quadratic function:
- But may not fit data so well, because inflection point means housing prices decrease when size gets really big.
- So instead must use a cubic function:
- In the cubic version, we have created new features and where. and .
- One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
- eg. if has range 1 - 1000 then range of becomes 1 - 1000000 and that of becomes 1 - 1000000000
- Or we can make it a square root function, we could do: .
Method to solve for analytically.Minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:
no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:
|Gradient Descent||Normal Equation|
|Need to choose alpha||No need to choose alpha|
|Needs many iterations||No need to iterate|
|O( )||O( ), need to calculate inverse of|
|Works well when is large||Slow if is very large|
- With the normal equation, computing the inversion has complexity . So if we have a very large number of features, the normal equation will be slow. In practice, when exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
- If is noninvertible (singular/degenerate), the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".
semicolon [,semi'kəulən, 'semikəulən] n. 分号
decimal ['desiməl] adj. 小数的；十进位的; n. 小数
diagonal [dai'æɡənəl] adj. 斜的；对角线的；斜纹的; n. 对角线；斜线
vectorization [,vektəri'zeiʃən] n. [数] 向量化
numerical [nju:'merikəl] adj. 数值的；数字的；用数字表示的
pathological [,pæθə'lɔdʒikəl] adj. 病理学的；病态的；由疾病引起的（等于pathologic）
convergence [kən'və:dʒəns] n. [数] 收敛；会聚，集合
polynomial [,pɔli'nəumiəl] n. [数] 多项式；
quadratic [kwɔ'drætik] adj. [数] 二次的 n. 二次方程式
cubic ['kju:bik] adj. 立方体的，立方的