Week 1 - Introduction

1. Several types of learning algorithms

1.1. Supervised Learning

  • Regression, to predict results within a continuous output.
    • For example, predict housing prices:
      • Given this data, how to predict the value of a specific size?
      • How to solve this?
        • Straight line through data
          • Maybe $150,000
        • Polynomial
          • Maybe $200,000
        • We'll learn how to choose function later.
      • After we fitted a line to the data, we can predict the actual values for houses
    • From the sample, all of the data we plotted called training data, and the action we fit lines to the data called linear regression
  • Classification, to predict results in a discrete output.
    • For example, checking a tumor is malignant or not.

1.2. Unsupervised Learning

  • derive this structure by clustering the data based on relationships among the variables in the data.
    • For example, grouping news stories into cohesive groups

2. Linear Regression (Model and Cost Function)

  • Model Representation

    • To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x)h^{(x)} is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

      week-1-1

    • hθ(x)=θ0+θ1xh_{\theta}(x)=\theta_{0}+\theta_{1}x, and this means:

      • Y is a linear function of x
        • θi\theta_{i} are parameters
        • θ0\theta_{0} is zero condition
        • θ1\theta_{1} is gradient
    • This kind of function is a Linear Regression with one variable,
      • also called Univariate Linear Regression

2.1. Cost Function

  • A cost function lets us figure out how to fit the best straight line to our data

  • Choosing values for θ1\theta_{1} (parameters)

    • Different values give you different functions
    • If θ0\theta_{0} is 1.5 and θ1\theta_{1} is 0 then we get straight line parallel with X along 1.5 @ y
    • If θ1\theta_{1} is > 0 then we get a positive slope
    • We want to find a straight line which colse to the actual ys as possible
  • To formalize this:

    • We want to solve a minimization problem
    • Minimize (hθ(x)y)2(h_{\theta}(x)-y)^{2}
      • minimize the difference between h(x) and y for each/any/every example
    • Sum this over the training set:12mi=1m(hθ(xi)yi)2\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2
      • 12m\frac{1}{2m}
        • 1m\frac{1}{m} - means we determine the average
        • 12m\frac{1}{2m} - the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the smallest value)
    • This is a cost function:
      • J(θ0,θ1)=12mi=1m(y^iyi)2=12mi=1m(hθ(xi)yi)2J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2
      • And we want to minimize this cost function
        • Our cost function is (because of the summation term) inherently looking at ALL the data in the training set at any time
  • The following image summarizes what the cost function does:

    • week-1-2

    • Hypothesis - is like your prediction machine, throw in an x value, get a putative y value

    • Cost - is a way to, using your training data, determine values for your θ\theta values which make the hypothesis as accurate as possible
      • This cost function is also called the squared error cost function

2.2. Cost Function - Intuition

  • Simplified hypothesis (Assumes θ0=0θ_{0}= 0)
    • So we got: hθ(x)=θ1xh_{\theta}(x)=\theta_{1}x, then the cost function will be:
    • J(θ1)=12mi=1m(hθ(xi)yi)2J(\theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2
    • minimizeθ1\displaystyle {\text{minimize} \atop \theta_{1}}, J(θ1)J(\theta_{1})
    • For example:
      • week-1-3-1
    • θ=1\theta=1, then hθ(x)=0+1×xh_{\theta}(x) = 0 + 1\times x , then hθ(xi)yi=0h_\theta (x_{i}) - y_{i} =0 , then J(θ1)=0J(\theta_{1})=0
    • Plots: ( θ1=1,J(θ1)=0\theta_{1}=1, J(\theta_{1})=0 ), (θ1=0.5,J(θ1)0.58\theta_{1}=0.5, J(\theta_{1})\approx0.58 ), (θ1=0,J(θ1)2.3\theta_{1}=0, J(\theta_{1})\approx2.3 )
    • Then we got:
    • The optimization objective for the learning algorithm is find the value of θ1\theta_{1} which minimizes J(θ1)\text{minimizes } J(\theta_{1})
      • So, here θ1=1\theta_{1} = 1 is the best value for θ1\theta_{1}
  • Using our original complex hypothesis with two paribles,So cost function is J(θ0,θ1)J(\theta_{0}, \theta_{1})
    • Generates a 3D surface plot where axis are
      • X=θ0X = \theta_{0}
      • Z=θ0Z = \theta_{0}
      • Y=J(θ0,θ1)Y=J(\theta_{0}, \theta_{1})
    • Instead of a surface plot we can use a contour figures/plots
      • Set of ellipses in different colors
      • Each color is the same value of J(θ0,θ1)J(\theta_{0}, \theta_{1}) , but obviously plot to different locations because θ1\theta_{1} and θ0\theta_{0} will vary
      • Imagine a bowl shape function coming out of the screen so the middle is the concentric circles

3. Parameter Learning

3.1. Gradient Descent

  • Minimize cost function JJ
  • Gradient descent
    • Used all over machine learning for minimization
  • Start by looking at a general J()J() function
  • Problem
    • We have J(θ0,θ1)J(\theta_{0}, \theta_{1})
    • We want to get minJ(θ0,θ1)\min J(\theta_{0}, \theta_{1})
  • Gradient descent applies to more general functions
    • J(θ0,θ1,θ2...θn)J(\theta_{0}, \theta_{1}, \theta_{2}... \theta_{n})
    • minJ(θ0,θ1,θ2...θn)\min J(\theta_{0}, \theta_{1}, \theta_{2}... \theta_{n})
  • Outline:
    • Start with some θ0,θ1\theta_{0}, \theta_{1},
    • Keep changing θ0,θ1\theta_{0}, \theta_{1} to reduce J(θ0,θ1)J(\theta_{0}, \theta_{1}) until we hopefully end up at a minimum
    • Here we can see one initialization point led to one local minimum
    • The other led to a different one

3.2. Gradient Descent Intuition

  • θj:=θjαθjJ(θ0,θ1)\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) for (j=0,j=1)( j=0, j=1)
  • What does this all mean?
    • Update θj\theta_{j} by setting it to (θjα)(\theta_{j} - \alpha) times the partial derivative of the cost function with respect to θj\theta_{j}
  • Notation
    • :=:=
      • Denotes assignment
      • α\alpha (alpha) Is a number called the learning rate
      • Controls how big a step you take
        • If α is big have an aggressive gradient descent
        • If α is small take tiny steps
  • Derivative term: θjJ(θ0,θ1)\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1), will derive it later
  • How this gradient descent algrorithm is impliemented
    • Do this for θ0\theta_{0} and θ1\theta_{1}
    • For j=0j=0 and j=1j=1 means we simultaneously update both, like:
      • week-1-12
  • Understanding the algorithm

    • back to simpler function where we minimize one parameter

      • minθ1J(θ1)\min_{\theta_{1}}J(\theta_{1}) where θ1\theta_{1} is a real number
    • Two key terms in the algorithm

      • α\alpha Alpha
      • θjJ(θ0,θ1)\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) Derivative term
    • Notation nuances

      • Partial derivative vs. Derivative
        • Use partial derivative when we have multiple variables but only derive with respect to one
        • Use derivative when we are deriving with respect to all the variables
    • Derivative term

      • Lets take the tangent at the point and look at the slope of the line
      • So moving towards the mimum (down) will create a negative derivative, alpha is always positive, so will update j(θ1)j(\theta_{1}) to a smaller value
      • Similarly, if we're moving up a slope we make j(θ1)j(\theta_{1}) a bigger numbers
      • week-1-13
    • Alpha term (α)

      • week-1-14
    • How does gradient descent converge with a fixed step size α?

      • The intuition behind the convergence is that θjJ(θ0,θ1)\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1), θ1\theta_{1} approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

        • θ1:=θ1α×0\theta_{1}:=\theta_{1}-\alpha \times 0

      • week-1-15

3.3. Gradient Descent For Linear Regression

  • Apply gradient descent to minimize the squared error cost function J(θ0,θ1)J(\theta_{0},\theta_{1})

  • Now we have a partial derivative:

    • week-1-16
  • So we need to determine the derivative for each parameter:

  • repeat until convergence: {θ0:=θ0α1mi=1m(hθ(xi)yi)θ1:=θ1α1mi=1m((hθ(xi)yi)xi)}\begin{aligned} \text{repeat until convergence: } \lbrace \\ \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \\ \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \\ \rbrace \end{aligned}

    • To check this you need to know multivariate calculus. So go back here when you finished multivariable calculus.
      • We have separated out the two cases for θj\theta_{j} into separate equations for θ0\theta_{0} and θ1\theta_{1}; and that for θ1\theta_1 we are multiplying xix_i at the end due to the derivative. The following is a derivation of θjJ(θ)\frac{\partial}{\partial \theta_j} J(\theta) for a single example :
      • week-1-18
  • How does it work

    • Risk of meeting different local optimum
    • The linear regression cost function is always a convex function
    • always has a single minimum
      • Bowl shaped
        • One global optima
          • So gradient descent will always converge to global optima
    • In action
      • Initialize values to θ0=900,θ1=0.1\theta_{0} = 900, \theta_{1} = -0.1
      • week-1-17
    • End up at a global minimum
    • This is actually "Batch" Gradient Descent
      • Refers to the fact that over each step you look at all the training data
        • Each step compute over m training examples
      • Sometimes non-batch versions exist, which look at small data subsets
        • We'll look at other forms of gradient descent (to use when m is too large) later in the course
    • There exists a numerical solution for finding a solution for a minimum function
      • Normal equations method Gradient descent scales better to large data sets though
      • Used in lots of contexts and machine learning

4. Linear Algebra

4.1. Matrices and Vectors

  • Matrix: Rectangular array of numbers:

    • X=[1234534567910111213]X = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 \\ 3 & 4 & 5 & 6 & 7 \\ 9 & 10 & 11 & 12 & 13 \end{bmatrix}
  • Dimension of matrix: number of rows x number of columns
  • XijX_{ij} = "i,j entry" in the ithi^{th} row, jthj^{th} column.

  • Vector: An n x 1 matrix.

    • y=[1313]y = \begin{bmatrix} 1 \\ 3 \\ 13 \end{bmatrix}
    • yiy_{i} = ithi^{th} element
  • Uppercase letter for Matrix, and lowwercase for Vector.

4.2. Addition and Scalar Multiplication

  • Matrix Addition:
    • Two matrices must have an equal number of rows and columns to be added.
    • Sample:
      • [123345]+[316672]=[4399117]\begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \end{bmatrix} + \begin{bmatrix} 3 & 1 & 6 \\ 6 & 7 & 2 \end{bmatrix} = \begin{bmatrix} 4 & 3 & 9 \\ 9 & 11 & 7 \end{bmatrix}
      • Error!! [123345]+[3167]\begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \end{bmatrix} + \begin{bmatrix} 3 & 1 \\ 6 & 7 \end{bmatrix}
  • Scalar Multiplication
    • Sample:
      • 3×[3167]=[931821]3 \times \begin{bmatrix} 3 & 1 \\ 6 & 7 \end{bmatrix} = \begin{bmatrix} 9 & 3 \\ 18 & 21 \end{bmatrix}

4.3. Matrix Vector Multiplication

  • week-1-m-11
  • Example:
    • [123345]×[432]=[1634]\begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \end{bmatrix} \times \begin{bmatrix} 4 \\ 3 \\ 2 \end{bmatrix} = \begin{bmatrix} 16 \\ 34 \end{bmatrix}
    • Row 1: 1×4+2×3+3×2=161 \times 4 + 2 \times 3 + 3 \times 2 = 16
  • week-1-m-13

4.4. Matrix Matrix Multiplication

  • week-1-m-16
  • week-1-m-18

4.5. Matrix Multiplication Properties

  • Let A and B be matirces. Then in general, ABBAA * B \ne B * A . (not commutative)
  • ABCA * B * C :
    • Let D=BCD = B * C. Compute ADA * D
    • Let E=ABE = A * B. Compute ECE * C
  • Identity Matrix:
    • week-1-m-22

4.6. Inverse and Transpose

  • Inverse: AA1=A1A=IAA^{-1} = A^{-1}A = I . Like 331=13 * 3^{-1} = 1
    • week-1-m-24
    • Not all numbers have an inverse. Like 0 doesn't have.
    • Matrices that don't have an inverse are "singular" or "degenerate".
    • How did you find the inverse
      • Turns out that you can sometimes do it by hand, although this is very hard
      • Numerical software for computing a matrices inverse
        • Lots of open source libraries
  • Transpose:
    • week-1-m-25

5. Words

  • algebra ['ældʒibrə] n. 代数学
  • gradient ['ɡreidiənt] n. 梯度;坡度
  • descent [di'sent] n. 下降;血统;袭击
  • gradient descent 梯度下降法
  • intuition [,intju:'iʃən] n. 直觉;直觉力;
  • linear ['liniə] adj. 线的,线型的;
  • regression [ri'ɡreʃən] n. 回归;退化;
  • plot [plɔt] vt. 密谋;绘图;划分;标绘
  • contour ['kɔntuə] n. 轮廓;等高线;周线;电路;概要
  • contour plot 等值线图
  • contour figure 轮廓图
  • denote [di'nəut] vt. 表示,指示
  • derivative [də'rɪvətɪv] n. 派生物;导数
  • derivative term 导数项
  • subtly ['sʌtli] adv. 精细地;巧妙地;敏锐地
  • nuance ['nju:ɑ:ns, nju:'ɑ:ns] n. 细微差别
  • tangent ['tændʒənt] n. [数] 切线,[数] 正切
  • slope [sləup] n. 斜坡;倾斜;斜率
  • converge [kən'və:dʒ] vt. 使汇聚 vi. 聚集;靠拢;收敛
  • convex [kɔn'veks] adj. 凸面的;凸圆的
  • scalar ['skeilə, -lɑ:] n. [数] 标量;[数] 数量

results matching ""

    No results matching ""