# Week 1 - Introduction

## 1. Several types of learning algorithms

### 1.1. Supervised Learning

• Regression, to predict results within a continuous output.
• For example, predict housing prices:
• Given this data, how to predict the value of a specific size?
• How to solve this?
• Straight line through data
• Maybe $150,000 • Polynomial • Maybe$200,000
• We'll learn how to choose function later.
• After we fitted a line to the data, we can predict the actual values for houses
• From the sample, all of the data we plotted called training data, and the action we fit lines to the data called linear regression
• Classification, to predict results in a discrete output.
• For example, checking a tumor is malignant or not.

### 1.2. Unsupervised Learning

• derive this structure by clustering the data based on relationships among the variables in the data.
• For example, grouping news stories into cohesive groups

## 2. Linear Regression (Model and Cost Function)

• Model Representation

• To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that $h^{(x)}$ is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

• $h_{\theta}(x)=\theta_{0}+\theta_{1}x$, and this means:

• Y is a linear function of x
• $\theta_{i}$ are parameters
• $\theta_{0}$ is zero condition
• $\theta_{1}$ is gradient
• This kind of function is a Linear Regression with one variable,
• also called Univariate Linear Regression

### 2.1. Cost Function

• A cost function lets us figure out how to fit the best straight line to our data

• Choosing values for $\theta_{1}$ (parameters)

• Different values give you different functions
• If $\theta_{0}$ is 1.5 and $\theta_{1}$ is 0 then we get straight line parallel with X along 1.5 @ y
• If $\theta_{1}$ is > 0 then we get a positive slope
• We want to find a straight line which colse to the actual ys as possible
• To formalize this:

• We want to solve a minimization problem
• Minimize $(h_{\theta}(x)-y)^{2}$
• minimize the difference between h(x) and y for each/any/every example
• Sum this over the training set:$\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$
• $\frac{1}{2m}$
• $\frac{1}{m}$ - means we determine the average
• $\frac{1}{2m}$ - the 2 makes the math a bit easier, and doesn't change the constants we determine at all (i.e. half the smallest value is still the smallest value)
• This is a cost function:
• $J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$
• And we want to minimize this cost function
• Our cost function is (because of the summation term) inherently looking at ALL the data in the training set at any time
• The following image summarizes what the cost function does:

• Hypothesis - is like your prediction machine, throw in an x value, get a putative y value

• Cost - is a way to, using your training data, determine values for your $\theta$ values which make the hypothesis as accurate as possible
• This cost function is also called the squared error cost function

### 2.2. Cost Function - Intuition

• Simplified hypothesis (Assumes $θ_{0}= 0$)
• So we got: $h_{\theta}(x)=\theta_{1}x$, then the cost function will be:
• $J(\theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$
• $\displaystyle {\text{minimize} \atop \theta_{1}}$, $J(\theta_{1})$
• For example:
• $\theta=1$, then $h_{\theta}(x) = 0 + 1\times x$ , then $h_\theta (x_{i}) - y_{i} =0$ , then $J(\theta_{1})=0$
• Plots: ( $\theta_{1}=1, J(\theta_{1})=0$ ), ($\theta_{1}=0.5, J(\theta_{1})\approx0.58$ ), ($\theta_{1}=0, J(\theta_{1})\approx2.3$ )
• Then we got:
• The optimization objective for the learning algorithm is find the value of $\theta_{1}$ which $\text{minimizes } J(\theta_{1})$
• So, here $\theta_{1} = 1$ is the best value for $\theta_{1}$
• Using our original complex hypothesis with two paribles,So cost function is $J(\theta_{0}, \theta_{1})$
• Generates a 3D surface plot where axis are
• $X = \theta_{0}$
• $Z = \theta_{0}$
• $Y=J(\theta_{0}, \theta_{1})$
• Instead of a surface plot we can use a contour figures/plots
• Set of ellipses in different colors
• Each color is the same value of $J(\theta_{0}, \theta_{1})$ , but obviously plot to different locations because $\theta_{1}$ and $\theta_{0}$ will vary
• Imagine a bowl shape function coming out of the screen so the middle is the concentric circles

## 3. Parameter Learning

• Minimize cost function $J$
• Used all over machine learning for minimization
• Start by looking at a general $J()$ function
• Problem
• We have $J(\theta_{0}, \theta_{1})$
• We want to get $\min J(\theta_{0}, \theta_{1})$
• Gradient descent applies to more general functions
• $J(\theta_{0}, \theta_{1}, \theta_{2}... \theta_{n})$
• $\min J(\theta_{0}, \theta_{1}, \theta_{2}... \theta_{n})$
• Outline:
• Start with some $\theta_{0}, \theta_{1}$,
• Keep changing $\theta_{0}, \theta_{1}$ to reduce $J(\theta_{0}, \theta_{1})$ until we hopefully end up at a minimum
• Here we can see one initialization point led to one local minimum
• The other led to a different one

• $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ for $( j=0, j=1)$
• What does this all mean?
• Update $\theta_{j}$ by setting it to $(\theta_{j} - \alpha)$ times the partial derivative of the cost function with respect to $\theta_{j}$
• Notation
• $:=$
• Denotes assignment
• $\alpha$ (alpha) Is a number called the learning rate
• Controls how big a step you take
• If α is big have an aggressive gradient descent
• If α is small take tiny steps
• Derivative term: $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$, will derive it later
• How this gradient descent algrorithm is impliemented
• Do this for $\theta_{0}$ and $\theta_{1}$
• For $j=0$ and $j=1$ means we simultaneously update both, like:
• Understanding the algorithm

• back to simpler function where we minimize one parameter

• $\min_{\theta_{1}}J(\theta_{1})$ where $\theta_{1}$ is a real number
• Two key terms in the algorithm

• $\alpha$ Alpha
• $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$ Derivative term
• Notation nuances

• Partial derivative vs. Derivative
• Use partial derivative when we have multiple variables but only derive with respect to one
• Use derivative when we are deriving with respect to all the variables
• Derivative term

• Lets take the tangent at the point and look at the slope of the line
• So moving towards the mimum (down) will create a negative derivative, alpha is always positive, so will update $j(\theta_{1})$ to a smaller value
• Similarly, if we're moving up a slope we make $j(\theta_{1})$ a bigger numbers
• Alpha term (α)

• How does gradient descent converge with a fixed step size α?

• The intuition behind the convergence is that $\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$, $\theta_{1}$ approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

• $\theta_{1}:=\theta_{1}-\alpha \times 0$

### 3.3. Gradient Descent For Linear Regression

• Apply gradient descent to minimize the squared error cost function $J(\theta_{0},\theta_{1})$

• Now we have a partial derivative:

• So we need to determine the derivative for each parameter:

• \begin{aligned} \text{repeat until convergence: } \lbrace \\ \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \\ \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \\ \rbrace \end{aligned}

• To check this you need to know multivariate calculus. So go back here when you finished multivariable calculus.
• We have separated out the two cases for $\theta_{j}$ into separate equations for $\theta_{0}$ and $\theta_{1}$; and that for $\theta_1$ we are multiplying $x_i$ at the end due to the derivative. The following is a derivation of $\frac{\partial}{\partial \theta_j} J(\theta)$ for a single example :
• How does it work

• Risk of meeting different local optimum
• The linear regression cost function is always a convex function
• always has a single minimum
• Bowl shaped
• One global optima
• So gradient descent will always converge to global optima
• In action
• Initialize values to $\theta_{0} = 900, \theta_{1} = -0.1$
• End up at a global minimum
• This is actually "Batch" Gradient Descent
• Refers to the fact that over each step you look at all the training data
• Each step compute over m training examples
• Sometimes non-batch versions exist, which look at small data subsets
• We'll look at other forms of gradient descent (to use when m is too large) later in the course
• There exists a numerical solution for finding a solution for a minimum function
• Normal equations method Gradient descent scales better to large data sets though
• Used in lots of contexts and machine learning

## 4. Linear Algebra

### 4.1. Matrices and Vectors

• Matrix: Rectangular array of numbers:

• $X = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 \\ 3 & 4 & 5 & 6 & 7 \\ 9 & 10 & 11 & 12 & 13 \end{bmatrix}$
• Dimension of matrix: number of rows x number of columns
• $X_{ij}$ = "i,j entry" in the $i^{th}$ row, $j^{th}$ column.

• Vector: An n x 1 matrix.

• $y = \begin{bmatrix} 1 \\ 3 \\ 13 \end{bmatrix}$
• $y_{i}$ = $i^{th}$ element
• Uppercase letter for Matrix, and lowwercase for Vector.

### 4.2. Addition and Scalar Multiplication

• Two matrices must have an equal number of rows and columns to be added.
• Sample:
• $\begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \end{bmatrix} + \begin{bmatrix} 3 & 1 & 6 \\ 6 & 7 & 2 \end{bmatrix} = \begin{bmatrix} 4 & 3 & 9 \\ 9 & 11 & 7 \end{bmatrix}$
• Error!! $\begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \end{bmatrix} + \begin{bmatrix} 3 & 1 \\ 6 & 7 \end{bmatrix}$
• Scalar Multiplication
• Sample:
• $3 \times \begin{bmatrix} 3 & 1 \\ 6 & 7 \end{bmatrix} = \begin{bmatrix} 9 & 3 \\ 18 & 21 \end{bmatrix}$

### 4.3. Matrix Vector Multiplication

• Example:
• $\begin{bmatrix} 1 & 2 & 3 \\ 3 & 4 & 5 \end{bmatrix} \times \begin{bmatrix} 4 \\ 3 \\ 2 \end{bmatrix} = \begin{bmatrix} 16 \\ 34 \end{bmatrix}$
• Row 1: $1 \times 4 + 2 \times 3 + 3 \times 2 = 16$

### 4.5. Matrix Multiplication Properties

• Let A and B be matirces. Then in general, $A * B \ne B * A$ . (not commutative)
• $A * B * C$ :
• Let $D = B * C$. Compute $A * D$
• Let $E = A * B$. Compute $E * C$
• Identity Matrix:

### 4.6. Inverse and Transpose

• Inverse: $AA^{-1} = A^{-1}A = I$ . Like $3 * 3^{-1} = 1$
• Not all numbers have an inverse. Like 0 doesn't have.
• Matrices that don't have an inverse are "singular" or "degenerate".
• How did you find the inverse
• Turns out that you can sometimes do it by hand, although this is very hard
• Numerical software for computing a matrices inverse
• Lots of open source libraries
• Transpose:

## 5. Words

• algebra ['ældʒibrə] n. 代数学
• descent [di'sent] n. 下降；血统；袭击
• intuition [,intju:'iʃən] n. 直觉；直觉力；
• regression [ri'ɡreʃən] n. 回归；退化；
• plot [plɔt] vt. 密谋；绘图；划分；标绘
• contour ['kɔntuə] n. 轮廓；等高线；周线；电路；概要
• contour plot 等值线图
• contour figure 轮廓图
• denote [di'nəut] vt. 表示，指示
• derivative [də'rɪvətɪv] n. 派生物；导数
• derivative term 导数项