Introduction to Probability

Multivariable Calculus

Algorithms: Part II

Algorithms: Part I

Introduction to Software Design and Architecture

Calculus Two: Sequences and Series

LAFF Linear Algebra

Stanford Machine Learning

Calculus One

Computational Thinking

Effective Thinking Through Mathematics

CS50 Introduction to Computer Science


Week 7 - Support Vector Machines(SVM)

  • Compared to both logistic regression and neural networks, SVM sometimes gives a cleaner way of learning non-linear functions

Optimization Objective

  • Start with logistic regression, and modify it a bit to get the SVM.
  • The logistic regression hypothesis is: $$h_{\theta}(x) = \frac{1}{1+e{-\theta{T}x} }$$
  • And the sigmoid activation function looks like:
  • If y=1y = 1, we want hθ(x)1h_{\theta}(x) \approx 1, θTx0\theta^{T}x \gg 0
  • If y=0y = 0, we want hθ(x)0h_{\theta}(x) \approx 0, θTx0\theta^{T}x \ll 0
  • Cost of single example: $$\begin{aligned} & -(y\log{h_{\theta}(x)} + (1 - y)\log{(1 - h_{\theta}(x)}))z \ = &-(y\log{\frac{1}{1+e{-\theta{T}x} }} - (1 - y)\log{(1 - \frac{1}{1+e{-\theta{T}x} }}))\end{aligned}$$
  • If y=1y = 1 (want θTx0\theta^Tx \gg 0), then the cost will be: log11+ez-\log\dfrac{1}{1+e^{-z} }
  • If y=0y = 0 (want θTx0\theta^Tx \ll 0), then the cost will be: log(111+ez)-\log{(1 - \dfrac{1}{1+e^{-z} }})
  • To build SVM, we redefine the cost functions:
    • If y=1y = 1:
      • Instead of a curve line, we create two straight lines which acts as an approximation to the logistic regression.
      • We call this function cost1(z)cost_1(z).
    • If y=0y = 0:
      • We call this function cost0(z)cost_0(z).
    • How to use formula to represent those two lines? At least to calculate the cost?
      • Everything Prof Ng said about SVM training was an intuition. The actual SVM training method provided in the svmTrain() function is the SMO method. That method is too complex to be included as part of the course. – from Discuss Forms
  • The complete SVM cost function
    • As a comparison we have logistic regression: $${\underset{\theta}{\text{min} } } \ - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$
    • Replace the cost function with cost0(z)cost_0(z) and cost1(z)cost_1(z), we get: $$\underset{\theta}{\text{min} }\ \frac{1}{m} \sum_{i=1}^m \large[ y{(i)} cost_1(\thetaTx^{(i)}) + (1 - y{(i)}) cost_0(\thetaTx^{(i)})\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$
    • In convention with SVM notation, we adjust a little bit:
      • Get rid of 1m\dfrac{1}{m} term.
        • Because 1m\frac{1}{m} is a constant, so we should still end up with the same optimal value for θ\theta.
      • For logistic regression we have:
        • Training data set term A: $$- \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large]$$
        • and Regularization term B: $$\frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$
        • To conclude it, we get A+λBA + \lambda B
        • So λ\lambda is the trade-off between training data set and regularization terms.
        • Instead of using A+λBA + \lambda B, In SVM, we rewrite it as CA+BCA + B, which C is a constant.
        • We can think of the parameter C playing a role similar to 1λ\frac{1}{\lambda}.
    • Overall optimization objective function for the SVM is: $$\underset{\theta}{\text{min} }\ C \sum_{i=1}^m \large[ y{(i)} \text{cost}_1(\thetaTx^{(i)}) + (1 - y{(i)}) \text{cost}_0(\thetaTx^{(i)})\large] + \frac{1}{2}\sum_{j=1}^n \theta_j^2$$
  • SVM Hypothesis:
    • Unlike logistic, hθ(x)h_{\theta}(x) doesn’t give us a probability, instead we get a direct prediction of 1 or 0
    • hθ(x)={1, if θTx00, otherwise.h_{\theta}(x) = \left\{ \begin{array}{rl} 1 & \text{, if } \theta^Tx \ge 0 \\ 0 & \text{, } \text{otherwise}. \end{array} \right.

Large Margin Intuition

  • Sometimes SVM called Large Margin Classifier.
  • Here are two plots for the cost function:
  • If we want the cost to be small, then we will need zz to be more that 1 not just 0. then:
    • If y=1y = 1, we want θTx1\theta^Tx \ge 1 (not just 0\ge 0).
    • If y=0y = 0, we want θTx1\theta^Tx \le -1 (not just 0\le 0).

SVM Decision Boundary

  • Use the simplified cost function minθ=CA+B\text{min}_{\theta} = CA+B
  • If C is a huge number, like 100,000, then we will need to make A to be very small, best to be 0, and in the same time minimize B.
    • Whenever y(i)=1y^{(i)} = 1: θTx(i)1\theta^Tx^{(i)} \ge 1.
    • Whenever y(i)=0y^{(i)} = 0: θTx(i)1\theta^Tx^{(i)} \le -1.
    • minθ 12i=1nθj2\underset{\theta}{\text{min} }\ \dfrac{1}{2}\sum_{i=1}^n\theta_j^2
  • Let’s check the result in a linearly separable case
    • The green and magenta lines are functional decision boundaries which could be chosen by logistic regression
    • The black line, by contrast is chosen by the SVM because of this safety net imposed by the optimization graph
    • We can see that, there is a large margin between the black line and the training sets which is called the margin of the support vector machine.
    • In another situation:
      • We can’t get the result like the black line, since we set A to 0, so the SVM is very sensitive to outliers. And we probably will get the magenta line below. Which lead to another way to fix this: set C to a small number, which means ignoring some outliers.

Mathematics Behind Large Margin Classification

Vector Inter Productions

  • u=[u1u2]u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}, v=[v1v2]v = \begin{bmatrix} v_1 \\ v_2 \end{bmatrix}
  • length of vector: u=u12+u22\lVert u \rVert = \sqrt{u_1^2+u_2^2}
  • p = length of projection of v onto u.
    • p is signed, which means it can be negative number.
  • utv=pu=u1v1+u2v2\begin{aligned}u^tv &= p \cdot \lVert u \rVert \\ &= u_1v_1 + u_2v_2 \end{aligned}

SVM Decision Boundary

  • minθ 12i=1nθj2\underset{\theta}{\text{min} }\ \dfrac{1}{2}\sum_{i=1}^n\theta_j^2
  • s.t. θTx(i)1  if y(i)=1θTx(i)1  if y(i)=0\begin{aligned}\text{s.t. } \theta^Tx^{(i)} &\ge 1\ \text{ if }y^{(i)} = 1 \\ \theta^Tx^{(i)} &\le -1\ \text{ if } y^{(i)} = 0\end{aligned}
  • Simplification: set θ0=0n=2\theta_0 = 0\text{, }n = 2(only 2 features). Then:
    • minθ 12i=1nθj2=12(θ12+θ22)=12(θ12+θ22)2=12θ2\underset{\theta}{\text{min} }\ \dfrac{1}{2}\sum_{i=1}^n\theta_j^2 = \frac{1}{2}(\theta_1^2+\theta_2^2) = \frac{1}{2}(\sqrt{\theta_1^2+\theta_2^2})^2 = \frac{1}{2}{\lVert \theta \rVert}^2
    • θTx(i)=θ1x1(i)+θ2x2(i)=p(i)θ\theta^Tx^{(i)} = \theta_1x_1^{(i)} + \theta_2x_2^{(i)} = p^{(i)} \cdot {\lVert \theta \rVert}
    • Redefine these functions, we get:
      • s.t. p(i)θ1  if y(i)=1p(i)θ1  if y(i)=0\begin{aligned}\text{s.t. } p^{(i)} \cdot {\lVert \theta \rVert} &\ge 1\ \text{ if }y^{(i)} = 1 \\ p^{(i)} \cdot {\lVert \theta \rVert} &\le -1\ \text{ if } y^{(i)} = 0\end{aligned}
      • where p(i)p^{(i)} is the projection of x(i)x^{(i)} onto the vector θ\theta.
    • So we want to maximize p(i)p^{(i)}, so θ\lVert \theta \rVert can be small.
    • Note that, θ0=0\theta_0 = 0, so the boundary has to pass through the origin (0,0).
    • Let’s consider the training examples like this:
    • And draw an option, to see if it’s possible that SVM would choose.
      • Note that θ is always at 90 degrees to the decision boundary (check linear algebra to find out why).
      • So vector θ\theta should be:
      • Then we can find the projection p(i)p^{(i)} of x(i)x^{(i)} onto θ\theta:
        • We know that we need to make p(i)θ1p^{(i)} \cdot {\lVert \theta \rVert} \ge 1, so:
          • if p is small, then θ{\lVert \theta \rVert} should be very large
        • Similarly, for the negative examples.
      • But the optimization objective is trying to find a set of parameters where the norm of theta is small. So this doesn’t seem like a good choice.
    • Let’s another option:
      • Now if you look at the projection p(i)p^{(i)} of x(i)x^{(i)} onto θ\theta, we find that p becomes large and θ{\lVert \theta \rVert} can be small.
      • This is why the SVM choses this hypothesis as the better one, and how we generate the large margin.
  • Finally, we did this derivation assuming θ0=0\theta_0 = 0.
    • It just means we are entertaining decision boundaries that pass through the origins of decision boundaries pass through the origin (0,0).
    • If you allow θ0\theta_0 to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0).

Kernels I

  • Kernels is used to adapt support vector machines in order to develop complex nonlinear classifiers.
  • Let’s see a example(find a non-linear boundary):
    • One way to distinguish the positive and negative examples is to come up with a set of complex polynomial features: $$h_{\theta}(x) = \left{ \begin{array}{rl}
      1 & \text{, if }\ \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 + \theta_4 x_1^2 + \theta_5 x_2^2 + \cdots \ge 0 \
      0 & \text{, otherwise}.
      \end{array} \right.$$
    • Another way is using a new notation to denote x1,x2,x1x2,x12,x22x_1, x_2, x_1 x_2, x_1^2, x_2^2 as f1,f2,f3,f4,f5,f_1, f_2, f_3, f_4, f_5, \cdots, so the hypothesis will be: $$h_{\theta}(x) = \left{ \begin{array}{rl}
      1 & \text{, if }\ \theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 + \theta_4 f_4 + \theta_5 f_5 + \cdots \ge 0 \
      0 & \text{, otherwise}.
      \end{array} \right.$$
    • Is there a different/better choice of the features f1,f2,f3,f_1, f_2, f_3, \cdots ?

Create New Features

  • First, manually pick a few points. In this case, we picked three points, and call them landmarks (l(1),l(2),l(3)l^{(1)}, l^{(2)}, l^{(3)}).
    • Later, will explain how to choose l(i)l^{(i)}
  • Second, define f1,f2,f3f_1, f_2, f_3 as the similarity between xx and l(i)l^{(i)}(ignore x0x_0). Then: $$\begin{aligned}
    f_1 &= \text{similarity}(x, l^{(1)}) = \text{exp}(-\frac{ {\lVert x - l^{(1)} \rVert}^2 }{2\sigma^2}) \
    f_2 &= \text{similarity}(x, l^{(2)}) = \text{exp}(-\frac{ {\lVert x - l^{(2)} \rVert}^2 }{2\sigma^2}) \
    • This similarity function is called a Kernel. And this exp function is a Gaussian Kernel.
    • So, instead of writing similarity between x and l we might write $$f_1 = k(x, l^{(1)})$$
    • My Note: σ\sigma: the value of standard deviation. Gaussian Kernel is calculating the value correspond with mean(l(1)l^{(1)}) and σ\sigma.

Kernels and Similarity

  • If xl(1)x \approx l^{(1)}:
    • f1exp(022σ2)1f_1 \approx \text{exp}(-\frac{0^2}{2\sigma^2}) \approx 1
  • If xx is far from l(1)l^{(1)}:
    • f1exp((large number)22σ2)0f_1 \approx \text{exp}(-\frac{(\text{large number})^2}{2\sigma^2}) \approx 0
  • Example:
    • l(1)=[35]l^{(1)} = \begin{bmatrix}3 \\ 5\end{bmatrix}, f1=exp(xl(1)22σ2)f_1 = \text{exp}(-\dfrac{ {\lVert x - l^{(1)} \rVert}^2 }{2\sigma^2}).
    • Plot f1f_1 vs the kernel function, we get plots like:
      • Notice that when x = [3, 5], then f1=1f_1 = 1.
      • As x moves away from [3,5], then the feature takes on values close to zero. So this measures how close x is to this landmark.
      • σ2\sigma^2 is parameter of the Gaussian Kernel, which defines the steepness of the rise around the landmark. We can see that the slop are getting smoother as σ2\sigma^2 are bigger.
  • Given this definition, what kind of hypothesis we can learn:
    • For example, let’s say we already run the algorithm and got θ0=0.5,θ1=1,θ2=1,θ3=0\theta_0 = -0.5, \theta_1 = 1, \theta_2 = 1, \theta_3 = 0. What happens if we evaluate the magenta dot below?
      • We know that xx is close to l(1)l^{(1)}, so f1f_1 will be close to 1, and f2f_2 and f3f_3 and will be close to 0.
      • So θ0+θ1f1+θ2f2+θ3f3=0.5+1+0+0=0.50\theta_0 + \theta_1 f_1 + \theta_2 f_2 + \theta_3 f_3 = -0.5 + 1 + 0 + 0 = 0.5 \ge 0. Then we predict y=1y = 1.
      • After we tried different points, we will eventually get this non-linear boundary:
  • Next segment will talk about:
    • How we choose the landmarks;
    • What other kernels we can use (other than the Gaussian Kernel).

Kernel II

Choosing the Landmarks

  • Put landmarks as exactly the same locations as the training examples.

  • Then we will get m landmarks, which has the same number with the training examples.

  • SVM with Kernels

    • Given (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m)}, y^{(m)}),

    • Choose l(1)=x(1),l(2)=x(2),,l(m)=x(m)l^{(1)} = x^{(1)}, l^{(2)} = x^{(2)}, \ldots, l^{(m)} = x^{(m)}.

    • For training example (x(i),y(i)x^{(i)}, y^{(i)}): $$\begin{aligned}
      f_1^{(i)} &= \text{similarity}(x^{(i)}, l^{(1)}) \
      f_2^{(i)} &= \text{similarity}(x^{(i)}, l^{(2)}) \
      &\vdots \
      f_i^{(i)} &= \text{similarity}(x^{(i)}, l^{(i)}) = \text{similarity}(x^{(i)}, x^{(i)}) = \text{exp}(-\frac{0}{2\sigma^2}) = 1\
      &\vdots \
      f_m^{(i)} &= \text{similarity}(x^{(i)}, l^{(m)})

    • We got f(i)=[f0(i)f1(i)fm(i)]f^{(i)} = \begin{bmatrix}f_0^{(i)} \\ f_1^{(i)} \\ \vdots \\ f_m^{(i)} \\ \end{bmatrix}(f0(i)=1f_0^{(i)} = 1), which are our new training examples.

      • Our training set become a m×mm \times m matrix, because every sample have m features.
      • f(i)Rm+1f^{(i)} \in \mathbb{R}^{m+1} (included f0f_0).
    • Then, our hypothesis will be: Given xx, compute features f(i)Rm+1f^{(i)} \in \mathbb{R}^{m+1}. Predict y=1y = 1 if θTf0\theta^Tf \ge 0.

    • And the training: $$\underset{\theta}{\text{min} }\ C \sum_{i=1}^m \large[ y{(i)} cost_1(\thetaTf^{(i)}) + (1 - y{(i)}) cost_0(\thetaTf^{(i)})\large] + \frac{1}{2}\sum_{j=1}^m \theta_j^2$$

      • Note that m = n, the number of features equals the number of training data examples.
      • Another mathematics detail about the training formula:
        • The regulation part: j=1mθj2=θTθ\displaystyle\sum_{j=1}^m \theta_j^2 = \theta^T\theta
        • What many implementations do is: θTMθ\theta^TM\theta
          • Where the matrix M depends on the kernel you are using.
          • This is like a rescale version of the parameter vector theta.
          • Allows the SVM to run much more efficiently.
  • You can apply kernels to other algorithms

    • But they tend to be very computationally expensive
    • But the SVM is far more efficient - so more practical

SVM Parameters

  • C(1λ\frac{1}{\lambda})
    • Large C: Lower bias, high variance.
    • Small C: Higher bias, low variance.
  • σ2\sigma^2:
    • Large σ2\sigma^2: Features fif_i vary more smoothly. Higher bias, lower variance.
    • Small σ2\sigma^2: Features fif_i vary less smoothly. Lower bias, higher variance.

Using an SVM

  • Use SVM software package(e.g. liblinear, libsvm,…) to solve for parameters θ\theta.
  • What do you need to do, is to specify:
    • Choice of parameter C
    • Choice of kernel

Choice of kernel

  • No kernel(“linear kernel”)
    • Predict “y = 1” if θTx0\theta^Tx \ge 0
  • Gaussian kernel:
    • fi=exp(xl(2)22σ2)f_i = \text{exp}(-\frac{ {\lVert x - l^{(2)} \rVert}^2 }{2\sigma^2}), where l(i)=x(i)l^{(i)} = x^{(i)}
    • Need to choose σ2\sigma^2
    • Note: Do perform feature scaling before using a Gaussian kernel.
  • Other choice of kernel
    • Not all similarity functions make valid kernels
    • Need to satisfy technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly, an do not diverge.
    • Many off-the-shelf kernels available:
      • Polynomial kernel: k(x,l)=(xTl+constant)degreek(x,l) = (x^Tl+\text{constant})^{\text{degree} }
        • For example: k(x,l)=(xTl+1)2k(x,l) = (x^Tl+1)^2
      • More esoteric: String kernel, chi-square kernel, histogram intersection kernel,…

Multi-class Classification

  • Many SVM packages already have built-in multi-class classification functionality.
  • Otherwise, use one-vs-all method.

Logistic Regression vs. SVMs

  • When should we use one algorithm versus the other?
  • n = number of features (xRn+1x \in \mathbb{R}^{n+1}), m = number of training examples
  • If n is large(relative to m) (e.g. n = 10,000, m = 10-1000):
    • Use logistic regression, or SVM without a kernel(“linear kernel”)
  • If n is small, m is intermediate (e.g. n = 1 - 1000, m = 10 - 10,000):
    • Use SVM with Gaussian kernel
  • If n is small, m is large (e.g. n = 1 - 1000, m = 10 - 50,000+):
    • Create/add more features, then use logistic regression or SVM without a kernel
  • Neural network likely to work well for most of these settings, but may be slower to train.