Introduction to Probability

Multivariable Calculus

Algorithms: Part II

Algorithms: Part I

Introduction to Software Design and Architecture

Calculus Two: Sequences and Series

LAFF Linear Algebra

Stanford Machine Learning

Calculus One

Computational Thinking

Effective Thinking Through Mathematics

CS50 Introduction to Computer Science

Others

Lecture 12

Machine Learning

  • Definition
  • Basic Paradigm
    • Observe set of examples: training data
    • Infer something about process that generated that data
    • Use inference to make predictions about previously unseen data: test data
  • Procedures
    • Representation of the features
      • separate people with features(man/woman, educated/not, etc.)
    • Distance metric for feature vectors
      • make feature vectors can be calculated in a same range.
    • Objective function and constraints
    • Optimization method for learning the model
    • Evaluation method

Supervised Learning

  • Start with set of feature vector/value pairs
  • Goal: find a model that predicts a value for a previously unseen feature vector
  • Regression models predict a real
    • As with linear regression
  • Classification models predict a label (chosen from a finite set of labels)

Unsupervised Learning

  • Start with a set of feature vectors
  • Goal: uncover some latent structure in the set of feature vectors
  • Clustering the most common technique
    • Define some metric that captures how similar one feature vector is to another
    • Group examples based on this metric

Difference between Supervised and Unsupervised

  • with label, we can classify the data to two clusters by wight or height, or four clusters by wight and height, which is Supervised Learning
  • without label, to figure out how to clustering the data, is Unsupervised Learning.

Choose Feature Vectors

  • Why should careful?
    • Irrelevant features can lead to a bad model.
    • Irrelevant features can greatly slow the learning process.
  • How?
    • signal-to-noise ratio (SNR)
      • Think of it as the ratio of useful input to irrelevant input.
    • The purpose of feature extraction is to separate those features in the available data that contribute to the signal from those that are merely noise.

Distance Between Vectors

Minkowski Metric

  • dist(X1,X2,p)=(k1lenabs(X1kX2k)p)1/pdist(X1, X2, p)=(\displaystyle\sum_{k-1}^{len}abs(X1_{k}-X2_{k})^p)^{1/p}

  • p = 1: Manhattan Distance

  • P = 2: Euclidean Distance

    1
    2
    3
    4
    5
    6
    7
    def minkowskiDist(v1, v2, p):
    """Assumes v1 and v2 are equal-length arrays of numbers
    Returns Minkowski distance of order p between v1 and v2"""
    dist = 0.0
    for i in range(len(v1)):
    dist += abs(v1[i] - v2[i])**p
    return dist**(1.0/p)
  • For example:

    • To compare the distance between star and circle and the distance between cross and circle
    • Use Manhattan Distance, they should be 3 and 4
    • Use Euclidean Distance, they should be 3 and 2.8 = 22+22\sqrt{2^2+2^2}
Using Distance Matrix for Classification
  • Procedures

    • Simplest approach is probably nearest neighbor
    • Remember training data
    • When predicting the label of a new example
      • Find the nearest example in the training data
      • Predict the label associated with that example
  • To predict the color of X

    • The closest one is pink, so X should be pink
  • K-nearest Neighbors

    • Find K nearest neighbors, and choose the label associated with the majority of those neighbors.
    • Usually, we use odd number. This sample, we use k = 3
  • Advantages and Disadvantages of KNN

    • Advantages
      • Learning fast, no explicit training
      • No theory required
      • Easy to explain method and results
    • Disadvantages
      • Memory intensive and predictions can take a long time
      • Are better algorithms than brute force
      • No model to shed light on process that generated data
  • For Example

    • To predict whether zebra, python and alligator are reptile or not.
    • Calculate the distances, we got:
      • The closest three animals to alligator are boa constrictor, chicken and dark frog, and two of them are not reptile, so alligator is not reptile.
      • But we know alligator is reptile. So what’s wrong?
      • We notice, all of the features are 0 or 1, except number of legs, which gets disproportionate weight.
        • So, Instead of number of legs, we say “has legs.” And then this becomes a one.
    • * The closest three animals to alligator are boa constrictor, chicken and cobra, and two of them are reptile, so alligator is reptile.
  • A More General Approach: Scaling

    • Z-scaling
      • Each feature has a mean of 0 & a standard deviation of 1
    • Interpolation
      • Map minimum value to 0, maximum value to 1, and linearly interpolate
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
        def zScaleFeatures(vals):
    """Assumes vals is a sequence of floats"""
    result = pylab.array(vals)
    mean = float(sum(result))/len(result)
    result = result - mean
    return result/stdDev(result)

    def iScaleFeatures(vals):
    """Assumes vals is a sequence of floats"""
    minVal, maxVal = min(vals), max(vals)
    fit = pylab.polyfit([minVal, maxVal], [0, 1], 1)
    return pylab.polyval(fit, vals)
    ```

    ### Clustering

    * Partition examples into groups (clusters) such that examples in a group are more similar to each other than to examples in other groups
    * Unlike classification, there is not typically a “right answer”
    * Answer dictated by feature vector and distance metric, not by a ground truth label

    #### Optimization Problem

    * Clustering is an optimization problem. The goal is to find a set of clusters that optimizes an objective function, subject to some set of constraints.
    * Given a distance metric that can be used to decide how close two examples are to each other, we need to define an **objective function** that
    * Minimizes the distance between examples in the same clusters, i.e., minimizes the dissimilarity of the examples within a cluster.
    * To compute the variability of the examples within a cluster
    * First compute the mean(`sum(V)/float(len(V))`, more precisely the Euclidean mean) of the feature vectors of all the examples in the cluster. , `V` is a list of feature vectors.
    * Compute the distance between feature vectors
    * $\text{variability}(c)=\displaystyle\sum_{e \in c}\text{distance}(\text{mean}(c), e)^2$
    * The definition of variability within a single cluster, `c`, can be extended to define a dissimilarity metric for a set of clusters, `C`:
    * $\text{dissimilarity}(C)=\displaystyle\sum_{e \in c}\text{variability(c)}$
    * It's NOT the optimization problem to find a set of clusters, C, such that `dissimilarity(C)` is minimized. Because it can easily be minimized by putting each example in its own cluster.
    * We could put a constraint on the distance between clusters or require that the maximum number of clusters is `k`. Then to find the minimum between clusters.

    ##### K-means Clustering

    * Constraint: exactly `k` non-empty clusters
    * Use a greedy algorithm to find an approximation to minimizing objective function
    * Algorithm

    randomly chose k examples as initial centroids
    while true:
    create k clusters by assigning each
    example to closest centroid
    compute k new centroids by averaging
    examples in each cluster
    if centroids don’t change:
    break

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
        
    * Sample: [lecture12-4.py](./unit-4/lecture12-3.py)
    * `k=4`, Initial Centroids:
    * <img src="https://i.imgur.com/V8dCSjw.jpg" style="width:200px" />
    * Result:
    * <img src="https://i.imgur.com/AlE5EKX.jpg" style="width:200px" />
    * Unlucky Initial Centroids
    * `k=4`, Initial Centroids:
    * <img src="https://i.imgur.com/wp4iegG.jpg" style="width:200px" />
    * Result:
    * <img src="https://i.imgur.com/AH4D3uZ.jpg" style="width:200px" />
    * Mitigating Dependence on Initial Centroids
      best = kMeans(points)
      for t in range(numTrials):
          C = kMeans(points)
          if dissimilarity(C) < dissimilarity(best):
          best = C
      return best
      ```
    

Wrapping Up Machine Learning

  • Use data to build statistical models that can be used to
    • Shed light on system that produced data
    • Make predictions about unseen data
  • Supervised learning
  • Unsupervised learning
  • Feature engineering
  • Goal was to expose you to some important ideas
    • Not to get you to the point where you could apply them
    • Much more detail, including implementations, in text