# Lecture 12

## 1. Machine Learning

• Definition
• Observe set of examples: training data
• Infer something about process that generated that data
• Use inference to make predictions about previously unseen data: test data
• Procedures
• Representation of the features
• separate people with features(man/woman, educated/not, etc.)
• Distance metric for feature vectors
• make feature vectors can be calculated in a same range.
• Objective function and constraints
• Optimization method for learning the model
• Evaluation method

### 1.1. Supervised Learning

• Goal: find a model that predicts a value for a previously unseen feature vector
• Regression models predict a real
• As with linear regression
• Classification models predict a label (chosen from a finite set of labels)

### 1.2. Unsupervised Learning

• Goal: uncover some latent structure in the set of feature vectors
• Clustering the most common technique
• Define some metric that captures how similar one feature vector is to another
• Group examples based on this metric

### 1.3. Difference between Supervised and Unsupervised

• with label, we can classify the data to two clusters by wight or height, or four clusters by wight and height, which is Supervised Learning
• without label, to figure out how to clustering the data, is Unsupervised Learning.

### 1.4. Choose Feature Vectors

• Why should careful?
• Irrelevant features can greatly slow the learning process.
• How?
• signal-to-noise ratio (SNR)
• Think of it as the ratio of useful input to irrelevant input.
• The purpose of feature extraction is to separate those features in the available data that contribute to the signal from those that are merely noise.

### 1.5. Distance Between Vectors

#### Minkowski Metric

• $dist(X1, X2, p)=(\displaystyle\sum_{k-1}^{len}abs(X1_{k}-X2_{k})^p)^{1/p}$
• p = 1: Manhattan Distance
• P = 2: Euclidean Distance

def minkowskiDist(v1, v2, p):
"""Assumes v1 and v2 are equal-length arrays of numbers
Returns Minkowski distance of order p between v1 and v2"""
dist = 0.0
for i in range(len(v1)):
dist += abs(v1[i] - v2[i])**p
return dist**(1.0/p)

• For example:
• To compare the distance between star and circle and the distance between cross and circle
• Use Manhattan Distance, they should be 3 and 4
• Use Euclidean Distance, they should be 3 and 2.8 = $\sqrt{2^2+2^2}$
##### Using Distance Matrix for Classification
• Procedures
• Simplest approach is probably nearest neighbor
• Remember training data
• When predicting the label of a new example
• Find the nearest example in the training data
• Predict the label associated with that example
• To predict the color of X

• The closest one is pink, so X should be pink
• K-nearest Neighbors

• Find K nearest neighbors, and choose the label associated with the majority of those neighbors.
• Usually, we use odd number. This sample, we use k = 3

• Learning fast, no explicit training
• No theory required
• Easy to explain method and results
• Memory intensive and predictions can take a long time
• Are better algorithms than brute force
• No model to shed light on process that generated data
• For Example

• To predict whether zebra, python and alligator are reptile or not.
• Calculate the distances, we got:
• The closest three animals to alligator are boa constrictor, chicken and dark frog, and two of them are not reptile, so alligator is not reptile.
• But we know alligator is reptile. So what's wrong?
• We notice, all of the features are 0 or 1, except number of legs, which gets disproportionate weight.
• So, Instead of number of legs, we say "has legs." And then this becomes a one.
• The closest three animals to alligator are boa constrictor, chicken and cobra, and two of them are reptile, so alligator is reptile.
• A More General Approach: Scaling

• Z-scaling
• Interpolation

• Map minimum value to 0, maximum value to 1, and linearly interpolate
def zScaleFeatures(vals):
"""Assumes vals is a sequence of floats"""
result = pylab.array(vals)
mean = float(sum(result))/len(result)
result = result - mean
return result/stdDev(result)

def iScaleFeatures(vals):
"""Assumes vals is a sequence of floats"""
minVal, maxVal = min(vals), max(vals)
fit = pylab.polyfit([minVal, maxVal], [0, 1], 1)
return pylab.polyval(fit, vals)

### 1.6. Clustering

• Partition examples into groups (clusters) such that examples in a group are more similar to each other than to examples in other groups
• Unlike classification, there is not typically a “right answer”
• Answer dictated by feature vector and distance metric, not by a ground truth label

#### Optimization Problem

• Clustering is an optimization problem. The goal is to find a set of clusters that optimizes an objective function, subject to some set of constraints.
• Given a distance metric that can be used to decide how close two examples are to each other, we need to define an objective function that
• Minimizes the distance between examples in the same clusters, i.e., minimizes the dissimilarity of the examples within a cluster.
• To compute the variability of the examples within a cluster
• First compute the mean(sum(V)/float(len(V)), more precisely the Euclidean mean) of the feature vectors of all the examples in the cluster. , V is a list of feature vectors.
• Compute the distance between feature vectors
• $\text{variability}(c)=\displaystyle\sum_{e \in c}\text{distance}(\text{mean}(c), e)^2$
• The definition of variability within a single cluster, c, can be extended to define a dissimilarity metric for a set of clusters, C:
• $\text{dissimilarity}(C)=\displaystyle\sum_{e \in c}\text{variability(c)}$
• It's NOT the optimization problem to find a set of clusters, C, such that dissimilarity(C) is minimized. Because it can easily be minimized by putting each example in its own cluster.
• We could put a constraint on the distance between clusters or require that the maximum number of clusters is k. Then to find the minimum between clusters.
##### K-means Clustering
• Constraint: exactly k non-empty clusters
• Use a greedy algorithm to find an approximation to minimizing objective function
• Algorithm

randomly chose k examples as initial centroids
while true:
create k clusters by assigning each
example to closest centroid
compute k new centroids by averaging
examples in each cluster
if centroids don’t change:
break

• Unlucky Initial Centroids

• k=4, Initial Centroids:
• Result:
• Mitigating Dependence on Initial Centroids

best = kMeans(points)
for t in range(numTrials):
C = kMeans(points)
if dissimilarity(C) < dissimilarity(best):
best = C
return best

### 1.7. Wrapping Up Machine Learning

• Use data to build statistical models that can be used to
• Shed light on system that produced data
• Make predictions about unseen data
• Supervised learning
• Unsupervised learning
• Feature engineering
• Goal was to expose you to some important ideas
• Not to get you to the point where you could apply them
• Much more detail, including implementations, in text