# Lecture 12

## 1. Machine Learning

- Definition
- Basic Paradigm
- Observe set of examples:
**training data** - Infer something about process that generated that data
- Use inference to make predictions about previously unseen data:
**test data**

- Observe set of examples:
- Procedures
- Representation of the features
- separate people with features(man/woman, educated/not, etc.)

- Distance metric for feature vectors
- make feature vectors can be calculated in a same range.

- Objective function and constraints
- Optimization method for learning the model
- Evaluation method

- Representation of the features

### 1.1. Supervised Learning

- Start with set of feature vector/value pairs
- Goal: find a model that predicts a value for a previously unseen feature vector
**Regression models**predict a real- As with linear regression

**Classification models**predict a label (chosen from a finite set of labels)

### 1.2. Unsupervised Learning

- Start with a set of feature vectors
- Goal: uncover some latent structure in the set of feature vectors
**Clustering**the most common technique- Define some metric that captures how similar one feature vector is to another
- Group examples based on this metric

### 1.3. Difference between Supervised and Unsupervised

- with label, we can classify the data to two clusters by wight or height, or four clusters by wight and height, which is Supervised Learning
- without label, to figure out how to clustering the data, is Unsupervised Learning.

### 1.4. Choose Feature Vectors

- Why should careful?
- Irrelevant features can lead to a bad model.
- Irrelevant features can greatly slow the learning process.

- How?
**signal-to-noise ratio (SNR)**- Think of it as the ratio of useful input to irrelevant input.

- The purpose of feature extraction is to separate those features in the available data that contribute to the signal from those that are merely noise.

### 1.5. Distance Between Vectors

#### Minkowski Metric

- $dist(X1, X2, p)=(\displaystyle\sum_{k-1}^{len}abs(X1_{k}-X2_{k})^p)^{1/p}$
- p = 1: Manhattan Distance
P = 2: Euclidean Distance

`def minkowskiDist(v1, v2, p): """Assumes v1 and v2 are equal-length arrays of numbers Returns Minkowski distance of order p between v1 and v2""" dist = 0.0 for i in range(len(v1)): dist += abs(v1[i] - v2[i])**p return dist**(1.0/p)`

- For example:
- To compare the distance between star and circle and the distance between cross and circle
- Use Manhattan Distance, they should be 3 and 4
- Use Euclidean Distance, they should be 3 and 2.8 = $\sqrt{2^2+2^2}$

##### Using Distance Matrix for Classification

- Procedures
- Simplest approach is probably nearest neighbor
- Remember training data
- When predicting the label of a new example
- Find the nearest example in the training data
- Predict the label associated with that example

To predict the color of

`X`

- The closest one is pink, so X should be pink

K-nearest Neighbors

- Find
`K`

nearest neighbors, and choose the label associated with the majority of those neighbors. - Usually, we use odd number. This sample, we use
`k = 3`

- Find
Advantages and Disadvantages of KNN

- Advantages
- Learning fast, no explicit training
- No theory required
- Easy to explain method and results

- Disadvantages
- Memory intensive and predictions can take a long time
- Are better algorithms than brute force
- No model to shed light on process that generated data

- Advantages
For Example

- To predict whether zebra, python and alligator are reptile or not.
- Calculate the distances, we got:
- The closest three animals to alligator are boa constrictor, chicken and dark frog, and two of them are not reptile, so alligator is not reptile.
- But we know alligator is reptile. So what's wrong?
- We notice, all of the features are 0 or 1, except number of legs, which gets disproportionate weight.
- So, Instead of number of legs, we say "has legs." And then this becomes a one.

- The closest three animals to alligator are boa constrictor, chicken and cobra, and two of them are reptile, so alligator is reptile.

A More General Approach: Scaling

- Z-scaling
- Each feature has a mean of 0 & a standard deviation of 1

Interpolation

- Map minimum value to 0, maximum value to 1, and linearly interpolate

`def zScaleFeatures(vals): """Assumes vals is a sequence of floats""" result = pylab.array(vals) mean = float(sum(result))/len(result) result = result - mean return result/stdDev(result) def iScaleFeatures(vals): """Assumes vals is a sequence of floats""" minVal, maxVal = min(vals), max(vals) fit = pylab.polyfit([minVal, maxVal], [0, 1], 1) return pylab.polyval(fit, vals)`

- Z-scaling

### 1.6. Clustering

- Partition examples into groups (clusters) such that examples in a group are more similar to each other than to examples in other groups
- Unlike classification, there is not typically a “right answer”
- Answer dictated by feature vector and distance metric, not by a ground truth label

#### Optimization Problem

- Clustering is an optimization problem. The goal is to find a set of clusters that optimizes an objective function, subject to some set of constraints.
- Given a distance metric that can be used to decide how close two examples are to each other, we need to define an
**objective function**that- Minimizes the distance between examples in the same clusters, i.e., minimizes the dissimilarity of the examples within a cluster.

- To compute the variability of the examples within a cluster
- First compute the mean(
`sum(V)/float(len(V))`

, more precisely the Euclidean mean) of the feature vectors of all the examples in the cluster. ,`V`

is a list of feature vectors. - Compute the distance between feature vectors
- $\text{variability}(c)=\displaystyle\sum_{e \in c}\text{distance}(\text{mean}(c), e)^2$

- First compute the mean(
- The definition of variability within a single cluster,
`c`

, can be extended to define a dissimilarity metric for a set of clusters,`C`

:- $\text{dissimilarity}(C)=\displaystyle\sum_{e \in c}\text{variability(c)}$

- It's NOT the optimization problem to find a set of clusters, C, such that
`dissimilarity(C)`

is minimized. Because it can easily be minimized by putting each example in its own cluster. - We could put a constraint on the distance between clusters or require that the maximum number of clusters is
`k`

. Then to find the minimum between clusters.

##### K-means Clustering

- Constraint: exactly
`k`

non-empty clusters - Use a greedy algorithm to find an approximation to minimizing objective function
Algorithm

`randomly chose k examples as initial centroids while true: create k clusters by assigning each example to closest centroid compute k new centroids by averaging examples in each cluster if centroids don’t change: break`

- Sample: lecture12-4.py
`k=4`

, Initial Centroids:- Result:

- Sample: lecture12-4.py
Unlucky Initial Centroids

`k=4`

, Initial Centroids:- Result:
Mitigating Dependence on Initial Centroids

`best = kMeans(points) for t in range(numTrials): C = kMeans(points) if dissimilarity(C) < dissimilarity(best): best = C return best`

### 1.7. Wrapping Up Machine Learning

- Use data to build statistical models that can be used to
- Shed light on system that produced data
- Make predictions about unseen data

- Supervised learning
- Unsupervised learning
- Feature engineering
- Goal was to expose you to some important ideas
- Not to get you to the point where you could apply them
- Much more detail, including implementations, in text