1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
 def zScaleFeatures(vals): """Assumes vals is a sequence of floats""" result = pylab.array(vals) mean = float(sum(result))/len(result) result = result  mean return result/stdDev(result) def iScaleFeatures(vals): """Assumes vals is a sequence of floats""" minVal, maxVal = min(vals), max(vals) fit = pylab.polyfit([minVal, maxVal], [0, 1], 1) return pylab.polyval(fit, vals) ```
* Partition examples into groups (clusters) such that examples in a group are more similar to each other than to examples in other groups * Unlike classification, there is not typically a “right answer” * Answer dictated by feature vector and distance metric, not by a ground truth label
* Clustering is an optimization problem. The goal is to find a set of clusters that optimizes an objective function, subject to some set of constraints. * Given a distance metric that can be used to decide how close two examples are to each other, we need to define an **objective function** that * Minimizes the distance between examples in the same clusters, i.e., minimizes the dissimilarity of the examples within a cluster. * To compute the variability of the examples within a cluster * First compute the mean(`sum(V)/float(len(V))`, more precisely the Euclidean mean) of the feature vectors of all the examples in the cluster. , `V` is a list of feature vectors. * Compute the distance between feature vectors * $\text{variability}(c)=\displaystyle\sum_{e \in c}\text{distance}(\text{mean}(c), e)^2$ * The definition of variability within a single cluster, `c`, can be extended to define a dissimilarity metric for a set of clusters, `C`: * $\text{dissimilarity}(C)=\displaystyle\sum_{e \in c}\text{variability(c)}$ * It's NOT the optimization problem to find a set of clusters, C, such that `dissimilarity(C)` is minimized. Because it can easily be minimized by putting each example in its own cluster. * We could put a constraint on the distance between clusters or require that the maximum number of clusters is `k`. Then to find the minimum between clusters.
##### Kmeans Clustering
* Constraint: exactly `k` nonempty clusters * Use a greedy algorithm to find an approximation to minimizing objective function * Algorithm
