Week 10 - Large Scale Machine Learning

1. Gradient Descent with Large Datasets

  • Before we try to learn with large datasets. First we need to make sure our algorithm is high variance, which will need more data to minimize the error between JCVJ_{CV} and JtrainJ_{\text{train} }

1.1. Stochastic Gradient Descent

  • In Week 1#Gradient Descent, we learned the Batch Gradient Descent(Use all of the training examples at a time). It costs lots of time to compute the derivative part(ddθJ(θ)\frac{d}{d\theta}J(\theta)). Because every time we sum all of the differences of the samples.
  • Stochastic Gradient Descent define the cost function slightly differently, as cost(θ,(x(i),y(i)))=12(hθ(x(i))y(i))2\text{cost}(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^2, The overall cost function is Jtrain=1mi=1mcost(θ,(x(i),y(i)))J_{\text{train} } = \frac{1}{m} \sum_{i=1}^m \text{cost}(\theta, (x^{(i)}, y^{(i)})), which is equivalent to the batch gradient descent.
  • The steps are:
    1. Randomly shuffle the training examples
    2. Repeat {    for i:=1,,m{        θj:=θjα(hθ(x(i))y(i))xj(i) (for j=0,,n)    }}\begin{aligned} & \{ \\ & \ \ \ \ \text{for } i := 1, \ldots, m \{ \\ & \ \ \ \ \ \ \ \ \theta_j := \theta_j - \alpha(h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)} \ (\text{for } j = 0, \ldots, n) \\ & \ \ \ \ \} \\ & \} \end{aligned}
    3. Normally, we always repeat the process 1 - 10 times.
  • In Batch Gradient Descent, the derivative term is 1mi=1m(hθ(xi)yi)xj(i)\frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i})x_j^{(i)}, we sum all the differences. But in Stochastic Gradient Descent, we calculate it one by one in m loops: ((hθ(x(i))y(i))xj(i)(h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)}).

  • Comparison with Batch Gradient Descent

    • As we saw, batch gradient descent does something like this to get to a global minimum:
    • With stochastic gradient descent every iteration is much faster, but every iteration is flitting a single example. So, stochastic gradient descent will never converges like batch gradient descent, but ends up wandering around some region close to the global minimum.

1.2. Mini-batch Gradient Descent

  • Batch gradient descent: Use all m examples in each iteration
  • Stochastic gradient descent: Use 1 example in each iteration
  • Mini-batch gradient descent: Use b examples in each iteration
  • The steps:
    1. Say b = 10, m = 1000.
    2. Repeat {    for i:=1,11,21,31,,991 {        θj:=θjα110k=ii+9(hθ(x(k))y(k))xj(k) (for j=0,,n)    }}\begin{aligned} & \{ \\ & \ \ \ \ \text{for } i := 1, 11, 21, 31, \ldots, 991 \ \{ \\ & \ \ \ \ \ \ \ \ \theta_j := \theta_j - \alpha\frac{1}{10}\sum_{k=i}^{i+9}(h_{\theta}(x^{(k)}) - y^{(k)})x_j^{(k)} \ (\text{for } j = 0, \ldots, n) \\ & \ \ \ \ \} \\ & \} \end{aligned}
  • Compared to batch gradient descent, this allows us to get through data in a much more efficient way.
  • Compared to stochastic gradient descent, we can vectorize the data to partially parallelize the computation(i.e. do 10 at once).
  • The relation with batch gradient descent and stochastic gradient descent are: If b = 1, then it will be stochastic gradient descent, and if b = m, it will be batch gradient descent.

1.3. Stochastic Gradient Descent Convergence

  • Batch gradient descent:
    • Plot as a function of the number of iterations of gradient descent. Jtrain(θ)=12mi=1m(hθ(x(i))y(i))2J_{\text{train} }(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2
  • Stochastic gradient descent:
    • cost(θ,(x(i),y(i)))=12(hθ(x(i))y(i))2\text{cost}(\theta, (x^{(i)}, y^{(i)})) = \frac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^2
    • During learning compute cost(θ,(x(i),y(i)))\text{cost}(\theta, (x^{(i)}, y^{(i)})) before updating θ\theta using (x(i),y(i))(x^{(i)}, y^{(i)}).
    • Every 1000 iterations (say), plot cost(θ,(x(i),y(i)))\text{cost}(\theta, (x^{(i)}, y^{(i)})) averaged over the last 1000 examples processed by algorithm. we may get different result:
      • In the top two figures, we can see, if we average 5000 examples, the curve will be smoother.
      • The bottom left shows that, sometimes, a large average examples can make the the tendency more clear.
      • The bottom right shows, if the curve increases, you may need a smaller learning rate(α\alpha).
    • About the learning rate(α\alpha):
      • In most implementations the learning rate is held constant.
      • But if we want to converge to a minimum, we can slowly decrease the learning rate over time ((E.g. α=const1interationNumber+const2\alpha = \frac{\text{const1} }{\text{interationNumber} + \text{const2} })

2. Online Learning

  • The online learning setting allows us to model problems where we have a continuous flood or a continuous stream of data coming in and we would like an algorithm to learn from that.
  • Example: Shipping service. We want to build an algorithm to optimize what price we should offer to the users.
    1. Model the probability (p(y=1x;θ)p(y=1|x;\theta)) that user use our service or not.
    2. Gather the feature vector, including the price we offered, origin, destination, etc.
    3. Repeat forever {    Get (x,y) corresponding to user {        θj:=θjα(hθ(x)y)xj (for j=0,,n)    }}\begin{aligned} & \{ \\ & \ \ \ \ \text{Get}\ (x, y)\ \text{corresponding to user}\ \{ \\ & \ \ \ \ \ \ \ \ \theta_j := \theta_j - \alpha(h_{\theta}(x) - y)x_j \ (\text{for } j = 0, \ldots, n) \\ & \ \ \ \ \} \\ & \} \end{aligned}

2.1. Other Online Learning Examples

  • Product search (learning to search)
    • User searches for "Android phone 1080p camera" Have 100 phones in store. Will return 10 results.
    • xx = features of phone, how many words in user query match name of phone, how many words in query match description of phone, etc.
    • y=1y = 1 if user clicks on link. y=0y = 0 otherwise.
    • Learn p(y=1x;θ)p(y = 1 | x; \theta).
    • Use to show user the 10 phones they’re most likely to click on.
  • Other examples: Choosing special offers to show user; customized selection of news articles; product recommendation; …

3. Map Reduce and Data Parallelism

4. Words

  • stochastic [stɔ'kæstik, stəu-] adj. [数] 随机的;猜测的
  • parallelize ['pærəlelaiz] vt. 平行放置;使……平行于……

results matching ""

    No results matching ""