i=1
to m
:
L
is the total number of layersg
with the input values given by $z^{(l)}$.i
is the error of the affected node of layer l
.j
is the node of layer l
.m
) in layer l
equals the number of the errors of the affected(n
) in layer l+1
.m*n
matrix, same as $\Theta^{(l)}$.D
is used as an “accumulator” to add up our values as we go along and eventually compute our partial derivative. Thus we get $\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta) = D^{(l)}_{i,j}$j
in layer l
). More formally, the delta values are actually the derivative of the cost function: $$\delta_j^{{(l)}=\frac{d}{dz_j}{(l)}}cost(t)$$With neural networks, we are working with sets of matrices: $$\begin{aligned} \Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}, \ldots \ D^{(1)}, D^{(2)}, D^{(3)}, \ldots \end{aligned}$$
In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector:
1 | thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ] |
Summarize:
Have initial parameters $\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}$.
Unroll to get initialTheta to pass to fminumc(@costFunction, initialTheta, options)
Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with: $$\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$$
With multiple theta matrices, we can approximate the derivative with respect to $\Theta_j$ as follows: $$\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \ldots, \Theta_j + \epsilon, \ldots, \Theta_n) - J(\Theta_1, \ldots, \Theta_j - \epsilon, \ldots, \Theta_n)}{2\epsilon}$$
A small value for ϵ (epsilon) such as ${\epsilon = 10^{-4}}$, guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems.
Hence, we are only adding or subtracting epsilon to the $\Theta_j$ matrix. In octave we can do it as follows:
1 | epsilon = 1e-4; |
We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.
Once you have verified once that your backpropagation algorithm is correct, you don’t need to compute gradApprox again. The code to compute gradApprox can be very slow.
Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our $\Theta$ matrices using the following method:
Initialize each $\Theta_{ij}^{(l)}$ to a random value in $[-\epsilon, \epsilon]$ (i.e. $-\epsilon \le \Theta_{ij}^{(l)}$)
1 | Theta1 = rand(10,11)*(2*INIT+EPSILON) - INIT_EPSILON; |
Hence, we initialize each $\Theta_{ij}^{(l)}$ to a random value between $[-\epsilon, \epsilon]$. Using the above formula guarantees that we get the desired bound. The same procedure applies to all the Θ’s. Below is some working code you could use to experiment.
1 | If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11. |
rand(x,y) is just a function in octave that will initialize a matrix[x*y] of random real numbers between 0 and 1.
choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.
When we perform forward and back propagation, we loop on every training example:
1 | for i = 1:m, |
The following image gives us an intuition of what is happening as we are implementing our neural network:
Ideally, you want $h_{\Theta}(x^{(i)}) \approx y^{(i)}$. This will minimize our cost function. However, keep in mind that $J(\Theta)$ is not convex and thus we can end up in a local minimum instead.