Training Recurrent Neural Networks without Backpropagation Through Time

Normally, for a recurrent net to learn dependencies between two states that are \$n\$ steps away from each other, they have to be unrolled through time at least \$n\$ steps during backpropagation. That is not plausible to happen in the Brain, and it is computationally exhausting in comparison to feedforward neural network training, and also it puts a hard time limit on which dependencies can be learned.

Recalling the basics of neural nets

A neural network is a function with parameters \$\theta\$ mapping an input vector \$\vec{v_i}\$ nonlinearily to an output vector \$\vec{v_o}\$ , so \$f({\theta},\vec{v_i}) = \vec{v_o}\$ . The parameters are initialized randomly, and the error \$E\$ between the actual and the desired output for a certain input is partially differentiated to each parameter and the parameters are altered in the direction of less error. This is called backpropagation and is the state of the art to train a neural net.

Its basic unit is a neuron. It takes a vector of inputs \$\vec{v}\$, a weight vector \$\vec{w}\$ of the same dimension as the input and produces an output scalar \$o\$:
$\ o = nl(\sum_{n = 1} ^ {dim(\vec{w})} v_n * w_n + bias)\$
where \$nl\$ is a differentiable nonlinear function, often \$tanh(x)\$ or \$max(0,x)\$

Using a neural network as activation function

For my approach, i used a feedforward neural network as activation funciton in one layer of another feedforward neural net.
Let's call this activation function neural net \$a\$ with parameters \$\theta_a\$. And let's call the "bigger" net that uses this kind of activation function \$b\$ with parameters \$\theta_b\$.
\$\theta_a\$ are the same for each \$a\$ in \$b\$. Each \$a\$ recieves a scalar product \$s = \sum_{n = 1} ^ {dim(\vec{w})} v_n * w_n + bias\$ as input, as well as a vector \$\vec{m}\$ of \$n_m\$ memory values. It outputs a scalar that functions as the output of the neuron, as well as another value that is used in backpropagation through \$b\$ instead of the actual derviation of the activation function, as well as a vector of \$n_m\$ values \$\vec{g}\$ that function as gates for the memory values. The new value for memory \$i\$ at timestep \$t\$ is computed as \$m_{i,t} = (1 - g_{i,t}) * m_{i,t-1} + g_{i,t} * s_t\$. So a forward pass of a neuron with \$a\$ as activation function at timestep t produces this: $\ (o_t,(\frac{\delta}{\delta o})_t,\vec{g_t}) = a(\theta_a,s_t,\vec{m_t})\$ . This activation function makes \$b\$ recurrent.

But for \$a\$ to be actually useful for \$b\$, \$\theta_a\$ needs to be selected so that the error of \$b\$ decreases during training. This can be achieved by training \$\theta_a\$ using hill climbing, a genetic algorithm or reinforcement learning. This is feasible since the dimension of \$\theta_a\$ is pretty low.

If we see it as a reinforcement learning problem, the state \$S\$ are the input values to \$a\$, the actions \$A\$ are the output values of \$a\$, and the reward is the test error. The problem faced here is that a reward depending on \$(S,A)\$ is only observable after a weight update.

I experimented with a bit repetition task, training \$\theta_a\$ with a genetic algorithm. It actually worked, after a night of training on 8 CPU cores, the net \$b\$ learned to repeat all bits, only backpropagating the error between current outputs, without using Backpropagation through time. But i removed memory and gates and made \$a\$ itself recurrent, since the restriction of it not being recurrent is only required for easy reinforcement learning. \$b\$ was a network of two 10 - dimensional layers with \$a\$ as activation function, followed by a tanh output layer with dim = 1. It was tasked with recalling 9 bits after the sequence was completely read.
After that, i tried using the trained \$\theta_a\$ in a bigger network, tasked with predicting the next letter of a text - it didn't work.
Future work includes setting up a reinforcement learning environment for training \$\theta_a\$

Dieses Blog durchsuchen

Andreas' Blog