Artificial neural networks have achieved great success in the field of computer vision. However, the deep neural network itself still remains a blackbox to us and there lacks an universal approach to judge the efficiency of the network structure. Besides that, the state-of-the-art computer vision algorithm struggles in some tasks which are easy to human. Our work applies a method promising in opening the blackbox, information bottleneck theory, to analyze a newly developed RNN cell model, hGRU, which managed to solve pattern recognition problem with long-range dependencies. Three points are made in our work: First, the hGRU is superior in long-range dependent problems. Second, the learning of hGRU is concentrated. Third, connection between hGRU layer and output layer restricts the model's performance.
Following is the reproducing result of information bottleneck theory. Each trajectory presents one hidden layer's dynamic during the whole training process. The left one is the output layer and the right ones are hidden layers. As for layer sequence, previous hidden layer is on the right.
This figure shows the information planes containing output and hGRU layer with 4, 6, 8 timesteps. The trajectory for output layer is on the left with an irregular shape while hGRU information trajectory appears to be a concave curve. During training process, both two layers' mutual information with labels increases, from zero to around one. Only the output layer in the four timesteps information plane reaches much lower mutual information with labels than others, which agrees with the truth that the four timesteps model fails in the pathfinder problem.
Three conclusions can be judged from this result: First, hGRU layer has its superiority in capturing information from long-range dependent problems. Noticed that hGRU layer makes progress steadily instead of moving in roundabout ways, its trajectory is relatively smooth. In the beginning of training, the trajectory slope is smaller as a result of absorbing more task-unrelated information. After that, it starts to take more and more task-related information as the slope grows bigger. This might be that it finds the correct way for capturing the information for target after the first exploring stage. It stays and moves little for the following 35000 steps. Second, the learning for hGRU layer is concentrating. Most of the learning happens in the steps around 14000. This agrees with our claim about hGRU layer's superiority because it can learn with fewer training steps. Even in the four timesteps model, the hGRU layer finishes its learning before 21000 training step. Third, the information transmittion between hGRU layer and output layer is the restriction for model performance. From step 21000 to 56000, the hGRU stays almost unchanged, waiting for the learning of the output layer. To make an analogy, output becomes the bottleneck for the whole networks performance. Enhancing the representing ability between these two layers might be a good choice by adding some hidden layers.