loading

Optics Lens

Based on the weight of link automatic linear encoder

by:HENGXIANG     2020-11-23
The depth of the modern neural network usually has a huge number of parameters and even higher than the size of the training data. This means that the depth of network has a strong tendency to over fitting. Alleviate the tendency of technology has a lot of, including the L1 and L2 regular normalization, stop as early as possible, group, and dropout. In the training phase, dropout randomly discarded hidden neurons and their connections, to break the mutual adaptation between neurons. Although dropout in the depth of the neural network training has been a huge success, how about dropout provides regularization in depth study theory in this area is still limited. Recently, the Johns Hopkins university Poorya Mianjy, Raman Arora, Rene Vidal in ICML 2018 papers submitted On the Implicit Bias of Dropout, mainly studies the Dropout is introduced into the Implicit Bias. Linear encoder automatically based on the weight of link in order to facilitate understanding of the mechanism of action of dropout, the researchers plan to dropout performance in a simple model analysis. To be specific, the researchers used a simple model is linear network contains only a single hidden layer. The goal of the network is found to minimize the expected loss ( Square loss) The weight matrix of U, V, on the type of x for the input, output, y is marked D as input, the distribution of x h hidden layer. Learning algorithm for the stochastic gradient descent with dropout, its goal is to: among them, the dropout rate is 1 - Theta, specific algorithm for: the goal of this algorithm is equivalent to ( Derivation see appendix A. 1) : among them, the lambda = ( 1 - θ) , and make U = V/theta researchers further simplified model for the weight of link automatic single hidden layer linear encoder. Accordingly, the network's goal is: the researchers proved that if U matrix is more than the global optimal solution of the target, then U all columns equal norm. This means that the dropout tend to be allocated equal weight to all the hidden nodes, that is to say, dropout added to the entire network of implicit bias, tend to make hidden nodes have similar effects, rather than having a small number of hidden nodes has the important influence. The above visual effects of different values of the parameters lambda. The network for single hidden layer linear encoder automatically, with one-dimensional input and one-dimensional output and hidden layer width is 2. When lambda = 0, the problem is converted into a square loss minimization problem. When lambda> 0, the global optimal value to the origin of contraction, all local minimum values for the global minimum value ( See section 4) prove process 。 When the lambda increases, the global optimal value further to the origin of contraction. Linear single hidden layer network and then, the researchers will be the result has been generalized to linear single hidden layer network. Recall that the goal of this network is: and the weight of link, the researchers proved that if the matrix of ( U, V) Is the global optimal solution of the above goal, then, lots UI lots lots lots = vi lots u1 lots lots v1 lots, among them, the I corresponding to the width of the hidden layer. Further evidence, the researchers mentioned the goals of the linear single hidden layer neural network equivalent to the regularization of the matrix decomposition ( 正规化矩阵分解) A mathematical tool: using the matrix decomposition, the researchers demonstrate that the global optimal values can be found in polynomial time: test researchers have tested some models, in order to verify the theoretical results mentioned above. The above visual dropout convergence process. Before and visualization example, similar model for single hidden layer linear encoder automatically, one-dimensional input and one-dimensional output and hidden layer width is 2. Input sampling from the standard normal distribution. Green dot for the initial iteration point, red dots for the global optimal point. We can see from the picture, under the different value of lambda, dropout can quickly converge to global optimal point. The researchers also tested on a shallow linear network. The network's input x ∈ ℝ 80, sampling from the standard normal distribution. Network output y ∈ ℝ 120, generated by y = Mx, which M ∈ ℝ 120 x80 uniform sampling from the right and left singular subspace ( Index attenuation) spectrum 。 The figure below shows the different parameter values ( λ ∈ { 0. 1, 0. 5, 1} ) And different hidden layer width ( r∈{ 20, 80} ) The combination of. Blue curve under different number of iterations for dropout corresponding target, the red line as the target of the optimal value. Run a total of about 50 times, take the average. : r = 20; Next: r = 80 above the final as the variance of ratings of 'importance'. Importance ratings calculation method for: lots uti lots lots vti lots, which t said moment ( Iteration) , I said hidden layer nodes. We see from the above, as the dropout of convergence, the variances of the ratings of 'importance' drab, eventually fell to zero. And the lambda is bigger, decline rapidly. Conclusion the theory study confirmed the dropout is a process of both texture distribution of the weight, to prevent the mutual adaptation. Also theoretically explains the dropout can converge to global optimal solution efficiently. Researchers are using a single hidden layer of the linear neural network, therefore, naturally, the next exploration direction for: deeper linear neural network using nonlinear activation of shallow layer neural networks, such as ReLU ReLU can accelerate training)
Custom message
Chat Online
Chat Online
Leave Your Message inputting...
Hello, this is Liz. If I am not online, please email me at heng@shhxgd.cn, or add my Whatsapp/ Wechat : +86 186-1688-3327, we will reply you as soon as possible~~