org.deeplearning4j.nn.conf.GradientNormalization Maven / Gradle / Ivy
package org.deeplearning4j.nn.conf;
/**Gradient normalization strategies. These are applied on raw gradients, before the gradients are passed to the
* updater (SGD, RMSProp, Momentum, etc)
* None = no gradient normalization (default)
*
* RenormalizeL2PerLayer = rescale gradients by dividing by the L2 norm of all gradients for the layer.
*
* RenormalizeL2PerParamType = rescale gradients by dividing by the L2 norm of the gradients, separately for
* each type of parameter within the layer.
* This differs from RenormalizeL2PerLayer in that here, each parameter type (weight, bias etc) is normalized separately.
* For example, in a MLP/FeedForward network (where G is the gradient vector), the output is as follows:
*
* - GOut_weight = G_weight / l2(G_weight)
* - GOut_bias = G_bias / l2(G_bias)
*
*
*
* ClipElementWiseAbsoluteValue = clip the gradients on a per-element basis.
* For each gradient g, set g <- sign(g)*max(maxAllowedValue,|g|).
* i.e., if a parameter gradient has absolute value greater than the threshold, truncate it.
* For example, if threshold = 5, then values in range -5<g<5 are unmodified; values <-5 are set
* to -5; values >5 are set to 5.
* This was proposed by Mikolov (2012), Statistical Language Models Based on Neural Networks (thesis),
* http://www.fit.vutbr.cz/~imikolov/rnnlm/thesis.pdf
* in the context of learning recurrent neural networks.
* Threshold for clipping can be set in Layer configuration, using gradientNormalizationThreshold(double threshold)
*
*
* ClipL2PerLayer = conditional renormalization. Somewhat similar to RenormalizeL2PerLayer, this strategy
* scales the gradients if and only if the L2 norm of the gradients (for entire layer) exceeds a specified
* threshold. Specifically, if G is gradient vector for the layer, then:
*
* - GOut = G if l2Norm(G) < threshold (i.e., no change)
* - GOut = threshold * G / l2Norm(G) otherwise
*
* Thus, the l2 norm of the scaled gradients will not exceed the specified threshold, though may be smaller than it
* See: Pascanu, Mikolov, Bengio (2012), On the difficulty of training Recurrent Neural Networks,
* http://arxiv.org/abs/1211.5063
* Threshold for clipping can be set in Layer configuration, using gradientNormalizationThreshold(double threshold)
*
*
* ClipL2PerParamType = conditional renormalization. Very similar to ClipL2PerLayer, however instead of clipping
* per layer, do clipping on each parameter type separately.
* For example in a recurrent neural network, input weight gradients, recurrent weight gradients and bias gradient are all
* clipped separately. Thus if one set of gradients are very large, these may be clipped while leaving the other gradients
* unmodified.
* Threshold for clipping can be set in Layer configuration, using gradientNormalizationThreshold(double threshold)
*
* @author Alex Black
*/
public enum GradientNormalization {
None, RenormalizeL2PerLayer, RenormalizeL2PerParamType, ClipElementWiseAbsoluteValue, ClipL2PerLayer, ClipL2PerParamType
}