{"title": "Understanding and Improving Layer Normalization", "book": "Advances in Neural Information Processing Systems", "page_first": 4381, "page_last": 4391, "abstract": "Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. \nTo address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.", "full_text": "Understanding and Improving Layer Normalization\n\nJingjing Xu1, Xu Sun1,2\u2217, Zhiyuan Zhang1, Guangxiang Zhao2, Junyang Lin1\n1 MOE Key Lab of Computational Linguistics, School of EECS, Peking University\n\n2 Center for Data Science, Peking University\n\n{jingjingxu,xusun,zzy1210,zhaoguangxiang,linjunyang}@pku.edu.cn\n\nAbstract\n\nLayer normalization (LayerNorm) is a technique to normalize the distributions\nof intermediate layers. It enables smoother gradients, faster training, and better\ngeneralization accuracy. However, it is still unclear where the effectiveness stems\nfrom. In this paper, our main contribution is to take a step further in understanding\nLayerNorm. Many of previous studies believe that the success of LayerNorm\ncomes from forward normalization. Unlike them, we \ufb01nd that the derivatives of the\nmean and variance are more important than forward normalization by re-centering\nand re-scaling backward gradients. Furthermore, we \ufb01nd that the parameters of\nLayerNorm, including the bias and gain, increase the risk of over-\ufb01tting and do\nnot work in most cases. Experiments show that a simple version of LayerNorm\n(LayerNorm-simple) without the bias and gain outperforms LayerNorm on four\ndatasets. It obtains the state-of-the-art performance on En-Vi machine translation.\nTo address the over-\ufb01tting problem, we propose a new normalization method,\nAdaptive Normalization (AdaNorm), by replacing the bias and gain with a new\ntransformation function. Experiments show that AdaNorm demonstrates better\nresults than LayerNorm on seven out of eight datasets.\n\n1\n\nIntroduction\n\nNeural network training has long been a focus in Deep Learning research area. One of the prominent\nprogress is the application of normalization methods. Initially, Ioffe and Szegedy [2015] introduce\nthe concept of normalizing layers with the proposed Batch Normalization (BatchNorm). It is widely\nbelieved that by controlling the mean and variance of layer inputs across mini-batches, BatchNorm\nstabilizes the distribution and improves training ef\ufb01ciency. Following this work, Lei Ba et al. [2016]\npoint out its limitation in Recurrent Neural Networks (RNN) and propose Layer Normalization\n(LayerNorm) that is performed across the neurons in a layer. LayerNorm is adaptive to RNN and\nself-attention-based models. A typical example is its application in the state-of-the-art framework,\nTransformer [Vaswani et al., 2017]. LayerNorm enables faster training of Transformer and is\nirreplaceable in this framework.\nDespite its great success, it is still unclear why LayerNorm is so effective. The widely accepted\nexplanation is that forward normalization brings distribution stability [Ioffe and Szegedy, 2015,\nLei Ba et al., 2016]. Recent studies show that the effects of BatchNorm are not related to the stability\nof input distribution [Zhang et al., 2017, Santurkar et al., 2018]. They also propose that the reason\nwhy BatchNorm is effective is that normalization smooths the optimization landscape. However, it is\nstill unclear whether these theories can explain the success of LayerNorm.\nThe main contribution of this paper is to explore how LayerNorm works. Through a series of analyses,\nwe \ufb01nd that the derivatives of the mean and variance are important by re-centering and re-scaling\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbackward gradients. Furthermore, it is beyond our expectation that the bias and gain do not work in\nmost cases. The details of our \ufb01ndings are illustrated below.\nThe derivatives of the mean and variance are more important to LayerNorm than forward\nnormalization. Many of the previous studies believe that the forward normalization is the only\ndecisive factor to LayerNorm. It makes the input distribution more stable, thus brings better conver-\ngence. Unlike them, our experimental results show that forward normalization has little to do with the\neffectiveness and the derivatives of the mean and variance play a signi\ufb01cant role in LayerNorm. To\nillustrate how these derivatives work, we propose DetachNorm, which adds an additional detaching\noperation to LayerNorm to change the mean and variance from variables to constants. It preserves the\nre-centering and re-scaling fact but cuts off the derivative of the mean and variance with respect to\nthe input. DetachNorm performs worse than LayerNorm on six out of eight datasets. This proves that\nthe derivatives of the mean and variance are useful to LayerNorm. Furthermore, to investigate the\nreason for the above observation, we analyze the gradients in LayerNorm and DetachNorm, and \ufb01nd\nthat the derivatives of means re-center gradients and the derivatives of variances re-scale gradients.\nThe parameters of LayerNorm, including the bias and gain, increase the risk of over-\ufb01tting and\ndo not work in most cases. The bias and gain are applied for af\ufb01ne transformation on normalized\nvectors. They are expected to enhance the expressive power by re-shaping the distribution. To\nevaluate their effects on results, we build a simple version of LayerNorm (LayerNorm-simple) by\nremoving the bias and gain. Our experimental results show that LayerNorm-simple achieves better\nresults than LayerNorm on four datasets. It even achieves the state-of-the-art performance on En-Vi\nmachine translation. By comparing loss curves of LayerNorm with and without the bias and gain, we\n\ufb01nd that the bias and gain cause over-\ufb01tting. We speculate the reason of over-\ufb01tting is mainly that\nthe bias and gain are learned from the training set and cannot adjust themself towards different input\ndistributions when testing.\nMotivated by this assumption, we propose a novel normalization method, Adaptive Normalization\n(AdaNorm). AdaNorm replaces the bias and gain with a new transformation function. This function\nadaptively adjusts scaling weights based on input values. We evaluate AdaNorm and LayerNorm on\neight datasets, covering tasks of machine translation, language modeling, text classi\ufb01cation, image\nclassi\ufb01cation, and dependency parsing. Results show that AdaNorm achieves better results on seven\ndatasets.\n\n2 Preliminaries\n\nIn this section, we \ufb01rst review the algorithm of LayerNorm and then introduce the datasets and\nmodels used in the following analysis sections.\n\n2.1 LayerNorm Algorithm\n\nLet x = (x1, x2, . . . , xH ) be the vector representation of an input of size H to normalization layers.\nLayerNorm re-centers and re-scales input x as\nx \u2212 \u00b5\n\u03c3\n\nh = g (cid:12) N (x) + b, N (x) =\n\n(xi \u2212 \u00b5)2\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nH\n\nH(cid:88)\n\n1\nH\n\nH(cid:88)\n\ni=1\n\n, \u00b5 =\n\nxi,\n\n\u03c3 =\n\n(1)\n\ni=1\n\nwhere h is the output of a LayerNorm layer. (cid:12) is a dot production operation. \u00b5 and \u03c3 are the mean\nand standard deviation of input. Bias b and gain g are parameters with the same dimension H.\n\n2.2 Experimental Setup\n\nTo investigate how LayerNorm works, we conduct a series of experiments in this paper. Since\nLayerNorm is a default setting in Transformer [Vaswani et al., 2017] and Transformer-XL [Dai et al.,\n2019], which have shown state-of-the-art results on a variety of tasks (e.g., machine translation),\nwe primarily consider normalization on Transformer and Transformer-XL networks. Also, to avoid\nthe impact of model architecture, we evaluate the effects of normalization on feed-forward neural\nnetworks and convolutional neural networks. Here list the datasets and models. More details can be\nfound at the arxiv version.\n\n2\n\n\fMachine translation includes three widely-used datasets, WMT English-German (En-De), IWSLT\n14 German-English (De-En) [Cettolo et al., 2014] and IWSLT 15 English-Vietnamese (En-Vi) [Cettolo\net al., 2015]. For all dataset, we use the setting of PreNorm where normalization is applied before\neach layer. We re-implement Transformer with the released code of Fairseq [Ott et al., 2019]2. The\nevaluation metric is BLEU [Papineni et al., 2002].\nFor En-De dataset, we use the same dataset splits and the same compound splitting following previous\nwork [Vaswani et al., 2017]. BPE is used to get vocabularies. We use the shared embedding setting\nand the vocabulary size is 32,765. We use \u201ctransformer_wmt_en_de_big_t2t\u201d as our basic model.\nThe dropout rate is 0.3. The learning rate is 0.001. The training batch size is 4,096 tokens. We use\noptimizer Adam with \u03b21 = 0.9 and \u03b22 = 0.98. The number of warmup steps is 4K.\nThe De-En dataset is provided by the IWSLT 2014 Evaluation Campaign. We use the same dataset\nsplits following previous work [Ott et al., 2019, Ranzato et al., 2016, Wiseman and Rush, 2016]. It\ncontains 153K sentences for training, 7K sentences for validation, and 7K sentences for testing. BPE\nis used to get vocabularies. We use the shared embedding setting and the vocabulary size is 10,149.\nWe use \u201ctransformer_iwslt_de_en\u201d as our basic model. The dropout rate is 0.3. The attention dropout\nrate is 0.1. The activation dropout is 0.1. The initialization learning rate is 1e-07 and the learning rate\nis 0.0015. The training batch size is 4,096 tokens. We update gradients for every 2 steps. The number\nof warmup steps is 8K.\nThe En-Vi dataset contains 133K training sentence pairs provided by the IWSLT 2015 Evaluation\nCampaign. We use TED tst2012 (1,553 sentences) as the validation set and TED tst2013 (1,268\nsentences) as the test set. BPE is used to get input and output vocabularies. The English and\nVietnamese vocabulary sizes are 7,669 and 6,669 respectively. The dropout rate is 0.1. The learning\nrate is 0.001. The training batch size is 4,096 tokens. The number of warmup steps is 8K. We use\n\u201ctransformer_wmt_en_de\u201d as our basic model. We use optimizer Adam with \u03b21 = 0.9 and \u03b22 = 0.98.\nLanguage modeling includes a large dataset, Enwiki83 that contains 100M bytes of unprocessed\nWikipedia text. We implement a 12-layer Transformer-XL model. The dimension of each layer is\n512. Multi-head attention contains 8 heads and the dimension of each head is 64. The dropout rate is\n0.1. The batch size is 22. We use optimizer Adam with a learning rate 0.00025. We use the average\nnumber of Bits-Per-Character (BPC) as the evaluation metric [Al-Rfou et al., 2018, Dai et al., 2019].\nText classi\ufb01cation includes two sentence classi\ufb01cation datasets: RT [Pang and Lee, 2005], and\nSST5 [Socher et al., 2013]. RT is a binary sentiment classi\ufb01cation dataset from online movie reviews.\nWe randomly divide all examples into 8,608 for training, 964 for validation, and 1,089 for testing.\nSST5 is a single-sentence classi\ufb01cation dataset built on movie reviews. We run experiments on a \ufb01ve\nlabel set. We build a Transformer model with a 4-layer encoder. The batch size is 4,096 tokens. The\nword embedding dimension is 128 and the hidden dimension is 128. The dropout rate is 0.2. We use\noptimizer Adam with \u03b21 = 0.9, \u03b22 = 0.998. Normalization is applied before each layer. Accuracy is\nthe evaluation metric.\nImage classi\ufb01cation includes a widely-used dataset, MNIST [LeCun et al., 1998]. It consists of\n55,000 training images, 5,000 validation images, and additional 10,000 testing images. We implement\na 3-layer convolutional neural network for classi\ufb01cation. The \ufb01rst 2D-convolution layer has 1 in-\nchannel, 20 out-channels. The second 2D-convolution layer has 20 in-channels, 50 out-channels. We\n\ufb02atten the output of the second 2D-convolution layer and send it to a linear layer. The batch size is\n32. We use optimizer Adam with a learning rate of 0.001. We apply LayerNorm before the activation\nin every linear layer. We train the model for 20 epochs. Normalization is applied before each layer.\nAccuracy is the evaluation metric.\nDependency parsing includes a dataset, English Penn TreeBank (PTB) [Marcus et al., 1993]. We\nfollow the standard split of the corpus with sections 2-21 as the training set (39,832 sentences,\n1,900,056 transition examples), section 22 as the validation set (1,700 sentences, 80,234 transition\nexamples), and section 23 as the testing set (2,416 sentences, 113,368 transition examples). We\nimplement a MLP-based parser following the work [Chen and Manning, 2014]. The dimension of\nthe hidden state is 512, the batch size is 1, 024, the dropout rate is 0.2. We use optimizer Adam and\ninitialize the learning rate to 0.001. We apply normalization before activation in every linear layer.\n\n2https://github.com/pytorch/fairseq\n3http://mattmahoney.net/dc/text.html\n\n3\n\n\fFollowing the work [Chen and Manning, 2014], we use Unlabeled Attachment Score (UAS) as the\nevaluation metric.\n\n3 Understanding LayerNorm\n\nTo investigate how LayerNorm facilitates training, we conduct ablation studies to observe each part\u2019s\ncontribution to the performance. In this section, we analyse the effects of the bias and gain, forward\nnormalization, and backward normalization.\n\nTable 1: The bias and gain do not work on six out of eight datasets. \u201cw/o Norm\u201d is a naive model\nwithout LayerNorm. \u201cLayerNorm-simple\u201d is a variant of LayerNorm that drops the bias and gain.\n\u201c(+)\u201d means higher is better. \u201c(-)\u201d means lower is better.\n\nModels\n\nMachine Translation\n\nLanguage Modeling\n\nEn-De(+) De-En(+) En-Vi(+)\n\nEnwiki8(-)\n\nModel Layers\n\n12\n\nw/o Norm\nLayerNorm\n\nLayerNorm-simple\n\nDiverge\n\n28.3\n28.4\n\n12\n34.0\n35.5\n35.5\n\n12\n28.4\n31.2\n31.6\n\n12\n1.04\n1.07\n1.07\n\nClassi\ufb01cation\n\nParsing\nRT(+) SST5(+) MNIST(+) PTB(+)\n\n4\n\n76.85\n77.21\n76.66\n\n4\n\n38.55\n39.23\n40.54\n\n3\n\n99.14\n99.13\n99.09\n\n3\n\n88.31\n89.12\n89.19\n\n3.1 The Effect of the Bias and Gain in LayerNorm\n\nThe bias and gain do not work in most cases. From Table 1, it can be found that LayerNorm\nis an effective approach. It brings large performance improvements on six out of eight datasets\ncompared with the naive baseline without LayerNorm (\u201cw/o Norm\u201d). By comparing LayerNorm\nand LayerNorm-simple, we \ufb01nd that dropping the bias and gain (\u201cLayerNorm-simple\u201d) does not\ndecrease the performance on six datasets. Surprisingly, LayerNorm-simple outperforms LayerNorm\non four datasets, even with a 0.4 BLEU improvement on En-Vi and a 1.31 ACC improvement on\nSST-5. Also, it needs to notice that 31.6 achieved by LayerNorm-simple is the state-of-the-art result\non En-Vi machine translation.\nFurthermore, we \ufb01nd that the bias and gain increase the risk of over-\ufb01tting. Initially, considering that\ninput information may be lost when normalizing input distributions, the bias and gain are designed\nfor af\ufb01ne transformation on normalized vectors to enhance the expressive power. However, since\nthe bias and gain are learned from the training set and they ignore the input distributions of the\ntesting data, the risk of over-\ufb01tting may increase in LayerNorm. It is veri\ufb01ed by convergence curves\nin Figure 1. LayerNorm achieves lower training loss (or BPC) but higher validation loss (or BPC)\nthan LayerNorm-simple on En-Vi, Enwiki8. These results indicate that current af\ufb01ne transformation\nmechanism has a potential risk of over-\ufb01tting and needs to be further improved.\n\nFigure 1: Convergence curves of LayerNorm and LayerNorm-simple on En-Vi, Enwiki8. Lower is\nbetter. The bias and gain increase the risk of over-\ufb01tting.\n\n3.2 The Effect of Forward Normalization\n\nFor easier analysis, we only consider LayerNorm without the bias and gain here. Let y =\n(y1, y2, . . . , yH ) be the normalized vector, the calculation process of LayerNorm without the bias and\n\n4\n\n10152025302.502.753.003.253.503.754.004.254.50Train Loss3.63.84.04.24.44.64.85.0Valid LossEn-ViLayerNorm-simple trainLayerNorm trainLayerNorm-simple validLayerNorm valid1020304050600.900.951.001.051.101.151.201.251.30Train BPC1.101.151.201.251.301.35Valid BPCEnwiki8LayerNorm-simple trainLayerNorm trainLayerNorm-simple validLayerNorm valid\fTable 2: The derivatives of the mean and variance matter. \u201cw/o Norm\u201d is the naive model without\nnormalization. \u201cDetachNorm\u201d is a variant of \u201cLayerNorm-simple\u201d. It detaches the derivatives of the\nmean and variance. \u201c(+)\u201d means higher is better. \u201c(-)\u201d means lower is better. The top table shows the\neffect of forward normalization. The bottom table shows the effect of the derivatives of the mean and\nvariance.\n\nModels\n\nMachine Translation\n\nLanguage Modeling\n\nEn-De De-En(+) En-Vi(+)\n\nEnwiki8(-)\n\nModel Layers\n\n12\n\nw/o Norm\nDetachNorm\nImprovement\n\nDiverge\nDiverge\n\n\u2013\n\n12\n34.0\n33.9\n-0.1\n\n12\n28.4\n27.7\n-0.7\n\n12\n1.04\n1.12\n-0.08\n\nModels\n\nMachine Translation\n\nLanguage Modeling\n\nEn-De De-En(+) En-Vi(+)\n\nEnwiki8(-)\n\nClassi\ufb01cation\n\nParsing\nRT(+) SST5(+) MNIST(+) PTB(+)\n\n4\n\n4\n\n3\n\n3\n\n76.85\n76.40\n-0.45\n\n99.14\n99.10\n-0.04\n\n38.55\n40.04\n1.49\n\n88.31\n89.79\n1.48\nParsing\nRT(+) SST5(+) MNIST(+) PTB(+)\n\nClassi\ufb01cation\n\nModel Layers\nDetachNorm\n\n12\n\nDiverge\n\nLayerNorm-simple\n\nImprovement\n\n28.4\n\u2013\n\n12\n33.9\n35.5\n1.6\n\n12\n27.7\n31.6\n3.9\n\n12\n1.12\n1.07\n0.05\n\n4\n\n76.40\n76.66\n0.26\n\n4\n\n40.04\n40.54\n0.50\n\n3\n\n99.10\n99.09\n-0.01\n\n3\n\n89.79\n89.19\n-0.60\n\ngain can be written as\n\ny =\n\nx \u2212 \u00b5\n\u03c3\n\n, \u00b5 =\n\nH(cid:88)\n\ni=1\n\n1\nH\n\nxi,\n\n\u03c3 =\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nH\n\nH(cid:88)\n\n(xi \u2212 \u00b5)2\n\ni=1\n\n(2)\n\nwhere x = (x1, x2, . . . , xH ) is the input vector and H is the dimension of x. \u00b5 and \u03c3 are the mean\nand standard deviation of x1, x2, . . . , xH. Then, suppose \u00afy and Dy are the mean and variance of\ny1, y2, . . . , yH. It is easy to verify\n\nH(cid:88)\n\ni=1\n\n\u00afy =\n\n1\nH\n\nyi =\n\n1\nH\n\nH(cid:88)\n\n(xi \u2212 \u00b5)\n\n\u03c3\n\ni=1\n\n= 0, Dy =\n\n1\nH\n\n(xi \u2212 \u00b5)2\n\n\u03c32\n\n= 1.\n\n(3)\n\nH(cid:88)\n\ni=1\n\nEq. (3) shows that normalization re-centers and re-scales input vector x. By now, a widely accepted\nbelief is that the effectiveness of LayerNorm comes from steady layer distributions brought by\nforward normalization [Lei Ba et al., 2016]. To evaluate whether forward normalization explains\nthe effectiveness of LayerNorm, we need to separate the effect on forward layer inputs and that on\nbackward gradients. In this paper, we design a new method, called DetachNorm. The difference\nbetween LayerNorm and DetachNorm is that DetachNorm detaches the derivatives of the mean and\nvariance4. Detaching derivatives means treating the mean and variance as changeable constants,\nrather than variables, which do not require gradients in backward propagation. The calculation of\nDetachNorm can be written as\n\ny =\n\nx \u2212 \u02c6\u00b5\n\u02c6\u03c3\n\n,\n\n\u02c6\u00b5 = \u03b8(\u00b5),\n\n\u02c6\u03c3 = \u03b8(\u03c3)\n\n(4)\n\nwhere \u00b5 and \u03c3 are the mean and standard deviation of input x, as calculated in Eq. (2). The function\n\u03b8(\u00b7) can be seen as a special copy function, which copies the values of \u00b5 and \u03c3 into constants \u02c6\u00b5 and\n\u02c6\u03c3. In all, DetachNorm keeps the same forward normalization fact as LayerNorm does, but cuts offs\nthe derivatives of the mean and variance.\nSince DetachNorm keeps the same re-centering and re-scaling way in forward propagation as\nLayerNorm-simple does, the gap between DetachNorm and \u201cw/o Norm\u201d shows the effect of forward\nnormalization. As we can see, DetachNorm perform worse than \u201cw/o Norm\u201d, showing that forward\nnormalization has little to do with the success of LayerNorm.\nFurthermore, the only difference between DetachNorm and LayerNorm-simple lies in that Detach-\nNorm detaches the derivatives of the mean and variance. As shown in Table 2, DetachNorm performs\n\n4In our implementation, we detach the derivative of standard deviation, the square root of variance.\n\n5\n\n\fFigure 2: Convergence curves of LayerNorm-simple and DetachNorm on two translation datasets.\n\nworse than LayerNorm-simple on six datasets. It is mainly because that DetachNorm converges to\nmuch worse local optima compared with LayerNorm-simple, as shown in Figure 2. The gap between\nDetachNorm and LayerNorm-simple shows the effectiveness of the derivatives of the mean and\nvariance. By comparing the achieved improvements, we \ufb01nd that the derivatives of the mean and\nvariance bring higher improvements than forward normalization does.\nThese results demonstrate that the derivatives of the mean and variance play a signi\ufb01cant role. In\naddition, the extremely worse results of DetachNorm on En-De, De-En and En-Vi indicate that the\nderivatives of the mean and variance may be more important for deeper models. In the following\nsection, we will give a detailed analysis of why and how the derivatives of the mean and variance\ncontribute to the performance.\n\n3.3 The Effect of the Derivatives of the Mean and Variance\n\nTo understand how the derivatives of the mean and variance work, we analyze the gradients of\nLayerNorm-simple and DetachNorm. According to the chain rule, the gradient of x is5\n\n\u2202(cid:96)\n\u2202x\n\n\u2190 dy\ndx\n\n\u2202(cid:96)\n\u2202y\n\n(5)\n\nwhere (cid:96) is the loss function, x is the input vector and y is the normalized vector. We here analyze the\neffect of detaching the derivatives of the mean and variance on backward gradients. Our results are\nsummarized in the following theorem, whose proof is listed in the Appendix of the arxiv version.\nTheorem 1. Given \u2202(cid:96)\nFor the case of detaching the derivatives of \u00b5 and \u03c3, suppose \u2202(cid:96)\nof x with mean \u00afa and variance Da. We have \u00afa = \u00afg/\u03c3 and Da = Dg/(\u03c32).\n\n\u2202y = (g1, g2, ..., gH )T, let \u00afg and Dg be the mean and variance of g1, g2, ..., gH.\n\u2202x = (a1, a2, ..., aH )T is the gradient\n\n(1) For the case of standard LayerNorm-simple, suppose \u2202(cid:96)\n\n\u2202x = (b1, b2, ..., bH )T is the gradient of x\n\nwith mean \u00afb and variance Db.\n\nWe have \u00afb = 0 and Db \u2264 Dg/(\u03c32).\n\n(2) For the case of detaching the derivative of \u00b5, suppose \u2202(cid:96)\n\n\u2202x = (c1, c2, ..., cH )T is the gradient of\n\nx with mean \u00afc and variance Dc.\n\nWe have \u00afc = \u00afg/\u03c3 and Dc \u2264 Dg/(\u03c32).\n\n(3) For the case of detaching the derivative of \u03c3, suppose \u2202(cid:96)\n\n\u2202x = (d1, d2, ..., dH )T is the gradient of\n\nx with mean \u00afd and variance Dd.\n\nWe have \u00afd = 0 and Dc = Dg/(\u03c32).\n\nBy comparing the case of detaching the derivative of \u00b5 with that of LayerNorm-simple in Theorem 1,\n\u2202x to zero. By comparing the case of detaching the\nwe \ufb01nd that the derivative of \u00b5 re-centers \u2202(cid:96)\n\n5When calculating the gradient, we adopt the denominator layout.\n\n6\n\n01020304050Epoch4.04.55.05.56.06.57.07.5Valid LossDe-EnDetachNormLayerNorm-simple0.02.55.07.510.012.515.017.5Epoch4.55.05.56.06.5Valid LossEn-ViDetachNormLayerNorm-simple\fderivative of \u03c3 with of LayerNorm-simple, we \ufb01nd that the derivative of \u03c3 reduces the variance of \u2202(cid:96)\n\u2202x,\nwhich can be seen a kind of re-scaling. We refer to gradient re-centering and re-scaling as gradient\nnormalization.\nTo further evaluate the effect of gradient normalization on model performance, we test the derivatives\nof the mean and variance separately. Table 3 shows that detaching the derivative of variance decreases\nthe performance signi\ufb01cantly on deeper networks. Therefore, it is necessary to control the variance\nof gradients for deeper networks.\nIn conclusion, LayerNorm normalizes forward layer inputs and backward gradients. The derivatives\nof the mean and variance play more important roles than forward normalization in LayerNorm.\nFurthermore, unlike previous work [Santurkar et al., 2018] only noticing that normalization smooths\ngradients, this paper provides deeper insight about how normalization impacts backward gradients.\n\nTable 3: The derivative of variance is more important than that of mean for deeper networks. \u201c(+)\u201d\nmeans higher is better. \u201c(-)\u201d means lower is better.\n\nModels\n\nMachine Translation\n\nLanguage Model\n\nEn-De(+) De-En(+) En-Vi(+)\n\nEnwiki8(-)\n\nModel Layers\nLayerNorm-simple\nDetach Mean\nDetach Variance\n\n12\n28.4\n28.3\n\nDiverge\n\n12\n35.5\n35.6\n34.2\n\n12\n31.6\n31.3\n29.8\n\n12\n1.07\n1.07\n1.10\n\nClassi\ufb01cation\n\nParsing\nRT(+) SST5(+) MNIST(+) PTB(+)\n\n4\n\n76.66\n75.02\n77.04\n\n4\n\n40.54\n40.99\n41.74\n\n3\n\n99.09\n99.25\n99.10\n\n3\n\n89.19\n89.45\n89.80\n\n4 AdaNorm\n\nAdaNorm adopts a new transformation function which can adaptively control scaling weights towards\ndifferent inputs.6\n\n4.1 AdaNorm Algorithm\nFormally, let y = N (x) = (x \u2212 \u00b5)/ \u03c3 be the normalized vector where \u00b5 and \u03c3 are the mean and\nvariance of the input x = (x1, x2, . . . , xH ). We use \u03c6(y), a function with respect to input x, to\nreplace the bias and gain with the following equation:\n\nz = \u03c6(y) (cid:12) y = \u03c6(N (x)) (cid:12) N (x)\n\n(6)\nwhere z = (z1, z2, . . . , zH ) is the output of AdaNorm and (cid:12) is a dot product operation. Unlike\nthe bias and gain being \ufb01xed in LayerNorm, \u03c6(y) can adaptively adjust scaling weights based on\ninputs. To keep the stability of training, we expect that \u03c6(\u00b7) has some features. First, \u03c6(\u00b7) must\nbe differentiable. Second, we expect that the average scaling weight is \ufb01xed, namely the average\nof \u03c6(y) is a constant C where C > 0. Third, we expect that the average of z is bounded, which\ncan avoid the problem of exploding loss. Namely, we require that there exists a constant M such\nzi| < M. Theorem 2 proves that there exists a unique solution which can satisfy these\nthat | 1\nrequirements. The proof is listed in the Appendix of the arxiv version.\nTheorem 2. Suppose \u03c6(yi) is derivable, \u2200y , 1\nM (M > 0), where H is the hidden size. There exists only one solution:\n\n\u03c6(yi) = C > 0, and \u2203M, s.t. | 1\n\nzi| <\n\nH(cid:80)\n\nH(cid:80)\n\nH\n\ni=1\n\nH(cid:80)\n\nH\n\ni=1\n\nH\n\ni=1\n\n\u03c6(yi) = C(1 \u2212 kyi)\n\nwhich can satisfy these requirements.\nSince 1 \u2212 kyi < 0 will undesirably change the direction of vector, we expect that \u03c6(yi) > 0 holds,\nwhich means yi < 1/k must hold. Due to the symmetry of yi, |yi| < 1/k is required to hold too.\n\n6Our code is released at https://github.com/lancopku/AdaNorm\n\n7\n\n\fBased on Chebyshev\u2019s Inequality, we have\n\nP (|yi| < 1/k) = P (|yi \u2212 E(yi)| < 1/k) \u2265 1 \u2212 Dy\n\n(1/k)2 = 1 \u2212 k2Dy\n\n(7)\n\nwhere Dy is the variance of y = (y1, y2, . . . , yH ) and H is the dimension of y. Based on Eq. (3), we\ncan verify Dy = 1. If we expect that |yi| < 1/k holds with a probability higher than 99%, k = 1/10\nshould be choose based on Eq. (7). Namely, we choose\n\nGiven an input vector x, the complete calculation process of AdaNorm is\n\nz = C(1 \u2212 ky) (cid:12) y,\n\ny =\n\nx \u2212 \u00b5\n\u03c3\n\n, \u00b5 =\n\n1\nH\n\nxi,\n\n\u03c3 =\n\n\u03c6(yi) = C(1 \u2212 yi\n10\n\n).\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nH\n\n(8)\n\n(9)\n\nH(cid:88)\n\n(xi \u2212 \u00b5)2\n\ni=1\n\nH(cid:88)\n\ni=1\n\nwhere C is a hyper-parameter. (cid:12) is a dot product operation. k is recommended to set as 1/10. To\nprevent the introduced term C(1 \u2212 ky) dismissing the feature of gradient re-centering and re-scaling,\nwe detach the gradient of C(1 \u2212 ky) and only treat it as a changeable constant in implementation.\n\nTable 4: Results of LayerNorm and AdaNorm. \u201c(+)\u201d means higher is better. \u201c(-)\u201d means lower is\nbetter. AdaNorm outperforms LayerNorm on seven datasets.\n\nMachine Translation\n\nLanguage Model\n\nModels\n\nw/o Norm\nLayerNorm\nLayerNorm-simple\nAdaNorm\n\n28.3\n28.4\n28.5\n\nEn-De(+) De-En(+) En-Vi(+)\nDiverge\n\n34.0\n35.5\n35.5\n35.6\n\n28.4\n31.2\n31.6\n31.4\n\nEnwiki8(-)\n\n1.04\n1.07\n1.07\n1.07\n\nClassi\ufb01cation\n\nParsing\nRT(+) SST5(+) MNIST(+) PTB(+)\n88.31\n76.85\n89.12\n77.21\n89.19\n76.66\n77.50\n89.23\n\n99.14\n99.13\n99.09\n99.35\n\n38.55\n39.23\n40.54\n40.54\n\n4.2 Comparison between AdaNorm and LayerNorm\n\nThe comparison between LayerNorm and AdaNorm is shown in Table 4.7 AdaNorm outperforms\nLayerNorm on seven datasets, with 0.2 BLEU on En-De, 0.1 BLEU on De-En, 0.2 BLEU on En-\nVi, 0.29 ACC on RT, 1.31 ACC on SST, 0.22 ACC on MNIST, and 0.11 UAC on PTB. Unlike\nLayerNorm-simple only performing well on bigger models, AdaNorm achieves more balanced results.\nFigure 3 shows the loss curves of LayerNorm and AdaNorm on the validation set of En-Vi, PTB, and\nDe-En. Compared to AdaNorm, LayerNorm has lower training loss but higher validation loss. Lower\nvalidation loss proves that AdaNorm has better convergence.\n\nFigure 3: Loss curves of LayerNorm and AdaNorm on En-Vi, PTB, and De-En.\n\n5 Related Work\n\nDeep neural networks have outperformed shallow models in a variety of \ufb01elds, such as natural\nlanguage processing [Sutskever et al., 2014, Bahdanau et al., 2015, Devlin et al., 2018], computer\nvision [He et al., 2016, Huang et al., 2017], etc. The improvement mainly comes from the stronger\n\n7For AdaNorm implementation, Kaiming initialization and the setting of prenorm are recommended.\n\n8\n\n51015202530352.502.753.003.253.503.754.004.254.50Train Loss3.63.84.04.24.44.64.85.0Valid LossEn-ViAdaNorm trainLayerNorm trainAdaNorm validLayerNorm valid510152025300.000.050.100.150.200.250.30Train Loss0.100.120.140.160.180.20Valid LossPTBAdaNorm trainLayerNorm trainAdaNorm validLayerNorm valid204060801001203.03.23.43.63.84.04.24.4Train Loss3.703.753.803.853.903.954.00Valid LossDe-EnAdaNorm trainLayerNorm trainAdaNorm validLayerNorm valid\fexpressive power of deep layers. However, with the increase of depth, the network training process\nbecomes complicated and requires advanced architectural techniques. One of the important techniques\nof such advances is normalization.\nCurrently, it is widely accepted that normalization layers assist training by smoothing gradients,\nenabling large learning rates, accelerating convergence, and improving generalization results [Zhang\net al., 2019]. First introduced by Ioffe and Szegedy [2015], BatchNorm \ufb01xes layer distributions to\nreduce ICS (Internal Covariate Shift), a phenomenon that the upper layers need to continuously adapt\nto the new distributions of lower layers. Following this work, several normalization methods have\nbeen proposed, like instance normalization [Ulyanov et al., 2016] and group normalization [Wu and\nHe, 2018]. In addition, there are several studies exploring better activation functions [Klambauer\net al., 2017] or initialization methods [Zhang et al., 2019] to avoid the dependency on normalization\nlayers.\nLayerNorm is proposed to expand BatchNorm into RNN. LayerNorm normalizes the mean and\nvariance of all summed inputs to the neurons in one layer. Unlike BatchNorm that depends on the size\nof mini-batch, LayerNorm has fewer limitations. LayerNorm is adaptive to RNN and self-attention-\nbased models. It has been applied to the state-of-the-art frameworks such as Transformer [Vaswani\net al., 2017], BERT [Devlin et al., 2018], and Transformer-XL [Dai et al., 2019]. LayerNorm brings\nbetter performance and is irreplaceable in these frameworks.\nDespite the good performance, it is still unclear how layer normalization works. Ioffe and Szegedy\n[2015] claim that the effectiveness of BatchNorm comes from reducing ICS. It has been a popular\nbelief about BatchNorm [Santurkar et al., 2018]. However, some recent studies point out that\nthe success of BatchNorm relates to the smoother gradients and has little to do with reducing\nICS [Santurkar et al., 2018, Bjorck et al., 2018]. Although these studies provide a pioneering\nperspective to understand BatchNorm, there still remain some unanswered questions, such as how\nBatchNorm helps smooth gradients. Also, there are little work studying whether these theories can\nexplain the success of LayerNorm. In this paper, we take a further step to a better understanding of\nLayerNorm.\n\n6 Conclusion\n\nIn this paper, we investigate how layer normalization works. Based on a series of experiments and\ntheoretical analysis, we summarize some interesting conclusions. We \ufb01nd that the derivatives of\nthe mean and variance are important to the success of LayerNorm by re-centering and re-scaling\nbackward gradients. Furthermore, experiments show that the bias and gain increase the risk of\nover-\ufb01tting and do not work in most cases. To address this problem, we propose a normalization\nmethod AdaNorm. It replaces the bias and gain in LayerNorm with a new adaptive transformation\nfunction that can update scaling weights based on input values. Experiments show that AdaNorm\noutperforms LayerNorm on seven datasets. In the future work, we would like to explore more\nalternatives to LayerNorm from the perspective of gradient normalization.\n\nAcknowledgments\n\nWe thank all reviewers for providing the thoughtful and constructive suggestions. This work was\nsupported in part by National Natural Science Foundation of China (No. 61673028).\n\nReferences\nR. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones. Character-level language modeling with\n\ndeeper self-attention. CoRR, abs/1808.04444, 2018.\n\nD. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and\ntranslate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego,\nCA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.\n\nN. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger. Understanding batch normalization. In\nAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information\n\n9\n\n\fProcessing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages 7705\u2013\n7716, 2018.\n\nM. Cettolo, J. Niehues, S. St\u00fcker, L. Bentivogli, and M. Federico. The iwslt 2015 evaluation campaign.\n\nIn IWSLT 2014, International Workshop on Spoken Language Translation, 2014.\n\nM. Cettolo, J. Niehues, S. St\u00fcker, L. Bentivogli, R. Cattoni, and M. Federico. The iwslt 2015\nevaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation,\n2015.\n\nD. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In\nProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,\nEMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group\nof the ACL, pages 740\u2013750, 2014.\n\nZ. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl:\nAttentive language models beyond a \ufb01xed-length context. arXiv preprint arXiv:1901.02860, 2019.\n\nJ. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional\n\ntransformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\nG. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional\nnetworks. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 4700\u20134708, 2017.\n\nS. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning,\nICML 2015, Lille, France, 6-11 July 2015, pages 448\u2013456, 2015.\n\nG. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In\n\nAdvances in neural information processing systems, pages 971\u2013980, 2017.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nJ. Lei Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,\n\n2016.\n\nM. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english:\n\nThe penn treebank. Computational Linguistics, 19(2):313\u2013330, 1993.\n\nM. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A fast,\n\nextensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.\n\nB. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with\nrespect to rating scales. In Proceedings of the Association for Computational Linguistics (ACL),\npages 115\u2013124, 2005.\n\nK. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine\ntranslation. In Proceedings of the 40th Annual Meeting of the Association for Computational\nLinguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311\u2013318, 2002.\n\nM. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural\nnetworks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan,\nPuerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.\n\nS. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization?\n\nIn Advances in Neural Information Processing Systems, pages 2483\u20132493, 2018.\n\nR. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep\nmodels for semantic compositionality over a sentiment treebank. In Proceedings of the 2013\nconference on empirical methods in natural language processing, pages 1631\u20131642, 2013.\n\n10\n\n\fI. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\nD. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance normalization: The missing ingredient for fast\n\nstylization. CoRR, abs/1607.08022, 2016.\n\nA. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.\nAttention is all you need. In Advances in Neural Information Processing Systems 30: Annual\nConference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach,\nCA, USA, pages 6000\u20136010, 2017.\n\nS. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In\nProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,\nEMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1296\u20131306, 2016.\n\nY. Wu and K. He. Group normalization.\n\nIn Computer Vision - ECCV 2018 - 15th European\nConference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pages 3\u201319, 2018.\n\nC. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In 5th International Conference on Learning Representations, ICLR\n2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.\n\nH. Zhang, Y. N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normalization.\n\nCoRR, abs/1901.09321, 2019.\n\n11\n\n\f", "award": [], "sourceid": 2446, "authors": [{"given_name": "Jingjing", "family_name": "Xu", "institution": "Peking University"}, {"given_name": "Xu", "family_name": "Sun", "institution": "Peking University"}, {"given_name": "Zhiyuan", "family_name": "Zhang", "institution": "Peking University"}, {"given_name": "Guangxiang", "family_name": "Zhao", "institution": "Peking University"}, {"given_name": "Junyang", "family_name": "Lin", "institution": "Alibaba Group"}]}