

    2.1 padding Padding-mask
    2.2 sequence mask:transformer decoder部分
    2.3 BERT: maskd LM
    2.4 RoBERTa: dynamic maskd LM
    2.5 ERNIE: Knowledge masking strategies
    2.6 BERT-wwm
    1.1 padding:
    一句话:[1, 2, 3, 4, 5]
    input size: 1* 8
    加padding:[1, 2, 3, 4, 5, 0, 0, 0]
    1.2 padding 引入带来的问题:
    原始均值:(1 + 2 + 3 + 4 + 5) / 5 = 3
    padding后的均值: (1 + 2 + 3 + 4 + 5) / 8 = 1.875
    1.3 引入mask,解决padding的缺陷:
    假设 m = [1, 1 , 1, 1, 1, 0, 0, 0]
    mask后的avg = 3 (和原先结果一致)
    1.4 除了上述的padding的场景,为了让模型学习到某个词或者关注到某个区域,也可以使用mask对信息做屏蔽。
    例:transformer mask encoder self-attention mask
    训练的时候,在Masked Multi-head attention层中,为了防止未来的信息被现在时刻看到,需要把将来的信息mask掉。
    t-1时刻、t时刻、t+1时刻在masked Multi-head attention layer是并行计算的。
    延伸问题:transformer decoder在预测时也用到了mask
    2.3 BERT: maskd LM
    The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross entropy loss.
    ——BERT 原文:训练数据中,被mask选中的概率是15%,选中的词,被[MASK]替换的概率是80%,不变的概率是10%,随机替换的概率是10%。
    2.4 RoBERTa: dynamic maskd LM
    The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.
    2.5 ERNIE: Knowledge masking strategies
    ERNIE is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking.
    给BERT加了知识图谱,加强了局部学习。BERT原先的方式,只是从mask出现的概率做填空。用knowledge level的填空方式,把knowledge挖空,保证了模型学到关键知识。
    基本级别掩码(Basic-Level Masking):
    短语级别掩码(Phrase-Level Masking):
    在这个阶段,首先使用语法分析工具得到一个句子中的短语,例如图中的“a serious of”,然后随机掩码掉一部分,并使用剩下的对这些短语进行预测。在这个阶段,词嵌入中加入了短语信息。
    实体级别掩码(Entity-Level Masking):
    2.6 BERT-wwm
    The whole word masking mainly mitigates the drawbacks in original BERT that, if the masked WordPiece token (Wu et al., 2016) belongs to a whole word, then all the WordPiece tokens (which forms a complete word) will be masked altogether.
    例句:there is an apple tree nearby.
    tok_list = ["there", "is", "an", "ap", "##p", "##le", "tr", "##ee", "nearby", "."]
    there [MASK] an ap [MASK] ##le tr [RANDOM] nearby .
    [MASK] [MASK] an ap ##p [MASK] tr ##ee nearby .
    there is [MASK] ap [MASK] ##le tr ##ee nearby [MASK] .
    there is an [MASK] [MASK] [RANDOM] tr ##ee nearby .
    there is [MASK] ap ##p ##le [MASK] [MASK] nearby .
    there is! [MASK] ap ##p ##le tr ##ee nearby [MASK] .