1. 理论

输入

embedding words

turning each input word into a vector using an embedding algorithm.

问题：The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

最底层的编码器输入是 embedding words，其后都是其他编码器的输出

In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below

BERT实践中也提到了这个，可以查看下

存疑

There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing.

Self-Attention

作用：

注意力机制：用于搞定当前处理的词与所有词的关系

Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

过程：

由Embedding产生QKV（长度不必同输入相同），当前词的Q与每个词（包含自身）的K做点积，再除以当前向量(指QKV)长度，取softmax(化为0-1的值)，再将softmax的值（多个值1）同各自的V相乘，再累加，即可得到一个新的上下文向量V,该向量为包含当前词与所有词（包括自身）内容的向量，此时，该V中，与当前词相关度高的我们再把此V向量送入feed-forward neural network

They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

计算方法：

一次矩阵运算得出所有X对应的新V

将所有输入词堆叠为一个矩阵，然后分别乘以一个权重得到 QKV

so,一次运算得出所有的V

The Beast With Many Heads

用多组不同的$W^Q, W^K, W^V$与X乘得到多组QKV，这些QKV并行运算的到多个V

It gives the attention layer multiple “representation subspaces”. with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices

接下来通过按列拼接将这些V拼接为一个更宽的矩阵，在将此矩阵与一个新权重矩阵 $W^O$相乘得到最终的矩阵，该矩阵送入下一层，可以发现Z和X大小一致

多头注意力机制矩阵计算全过程（多头是8个头哦）

Positional Encoding

输入加入位置信息，让后序计算QKV时，能够感知该词的位置，如绝对位置： “这个词是在句首还是句尾？”相对距离： “词 A 和词 B 是紧挨着的，还是隔得很远？”

位置编码向量的生成

例如Tensor2Tensor 中的位置编码：

20个维度为512的位置编码，左边用sin计算，右边用cos计算，之后进行左右拼接得到整个编码

The Residuals

编码器中的体现

解码器中的体现

Decoder

最后一层编码器的输出变成一组K,V，传入“encoder-decoder attention” layer（解码器中间那层），具体来说Encoder 的输出会被“广播”（Broadcast）给6个中间的解码器层（因为叠加了6层解码器）。然而Q是来自Masked multi-attention layer，K,V包含了句子中提取的全部信息，一个相当于摘要，一个相当于具体的上下文，这样利用Masked multi-attention layer传来的Q,来从丰富的上下文中提取解码器感兴趣的信息，（这里是通过注意力机制完成的）

The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence

be44a1e1-2ce9-4b53-9817-dda022230586

Mask操作（这里对QdotK后的矩阵Z进行操作）

在 Softmax 之前，加上一个上三角矩阵（Mask Matrix）。这个矩阵规定：**凡是该位置之后的词，分数统统设为 $-\infty$（负无穷大）。

这样再经过softmaxt 上三角值就为0（$e^{-\infty} \approx 0$）

28d39464-2a74-4e79-93a4-656139c15d08 **

具体操作为：$\text{Softmax}(\text{Scores} + \text{Mask})$

Final Linear and Softmax Layer

将floats向量转换为具体文字

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
The Linear layer is a simple fully connected neural network that projects the vector into a much, much larger vector called a logits vector.

全连接层进行投影，将输出投影到1*10000的向量（目的是为词汇表中每个词打个分）

softmax层将打分转换为概率

6b9001b6-7bd8-4a4b-9072-b22e62a35b8e

我们从输出中调出概率最大的那个，便是预测词的index

one-hot编码&输出

词汇表

用one-hot后的index表示一个词

渴望的最终输出

1. 理论#

输入#

Self-Attention#

作用：#

过程：#

The Beast With Many Heads#

Positional Encoding#

The Residuals#

Decoder#

Final Linear and Softmax Layer#

one-hot编码&输出#

2.实战#