Transformer

1. 理论输入 embedding words turning each input word into a vector using an embedding algorithm. 问题：The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset. 最底层的编码器输入是 embedding words，其后都是其他编码器的输出 In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below BERT实践中也提到了这个，可以查看下 ...