NLP基础

NLP的核心任务:understanding and synthesizing NLP输入预处理 Tokenization Case folding 将输入统一大小写,以减少内存,提高效率 ,but可能创造歧义,so具体问题具体分析 For example "Green" (name) has a different meaning to "green" (colour) but both would get the same token if case folding is applied. Stop word removal 移除一些含义较少的词,同样提高效率,but可能造成语义不完整,具体问题具体分析 Examples include, "a", "the", "of", "an", "this","that".For some tasks like topic modelling (identifying topics in text), contextual information is not as important compared to a task like sentiment analysis where the stop word "not" can change the sentiment completely. ...

November 21, 2025 · 1 min · 207 words · Bob

Transformer

1. 理论 输入 embedding words turning each input word into a vector using an embedding algorithm. 问题:The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset. 最底层的编码器输入是 embedding words,其后都是其他编码器的输出 In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below BERT实践中也提到了这个,可以查看下 ...

November 20, 2025 · 2 min · 414 words · Bob