NLP基础

NLP的核心任务:understanding and synthesizing NLP输入预处理 Tokenization Case folding 将输入统一大小写,以减少内存,提高效率 ,but可能创造歧义,so具体问题具体分析 For example "Green" (name) has a different meaning to "green" (colour) but both would get the same token if case folding is applied. Stop word removal 移除一些含义较少的词,同样提高效率,but可能造成语义不完整,具体问题具体分析 Examples include, "a", "the", "of", "an", "this","that".For some tasks like topic modelling (identifying topics in text), contextual information is not as important compared to a task like sentiment analysis where the stop word "not" can change the sentiment completely. ...

November 21, 2025 · 2 min · 221 words · Bob

Transformer

1. 理论 输入 embedding words turning each input word into a vector using an embedding algorithm. 问题:The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset. 最底层的编码器输入是 embedding words,其后都是其他编码器的输出 In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below BERT实践中也提到了这个,可以查看下 ...

November 20, 2025 · 8 min · 1690 words · Bob

Knowledge Graph & NLP Tutorial-(BERT,spaCy,NLTK)

NLP处理阶段 词法:切分为token uneasy” can be broken into two sub-word tokens as “un-easy”. 句法:1.检查句子结构有问题与否;2.形成一个能够体现词间句法关系的结果 eg: “The school goes to the boy” 语义:语义是否正确 semantic analyzer would reject a sentence like “Hot ice-cream” Pragmatic :歧义,中选择一个意思 知识图谱 存储提取的信息的一种方式。存储结构一般包括:a subject, a predicate and an object(主谓宾) 这些技术用于构建知识图谱 sentence segmentation, dependency parsing, parts of speech tagging, and entity recognition. 抽取实体 从句子中抽取主语和宾语,需要特殊处理的是复合名称和修饰词。 抽取关系 从句子中提取“主要的”动词 完成此二者之后便可进行知识图谱的构建,构建时最好将每个关系单独构建一个图谱,这是为了更好可视化。 BERT 适用于少数据集,question answering and sentiment analysis 任务 ...

November 10, 2025 · 2 min · 409 words · Bob