Skip to content

历史传统技术

由于新技术的出现,有些技术已很少使用。

N-Grams

In natural language, precise meaning of words can only be determined in context. For example, meanings of neural network and fishing network are completely different. One of the ways to take this into account is to build our model on pairs of words, and considering word pairs as separate vocabulary tokens. In this way, the sentence I like to go fishing will be represented by the following sequence of tokens: I like, like to, to go, go fishing. The problem with this approach is that the dictionary size grows significantly, and combinations like go fishing and go shopping are presented by different tokens, which do not share any semantic similarity despite the same verb.

In some cases, we may consider using tri-grams -- combinations of three words -- as well. Thus the approach is such is often called n-grams. Also, it makes sense to use n-grams with character-level representation, in which case n-grams will roughly correspond to different syllabi.

Static Word Embedding

是指每个词被表示为一个固定的向量(通常是实数向量),这个向量在上下文中不发生改变。

同一个词在不同句子中具有**相同的向量表示**。

不能处理多义词(polysemy)

  • "bank"(河岸 vs 银行)在不同上下文中仍是同一个向量。

对上下文信息**不敏感**。

在现代 NLP 中,已被 上下文词嵌入(Contextual Embedding)(如 BERT)广泛取代。

🔹 常见的静态词嵌入模型

模型名称 简介 训练方法
Word2Vec 由 Google 提出,使用上下文预测目标词或反之 Skip-gram / CBOW
GloVe Global Vectors,由 Stanford 提出,结合全局共现矩阵 基于共现统计的矩阵因式分解
FastText Facebook 提出,可以处理未登录词(OOV)

Bag-of-Words and TF/IDF

When solving tasks like text classification, we need to be able to represent text by one fixed-size vector, which we will use as an input to final dense classifier. One of the simplest ways to do that is to combine all individual word representations, eg. by adding them. If we add one-hot encodings of each word, we will end up with a vector of frequencies, showing how many times each word appears inside the text. Such representation of text is called bag of words (BoW).

A BoW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as president and country, while scientific publication would have something like collider, discovered, etc. Thus, word frequencies can in many cases be a good indicator of text content.

The problem with BoW is that certain common words, such as and, is, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach.

However, none of those approaches can fully take into account the semantics of text. We need more powerful neural networks models to do this.