历史传统技术¶

由于新技术的出现，有些技术已很少使用。

N-Grams¶

In natural language, precise meaning of words can only be determined in context. For example, meanings of neural network and fishing network are completely different. One of the ways to take this into account is to build our model on pairs of words, and considering word pairs as separate vocabulary tokens. In this way, the sentence I like to go fishing will be represented by the following sequence of tokens: I like, like to, to go, go fishing. The problem with this approach is that the dictionary size grows significantly, and combinations like go fishing and go shopping are presented by different tokens, which do not share any semantic similarity despite the same verb.

In some cases, we may consider using tri-grams -- combinations of three words -- as well. Thus the approach is such is often called n-grams. Also, it makes sense to use n-grams with character-level representation, in which case n-grams will roughly correspond to different syllabi.

Static Word Embedding¶

是指每个词被表示为一个固定的向量（通常是实数向量），这个向量在上下文中不发生改变。

同一个词在不同句子中具有**相同的向量表示**。

不能处理多义词（polysemy）：

"bank"（河岸 vs 银行）在不同上下文中仍是同一个向量。

对上下文信息**不敏感**。

在现代 NLP 中，已被 上下文词嵌入（Contextual Embedding）（如 BERT）广泛取代。

🔹 常见的静态词嵌入模型¶

模型名称	简介	训练方法
Word2Vec	由 Google 提出，使用上下文预测目标词或反之	Skip-gram / CBOW
GloVe	Global Vectors，由 Stanford 提出，结合全局共现矩阵	基于共现统计的矩阵因式分解
FastText	Facebook 提出，可以处理未登录词（OOV）

Bag-of-Words and TF/IDF¶

When solving tasks like text classification, we need to be able to represent text by one fixed-size vector, which we will use as an input to final dense classifier. One of the simplest ways to do that is to combine all individual word representations, eg. by adding them. If we add one-hot encodings of each word, we will end up with a vector of frequencies, showing how many times each word appears inside the text. Such representation of text is called bag of words (BoW).

A BoW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as president and country, while scientific publication would have something like collider, discovered, etc. Thus, word frequencies can in many cases be a good indicator of text content.

The problem with BoW is that certain common words, such as and, is, etc. appear in most of the texts, and they have highest frequencies, masking out the words that are really important. We may lower the importance of those words by taking into account the frequency at which words occur in the whole document collection. This is the main idea behind TF/IDF approach.

However, none of those approaches can fully take into account the semantics of text. We need more powerful neural networks models to do this.

RNN¶

import numpy as np
from typing import List, Dict, Optional
from numpy.typing import NDArray

Array2D = NDArray[np.float64]

class SimpleRNN:
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        self.hidden_size = hidden_size

        # Weight initialization (small random values)
        self.W_xh: Array2D = np.random.randn(hidden_size, input_size) * 0.01
        self.W_hh: Array2D = np.random.randn(hidden_size, hidden_size) * 0.01
        self.W_hy: Array2D = np.random.randn(output_size, hidden_size) * 0.01
        self.b_h: Array2D = np.zeros((hidden_size, 1))
        self.b_y: Array2D = np.zeros((output_size, 1))

        # Buffers for backprop
        self.last_inputs: List[Array2D] = []
        self.last_hs: List[Array2D] = []

    def forward(self, inputs: List[Array2D]) -> List[Array2D]:
        """
        Forward pass through time
        Each input x_t must be of shape (input_size, 1)
        Returns a list of outputs (output_size, 1)
        """
        h: Array2D = np.zeros((self.hidden_size, 1), dtype=np.float64)
        self.last_inputs = inputs
        self.last_hs = [h]

        outputs: List[Array2D] = []

        for x in inputs:
            h = np.tanh(self.W_xh @ x + self.W_hh @ h + self.b_h)
            y = self.W_hy @ h + self.b_y
            outputs.append(y)
            self.last_hs.append(h)  # Save for BPTT

        return outputs

    def backward(
        self,
        d_y: List[Array2D],
        learning_rate: float = 0.01,
        clip_grad: Optional[float] = 1.0
    ) -> None:
        """
        Backpropagation Through Time (BPTT)
        d_y: list of output gradients, each of shape (output_size, 1)
        learning_rate: learning rate for parameter update
        clip_grad: value to clip gradients (None to disable)
        """

        # Gradient accumulators
        dW_xh = np.zeros_like(self.W_xh)
        dW_hh = np.zeros_like(self.W_hh)
        dW_hy = np.zeros_like(self.W_hy)
        db_h = np.zeros_like(self.b_h)
        db_y = np.zeros_like(self.b_y)

        dh_next = np.zeros((self.hidden_size, 1), dtype=np.float64)

        for t in reversed(range(len(self.last_inputs))):
            x_t = self.last_inputs[t]
            h_t = self.last_hs[t + 1]
            h_prev = self.last_hs[t]

            # Output gradient
            dy = d_y[t]
            dW_hy += dy @ h_t.T
            db_y += dy

            # Backprop into hidden layer
            dh = self.W_hy.T @ dy + dh_next
            dtanh = (1 - h_t ** 2) * dh

            dW_xh += dtanh @ x_t.T
            dW_hh += dtanh @ h_prev.T
            db_h += dtanh
            dh_next = self.W_hh.T @ dtanh

        # Optional: gradient clipping to avoid exploding gradients
        if clip_grad is not None:
            for dparam in [dW_xh, dW_hh, dW_hy, db_h, db_y]:
                np.clip(dparam, -clip_grad, clip_grad, out=dparam)

        # Update weights
        self.W_xh -= learning_rate * dW_xh
        self.W_hh -= learning_rate * dW_hh
        self.W_hy -= learning_rate * dW_hy
        self.b_h  -= learning_rate * db_h
        self.b_y  -= learning_rate * db_y




if __name__ == "__main__":
    inputs: List[Array2D] = [np.random.randn(2, 1).astype(np.float64) for _ in range(4)]

    rnn = SimpleRNN(input_size=2, hidden_size=5, output_size=1)
    outputs: List[Array2D] = rnn.forward(inputs)

    # Fake gradient
    d_y: List[Array2D] = [np.random.randn(1, 1).astype(np.float64) for _ in outputs]
    rnn.backward(d_y, learning_rate=0.01)

LSTM¶

import numpy as np
from typing import List, Optional
from numpy.typing import NDArray

Array2D = NDArray[np.float64]

class SimpleLSTM:
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        def init_gate() -> Array2D:
            return np.random.randn(hidden_size, input_size + hidden_size).astype(np.float64) * 0.01

        # Weight matrices for gates
        self.W_f = init_gate()  # forget gate
        self.W_i = init_gate()  # input gate
        self.W_c = init_gate()  # candidate gate
        self.W_o = init_gate()  # output gate

        # Bias vectors
        self.b_f = np.zeros((hidden_size, 1), dtype=np.float64)
        self.b_i = np.zeros((hidden_size, 1), dtype=np.float64)
        self.b_c = np.zeros((hidden_size, 1), dtype=np.float64)
        self.b_o = np.zeros((hidden_size, 1), dtype=np.float64)

        # Output layer
        self.W_hy = np.random.randn(output_size, hidden_size).astype(np.float64) * 0.01
        self.b_y = np.zeros((output_size, 1), dtype=np.float64)

        # For BPTT
        self.last_inputs: List[Array2D] = []
        self.last_cells: List[Array2D] = []
        self.last_hiddens: List[Array2D] = []
        self.last_gates: List[dict] = []

    def sigmoid(self, x: Array2D) -> Array2D:
        return 1 / (1 + np.exp(-x))

    def forward(self, inputs: List[Array2D]) -> List[Array2D]:
        """
        Forward pass through the LSTM.
        Inputs: list of input vectors (input_size, 1)
        Returns: list of output vectors (output_size, 1)
        """
        h = np.zeros((self.hidden_size, 1), dtype=np.float64)
        c = np.zeros((self.hidden_size, 1), dtype=np.float64)

        self.last_inputs = inputs
        self.last_hiddens = [h]
        self.last_cells = [c]
        self.last_gates = []

        outputs: List[Array2D] = []

        for x in inputs:
            combined = np.vstack((h, x))  # (hidden + input)

            f = self.sigmoid(self.W_f @ combined + self.b_f)       # forget gate
            i = self.sigmoid(self.W_i @ combined + self.b_i)       # input gate
            c_hat = np.tanh(self.W_c @ combined + self.b_c)        # candidate memory
            o = self.sigmoid(self.W_o @ combined + self.b_o)       # output gate

            c = f * c + i * c_hat
            h = o * np.tanh(c)

            y = self.W_hy @ h + self.b_y

            self.last_gates.append({'f': f, 'i': i, 'c_hat': c_hat, 'o': o, 'combined': combined})
            self.last_hiddens.append(h)
            self.last_cells.append(c)
            outputs.append(y)

        return outputs

    def backward(
        self,
        d_y: List[Array2D],
        learning_rate: float = 0.01,
        clip_grad: Optional[float] = 1.0
    ) -> None:
        """
        Backward pass through the LSTM (BPTT).
        d_y: list of output gradients, each of shape (output_size, 1)
        """

        # Gradient accumulators
        dW_f = np.zeros_like(self.W_f)
        dW_i = np.zeros_like(self.W_i)
        dW_c = np.zeros_like(self.W_c)
        dW_o = np.zeros_like(self.W_o)
        db_f = np.zeros_like(self.b_f)
        db_i = np.zeros_like(self.b_i)
        db_c = np.zeros_like(self.b_c)
        db_o = np.zeros_like(self.b_o)

        dW_hy = np.zeros_like(self.W_hy)
        db_y = np.zeros_like(self.b_y)

        dh_next = np.zeros((self.hidden_size, 1), dtype=np.float64)
        dc_next = np.zeros((self.hidden_size, 1), dtype=np.float64)

        for t in reversed(range(len(self.last_inputs))):
            gates = self.last_gates[t]
            f, i, c_hat, o, combined = gates['f'], gates['i'], gates['c_hat'], gates['o'], gates['combined']
            h = self.last_hiddens[t + 1]
            c = self.last_cells[t + 1]
            c_prev = self.last_cells[t]
            x = self.last_inputs[t]
            h_prev = self.last_hiddens[t]

            # Output layer gradients
            dy = d_y[t]
            dW_hy += dy @ h.T
            db_y += dy

            dh = self.W_hy.T @ dy + dh_next
            do = dh * np.tanh(c)
            do_raw = o * (1 - o) * do

            dc = dh * o * (1 - np.tanh(c) ** 2) + dc_next
            dc_hat = dc * i
            dc_hat_raw = (1 - c_hat ** 2) * dc_hat

            di = dc * c_hat
            di_raw = i * (1 - i) * di

            df = dc * c_prev
            df_raw = f * (1 - f) * df

            dW_f += df_raw @ combined.T
            dW_i += di_raw @ combined.T
            dW_c += dc_hat_raw @ combined.T
            dW_o += do_raw @ combined.T
            db_f += df_raw
            db_i += di_raw
            db_c += dc_hat_raw
            db_o += do_raw

            d_combined = (
                self.W_f.T @ df_raw +
                self.W_i.T @ di_raw +
                self.W_c.T @ dc_hat_raw +
                self.W_o.T @ do_raw
            )

            dh_next = d_combined[:self.hidden_size, :]
            dc_next = dc * f

        # Optional: gradient clipping
        if clip_grad is not None:
            for grad in [dW_f, dW_i, dW_c, dW_o, db_f, db_i, db_c, db_o, dW_hy, db_y]:
                np.clip(grad, -clip_grad, clip_grad, out=grad)

        # Apply gradients
        self.W_f -= learning_rate * dW_f
        self.W_i -= learning_rate * dW_i
        self.W_c -= learning_rate * dW_c
        self.W_o -= learning_rate * dW_o
        self.b_f -= learning_rate * db_f
        self.b_i -= learning_rate * db_i
        self.b_c -= learning_rate * db_c
        self.b_o -= learning_rate * db_o
        self.W_hy -= learning_rate * dW_hy
        self.b_y -= learning_rate * db_y



if __name__ == "__main__":
    inputs: List[Array2D] = [np.random.randn(3, 1).astype(np.float64) for _ in range(5)]

    lstm = SimpleLSTM(input_size=3, hidden_size=4, output_size=2)
    outputs = lstm.forward(inputs)

    # Fake gradient from loss
    d_y = [np.random.randn(2, 1).astype(np.float64) for _ in outputs]
    lstm.backward(d_y, learning_rate=0.01)