Word2Vec：将词汇转化为向量的技术

news2025/4/18 15:17:34

文章目录

Word2Vec
- 来龙去脉
- - 分层Softmax
  - 负采样

Word2Vec

下面的文章纯属笔记，看完后不会有任何收获，如果想理解这两种优化技术，给大家推荐一篇博客，讲的很好：
详解-----分层Softmax与负采样

来龙去脉

word2vec,即将词语转换为向量。在机器学习或自然语言任务中，我们需要对句子进行翻译或者根据某些词生成另一些词，这些任务现在大多数都可以用神经网络来做。比如在句子翻译任务中，我们给神经网络输入该句子，那么想要的输出就是该句子的翻译版本。但是由于计算机只接受数字形式的输入，所以我们要将词语转化为数字形式。
word2vec就是将词语转换为数字向量的技术，经过该方法训练之后，我们就可以得到每一个词的固定向量表示，使得意思相近的词在向量空间中距离较近，不相关的词在向量空间中距离较远。
word2vec有两种经典的方法来进行训练，从而得到词的向量表示，一种叫做CBOW（连续词袋模型），一种叫skip-gram(跳元模型)。
CBOW的核心想法是通过一个词周围的词，预测该词，类似于完形填空。
skip-gram的核心想法是通过预测该词周围的词，相当于根据一个词造句子。
在用神经网络训练者两种模型的时候，我们输出层的个数就是我们所有词语的个数，需要经过一个softmax才能得到预测的每一个词的概率，这会导致指数运算次数非常多，导致对计算资源的要求很高。
基于这个问题，提出了两种优化方案，一种叫分层softmax,一种叫负采样。本章重点介绍这两种技术。

分层Softmax

在这里插入图片描述

这张图就说明了分层softmax的核心流程，我们以CBOW为例，在得到每一个周围词的词嵌入表示后，对其进行加权平均，就得到了图中的h,然后构建Huffman树（基于每一个词出现的频率构建），得到Huffman树之后，每一个叶子节点就表示了词汇表中的一个词，现在为每一个非叶子节点赋予一个可训练参数，然后将h与每一个非叶子节点的参数相乘，经过一个sigmoid得到一个0-1的值，在计算一个词的概率的时候，将路径上所有非叶子节点得到的值相乘，就得到输出该词的概率值，通过二叉树这种设计，保证了最后得到的所有词的概率的和为1。
损失函数的设计用的是二元交叉熵损失。
Huffman树的构建

import heapq
import numpy as np

# 构建Huffman树
class HuffmanTree:
    def __init__(self, vocab, freq):
        self.vocab = vocab
        self.freq = freq
        self.tree = self.build_huffman_tree()

    def build_huffman_tree(self):
        heap = [[weight, [symbol, ""]] for symbol, weight in zip(self.vocab, self.freq)]
        heapq.heapify(heap)

        while len(heap) > 1:
            lo = heapq.heappop(heap)
            hi = heapq.heappop(heap)
            for pair in lo[1:]:
                pair[1] = '0' + pair[1]
            for pair in hi[1:]:
                pair[1] = '1' + pair[1]
            heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
        
        return heap[0][1:]

    def get_code(self):
        return {symbol: code for symbol, code in self.tree}

分层softmax代码

def preprocess(text):
	text = text.lower()
	text = text.replace('.', ' .')
	text = text.replace(',', ' ,')
	text = text.replace('!', ' !')
	words = text.split(' ')
	
	word_to_id = {}
	id_to_word = {}
	word_count = {}
	for word in words:
		if word not in word_to_id:
			new_id = len(word_to_id)
			word_to_id[word] = new_id
			id_to_word[new_id] = word
			word_count[new_id] = 1
		else:
			word_count[word_to_id[word]] += 1
			corpus = np.array([word_to_id[w] for w in words])
	
	return corpus, word_to_id, id_to_word, word_count

负采样

在这里插入图片描述
负采样的前置部分和前面一样，它的基本思想是从一个概率分布中选择少数几个负样本参与每一次的训练，一般情况下，我们不是只用正的样本吗？我们将h经过一个网络后，会得到所有词的logits值，注意，此时我们还没有将其softmax，也就避免了大量的指数运算。我们根据所有logits值和给的正样本的标签得到此时正样本的logits值，同理，我们从词汇库里选择几个词作为负样本，经过网络传播后，也会得到几个负样本的logits值，接下来，我们对这几个词（正样本和负样本）做softmax,从而得到正样本的概率和这几个负样本的概率，通过最大化的正样本的概率并且最小化负样本的概率，进而训练网络。
那么这里负样本的个数应该选几个呢？在word2vec原文中，当数据量较大时，通常选用的负例个数为5，当数据量较小时，选5-20个。

负采样代码：

import random
import numpy as np
from collections import Counter

# 示例语料库
corpus = [
    'cat is on the mat',
    'dog is in the house',
    'cat and dog are friends',
    'dog is playing with the ball'
]

# 1. 构建词汇表并计算词频
def build_vocab(corpus):
    words = []
    for sentence in corpus:
        words.extend(sentence.split())
    word_counts = Counter(words)
    vocab = {word: count for word, count in word_counts.items()}
    return vocab

# 计算词频并构建词汇表
vocab = build_vocab(corpus)
vocab_size = len(vocab)
print("词汇表：", vocab)

# 2. 负样本采样
def get_negative_sample(vocab, num_samples=5):
    # 获取词频的平方根，并进行归一化
    word_freq = np.array([count for count in vocab.values()])
    word_freq = word_freq ** 0.75  # 使用0.75的幂次方分布
    word_freq /= word_freq.sum()  # 归一化，使得总和为1

    # 根据权重选择负样本
    negative_samples = np.random.choice(list(vocab.keys()), size=num_samples, p=word_freq)
    return negative_samples

# 测试负采样
negative_samples = get_negative_sample(vocab, num_samples=5)
print("负样本：", negative_samples)

# 3. 简单的训练步骤（Skip-Gram模型）
def train_step(context, target, negative_samples, learning_rate=0.1):
    # 这里只是简单地输出每个样本的训练步骤，实际情况中会根据模型进行参数更新
    print(f"\n上下文词: {context}")
    print(f"目标词: {target}")
    print(f"负样本: {negative_samples}")
    
    # 计算损失和梯度的代码可以根据实际模型来实现
    # 这里只是模拟训练过程
    
    # 示例的损失计算（假设目标词是正样本，负样本是负样本）
    for word in [target] + list(negative_samples):
        if word == target:
            print(f"目标词 {target} 的损失：正样本，最大化概率")
        else:
            print(f"负样本 {word} 的损失：最小化概率")

# 4. 模拟训练过程
def train_model(corpus, vocab, num_epochs=10, num_negative_samples=5, learning_rate=0.1):
    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch + 1}/{num_epochs}")
        
        # 遍历语料库
        for sentence in corpus:
            words = sentence.split()
            
            # 模拟Skip-Gram的训练过程
            for i, target in enumerate(words):
                # 上下文词是目标词附近的其他词
                context = [words[j] for j in range(len(words)) if j != i]
                
                # 从词汇表中选择负样本
                negative_samples = get_negative_sample(vocab, num_samples=num_negative_samples)
                
                # 进行训练步骤
                train_step(context, target, negative_samples, learning_rate)

# 训练模型
train_model(corpus, vocab)