C1W3.Assignment: Hello Vectors

理论课：C1W3.Vector Space Models

文章目录

Importing the data
生成word_embeddings_subset（optional）
Predict relationships among words
- Cosine Similarity
- Euclidean distance
Finding the country of each capital
Model Accuracy
Plotting the vectors using PCA

理论课： C1W3.Vector Space Models

本次作业的目的：

预测单词之间的类比，例如：男人vs女人，相当于：国王vs？？
使用 PCA 降低词嵌入的维度，并将其绘制成二维图。
使用相似度量（余弦相似度）比较词嵌入。
了解这些向量空间模型的工作原理。

Predict the Countries from Capitals任务说明：要求从一个国家中找出该国的首都来说明单词类比，编写一个程序，让它能根据国家，给出其对应首都。

Importing the data

导入包，没装的pip install 安装

# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import w3_unittest

from utils import get_vectors

数据集以 Pandas DataFrame 的形式加载。阿如果数据量较大，这可能需要几分钟时间、这可能需要几分钟时间。

data = pd.read_csv('./data/capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']

# print first five elements in the DataFrame
data.head(5)

在这里插入图片描述

生成word_embeddings_subset（optional）

由于原始的谷歌新闻单词嵌入数据集约为 3.64 G，有条件的同学可以自行下载该数据集，提取出将在本作业中分析的单词样本，并将其保存在名为word_embeddings_subset.p的 pickle 文件中。
自己提取单词词向量看下面：

从这里下载已经训练好的谷歌新闻单词嵌入数据集。
在页面中搜索 "GoogleNews-vectors-negative300.bin.gz "并点击链接下载。
需要解压该文件。

import nltk
from gensim.models import KeyedVectors


embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
f = open('capitals.txt', 'r').read()
set_words = set(nltk.word_tokenize(f))
select_words = words = ['king', 'queen', 'oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
for w in select_words:
    set_words.add(w)

def get_word_embeddings(embeddings):

    word_embeddings = {}
    for word in embeddings.vocab:
        if word in set_words:
            word_embeddings[word] = embeddings[word]
    return word_embeddings


# Testing your function
word_embeddings = get_word_embeddings(embeddings)
print(len(word_embeddings))
pickle.dump( word_embeddings, open( "word_embeddings_subset.p", "wb" ) )

也可以从绑定资源中直接下载word_embeddings_subset.p，并保存到data目录下，然后加载为 dictionary

word_embeddings = pickle.load(open("./data/word_embeddings_subset.p", "rb"))
len(word_embeddings)  # there should be 243 words that will be used in this assignment

结果：
243
每个单词是300维的：

print("dimension: {}".format(word_embeddings['Spain'].shape[0]))

结果：
dimension: 300

Predict relationships among words

接下来编写一个函数，利用单词嵌入来预测单词之间的关系。

函数将吃三个单词。
前两个词相互关联。
它将预测第四个单词，该单词与第三个单词的关系与前两个单词的关系类似。
例如，“雅典之于希腊，就像曼谷之于 ______”？
编写一个能够找到第四个单词的程序。

Cosine Similarity

余弦相似度公式为：
$\cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1}$
$A$ 和 $B$ 代表词向量， $A_i$ 或 $B_i$ 代表该向量的索引 i。

如果 $A$ 和 $B$ 完全相同，则 $cos(\theta)=1$ 。
否则，如果它们完全相反，即 $A = - B$ ，那么你将得到 $cos(\theta) =-1$ 。
如果得到 $cos(\theta)=0$ ，则表示它们正交（或垂直）。
0 和 1 之间的数字表示相似度得分。
介于-1和0之间的数字表示非相似度得分。

# UNQ_C1 GRADED FUNCTION: cosine_similarity

def cosine_similarity(A, B):
    '''
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        cos: numerical number representing the cosine similarity between A and B.
    '''

    ### START CODE HERE ###
    dot = np.dot(A,B)
    norma = np.sqrt(np.dot(A,A))
    normb = np.sqrt(np.dot(B,B))
    cos = dot/(norma*normb)

    ### END CODE HERE ###
    return cos

测试：

# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']

cosine_similarity(king, queen)

结果：
0.6510957

Euclidean distance

欧氏距离公式如下：
$\begin{aligned} d(\mathbf{A}, \mathbf{B})=d(\mathbf{B}, \mathbf{A}) &=\sqrt{\left(A_{1}-B_{1}\right)^{2}+\left(A_{2}-B_{2}\right)^{2}+\cdots+\left(A_{n}-B_{n}\right)^{2}} \\ &=\sqrt{\sum_{i=1}^{n}\left(A_{i}-B_{i}\right)^{2}} \end{aligned}$
其中：

$n$ 是向量中元素的个数
$A$ 和 $B$ 是相应的单词向量。
词语越相似，欧氏距离越有可能接近 0。

# UNQ_C2 GRADED FUNCTION: euclidean

def euclidean(A, B):
    """
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        d: numerical number representing the Euclidean distance between A and B.
    """

    ### START CODE HERE ###

    # euclidean distance    
    d = np.sqrt(np.sum((A-B)**2))

    ### END CODE HERE ###

    return d

测试：

# Test your function
euclidean(king, queen)

结果：
2.4796925

Finding the country of each capital

使用上面实现的函数计算向量之间的相似性，并利用这些相似性找出各国的首都。需要编写一个函数接收三个单词和词向量的字典。任务是找出各国首都。例如，给定以下单词：
1: Athens 2: Greece 3: Baghdad,
预测结果应该为：4: Iraq
函数编写时：
1.您可需要参考上述国王 - 男人 + 女人 = 皇后的示例，并使用单词嵌入和相似度函数将该方案转化为数学函数。
2.在词向量词典中迭代，计算向量和当前词嵌入之间的余弦相似度得分。
3.需要确保返回的单词与输入函数的单词不重复。
4.返回得分最高的单词。

# UNQ_C3 GRADED FUNCTION: get_country

def get_country(city1, country1, city2, embeddings, cosine_similarity=cosine_similarity):
    """
    Input:
        city1: a string (the capital city of country1)
        country1: a string (the country of capital1)
        city2: a string (the capital city of country2)
        embeddings: a dictionary where the keys are words and
    Output:
        countries: a dictionary with the most likely country and its similarity score
    """
    ### START CODE HERE ###

    # store the city1, country 1, and city 2 in a set called group
    group = set((city1, country1, city2))

    # get embeddings of city 1
    city1_emb = embeddings[city1]

    # get embedding of country 1
    country1_emb = embeddings[country1]

    # get embedding of city 2
    city2_emb = embeddings[city2]

    # get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
    # Remember: King - Man + Woman = None
    vec = country1_emb-city1_emb+city2_emb

    # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
    similarity = -1

    # initialize country to an empty string
    country = ''

    # loop through all words in the embeddings dictionary
    for word in embeddings.keys():

        # first check that the word is not already in the 'group'
        if word not in group:

            # get the word embedding
            word_emb = embeddings[word]

            # calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
            cur_similarity = cosine_similarity(vec,word_emb)

            # if the cosine similarity is more similar than the previously best similarity...
            if cur_similarity > similarity:

                # update the similarity to the new, better similarity
                similarity = cur_similarity

                # store the country as a tuple, which contains the word and the similarity
                country = (word,similarity)

    ### END CODE HERE ###

    return country

测试：

# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)

结果：
(‘Egypt’, 0.7626821)

Model Accuracy

正确率计算公式为：
$\text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}}$
借助上面的get_country函数，遍历每一个单词，计算出准确率。

# UNQ_C4 GRADED FUNCTION: get_accuracy

def get_accuracy(word_embeddings, data, get_country=get_country):
    '''
    Input:
        word_embeddings: a dictionary where the key is a word and the value is its embedding
        data: a pandas data frame as

    '''

    ### START CODE HERE ###
    # initialize num correct to zero
    num_correct = 0

    # loop through the rows of the dataframe
    for i, row in data.iterrows():

        # get city1
        city1 = row['city1']

        # get country1
        country1 = row['country1']

        # get city2
        city2 = row['city2']
        
        # get country2
        country2 = row['country2']

        # use get_country to find the predicted country2
        predicted_country2, _ = get_country(city1,country1,city2,word_embeddings)

        # if the predicted country2 is the same as the actual country2...
        if predicted_country2 == country2:
            # increment the number of correct by 1
            num_correct += 1

    # get the number of rows in the data dataframe (length of dataframe)
    m = len(data)

    # calculate the accuracy by dividing the number correct by m
    accuracy = num_correct /m

    ### END CODE HERE ###
    return accuracy

测试：

accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")

结果：
Accuracy is 0.92

Plotting the vectors using PCA

接下来使用主成分分析法principal component analysis (PCA)来探索降低词向量维度后它们之间的距离。现在的词向量是300维的，难以使用可视化的方式显示这些词向量，因此我们使用PCA将向量投射到一个维度更小的空间中，并尽量保持原始信息不丢失。可视化后相似的单词会相互聚集在一起。例如，“悲伤”、"快乐 "和 "喜悦 “都是描述情绪的词语，在绘制时应该相互靠近。这些词石油”、"天然气 "和 "石油 “都是描述自然资源的词语。城市”、“村庄”、"城镇 "等词可视为同义词，描述的是类似的事物。
大概步骤如下：

对数据进行均值归一化处理
计算数据的协方差矩阵（ $\Sigma$ ）。
计算协方差矩阵的特征向量和特征值
将前 K 个特征向量与归一化数据相乘。结果如下：

在这里插入图片描述

# UNQ_C5 GRADED FUNCTION: compute_pca


def compute_pca(X, n_components=2):
    """
    Input:
        X: of dimension (m,n) where each row corresponds to a word vector
        n_components: Number of components you want to keep.
    Output:
        X_reduced: data transformed in 2 dims/columns + regenerated original data
    pass in: data as 2D NumPy array
    """

    ### START CODE HERE ###
    # mean center the data
    X_demeaned = X - np.mean(X,axis=0)

    # calculate the covariance matrix
    covariance_matrix = np.cov(X_demeaned, rowvar=False)

    # calculate eigenvectors & eigenvalues of the covariance matrix
    eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix, UPLO='L')

    # sort eigenvalue in increasing order (get the indices from the sort)
    idx_sorted = np.argsort(eigen_vals)
    
    # reverse the order so that it's from highest to lowest.
    idx_sorted_decreasing = idx_sorted[::-1]

    # sort the eigen values by idx_sorted_decreasing
    eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]

    # sort eigenvectors using the idx_sorted_decreasing indices
    eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]

    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    eigen_vecs_subset = eigen_vecs_sorted[:,0:n_components]

    # transform the data by multiplying the transpose of the eigenvectors with the transpose of the de-meaned data
    # Then take the transpose of that product.
    X_reduced = np.dot(eigen_vecs_subset.transpose(),X_demeaned.transpose()).transpose()

    ### END CODE HERE ###

    return X_reduced

测试：

# Testing your function
np.random.seed(1)
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(X_reduced)

结果：
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]

接下来挑选11个单词，然后观察他们PCA降维后的结果：

words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
         'village', 'country', 'continent', 'petroleum', 'joyful']

# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)

print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)

结果：
You have 11 words each of 300 dimensions thus X.shape is: (11, 300)
降维：

# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))

plt.show()