理论课:C1W3.Vector Space Models
文章目录
- Importing the data
- 生成word_embeddings_subset(optional)
- Predict relationships among words
- Cosine Similarity
- Euclidean distance
- Finding the country of each capital
- Model Accuracy
- Plotting the vectors using PCA
理论课: C1W3.Vector Space Models
本次作业的目的:
- 预测单词之间的类比,例如:男人vs女人,相当于:国王vs??
- 使用 PCA 降低词嵌入的维度,并将其绘制成二维图。
- 使用相似度量(余弦相似度)比较词嵌入。
- 了解这些向量空间模型的工作原理。
Predict the Countries from Capitals任务说明:要求从一个国家中找出该国的首都来说明单词类比,编写一个程序,让它能根据国家,给出其对应首都。
Importing the data
导入包,没装的pip install 安装
# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import w3_unittest
from utils import get_vectors
数据集以 Pandas DataFrame 的形式加载。阿如果数据量较大,这可能需要几分钟时间、这可能需要几分钟时间。
data = pd.read_csv('./data/capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']
# print first five elements in the DataFrame
data.head(5)
生成word_embeddings_subset(optional)
由于原始的谷歌新闻单词嵌入数据集约为 3.64 G,有条件的同学可以自行下载该数据集,提取出将在本作业中分析的单词样本,并将其保存在名为word_embeddings_subset.p的 pickle 文件中。
自己提取单词词向量看下面:
- 从这里下载已经训练好的谷歌新闻单词嵌入数据集。
- 在页面中搜索 "GoogleNews-vectors-negative300.bin.gz "并点击链接下载。
- 需要解压该文件。
import nltk
from gensim.models import KeyedVectors
embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)
f = open('capitals.txt', 'r').read()
set_words = set(nltk.word_tokenize(f))
select_words = words = ['king', 'queen', 'oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']
for w in select_words:
set_words.add(w)
def get_word_embeddings(embeddings):
word_embeddings = {}
for word in embeddings.vocab:
if word in set_words:
word_embeddings[word] = embeddings[word]
return word_embeddings
# Testing your function
word_embeddings = get_word_embeddings(embeddings)
print(len(word_embeddings))
pickle.dump( word_embeddings, open( "word_embeddings_subset.p", "wb" ) )
也可以从绑定资源中直接下载word_embeddings_subset.p,并保存到data目录下,然后加载为 dictionary
word_embeddings = pickle.load(open("./data/word_embeddings_subset.p", "rb"))
len(word_embeddings) # there should be 243 words that will be used in this assignment
结果:
243
每个单词是300维的:
print("dimension: {}".format(word_embeddings['Spain'].shape[0]))
结果:
dimension: 300
Predict relationships among words
接下来编写一个函数,利用单词嵌入来预测单词之间的关系。
- 函数将吃三个单词。
- 前两个词相互关联。
- 它将预测第四个单词,该单词与第三个单词的关系与前两个单词的关系类似。
- 例如,“雅典之于希腊,就像曼谷之于 ______”?
- 编写一个能够找到第四个单词的程序。
Cosine Similarity
余弦相似度公式为:
cos
(
θ
)
=
A
⋅
B
∥
A
∥
∥
B
∥
=
∑
i
=
1
n
A
i
B
i
∑
i
=
1
n
A
i
2
∑
i
=
1
n
B
i
2
(1)
\cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1}
cos(θ)=∥A∥∥B∥A⋅B=∑i=1nAi2∑i=1nBi2∑i=1nAiBi(1)
A
A
A 和
B
B
B 代表词向量,
A
i
A_i
Ai 或
B
i
B_i
Bi 代表该向量的索引 i。
- 如果 A A A 和 B B B 完全相同,则 c o s ( θ ) = 1 cos(\theta)=1 cos(θ)=1。
- 否则,如果它们完全相反,即 A = − B A=-B A=−B,那么你将得到 c o s ( θ ) = − 1 cos(\theta) =-1 cos(θ)=−1。
- 如果得到 c o s ( θ ) = 0 cos(\theta)=0 cos(θ)=0,则表示它们正交(或垂直)。
- 0 和 1 之间的数字表示相似度得分。
- 介于-1和0之间的数字表示非相似度得分。
# UNQ_C1 GRADED FUNCTION: cosine_similarity
def cosine_similarity(A, B):
'''
Input:
A: a numpy array which corresponds to a word vector
B: A numpy array which corresponds to a word vector
Output:
cos: numerical number representing the cosine similarity between A and B.
'''
### START CODE HERE ###
dot = np.dot(A,B)
norma = np.sqrt(np.dot(A,A))
normb = np.sqrt(np.dot(B,B))
cos = dot/(norma*normb)
### END CODE HERE ###
return cos
测试:
# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']
cosine_similarity(king, queen)
结果:
0.6510957
Euclidean distance
欧氏距离公式如下:
d
(
A
,
B
)
=
d
(
B
,
A
)
=
(
A
1
−
B
1
)
2
+
(
A
2
−
B
2
)
2
+
⋯
+
(
A
n
−
B
n
)
2
=
∑
i
=
1
n
(
A
i
−
B
i
)
2
\begin{aligned} d(\mathbf{A}, \mathbf{B})=d(\mathbf{B}, \mathbf{A}) &=\sqrt{\left(A_{1}-B_{1}\right)^{2}+\left(A_{2}-B_{2}\right)^{2}+\cdots+\left(A_{n}-B_{n}\right)^{2}} \\ &=\sqrt{\sum_{i=1}^{n}\left(A_{i}-B_{i}\right)^{2}} \end{aligned}
d(A,B)=d(B,A)=(A1−B1)2+(A2−B2)2+⋯+(An−Bn)2=i=1∑n(Ai−Bi)2
其中:
- n n n 是向量中元素的个数
- A A A 和 B B B 是相应的单词向量。
- 词语越相似,欧氏距离越有可能接近 0。
# UNQ_C2 GRADED FUNCTION: euclidean
def euclidean(A, B):
"""
Input:
A: a numpy array which corresponds to a word vector
B: A numpy array which corresponds to a word vector
Output:
d: numerical number representing the Euclidean distance between A and B.
"""
### START CODE HERE ###
# euclidean distance
d = np.sqrt(np.sum((A-B)**2))
### END CODE HERE ###
return d
测试:
# Test your function
euclidean(king, queen)
结果:
2.4796925
Finding the country of each capital
使用上面实现的函数计算向量之间的相似性,并利用这些相似性找出各国的首都。需要编写一个函数接收三个单词和词向量的字典。任务是找出各国首都。例如,给定以下单词:
1: Athens 2: Greece 3: Baghdad,
预测结果应该为:4: Iraq
函数编写时:
1.您可需要参考上述国王 - 男人 + 女人 = 皇后的示例,并使用单词嵌入和相似度函数将该方案转化为数学函数。
2.在词向量词典中迭代,计算向量和当前词嵌入之间的余弦相似度得分。
3.需要确保返回的单词与输入函数的单词不重复。
4.返回得分最高的单词。
# UNQ_C3 GRADED FUNCTION: get_country
def get_country(city1, country1, city2, embeddings, cosine_similarity=cosine_similarity):
"""
Input:
city1: a string (the capital city of country1)
country1: a string (the country of capital1)
city2: a string (the capital city of country2)
embeddings: a dictionary where the keys are words and
Output:
countries: a dictionary with the most likely country and its similarity score
"""
### START CODE HERE ###
# store the city1, country 1, and city 2 in a set called group
group = set((city1, country1, city2))
# get embeddings of city 1
city1_emb = embeddings[city1]
# get embedding of country 1
country1_emb = embeddings[country1]
# get embedding of city 2
city2_emb = embeddings[city2]
# get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
# Remember: King - Man + Woman = None
vec = country1_emb-city1_emb+city2_emb
# Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
similarity = -1
# initialize country to an empty string
country = ''
# loop through all words in the embeddings dictionary
for word in embeddings.keys():
# first check that the word is not already in the 'group'
if word not in group:
# get the word embedding
word_emb = embeddings[word]
# calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
cur_similarity = cosine_similarity(vec,word_emb)
# if the cosine similarity is more similar than the previously best similarity...
if cur_similarity > similarity:
# update the similarity to the new, better similarity
similarity = cur_similarity
# store the country as a tuple, which contains the word and the similarity
country = (word,similarity)
### END CODE HERE ###
return country
测试:
# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)
结果:
(‘Egypt’, 0.7626821)
Model Accuracy
正确率计算公式为:
Accuracy
=
Correct # of predictions
Total # of predictions
\text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}}
Accuracy=Total # of predictionsCorrect # of predictions
借助上面的get_country
函数,遍历每一个单词,计算出准确率。
# UNQ_C4 GRADED FUNCTION: get_accuracy
def get_accuracy(word_embeddings, data, get_country=get_country):
'''
Input:
word_embeddings: a dictionary where the key is a word and the value is its embedding
data: a pandas data frame as
'''
### START CODE HERE ###
# initialize num correct to zero
num_correct = 0
# loop through the rows of the dataframe
for i, row in data.iterrows():
# get city1
city1 = row['city1']
# get country1
country1 = row['country1']
# get city2
city2 = row['city2']
# get country2
country2 = row['country2']
# use get_country to find the predicted country2
predicted_country2, _ = get_country(city1,country1,city2,word_embeddings)
# if the predicted country2 is the same as the actual country2...
if predicted_country2 == country2:
# increment the number of correct by 1
num_correct += 1
# get the number of rows in the data dataframe (length of dataframe)
m = len(data)
# calculate the accuracy by dividing the number correct by m
accuracy = num_correct /m
### END CODE HERE ###
return accuracy
测试:
accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")
结果:
Accuracy is 0.92
Plotting the vectors using PCA
接下来使用主成分分析法principal component analysis (PCA)来探索降低词向量维度后它们之间的距离。现在的词向量是300维的,难以使用可视化的方式显示这些词向量,因此我们使用PCA将向量投射到一个维度更小的空间中,并尽量保持原始信息不丢失。可视化后相似的单词会相互聚集在一起。例如,“悲伤”、"快乐 "和 "喜悦 “都是描述情绪的词语,在绘制时应该相互靠近。这些词 石油”、"天然气 "和 "石油 “都是描述自然资源的词语。城市”、“村庄”、"城镇 "等词可视为同义词,描述的是类似的事物。
大概步骤如下:
- 对数据进行均值归一化处理
- 计算数据的协方差矩阵( Σ \Sigma Σ)。
- 计算协方差矩阵的特征向量和特征值
- 将前 K 个特征向量与归一化数据相乘。结果如下:
# UNQ_C5 GRADED FUNCTION: compute_pca
def compute_pca(X, n_components=2):
"""
Input:
X: of dimension (m,n) where each row corresponds to a word vector
n_components: Number of components you want to keep.
Output:
X_reduced: data transformed in 2 dims/columns + regenerated original data
pass in: data as 2D NumPy array
"""
### START CODE HERE ###
# mean center the data
X_demeaned = X - np.mean(X,axis=0)
# calculate the covariance matrix
covariance_matrix = np.cov(X_demeaned, rowvar=False)
# calculate eigenvectors & eigenvalues of the covariance matrix
eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix, UPLO='L')
# sort eigenvalue in increasing order (get the indices from the sort)
idx_sorted = np.argsort(eigen_vals)
# reverse the order so that it's from highest to lowest.
idx_sorted_decreasing = idx_sorted[::-1]
# sort the eigen values by idx_sorted_decreasing
eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]
# sort eigenvectors using the idx_sorted_decreasing indices
eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]
# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
eigen_vecs_subset = eigen_vecs_sorted[:,0:n_components]
# transform the data by multiplying the transpose of the eigenvectors with the transpose of the de-meaned data
# Then take the transpose of that product.
X_reduced = np.dot(eigen_vecs_subset.transpose(),X_demeaned.transpose()).transpose()
### END CODE HERE ###
return X_reduced
测试:
# Testing your function
np.random.seed(1)
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(X_reduced)
结果:
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]
接下来挑选11个单词,然后观察他们PCA降维后的结果:
words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
'village', 'country', 'continent', 'petroleum', 'joyful']
# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)
print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)
结果:
You have 11 words each of 300 dimensions thus X.shape is: (11, 300)
降维:
# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))
plt.show()
结果: