NLP中的嵌入和距离度量

news2026/2/15 4:28:58

本文将深入研究嵌入、矢量数据库和各种距离度量的概念，并提供示例和演示代码。

NLP中的嵌入

嵌入是连续向量空间中对象、单词或实体的数值表示。在NLP中，词嵌入捕获词之间的语义关系，使算法能够更好地理解文本的上下文和含义。

让我们试着用一个例子和一些可视化的方法来理解它：假设有6个句子，想要创建嵌入

 from sentence_transformers import SentenceTransformer
 
 # Sample text embedding model
 model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
 
 #Sentences we want to encode. Example:
 sentence = ['The team enjoyed the hike through the meadow',
             'The team enjoyed the hike through the mountains',
             'The team has not enjoyed the hike through the meadows',
             'The national park had great views',
             'There were lot of rare animals in national park',
             'Olive oil drizzled over pizza tastes delicious']
             
 #Sentences are encoded by calling model.encode()
 embedding = model.encode(sentence)
 
 #Preview the embeddings
 
 print(embedding)
 
 # As we can see embeddings are nothing but 
 # numerical representation of sentences in a vector form
 
 [[ 0.37061948  0.26414198  0.21265635 ...  0.14994518 -0.25794953
   -0.23970771]
  [-0.07645706  0.27122658 -0.04530133 ... -0.27319074 -0.60025024
   -0.302555  ]
  [ 0.35693657 -0.2331443   0.418002   ... -0.37369558 -0.10241977
   -0.03282997]
  [ 0.66933334  0.40094966 -0.48208416 ...  0.10645878 -1.5067163
   -0.01547382]
  [ 0.4339616   0.2323563   0.21751338 ... -0.5746389  -0.26438454
    0.492655  ]
  [-0.2655593   0.11172348 -0.1473308  ...  0.42197517  0.88394576
    0.10763898]]

可以看到我们获得了一个384维度的嵌入

 embedding.shape
 (6, 384)

然后我们使用PCA进行降维并使用matplot可视化

 from sklearn.decomposition import PCA
 
 # Perform PCA for 2D visualization
 PCA_model = PCA(n_components = 2)
 PCA_model.fit(embedding)
 new_embeddings = PCA_model.transform(embedding)
 
 # As we can see now the shape has changed from (6,384)->(6,2)
 Shape: (6, 2)
 [[-2.7454064  -1.628386  ]
  [-2.7024133  -2.0113547 ]
  [-2.6084075  -2.5289955 ]
  [ 0.62488586  3.9073005 ]
  [ 0.09110744  4.9031897 ]
  [ 7.3402357  -2.6417546 ]]

PCA将其维度降维2 这样可以显示在2D的散点图上

 import matplotlib.pyplot as plt
 import mplcursors
 
 def plot2d(x_values, y_values, text_labels):
     """
     Create a 2D plot with annotations.
 
     Parameters:
     - x_values (array): X-axis values.
     - y_values (array): Y-axis values.
     - text_labels (list): List of text labels for each point.
     """
     fig, ax = plt.subplots()
     scatter = ax.scatter(x_values, y_values, label='Data Points')
 
     # Annotate points with text labels
     for i, label in enumerate(text_labels):
         ax.annotate(label, (x_values[i], y_values[i]))
 
     mplcursors.cursor(hover=True)
 
     ax.set_xlabel('X-axis')
     ax.set_ylabel('Y-axis')
     ax.set_title('2D Plot with Annotations')
 
     plt.show()

调用这个函数，结果如下

 import matplotlib.pyplot as plt
 import mplcursors
 
 # pass the embeddings and original sentence to create labels
 plot2d(new_embeddings[:,0], new_embeddings[:,1], sentence)