题意:OpenAI 嵌入 API:嵌入是如何工作的?
问题背景:
There are quite a few tutorials on embeddings
in OpenAI. I can't understand how they work.
在OpenAI中有很多关于嵌入的教程,但我无法理解它们是如何工作的。
Referring to https://platform.openai.com/docs/guides/embeddings/what-are-embeddings , an embedding
is a vector
or list
. A string is passed to an embedding
model and the model returns a number (in simplest terms). I can use this number(s).
参考 [https://platform.openai.com/docs/guides/embeddings/what-are-embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings),嵌入是一种向量或列表。一个字符串被传递给嵌入模型,模型返回一个数字(最简单的说法)。我可以使用这些数字。
If I use a simple string to get its embeddings
, I get a massive list
如果我使用一个简单的字符串来获取它的嵌入,我会得到一个庞大的列表。
result = get_embedding("I live in space", engine = "textsearchcuriedoc001mc")
result
when printed
打印结果时的输出
[5.4967957112239674e-05,
-0.01301578339189291,
-0.002223075833171606,
0.013594076968729496,
-0.027540158480405807,
0.008867159485816956,
0.009403547272086143,
-0.010987567715346813,
0.01919262297451496,
0.022209804505109787,
-0.01397960539907217,
-0.012806257233023643,
-0.027908924967050552,
0.013074451126158237,
0.024942029267549515,
0.0200139675289392 , ..... -> truncated this much, much, much longer list
Question 1 - how is this massive list correlated with my 4-word text?
问题1——这个庞大的列表是如何与我的4个单词文本相关联的?
Question 2 -
I create embeddings
of the text I want to use in query. Note that it is exactly the same as the text of original content I live in space
我为我要用于查询的文本创建了嵌入。请注意,它与原始内容的文本完全相同:“I live in space”(我住在太空)。
queryembedding = get_embedding(
'I live in space',
engine="textsearchcuriequery001mc"
)
queryembedding
When I run cosine similarity
, the value is 0.42056650555103214
.
当我运行余弦相似度时,值是0.42056650555103214。
similarity = cosine_similarity(embeddings_of_i_live,queryembedding)
similarity
I get value 0.42056650555103214
Shouldn't the value be 1
to indicate identical value?
这个值不应该是1来表示完全相同的内容吗?
问题解决:
Q1:
How is this massive list correlated with my 4-word text?
这个庞大的列表是如何与我的4个单词文本相关联的?
A1: Let's say you want to use the OpenAI text-embedding-ada-002
model. No matter what your input is, you will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space, which is very hard to imagine. Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002
model has an output dimension of 1536. It's pre-defined.
A1: 假设你想使用OpenAI的`text-embedding-ada-002`模型。无论你的输入是什么,你总是会得到一个1536维的嵌入向量(即,里面有1536个数字)。你可能熟悉三维空间(即X、Y、Z),而这是一个1536维的空间,非常难以想象。为什么嵌入向量中正好有1536个数字呢?因为`text-embedding-ada-002`模型的输出维度是1536,这是预定义的。
Q2:
I create embeddings of the text I want to use in the query. Note that it is exactly the same as the text of the original content:
I live in space
. When I run cosine similarity, the value is0.42056650555103214
. Should the value be1
to indicate an identical value?
A2: Yes, the value should be 1
if you calculate cosine similarity between two identical texts. See an example here.
For an example of semantic search based on embeddings, see this answer.
A2: 是的,如果你计算两个完全相同文本之间的余弦相似度,值应该是1。请参见[这里的示例](https://platform.openai.com/docs/guides/embeddings/semantic-search)。
有关基于嵌入的语义搜索的示例,请参见[这个答案](https://platform.openai.com/docs/guides/embeddings/semantic-search)。