OpenAI Embeddings API: How embeddings work?

news2025/7/9 2:06:46

题意：OpenAI 嵌入 API：嵌入是如何工作的？

问题背景：

There are quite a few tutorials on embeddings in OpenAI. I can't understand how they work.

在OpenAI中有很多关于嵌入的教程，但我无法理解它们是如何工作的。

Referring to https://platform.openai.com/docs/guides/embeddings/what-are-embeddings , an embedding is a vector or list. A string is passed to an embedding model and the model returns a number (in simplest terms). I can use this number(s).

参考 [https://platform.openai.com/docs/guides/embeddings/what-are-embeddings](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings)，嵌入是一种向量或列表。一个字符串被传递给嵌入模型，模型返回一个数字（最简单的说法）。我可以使用这些数字。

If I use a simple string to get its embeddings, I get a massive list

如果我使用一个简单的字符串来获取它的嵌入，我会得到一个庞大的列表。

result = get_embedding("I live in space", engine = "textsearchcuriedoc001mc")

result when printed

打印结果时的输出

[5.4967957112239674e-05,
 -0.01301578339189291,
 -0.002223075833171606,
 0.013594076968729496,
 -0.027540158480405807,
 0.008867159485816956,
 0.009403547272086143,
 -0.010987567715346813,
 0.01919262297451496,
 0.022209804505109787,
 -0.01397960539907217,
 -0.012806257233023643,
 -0.027908924967050552,
 0.013074451126158237,
 0.024942029267549515,
 0.0200139675289392 , ..... -> truncated this much, much, much longer list

Question 1 - how is this massive list correlated with my 4-word text?

问题1——这个庞大的列表是如何与我的4个单词文本相关联的？

Question 2 -

I create embeddings of the text I want to use in query. Note that it is exactly the same as the text of original content I live in space

我为我要用于查询的文本创建了嵌入。请注意，它与原始内容的文本完全相同：“I live in space”（我住在太空）。

queryembedding = get_embedding(
        'I live in space',
        engine="textsearchcuriequery001mc"
    )
queryembedding

When I run cosine similarity , the value is 0.42056650555103214.

当我运行余弦相似度时，值是0.42056650555103214。

similarity = cosine_similarity(embeddings_of_i_live,queryembedding)
similarity

I get value 0.42056650555103214

Shouldn't the value be 1 to indicate identical value?

这个值不应该是1来表示完全相同的内容吗？

问题解决：

Q1:

How is this massive list correlated with my 4-word text?

这个庞大的列表是如何与我的4个单词文本相关联的？

A1: Let's say you want to use the OpenAI text-embedding-ada-002 model. No matter what your input is, you will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space, which is very hard to imagine. Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.

A1: 假设你想使用OpenAI的`text-embedding-ada-002`模型。无论你的输入是什么，你总是会得到一个1536维的嵌入向量（即，里面有1536个数字）。你可能熟悉三维空间（即X、Y、Z），而这是一个1536维的空间，非常难以想象。为什么嵌入向量中正好有1536个数字呢？因为`text-embedding-ada-002`模型的输出维度是1536，这是预定义的。

Q2:

I create embeddings of the text I want to use in the query. Note that it is exactly the same as the text of the original content: I live in space. When I run cosine similarity, the value is 0.42056650555103214. Should the value be 1 to indicate an identical value?

A2: Yes, the value should be 1 if you calculate cosine similarity between two identical texts. See an example here.

For an example of semantic search based on embeddings, see this answer.

A2: 是的，如果你计算两个完全相同文本之间的余弦相似度，值应该是1。请参见[这里的示例](https://platform.openai.com/docs/guides/embeddings/semantic-search)。

有关基于嵌入的语义搜索的示例，请参见[这个答案](https://platform.openai.com/docs/guides/embeddings/semantic-search)。