题意:"如何处理OpenAI的最大上下文长度为2049个tokens的问题?"
问题背景:
I'd like to send the text from various PDF's to OpenAI's API. Specifically the Summarize for a 2nd grader or the TL;DR summarization API's.
"我想将来自各种PDF的文本发送到OpenAI的API,特别是用于‘给二年级学生的总结’或‘TL;DR总结’的API。"
I can extract the text from PDF's using PyMuPDF
and prepare the OpenAI prompt.
"我可以使用 PyMuPDF 从 PDF 中提取文本,并准备 OpenAI 的提示。"
Question: How best to prepare the prompt when the token count is longer than the allowed 2049?
"问题:当token数量超过允许的2049时,如何最好地准备提示?"
- Do I just truncate the text then send multiple requests?
"我是否只需截断文本然后发送多个请求?"
- Or is there a way to sample the text to "compress" it to lose key points?
"还是有办法对文本进行采样,以‘压缩’它而不丢失关键点?"
问题解决:
I faced the same problem. Here is the strategy I used to send text that is much, much longer than OpenAIs GPT3 token limit.
"我遇到了同样的问题。以下是我用来发送远远超过OpenAI GPT-3 token限制的文本的策略。"
Depending on the model (Davinci, Curie, etc.) used, requests can use up to 4097 tokens shared between prompt and completion.
"根据使用的模型(如Davinci、Curie等),请求可以使用最多4097个tokens,这些tokens在提示和完成之间共享。"
- Prompt being the input you send to OpenAI, i.e. your "command", e.g. "Summarize the following text" plus the text itself
"提示是你发送给OpenAI的输入,即你的‘命令’,例如‘总结以下文本’以及文本本身。"
- Completion being the response, i.e. the entire summary of your text
"完成部分是指响应,即你的文本的完整摘要。"
If your prompt is 4000 tokens, your completion can be 97 tokens at most. For more information on OpenAI tokens and how to count them, see here.
"如果你的提示是4000个tokens,那么完成部分最多只能有97个tokens。"
To ensure that we don’t exceed the maximum length limit for prompt plus completion, we need to ensure that prompt (i.e. your text) and completion (i.e. the summary) put together always fits into the 4097 token boundary.
为了确保我们不会超过提示和完成的最大长度限制,我们需要确保提示(即你的文本)和完成(即总结)加在一起始终符合4097个标记的边界。
For that reason we split the entire text into multiple text chunks, summarize each chunk independently and finally merge all summarized chunks using a simple " ".join()
function.
为此,我们将整个文本拆分成多个文本块,独立总结每个块,然后使用简单的 `" ".join()` 函数将所有总结后的块合并起来。
Maximum Number of Words - Token-to-Word Conversion
最大词数 - 标记到词的转换
OpenAI has a fixed limit on the number of tokens. However, a token is not the same as a word. Hence, we first need to calculate the maximum number of words we can send to OpenAI. The documentation says:
OpenAI 对标记的数量有固定限制。然而,标记与词并不相同。因此,我们首先需要计算可以发送给 OpenAI 的最大词数。文档中写道:
Given the token-to-word ratio, we can send approximately 2900 words to OpenAI's GPT3 assuming a 5 sentence summary per text chunk.
考虑到标记与词的比例,我们可以向 OpenAI 的 GPT-3 发送大约 2900 个词,假设每个文本块有 5 句话的总结。
- Max tokens per request: 4000 tokens (leaving 97 tokens as a safety buffer) = 3000 words
每次请求的最大标记数:4000 个标记(留出 97 个标记作为安全缓冲)= 3000 个词
- Max prompt tokens: “Summarize the following text in five sentences” has 7 words = 10 tokens
最大提示标记数:“Summarize the following text in five sentences” 包含 7 个词 = 10 个标记
- Max tokens of returned summary (5 sentences): 20 words per sentence. 5 * 20 = 100 words = 133 tokens
返回总结的最大标记数(5 句话):每句话 20 个词。5 * 20 = 100 个词 = 133 个标记
- Max tokens of text chunk: 4000 - 10 - 133 = 3857 tokens = 2900 words
文本块的最大标记数:4000 - 10 - 133 = 3857 个标记 = 2900 个词
Text Chunking 文本切块
We can choose from a plethora of strategies to split up the entire text into smaller chunks.
我们可以从多种策略中选择,将整个文本拆分成较小的块。
The simplest approach is creating a single list of all words by splitting the entire text on whitespaces, and then creating buckets of words with words evenly distributed across all buckets. The downside is that we are likely to split a sentence half-way through and lose the meaning of the sentence because GPT ends up summarizing the first half of the sentence independently from the second half — ignoring any relations between the two chunks.
最简单的方法是通过空白符将整个文本拆分成单词列表,然后创建多个单词桶,使单词在各个桶中均匀分布。缺点是,我们可能会在句子中间拆分,从而丧失句子的含义,因为 GPT 会独立地总结句子的前半部分和后半部分——忽略了这两个块之间的关系。
Other options include tokenizers such as SentencePiece and spaCy’s sentence splitter. Choosing the later generates the most stable results.
其他选项包括使用 SentencePiece 和 spaCy 的句子分割器等分词器。选择后者可以生成最稳定的结果。
Implementation of Text Chunking with spaCy
使用 spaCy 实现文本切块
The following example splits the text “My first birthday was great. My 2. was even better.” into a list of two sentences.
以下示例将文本“My first birthday was great. My 2. was even better.”拆分成两个句子的列表。
python -m spacy download en_core_web_sm
import spacy
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")
text = "My first birthday was great. My 2. was even better."
for sentence in nlp(text).sents:
print(sentence.text)
Output 输出
My first birthday was great.
My 2. was even better.
spaCy correctly detected the second sentence instead of splitting it after the “2.”.
spaCy 正确地识别了第二个句子,而没有在“2.”之后将其分割。
Now, let’s write a text_to_chunks
helper function to generate chunks of sentences where each chunk holds at most 2700 words. 2900 words was the initially calculated word limit, but we want to ensure to have enough buffer for words that are longer than 1.33 tokens.
现在,让我们编写一个 `text_to_chunks` 辅助函数来生成句子块,其中每个块最多包含 2700 个单词。最初计算的单词限制是 2900 个,但我们希望为超过 1.33 个标记的长单词预留足够的缓冲空间。
def text_to_chunks(text):
chunks = [[]]
chunk_total_words = 0
sentences = nlp(text)
for sentence in sentences.sents:
chunk_total_words += len(sentence.text.split(" "))
if chunk_total_words > 2700:
chunks.append([])
chunk_total_words = len(sentence.text.split(" "))
chunks[len(chunks)-1].append(sentence.text)
return chunks
An alternative approach to determine the number of tokens of a text was recently introduced by OpenAI. The approach uses tiktoken
and is tailored towards OpenAI's models.
最近,OpenAI 引入了一种替代方法来确定文本的标记数量。该方法使用 `tiktoken`,并针对 OpenAI 的模型进行了优化。
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
number_of_tokens = len(encoding.encode("tiktoken is great!"))
print(number_of_tokens)
Next, we wrap the text summarization logic into a summarize_text function.
接下来,我们将文本摘要逻辑封装到一个 `summarize_text` 函数中。
def summarize_text(text):
prompt = f"Summarize the following text in 5 sentences:\n{text}"
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
temperature=0.3,
max_tokens=150, # = 112 words
top_p=1,
frequency_penalty=0,
presence_penalty=1
)
return response["choices"][0]["text"]
Our final piece of code looks like this:
我们的最终代码如下所示:
chunks = text_to_chunks(one_large_text)
chunk_summaries = []
for chunk in chunks:
chunk_summary = summarize_text(" ".join(chunk))
chunk_summaries.append(chunk_summary)
summary = " ".join(chunk_summaries)
References
-
- How to count tokens with tiktoken, OpenAI Cookbook