题意:将包含 OpenAI API 请求结果的 jsonl 文件加载到 pandas DataFrame 中
问题背景:
I have a large data set containing around 500k observation. It has a string variable that I want to create an embedding for. I used the OpenAI API to create the embedding and because of the large number of observations I used their script for parallel requests:
我有一个包含大约 50 万条观测数据的大型数据集,其中有一个字符串变量,我想为其创建嵌入。我使用 OpenAI API 来创建嵌入,并且由于观测数量庞大,我使用了他们的脚本来并行请求:
openai-cookbook/examples/api_request_parallel_processor.py at main · openai/openai-cookbook · GitHub
Everything worked fine. But I'm struggling to load the results to a pandas data.frame. The jsonl file with the results has the following structure, each row corresponding to one of the 500k observations:
一切工作正常。但我在将结果加载到 pandas DataFrame 时遇到了困难。包含结果的 jsonl 文件具有以下结构,每一行对应 50 万个观测值中的一个:
[{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"}, {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}]
[{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 2}, {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}]
Now, I want to read these results into a panda data frame with the following structure. It should have a variable that contains the "INPUT STRING" and 1536 additional variables that contain the embedding.
现在,我想将这些结果读取到一个 pandas DataFrame 中,结构如下:它应该有一个包含“INPUT STRING”的变量,以及 1536 个额外的变量来存储嵌入。
I'm new to python and json files. I usually work with csv files and R.
我是 Python 和 JSON 文件的新手,通常使用 CSV 文件和 R。
I tried to use the read_json function from pandas but that did not work
我尝试使用 pandas 的 `read_json` 函数,但没有成功。
import pandas as pd
openai_results = pd.read_json("results.jsonl", lines=True)
But this gives me a a data set with only 2 variables: For example for the first observation, the first variable contains : {"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"} and the second variable {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1,2,3,4...,1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}
但这给我生成了一个只有 2 个变量的数据集:例如,对于第一个观测,第一个变量包含:`{"model": "text-embedding-ada-002", "input": "INPUT STRING NR 1"}`,而第二个变量包含:`{"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [1, 2, 3, 4, ..., 1536]}], "model": "text-embedding-ada-002-v2", "usage": {"prompt_tokens": 2, "total_tokens": 2}}`
问题解决:
You can use something like this:
你可以使用如下代码:
df = pd.read_json('your_file.json', lines=True)
df
'''
0 1
0 {'model': 'text-embedding-ada-002', 'input': '... {'object': 'list', 'data': [{'object': 'embedd...
1 {'model': 'text-embedding-ada-002', 'input': '... {'object': 'list', 'data': [{'object': 'embedd...
'''
Access values: 访问值:
df["input"] = df[0].str["input"]
df["embedding"] = df[1].str["data"].str[0].str["embedding"] # or df["embedding"]=df[1].apply(lambda x: x["data"][0]["embedding"])
df = df[["input","embedding"]]
Out:
input embedding
0 INPUT STRING NR 1 [1, 2, 3, 4, 1536]
1 INPUT STRING NR 2 [1, 2, 3, 4, 1536]
If you want to explode embedding column then use explode():
如果你想展开嵌入列,可以使用 `explode()`:
df = df.explode("embedding")
df
'''
input embedding
0 INPUT STRING NR 1 1
0 INPUT STRING NR 1 2
0 INPUT STRING NR 1 3
0 INPUT STRING NR 1 4
0 INPUT STRING NR 1 1536
1 INPUT STRING NR 2 1
1 INPUT STRING NR 2 2
1 INPUT STRING NR 2 3
1 INPUT STRING NR 2 4
1 INPUT STRING NR 2 1536
'''