ChatGPT SQL
ChatGPT SQL出自2023年3月的论文《A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability》(github),这篇论文分析了ChatGPT的text2sql能力,实验结果表明ChatGPT的text2sql能力令人印象深刻,虽然没有达到当时的SOTA,但是无需训练性能也比SOTA低14%,且ChatGPT的鲁棒性与SOTA相比只有7.8%的差距。
使用zero-shot让ChatGPT生成SQL,prompt来自OpenAI展示的demo prompt,论文没有做特意的调整,如论文图1所示,图中上半部分是单轮场景下的text2sql,下半部分是多轮场景下的text2sql。
C3
text2sql方法C3出自2023年7月的论文《C3: Zero-shot Text-to-SQL with ChatGPT》(github), 它通过zero-shot的方式来prompt ChatGPT生成SQL。
C3的prompt包括3个组成部分:Clear Prompting (CP), Calibration with Hints (CH), and Consistent Output (CO)
Clear Prompting:包括两个部分clear layout 和clear context,如论文图2©所示。
- clear layout:用#将prompt里的指令、上下文(数据库schema)、问题分开。因为实验结果表明直接用论文图2(b)的指令可能会使生成的SQL有冗余列,所以在指令后添加了"and do not select extra columns that are not explicitly requested in the query"。
-
clear context:通过schema linking选择与问题相关的表和列。schema linking是通过zero-shot prompt ChatGPT来实现的,包括Table Recall 和 column Recall:
-
Table recall,zero-shop prompt ChatGPT分为3步来选择表。并用self-consistency来保证稳定性,即让ChatGPT生成10个结果集,每个结果中包括了top 4的表格,最后的结果为这10个结果集中出现最频繁的结果集。
""" Given the database schema and question, perform the following actions: 1 - Rank all the tables based on the possibility of being used in the SQL according to the question from the most relevant to the least relevant, Table or its column that matches more with the question words is highly relevant and must be placed ahead. 2 - Check whether you consider all the tables. 3 - Output a list object in the order of step 2, Your output should contain all the tables. The format should be like: [ "table_1", "table_2", ... ] Schema: # continents ( contid, continent ) # countries ( countryid, countryname, continent ) # car_makers ( id, maker, fullname, country ) # model_list ( moddeli, maker, model ) # car_names ( makeid, model, make ) # cars_data ( id, mpg, cylinders, edispl, horsepower, weight, accelerate, year ) Question: ### What is the name of the different car makers who produced a car in 1970? """
-
Column Recall,也通过zero-prompt来让ChatGPT分成两步来召回列。同样用self-consistency来保证稳定性,先让ChatGPT对每个表生成10个结果集,最后的结果为这10个结果集中出现最多频繁的5个列。
""" Given the database tables and question, perform the following actions: 1 - Rank the columns in each table based on the possibility of being used in the SQL, Column that matches more with the question words or the foreign key is highly relevant and must be placed ahead. You should output them in the order of the most relevant to the least relevant. Explain why you choose each column. 2 - Output a JSON object that contains all the columns in each table according to your explanation. The format should be like: { "table_1": ["column_1", "column_2", ......], "table_2": ["column_1", "column_2", ......], "table_3": ["column_1", "column_2", ......], ...... } Schema: # car_makers ( id, maker, fullname, country ) # model_list ( modelid, maker, model ) # car_names ( makeid, model, make ) # cars_data ( id, mpg, cylinders, edispl, horsepower, weight, accelerate, year ) Foreign keys: # model_list.maker = car_makers.id # car_names.model = model_list.model # cars_data.id = car_names.makeid Question: ### What is the name of the different car makers who produced a car in 1970? """
Calibration with Hints : 通过对ChatGPT生成的SQL进行分析,发现它容易因为bias出现如论文图3所示的错误,所以在prompt里添加了如论文图1右上部分所示的两个提示。
-
Consistent Output:使用execution-based Self-consistency。先让LLM采样输出多个SQL结果,然后将这些生成的SQL查询在数据库上执行并记录执行结果,去掉错误记录后,通过对执行结果采取投票机制来选择最后SQL。
github issue 里有一个问题是关于执行时间的,作者回复如下:
Time taken for recalling table: approximately 7s per sample.
Time taken for recalling column: approximately 25s per sample.
Time taken for generating SQL: approximately 2s per sample.
The time spent also depends on the internet status and the rate limits of APl calls
在用self-consistency时,如issue作者所回复,通过ChatGPT api里的参数n来一次生成多个结果,对于n个输入是共享同一个输入token的。