前言
关于NL2SQL的技术,如果大家最近有关注AI圈的话,或多或少都有所了解。其实很多业务场景下,于用户而言更多的是想要获取到最终数据的呈现效果,关于数据是如何获取得到的学习成本,是尽可能越少越好。众所周知当学习成本越低,那么产品的获客率也越高,当然对于我们技术人员来说,更多的还是研发思维。最终我们开发的服务主要还是为了业务服务,NL2SQL必然是以后数据开发的趋势所在,因此我们数据开发人员来说,暂且不谈掌握这门技术,清楚理念还是十分必要的。
NL2SQL技术概念
NL2SQL(Natural Language to SQL)是自然语言处理和数据库查询相结合的一项技术,旨在将用户以自然语言输入的查询转换为SQL查询语句,从而实现自然语言问答与数据库之间的自动交互。就按照企业日常报表业务,按照研发思路,我们首先可以通过UI或者页面前端获取到客户的文本信息,传输到后端进行落库,然后工单展示或者是直接进行数据库查询,由数据开发人员编写SQL语言,最终查询得到结果再进行数据可视化展示。
那么按照NL2SQL的理念来说,整个数据查询流程可以被大大简化。首先用户不需要再通过复杂的页面选择或表单提交,甚至不需要对数据库结构和SQL语言有任何了解。用户可以直接在前端输入自然语言的问题,例如“查询上个月的销售报表”或"显示2023年所有产品的库存情况",系统会自动将这些自然语言问题传递给后端。
在后端,NL2SQL技术会解析用户输入的自然语言,识别其中的意图、关键词,并结合具体的数据库模式,自动生成相应的SQL查询语句。这个过程涉及自然语言处理(NLP)、语义解析、数据库模式映射等多项核心技术。通过这些技术,系统能够理解用户的需求,并生成准确的SQL语句,从而直接查询数据库,获取数据。
查询结果同样可以直接返回给前端进行展示,或通过数据可视化工具进行图表化呈现。这种技术避免了传统方法中数据开发人员手动编写SQL查询的繁琐步骤,提升了响应速度。更重要的是,用户不再需要依赖技术人员进行数据查询,而是可以实时获得自己想要的结果。这不仅降低了学习成本,还能显著的提高客户的满意度和体验感。
可能光说概念大家还是有所不解,我们以一个实际场景来看:假设一位销售经理需要在月度会议前准备一份关于公司销售情况的报告,通常,整个过程可能会涉及多次向数据分析师或IT团队请求帮助,以编写SQL查询,导出数据并生成报告。那么传统方法一般为:
传统方法
- 自然语言需求:销售经理提出问题:“我想知道上个月每个产品的销售额是多少,并按销售额排序。”
- 沟通与转化:销售经理将这一需求发给数据分析师或IT人员。
- SQL查询编写:数据分析师根据需求,手动编写SQL查询:
SELECT product_name, SUM(sales_amount) AS total_sales
FROM sales
WHERE sale_date BETWEEN '2023-07-01' AND '2023-07-31'
GROUP BY product_name
ORDER BY total_sales DESC;
-
数据提取与导出:查询结果被导出并整理为Excel或其他格式,交给销售经理。
-
报告制作:销售经理将导出的数据整合进报告中,进一步处理可视化图表。
这种传统方法需要多个步骤,涉及不同角色的协作,尤其是在多个查询需求下,可能会导致反复沟通和修改SQL查询的过程。
NL2SQL技术
- 自然语言查询:销售经理直接在系统中输入问题:“显示上个月每个产品的销售额,并按销售额排序。”
- NL2SQL技术处理:系统通过自然语言处理,将该需求解析为SQL查询,自动生成如下SQL语句:
SELECT product_name, SUM(sales_amount) AS total_sales
FROM sales
WHERE sale_date BETWEEN '2023-07-01' AND '2023-07-31'
GROUP BY product_name
ORDER BY total_sales DESC;
-
自动执行查询并显示结果:系统执行查询,并将结果以表格或图表形式直接呈现给销售经理。
-
即时反馈与可视化:销售经理立即获取数据,并可以根据需要进一步调整查询,如“按产品类别分类显示销售额”或“查看去年同期的销售数据”。
效果对比就十分明显了,使用NL2SQL后查询过程自动化,大幅缩短了数据获取的时间。销售经理无需学习SQL语言,仅凭自然语言就可以完成复杂查询,降低了技术门槛。而且可以基于效果实时验证,查询可以随时进行,不再依赖技术团队的实时对接,查询结果直接以可视化形式呈现,用户能够更直观理解数据,从而快速做出业务决策。
NL2SQL技术支持
NL2SQL工作流程
用户输入自然语言问题,例如“查找2020年销售额最高的产品。系统首先需要理解用户输入的查询意图。自然语言本质上是模棱两可的,使其容易受到多种解释的影响。解决这种歧义是自然语言处理中的一项关键任务,以确保人与机器之间的准确通信。在实践中,情况可能并非总是如此。最终用户可能不知道(全部或部分)列或表的语义。因此最大的问题在于如何让AI能够精确定位到数据的具体表字段信息。所以说做NL2SQL不可避免的需要业务强绑定,必然要对所在领域的数据集有清楚的认知,才能训练好语义大模型。
目前比较火的英文数据集有WikiSQL、Spider、WikiTableQuestions、ATIS等,中文数据集有刚刚结束的中文首届NL2SQL挑战赛公开的数据,各个数据集都有各自的特点,这里不详细开展论述,感兴趣的可以自行探索。
根据数据集中SQL涉及到的数据库表的个数不同,分为单标和多表;根据所生成的SQL结构中是否包含嵌套查询,将数据集分为有嵌套和无嵌套。有个十分有意思的比赛
大家感兴趣的可以去看看这个比赛的获奖作品和数据集,会对NL2SQL工作和研究有较为清晰的了解。
我们以一句实际语言来看整个SQL生成逻辑。
比如自然语言输入:
“查找2023年所有销售额超过1000万的产品,并按销售额降序排列。”
目标SQL查询:
SELECT product_name, sales_amount
FROM sales
WHERE year = 2023 AND sales_amount > 10000000
ORDER BY sales_amount DESC;
1. 文本预处理
首先,系统对输入的自然语言进行预处理,准备后续的语义解析步骤。
- 分词(Tokenization):将句子拆分为单独的词或词组。
- 结果:[“查找”, “2023年”, “所有”, “销售额”, “超过”, “1000万”, “的”, “产品”, “并”, “按”, “销售额”, “降序”, “排列”]
- 算法:基于规则或深度学习模型的分词工具,如SpaCy、NLTK。
- 去除停用词:去掉“的”、“并”等不影响语义的词汇。
- 结果:[“查找”, “2023年”, “销售额”, “超过”, “1000万”, “产品”, “按”, “销售额”, “降序”, “排列”]
- 算法:通过停用词词典或手工定义的停用词表进行过滤。
2. 命名实体识别(NER, Named Entity Recognition)
系统需要识别出句子中的关键实体,如时间、数字和产品等。
- 识别内容:
- 时间实体:2023年
- 数值实体:1000万
- 产品实体:产品
- 操作实体:销售额,降序排列
- 算法:NER模型,如基于BiLSTM-CRF或预训练模型(如BERT)的NER工具。
- 输出:
- 时间:2023年
- 数值:10000000
- 操作:查找、超过、降序排列
- 实体:产品、销售额
3. 意图识别
系统识别用户的查询意图,即“查询”操作。这里的意图识别是关键步骤,它决定了SQL语句的类型(SELECT查询)。
-
识别出查询意图
:查询数据,并按条件过滤。
- 算法:基于分类模型(如SVM、LSTM)或预训练模型(如BERT)进行意图分类。
-
结果:查询操作,按条件筛选并排序。
4. 句法解析(Syntax Parsing)
通过句法解析分析自然语言输入的结构,确定主语、谓语、宾语等元素之间的关系。这里的目标是识别出“查找”的对象是“产品”,“销售额超过1000万”是一个过滤条件。
- 解析句法结构:
- 查找(谓语) -> 产品(宾语)
- 销售额超过1000万(条件) -> 降序排列(排序方式)
- 算法:依存句法分析(Dependency Parsing),使用工具如Stanford NLP或SpaCy。
- 结果:
- 主谓宾关系:查找产品
- 条件关系:销售额 > 1000万
- 排序:按销售额降序排列
5. 数据库模式映射
系统需要将自然语言中的实体(如“产品”、“销售额”)映射到数据库中的具体表和字段。这个步骤要求系统理解数据库的模式(Schema)。
- 映射自然语言到数据库字段:
- “产品” ->
product_name
- “销售额” ->
sales_amount
- “2023年” ->
year
- 算法:基于规则的模式映射或通过训练模型学习自然语言与数据库字段之间的对应关系。
- “产品” ->
- 结果:
product_name
sales_amount
year
6. SQL模板生成
系统将识别出的信息填充到SQL模板中。这里,模板生成的任务是将自然语言解析出的结构信息转换为SQL语句的各个子句(SELECT、FROM、WHERE、ORDER BY)。
SQL模板:
SELECT {字段1, 字段2, ...}
FROM {表}
WHERE {条件1} AND {条件2} ...
ORDER BY {排序字段} {排序方式};
填充模板:
- SELECT:
product_name, sales_amount
- FROM:
sales
- WHERE:
year = 2023 AND sales_amount > 10000000
- ORDER BY:
sales_amount DESC
- 算法:基于模板匹配或生成模型(如Seq2Seq、Transformer)生成SQL语句。
结果:
SELECT product_name, sales_amount
FROM sales
WHERE year = 2023 AND sales_amount > 10000000
ORDER BY sales_amount DESC;
这里的prompt可以提供一个现在行业普遍认可的:
system = """
Given the database schema below, generate a MySQL query based on the user's question. Ensure to consider totals from line items, inclusive date ranges, and correct data aggregation for summarization. Remember to handle joins, groupings, and orderings effectively.
Database schema:
- Customer (CustomerID, FirstName, LastName, Email, Phone, BillingAddress, ShippingAddress, CustomerSince, IsActive)
- Employee (EmployeeID, FirstName, LastName, Email, Phone, HireDate, Position, Salary)
- InventoryLog (LogID, ProductID, ChangeDate, QuantityChange, Notes)
- LineItem (LineItemID, SalesOrderID, ProductID, Quantity, UnitPrice, Discount, TotalPrice)
- Product (ProductID, ProductName, Description, UnitPrice, StockQuantity, ReorderLevel, Discontinued)
- SalesOrder (SalesOrderID, CustomerID, OrderDate, RequiredDate, ShippedDate, Status, Comments, PaymentMethod, IsPaid)
- Supplier (SupplierID, CompanyName, ContactName, ContactTitle, Address, Phone, Email)
Guidelines for SQL query generation:
1. **Ensure Efficiency and Performance**: Opt for JOINs over subqueries where possible, use indexes effectively, and mention any specific performance considerations to keep in mind.
2. **Adapt to Specific Analytical Needs**: Tailor WHERE clauses, JOIN operations, and aggregate functions to precisely meet the analytical question being asked.
3. **Complexity and Variations**: Include a range from simple to complex queries, illustrating different SQL functionalities such as aggregate functions, string manipulation, and conditional logic.
4. **Handling Specific Cases**: Provide clear instructions on managing NULL values, ensuring date ranges are inclusive, and handling special data integrity issues or edge cases.
5. **Explanation and Rationale**: After each generated query, briefly explain why this query structure was chosen and how it addresses the analytical need, enhancing understanding and ensuring alignment with requirements.
-- 1. Average Order Total for Customers without a Registered Phone Number Within a Specific Period
SELECT AVG(TotalPrice) FROM LineItem
JOIN SalesOrder ON LineItem.SalesOrderID = SalesOrder.SalesOrderID
JOIN Customer ON SalesOrder.CustomerID = Customer.CustomerID
WHERE Customer.Phone IS NULL AND SalesOrder.OrderDate BETWEEN '2003-01-01' AND '2009-12-31';
-- Rationale: Analyzes spending behavior of uncontactable customers within a set timeframe, aiding targeted marketing strategies.
-- 2. List Top 10 Employees by Salary
SELECT * FROM Employee ORDER BY Salary DESC LIMIT 10;
-- Rationale: Identifies highest-earning employees for payroll analysis and salary budgeting.
-- 3. Find the Total Quantity of Each Product Sold Last Month
SELECT Product.ProductName, SUM(LineItem.Quantity) AS TotalQuantitySold
FROM Product
JOIN LineItem ON Product.ProductID = LineItem.ProductID
JOIN SalesOrder ON LineItem.SalesOrderID = SalesOrder.SalesOrderID
WHERE SalesOrder.OrderDate BETWEEN DATE_SUB(NOW(), INTERVAL 1 MONTH) AND NOW()
GROUP BY Product.ProductID;
-- Rationale: Helps in inventory management by highlighting sales performance of products, informing restocking decisions.
-- 4. Show Sales by Customer for the Current Year, Including Customer Details
SELECT Customer.FirstName, Customer.LastName, SUM(LineItem.TotalPrice) AS TotalSales
FROM Customer
JOIN SalesOrder ON Customer.CustomerID = SalesOrder.CustomerID
JOIN LineItem ON SalesOrder.SalesOrderID = LineItem.SalesOrderID
WHERE YEAR(SalesOrder.OrderDate) = YEAR(CURDATE())
GROUP BY Customer.CustomerID;
-- Rationale: Identifies top customers based on yearly sales, supporting personalized customer service and loyalty programs.
-- 5. Identify Products That Need Reordering (Stock Quantity Below Reorder Level)
SELECT ProductName, StockQuantity, ReorderLevel FROM Product WHERE StockQuantity <= ReorderLevel;
-- Rationale: Essential for inventory control, prompting restocking of products to meet demand efficiently.
-- 6. Display All Suppliers That Have Not Supplied Any Products That Are Currently Discontinued
SELECT Supplier.CompanyName FROM Supplier
LEFT JOIN Product ON Supplier.SupplierID = Product.SupplierID
WHERE Product.Discontinued = 0
GROUP BY Supplier.SupplierID;
-- Rationale: Evaluates supplier contributions to the supply chain by focusing on those with active product lines.
Remember to adapt queries based on the actual question context, utilizing the appropriate WHERE clauses, JOIN operations, and aggregate functions to meet the specific analytical needs.
Sample records for the Supplier table:
- SupplierID: 29, CompanyName: Hogan-Anderson, ContactName: Sierra Carey, ContactTitle: Mining engineer, Address: 246 Johnny Fords Apt. 858, Williamsport, AK 96920, Phone: 232.945.6443, Email: rodney04@example.com
- SupplierID: 30, CompanyName: Nixon, Woods and Pearson, ContactName: Lawrence Phillips, ContactTitle: Aid worker, Address: USS Osborn, FPO AE 24294, Phone: 001-462-571-0185x478, Email: jessica29@example.org
Sample records for the Product table:
- ProductID: 1, ProductName: Reflect Sea, Description: Factor country center price pretty foreign theory paper fact machine two., UnitPrice: 191.19, StockQuantity: 665, ReorderLevel: 46, Discontinued: 1
- ProductID: 2, ProductName: Avoid American, Description: Skill environmental start set bring must job early per weight difficult someone., UnitPrice: 402.14, StockQuantity: 970, ReorderLevel: 15, Discontinued: 1
- ProductID: 3, ProductName: Evening By, Description: Whether high bill though each president another its., UnitPrice: 12.81, StockQuantity: 842, ReorderLevel: 32, Discontinued: 1
- ProductID: 4, ProductName: Certain Identify, Description: Spring identify bring debate wrong style hit., UnitPrice: 155.22, StockQuantity: 600, ReorderLevel: 27, Discontinued: 1
- ProductID: 5, ProductName: Impact Agreement, Description: Whom ready entire meeting consumer safe pressure truth., UnitPrice: 368.72, StockQuantity: 155, ReorderLevel: 35, Discontinued: 0
- ProductID: 6, ProductName: Million Agreement, Description: Glass why team yes reduce issue nothing., UnitPrice: 297.03, StockQuantity: 988, ReorderLevel: 36, Discontinued: 1
- ProductID: 7, ProductName: Foot Vote, Description: Anyone floor movie maintain TV new age prove certain really dog., UnitPrice: 28.75, StockQuantity: 828, ReorderLevel: 24, Discontinued: 0
- ProductID: 8, ProductName: Somebody Current, Description: Politics since exactly film idea Republican., UnitPrice: 202.9, StockQuantity: 317, ReorderLevel: 18, Discontinued: 0
- ProductID: 9, ProductName: Somebody Character, Description: Long agreement history administration purpose conference including., UnitPrice: 300.38, StockQuantity: 242, ReorderLevel: 30, Discontinued: 1
- ProductID: 10, ProductName: Low Idea, Description: Spend guess somebody spend fight director technology find between college skill., UnitPrice: 34.68, StockQuantity: 65, ReorderLevel: 27, Discontinued: 0
Use the above schema and sample records to generate syntactically correct SQL queries. For example, to query the list of discontinued products, or to find products below a certain stock quantity.
Sample records for the Employee table:
- EmployeeID: 1, FirstName: Danny, LastName: Morales, Email: catherine08@example.com, Phone: 001-240-574-6687x625, HireDate: 2021-06-16, Position: Medical technical officer, Salary: 36293
- EmployeeID: 2, FirstName: William, LastName: Spencer, Email: sthompson@example.com, Phone: (845)940-2095x693, HireDate: 2023-08-22, Position: English as a foreign language teacher, Salary: 51775
- EmployeeID: 3, FirstName: Brian, LastName: Stark, Email: hughesmelissa@example.com, Phone: 780.299.1965x06374, HireDate: 2023-02-24, Position: Pharmacologist, Salary: 11963
- EmployeeID: 4, FirstName: Sarah, LastName: Cannon, Email: brittney20@example.com, Phone: 512.717.8995x05793, HireDate: 2019-05-23, Position: Physiological scientist, Salary: 69878
- EmployeeID: 5, FirstName: Lance, LastName: Bell, Email: patrick57@example.net, Phone: +1-397-320-2600x803, HireDate: 2019-06-22, Position: Scientific laboratory technician, Salary: 56499
- EmployeeID: 6, FirstName: Jason, LastName: Larsen, Email: teresaharris@example.org, Phone: +1-541-955-5657x7357, HireDate: 2022-11-02, Position: Proofreader, Salary: 89756
- EmployeeID: 7, FirstName: Kyle, LastName: Baker, Email: nathanielmiller@example.net, Phone: +1-863-658-3715x6525, HireDate: 2019-10-30, Position: Firefighter, Salary: 96795
- EmployeeID: 8, FirstName: Jennifer, LastName: Hernandez, Email: sarah43@example.org, Phone: 267-588-3195, HireDate: 2021-01-10, Position: Designer, interior/spatial, Salary: 37584
- EmployeeID: 9, FirstName: Shane, LastName: Meyer, Email: perrystanley@example.org, Phone: 001-686-918-6486, HireDate: 2021-04-14, Position: Retail manager, Salary: 69688
- EmployeeID: 10, FirstName: Christine, LastName: Powell, Email: tanderson@example.org, Phone: 427.468.2131, HireDate: 2019-05-11, Position: Sports administrator, Salary: 39962
Use the above schema and sample records to generate syntactically correct SQL queries. For example, to query the top 10 employees by salary, or to find employees hired within a specific period.
Sample records for the Customer table:
- CustomerID: 1, FirstName: Sandra, LastName: Cruz, Email: rhonda24@example.net, Phone: 511-949-6987x21174, BillingAddress: "18018 Kyle Streets Apt. 606, Shaneville, AZ 85788", ShippingAddress: "18018 Kyle Streets Apt. 606, Shaneville, AZ 85788", CustomerSince: 2023-05-02, IsActive: 0
- CustomerID: 2, FirstName: Robert, LastName: Williams, Email: traciewall@example.net, Phone: 944-649-2491x60774, BillingAddress: "926 Mitchell Pass Apt. 342, Brianside, SC 83374", ShippingAddress: "926 Mitchell Pass Apt. 342, Brianside, SC 83374", CustomerSince: 2020-09-01, IsActive: 0
- CustomerID: 3, FirstName: John, LastName: Greene, Email: travis92@example.org, Phone: 279.334.1551, BillingAddress: "36019 Bill Manors Apt. 219, Dominiquefort, AK 55904", ShippingAddress: "36019 Bill Manors Apt. 219, Dominiquefort, AK 55904", CustomerSince: 2021-03-15, IsActive: 0
- CustomerID: 4, FirstName: Steven, LastName: Riley, Email: greennathaniel@example.org, Phone: +1-700-682-7696x189, BillingAddress: "76545 Hebert Crossing Suite 235, Forbesbury, MH 14227", ShippingAddress: "76545 Hebert Crossing Suite 235, Forbesbury, MH 14227", CustomerSince: 2022-12-05, IsActive: 0
- CustomerID: 5, FirstName: Christina, LastName: Blake, Email: christopher87@example.net, Phone: 584.263.4429, BillingAddress: "8342 Shelly Fork, West Chasemouth, CT 81799", ShippingAddress: "8342 Shelly Fork, West Chasemouth, CT 81799", CustomerSince: 2019-11-12, IsActive: 0
- CustomerID: 6, FirstName: Michael, LastName: Stevenson, Email: lynnwilliams@example.org, Phone: 328-637-4320x7025, BillingAddress: "7503 Mallory Mountains Apt. 199, Meganport, MI 81064", ShippingAddress: "7503 Mallory Mountains Apt. 199, Meganport, MI 81064", CustomerSince: 2024-01-01, IsActive: 1
- CustomerID: 7, FirstName: Anna, LastName: Kramer, Email: steven23@example.org, Phone: +1-202-719-6886x844, BillingAddress: "295 Mcgee Fort, Manningberg, PR 93309", ShippingAddress: "295 Mcgee Fort, Manningberg, PR 93309", CustomerSince: 2022-03-06, IsActive: 1
- CustomerID: 8, FirstName: Michael, LastName: Sullivan, Email: bbailey@example.com, Phone: 988.368.5033, BillingAddress: "772 Bruce Motorway Suite 583, Powellbury, MH 42611", ShippingAddress: "772 Bruce Motorway Suite 583, Powellbury, MH 42611", CustomerSince: 2019-03-23, IsActive: 1
- CustomerID: 9, FirstName: Kevin, LastName: Moody, Email: yoderjennifer@example.org, Phone: 3425196543, BillingAddress: "371 Lee Lake, New Michaelport, CT 99382", ShippingAddress: "371 Lee Lake, New Michaelport, CT 99382", CustomerSince: 2023-12-03, IsActive: 1
- CustomerID: 10, FirstName: Jeremy, LastName: Mejia, Email: spencersteven@example.org, Phone: 449.324.7097, BillingAddress: "90137 Harris Garden, Matthewville, IA 39321", ShippingAddress: "90137 Harris Garden, Matthewville, IA 39321", CustomerSince: 2019-05-20, IsActive: 1
These sample records provide a clear representation of the data structure for customers within the database schema. Use these details to assist in generating queries that involve customer information, such as filtering active customers, summarizing sales by customer, or identifying long-term customers.
Sample records for the InventoryLog table:
- LogID: 1, ProductID: 301, ChangeDate: 2023-09-08, QuantityChange: 84, Notes: Inventory increased
- LogID: 2, ProductID: 524, ChangeDate: 2023-08-09, QuantityChange: -84, Notes: Inventory decreased
- LogID: 3, ProductID: 183, ChangeDate: 2023-04-17, QuantityChange: -51, Notes: Inventory decreased
- LogID: 4, ProductID: 390, ChangeDate: 2023-02-27, QuantityChange: 80, Notes: Inventory increased
- LogID: 5, ProductID: 737, ChangeDate: 2023-11-15, QuantityChange: 24, Notes: Inventory increased
- LogID: 6, ProductID: 848, ChangeDate: 2023-11-22, QuantityChange: 69, Notes: Inventory increased
- LogID: 7, ProductID: 534, ChangeDate: 2023-06-06, QuantityChange: -61, Notes: Inventory decreased
- LogID: 8, ProductID: 662, ChangeDate: 2024-01-16, QuantityChange: 70, Notes: Inventory increased
- LogID: 9, ProductID: 969, ChangeDate: 2024-01-07, QuantityChange: -25, Notes: Inventory decreased
- LogID: 10, ProductID: 640, ChangeDate: 2023-08-08, QuantityChange: -13, Notes: Inventory decreased
These sample records provide insights into the inventory adjustments for different products within the database schema. Utilize these details to assist in generating queries that track inventory changes, analyze stock levels, or evaluate inventory management efficiency.
Sample records for the LineItem table:
- LineItemID: 1, SalesOrderID: 280, ProductID: 290, Quantity: 3, UnitPrice: 84.59, Discount: NULL, TotalPrice: 253.77
- LineItemID: 2, SalesOrderID: 94, ProductID: 249, Quantity: 6, UnitPrice: 88.7, Discount: NULL, TotalPrice: 532.2
- LineItemID: 3, SalesOrderID: 965, ProductID: 247, Quantity: 1, UnitPrice: 43.44, Discount: NULL, TotalPrice: 43.44
- LineItemID: 4, SalesOrderID: 173, ProductID: 16, Quantity: 10, UnitPrice: 26.3, Discount: NULL, TotalPrice: 263
- LineItemID: 5, SalesOrderID: 596, ProductID: 191, Quantity: 9, UnitPrice: 59.44, Discount: NULL, TotalPrice: 534.96
- LineItemID: 6, SalesOrderID: 596, ProductID: 308, Quantity: 8, UnitPrice: 33.11, Discount: NULL, TotalPrice: 264.88
- LineItemID: 7, SalesOrderID: 960, ProductID: 758, Quantity: 5, UnitPrice: 64.47, Discount: NULL, TotalPrice: 322.35
- LineItemID: 8, SalesOrderID: 148, ProductID: 288, Quantity: 5, UnitPrice: 65.21, Discount: NULL, TotalPrice: 326.05
- LineItemID: 9, SalesOrderID: 974, ProductID: 706, Quantity: 3, UnitPrice: 59.86, Discount: NULL, TotalPrice: 179.58
- LineItemID: 10, SalesOrderID: 298, ProductID: 998, Quantity: 2, UnitPrice: 75.79, Discount: NULL, TotalPrice: 151.58
These sample records illustrate various line items associated with sales orders in the database. These details help in constructing queries to analyze sales performance, product popularity, pricing strategies, and overall sales revenue.
Sample records for the SalesOrder table:
- SalesOrderID: 1, CustomerID: 12, OrderDate: 2022-11-05, RequiredDate: 2022-12-02, ShippedDate: 2022-11-25, Status: Pending, Comments: NULL, PaymentMethod: NULL, IsPaid: 0
- SalesOrderID: 2, CustomerID: 56, OrderDate: 2022-02-22, RequiredDate: 2022-03-08, ShippedDate: 2022-03-17, Status: Completed, Comments: NULL, PaymentMethod: NULL, IsPaid: 1
- SalesOrderID: 3, CustomerID: 63, OrderDate: 2023-03-20, RequiredDate: 2023-03-27, ShippedDate: NULL, Status: Shipped, Comments: NULL, PaymentMethod: NULL, IsPaid: 0
- SalesOrderID: 4, CustomerID: 21, OrderDate: 2023-04-29, RequiredDate: 2023-05-26, ShippedDate: 2023-05-14, Status: Pending, Comments: NULL, PaymentMethod: NULL, IsPaid: 1
- SalesOrderID: 5, CustomerID: 16, OrderDate: 2022-11-05, RequiredDate: 2022-11-30, ShippedDate: NULL, Status: Shipped, Comments: NULL, PaymentMethod: NULL, IsPaid: 1
- SalesOrderID: 6, CustomerID: 46, OrderDate: 2023-10-06, RequiredDate: 2023-10-27, ShippedDate: NULL, Status: Shipped, Comments: NULL, PaymentMethod: NULL, IsPaid: 1
- SalesOrderID: 7, CustomerID: 47, OrderDate: 2023-02-08, RequiredDate: 2023-02-25, ShippedDate: 2023-03-03, Status: Shipped, Comments: NULL, PaymentMethod: NULL, IsPaid: 1
- SalesOrderID: 8, CustomerID: 70, OrderDate: 2022-07-29, RequiredDate: 2022-08-18, ShippedDate: 2022-08-10, Status: Pending, Comments: NULL, PaymentMethod: NULL, IsPaid: 0
- SalesOrderID: 9, CustomerID: 14, OrderDate: 2022-03-29, RequiredDate: 2022-04-15, ShippedDate: 2022-04-17, Status: Completed, Comments: NULL, PaymentMethod: NULL, IsPaid: 0
- SalesOrderID: 10, CustomerID: 31, OrderDate: 2024-01-12, RequiredDate: 2024-01-31, ShippedDate: 2024-02-07, Status: Pending, Comments: NULL, PaymentMethod: NULL, IsPaid: 0
These sample records provide insights into sales order management within the database, including order status, shipping details, payment methods, and customer IDs. This information is crucial for analyzing sales processes, order fulfillment rates, customer engagement, and payment transactions.
"""
7. SQL查询优化
在生成SQL语句之后,系统可以对SQL查询进行优化,以提高查询效率。例如,添加索引、优化查询计划等。
- 优化步骤:
- 确定索引:在
year
和sales_amount
字段上检查是否有适当的索引。 - 重写查询:如果有需要,系统可以通过查询优化器重写查询以提高性能。
- 算法:查询优化算法,如基于成本的查询优化器(Cost-Based Optimizer, CBO)。
- 确定索引:在
- 结果:优化后的SQL语句(此例中原SQL语句已经较为简单,无需进一步优化)。
8. SQL执行与结果展示
优化后的SQL语句被发送到数据库执行,数据库返回结果集。系统会将查询结果转换为用户友好的形式进行展示,如表格、图表或其他可视化形式。
-
SQL执行:通过数据库连接,执行生成的SQL查询。
-
结果展示:将数据库返回的结果集以表格或图表的形式呈现给用户。
- 算法:基于数据库的查询引擎进行执行,结果展示使用可视化工具(如Matplotlib、Plotly)。
-
示例输出:
-
product_name sales_amount 产品A 15000000 产品B 12000000 产品C 11000000
-
评价该模型的两个指标:精确匹配率、执行正确率
- Execution Accuracy
- 定义:计算SQL执行结果正确的数量在数据集中的比例,结果存在高估的可能。
- Exact Match
- 定义:计算模型生成的SQL和标注SQL的匹配程度,结果存在低估的可能。
其中, N表示数据量,
S
Q
L
′
SQL'
SQL′和
S
Q
L
SQL
SQL分别代表预测的SQL语句和真实的SQL语句,
S
c
o
r
e
l
f
Score_{lf}
Scorelf表示
L
o
g
i
c
F
o
r
m
LogicForm
LogicForm准确率;
Y
′
Y'
Y′和
Y
Y
Y分别代表预测的SQL和真实的SQL的执行结果,
S
c
o
r
e
e
x
Score_{ex}
Scoreex表示
E
x
e
c
u
t
i
o
n
Execution
Execution准确率。
总结
NL2SQL的出现,彻底改变了人与数据交互的方式。它通过将复杂的SQL查询隐藏在自然语言输入背后,极大地降低了数据获取的门槛,让业务人员无需依赖技术背景就能直接获取所需的信息。随着自然语言处理技术的不断进步,NL2SQL的应用场景将愈加广泛,覆盖从企业报表到智能客服等各个领域。未来,随着模型的泛化能力增强和实时性能优化,我们可以期待NL2SQL技术在数据驱动的决策中扮演更加重要的角色,让“零成本与数据沟通”真正成为可能。