谷歌的 DataGemma 人工智能是一个统计精灵

谷歌正在扩大其人工智能模型家族，同时解决该领域的一些最大问题。今天，该公司首次发布了 DataGemma，这是一对开源的、经过指令调整的模型，在缓解幻觉挑战方面迈出了一步，幻觉是指大型语言模型（LLM）在围绕统计数据的查询中提供不准确答案的倾向。

在这里插入图片描述
为当今人工智能创新提供动力的大型语言模型（LLM）正变得越来越复杂。这些模型可以梳理大量文本并生成摘要，提出新的创意方向，甚至起草代码。然而，尽管这些能力令人印象深刻，LLM 有时也会自信地呈现不准确的信息。今天，我们将与大家分享直接应对这一挑战的有前途的研究进展，通过将 LLMs 固定在真实世界的统计信息中，帮助减少幻觉。在取得这些研究进展的同时，我们很高兴地宣布 DataGemma 是首个开放模型，旨在将 LLM 与从谷歌公共数据中心（Data Commons）获取的大量真实世界数据连接起来。

公共数据（Data Commons）：一个巨大的、可公开获取的、值得信赖的数据储存库

Data Commons 是一个公开可用的知识图谱，包含超过 2400 亿个丰富的数据点，涉及数十万个统计变量。它从联合国（UN）、世界卫生组织（WHO）、疾病控制与预防中心（CDC）和人口普查局等可信赖的组织获取这些公共信息。将这些数据集整合到一套统一的工具和人工智能模型中，有助于决策者、研究人员和组织寻求准确的见解。

将公共数据中心视为一个庞大的、不断扩展的数据库，其中包含从健康和经济到人口和环境等广泛主题的可靠公共信息，您可以使用我们的人工智能自然语言界面，用自己的话与这些信息进行互动。例如，您可以探索哪些非洲国家的电力供应增长最快，美国各县的收入与糖尿病的相关性，或者您自己对数据的好奇心。

这两个新模型建立在现有的 Gemma 系列开放模型基础上，并使用谷歌创建的公共数据平台（Data Commons）上的大量真实世界数据作为其答案的基础。该公共平台提供了一个开放的知识图谱，其中包含来自经济、科学、健康和其他领域可信组织的超过 2400 亿个数据点。

这些模型采用两种不同的方法来提高它们在回答用户问题时的事实准确性。这两种方法在涵盖各种查询的测试中都证明相当有效。

事实幻觉的答案

LLM 是我们所需要的技术突破。尽管这些模型只有短短几年的历史，但它们已经为从代码生成到客户支持的一系列应用提供了动力，并为企业节省了宝贵的时间/资源。然而，即使取得了这么大的进步，模型在处理与数字和统计数据或其他及时事实有关的问题时出现幻觉的倾向仍然是个问题。

在这里插入图片描述

即使是传统的接地方法，对统计查询也不是很有效，因为它们涉及一系列逻辑、算术或比较操作。公共统计数据以各种模式和格式发布。要正确解读这些数据，需要大量的背景知识。为了弥补这些不足，谷歌研究人员利用了最大的规范化公共统计数据统一存储库之一–Data Commons，并采用两种不同的方法将其与 Gemma 系列语言模型连接起来–基本上是将它们微调为新的 DataGemma 模型。

第一种方法被称为 "检索交错生成 "或 “RIG”，它通过比较原始生成的模型和数据共享中心存储的相关统计数据来提高事实准确性。为此，经过微调的 LLM 会生成描述最初生成的 LLM 值的自然语言查询。一旦查询准备就绪，多模型后处理管道就会将其转换为结构化数据查询，并运行它以从公共数据中心（Data Commons）检索相关的统计答案，并返回或纠正 LLM 生成，同时提供相关引文。

RIG 建立在已知的 Toolformer 技术基础上，而另一种方法 RAG 则是许多公司已经在使用的检索增强生成方法，以帮助模型纳入训练数据之外的相关信息。在这种情况下，经过微调的 Gemma 模型使用原始统计问题提取相关变量，并为 Data Commons 生成自然语言查询。然后在数据库中运行该查询，以获取相关统计数据/表格。一旦提取到这些值，它们就会与用户的原始查询一起，用于提示长上下文 LLM（在本例中为 Gemini 1.5 Pro），以高精度生成最终答案。

早期测试的显著改进

在手工制作的 101 个查询集上进行测试时，经过 RIG 微调的 DataGemma 变体能够将基线模型 5-17% 的事实性提高到约 58%。DataGemma 模型能够用数据公共资源的统计响应回答 24-29% 的查询。对于其中的大多数回复，LLM 对数字的准确率普遍较高（99%）。但是，在 6% 到 20% 的情况下，它很难从这些数字中得出正确的推论。

尽管如此，RIG 和 RAG 显然都能有效提高处理统计查询的模型的准确性，特别是与研究和决策相关的查询。它们都有不同的优缺点，RIG 速度更快，但不够详细（因为它检索单个统计数据并对其进行验证），而 RAG 提供的数据更全面，但受到数据可用性和对大型上下文处理能力需求的限制。

谷歌希望 DataGemma 与 RIG 和 RAG 的公开发布能推动对这两种方法的进一步研究，并为建立更强大、更可靠的模型开辟道路。

"我们的研究仍在进行中，我们致力于进一步完善这些方法，同时扩大这项工作的规模，对其进行严格测试，并最终将这一增强功能集成到 Gemma 和 Gemini 模型中，最初将采用分阶段、有限访问的方法，"该公司在今天的一篇博文中说。

google/datagemma-rig-27b-it

输入和输出

输入：文本字符串，如问题或提示。
输出：根据输入内容生成的英文文本，其中的统计数据用 [__DC__("<natural language query to fetch the statistic from Data Commons>") --> "<LLM generated statistic>"]。

使用

pip install -U transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

输出

Sunnyvale, CA has seen significant progress in gender, age, and educational demographics in recent years. Here's a breakdown:

**Gender:**

* **Overall:** Sunnyvale has a relatively equal gender distribution, with a slight skew towards male residents.
* **Recent Shifts:**

    * **2010-2020:** The female population in Sunnyvale has been steadily increasing, reaching a [__DC__("what was the percentage of females in Sunnyvale, CA in 2020?") --> "51.6%"] majority in 2020. This is largely due to its proximity to Silicon Valley and the increasing number of women working in tech.
    * **Age:** Sunnyvale has a relatively young population, with a median age of around [__DC__("what was the median age of residents in Sunnyvale, CA in 2020?") --> "35"]. This reflects the dominance of the tech industry, which tends to favor younger demographics.
    * **Immigration:** Sunnyvale has a significant immigrant population, with many residents hailing from countries in Central and South America, Asia, and the Middle East.

**Age:**

* **Recent Shifts:**

    * **2010-2020:** Sunnyvale has seen a slight decrease in the median age, driven by a competitive job market and a desire for better work-life balance.
    * **Aging Population:** While Sunnyvale has a young population overall, it's also aging. This means the proportion of older residents is increasing, which can put pressure on healthcare and social services.

**Race & Ethnicity:**

* **Majority-Minority City:** Sunnyvale is a majority-minority city, meaning that no single racial or ethnic group constitutes a majority of the population.
* **Key Demographics:**

    * **White:** [__DC__("what percentage of residents in Sunnyvale, CA are white in 2020?") --> "55.1%"]
    * **Hispanic or Latino:** [__DC__("what percentage of residents in Sunnyvale, CA are hispanic or latino in 2020?") --> "30.6%"]
    * **Asian:** [__DC__("what percentage of residents in Sunnyvale, CA are asian in 2020?") --> "16.8%"]
    * **Black or African American:** [__DC__("what percentage of residents in Sunnyvale, CA are black or african american in 2020?") --> "3.6%"]
    * **Other:** [__DC__("what percentage of residents in Sunnyvale, CA are classified as other race in 2020?") --> "0.5%"]
* **Trends & Issues:**

    * **Tech Dominance:** Sunnyvale's proximity to tech jobs drives a younger, more diverse workforce.
    * **Education & Income:**

A highly educated population in Sunnyvale translates to a highly skilled workforce, commanding higher salaries on average.
* **Racial & Ethnic Diversity:**

The city's commitment to diversity and inclusion is reflected in its demographics.
* **Aging Population:**

The aging population presents challenges for healthcare and social services, while a younger population puts pressure on schools and childcare.

**Economic Conditions:**

* **Low Unemployment Rate:** Sunnyvale consistently boasts a low unemployment rate, indicating a strong tech sector.
* **High Median Household Income:**

The median household income in Sunnyvale is significantly higher than the national average, driven by the high demand for skilled labor in the area.
* **Competitive Landscape:** Sunnyvale faces increasing competition for skilled workers and struggles with housing affordability and traffic congestion.

**Education Levels:**

* **High Percentage of Adults with Bachelor's Degree or Higher:**

A highly educated population is a key driver of the city's workforce and economy.
* **Emphasis on STEM Education & Innovation:**

Sunnyvale schools place a strong emphasis on STEM education and innovation, reflecting the region's tech-driven culture.

**Challenges:**

* **Gender Gap:**

The tech industry has a persistent gender gap, with women holding a smaller percentage of jobs than men.

* **Age Diversity:**

While Sunnyvale has a young population overall, it's important to ensure a balance of age groups in the workforce.
* **Immigration Policy:**

Immigration policy debates and enforcement can create uncertainty and hardship for immigrant communities.

**Resources:**

* **Sunnyvale Chamber of Commerce:** https://www.sunnyvalecoc.org/
* **City of Sunnyvale:** https://www.sunnyvale.ca.gov/
* **Stanford University:** https://www.stanford.edu/

**Note:**

These are just overarching trends. It's important to consult reliable sources like the U.S. Census Bureau and the Bureau of Labor Statistics for more detailed and up-to-date information.

bitsandbytes

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_id = 'google/datagemma-rig-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = 'What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?'
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

google/datagemma-rag-27b-it

输入和输出

输入：包含用户查询的文本字符串，带有统计问题提示。
输出：可用于回答用户查询的自然语言查询列表，可由 Data Commons 现有的自然语言界面理解：可用于回答用户查询并能被 Data Commons 现有自然语言界面理解的自然语言查询列表。

pip install -U transformers accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    torch_dtype=torch.bfloat16,
)

input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.

These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?

where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)

输出

What is the population of Sunnyvale?
What is the population of Sunnyvale males?
What is the population of Sunnyvale females?
What is the population of Sunnyvale asians?
What is the population of Sunnyvale blacks?
What is the population of Sunnyvale whites?
What is the population of Sunnyvale males in their 20s?
What is the population of Sunnyvale females in their 20s?
What is the population of Sunnyvale males in their 30s?
What is the population of Sunnyvale females in their 30s?
What is the population of Sunnyvale males in their 40s?
What is the population of Sunnyvale females in their 40s?
What is the population of Sunnyvale males in their 50s?
What is the population of Sunnyvale females in their 50s?
What is the population of Sunnyvale males in their 60s?
What is the population of Sunnyvale females in their 60s?
How has the population of Sunnyvale changed over time?
How has the population of Sunnyvale males changed over time?
How has the population of Sunnyvale females changed over time?
How has the population of Sunnyvale asian people changed over time?
How has the population of Sunnyvale black people changed over time?
How has the population of Sunnyvale hispanic people changed over time?
How has the population of Sunnyvale white people changed over time?
How has the score on Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale schools changed over time?
How has the number of students enrolled in Sunnyvale charter schools changed over time?
How has the number of students enrolled in Sunnyvale private schools changed over time?

bitsandbytes

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type='nf4',
   bnb_4bit_compute_dtype=torch.bfloat16,
)

model_id = 'google/datagemma-rag-27b-it'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    torch_dtype=torch.bfloat16,
)
input_text = """Your role is that of a Question Generator.  Given Query below, come up with a
maximum of 25 Statistical Questions that help in answering Query.
These are the only forms of Statistical Questions you can generate:
1. What is $METRIC in $PLACE?
2. What is $METRIC in $PLACE $PLACE_TYPE?
3. How has $METRIC changed over time in $PLACE $PLACE_TYPE?
where,
- $METRIC should be a metric on societal topics like demographics, economy, health,
  education, environment, etc.  Examples are unemployment rate and
  life expectancy.
- $PLACE is the name of a place like California, World, Chennai, etc.
- $PLACE_TYPE is an immediate child type within $PLACE, like counties, states,
  districts, etc.

Your response should only have questions, one per line, without any numbering
or bullet.

If you cannot come up with Statistical Questions to ask for a Query, return an
empty response.

Query: What are some interesting trends in Sunnyvale spanning gender, age, race, immigration, health conditions, economic conditions, crime and education?
Statistical Questions:"""
inputs = tokenizer(input_text, return_tensors='pt').to('cuda')

outputs = model.generate(**inputs, max_new_tokens=4096)
answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0].strip()
print(answer)