【EVI】Hume AI 初探

写在前面的话

Hume AI宣布已在B轮融资中筹集5000万美元，由前Google DeepMind研究员Alan Cowen创立并担任CEO。该AI模型专注于理解人类情感，并发布了「共情语音界面」演示，通过语音对话实现互动。从 Hume AI 官网展示的信息，EVI 能够识别和响应 53 种不同情绪。这一从声音中辨别情绪的能力来源于包括全球数十万人的受控实验数据在内的全面研究，EVI 正是基于对不同文化来源声音和面部表情的复杂分析，才构成了 AI 情绪识别能力的基础。

听说后，我简单地了解了Hume AI文档。从接入方式来看，与之前接入GPT的方式差不多，通过网络请求的方式去弄
…总之，一言难尽。

Hume AI介绍

Hume AI 可以集成到任何涉及人类数据的应用程序或研究中:音频、视频、图像或文本。使用api来访问模型，这些模型可以在细微的面部和声音行为中测量超过50个维度的情绪表达。捕捉细微的表情，如脸上的无聊和欲望，声音表情，如叹息和笑，讲话中持续的情感语调，文本中传达的情感，以及对情感体验的时刻到时刻的多模态估计。

EVI

EVI（EMPATHIC VOICE INTERFACE ），Hume的EVI接口可以理解和模拟语音语调、单词重音等，以优化人类与人工智能的互动。
语音AI助手

Demo

具有共情的能力的语音AI。
官方的在线Demo: https://demo.hume.ai

在这里插入图片描述

快速开始

本小节内容由官网(quickstart)翻译而来。

获取API KEY

Hume AI采用即用即付的付费模式。

为了建立经过身份验证的连接，首先需要使用我们的 API 密钥和客户端密钥实例化 Hume 客户端。这些密钥可以通过登录门户并访问API 密钥页面来获取。

在下面的示例代码中，API 密钥和客户端密钥已保存到环境变量中。避免在项目中对这些值进行硬编码，以防止它们被泄露。

import { Hume, HumeClient } from 'hume';

// instantiate the Hume client and authenticate
const client = new HumeClient({
  apiKey: import.meta.env.HUME_API_KEY,
  clientSecret: import.meta.env.HUME_CLIENT_SECRET,
});

使用我们的 Typescript SDK 时，在使用您的凭据实例化 Hume 客户端后，将获取与 EVI 建立经过身份验证的连接所需的访问令牌并在后台应用。

连接

使用我们的凭据实例化 Hume 客户端后，我们现在可以与 EVI 建立经过身份验证的 WebSocket 连接并定义我们的 WebSocket 事件处理程序。目前，我们将包含占位符事件处理程序，以便在后续步骤中更新。

import { Hume, HumeClient } from 'hume';

// instantiate the Hume client and authenticate
const client = new HumeClient({
  apiKey: import.meta.env.HUME_API_KEY,
  clientSecret: import.meta.env.HUME_CLIENT_SECRET,
});

// instantiates WebSocket and establishes an authenticated connection
const socket = await client.empathicVoice.chat.connect({
  onOpen: () => {
    console.log('WebSocket connection opened');
  },
  onMessage: (message) => {
    console.log(message);
  },
  onError: (error) => {
    console.error(error);
  },
  onClose: () => {
    console.log('WebSocket connection closed');
  }
});

上传音频

要捕获音频并将其作为音频输入通过套接字发送，需要执行几个步骤。

需要处理用户访问麦克风的权限。
使用 Media Stream API 捕获音频，并使用 MediaRecorder API 录制捕获的音频。
对录制的音频 Blob 进行 base64 编码，
使用该sendAudioInput方法通过 WebSocket 发送编码的音频。

接受的音频格式包括：mp3、wav、aac、ogg、flac、webm、avr、cdda、cvs/vms、mp2、mp4、ac3、avi、wmv、mpeg、ircam

import {
  convertBlobToBase64,
  ensureSingleValidAudioTrack,
  getAudioStream,
} from 'hume';

// the recorder responsible for recording the audio stream to be prepared as the audio input
let recorder: MediaRecorder | null = null;
// the stream of audio captured from the user's microphone
let audioStream: MediaStream | null = null;

// define function for capturing audio
async function captureAudio(): Promise<void> {
  // prompts user for permission to capture audio, obtains media stream upon approval
  audioStream = await getAudioStream();
  // ensure there is only one audio track in the stream
  ensureSingleValidAudioTrack(audioStream);
  // instantiate the media recorder
  recorder = new MediaRecorder(audioStream, { mimeType });
  // callback for when recorded chunk is available to be processed
  recorder.ondataavailable = async ({ data }) => {
    // IF size of data is smaller than 1 byte then do nothing
    if (data.size < 1) return;
    // base64 encode audio data
    const encodedAudioData = await convertBlobToBase64(data);
    // define the audio_input message JSON
    const audioInput: Omit<Hume.empathicVoice.AudioInput, 'type'> = {
      data: encodedAudioData,
    };
    // send audio_input message
    socket?.sendAudioInput(audioInput);
  };
  // capture audio input at a rate of 100ms (recommended)
  const timeSlice = 100;
  recorder.start(timeSlice);
}

// define a WebSocket open event handler to capture audio
async function handleWebSocketOpenEvent(): Promise<void> {
  // place logic here which you would like invoked when the socket opens
  console.log('Web socket connection opened');
  await captureAudio();
}

响应

响应将包含多条消息，详细信息如下：

user_message：此消息封装了音频输入的转录。此外，它还包括与说话者的声音韵律相关的表情测量预测。
assistant_message：对于响应中的每个句子，AssistantMessage都会发送一个。此消息不仅传递响应的内容，而且还包含有关生成的音频响应的表达质量的预测。
audio_output：每个都会附带AssistantMessage一条消息。这包含与相对应的实际音频（二进制）响应。AudioOutputAssistantMessage
assistant_end：表示对音频输入的响应的结束，AssistantEnd 消息作为通信的最后一部分传递。

这里我们将重点播放接收到的音频输出。要播放响应中的音频输出，我们需要定义将接收到的二进制文件转换为 Blob 的逻辑，并创建 HTMLAudioInput 来播放音频。然后，我们需要更新客户端的 on message WebSocket 事件处理程序，以在接收音频输出时调用播放音频的逻辑。为了管理此处传入音频的播放，我们将实现一个队列并按顺序播放音频。

import { 
  convertBase64ToBlob,
  getBrowserSupportedMimeType
} from 'hume';

// audio playback queue
const audioQueue: Blob[] = [];
// flag which denotes whether audio is currently playing or not
let isPlaying = false;
// the current audio element to be played
let currentAudio: : HTMLAudioElement | null = null;
// mime type supported by the browser the application is running in
const mimeType: MimeType = (() => {
  const result = getBrowserSupportedMimeType();
  return result.success ? result.mimeType : MimeType.WEBM;
})();

// play the audio within the playback queue, converting each Blob into playable HTMLAudioElements
function playAudio(): void {
  // IF there is nothing in the audioQueue OR audio is currently playing then do nothing
  if (!audioQueue.length || isPlaying) return;
  // update isPlaying state
  isPlaying = true;
  // pull next audio output from the queue
  const audioBlob = audioQueue.shift();
  // IF audioBlob is unexpectedly undefined then do nothing
  if (!audioBlob) return;
  // converts Blob to AudioElement for playback
  const audioUrl = URL.createObjectURL(audioBlob);
  currentAudio = new Audio(audioUrl);
  // play audio
  currentAudio.play();
  // callback for when audio finishes playing
  currentAudio.onended = () => {
    // update isPlaying state
    isPlaying = false;
    // attempt to pull next audio output from queue
    if (audioQueue.length) playAudio();
  };
}

// define a WebSocket message event handler to play audio output
function handleWebSocketMessageEvent(
  message: Hume.empathicVoice.SubscribeEvent
): void {
  // place logic here which you would like to invoke when receiving a message through the socket
  switch (message.type) {
    // add received audio to the playback queue, and play next audio output
    case 'audio_output':
      // convert base64 encoded audio to a Blob
      const audioOutput = message.data;
      const blob = convertBase64ToBlob(audioOutput, mimeType);
      // add audio Blob to audioQueue
      audioQueue.push(blob);
      // play the next audio output
      if (audioQueue.length === 1) playAudio();
      break;
  }
}

中断

可中断性是 Empathic Voice Interface 的一大特色。如果在接收上一个音频输入的响应消息时通过 websocket 发送音频输入，则将停止发送对上一个音频输入的响应。此外，界面将发回一条 user_interruption消息，并开始响应新的音频输入。

// function for stopping the audio and clearing the queue
function stopAudio(): void {
  // stop the audio playback
  currentAudio?.pause();
  currentAudio = null;
  // update audio playback state
  isPlaying = false;
  // clear the audioQueue
  audioQueue.length = 0;
}

// update WebSocket message event handler to handle interruption
function handleWebSocketMessageEvent(
  message: Hume.empathicVoice.SubscribeEvent
): void {
  // place logic here which you would like to invoke when receiving a message through the socket
  switch (message.type) {
    // add received audio to the playback queue, and play next audio output
    case 'audio_output':
      // convert base64 encoded audio to a Blob
      const audioOutput = message.data;
      const blob = convertBase64ToBlob(audioOutput, mimeType);
      // add audio Blob to audioQueue
      audioQueue.push(blob);
      // play the next audio output
      if (audioQueue.length === 1) playAudio();
      break;
    // stop audio playback, clear audio playback queue, and update audio playback state on interrupt
    case 'user_interruption':
      stopAudio();
      break;
  }
}

API参考

官方链接：API Reference

网络请求URL:
https://api.hume.ai/v0/evi/tools?page_number=0&page_size=2

示例代码：

curl -G https://api.hume.ai/v0/evi/tools \
     -H "X-Hume-Api-Key: " \
     -d page_number=0 \
     -d page_size=2

TypeScript示例：

// List tools (GET /tools)
const response = await fetch("https://api.hume.ai/v0/evi/tools?page_number=0&page_size=2", {
  method: "GET",
  headers: {
    "X-Hume-Api-Key": ""
  },
});
const body = await response.json();
console.log(body);

Python示例

import requests
# List tools (GET /tools)
response = requests.get(
  "https://api.hume.ai/v0/evi/tools?page_number=0&page_size=2",
  headers={
    "X-Hume-Api-Key": ""
  },
)
print(response.json())