faiss多GPU量化压缩极限4bit

news2024/11/6 18:32:55

group 868373192

second group 277356808

目前faiss多卡压缩的极限，采用ivfPQ只能到8bit，而SQ可以到4bit，因此采用后者。

参考前述博文：10亿级别向量数据进行faiss-gpu实现快速召回-CSDN博客

量化压缩后的index：faiss新版使用方法-CSDN博客

faiss 用于检索10亿向量（维度768）的方法-CSDN博客

faiss.IndexScalarQuantizer使用方法-CSDN博客

import faiss
import numpy as np

# 设置GPU数量
faiss.omp_set_num_threads(4)

# 初始化GPU资源
gpu_resources = []
for i in range(4):
    res = faiss.StandardGpuResources()
    gpu_resources.append(res)

# 创建一个4-bit标量量化的索引
quantizer = faiss.IndexFlatL2(768)
index = faiss.IndexIVFScalarQuantizer(quantizer, 768, 1024, faiss.ScalarQuantizer.QT_4bit)  # 1024个聚类中心，4-bit量化

# 将索引转移到GPU
co = faiss.GpuMultipleClonerOptions()
co.useFloat16 = True
co.usePrecomputed = False
co.indicesOptions = faiss.INDICES_64_BIT
co.verbose = True
co.shard = True  # 分片模式

# 使用IndexShards进行分片
shards = faiss.IndexShards(768)
for i in range(4):
    gpu_index = faiss.index_cpu_to_gpu(gpu_resources[i], i, index)
    shards.add_shard(gpu_index)

# 生成随机数据
d = 768
nb = 1000000
np.random.seed(1234)
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.

# 训练索引
shards.train(xb)

# 添加数据到索引
shards.add(xb)

# 搜索示例
nq = 10  # 查询数量
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

k = 5  # 返回最近的5个结果
D, I = shards.search(xq, k)

print("查询结果:")
for i in range(nq):
    print(f"查询 {i}: 最近的 {k} 个向量索引: {I[i]}")

注意，上述数据不能分批加载，这就需要大的瞬时空间（内存），bug如下：

0, in add return _swigfaiss.IndexShards_add(self, n, x) RuntimeError: Error in void faiss::IndexShardsTemplate<IndexT>::add_with_ids(faiss::IndexShardsTemplate<IndexT>::idx_t, const component_t*, const idx_t*) [with IndexT = faiss::Index; faiss::IndexShardsTemplate<IndexT>::idx_t = long int; faiss::IndexShardsTemplate<IndexT>::component_t = float] at /project/faiss/faiss/IndexShards.cpp:240: Error: 'this->ntotal == 0' failed: when adding to IndexShards with sucessive_ids, only add() in a single pass is supported

这个错误表明在使用IndexShards时，add方法不支持多次调用。IndexShards要求在添加数据时，所有数据必须一次性添加，而不是分多次添加。

为了解决这个问题，我们需要确保在调用add方法时，一次性添加所有数据。下面是修改后的代码：

修改后的代码

import faiss
import numpy as np

# 设置GPU数量
faiss.omp_set_num_threads(4)

# 初始化GPU资源
gpu_resources = []
for i in range(4):
    res = faiss.StandardGpuResources()
    gpu_resources.append(res)

# 创建一个4-bit标量量化的索引
quantizer = faiss.IndexFlatL2(768)
index = faiss.IndexIVFScalarQuantizer(quantizer, 768, 1024, faiss.ScalarQuantizer.QT_4bit)  # 1024个聚类中心，4-bit量化

# 将索引转移到GPU
co = faiss.GpuMultipleClonerOptions()
co.useFloat16 = True
co.usePrecomputed = False
co.indicesOptions = faiss.INDICES_64_BIT
co.verbose = True
co.shard = True  # 分片模式

# 使用IndexShards进行分片
shards = faiss.IndexShards(768)
for i in range(4):
    gpu_index = faiss.index_cpu_to_gpu(gpu_resources[i], i, index)
    shards.add_shard(gpu_index)

# 生成随机数据
d = 768
nb = 1000000
np.random.seed(1234)
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.

# 训练索引
shards.train(xb)

# 添加数据到索引
shards.add(xb)

# 搜索示例
nq = 10  # 查询数量
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

k = 5  # 返回最近的5个结果
D, I = shards.search(xq, k)

print("查询结果:")
for i in range(nq):
    print(f"查询 {i}: 最近的 {k} 个向量索引: {I[i]}")

代码解释

GPU资源初始化: 我们初始化了4个GPU资源，每个GPU资源对应一个GPU。
4-bit标量量化索引: 我们创建了一个IndexIVFScalarQuantizer索引，其中1024是聚类中心的数量，faiss.ScalarQuantizer.QT_4bit表示使用4-bit标量量化。
索引转移到GPU: 使用faiss.GpuMultipleClonerOptions()将索引转移到GPU，并设置了一些选项，如使用Float16、分片模式等。
使用IndexShards进行分片: 我们使用IndexShards来管理多个GPU上的索引，确保每个GPU上的索引具有相同的编码大小。
训练和添加数据: 我们生成了100万个768维的随机向量，并将其用于训练和添加到索引中。注意，add方法是一次性调用的，而不是分多次调用。
搜索: 我们生成了10个查询向量，并使用索引进行搜索，返回每个查询向量的最近5个向量的索引。

注意事项

内存和显存: 由于数据量较大（100万条768维向量），确保你的GPU有足够的显存来处理这些数据。
性能调优: 你可以调整聚类中心的数量、子向量的数量等参数来优化性能。
多GPU负载均衡: 使用分片模式（shard=True）可以更好地在多个GPU之间分配负载。

通过这个修改后的代码，你应该能够避免IndexShards的add方法不支持多次调用的问题，并在多GPU环境下成功运行FAISS的4-bit标量量化索引。

我就想分批进行加载数据呢？

下面是一个不使用IndexShards，而是手动管理多个GPU索引并实现分批加载数据的示例代码：

Python代码示例

import faiss
import numpy as np

# 设置GPU数量
faiss.omp_set_num_threads(4)

# 初始化GPU资源
gpu_resources = []
for i in range(4):
    res = faiss.StandardGpuResources()
    gpu_resources.append(res)

# 创建一个4-bit标量量化的索引
quantizer = faiss.IndexFlatL2(768)
index = faiss.IndexIVFScalarQuantizer(quantizer, 768, 1024, faiss.ScalarQuantizer.QT_4bit)  # 1024个聚类中心，4-bit量化

# 将索引转移到GPU
gpu_indices = []
for i in range(4):
    gpu_index = faiss.index_cpu_to_gpu(gpu_resources[i], i, index)
    gpu_indices.append(gpu_index)

# 生成随机数据
d = 768
nb = 1000000
np.random.seed(1234)
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.

# 分批训练和添加数据
batch_size = nb // 4  # 每个GPU处理的数据量
for i in range(4):
    start = i * batch_size
    end = (i + 1) * batch_size
    batch_data = xb[start:end]
    
    # 训练索引
    gpu_indices[i].train(batch_data)
    
    # 添加数据到索引
    gpu_indices[i].add(batch_data)

# 合并索引
index = faiss.IndexShards(768)
for gpu_index in gpu_indices:
    index.add_shard(gpu_index)

# 搜索示例
nq = 10  # 查询数量
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

k = 5  # 返回最近的5个结果
D, I = index.search(xq, k)

print("查询结果:")
for i in range(nq):
    print(f"查询 {i}: 最近的 {k} 个向量索引: {I[i]}")

代码解释

GPU资源初始化: 我们初始化了4个GPU资源，每个GPU资源对应一个GPU。
4-bit标量量化索引: 我们创建了一个IndexIVFScalarQuantizer索引，其中1024是聚类中心的数量，faiss.ScalarQuantizer.QT_4bit表示使用4-bit标量量化。
索引转移到GPU: 使用faiss.index_cpu_to_gpu将索引转移到每个GPU上，并存储在gpu_indices列表中。
分批训练和添加数据: 我们将数据分成4个批次，每个批次对应一个GPU。每个GPU上的索引分别进行训练和添加数据。
合并索引: 使用IndexShards将所有GPU上的索引合并成一个索引。
搜索: 我们生成了10个查询向量，并使用合并后的索引进行搜索，返回每个查询向量的最近5个向量的索引。