Faiss::IndexIVF

IndexIVF 结构体：

struct IndexIVFPQ: IndexIVF {
    /// 是否基于向量与聚类中心的距离（残差）
    bool by_residual;              ///< Encode residual or plain vector?
    /// 是否预先计算残差表
    int use_precomputed_table;     ///< if by_residual, build precompute tables
    /// 点积量化器，用于分段训练聚类中心，并量化向量
    ProductQuantizer pq;           ///< produces the codes

    bool do_polysemous_training;   ///< reorder PQ centroids after training?
    PolysemousTraining *polysemous_training; ///< if NULL, use default

    // search-time parameters
    size_t scan_table_threshold;   ///< use table computation or on-the-fly?
    int polysemous_ht;             ///< Hamming thresh for polysemous filtering


    /// if use_precompute_table
    /// size nlist * pq.M * pq.ksub
    /// 残差表
    std::vector <float> precomputed_table;
	/// nlist为粗量化器聚类数量，也称倒排链个数
    IndexIVFPQ (
            Index * quantizer, size_t d, size_t nlist,
            size_t M, size_t nbits_per_idx);           
}

IndexIVF 初始化：

  IndexIVFPQ (
            Index * quantizer, size_t d, size_t nlist,
            size_t M, size_t nbits_per_idx);

quantizer 粗量化器，存放所有聚类中心点，与之对应的是nlist，即粗聚类中心个数
d为向量维度、M分段个数、nbits_per_idx为分段量化后占用的字节数，对应分段的聚类点为2^nbits_per_idx

训练

所以综上而言，IndexIVF 含有两个量化器：

Index * quantizer 粗量化器
ProductQuantizer pq 点积量化，即分段的量化

那么理论上就应该存在对两个量化器分别进行训练的逻辑，事实也是如此：

void IndexIVF::train (idx_t n, const float *x)
{
    train_q1 (n, x, verbose, metric_type);

    train_residual (n, x);
    is_trained = true;
}

在train_q1中对quantizer （即粗量化器）进行了训练，train_residual对pq进行了训练。

PQ相对于粗量化器而言，采用分段的聚类。

即PQ的训练：

（1）采样，并不是说给多少向量，就要全部用于训练，而需要进行采样

//输入训练向量
train_residual (n, x);
// 采样
x = fvecs_maybe_subsample (
         d, (size_t*)&n, pq.cp.max_points_per_centroid * pq.ksub,
         x, verbose, pq.cp.seed);

即每个聚类中心点保留max_points_per_centroid个样本进行训练

（2）计算采样向量对应的粗量化器聚类中心，assign即存储了粗量化聚类中心ID

idx_t * assign = new idx_t [n]; // assignement to coarse centroids
ScopeDeleter<idx_t> del (assign);
quantizer->assign (n, x, assign);

（3）计算每个样本与最近聚类中心点的向量差

float *residuals = new float [n * d];
        del_residuals.set (residuals);
        for (idx_t i = 0; i < n; i++)
           quantizer->compute_residual (x + i * d, residuals+i*d, assign[i]);
trainset = residuals;

void Index::compute_residual (const float * x,
                              float * residual, idx_t key) const {
  reconstruct (key, residual);
  for (size_t i = 0; i < d; i++)
    residual[i] = x[i] - residual[i];
}

（4）PQ训练残差

pq.train (n, trainset);
///

float * xslice = new float[n * dsub];
ScopeDeleter<float> del (xslice);
for (int m = 0; m < M; m++) {
    for (int j = 0; j < n; j++)
        memcpy (xslice + j * dsub,
                x + j * d + m * dsub,
                dsub * sizeof(float));

    Clustering clus (dsub, ksub, cp);

数据分段，逐个段进行聚类，其实不难发现PQ训练使用的是训练样本与粗量化器聚类中心的残差向量作为训练集。

索引

（1）找出输入向量样本对应的最近的粗聚类中心ID

long * idx0 = new long [n];
del_idx.set (idx0);
quantizer->assign (n, x, idx0);

（2）计算样本与粗聚类中心点的残差，放入residuals

（3）计算每个残差向量的量化编码（PQ）

pq.compute_codes (to_encode, xcodes, n);

xcodes：输出量化编码结果

to_encode：待量化编码的残差数组

（4）构建倒排索引

id倒排链，即记录距离聚类中心点最近的向量ID链
code倒排链，code就是向量的残差量化编码

搜索

搜索的目标就是计算查询向量与索引向量的最短距离：

$d = || x - y_c - y_r ||^2$

$x$ 为查询向量
$y_c$ 为离 $x$ 最近的粗聚类中心点
$y_r$ 是 $x$ 的残差量化的聚类中心点， $y_r$ 实际上应该是多段的，因为PQ训练是分段训练的

将公式展开为：

$d=||x-y_c||^2 + ||y_r||^2+2*(y_c.y_r)-2*(x.y_r)$

其中 $||y_r||^2+2*(y_c|y_r)$ 是可以提前计算好的，即可放入提前计算的表中。

搜索的流程：

粗量化器查询，即返回最近的几个粗聚类中心点
残差查询
- 计算当前向量与PQ所有子聚类中心的点积，即 $(x.y_r)$
- 计算正在的距离

即通过查询到粗聚类中心ID，可以计算 $||x-y_c||^2$ ，获取粗聚类中心点对应的ID链，遍历ID对应的code，可以得到分段的子聚类中心，此时便可计算 $||y_r||^2+2*(y_c.y_r)-2*(x.y_r)$ ，综合这两部分，就可以求得真正距离。

殷亚云的博客

Min-Hash & LSH