STL关联式容器之hashtable

hashtable的桶子与节点

下图为开链法(separate chaining)完成hashtable的图形表述。为了剖析SGI STL源码，我们遵循SGI的命名，称hash table表格内的元素为桶(bucket),此名称的大约意义是，表格内的每个单元，涵盖的不只是个节点(元素)，可能是一桶节点。(nil标识null节点)

下面是hash table的节点定义

template <class Value>
struct __hashtable_node {
    __hashtable_node *next;
    Value value;
};

注意，bucket所维护的linked list，并不采用STL的list或slist，而是自行维护上述的hashtable node。至于buckets聚合体，则以vector完成，以便有动态扩展能力。稍后在hash table的定义中我们可以清楚看到这一点。

hashtable迭代器

以下是hashtable迭代器的定义：

template <class Value, class Key, class HashFcn, class ExtractKey, class EqualKey, class Alloc>
struct __hashtable_iterator {
    typedef __hashtable_iterator<Value, Key, HashFcn, ExtractKey, EqualKey, Alloc> iterator;
    typedef __hashtable_const_iterator<Value, Key, HashFcn, ExtractKey, EqualKey, Alloc> const_iterator;
    typedef __hashtable_node<Value> node;
    
    typedef forward_iterator_tag iterator_category;
    typedef Value value_type;
    typedef ptrdiff_t difference_type;
    typedef size_t size_type;
    typedef Value& reference;
    typedef Value* pointer;

    node* cur;
    hashtable *ht;
    
    __hashtable_iterator(node *n, hashtable* tab) : cur(n), ht(tab) {}
    __hashtable_iterator() {}
    reference operator*() const { return cur->val; }
    pointer operator->() const { return &(operator*()); }
    iterator& operator++();
    iterator& operator++(int);
    
    bool operator == (const iterator& it) const { return cur == it.cur; }
    bool operator != (const iterator& it) const { return cur != it.cur; }
};

注意，hashtable迭代器必须永远维系着与整个bucket vector的关系，并记录目前所指的节点。其前进操作是首先尝试从目前所指节点出发，前进一个位置节点，由于节点被安置在list内，所以利用节点的next指针可以轻易的达成前进的操作。如果目前节点正巧是list的尾端，就跳至下一个bucket身上，那正式指向下一个list的头部节点。

template<class V, class K, class HF, class ExK, class EqK, class A>
__hashtable_iterator<V, K, HF, ExK, EqK, A>&
__hashtable_iterator<V, K, HF, ExK, EqK, A>::operator++() {
    const node* old = cur;
    cur = cur->next;
    if (!cur) {
        size_type bucket = ht->bkt_num(old_val); // 对old_value值重新进行了一次hash映射
        while(!cur && ++bucket < ht->buckets.size()) {
            cur = ht->buckets[bucket];
        }
    }
    return *this;
}


template<class V, class K, class HF, class ExK, class EqK, class A>
__hashtable_iterator<V, K, HF, ExK, EqK, A>&
__hashtable_iterator<V, K, HF, ExK, EqK, A>::operator++(int) {

    iterator tmp = *this;
    ++*this;
    return tmp;
}

注意，hash table的迭代器没有后退操作（operator--()）,hashtabl也没有定义逆向迭代器(reverse iterator)。

hashtable的数据结构

下面是hashtable的定义摘要，其中可见buckets聚合体以vector完成，以利于动态扩充：

template <class Value, class Key, class HashFcn, class ExtractKey, class EqualKey, class Alloc = alloc>
class hashtable {
public:
    typedef HashFcn hasher;
    typedef EqualKey key_equal;
    typedef size_t size_type;

private:
    hasher hash;
    key_equal equals;
    ExtractKey get_key;

    typedef __hashtable_node<Value> node;
    typedef simple_alloc<node, Alloc> node_allocator;

    vector<node*, Alloc> buckets;
    size_type num_elements;

public:
    size_type bucket_count() const { return buckets.size(); }
...
};

hashtable的模板参数相当多，包括：

Value：节点的实值型别
Key：节点的键值型别
HashFcn：hash function的函数型别
ExtractKey: 从节点中取出键值的方法(函数或仿函数)
EqualKey：判断键值相同与否的方法（函数或仿函数）
Alloc：空间配置器，缺省为std::alloc

后续会给出使用这些型别的样例。注意，先前谈及概念时，我们说hash function是计算元素位置的函数，SGI将这项任务赋予bkt_num(),由它调用hash function取得一个可执行modulus(取模)运算的值。稍后执行这项"计算元素位置"的任务时，我们会再次看到。

虽然开链法(seperate chaining)并不要求表格大小必须为质数，但SGI STL仍然以质数来设计表格大小，并且现将28个质数（逐渐呈现大约两倍的关系）计算好，以备随时访问，同时提供一个函数，用来查询在这28个质数中，“最接近某数并大于某数”的质数：

static const int __stl_num_primes = 28;
static const unsigned long __stl_prime_list[__stl_num_primes] = {
    53,    97,    193,    389,    769,
  1543,    3079,    6151,    12289,    24593,
 49157,    98317,    196613,    393241,    784633,
1572869,    3145739,    6291469,    12582917,    25165843,
50331653,    100663319,    201326611,    402653189,    805306457,
1610612741,    3221225473ul,    4294967291ul
};

inline unsigned long __stl_next_prime(unsigned long n) {
    const unsigned long* first = __stl_prime_list;
    const unsigned long* last = __stl_prime_list + __stl_num_primes;
    const unsigned long* pos = lower_bound(first, last, n);
    // 以上lower_bound是泛型算法，后续章节会介绍
    // lower_bound,序列需先排序。
    return pos == last ? *(last -1) : *pos;
}


size_type max_bucket_count() const {
    return __stl_prime_list[__stl_num_primes - 1];
}

hashtable的构造与内存管理

上一节hashtable定义中展现了一个专属的节点配置器：

typedef simple_alloc<node, Alloc>node_allocator;

下面是节点配置函数与节点释放函数：

node *new_node(const value_type& obj) {
    node* n = node_allocator::allocate();
    n->next = 0;
    __STL_TRY {
        construct(&n->val, obj);
        return n;
    }
    __STL_UNWIND(node_allocator::deallocate(n));
}

void delete_node(node* n) {
    destroy(&n->val);
    node_allocator::deallocate(n);
}

当我们初始构造一个拥有50个节点的hash table如下：

hashtable<int, int, hash<int>, identity<int>, equal_to<int>, alloc>
iht(50, hash<int>(), euqal_to<int>());

cout << iht.size() << endl;                // 0
cout << iht.bucket_count() << endl;        // 53. STL 提供的第一个质数

上述定义调用一下构造函数

hashtable(size_type n, const HashFcn& hlf, const EqualKey& eql)
: hash(half), equals(eql), get_key(Extrack_key()), num_elements(0) {
    initial_buckets(n); }

void initial_buckets(size_type n) {
    const size_type n_buckets = next_size(n);
    buckets.reserve(n_buckets);
    buckets.insert(buckets.end(), n_buckets, (node*)0);
    num_elements = 0;
}

其中next_size()返回最近并大于n的质数；

size_type next_size(size_type n) const { return __stl_next_prime(n); }

然后buckets vector保留空间，设定所有buckets的初值为0(NULL指针)。

插入操作(insert)与表格重整

当客户端开始插入元素(节点)时：

iht.insert_unique(59);

iht.insert_unique(63);

iht.insert_unique(108);

hash table内将进行一下操作：

pair<iterator, bool> insert_unique(const value_type& obj) {
    resize(num_elements + 1);
    return insert_unique_noresize(obj);
}

以下函数判断是否需要重建表格。如果不需要，立即回返，如果需要，就开始resize

template <class V, class K, class HF, class Ex, class Eq, class A> 
void hashtable<V, K, HF, Ex, Eq, A>::resize(size_type num_elements_hint) {
    const size_type old_n = buckets.size();
    if (num_elements_hint > old_n ) {
        const size_type n = next_size(num_elements_hint);
        if (n > old_n) {
            vector<node*, A> tmp(n, (node*)0);
            for (size_type bucket = 0; bucket < old_n; ++bucket) {
                node *first = buckets[bucket];
                while (first) {
                    size_type new_bucket = bkt_num(firs->val, n);

                    // 以下四个操作颇为微妙
                    // (1)令旧bucket指向其所对应之串行节点的下一个节点(以便于迭代处理)
                    buckets[bucket] = first->next;
                    // (2) (3) 将当前节点插入到新的bucket内，成为其对应串行的第一个节点
                    first->next = tmp[new_bucket];
                    tmp[new_bucket] = first;
                    // (4) 回到旧的所指的待处理串行，准备处理下一个节点
                    first = buckets[bucket]
                } // while (first)
            } // for
            buckets.swap(tmp);
        } // if (n > old_n)
    }
}

//在不需要重建表格的情况下插入新节点。键值不允许重复

template<class v, class K, class HF, class Ex, class Eq, class A>
pair<typename hashtable<V, K, HF, Ex, Eq, A>::iterator, bool>
hashtable<V, K, HF, Ex, Eq, A>::insert_unique_noresize(const value_type& obj) {
    const size_type n = bit_num(obj);
    node* first = buckets[n];
    for (node * cur = first; cur; cur = cur->next) {
        if (equals(get_key(cur->val), get_key(obj))) {
            return pair<iterator, bool>(iterator(cur, this), false);
        }
    }

    node *tmp = new_node(obj);
    tmp->next = first;
    buckets[n] = tmp;
    ++num_elements;
    return pair<iterator, bool>(iterator(tmp, this), true);
}

hashtable同时允许键值重复的的函数，其函数定义如下

iterator insert_equal(const value_type& obj) {
    resize(num_elements + 1);
    return insert_euqal_noresize(obj);
}

template <class V, class K, class HF, class Ex, class Eq, class A>
typename hashtable<V, K, HF, Ex, Eq, A>::iterator
hashtable<V, K, HF, Ex, Eq, A>::insert_equal_noresize(const value_type& obj) {
    const size_type n = bkt_num(obj);
    node* first = buckets[n];
    
    // 此操作为了后续遍历时，相同key的元素总是连续出现，如果不做这个要求，该循环可以不处理，直接将元素插入list的头部，效率会更高
    for(node* cur = first; cur; cur = cur->next) {
        if (equals(get_key(cur->val), get_key(obj))) {
            node *tmp = new_node(obj);
            tmp->next = cur->next;
            cur->next = tmp;
            ++num_elements;
            return iterator(tmp, this); 
        }
    }
    node *tmp = new_node(obj);
    tmp->next = first;
    buckets[n] = tmp;
    ++num_elements;
    return iterator(tmp, this);
    
}

判知元素的落脚处(bkt_num)

本节程序代码在许多地方都需要知道某个元素值落脚于哪一个bucket之内。这本来是hash function的责任，SGI把这个任务包装了一层，先交给bkt_num()函数，再由此函数调用hash function，取得一个可执行modulus（取模）运算的数值。为什么要这么做？因为有些型别无法直接拿来对hashtable的大小进行模运算，例如字符字符串const char*，这时候我们需要做一些转换。下面是bkt_num函数，

// 版本1：接受实值(value)，和buckets个数
size_type bkt_num(const value_type& obj, size_t n) {
    return bkt_num_key(get_key(obj), n); //调用版本4
}

// 版本2：只接受实值(value)
size_type bkt_num(const value_type& obj) const {
    return bkt_num_key(get_key(obj));
}

// 版本3：只接受键值
size_type bkt_num_key(const key_type& key) const {
    return bkt_num_key(key, buckets.size());    // 调用版本4
}

// 版本4：接受键值和buckets个数
size_type bkt_num_key(const key_type&key, size_t n) const {
    return hash(key) % n;                         // SGI所有内建hash函数后续会进行说明
}

复制(copy_from)和整体删除(clear)

由于整个hash table由vector和linked-list组合而成，因此，复制和整体删除，都需要特别注意内存的释放问题。下面是hashtable提供的两个相关函数：

template <class V, class K, class HF, class Ex, class Eq, class A>
void hashtable<V, K, HF, Ex, Eq, A>::clear() {
    for (size_type i = 0; i < buckets.size(); ++i) {
        node *cur = buckets[i];
        while(cur != 0) {
            node *next = cur->next;
            delete_node(cur);
            cur = next;
        } // while
        buckets[i]=0;
    } // for
    num_elements=0;
    // 注意，buckets vector并未释放掉空间，仍保留原来大小
}

template <class V, class K, class HF, class Ex, class Eq, class A> 
void hashtable<V, K, HF, Ex, Eq, A>::copy_from(const hashtable& ht) {
    buckets.clear();

    buckets.reserve(ht.buckets.size());

    buckets.insert(buckets.end(), ht.buckets.size(), (node*)0);
    
    __STL_TRY {
        for (size_type i=0; i<bt.buckets.size(); ++i) {
            if (const node* cur = ht.buckets[i]) {
                node *copy = new_node(cur->val);
                buckets[i] = copy;
                for (node* next = cur->next; next; cur=next, next = cur->next) {
                    copy->next = new_node(next->val);
                    copy = copy->next;                

                } // for

            } // if
        } // for
        num_elements = ht.num_elements;
    }
    __STL_UNWIND(clear());
}

hash functions

<stl_hash_fun.h>定义有数个现成的hash functions，全部是仿函数。先前谈及概念时，我们说hash function是计算元素位置的函数，SGI将这项任务赋予了先前提过的bkt_num,再由它来调用这里提供的hash function，取得一个可以对hashtable进行模运算的值。针对char，int，long等整型型别，这里大部分的hash function什么也没做，只是忠实的返回原值。但对于字符串(const char*),就设计了一个转换函数如下：

template <class Key> struct hash {};

inline size_t __stl_hash_string(const char*s) {
    unsigned long h = 0;
    for (; *s; ++s) {
        h = 5*h + *s;
    }
    return size_t(h);
}

__STL_TEMPLATE_NULL struct hash<char*> {
    size_t operator()(const char*s) const { return __stl_hash_string(s); }
};

__STL_TEMPLATE_NULL struct hash<const char*> {
    size_t operator()(const char*s) const { return __stl_hash_string(s); }
};

__STL_TEMPLATE_NULL struct hash<char> {
    size_t operator()(char x) const { return x; }
};
__STL_TEMPLATE_NULL struct hash<unsigned char> {
    size_t operator()(unsigned char x) const { return x; }
};
__STL_TEMPLATE_NULL struct hash<signed char> {
    size_t operator()(unsigned char x) const { return x; }
};

__STL_TEMPLATE_NULL struct hash<short> {
    size_t operator()(short x) const { return x; }
};
__STL_TEMPLATE_NULL struct hash<unsigned short> {
    size_t operator()(unsigned short x) const { return x; }
};


__STL_TEMPLATE_NULL struct hash<int> {
    size_t operator()(int x) const { return x; }
};
__STL_TEMPLATE_NULL struct hash<unsigned int> {
    size_t operator()(unsigned int x) const { return x; }
};

__STL_TEMPLATE_NULL struct hash<long> {
    size_t operator()(long x) const { return x; }
};
__STL_TEMPLATE_NULL struct hash<unsigned long> {
    size_t operator()(unsigned long x) const { return x; }
};

由此可见，SGI hashtable无法处理上述所列各型别以外的元素，例如string，double， float，预处理这些型别，用户必须自定义他们的hash function。下面直接以SGI hashtable处理string所获得的错误现象：

#include <hash_set>
#include <iostream>
#include <string>
using namespace std;
using namespace __gnu_cxx;

int main() {
    hash_set<string> shs;
    hash_set<double> dhs;
    
    shs.insert(string("chnming"));
    dhs.insert(15.0);

    return 0;
}

以上代码编译会报错（以上代码在gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0进行编译，可能实现上会有些差异）