精通推荐算法27：行为序列建模之BST

精通推荐算法27：行为序列建模之BST— 代码实现

news2025/4/15 19:57:47

1 引言

上文精通推荐算法26：行为序列建模之BST— Transformer建模用户行为序列-CSDN博客
讲解了BST的背景和模型结构，本文给出其代码实现，供大家参考。

2 BST核心代码

Transformer已经成为了算法工程师的必备技能，因此这一节给出BST的Transformer层代码。代码基于Keras深度学习库实现，其中Keras版本为2.10。

先看Multi-Head Self Attention的实现。

class MultiHeadSelfAttention(keras.layers.Layer):
    def __init__(self, emb_dim, num_heads=4, **kwargs):
        """
        构建多头自注意力
        @param emb_dim: 输入embedding维度，必须为多头数目的整数倍
        @param num_heads: 多头的数目，一般取8、4、2等
        """
        self.emb_dim = emb_dim
        self.num_heads = num_heads

        # 定义q、k、v三个权重矩阵
        self.w_query = keras.layers.Dense(256)
        self.w_key = keras.layers.Dense(256)
        self.w_value = keras.layers.Dense(256)

        # 每个头内的向量维度，等于总维度除以多头数
        self.dim_one_head = 256 // num_heads

        # 全连接单元，将多个头连接起来。输出向量与原始输入向量维度相同，从而可以进行残差连接。
        self.w_combine_heads = keras.layers.Dense(emb_dim)

        super(MultiHeadSelfAttention, self).__init__(**kwargs)

    def attention(self, query, key, value):
        """
        attention计算，softmax((q * k) / sqrt(dk, 0.5)) * v
        @param query: q向量
        @param key: k向量
        @param value: v向量
        @return:
        """
        # 1 内积，计算q和k的相关性权重，得到一个标量
        score = tf.matmul(query, key, transpose_b=True)

        # 2 计算向量长度，dk
        dim_key = tf.cast(tf.shape(key).shape[-1], tf.float32)

        # 3 缩放，向量长度上做归一化
        score = score / tf.math.sqrt(dim_key)

        # 4 Softmax归一化，使得权重处于0到1之间
        att_scores = tf.nn.softmax(score, axis=-1)

        # 权重乘以每个v向量，再累加起来，最终得到一个向量，维度与q、k、v相同
        att_output = tf.matmul(att_scores, value)
        return att_output, att_scores

    def build_multi_head(self, x_input, batch_size):
        """
        分割隐向量到多个head中，所以多头并不会带来维度倍增
        @param x_input: 输入向量，可以为q、k、v等向量
        @param batch_size:
        @return: 多头矩阵
        """
        x_input = tf.reshape(x_input, shape=(batch_size, -1, self.num_heads, self.dim_one_head))
        return tf.transpose(x_input, perm=[0, 2, 1, 3])

    def call(self, inputs, **kwargs):
        """
        多头自注意力计算部分
        @param inputs:
        @param kwargs:
        @return:
        """
        x_query = inputs[0]
        x_key = inputs[1]
        batch_size = tf.shape(x_query)[0]

        # 得到q向量，原始输入经过线性变换，然后再进行多头切割
        query = self.w_query(x_query)
        query = self.build_multi_head(query, batch_size)

        # 得到k向量
        key = self.w_key(x_key)
        key = self.build_multi_head(key, batch_size)

        # 得到v向量, v和k用同一个输入
        value = self.w_value(x_key)
        value = self.build_multi_head(value, batch_size)

        # attention计算
        att_output, att_scores = self.attention(query, key, value)
        att_output = tf.transpose(att_output, perm=[0, 2, 1, 3])  # [batch_size,seq_len,num_heads,dim_one_head]
        att_output = tf.reshape(att_output, shape=(batch_size, -1, self.dim_one_head * self.num_heads))

        # 多头合并，并进行线性连接输出
        output = self.w_combine_heads(att_output) # [batch_size,seq_len,emb_dim]
        return output

代码采用Keras标准的自定义Layer实现，重点看call函数和attention函数。call函数先将原始输入向量，通过线性连接，得到query、key、value三个向量。然后通过attention函数计算得到每个头的输出。最后通过一层线性连接来融合多头，得到一个与输入向量维度相同的输出向量。Attention函数的实现步骤为：

计算query和key的内积，得到二者相关性权重
计算query和key向量长度，二者长度相同，后续归一化会用到
对第一步得到的权重进行缩放，相当于在向量长度上做归一化
Softmax归一化，使得权重处于0到1之间
权重乘以每个v向量，再累加起来，最终得到一个向量，维度与q、k、v相同

再看Transformer的整体实现。

class Transformer(keras.layers.Layer):
    def __init__(self, seq_len, emb_dim, num_heads=4, ff_dim=128, **kwargs):
        # position emb，位置编码向量
        self.seq_len = seq_len
        self.emb_dim = emb_dim
        self.positions_embedding = Embedding(seq_len, self.emb_dim, input_length=seq_len)

        # multi-head self attention层
        self.att = MultiHeadSelfAttention(self.emb_dim, num_heads)

        # 两层feed-forward全连接
        self.ffn = keras.Sequential([keras.layers.Dense(ff_dim, activation="relu"),
                                     keras.layers.Dense(self.emb_dim)])

        # 两次layerNorm，一次为input和attention输出残差连接，另一次为attention输出和全连接输出
        self.ln1 = keras.layers.LayerNormalization()
        self.ln2 = keras.layers.LayerNormalization()
        self.dropout1 = keras.layers.Dropout(0.3)
        self.dropout2 = keras.layers.Dropout(0.3)

        super(Transformer, self).__init__(**kwargs)

    def call(self, inputs, **kwargs):
        """
        结构：一层multi-head self attention, 叠加两层feed-forward，中间加入layeNorm和残差连接
        @param inputs:原始输入向量，可以包括物品id、类目id、品牌id等特征的Embedding
        @param kwargs:
        @return:单层Transformer结构的输出
        """
        # 1 构建位置特征，并进行Embedding
        positions = tf.range(start=0, limit=self.seq_len, delta=1)
        positions_embedding = self.positions_embedding(positions)

        # 2 原始输入和位置向量加起来，也可以采用BST原文中的Concat方法
        x_key = inputs + positions_embedding

        # 3 multi-head self attention，利用layerNorm和输入进行残差连接
        att_output = self.att(inputs)
        att_output = self.dropout1(att_output)
        output1 = self.ln1(x_key + att_output)

        # 4 两层feed-forward全连接，利用layerNorm和输入进行残差连接
        ffn_output = self.ffn(output1)
        ffn_output = self.dropout2(ffn_output)
        output2 = self.ln2(output1 + ffn_output)   # [B, seq_len, emb_dim]

        # 5 平均池化，对序列进行压缩
        result = tf.reduce_mean(output2, axis=1)   # [B, emb_dim]
        return result

重点看call函数，其实现步骤为：

构建位置特征，并进行Embedding。此处为了方便，用位置的index来表示时序关系，与BST有一点细微区别。
将原始输入向量和位置向量相加，得到Transformer层的输入。也可以采用BST原文中的Concat方法。
将Transformer层的输入送到Multi-Head Self Attention模块中，经过LayerNorm后，再和输入进行残差。得到Attention模块的输出。
将Attention的输出送入到Feed-Forward模块中，它是两层全连接结构，经过LayerNorm后，再和Attention的输出进行残差连接。得到Feed-Forward模块的输出。
将Feed-Forward模块输出的多个向量，进行平均池化，从而将不定长的序列压缩为一个固定长度的向量。

3 BST总结和思考

BST利用Transformer来建模用户行为序列，可以有效解决传统RNN模型容易出现的梯度弥散、串行耗时长等一系列问题，并取得了不错的业务效果。目前Transformer已经广泛应用在搜索推荐广告中，特别是用户行为序列建模中。BST作为一次完整的Transformer在大规模工业场景中的落地，其各种细节的处理和参数的配置，都非常值得学习和借鉴。