【深度学习】注意力机制(三)

news2025/1/12 0:45:04

本文介绍一些注意力机制的实现,包括EMHSA/SA/SGE/AFT/Outlook Attention。

【深度学习】注意力机制(一)

【深度学习】注意力机制(二)

【深度学习】注意力机制(四)

【深度学习】注意力机制(五)​​​​​​​

目录

一、EMHSA(Efficient Multi-Head Self-Attention)

二、SA(SHUFFLE ATTENTION)

三、SGE(Spatial Group-wise Enhance)

四、AFT(Attention Free Transformer)

五、Outlook Attention


一、EMHSA(Efficient Multi-Head Self-Attention)

论文:论文地址

如下图:

代码(代码连接):

import numpy as np
import torch
from torch import nn
from torch.nn import init



class EMSA(nn.Module):

    def __init__(self, d_model, d_k, d_v, h,dropout=.1,H=7,W=7,ratio=3,apply_transform=True):

        super(EMSA, self).__init__()
        self.H=H
        self.W=W
        self.fc_q = nn.Linear(d_model, h * d_k)
        self.fc_k = nn.Linear(d_model, h * d_k)
        self.fc_v = nn.Linear(d_model, h * d_v)
        self.fc_o = nn.Linear(h * d_v, d_model)
        self.dropout=nn.Dropout(dropout)

        self.ratio=ratio
        if(self.ratio>1):
            self.sr=nn.Sequential()
            self.sr_conv=nn.Conv2d(d_model,d_model,kernel_size=ratio+1,stride=ratio,padding=ratio//2,groups=d_model)
            self.sr_ln=nn.LayerNorm(d_model)

        self.apply_transform=apply_transform and h>1
        if(self.apply_transform):
            self.transform=nn.Sequential()
            self.transform.add_module('conv',nn.Conv2d(h,h,kernel_size=1,stride=1))
            self.transform.add_module('softmax',nn.Softmax(-1))
            self.transform.add_module('in',nn.InstanceNorm2d(h))

        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        self.h = h

        self.init_weights()


    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)

    def forward(self, queries, keys, values, attention_mask=None, attention_weights=None):

        b_s, nq ,c = queries.shape
        nk = keys.shape[1]

        q = self.fc_q(queries).view(b_s, nq, self.h, self.d_k).permute(0, 2, 1, 3)  # (b_s, h, nq, d_k)

        if(self.ratio>1):
            x=queries.permute(0,2,1).view(b_s,c,self.H,self.W) #bs,c,H,W
            x=self.sr_conv(x) #bs,c,h,w
            x=x.contiguous().view(b_s,c,-1).permute(0,2,1) #bs,n',c
            x=self.sr_ln(x)
            k = self.fc_k(x).view(b_s, -1, self.h, self.d_k).permute(0, 2, 3, 1)  # (b_s, h, d_k, n')
            v = self.fc_v(x).view(b_s, -1, self.h, self.d_v).permute(0, 2, 1, 3)  # (b_s, h, n', d_v)
        else:
            k = self.fc_k(keys).view(b_s, nk, self.h, self.d_k).permute(0, 2, 3, 1)  # (b_s, h, d_k, nk)
            v = self.fc_v(values).view(b_s, nk, self.h, self.d_v).permute(0, 2, 1, 3)  # (b_s, h, nk, d_v)

        if(self.apply_transform):
            att = torch.matmul(q, k) / np.sqrt(self.d_k)  # (b_s, h, nq, n')
            att = self.transform(att) # (b_s, h, nq, n')
        else:
            att = torch.matmul(q, k) / np.sqrt(self.d_k)  # (b_s, h, nq, n')
            att = torch.softmax(att, -1) # (b_s, h, nq, n')


        if attention_weights is not None:
            att = att * attention_weights
        if attention_mask is not None:
            att = att.masked_fill(attention_mask, -np.inf)
        
        att=self.dropout(att)

        out = torch.matmul(att, v).permute(0, 2, 1, 3).contiguous().view(b_s, nq, self.h * self.d_v)  # (b_s, nq, h*d_v)
        out = self.fc_o(out)  # (b_s, nq, d_model)
        return out

二、SA(SHUFFLE ATTENTION)

论文:论文地址

如下图:

代码如下(代码连接):

import numpy as np
import torch
from torch import nn
from torch.nn import init
from torch.nn.parameter import Parameter


class ShuffleAttention(nn.Module):

    def __init__(self, channel=512,reduction=16,G=8):
        super().__init__()
        self.G=G
        self.channel=channel
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.gn = nn.GroupNorm(channel // (2 * G), channel // (2 * G))
        self.cweight = Parameter(torch.zeros(1, channel // (2 * G), 1, 1))
        self.cbias = Parameter(torch.ones(1, channel // (2 * G), 1, 1))
        self.sweight = Parameter(torch.zeros(1, channel // (2 * G), 1, 1))
        self.sbias = Parameter(torch.ones(1, channel // (2 * G), 1, 1))
        self.sigmoid=nn.Sigmoid()


    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                init.constant_(m.weight, 1)
                init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                init.normal_(m.weight, std=0.001)
                if m.bias is not None:
                    init.constant_(m.bias, 0)


    @staticmethod
    def channel_shuffle(x, groups):
        b, c, h, w = x.shape
        x = x.reshape(b, groups, -1, h, w)
        x = x.permute(0, 2, 1, 3, 4)

        # flatten
        x = x.reshape(b, -1, h, w)

        return x

    def forward(self, x):
        b, c, h, w = x.size()
        #group into subfeatures
        x=x.view(b*self.G,-1,h,w) #bs*G,c//G,h,w

        #channel_split
        x_0,x_1=x.chunk(2,dim=1) #bs*G,c//(2*G),h,w

        #channel attention
        x_channel=self.avg_pool(x_0) #bs*G,c//(2*G),1,1
        x_channel=self.cweight*x_channel+self.cbias #bs*G,c//(2*G),1,1
        x_channel=x_0*self.sigmoid(x_channel)

        #spatial attention
        x_spatial=self.gn(x_1) #bs*G,c//(2*G),h,w
        x_spatial=self.sweight*x_spatial+self.sbias #bs*G,c//(2*G),h,w
        x_spatial=x_1*self.sigmoid(x_spatial) #bs*G,c//(2*G),h,w

        # concatenate along channel axis
        out=torch.cat([x_channel,x_spatial],dim=1)  #bs*G,c//G,h,w
        out=out.contiguous().view(b,-1,h,w)

        # channel shuffle
        out = self.channel_shuffle(out, 2)
        return out

三、SGE(Spatial Group-wise Enhance)

论文:Spatial Group-wise Enhance: Improving Semanti

如下图:

代码如下(代码连接):

import torch
import torch.nn as nn


class SpatialGroupEnhance(nn.Module):
    def __init__(self, groups = 64):
        super(SpatialGroupEnhance, self).__init__()
        self.groups   = groups
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.weight   = Parameter(torch.zeros(1, groups, 1, 1))
        self.bias     = Parameter(torch.ones(1, groups, 1, 1))
        self.sig      = nn.Sigmoid()

    def forward(self, x): # (b, c, h, w)
        b, c, h, w = x.size()
        x = x.view(b * self.groups, -1, h, w) 
        xn = x * self.avg_pool(x)
        xn = xn.sum(dim=1, keepdim=True)
        t = xn.view(b * self.groups, -1)
        t = t - t.mean(dim=1, keepdim=True)
        std = t.std(dim=1, keepdim=True) + 1e-5
        t = t / std
        t = t.view(b, self.groups, h, w)
        t = t * self.weight + self.bias
        t = t.view(b * self.groups, 1, h, w)
        x = x * self.sig(t)
        x = x.view(b, c, h, w)
        return x

四、AFT(Attention Free Transformer)

论文:An Attention Free Transformer

如下图:

代码如下(代码连接):

import torch, math
from torch import nn, einsum
import torch.nn.functional as F    

class AFTFull(nn.Module):
    def __init__(self, max_seqlen, dim, hidden_dim=64):
        super().__init__()
        '''
        max_seqlen: the maximum number of timesteps (sequence length) to be fed in
        dim: the embedding dimension of the tokens
        hidden_dim: the hidden dimension used inside AFT Full

        Number of heads is 1 as done in the paper
        '''
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.to_q = nn.Linear(dim, hidden_dim)
        self.to_k = nn.Linear(dim, hidden_dim)
        self.to_v = nn.Linear(dim, hidden_dim)
        self.project = nn.Linear(hidden_dim, dim)
        self.wbias = nn.Parameter(torch.Tensor(max_seqlen, max_seqlen))
        nn.init.xavier_uniform_(self.wbias)

    def forward(self, x):
        B, T, _ = x.shape
        Q = self.to_q(x).view(B, T, self.hidden_dim)
        K = self.to_k(x).view(B, T, self.hidden_dim)
        V = self.to_v(x).view(B, T, self.hidden_dim)
        temp_wbias = self.wbias[:T, :T].unsqueeze(0) # sequences can still be variable length

        '''
        From the paper
        '''
        Q_sig = torch.sigmoid(Q)
        temp = torch.exp(temp_wbias) @ torch.mul(torch.exp(K), V)
        weighted = temp / (torch.exp(temp_wbias) @ torch.exp(K))
        Yt = torch.mul(Q_sig, weighted)

        Yt = Yt.view(B, T, self.hidden_dim)
        Yt = self.project(Yt)

        return Yt

class AFTSimple(nn.Module):
    def __init__(self, max_seqlen, dim, hidden_dim=64):
        super().__init__()
        '''
        max_seqlen: the maximum number of timesteps (sequence length) to be fed in
        dim: the embedding dimension of the tokens
        hidden_dim: the hidden dimension used inside AFT Full
        
        Number of Heads is 1 as done in the paper.
        '''
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.to_q = nn.Linear(dim, hidden_dim)
        self.to_k = nn.Linear(dim, hidden_dim)
        self.to_v = nn.Linear(dim, hidden_dim)
        self.project = nn.Linear(hidden_dim, dim)

    def forward(self, x):
        B, T, _ = x.shape
        Q = self.to_q(x).view(B, T, self.hidden_dim)
        K = self.to_k(x).view(B, T, self.hidden_dim)
        V = self.to_v(x).view(B, T, self.hidden_dim)

        '''
        From the paper
        '''
        weights = torch.mul(torch.softmax(K, 1), V).sum(dim=1, keepdim=True)
        Q_sig = torch.sigmoid(Q)
        Yt = torch.mul(Q_sig, weights)

        Yt = Yt.view(B, T, self.hidden_dim)
        Yt = self.project(Yt)

        return Yt

class AFTLocal(nn.Module):
    def __init__(self, max_seqlen, dim, hidden_dim=64, s=256):
        super().__init__()
        '''
        max_seqlen: the maximum number of timesteps (sequence length) to be fed in
        dim: the embedding dimension of the tokens
        hidden_dim: the hidden dimension used inside AFT Full
        s: the window size used for AFT-Local in the paper

        Number of heads is 1 as done in the paper
        '''
        self.dim = dim
        self.hidden_dim = hidden_dim
        self.to_q = nn.Linear(dim, hidden_dim)
        self.to_k = nn.Linear(dim, hidden_dim)
        self.to_v = nn.Linear(dim, hidden_dim)
        self.project = nn.Linear(hidden_dim, dim)
        self.wbias = nn.Parameter(torch.Tensor(max_seqlen, max_seqlen))
        self.max_seqlen = max_seqlen
        self.s = s
        nn.init.xavier_uniform_(self.wbias)


    def forward(self, x):
        B, T, _ = x.shape
        Q = self.to_q(x).view(B, T, self.hidden_dim)
        K = self.to_k(x).view(B, T, self.hidden_dim)
        V = self.to_v(x).view(B, T, self.hidden_dim)
        self.wbias = nn.Parameter(torch.Tensor([
            [self.wbias[i][j] if math.fabs(i-j) < self.s else 0 for j in range(self.max_seqlen)] 
            for i in range(self.max_seqlen)
            ]))
        temp_wbias = self.wbias[:T, :T].unsqueeze(0) # sequences can still be variable length

        '''
        From the paper
        '''
        Q_sig = torch.sigmoid(Q)
        temp = torch.exp(temp_wbias) @ torch.mul(torch.exp(K), V)
        weighted = temp / (torch.exp(temp_wbias) @ torch.exp(K))
        Yt = torch.mul(Q_sig, weighted)

        Yt = Yt.view(B, T, self.hidden_dim)
        Yt = self.project(Yt)

        return Yt

五、Outlook Attention

论文:VOLO: Vision Outlooker for Visual Recognition

如下图:

代码如下(代码连接):

import torch
import torch.nn as nn


class OutlookAttention(nn.Module):
    """
    Implementation of outlook attention
    --dim: hidden dim
    --num_heads: number of heads
    --kernel_size: kernel size in each window for outlook attention
    return: token features after outlook attention
    """

    def __init__(self, dim, num_heads, kernel_size=3, padding=1, stride=1,
                 qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
        super().__init__()
        head_dim = dim // num_heads
        self.num_heads = num_heads
        self.kernel_size = kernel_size
        self.padding = padding
        self.stride = stride
        self.scale = qk_scale or head_dim**-0.5

        self.v = nn.Linear(dim, dim, bias=qkv_bias)
        self.attn = nn.Linear(dim, kernel_size**4 * num_heads)

        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)

        self.unfold = nn.Unfold(kernel_size=kernel_size, padding=padding, stride=stride)
        self.pool = nn.AvgPool2d(kernel_size=stride, stride=stride, ceil_mode=True)

    def forward(self, x):
        B, H, W, C = x.shape

        v = self.v(x).permute(0, 3, 1, 2)  # B, C, H, W

        h, w = math.ceil(H / self.stride), math.ceil(W / self.stride)
        v = self.unfold(v).reshape(B, self.num_heads, C // self.num_heads,
                                   self.kernel_size * self.kernel_size,
                                   h * w).permute(0, 1, 4, 3, 2)  # B,H,N,kxk,C/H

        attn = self.pool(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)
        attn = self.attn(attn).reshape(
            B, h * w, self.num_heads, self.kernel_size * self.kernel_size,
            self.kernel_size * self.kernel_size).permute(0, 2, 1, 3, 4)  # B,H,N,kxk,kxk
        attn = attn * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).permute(0, 1, 4, 3, 2).reshape(
            B, C * self.kernel_size * self.kernel_size, h * w)
        x = F.fold(x, output_size=(H, W), kernel_size=self.kernel_size,
                   padding=self.padding, stride=self.stride)

        x = self.proj(x.permute(0, 2, 3, 1))
        x = self.proj_drop(x)

        return x

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1319314.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

PCB设计规则中的经验公式_笔记

PCB设计规则中的经验公式 规则1 - 临界长度规则2 - 信号带宽与上升时间规则3- 时钟信号带宽规则4-信号传输速度规则5- 集肤 (效应) 深度规则6 - 50Ω传输线电容规则7 - 50Ω传输线电感规则8 - 回流路径电感规则9 - 地弹噪声规则10- 串行传输比特率与信号带宽规则11- PCB走线直流…

HR人才测评,招聘企业中高层管理的岗位胜任力测评方案

不管是哪一个企业&#xff0c;中高层管理都是企业的核心层&#xff0c;在对这部分人才进行测评方案制定的时候&#xff0c;则要更加细致谨慎一些&#xff0c;避免出现人才录用失误的情况。 中高层管理人员是公司的支柱&#xff0c;需要具备的素质主要偏向于管理能力、综合素质…

CyclicBarrier学习一

一、定义 CyclicBarrier 字面意思回环栅栏&#xff08;循环屏障&#xff09;&#xff0c;通过它可以实现让一组线程等待至某个状态&#xff08;屏障点&#xff09;之后再全部同时执行。叫做回环是因为当所有等待线程都被释放以后&#xff0c;CyclicBarrier可以被重用。 CyclicB…

vue写了这么久了您是否知道:为什么data属性是一个函数而不是一个对象?

一、实例和组件定义data的区别 vue实例的时候定义data属性既可以是一个对象&#xff0c;也可以是一个函数 const app new Vue({el:"#app",// 对象格式data:{foo:"foo"},// 函数格式data(){return {foo:"foo"}} })组件中定义data属性&#xff…

信号与线性系统翻转课堂笔记4——连续LTI系统的微分方程模型与求解

信号与线性系统翻转课堂笔记4——连续LTI系统的微分方程模型与求解 The Flipped Classroom4 of Signals and Linear Systems 对应教材&#xff1a;《信号与线性系统分析&#xff08;第五版&#xff09;》高等教育出版社&#xff0c;吴大正著 一、要点 &#xff08;1&#x…

Zotero攻略

给大家分享一下我对于Zotero的使用。 1、下载链接 Zotero | Your personal research assistant 进入后直接下载即可 2、一些好用的插件 &#xff08;1&#xff09;Zotero Connector 下载地址&#xff1a;Zotero | Connectors 超级好用&#xff01;不用一篇一篇下PDF了&am…

geemap学习笔记028:Landsat8计算时间序列NDVI并导出

前言 本节则是以Landsat8影像数据为例&#xff0c;进行NDVI时间序列计算&#xff0c;并将得到的时间序列NDVI进行展示并导出。 1 导入库并显示地图 import ee import geemap import datetime import pandas as pd import os ee.Initialize()2 定义时间范围 # 定义日期范围 …

记录 | Visual Studio报错:const char*类型的值不能用于初始化char*类型

Visual Studio 报错&#xff1a; const char *”类型的值不能用于初始化“char *”类型的实体错误 解决办法&#xff1a; 1&#xff0c;强制类型转换&#xff0c;例如&#xff1a; char * Singer::pv[] {(char*)"other", (char*)"alto", (char*)"c…

【网络协议】网络运维管理神经-SNMP协议

文章目录 什么是SNMP&#xff1f;SNMP的组件SNMP的历史版本SNMP端口SNMP配置案例SNMP工作原理SNMP的基本工作原理SNMP的操作类型SNMP TrapsSNMP Inform SNMP的应用场景推荐阅读 什么是SNMP&#xff1f; SNMP&#xff08;Simple Network Management Protocol&#xff0c;简单网…

学习黑马vue

项目分析 项目下载地址&#xff1a;vue-admin-template-master: 学习黑马vue 项目下载后没有环境可参考我的篇文章&#xff0c;算是比较详细&#xff1a;vue安装与配置-CSDN博客 安装这两个插件可格式化代码&#xff0c;vscode这个软件是免费的&#xff0c;官网&#xff1a;…

2023年OceanBase开发者大会-核心PPT资料下载

一、峰会简介 2023年OceanBase开发者大会主要涵盖了OceanBase的最新技术进展、产品更新以及开发者工具的发布。大会发布了OceanBase 4.1版本&#xff0c;公布了两大友好工具&#xff0c;升级了文档的易用性&#xff0c;并统一了企业版和社区版的代码分支。这些举措全面呈现了O…

2017年AMC8数学竞赛中英文真题典型考题、考点分析和答案解析

从战争中学习战争最有效。前几天&#xff0c;六分成长分析了2023年、2022年、2020、2019、2018年的AMC8的典型考题、考点和详细答案解析。 今天继续为大家分享2017年的AMC8的五道典型考题&#xff0c;所有的这些试题六分成长独家制作了在线版本&#xff0c;适合各种终端和设备…

Android Compose Transition 动画

Transition 是一种动画效果&#xff0c;用于在组件的状态之间进行平滑的过渡。它可以帮助我们在状态变化时&#xff0c;以一种流畅的方式更新 UI。通过使用 Compose 的 Transition API&#xff0c;您可以在应用中创建各种各样的动画效果&#xff0c;从而增强用户体验并提高应用…

智能五子棋1

*一、项目需求* 五子棋是一种简单的黑白棋&#xff0c;历史悠久&#xff0c;起源于中国&#xff0c;后传入日本&#xff0c;在日本被称为“连珠”&#xff0c;是一种老少皆宜的益智游戏。 人工智能五子棋系统的目标用户是一切想致力于研究人机对弈算法理论的相关研究者和一切…

I2C总线(一)核心

基于linux-3.14.16 一、简介 硬件上&#xff0c;i2c总线由&#xff0c;i2c控制器、i2c总线、i2c设备组成。 驱动代码将通过设置i2c寄存器&#xff0c;从而在总线上产生数据信息&#xff0c;来和i2c设备通信&#xff08;读/写&#xff09;。 i2c核心&#xff0c;主要的功能包…

spring boot 配置多数据源 踩坑 BindingException: Invalid bound statement (not found)

在上一篇&#xff1a;《【已解决】Spring Boot多数据源的时候&#xff0c;mybatis报错提示&#xff1a;Invalid bound statement (not found)》 凯哥(凯哥Java) 已经接受了&#xff0c;在Spring Boot配置多数据源时候&#xff0c;因为自己马虎&#xff0c;导致的一个坑。下面&a…

【操作系统】实验一 熟悉LINUX环境和命令

实验名称&#xff1a; 实验一 熟悉LINUX环境和命令 实验目的&#xff1a; 1. 了解UNIX/LINUX的命令及使用格式。 2.熟悉UNIX/LINUX的常用基本命令。 3. 练习并掌握LINUX提供的vi编辑器来编译C程序 4. 学会利用gcc、gdb编译、调试C程序 实验内容&#xff1a; 熟悉UNIX/LINUX的…

YOLOv5改进 | 卷积篇 | SPD-Conv空间深度转换卷积(高效空间编码技术)

一、本文介绍 本文给大家带来的改进内容是SPD-Conv&#xff08;空间深度转换卷积&#xff09;技术。SPD-Conv是一种创新的空间编码技术&#xff0c;它通过更有效地处理图像数据来改善深度学习模型的表现。SPD-Conv的基本概念&#xff1a;它是一种将图像空间信息转换为深度信息…

Java_Lambda表达式JDK8新特性(方法引用)

一、Lambda表达式 接下来&#xff0c;我们学习一个JDK8新增的一种语法形式&#xff0c;叫做Lambda表达式。作用&#xff1a;用于简化匿名内部类代码的书写。 1.1 Lambda表达式基本使用 怎么去简化呢&#xff1f;Lamdba是有特有的格式的&#xff0c;按照下面的格式来编写Lamd…