Tesla T4 P2P测试

news2024/9/27 7:25:20

Tesla T4 P2P测试

  • 一.测试环境
  • 二.测试步骤
    • 1.获取设备信息
    • 2.查看PCIE拓扑结构
    • 3.选择9B、9E这二张
    • 4.查看逻辑设备ID
    • 5.设置环境变量(需要用逻辑设备ID,通过UUID跟smi看到的物理ID关联)
    • 6.不同地址的原子操作
    • 2.P2P与非P2P的性能差异
    • 3.GPU带宽测试

Tesla T4 P2P测试

  • 通过物理ID找到逻辑ID
  • NCU P2P相关的Metrics
  • PCIE、DRAM相关的Metrics

一.测试环境


二.测试步骤

1.获取设备信息

nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-c91a8013-0877-df61-53b9-016fabcd5f82)
GPU 1: NVIDIA A30 (UUID: GPU-f67790b5-bb58-614d-4190-3598a99f925e)
GPU 2: Tesla T4 (UUID: GPU-e95bfaa3-bf41-7aeb-d1e7-4f4f98ac3a63)
GPU 3: Tesla T4 (UUID: GPU-7b470c8f-cfe3-81d2-1dd8-2e36c2552d0e)
GPU 4: Tesla T4 (UUID: GPU-d59282d2-060d-270e-1c0e-f50e936ffede)
GPU 5: NVIDIA GeForce RTX 3080 Ti (UUID: GPU-9a131b18-28de-d6a1-01e9-76a133e21680)
GPU 6: NVIDIA A30 (UUID: GPU-49daa3a5-490b-569c-f1d2-79d98c6d3a02)
GPU 7: Tesla T4 (UUID: GPU-1f45f1e1-1e10-7d2d-f25a-e79ac17ddfa1)

nvidia-smi  -q | grep "Bus Id"
Bus Id                            : 00000000:34:00.0
Bus Id                            : 00000000:35:00.0
Bus Id                            : 00000000:36:00.0 Tesla T4
Bus Id                            : 00000000:37:00.0 Tesla T4
Bus Id                            : 00000000:9B:00.0 Tesla T4 d59282d2
Bus Id                            : 00000000:9C:00.0
Bus Id                            : 00000000:9D:00.0
Bus Id                            : 00000000:9E:00.0 Tesla T4 1f45f1e1

2.查看PCIE拓扑结构

lstopo --ignore Core --ignore Misc --ignore PU

在这里插入图片描述

3.选择9B、9E这二张

4.查看逻辑设备ID

tee devinfo.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>
#include <iostream>
#include <chrono>
#include <thread>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA error in file '%s' in line %i: %s.\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
            exit(EXIT_FAILURE); \
        } \
    } while (0)

int main() {
    int device_count=0;
    CUDA_CHECK(cudaGetDeviceCount(&device_count));
    for(int deviceid=0; deviceid<device_count;deviceid++)
    {
        CUDA_CHECK(cudaSetDevice(deviceid));  
        cudaDeviceProp deviceProp;
        CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, deviceid));
        printf("Index,%d UUID:GPU-%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x\n",
                 deviceid,
                 (unsigned char)deviceProp.uuid.bytes[0], (unsigned char)deviceProp.uuid.bytes[1],
                 (unsigned char)deviceProp.uuid.bytes[2], (unsigned char)deviceProp.uuid.bytes[3],
                 (unsigned char)deviceProp.uuid.bytes[4], (unsigned char)deviceProp.uuid.bytes[5],
                 (unsigned char)deviceProp.uuid.bytes[6], (unsigned char)deviceProp.uuid.bytes[7],
                 (unsigned char)deviceProp.uuid.bytes[8], (unsigned char)deviceProp.uuid.bytes[9],
                 (unsigned char)deviceProp.uuid.bytes[10],(unsigned char)deviceProp.uuid.bytes[11],
                 (unsigned char)deviceProp.uuid.bytes[12],(unsigned char)deviceProp.uuid.bytes[13],
                 (unsigned char)deviceProp.uuid.bytes[14],(unsigned char)deviceProp.uuid.bytes[15]);
    }
    return 0;
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -o devinfo devinfo.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64  -lcuda
unset CUDA_VISIBLE_DEVICES && ./devinfo

  • 输出
Index,0 UUID:GPU-c91a8013-0877-df61-53b9-016fabcd5f82
Index,1 UUID:GPU-9a131b18-28de-d6a1-01e9-76a133e21680
Index,2 UUID:GPU-f67790b5-bb58-614d-4190-3598a99f925e
Index,3 UUID:GPU-49daa3a5-490b-569c-f1d2-79d98c6d3a02
Index,4 UUID:GPU-e95bfaa3-bf41-7aeb-d1e7-4f4f98ac3a63
Index,5 UUID:GPU-7b470c8f-cfe3-81d2-1dd8-2e36c2552d0e
Index,6 UUID:GPU-d59282d2-060d-270e-1c0e-f50e936ffede  #选中
Index,7 UUID:GPU-1f45f1e1-1e10-7d2d-f25a-e79ac17ddfa1  #选中

5.设置环境变量(需要用逻辑设备ID,通过UUID跟smi看到的物理ID关联)

export CUDA_VISIBLE_DEVICES=6,7

6.不同地址的原子操作

tee p2p.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>
#include <iostream>
#include <chrono>
#include <thread>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA error in file '%s' in line %i: %s.\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
            exit(EXIT_FAILURE); \
        } \
    } while (0)

template<int mode>
__global__ void dummyKernel(float *data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    for(int i=0;i<102400;i++)
    {
        atomicAdd(&data[idx], idx*i);
    }
}

template <typename F>
void TIMEIT(F const &f,cudaStream_t &stream,cudaEvent_t &start_ev,cudaEvent_t&stop_ev)
{ 
    CUDA_CHECK(cudaDeviceSynchronize());
    auto start = std::chrono::high_resolution_clock::now();
    cudaEventRecord(start_ev, stream); 
    f(stream); 
    cudaEventRecord(stop_ev, stream); 
    CUDA_CHECK(cudaEventSynchronize(stop_ev)); 
    auto end = std::chrono::high_resolution_clock::now(); 
    std::chrono::duration<double> diff = end - start; 
    float milliseconds = 0; 
    cudaEventElapsedTime(&milliseconds, start_ev, stop_ev); 
    printf("E2E:%8.4fms Kernel:%8.4fms\n",diff.count()*1000,milliseconds);
}

int main() {
    int devID0 = 0, devID1 = 1;
    int device_count=0;
    CUDA_CHECK(cudaGetDeviceCount(&device_count));
    for(int deviceid=0; deviceid<2;deviceid++)
    {
        CUDA_CHECK(cudaSetDevice(deviceid));  
        cudaDeviceProp deviceProp;
        CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, deviceid));
        std::cout << "-----------------------------------" << std::endl;
        std::cout << "Device Index: " << deviceid << std::endl;
        std::cout << "Compute Capability:"<<deviceProp.major<<"."<<deviceProp.minor<<std::endl;
        std::cout << "Device name: " << deviceProp.name << std::endl;
        std::cout << "Max threads per block: " << deviceProp.maxThreadsPerBlock << std::endl;
        std::cout << "Shared memory per block: " << deviceProp.sharedMemPerBlock << " bytes" << std::endl;
        std::cout << "Max blocks per SM: " << deviceProp.maxBlocksPerMultiProcessor << std::endl;
        std::cout << "asyncEngineCount: " << deviceProp.asyncEngineCount << std::endl;
        std::cout << "directManagedMemAccessFromHost: " << deviceProp.directManagedMemAccessFromHost << std::endl;
        std::cout << "unifiedAddressing: " << deviceProp.unifiedAddressing << std::endl;
        std::cout << "canMapHostMemory: " << deviceProp.canMapHostMemory << std::endl;
        std::cout << "Number of SMs: " << deviceProp.multiProcessorCount << std::endl;
        
        printf("UUID:GPU-%02x%02x%02x%02x-%02x%02x-%02x%02x-%02x%02x-%02x%02x%02x%02x%02x%02x\n",
                 (unsigned char)deviceProp.uuid.bytes[0], (unsigned char)deviceProp.uuid.bytes[1],
                 (unsigned char)deviceProp.uuid.bytes[2], (unsigned char)deviceProp.uuid.bytes[3],
                 (unsigned char)deviceProp.uuid.bytes[4], (unsigned char)deviceProp.uuid.bytes[5],
                 (unsigned char)deviceProp.uuid.bytes[6], (unsigned char)deviceProp.uuid.bytes[7],
                 (unsigned char)deviceProp.uuid.bytes[8], (unsigned char)deviceProp.uuid.bytes[9],
                 (unsigned char)deviceProp.uuid.bytes[10],(unsigned char)deviceProp.uuid.bytes[11],
                 (unsigned char)deviceProp.uuid.bytes[12],(unsigned char)deviceProp.uuid.bytes[13],
                 (unsigned char)deviceProp.uuid.bytes[14],(unsigned char)deviceProp.uuid.bytes[15]);
    }
    
    std::cout << "-----------------------------------" << std::endl;
    int p2p_value=0;
    CUDA_CHECK(cudaDeviceGetP2PAttribute(&p2p_value,cudaDevP2PAttrAccessSupported,devID0,devID1));
    std::cout << "cudaDevP2PAttrAccessSupported: " << p2p_value << std::endl;
    
    #define block_size (32*4)
    
    size_t dataSize = block_size * sizeof(float);
    float *data0_dev, *data1_dev;

    CUDA_CHECK(cudaSetDevice(devID0));
    CUDA_CHECK(cudaMalloc(&data0_dev, dataSize));

    CUDA_CHECK(cudaSetDevice(devID1));
    CUDA_CHECK(cudaMalloc(&data1_dev, dataSize));
    CUDA_CHECK(cudaMemset(data1_dev, 0, dataSize));
    // 启用P2P
    int canAccessPeer=0;
    CUDA_CHECK(cudaDeviceCanAccessPeer(&canAccessPeer, devID0, devID1));
    if (canAccessPeer) {
        CUDA_CHECK(cudaSetDevice(devID1));
        cudaStream_t stream;
        cudaStreamCreate(&stream);

        cudaEvent_t start_ev, stop_ev;
        cudaEventCreate(&start_ev);
        cudaEventCreate(&stop_ev);
        
        CUDA_CHECK(cudaDeviceEnablePeerAccess(devID0, 0));//让devID1可以访问devID0的设备内存
        
        TIMEIT([&data0_dev](cudaStream_t &stream)-> void {dummyKernel<1><<<1, block_size,0,stream>>>(data0_dev);},stream,start_ev,stop_ev);
        TIMEIT([&data1_dev](cudaStream_t &stream)-> void {dummyKernel<3><<<1, block_size,0,stream>>>(data1_dev);},stream,start_ev,stop_ev);

        CUDA_CHECK(cudaDeviceDisablePeerAccess(devID0));
    }
    else
    {
        printf("%s %d canAccessPeer=0\n",__FILE__,__LINE__);
    }

    CUDA_CHECK(cudaFree(data0_dev));
    CUDA_CHECK(cudaFree(data1_dev));
    return 0;
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -o p2p p2p.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64  -lcuda
./p2p

/usr/local/cuda/bin/ncu --metrics \
lts__t_requests_aperture_device.sum,\
lts__t_sectors_aperture_device.sum,\
lts__t_requests_aperture_peer.sum,\
lts__t_requests_srcnode_gpc_aperture_peer.sum,\
lts__t_requests_srcunit_l1_aperture_peer.sum,\
lts__t_requests_srcunit_tex_aperture_peer.sum,\
lts__t_sectors_aperture_peer.sum,\
lts__t_sectors_srcnode_gpc_aperture_peer.sum,\
lts__t_sectors_srcunit_l1_aperture_peer.sum,\
lts__t_sectors_srcunit_tex_aperture_peer.sum,\
dram__bytes_read.sum,\
dram__bytes_write.sum,\
dram__bytes_read.sum.per_second,\
pcie__read_bytes.sum,\
pcie__write_bytes.sum,\
pcie__read_bytes.sum.per_second,\
pcie__write_bytes.sum.per_second,\
dram__bytes_write.sum.per_second ./p2p
  • 输出
-----------------------------------
Device Index: 0
Compute Capability:7.5
Device name: Tesla T4
Max threads per block: 1024
Shared memory per block: 49152 bytes
Max blocks per SM: 16
asyncEngineCount: 3
directManagedMemAccessFromHost: 0
unifiedAddressing: 1
canMapHostMemory: 1
Number of SMs: 40
UUID:GPU-d59282d2-060d-270e-1c0e-f50e936ffede
-----------------------------------
Device Index: 1
Compute Capability:7.5
Device name: Tesla T4
Max threads per block: 1024
Shared memory per block: 49152 bytes
Max blocks per SM: 16
asyncEngineCount: 3
directManagedMemAccessFromHost: 0
unifiedAddressing: 1
canMapHostMemory: 1
Number of SMs: 40
UUID:GPU-1f45f1e1-1e10-7d2d-f25a-e79ac17ddfa1
-----------------------------------
cudaDevP2PAttrAccessSupported: 1
E2E:132.1300ms Kernel:132.1245ms  #GPU1通过P2P对GPU0的设备内存进行原子操作
E2E:  3.5552ms Kernel:  3.5444ms  #GPU1对本地DRAM进行原子操作

void dummyKernel<(int)1>(float *), 2024-Sep-25 17:06:47, Context 2, Stream 34
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes_read.sum                                                             Kbyte                          52.86
dram__bytes_read.sum.per_second                                           Kbyte/second                         405.11
dram__bytes_write.sum                                                            Kbyte                           1.25
dram__bytes_write.sum.per_second                                          Kbyte/second                           9.56
lts__t_requests_aperture_device.sum                                            request                          17775
lts__t_requests_aperture_peer.sum                                              request                         409600 #4个warp,每个102400 次合并访问 地址范围了4*32*4
lts__t_requests_srcnode_gpc_aperture_peer.sum                                  request                         409600
lts__t_requests_srcunit_l1_aperture_peer.sum                                   request                              0
lts__t_requests_srcunit_tex_aperture_peer.sum                                  request                         409600
lts__t_sectors_aperture_device.sum                                              sector                          93043
lts__t_sectors_aperture_peer.sum                                                sector                        1638400 
lts__t_sectors_srcnode_gpc_aperture_peer.sum                                    sector                        1638400 #合并访问一次请求4个sector 1638400*32=50MB
lts__t_sectors_srcunit_l1_aperture_peer.sum                                     sector                              0
lts__t_sectors_srcunit_tex_aperture_peer.sum                                    sector                        1638400
pcie__read_bytes.sum                                                             Mbyte                          59.10
pcie__read_bytes.sum.per_second                                           Mbyte/second                         452.93
pcie__write_bytes.sum                                                            Mbyte                          72.18 #实际PCIE读写加起来超过50MB
pcie__write_bytes.sum.per_second                                          Mbyte/second                         553.17
---------------------------------------------------------------------- --------------- ------------------------------

void dummyKernel<(int)3>(float *), 2024-Sep-25 17:06:48, Context 2, Stream 34
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes_read.sum                                                             Kbyte                           6.21
dram__bytes_read.sum.per_second                                           Mbyte/second                           1.77
dram__bytes_write.sum                                                             byte                            224
dram__bytes_write.sum.per_second                                          Kbyte/second                          63.93
lts__t_requests_aperture_device.sum                                            request                         414530
lts__t_requests_aperture_peer.sum                                              request                              0
lts__t_requests_srcnode_gpc_aperture_peer.sum                                  request                              0
lts__t_requests_srcunit_l1_aperture_peer.sum                                   request                              0
lts__t_requests_srcunit_tex_aperture_peer.sum                                  request                              0
lts__t_sectors_aperture_device.sum                                              sector                        1643605
lts__t_sectors_aperture_peer.sum                                                sector                              0
lts__t_sectors_srcnode_gpc_aperture_peer.sum                                    sector                              0
lts__t_sectors_srcunit_l1_aperture_peer.sum                                     sector                              0
lts__t_sectors_srcunit_tex_aperture_peer.sum                                    sector                              0
pcie__read_bytes.sum                                                             Kbyte                           3.58
pcie__read_bytes.sum.per_second                                           Mbyte/second                           1.02
pcie__write_bytes.sum                                                            Kbyte                           3.07
pcie__write_bytes.sum.per_second                                          Kbyte/second                         876.74
---------------------------------------------------------------------- --------------- ------------------------------

2.P2P与非P2P的性能差异

tee p2p.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>
#include <iostream>
#include <chrono>
#include <thread>

#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            fprintf(stderr, "CUDA error in file '%s' in line %i: %s.\n", __FILE__, __LINE__, cudaGetErrorString(error)); \
            exit(EXIT_FAILURE); \
        } \
    } while (0)

template<int mode>
__global__ void dummyKernel(float *input_data,float *output_data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    output_data[idx]=input_data[idx];
}

template <typename F>
void TIMEIT(F const &f,cudaStream_t &stream,cudaEvent_t &start_ev,cudaEvent_t&stop_ev)
{ 
    CUDA_CHECK(cudaDeviceSynchronize());
    auto start = std::chrono::high_resolution_clock::now();
    cudaEventRecord(start_ev, stream); 
    f(stream); 
    cudaEventRecord(stop_ev, stream); 
    CUDA_CHECK(cudaEventSynchronize(stop_ev)); 
    auto end = std::chrono::high_resolution_clock::now(); 
    std::chrono::duration<double> diff = end - start; 
    float milliseconds = 0; 
    cudaEventElapsedTime(&milliseconds, start_ev, stop_ev); 
    printf("E2E:%7.2fms Kernel:%7.2fms\n",diff.count()*1000,milliseconds);
}

int main() {
    int devID0 = 0, devID1 = 1;
    int device_count=0;
    CUDA_CHECK(cudaGetDeviceCount(&device_count));
    for(int deviceid=0; deviceid<2;deviceid++)
    {
        CUDA_CHECK(cudaSetDevice(deviceid));  
        cudaDeviceProp deviceProp;
        CUDA_CHECK(cudaGetDeviceProperties(&deviceProp, deviceid));
        std::cout << "-----------------------------------" << std::endl;
        std::cout << "Device Index: " << deviceid << std::endl;
        std::cout << "Compute Capability:"<<deviceProp.major<<"."<<deviceProp.minor<<std::endl;
        std::cout << "Device name: " << deviceProp.name << std::endl;
        std::cout << "Max threads per block: " << deviceProp.maxThreadsPerBlock << std::endl;
        std::cout << "Shared memory per block: " << deviceProp.sharedMemPerBlock << " bytes" << std::endl;
        std::cout << "Max blocks per SM: " << deviceProp.maxBlocksPerMultiProcessor << std::endl;
        std::cout << "asyncEngineCount: " << deviceProp.asyncEngineCount << std::endl;
        std::cout << "directManagedMemAccessFromHost: " << deviceProp.directManagedMemAccessFromHost << std::endl;
        std::cout << "unifiedAddressing: " << deviceProp.unifiedAddressing << std::endl;
        std::cout << "Number of SMs: " << deviceProp.multiProcessorCount << std::endl;
    }
    
    std::cout << "-----------------------------------" << std::endl;
    int p2p_value=0;
    CUDA_CHECK(cudaDeviceGetP2PAttribute(&p2p_value,cudaDevP2PAttrAccessSupported,devID0,devID1));
    std::cout << "cudaDevP2PAttrAccessSupported: " << p2p_value << std::endl;
    
    #define block_size 1024
    #define block_count 1000000L
    
    size_t dataSize = block_count*block_size * sizeof(float);
    float *data0_dev, *data1_dev,*data1_dev_ex;

    CUDA_CHECK(cudaSetDevice(devID0));
    CUDA_CHECK(cudaMalloc(&data0_dev, dataSize));

    CUDA_CHECK(cudaSetDevice(devID1));
    CUDA_CHECK(cudaMalloc(&data1_dev, dataSize));
    CUDA_CHECK(cudaMalloc(&data1_dev_ex, dataSize));
    char *host;
    CUDA_CHECK(cudaMallocHost(&host,dataSize));

    printf("Init Done(%.2f)GB..\n",dataSize/1024.0/1024.0/1024.0);
    // 启用P2P
    int canAccessPeer=0;
    CUDA_CHECK(cudaDeviceCanAccessPeer(&canAccessPeer, devID0, devID1));
    if (canAccessPeer) {
        CUDA_CHECK(cudaSetDevice(devID1));
        cudaStream_t stream;
        cudaStreamCreate(&stream);

        cudaEvent_t start_ev, stop_ev;
        cudaEventCreate(&start_ev);
        cudaEventCreate(&stop_ev);
        
        CUDA_CHECK(cudaDeviceEnablePeerAccess(devID0, 0));//让devID1可以访问devID0的设备内存
        
        TIMEIT([&](cudaStream_t &stream)-> void {cudaMemcpyAsync(host,data1_dev,dataSize,cudaMemcpyHostToDevice,stream);},stream,start_ev,stop_ev);
        TIMEIT([&](cudaStream_t &stream)-> void {dummyKernel<1><<<block_count, block_size,0,stream>>>(data0_dev,data1_dev);},stream,start_ev,stop_ev);
        TIMEIT([&](cudaStream_t &stream)-> void {dummyKernel<2><<<block_count, block_size,0,stream>>>(data1_dev_ex,data1_dev);},stream,start_ev,stop_ev);

        CUDA_CHECK(cudaDeviceDisablePeerAccess(devID0));
    }
    else
    {
        printf("%s %d canAccessPeer=0\n",__FILE__,__LINE__);
    }

    CUDA_CHECK(cudaFreeHost(host));
    CUDA_CHECK(cudaFree(data0_dev));
    CUDA_CHECK(cudaFree(data1_dev));
    CUDA_CHECK(cudaFree(data1_dev_ex));
    return 0;
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -o p2p p2p.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64  -lcuda
./p2p

/usr/local/NVIDIA-Nsight-Compute/ncu --metrics \
dram__bytes_read.sum.pct_of_peak_sustained_elapsed,\
dram__bytes_write.sum.pct_of_peak_sustained_elapsed,\
dram__bytes_read.sum.per_second,\
pcie__read_bytes.sum.per_second,\
pcie__write_bytes.sum.per_second,\
dram__bytes_write.sum.per_second ./p2p
  • 输出
-----------------------------------
Device Index: 0
Compute Capability:7.5
Device name: Tesla T4
Max threads per block: 1024
Shared memory per block: 49152 bytes
Max blocks per SM: 16
asyncEngineCount: 3
directManagedMemAccessFromHost: 0
unifiedAddressing: 1
Number of SMs: 40
-----------------------------------
Device Index: 1
Compute Capability:7.5
Device name: Tesla T4
Max threads per block: 1024
Shared memory per block: 49152 bytes
Max blocks per SM: 16
asyncEngineCount: 3
directManagedMemAccessFromHost: 0
unifiedAddressing: 1
Number of SMs: 40
-----------------------------------
cudaDevP2PAttrAccessSupported: 1
Init Done(3.81)GB..
E2E: 325.70ms Kernel: 325.71ms  # GPU1 cudaMemcpyHostToDevice  11.697GB/s
E2E: 307.29ms Kernel: 307.31ms  # GPU1 通过P2P从GPU0的设备内存读取数据(比H2D快) 12.39GB/s
E2E:  37.90ms Kernel:  37.89ms  # GPU1 DRAM内 D2D的拷贝 2*100.55GB/s


void dummyKernel<1>(float *, float *) (1000000, 1, 1)x(1024, 1, 1), Context 2, Stream 34, Device 7, CC 7.5
Section: Command line profiler metrics
--------------------------------------------------- ----------- ------------
Metric Name                                         Metric Unit Metric Value
--------------------------------------------------- ----------- ------------
dram__bytes_read.sum.pct_of_peak_sustained_elapsed            %         0.01
dram__bytes_read.sum.per_second                         Mbyte/s        37.35
dram__bytes_write.sum.pct_of_peak_sustained_elapsed           %         4.72
dram__bytes_write.sum.per_second                        Gbyte/s        15.10
pcie__read_bytes.sum.per_second                         Gbyte/s        15.01
pcie__write_bytes.sum.per_second                        Gbyte/s         3.34
--------------------------------------------------- ----------- ------------

void dummyKernel<2>(float *, float *) (1000000, 1, 1)x(1024, 1, 1), Context 2, Stream 34, Device 7, CC 7.5
Section: Command line profiler metrics
--------------------------------------------------- ----------- ------------
Metric Name                                         Metric Unit Metric Value
--------------------------------------------------- ----------- ------------
dram__bytes_read.sum.pct_of_peak_sustained_elapsed            %        37.34  #同时读写时,利用率加起来75%
dram__bytes_read.sum.per_second                         Gbyte/s       119.41
dram__bytes_write.sum.pct_of_peak_sustained_elapsed           %        37.81
dram__bytes_write.sum.per_second                        Gbyte/s       120.89  #加起来239GB/s,跟后面的带宽测试一致
pcie__read_bytes.sum.per_second                         Mbyte/s        22.03
pcie__write_bytes.sum.per_second                        Mbyte/s         7.42
--------------------------------------------------- ----------- ------------

3.GPU带宽测试

git clone https://www.github.com/nvidia/cuda-samples
cd cuda-samples/Samples/1_Utilities/deviceQuery
make clean && make
./deviceQuery
cd ../bandwidthTest/
make clean && make
./bandwidthTest --device=0
  • 输出
Running on...

 Device 0: Tesla T4
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     12.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     13.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     239.4
Result = PASS

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2164950.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

一场大模型面试,三个小时,被撞飞了

去华为面试大模型&#xff0c;一点半去五点半回&#xff0c;已经毫无力气。 1️⃣一轮面试—1小时 因为一面都是各个业务的主管&#xff0c;所以专业性很强&#xff0c;面试官经验很丰富&#xff0c;建议大家还是需要十分熟悉所学内容&#xff0c;我勉强通过一面。 2️⃣二轮…

9_24_statusBar

statusBar&#xff08;状态栏&#xff09; 状态栏就是一个窗口最先面的一行&#xff0c;一般有三个作用&#xff1a; • 永久信息&#xff0c;例如版本号&#xff0c;机构名称 • 进度消息&#xff0c;如进度条提示&#xff0c;百分比提示 • 实时消息&#xff0c;当前程序状态…

神经网络的初步学习

文章目录 概要基础概念简单的神经网络图神经元模型权重求和激活函数输出 多层感知机前向传播&#xff1a;激活函数&#xff1a;误差计算&#xff08;损失函数&#xff09;&#xff1a;反向传播&#xff1a;1. **激活函数&#xff08;Activation Function&#xff09;**2. **非线…

mini-lsm通关笔记Week2Overview

Week 2 Overview: Compaction and Persistence 在上周&#xff0c;您已经实现了LSM存储引擎的所有必要结构&#xff0c;并且您的存储引擎已经支持读写接口。在本周中&#xff0c;我们将深入探讨SST文件的磁盘组织&#xff0c;并研究在系统中实现性能和成本效益的最佳方法。我们…

Skyeye 云这几年的经历

前言 我是 17 年毕业的&#xff0c;之前也是在学校的实验室 (做开发的) 待了两年多时间&#xff0c;期间学了不少东西&#xff0c;学的东西也算是与时俱进了。最近两年也算是开源中国的常客了&#xff0c;每周都会保持自己项目的一个更新进度。 项目地址&#xff1a;skyeye-o…

【SpringCloud】优雅实现远程调用 - OpenFeign

目录 优雅实现远程调用-OpenFeignRestTemplate存在问题OpenFeign介绍Spring Cloud Feign 快速上手引入依赖添加注解编写OpenFeign的客户端远程调用测试 OpenFeign参数传递传递单个参数传递多个参数传递对象传递JSON 最佳实践Feign 继承方式创建⼀个Module引入依赖编写接口打Jar…

ESP32,制作一个遥控点火玩具

最近想做一个遥控点火玩具&#xff0c;过年的时候可以让娃拿出手机遥控点炮&#xff0c;绝对能成为全村最亮的仔。 实际也挺简单&#xff0c;使用的东西有ESP32开发板&#xff0c;1838T红外接收器&#xff0c;一个继电器&#xff0c;一个钨丝。 原理图如下。 GPIO17接红外ou…

夹耳式蓝牙耳机哪个牌子最好?夹耳式耳机推荐性价比排行榜

耳夹式耳机既不堵耳孔、也不需要包覆耳廓&#xff0c;佩戴时看起来更像是一个“耳环”&#xff0c;固定方式也类似“夹耳朵”。不过&#xff0c;它并不是真的夹住了耳朵肉&#xff0c;而是半夹、半挂——依靠耳廓边缘厚、里面薄&#xff0c;且有一定的弯折面的特殊构造&#xf…

Arm Cortex-R52+ Generic Timer分析

目录 1.Generic Timer初识 2.R52的Generic Timer 3.如何配置Timer中断 4.小结 1.Generic Timer初识 Arm Cortex-R52内部实现了Generic Timer(通用计时器)&#xff0c;它可以基于递增计数来产生中断和事件流。 事实上&#xff0c;该计时器和Armv8-R AArch32中的定义完全一…

Python数据分析与可视化:从基础到高级应用

一、引言 在当今数据驱动的时代&#xff0c;数据的分析和可视化变得至关重要。Python作为一种功能强大且广泛使用的编程语言&#xff0c;在数据分析和可视化领域拥有丰富的库和工具。通过Python&#xff0c;数据分析师和科学家能够高效地处理数据、提取有价值的信息并以直观的方…

【网络安全】更改参数实现试用计划延长

未经许可,不得转载。 文章目录 正文目标:example.com,电子商务网站,允许企业在线创建商店。该平台提供了广泛的功能,如商店设计、创建产品等。 正文 在界面 example.com/start-your-trial/ 上,可以创建为期 15 天的试用商店。填写完所有信息后,我点击了“Sign Up”按钮…

BERT训练环节(代码实现)

1.代码实现 #导包 import torch from torch import nn import dltools #加载数据需要用到的声明变量 batch_size, max_len 1, 64 #获取训练数据迭代器、词汇表 train_iter, vocab dltools.load_data_wiki(batch_size, max_len) #其余都是二维数组 #tokens, segments, vali…

一带一路区块链赛项样题解析(中)

一带一路区块链赛项样题解析 (模块二) 标题任务一 按要求完成智能合约开发 1、学籍信息合约(Roll)接口编码(6分) (1)编写学籍信息合约中的RollInfo 实体接口,完成RollInfo实体通用数据的初始化,实现可追溯的学籍信息上链功能;(2分) // SPDX-License-Identifie…

FPGA IP 和 开源 HDL 一般去哪找?

在FPGA开发的世界中&#xff0c;IP核和HDL模块是构建复杂数字系统的基石。它们如同乐高积木&#xff0c;让开发者能够快速搭建和重用经过验证的电路功能。但你是否曾感到迷茫&#xff0c;不知道从哪里寻找这些宝贵的资源&#xff1f;本文将为你揭开寻找FPGA IP核和HDL模块资源的…

探索MemGPT:AI界的新宠儿

文章目录 探索MemGPT&#xff1a;AI界的新宠儿1. 背景介绍2. MemGPT是什么&#xff1f;3. 如何安装MemGPT&#xff1f;4. 简单的库函数使用方法5. 场景应用场景一&#xff1a;创建持久聊天机器人场景二&#xff1a;文档分析场景三&#xff1a;多会话聊天互动 6. 常见Bug及解决方…

【2.使用VBA自动填充Excel工作表】

目录 前言什么是VBA如何使用Excel中的VBA简单基础入门控制台输出信息定义过程&#xff08;功能&#xff09;定义变量常用的数据类型Set循环For To 我的需求开发过程效果演示文件情况测试填充源文件测试填充目标文件 全部完整的代码sheet1中的代码&#xff0c;对应A公司工作表Us…

社区来稿丨一个真正意义上的实时多模态智能体框架,TEN Framework 为构建下一代 AI Agent 而生

本文由 RTE 开发者社区成员通过社区网站投稿提供&#xff0c;如果你也有与实时互动&#xff08;Real-Time Engagement&#xff0c;RTE&#xff09;相关的项目分享&#xff0c;欢迎访问网站 rtecommunity.dev 发布&#xff0c;优秀项目将会在公众号发布分享。 自从 OpenAI 展示了…

大数据毕业设计选题推荐-手机销售数据分析系统-Hive-Hadoop-Spark

✨作者主页&#xff1a;IT毕设梦工厂✨ 个人简介&#xff1a;曾从事计算机专业培训教学&#xff0c;擅长Java、Python、PHP、.NET、Node.js、GO、微信小程序、安卓Android等项目实战。接项目定制开发、代码讲解、答辩教学、文档编写、降重等。 ☑文末获取源码☑ 精彩专栏推荐⬇…

PINN机器学习登上Science正刊!热门buff叠满!11个创新思路get到就能发

今天我们来聊聊物理信息机器学习PIML。PINN大家都熟悉吧&#xff0c;毕竟研究热度就没下去过&#xff0c;这个热点其实就是PIML的一种典型代表。 PIML是一种融合了物理学与机器学习的创新技术&#xff0c;通过引入物理学的先验知识&#xff0c;来改进和优化机器学习模型的性能…

换脸黑科技FaceFusion 3.0(Windows Mac整合包)震撼来袭!

换脸黑科技FaceFusion 3.0&#xff08;Windows & Mac整合包&#xff09;震撼来袭&#xff01; 各位魔法师们&#xff0c;准备好迎接 FaceFusion 3.0 的强势登场了吗&#xff1f;这款 AI 换脸神器经历了全面升级&#xff0c;功能更强大&#xff0c;效果更惊艳&#xff0c;操…