RK3588上CPU和GPU算力以及opencv resize的性能对比测试

news2025/1/11 20:37:33

RK3588上CPU和GPU算力以及opencv resize的性能对比测试

  • 一.背景
  • 二.小结
  • 三.相关链接
  • 四.操作步骤
    • 1.环境搭建
      • A.安装依赖
      • B.设置GPU为高性能模式
      • C.获取GPU信息
      • D.获取CPU信息
    • 2.调用OpenCL SDK获取GPU信息
    • 3.使用OpenCL API计算矩阵乘
    • 4.使用clpeak测试GPU的性能
    • 5.使用OpenBLAS测试CPU的算力
    • 6.分别用CPU与OpenCL测试opencv resize的性能
      • A.编译OpenCV支持OpenCL
      • B.运行OpenCV测试程序

一.背景

  • 希望对比RK3588上CPU和Mali-GPU的性能差异
  • Mali-GPU算力测试采用clpeak
  • CPU-FP32的性能测试采用Openblas(开启了NEON优化)
  • 分别用CPU和opencl测试opencv resize在不同算法下的性能:从32x32放大到8192x8192再缩放回32x32,循环100次

二.小结

  • GPU型号: Mali-LODX r0p0 Mali-G610 4 cores r0p0 0xA867
  • GPU FP32(clpeak): 441.95 GFLOPS
  • CPU FP32(openblas+neon): 53.68 GFLOPS
  • 插值方法:INTER_NEAREST CPU耗时(秒):3.01526 GPU耗时(秒):0.0672681
  • 插值方法:INTER_LINEAR CPU耗时(秒):5.3227 GPU耗时(秒):0.0189366
  • 插值方法:INTER_CUBIC CPU耗时(秒):8.22734 GPU耗时(秒):11.6337
  • 插值方法:INTER_AREA CPU耗时(秒):20.4999 GPU耗时(秒):27.3197
  • 插值方法:INTER_LANCZOS4 CPU耗时(秒):29.3602 GPU耗时(秒):43.9484

三.相关链接

  • opencv编译

四.操作步骤

1.环境搭建

A.安装依赖

mv /lib/aarch64-linux-gnu/libOpenCL.so.1 /lib/aarch64-linux-gnu/libOpenCL.so.1.bk
ln -s /usr/lib/aarch64-linux-gnu/libmali.so /lib/aarch64-linux-gnu/libOpenCL.so.1

sudo apt install opencl-headers
sudo apt install ocl-icd-libopencl1
sudo apt install ocl-icd-opencl-dev
sudo apt install clinfo

B.设置GPU为高性能模式

echo performance> /sys/class/devfreq/fb000000.gpu/governor
echo performance> /sys/class/devfreq/fdab0000.npu/governor

C.获取GPU信息

cat /sys/class/misc/mali0/device/gpuinfo
clinfo

输出

Mali-G610 4 cores r0p0 0xA867

Number of platforms                               1
  Platform Name                                   ARM Platform
  Platform Vendor                                 ARM
  Platform Version                                OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_opencl
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             ARM

  Platform Name                                   ARM Platform
Number of devices                                 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
  Device Name                                     Mali-LODX r0p0
  Device Vendor                                   ARM
  Device Vendor ID                                0xa8670000
  Device Version                                  OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
  Driver Version                                  2.1
  Device OpenCL C Version                         OpenCL C 2.0 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               4
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple              16
  Max sub-groups per work group                   64
  Preferred / native vector sizes
    char                                                16 / 4
    short                                                8 / 2
    int                                                  4 / 1
    long                                                 2 / 1
    half                                                 8 / 2        (cl_khr_fp16)
    float                                                4 / 1
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    64, Little-Endian
  Global memory size                              16643870720 (15.5GiB)
  Error Correction support                        No
  Max memory allocation                           16643870720 (15.5GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Max size for global variable                    65536 (64KiB)
  Preferred total size of global vars             0
  Global Memory cache type                        Read/Write
  Global Memory cache size                        1048576 (1024KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   32 bytes
    Pitch alignment for 2D image buffers          64 pixels
    Max 2D image size                             65536x65536 pixels
    Max 3D image size                             65536x65536x65536 pixels
    Max number of read image args                 128
    Max number of write image args                64
    Max number of read/write image args           64
  Max number of pipe args                         16
  Max active pipe reservations                    1
  Max pipe packet size                            1024
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     128
  Max constant buffer size                        16643870720 (15.5GiB)
  Max size of kernel argument                     1024
  Queue properties (on host)
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Queue properties (on device)
    Out-of-order execution                        Yes
    Profiling                                     Yes
    Preferred size                                2097152 (2MiB)
    Max size                                      16777216 (16MiB)
  Max queues on device                            1
  Max events on device                            1024
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Sub-group independent forward progress        Yes
    IL version                                    SPIR-V_1.0
    SPIR versions                                 <printDeviceInfo:161: get CL_DEVICE_SPIR_VERSIONS size : error -30>
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_opencl

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM Platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]
  clCreateContext(NULL, ...) [default]            Success [ARM]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-LODX r0p0

D.获取CPU信息

lscpu

输出

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       3
Vendor ID:                       ARM
Model:                           0
Model name:                      Cortex-A55
Stepping:                        r2p0
CPU max MHz:                     2208.0000
CPU min MHz:                     408.0000
BogoMIPS:                        48.00
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        1 MiB
L3 cache:                        3 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp

2.调用OpenCL SDK获取GPU信息

cat > cl_query.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

int main() {
    cl_platform_id *platforms = NULL;
    cl_uint num_platforms = 0;

    // 获取可用的平台数量
    cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);
    platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * num_platforms);

    // 获取所有平台ID
    clStatus = clGetPlatformIDs(num_platforms, platforms, NULL);

    printf("OpenCL平台数量: %d\n", num_platforms);

    // 遍历每个平台
    for (cl_uint i = 0; i < num_platforms; ++i) {
        char buffer[10240];
        printf("\n平台 %d:\n", i+1);

        // 获取平台名称
        clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(buffer), buffer, NULL);
        printf("  名称: %s\n", buffer);

        // 获取平台供应商
        clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(buffer), buffer, NULL);
        printf("  供应商: %s\n", buffer);

        // 获取平台版本
        clGetPlatformInfo(platforms[i], CL_PLATFORM_VERSION, sizeof(buffer), buffer, NULL);
        printf("  版本: %s\n", buffer);

        // 获取设备数量
        cl_uint num_devices = 0;
        clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);

        cl_device_id *devices = (cl_device_id*) malloc(sizeof(cl_device_id) * num_devices);
        clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);

        // 遍历每个设备
        for (cl_uint j = 0; j < num_devices; ++j) {
            printf("  设备 %d:\n", j+1);

            // 获取设备名称
            clGetDeviceInfo(devices[j], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);
            printf("    名称: %s\n", buffer);

            // 获取设备类型
            cl_device_type device_type;
            clGetDeviceInfo(devices[j], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);
            if (device_type & CL_DEVICE_TYPE_CPU)
                printf("    类型: CPU\n");
            if (device_type & CL_DEVICE_TYPE_GPU)
                printf("    类型: GPU\n");
            if (device_type & CL_DEVICE_TYPE_ACCELERATOR)
                printf("    类型: 加速器\n");

            // 获取计算单元数量
            cl_uint compute_units;
            clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);
            printf("    计算单元数: %d\n", compute_units);

            // 获取全局内存大小
            cl_ulong global_mem;
            clGetDeviceInfo(devices[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);
            printf("    全局内存大小: %llu MB\n", (unsigned long long)(global_mem / (1024 * 1024)));
        }
        free(devices);
    }
    free(platforms);
    return 0;
}
EOF

gcc -o cl_query cl_query.c -lOpenCL
./cl_query

输出

OpenCL平台数量: 1

平台 1:
  名称: ARM Platform
  供应商: ARM
  版本: OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
  设备 1:
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
    名称: Mali-LODX r0p0
    类型: GPU
    计算单元数: 4
    全局内存大小: 15872 MB

3.使用OpenCL API计算矩阵乘

cat > matmul.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#include <time.h>
#include <sys/time.h>

#define MATRIX_SIZE 8192
#define TILE_SIZE 32

// 获取当前时间(秒),用于计算耗时
double get_current_time() {
    struct timeval tp;
    gettimeofday(&tp, NULL);
    return (double)(tp.tv_sec) + (double)(tp.tv_usec) / 1e6;
}

#define xstr(s) str(s)
#define str(s) #s

const char *kernelSource = "                                  \n" \
"__kernel void mat_mul_optimized(const int N,                 \n" \
"                                __global float* A,           \n" \
"                                __global float* B,           \n" \
"                                __global float* C) {         \n" \
"    const int TILE_SIZE = " xstr(TILE_SIZE) ";               \n" \
"    __local float Asub[TILE_SIZE][TILE_SIZE];                \n" \
"    __local float Bsub[TILE_SIZE][TILE_SIZE];                \n" \
"    int global_row = get_global_id(1);                       \n" \
"    int global_col = get_global_id(0);                       \n" \
"    int local_row = get_local_id(1);                         \n" \
"    int local_col = get_local_id(0);                         \n" \
"    float sum = 0.0f;                                        \n" \
"    int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;          \n" \
"    for (int t = 0; t < numTiles; ++t) {                     \n" \
"        int tiled_row = global_row;                          \n" \
"        int tiled_col = t * TILE_SIZE + local_col;           \n" \
"        if (tiled_row < N && tiled_col < N)                  \n" \
"            Asub[local_row][local_col] = A[tiled_row * N + tiled_col];\n" \
"        else                                                 \n" \
"            Asub[local_row][local_col] = 0.0f;               \n" \
"        tiled_row = t * TILE_SIZE + local_row;               \n" \
"        tiled_col = global_col;                              \n" \
"        if (tiled_row < N && tiled_col < N)                  \n" \
"            Bsub[local_row][local_col] = B[tiled_row * N + tiled_col];\n" \
"        else                                                 \n" \
"            Bsub[local_row][local_col] = 0.0f;               \n" \
"        barrier(CLK_LOCAL_MEM_FENCE);                        \n" \
"        for (int k = 0; k < TILE_SIZE; ++k) {                \n" \
"            sum += Asub[local_row][k] * Bsub[k][local_col];  \n" \
"        }                                                    \n" \
"        barrier(CLK_LOCAL_MEM_FENCE);                        \n" \
"    }                                                        \n" \
"    if (global_row < N && global_col < N)                    \n" \
"        C[global_row * N + global_col] = sum;                \n" \
"}                                                            \n";

int main() {
    int N = MATRIX_SIZE;
    size_t bytes = N * N * sizeof(float);

    // 分配主机内存
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);

    // 初始化矩阵
    for(int i = 0; i < N*N; i++) {
        h_A[i] = 1.0f;
        h_B[i] = 1.0f;
    }

    // 获取平台和设备信息
    cl_platform_id platformId = NULL;
    cl_device_id deviceID = NULL;
    cl_uint retNumDevices;
    cl_uint retNumPlatforms;

    cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);
    ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);

    // 创建 OpenCL 上下文
    cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);

    // 创建命令队列
    cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);

    // 创建内存缓冲区
    cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);
    cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);
    cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, &ret);

    // 将数据写入缓冲区
    ret = clEnqueueWriteBuffer(commandQueue, d_A, CL_TRUE, 0, bytes, h_A, 0, NULL, NULL);
    ret = clEnqueueWriteBuffer(commandQueue, d_B, CL_TRUE, 0, bytes, h_B, 0, NULL, NULL);

    // 记录编译开始时间
    double compile_start = get_current_time();

    // 创建程序对象
    cl_program program = clCreateProgramWithSource(context, 1, (const char**)&kernelSource, NULL, &ret);

    // 编译内核程序
    ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);

    // 检查编译错误
    if (ret != CL_SUCCESS) {
        size_t log_size;
        clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
        char *log = (char *)malloc(log_size);
        clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
        printf("CL Compilation failed:\n%s\n", log);
        free(log);
        return 1;
    }

    // 记录编译结束时间
    double compile_end = get_current_time();

    double compile_time = compile_end - compile_start;

    // 创建 OpenCL 内核
    cl_kernel kernel = clCreateKernel(program, "mat_mul_optimized", &ret);

    // 设置内核参数
    ret = clSetKernelArg(kernel, 0, sizeof(int), (void*)&N);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_A);
    ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_B);
    ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&d_C);

    // 定义全局和本地工作区大小
    size_t local[2] = {TILE_SIZE, TILE_SIZE};
    size_t global[2] = {(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE,
                        (size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE};

    // 记录第一次内核执行开始时间
    double launch_start = get_current_time();

    // 执行内核
    ret = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global, local, 0, NULL, NULL);
	printf("clEnqueueNDRangeKernel:%d\n",ret);
    // 等待命令队列执行完成
    clFinish(commandQueue);

    // 记录第一次内核执行结束时间
    double launch_end = get_current_time();

    double launch_time = launch_end - launch_start;

    // 读取结果
    ret = clEnqueueReadBuffer(commandQueue, d_C, CL_TRUE, 0, bytes, h_C, 0, NULL, NULL);

    // 计算 GFLOPS
    double total_ops = 2.0 * N * N * N;
    double gflops = (total_ops / 1e9) / launch_time;

    // 输出结果
    printf("编译时间: %f 秒\n", compile_time);
    printf("第一次内核执行时间: %f 秒\n", launch_time);
    printf("计算性能: %f GFLOPS\n", gflops);

    // 释放资源
    ret = clFlush(commandQueue);
    ret = clFinish(commandQueue);
    ret = clReleaseKernel(kernel);
    ret = clReleaseProgram(program);
    ret = clReleaseMemObject(d_A);
    ret = clReleaseMemObject(d_B);
    ret = clReleaseMemObject(d_C);
    ret = clReleaseCommandQueue(commandQueue);
    ret = clReleaseContext(context);

    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

EOF
gcc -o matmul matmul.c -lOpenCL
./matmul

输出

编译时间: 0.031085 秒
第一次内核执行时间: 62.258528 秒
计算性能: 17.660418 GFLOPS

4.使用clpeak测试GPU的性能

git clone https://gitcode.com/gh_mirrors/cl/clpeak.git
git submodule update --init --recursive --remote
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./clpeak

输出

Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
  Device: Mali-LODX r0p0
    Driver version  : 2.1 (Linux ARM64)
    Compute units   : 4
    Clock frequency : 1000 MHz

    Global memory bandwidth (GBPS)
      float   : 25.71
      float2  : 24.45
      float4  : 23.70
      float8  : 12.05
      float16 : 12.01

    Single-precision compute (GFLOPS)
      float   : 441.77
      float2  : 470.27
      float4  : 466.52
      float8  : 435.65
      float16 : 411.38

    Half-precision compute (GFLOPS)
      half   : 441.96
      half2  : 878.25
      half4  : 911.51
      half8  : 886.19
      half16 : 846.44

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 124.96
      int2  : 125.71
      int4  : 125.16
      int8  : 123.82
      int16 : 124.24

    Integer compute Fast 24bit (GIOPS)
      int   : 125.16
      int2  : 125.63
      int4  : 125.20
      int8  : 123.73
      int16 : 124.33

    Integer char (8bit) compute (GIOPS)
      char   : 126.47
      char2  : 251.55
      char4  : 498.03
      char8  : 497.37
      char16 : 491.94

    Integer short (16bit) compute (GIOPS)
      short   : 126.31
      short2  : 250.90
      short4  : 249.47
      short8  : 248.51
      short16 : 245.30

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 8.54
      enqueueReadBuffer               : 9.97
      enqueueWriteBuffer non-blocking : 8.55
      enqueueReadBuffer non-blocking  : 9.99
      enqueueMapBuffer(for read)      : 61.66
        memcpy from mapped ptr        : 11.95
      enqueueUnmap(after write)       : 62.02
        memcpy to mapped ptr          : 11.89

    Kernel launch latency : 26.81 us

5.使用OpenBLAS测试CPU的算力

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make TARGET=ARMV8
make install
cd benchmark
make TARGET=ARMV8 sgemm
cc sgemm.o -o sgemm /opt/OpenBLAS/lib/libopenblas.so -Wl,-rpath=/opt/OpenBLAS/lib/
export OPENBLAS_NUM_THREADS=8
export OPENBLAS_LOOPS=10
export OPENBLAS_PARAM_M=8192
export OPENBLAS_PARAM_N=8192
export OPENBLAS_PARAM_K=8192
./sgemm

输出

From :   1  To : 200 Step=1 : Transa=N : Transb=N
          SIZE                   Flops             Time
 M=8192, N=8192, K=8192 :    53485.68 MFlops 205.571220 sec

6.分别用CPU与OpenCL测试opencv resize的性能

A.编译OpenCV支持OpenCL

  • Opencv修改点[链接libmali.so]
diff --git a/cmake/OpenCVDetectOpenCL.cmake b/cmake/OpenCVDetectOpenCL.cmake
index 6ab2cae070..c3cf235e45 100644
--- a/cmake/OpenCVDetectOpenCL.cmake
+++ b/cmake/OpenCVDetectOpenCL.cmake
@@ -3,9 +3,8 @@ if(APPLE)
   set(OPENCL_LIBRARY "-framework OpenCL" CACHE STRING "OpenCL library")
   set(OPENCL_INCLUDE_DIR "" CACHE PATH "OpenCL include directory")
 else()
-  set(OPENCL_LIBRARY "" CACHE STRING "OpenCL library")
-  set(OPENCL_INCLUDE_DIR "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/1.2" CACHE PATH "OpenCL include directory")
-  ocv_install_3rdparty_licenses(opencl-headers "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/LICENSE.txt")
+  set(OPENCL_LIBRARY "/usr/lib/aarch64-linux-gnu/libmali.so")
+  set(OPENCL_INCLUDE_DIR "/usr/include")
 endif()
 mark_as_advanced(OPENCL_INCLUDE_DIR OPENCL_LIBRARY)
  • 编译Opencv
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout bdb6a968ce69a2bf7c34724f9052c20e941ab47b
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \
      -DCMAKE_INSTALL_PREFIX=`pwd`/_install \
      -DWITH_OPENCL=ON -DWITH_NEON=ON \
      -DBUILD_SHARED_LIBS=ON \
      -D BUILD_opencv_world=ON -DBUILD_TESTS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_opencv_apps=OFF \
      -DBUILD_opencv_dnn=OFF -DBUILD_opencv_calib3d=OFF \
      -DBUILD_opencv_imgproc=ON -DBUILD_opencv_imgcodecs=ON ..
make -j4
make install

B.运行OpenCV测试程序

cat > opencv_resize.cpp <<-'EOF'
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
#include <map>

void run(int resize_mode)
{
    // 创建一个32x32的随机图像
    cv::Mat src = cv::Mat::zeros(32, 32, CV_8UC3);
    cv::randu(src, cv::Scalar::all(0), cv::Scalar::all(255));
	
    // ------------------------------------
    // 在CPU上执行
    // ------------------------------------
    cv::ocl::setUseOpenCL(false);

    cv::Mat enlarged_cpu, resized_back_cpu;

    // 记录放大操作的开始时间
    int64 start_time_cpu = cv::getTickCount();
	for(int i=0;i<100;i++)
	{
		// 放大到8192x8192
		cv::resize(src, enlarged_cpu, cv::Size(8192, 8192), 0, 0, resize_mode);
		// 缩小回32x32
		cv::resize(enlarged_cpu, resized_back_cpu, cv::Size(32, 32), 0, 0, resize_mode);
	}
    // 记录缩小操作的结束时间
    int64 end_time_cpu = cv::getTickCount();

    // 计算缩小操作的耗时
    double time_resize_cpu = (end_time_cpu - start_time_cpu) / cv::getTickFrequency();

    // ------------------------------------
    // 在GPU(OpenCL)上执行
    // ------------------------------------
    cv::ocl::setUseOpenCL(true);

    cv::UMat src_umat;
    src.copyTo(src_umat);

    cv::UMat enlarged_gpu, resized_back_gpu;

    // 记录放大操作的开始时间
    int64 start_time_gpu = cv::getTickCount();
	for(int i=0;i<100;i++)
	{
		// 放大到8192x8192
		cv::resize(src_umat, enlarged_gpu, cv::Size(8192, 8192), 0, 0, resize_mode);
		// 缩小回32x32
		cv::resize(enlarged_gpu, resized_back_gpu, cv::Size(32, 32), 0, 0, resize_mode);
	}
    // 记录缩小操作的结束时间
    int64 end_time_gpu = cv::getTickCount();

    // 计算缩小操作的耗时
    double time_resize_gpu = (end_time_gpu - start_time_gpu) / cv::getTickFrequency();
    std::cout <<"CPU耗时(秒):" << time_resize_cpu << " " << "GPU耗时(秒):" << time_resize_gpu << std::endl;
}

int main() {
    // 检查系统是否支持OpenCL
    if (!cv::ocl::haveOpenCL()) {
        std::cout << "系统不支持OpenCL。" << std::endl;
        return -1;
    }

    // 输出OpenCL设备信息
    cv::ocl::Context context;
    if (!context.create(cv::ocl::Device::TYPE_GPU)) {
        std::cout << "未找到可用的GPU设备,使用CPU执行。" << std::endl;
    } else {
        cv::ocl::Device device = cv::ocl::Device::getDefault();
        std::cout << "使用的OpenCL设备:" << device.name() << std::endl;
    }
    // 定义要测试的插值方法
    std::vector<int> interpolation_methods = {
        cv::INTER_NEAREST,
        cv::INTER_LINEAR,
        cv::INTER_CUBIC,
        cv::INTER_AREA,
        cv::INTER_LANCZOS4
    };

    // 插值方法的名称,用于输出结果
    std::vector<std::string> interpolation_names = {
        "INTER_NEAREST",
        "INTER_LINEAR",
        "INTER_CUBIC",
        "INTER_AREA",
        "INTER_LANCZOS4"
    };
    for (size_t i = 0; i < interpolation_methods.size(); ++i) {
        int interpolation = interpolation_methods[i];
        std::string method_name = interpolation_names[i];
        std::cout << "插值方法:" << method_name << " ";
		run(interpolation);
	}		
    return 0;
}
EOF
g++ -o opencv_resize opencv_resize.cpp -I _install/include/opencv4 \
    _install/lib/libopencv_world.so -Wl,-rpath=_install/lib
export OPENBLAS_NUM_THREADS=8
./opencv_resize

输出

arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
使用的OpenCL设备:Mali-LODX r0p0
插值方法:INTER_NEAREST  CPU耗时():3.01526 GPU耗时():0.0672681
插值方法:INTER_LINEAR   CPU耗时():5.3227  GPU耗时():0.0189366
插值方法:INTER_CUBIC    CPU耗时():8.22734 GPU耗时():11.6337
插值方法:INTER_AREA     CPU耗时():20.4999 GPU耗时():27.3197
插值方法:INTER_LANCZOS4 CPU耗时():29.3602 GPU耗时():43.9484

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2275098.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

半导体数据分析: 玩转WM-811K Wafermap 数据集(二) AI 机器学习

一、数据集回顾 前面我们已经基本了解了WM-811K Wafermap 数据集&#xff0c;并通过几段代码&#xff0c;熟悉了这个数据集的数据结构&#xff0c;这里为了方便各位连续理解&#xff0c;让我们再回顾一下&#xff1a; WM-811K Wafermap 数据集是一个在半导体制造领域广泛使用…

协同过滤算法私人诊所系统|Java|SpringBoot|VUE|

【技术栈】 1⃣️&#xff1a;架构: B/S、MVC 2⃣️&#xff1a;系统环境&#xff1a;Windowsh/Mac 3⃣️&#xff1a;开发环境&#xff1a;IDEA、JDK1.8、Maven、Mysql5.7 4⃣️&#xff1a;技术栈&#xff1a;Java、Mysql、SpringBoot、Mybatis-Plus、VUE、jquery,html 5⃣️…

Python基于YOLOv8和OpenCV实现车道线和车辆检测

使用YOLOv8&#xff08;You Only Look Once&#xff09;和OpenCV实现车道线和车辆检测&#xff0c;目标是创建一个可以检测道路上的车道并识别车辆的系统&#xff0c;并估计它们与摄像头的距离。该项目结合了计算机视觉技术和深度学习物体检测。 1、系统主要功能 车道检测&am…

nexus搭建maven私服

说到maven私服每个公司都有&#xff0c;比如我上一篇文章介绍的自定义日志starter&#xff0c;就可以上传到maven私服供大家使用&#xff0c;每次更新只需deploy一下就行&#xff0c;以下就是本人搭建私服的步骤 使用docker安装nexus #拉取镜像 docker pull sonatype/nexus3:…

MiniMind - 从0训练语言模型

文章目录 一、关于 MiniMind &#x1f4cc;项目包含 二、&#x1f4cc; Environment三、&#x1f4cc; Quick Start Test四、&#x1f4cc; Quick Start Train0、克隆项目代码1、环境安装2、如果你需要自己训练3、测试模型推理效果 五、&#x1f4cc; Data sources1、分词器&am…

Postman接口测试基本操作

&#x1f345; 点击文末小卡片 &#xff0c;免费获取软件测试全套资料&#xff0c;资料在手&#xff0c;涨薪更快 Postman-获取验证码 需求&#xff1a;使用Postman访问验证码接口&#xff0c;并查看响应结果。 地址&#xff1a;http://kdtx-test.itheima.net/api/captchaIm…

基于Python实现的通用小规模搜索引擎

基于Python实现的通用小规模搜索引擎 1.项目简介 1.1背景 《信息内容安全》网络信息内容获取技术课程项目设计 一个至少能支持10个以上网站的爬虫程序&#xff0c;且支持增量式数据采集;并至少采集10000个实际网页;针对采集回来的网页内容&#xff0c; 能够实现网页文本的分…

查找路由器的管理后台ip【通用找IP】

需求&#xff1a; 刚刚搞了个【小米】路由器&#xff0c;我想进路由的管理后台&#xff0c;提示&#xff1a;安装xx的路由管家&#xff0c;我不想安装 但是无法找到这个管理后台。 而且我是用这个路由作为中继&#xff0c;那么这个路由的ip就会经常更换 尝试通过网上搜索引擎来…

混合专家模型 (MoE)笔记摘要

ref&#xff1a; https://huggingface.co/blog/zh/moe#%E4%BB%80%E4%B9%88%E6%98%AF%E6%B7%B7%E5%90%88%E4%B8%93%E5%AE%B6%E6%A8%A1%E5%9E%8B 简短总结 混合专家模型 (MoEs): 与稠密模型相比&#xff0c; 预训练速度更快 与具有相同参数数量的模型相比&#xff0c;具有更快的…

01 Oracle自学环境搭建

1 Oracle12C安装 1.1 下载 官网地址&#xff1a;https://www.oracle.com/ 解压安装包 运行安装程序 1.2 安装 配置安全更新 软件更新 安装选项 系统类 Oracle主目录用户选择 使用现有windows用户&#xff1a;如果选择该项&#xff0c;则需要指定没有管理权限的用户。 创建新Wi…

【Python】Python与C的区别

文章目录 语句结束符代码块表示变量声明函数定义注释格式Python的标识符数据输入input()函数数据输出print()函数 语句结束符 C 语言 C 语言中每条语句必须以分号;结束。例如&#xff0c;int a 10;、printf("Hello, World!");。分号是语句的一部分&#xff0c;用于…

安科瑞 Acrel-1000DP 分布式光伏监控系统在工业厂房分布式光伏发电项目中的应用

吕梦怡 18706162527 摘 要&#xff1a;常规能源以煤、石油、天然气为主&#xff0c;不仅资源有限&#xff0c;而且会造成严重的大气污染&#xff0c;开发清洁的可再生能源已经成为当今发展的重要任务&#xff0c;“节能优先&#xff0c;效率为本”的分布式发电能源符合社会发…

逆向 易九批 最新版 爬虫逆向 x-sign ......

声明 本文章中所有内容仅供学习交流&#xff0c;抓包内容、敏感网址、数据接口均已做脱敏处理&#xff0c;严禁用于商业用途和非法用途&#xff0c;否则由此产生的一切后果均与作者无关&#xff0c;若有侵权&#xff0c;请联系我立即删除&#xff01; # 欢迎交流 wjxch1004

TensorFlow Quantum快速编程(高级篇)

五、实战:量子分类器应用 5.1 数据准备 在实战构建量子分类器时,数据准备是基石环节。选用鸢尾花数据集,这一经典数据集在机器学习领域应用广泛,其涵盖了三种鸢尾花品种的样本,每个样本包含花萼长度、花萼宽度、花瓣长度、花瓣宽度四个特征。鉴于本次构建二分类量子分类…

maven高级(day15)

Maven 是一款构建和管理 Java 项目的工具 分模块设计与开发 所谓分模块设计&#xff0c;顾名思义指的就是我们在设计一个 Java 项目的时候&#xff0c;将一个 Java 项目拆分成多 个模块进行开发。 分模块设计我们在进行项目设计阶段&#xff0c;就可以将一个大的项目拆分成若干…

android studio根据包名获取当前安装包信息

package com.example.myapplication2;import android.content.Context; import android.content.pm.PackageInfo; import android.content.pm.PackageManager; import android.util.Log;/**** 获取版本信息*/ public class SystemHelper {/*** 获取本地软件版本号*/public stat…

安卓硬件加速hwui

安卓硬件加速 本文基于安卓11。 从 Android 3.0 (API 级别 11) 开始&#xff0c;Android 2D 渲染管道支持硬件加速&#xff0c;这意味着在 View 的画布上执行的所有绘图操作都使用 GPU。由于启用硬件加速所需的资源增加&#xff0c;你的应用程序将消耗更多内存。 软件绘制&am…

口碑很好的国产LDO芯片,有哪些?

在几乎任何一个电路设计中&#xff0c;都可能会使用LDO&#xff08;低压差线性稳压器&#xff09;这个器件。 虽然LDO不是什么高性能的IC&#xff0c;但LDO芯片市场竞争异常激烈。最近几年&#xff0c;诞生了越来越多的精品国产LDO&#xff0c;让人看得眼花缭乱。 业内人士曾经…

实训云上搭建集群

文章目录 1. 登录实训云1.1 实训云网址1.2 登录实训云 2. 创建网络2.1 网络概述2.2 创建步骤 3. 创建路由器3.1 路由器名称3.1 创建路由器3.3 查看网络拓扑 4. 连接子网5. 创建虚拟网卡5.1 创建原因5.2 查看端口5.3 创建虚拟网卡 6. 管理安全组规则6.1 为什么要管理安全组规则6…

win32汇编环境,怎么进行乘法运算的

;运行效果 ;win32汇编环境,怎么进行乘法运算的 ;基础知识&#xff0c;重新复习一下。 ;首先需明白字节的概念。1个字节是8位&#xff0c;al和ah都是8位的&#xff0c;8位之中每位要么是0&#xff0c;要么是1&#xff0c;假如8位都是1&#xff0c;就是16进制的FF&#xff0c;也就…