RK3588上CPU和GPU算力以及opencv resize的性能对比测试
- 一.背景
- 二.小结
- 三.相关链接
- 四.操作步骤
- 1.环境搭建
- A.安装依赖
- B.设置GPU为高性能模式
- C.获取GPU信息
- D.获取CPU信息
- 2.调用OpenCL SDK获取GPU信息
- 3.使用OpenCL API计算矩阵乘
- 4.使用clpeak测试GPU的性能
- 5.使用OpenBLAS测试CPU的算力
- 6.分别用CPU与OpenCL测试opencv resize的性能
- A.编译OpenCV支持OpenCL
- B.运行OpenCV测试程序
一.背景
- 希望对比RK3588上CPU和Mali-GPU的性能差异
- Mali-GPU算力测试采用clpeak
- CPU-FP32的性能测试采用Openblas(开启了NEON优化)
- 分别用CPU和opencl测试opencv resize在不同算法下的性能:从32x32放大到8192x8192再缩放回32x32,循环100次
二.小结
- GPU型号: Mali-LODX r0p0 Mali-G610 4 cores r0p0 0xA867
- GPU FP32(clpeak): 441.95 GFLOPS
- CPU FP32(openblas+neon): 53.68 GFLOPS
- 插值方法:INTER_NEAREST CPU耗时(秒):3.01526 GPU耗时(秒):0.0672681
- 插值方法:INTER_LINEAR CPU耗时(秒):5.3227 GPU耗时(秒):0.0189366
- 插值方法:INTER_CUBIC CPU耗时(秒):8.22734 GPU耗时(秒):11.6337
- 插值方法:INTER_AREA CPU耗时(秒):20.4999 GPU耗时(秒):27.3197
- 插值方法:INTER_LANCZOS4 CPU耗时(秒):29.3602 GPU耗时(秒):43.9484
三.相关链接
- opencv编译
四.操作步骤
1.环境搭建
A.安装依赖
mv /lib/aarch64-linux-gnu/libOpenCL.so.1 /lib/aarch64-linux-gnu/libOpenCL.so.1.bk
ln -s /usr/lib/aarch64-linux-gnu/libmali.so /lib/aarch64-linux-gnu/libOpenCL.so.1
sudo apt install opencl-headers
sudo apt install ocl-icd-libopencl1
sudo apt install ocl-icd-opencl-dev
sudo apt install clinfo
B.设置GPU为高性能模式
echo performance> /sys/class/devfreq/fb000000.gpu/governor
echo performance> /sys/class/devfreq/fdab0000.npu/governor
C.获取GPU信息
cat /sys/class/misc/mali0/device/gpuinfo
clinfo
输出
Mali-G610 4 cores r0p0 0xA867
Number of platforms 1
Platform Name ARM Platform
Platform Vendor ARM
Platform Version OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_opencl
Platform Host timer resolution 1ns
Platform Extensions function suffix ARM
Platform Name ARM Platform
Number of devices 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
Device Name Mali-LODX r0p0
Device Vendor ARM
Device Vendor ID 0xa8670000
Device Version OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
Driver Version 2.1
Device OpenCL C Version OpenCL C 2.0 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
Device Type GPU
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 4
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 0
Supported partition types None
Supported affinity domains (n/a)
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 1024
Preferred work group size multiple 16
Max sub-groups per work group 64
Preferred / native vector sizes
char 16 / 4
short 8 / 2
int 4 / 1
long 2 / 1
half 8 / 2 (cl_khr_fp16)
float 4 / 1
double 0 / 0 (n/a)
Half-precision Floating-point support (cl_khr_fp16)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 64, Little-Endian
Global memory size 16643870720 (15.5GiB)
Error Correction support No
Max memory allocation 16643870720 (15.5GiB)
Unified memory for Host and Device Yes
Shared Virtual Memory (SVM) capabilities (core)
Coarse-grained buffer sharing Yes
Fine-grained buffer sharing No
Fine-grained system sharing No
Atomics No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Preferred alignment for atomics
SVM 0 bytes
Global 0 bytes
Local 0 bytes
Max size for global variable 65536 (64KiB)
Preferred total size of global vars 0
Global Memory cache type Read/Write
Global Memory cache size 1048576 (1024KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 32 bytes
Pitch alignment for 2D image buffers 64 pixels
Max 2D image size 65536x65536 pixels
Max 3D image size 65536x65536x65536 pixels
Max number of read image args 128
Max number of write image args 64
Max number of read/write image args 64
Max number of pipe args 16
Max active pipe reservations 1
Max pipe packet size 1024
Local memory type Global
Local memory size 32768 (32KiB)
Max number of constant args 128
Max constant buffer size 16643870720 (15.5GiB)
Max size of kernel argument 1024
Queue properties (on host)
Out-of-order execution Yes
Profiling Yes
Queue properties (on device)
Out-of-order execution Yes
Profiling Yes
Preferred size 2097152 (2MiB)
Max size 16777216 (16MiB)
Max queues on device 1
Max events on device 1024
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Sub-group independent forward progress Yes
IL version SPIR-V_1.0
SPIR versions <printDeviceInfo:161: get CL_DEVICE_SPIR_VERSIONS size : error -30>
printf() buffer size 1048576 (1024KiB)
Built-in kernels (n/a)
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_opencl
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM Platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM]
clCreateContext(NULL, ...) [default] Success [ARM]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name ARM Platform
Device Name Mali-LODX r0p0
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name ARM Platform
Device Name Mali-LODX r0p0
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name ARM Platform
Device Name Mali-LODX r0p0
D.获取CPU信息
lscpu
输出
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 3
Vendor ID: ARM
Model: 0
Model name: Cortex-A55
Stepping: r2p0
CPU max MHz: 2208.0000
CPU min MHz: 408.0000
BogoMIPS: 48.00
L1d cache: 256 KiB
L1i cache: 256 KiB
L2 cache: 1 MiB
L3 cache: 3 MiB
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
2.调用OpenCL SDK获取GPU信息
cat > cl_query.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
int main() {
cl_platform_id *platforms = NULL;
cl_uint num_platforms = 0;
// 获取可用的平台数量
cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);
platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * num_platforms);
// 获取所有平台ID
clStatus = clGetPlatformIDs(num_platforms, platforms, NULL);
printf("OpenCL平台数量: %d\n", num_platforms);
// 遍历每个平台
for (cl_uint i = 0; i < num_platforms; ++i) {
char buffer[10240];
printf("\n平台 %d:\n", i+1);
// 获取平台名称
clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(buffer), buffer, NULL);
printf(" 名称: %s\n", buffer);
// 获取平台供应商
clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(buffer), buffer, NULL);
printf(" 供应商: %s\n", buffer);
// 获取平台版本
clGetPlatformInfo(platforms[i], CL_PLATFORM_VERSION, sizeof(buffer), buffer, NULL);
printf(" 版本: %s\n", buffer);
// 获取设备数量
cl_uint num_devices = 0;
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);
cl_device_id *devices = (cl_device_id*) malloc(sizeof(cl_device_id) * num_devices);
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);
// 遍历每个设备
for (cl_uint j = 0; j < num_devices; ++j) {
printf(" 设备 %d:\n", j+1);
// 获取设备名称
clGetDeviceInfo(devices[j], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);
printf(" 名称: %s\n", buffer);
// 获取设备类型
cl_device_type device_type;
clGetDeviceInfo(devices[j], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);
if (device_type & CL_DEVICE_TYPE_CPU)
printf(" 类型: CPU\n");
if (device_type & CL_DEVICE_TYPE_GPU)
printf(" 类型: GPU\n");
if (device_type & CL_DEVICE_TYPE_ACCELERATOR)
printf(" 类型: 加速器\n");
// 获取计算单元数量
cl_uint compute_units;
clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);
printf(" 计算单元数: %d\n", compute_units);
// 获取全局内存大小
cl_ulong global_mem;
clGetDeviceInfo(devices[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);
printf(" 全局内存大小: %llu MB\n", (unsigned long long)(global_mem / (1024 * 1024)));
}
free(devices);
}
free(platforms);
return 0;
}
EOF
gcc -o cl_query cl_query.c -lOpenCL
./cl_query
输出
OpenCL平台数量: 1
平台 1:
名称: ARM Platform
供应商: ARM
版本: OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03
设备 1:
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
名称: Mali-LODX r0p0
类型: GPU
计算单元数: 4
全局内存大小: 15872 MB
3.使用OpenCL API计算矩阵乘
cat > matmul.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#include <time.h>
#include <sys/time.h>
#define MATRIX_SIZE 8192
#define TILE_SIZE 32
// 获取当前时间(秒),用于计算耗时
double get_current_time() {
struct timeval tp;
gettimeofday(&tp, NULL);
return (double)(tp.tv_sec) + (double)(tp.tv_usec) / 1e6;
}
#define xstr(s) str(s)
#define str(s) #s
const char *kernelSource = " \n" \
"__kernel void mat_mul_optimized(const int N, \n" \
" __global float* A, \n" \
" __global float* B, \n" \
" __global float* C) { \n" \
" const int TILE_SIZE = " xstr(TILE_SIZE) "; \n" \
" __local float Asub[TILE_SIZE][TILE_SIZE]; \n" \
" __local float Bsub[TILE_SIZE][TILE_SIZE]; \n" \
" int global_row = get_global_id(1); \n" \
" int global_col = get_global_id(0); \n" \
" int local_row = get_local_id(1); \n" \
" int local_col = get_local_id(0); \n" \
" float sum = 0.0f; \n" \
" int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE; \n" \
" for (int t = 0; t < numTiles; ++t) { \n" \
" int tiled_row = global_row; \n" \
" int tiled_col = t * TILE_SIZE + local_col; \n" \
" if (tiled_row < N && tiled_col < N) \n" \
" Asub[local_row][local_col] = A[tiled_row * N + tiled_col];\n" \
" else \n" \
" Asub[local_row][local_col] = 0.0f; \n" \
" tiled_row = t * TILE_SIZE + local_row; \n" \
" tiled_col = global_col; \n" \
" if (tiled_row < N && tiled_col < N) \n" \
" Bsub[local_row][local_col] = B[tiled_row * N + tiled_col];\n" \
" else \n" \
" Bsub[local_row][local_col] = 0.0f; \n" \
" barrier(CLK_LOCAL_MEM_FENCE); \n" \
" for (int k = 0; k < TILE_SIZE; ++k) { \n" \
" sum += Asub[local_row][k] * Bsub[k][local_col]; \n" \
" } \n" \
" barrier(CLK_LOCAL_MEM_FENCE); \n" \
" } \n" \
" if (global_row < N && global_col < N) \n" \
" C[global_row * N + global_col] = sum; \n" \
"} \n";
int main() {
int N = MATRIX_SIZE;
size_t bytes = N * N * sizeof(float);
// 分配主机内存
float *h_A = (float*)malloc(bytes);
float *h_B = (float*)malloc(bytes);
float *h_C = (float*)malloc(bytes);
// 初始化矩阵
for(int i = 0; i < N*N; i++) {
h_A[i] = 1.0f;
h_B[i] = 1.0f;
}
// 获取平台和设备信息
cl_platform_id platformId = NULL;
cl_device_id deviceID = NULL;
cl_uint retNumDevices;
cl_uint retNumPlatforms;
cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);
ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);
// 创建 OpenCL 上下文
cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);
// 创建命令队列
cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);
// 创建内存缓冲区
cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);
cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);
cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, &ret);
// 将数据写入缓冲区
ret = clEnqueueWriteBuffer(commandQueue, d_A, CL_TRUE, 0, bytes, h_A, 0, NULL, NULL);
ret = clEnqueueWriteBuffer(commandQueue, d_B, CL_TRUE, 0, bytes, h_B, 0, NULL, NULL);
// 记录编译开始时间
double compile_start = get_current_time();
// 创建程序对象
cl_program program = clCreateProgramWithSource(context, 1, (const char**)&kernelSource, NULL, &ret);
// 编译内核程序
ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);
// 检查编译错误
if (ret != CL_SUCCESS) {
size_t log_size;
clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
char *log = (char *)malloc(log_size);
clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
printf("CL Compilation failed:\n%s\n", log);
free(log);
return 1;
}
// 记录编译结束时间
double compile_end = get_current_time();
double compile_time = compile_end - compile_start;
// 创建 OpenCL 内核
cl_kernel kernel = clCreateKernel(program, "mat_mul_optimized", &ret);
// 设置内核参数
ret = clSetKernelArg(kernel, 0, sizeof(int), (void*)&N);
ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_A);
ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_B);
ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&d_C);
// 定义全局和本地工作区大小
size_t local[2] = {TILE_SIZE, TILE_SIZE};
size_t global[2] = {(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE,
(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE};
// 记录第一次内核执行开始时间
double launch_start = get_current_time();
// 执行内核
ret = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global, local, 0, NULL, NULL);
printf("clEnqueueNDRangeKernel:%d\n",ret);
// 等待命令队列执行完成
clFinish(commandQueue);
// 记录第一次内核执行结束时间
double launch_end = get_current_time();
double launch_time = launch_end - launch_start;
// 读取结果
ret = clEnqueueReadBuffer(commandQueue, d_C, CL_TRUE, 0, bytes, h_C, 0, NULL, NULL);
// 计算 GFLOPS
double total_ops = 2.0 * N * N * N;
double gflops = (total_ops / 1e9) / launch_time;
// 输出结果
printf("编译时间: %f 秒\n", compile_time);
printf("第一次内核执行时间: %f 秒\n", launch_time);
printf("计算性能: %f GFLOPS\n", gflops);
// 释放资源
ret = clFlush(commandQueue);
ret = clFinish(commandQueue);
ret = clReleaseKernel(kernel);
ret = clReleaseProgram(program);
ret = clReleaseMemObject(d_A);
ret = clReleaseMemObject(d_B);
ret = clReleaseMemObject(d_C);
ret = clReleaseCommandQueue(commandQueue);
ret = clReleaseContext(context);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
EOF
gcc -o matmul matmul.c -lOpenCL
./matmul
输出
编译时间: 0.031085 秒
第一次内核执行时间: 62.258528 秒
计算性能: 17.660418 GFLOPS
4.使用clpeak测试GPU的性能
git clone https://gitcode.com/gh_mirrors/cl/clpeak.git
git submodule update --init --recursive --remote
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./clpeak
输出
Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
Device: Mali-LODX r0p0
Driver version : 2.1 (Linux ARM64)
Compute units : 4
Clock frequency : 1000 MHz
Global memory bandwidth (GBPS)
float : 25.71
float2 : 24.45
float4 : 23.70
float8 : 12.05
float16 : 12.01
Single-precision compute (GFLOPS)
float : 441.77
float2 : 470.27
float4 : 466.52
float8 : 435.65
float16 : 411.38
Half-precision compute (GFLOPS)
half : 441.96
half2 : 878.25
half4 : 911.51
half8 : 886.19
half16 : 846.44
No double precision support! Skipped
Integer compute (GIOPS)
int : 124.96
int2 : 125.71
int4 : 125.16
int8 : 123.82
int16 : 124.24
Integer compute Fast 24bit (GIOPS)
int : 125.16
int2 : 125.63
int4 : 125.20
int8 : 123.73
int16 : 124.33
Integer char (8bit) compute (GIOPS)
char : 126.47
char2 : 251.55
char4 : 498.03
char8 : 497.37
char16 : 491.94
Integer short (16bit) compute (GIOPS)
short : 126.31
short2 : 250.90
short4 : 249.47
short8 : 248.51
short16 : 245.30
Transfer bandwidth (GBPS)
enqueueWriteBuffer : 8.54
enqueueReadBuffer : 9.97
enqueueWriteBuffer non-blocking : 8.55
enqueueReadBuffer non-blocking : 9.99
enqueueMapBuffer(for read) : 61.66
memcpy from mapped ptr : 11.95
enqueueUnmap(after write) : 62.02
memcpy to mapped ptr : 11.89
Kernel launch latency : 26.81 us
5.使用OpenBLAS测试CPU的算力
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make TARGET=ARMV8
make install
cd benchmark
make TARGET=ARMV8 sgemm
cc sgemm.o -o sgemm /opt/OpenBLAS/lib/libopenblas.so -Wl,-rpath=/opt/OpenBLAS/lib/
export OPENBLAS_NUM_THREADS=8
export OPENBLAS_LOOPS=10
export OPENBLAS_PARAM_M=8192
export OPENBLAS_PARAM_N=8192
export OPENBLAS_PARAM_K=8192
./sgemm
输出
From : 1 To : 200 Step=1 : Transa=N : Transb=N
SIZE Flops Time
M=8192, N=8192, K=8192 : 53485.68 MFlops 205.571220 sec
6.分别用CPU与OpenCL测试opencv resize的性能
A.编译OpenCV支持OpenCL
- Opencv修改点[链接libmali.so]
diff --git a/cmake/OpenCVDetectOpenCL.cmake b/cmake/OpenCVDetectOpenCL.cmake
index 6ab2cae070..c3cf235e45 100644
--- a/cmake/OpenCVDetectOpenCL.cmake
+++ b/cmake/OpenCVDetectOpenCL.cmake
@@ -3,9 +3,8 @@ if(APPLE)
set(OPENCL_LIBRARY "-framework OpenCL" CACHE STRING "OpenCL library")
set(OPENCL_INCLUDE_DIR "" CACHE PATH "OpenCL include directory")
else()
- set(OPENCL_LIBRARY "" CACHE STRING "OpenCL library")
- set(OPENCL_INCLUDE_DIR "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/1.2" CACHE PATH "OpenCL include directory")
- ocv_install_3rdparty_licenses(opencl-headers "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/LICENSE.txt")
+ set(OPENCL_LIBRARY "/usr/lib/aarch64-linux-gnu/libmali.so")
+ set(OPENCL_INCLUDE_DIR "/usr/include")
endif()
mark_as_advanced(OPENCL_INCLUDE_DIR OPENCL_LIBRARY)
- 编译Opencv
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout bdb6a968ce69a2bf7c34724f9052c20e941ab47b
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=`pwd`/_install \
-DWITH_OPENCL=ON -DWITH_NEON=ON \
-DBUILD_SHARED_LIBS=ON \
-D BUILD_opencv_world=ON -DBUILD_TESTS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_opencv_apps=OFF \
-DBUILD_opencv_dnn=OFF -DBUILD_opencv_calib3d=OFF \
-DBUILD_opencv_imgproc=ON -DBUILD_opencv_imgcodecs=ON ..
make -j4
make install
B.运行OpenCV测试程序
cat > opencv_resize.cpp <<-'EOF'
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
#include <map>
void run(int resize_mode)
{
// 创建一个32x32的随机图像
cv::Mat src = cv::Mat::zeros(32, 32, CV_8UC3);
cv::randu(src, cv::Scalar::all(0), cv::Scalar::all(255));
// ------------------------------------
// 在CPU上执行
// ------------------------------------
cv::ocl::setUseOpenCL(false);
cv::Mat enlarged_cpu, resized_back_cpu;
// 记录放大操作的开始时间
int64 start_time_cpu = cv::getTickCount();
for(int i=0;i<100;i++)
{
// 放大到8192x8192
cv::resize(src, enlarged_cpu, cv::Size(8192, 8192), 0, 0, resize_mode);
// 缩小回32x32
cv::resize(enlarged_cpu, resized_back_cpu, cv::Size(32, 32), 0, 0, resize_mode);
}
// 记录缩小操作的结束时间
int64 end_time_cpu = cv::getTickCount();
// 计算缩小操作的耗时
double time_resize_cpu = (end_time_cpu - start_time_cpu) / cv::getTickFrequency();
// ------------------------------------
// 在GPU(OpenCL)上执行
// ------------------------------------
cv::ocl::setUseOpenCL(true);
cv::UMat src_umat;
src.copyTo(src_umat);
cv::UMat enlarged_gpu, resized_back_gpu;
// 记录放大操作的开始时间
int64 start_time_gpu = cv::getTickCount();
for(int i=0;i<100;i++)
{
// 放大到8192x8192
cv::resize(src_umat, enlarged_gpu, cv::Size(8192, 8192), 0, 0, resize_mode);
// 缩小回32x32
cv::resize(enlarged_gpu, resized_back_gpu, cv::Size(32, 32), 0, 0, resize_mode);
}
// 记录缩小操作的结束时间
int64 end_time_gpu = cv::getTickCount();
// 计算缩小操作的耗时
double time_resize_gpu = (end_time_gpu - start_time_gpu) / cv::getTickFrequency();
std::cout <<"CPU耗时(秒):" << time_resize_cpu << " " << "GPU耗时(秒):" << time_resize_gpu << std::endl;
}
int main() {
// 检查系统是否支持OpenCL
if (!cv::ocl::haveOpenCL()) {
std::cout << "系统不支持OpenCL。" << std::endl;
return -1;
}
// 输出OpenCL设备信息
cv::ocl::Context context;
if (!context.create(cv::ocl::Device::TYPE_GPU)) {
std::cout << "未找到可用的GPU设备,使用CPU执行。" << std::endl;
} else {
cv::ocl::Device device = cv::ocl::Device::getDefault();
std::cout << "使用的OpenCL设备:" << device.name() << std::endl;
}
// 定义要测试的插值方法
std::vector<int> interpolation_methods = {
cv::INTER_NEAREST,
cv::INTER_LINEAR,
cv::INTER_CUBIC,
cv::INTER_AREA,
cv::INTER_LANCZOS4
};
// 插值方法的名称,用于输出结果
std::vector<std::string> interpolation_names = {
"INTER_NEAREST",
"INTER_LINEAR",
"INTER_CUBIC",
"INTER_AREA",
"INTER_LANCZOS4"
};
for (size_t i = 0; i < interpolation_methods.size(); ++i) {
int interpolation = interpolation_methods[i];
std::string method_name = interpolation_names[i];
std::cout << "插值方法:" << method_name << " ";
run(interpolation);
}
return 0;
}
EOF
g++ -o opencv_resize opencv_resize.cpp -I _install/include/opencv4 \
_install/lib/libopencv_world.so -Wl,-rpath=_install/lib
export OPENBLAS_NUM_THREADS=8
./opencv_resize
输出
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
使用的OpenCL设备:Mali-LODX r0p0
插值方法:INTER_NEAREST CPU耗时(秒):3.01526 GPU耗时(秒):0.0672681
插值方法:INTER_LINEAR CPU耗时(秒):5.3227 GPU耗时(秒):0.0189366
插值方法:INTER_CUBIC CPU耗时(秒):8.22734 GPU耗时(秒):11.6337
插值方法:INTER_AREA CPU耗时(秒):20.4999 GPU耗时(秒):27.3197
插值方法:INTER_LANCZOS4 CPU耗时(秒):29.3602 GPU耗时(秒):43.9484