【项目记录】大模型基于llama.cpp在Qemu-riscv64向量扩展指令下的部署

news2025/1/11 9:50:18

概述

本文在qemu-riscv64平台上,利用向量扩展指令加速运行基于llama.cpp构建的大模型。
参考博客链接:
Accelerating llama.cpp with RISC-V Vector Extension
基于RVV的llama.cpp在Banana Pi F3 RISCV开发板上的演示

llama.cpp工程

Llama.cpp是一个基于C++编写的高性能大模型推理框架,旨在提供快速、稳定且易于使用的计算工具,Llama.cpp支持多种计算模式,包括向量计算、矩阵运算、图算法等,可广泛应用于机器学习、图像处理、数据分析等领域。

目录结构

src是构建模型架构的基础库文件夹
examples是部分案例模型的源文件
ggml是计算操作的库文件
在这里插入图片描述

源码分析

可如下参考链接(注意:函数名有所变动):
CodeLeaner@微信公众号:llama.cpp源码解析

llama模型
// llama  @examples/main/main.cpp
main()  
	gpt_params_parse(argc, argv, params, LLAMA_EXAMPLE_MAIN, print_usage)	// 解析传递进来的模型参数
	llama_init_from_gpt_params()
		llama_load_model_from_file(params.model.c_str(), mparams);		// 加载model参数
		llama_new_context_with_model(model, cparams);			// 
			ggml_backend_cpu_init();
				*cpu_backend				// 定义指针指向 cpu_backend_i
		llama_tokenize(ctx, prompt, true, true)				// 将prompt tokenize
	while   // 循环产生token
		   	llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))	// 生成token函数
   			llama_token_to_piece(ctx, id, params.special)
 	gpt_perf_print(ctx, smpl);						// 打印性能结果
llama.cpp模型库文件

// decode 函数体 @src/llama.cpp
int32_t llama_decode(        struct llama_context * ctx,          struct llama_batch   batch)
	llama_decode_internal(*ctx, batch)
		while (lctx.sbatch.n_tokens > 0) 
			llama_build_graph(lctx, ubatch, false);
			llama_graph_compute(lctx, gf, n_threads, threadpool);
				ggml_backend_cpu_set_n_threads(lctx.backend_cpu, n_threads);
        			ggml_backend_cpu_set_threadpool(lctx.backend_cpu, threadpool);
        			ggml_backend_cpu_set_abort_callback(lctx.backend_cpu, lctx.abort_callback, lctx.abort_callback_data);	
				ggml_backend_sched_graph_compute_async(lctx.sched, gf);




ggml计算库函数
// backend计算图执行函数   ggml/src/ggml-backend.c
enum ggml_status ggml_backend_sched_graph_compute_async(ggml_backend_sched_t sched, struct ggml_cgraph * graph)
	ggml_backend_sched_alloc_graph(sched, graph)
		ggml_backend_sched_split_graph(sched, graph);			// 该函数划分graph,并调用cpu_backend进行计算,继而调用ggml_graph_compute计算



// ggml计算图中各类计算选通的主体函数  ggml/src/ggml.c
enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan)
	ggml_graph_compute_thread(&threadpool->workers[0])
		ggml_compute_forward(&params, node);
			ggml_compute_forward_dup(params, tensor);
			ggml_compute_forward_add1(params, tensor);
			ggml_compute_forward_repeat(params, tensor);
			ggml_compute_forward_mul_mat(params, tensor);
			ggml_compute_forward_soft_max(params, tensor);
			ggml_compute_forward_rms_norm(params, tensor);

结构体

// cpu执行数据流结构体,将函数作为结构体成员。
cpu_backend_i	// @ggml/src/ggml-backend.c   
	ggml_backend_cpu_graph_plan_create()
		ggml_graph_plan()
	ggml_backend_cpu_graph_plan_compute()
		ggml_graph_compute()
	ggml_backend_cpu_graph_compute()
		ggml_graph_plan()
		ggml_graph_compute()

llama.cpp中量化方式(Qn_0、Qn_1等)含义

sgsprog@hackmd.io: Linux 核心專題: llama.cpp 效能分析

rvv移植代码分析

Github相关链接:
Llama.cpp中利用GGML中对RVV的支持1
Llama.cpp中利用GGML中对RVV的支持2
Tameem-10xE@llama.cpp Github:Added RISC-V Vector Intrinsics Support

起初移植代码在ggml.c中后续迁移至ggml-quants.c文件中。

修改函数包括12个:
quantize_row_q8_0
quantize_row_q8_1
ggml_vec_dot_q4_0_q8_0
ggml_vec_dot_q4_1_q8_1
ggml_vec_dot_q5_0_q8_0
ggml_vec_dot_q5_1_q8_1
ggml_vec_dot_q8_0_q8_0
ggml_vec_dot_q2_K_q8_K
ggml_vec_dot_q3_K_q8_K
ggml_vec_dot_q4_K_q8_K
ggml_vec_dot_q5_K_q8_K
ggml_vec_dot_q6_K_q8_K

这些函数作为不同量化模式的成员函数:

static const ggml_type_traits_t type_traits[GGML_TYPE_COUNT] = {
    [GGML_TYPE_Q8_0] = {
        .type_name                = "q8_0",
        .blck_size                = QK8_0,
        .type_size                = sizeof(block_q8_0),
        .is_quantized             = true,
        .to_float                 = (ggml_to_float_t) dequantize_row_q8_0,
        .from_float               = quantize_row_q8_0,
        .from_float_ref           = (ggml_from_float_t) quantize_row_q8_0_ref,
        .from_float_to_mat        = quantize_mat_q8_0,
        .vec_dot                  = ggml_vec_dot_q8_0_q8_0,
        .vec_dot_type             = GGML_TYPE_Q8_0,
    ......
}

在ggml不同算子计算函数中被调用:

static void ggml_compute_forward_mul_mat_one_chunk(...)
	ggml_vec_dot_t const vec_dot      = type_traits[type].vec_dot;
	for (int64_t iir1 = ir1_start; iir1 < ir1_end; iir1 += blck_1) {
    		for (int64_t iir0 = ir0_start; iir0 < ir0_end; iir0 += blck_0) {
        		for (int64_t ir1 = iir1; ir1 < iir1 + blck_1 && ir1 < ir1_end; ir1 += num_rows_per_vec_dot) {
						...
						for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir0_end; ir0 += num_rows_per_vec_dot) {
               		vec_dot(ne00, &tmp[ir0 - iir0], (num_rows_per_vec_dot > 1 ? 16 : 0), src0_row + ir0 * nb01, (num_rows_per_vec_dot > 1 ? nb01 : 0), src1_col, (num_rows_per_vec_dot > 1 ? src1_col_stride : 0), num_rows_per_vec_dot);
           		}
           		...
					}
			}
	}

2024/10/02: 工具准备OK,但qemu运行时被killed

工具版本

Qemu:
在这里插入图片描述
Gcc版本:
Github Release

llama.cpp:
llama.cpp Github 10月2号pull

llama-7b模型版本:
Huggingface gguf文件

编译

llama.cpp编译

cd llama.cpp
make   RISCV_CROSS_COMPILE=1 

运行命令

qemu-riscv64 -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-server -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p “Anything” -n 9

问题

命令运行现象

在这里插入图片描述在这里插入图片描述
在这里插入图片描述

可能原因

运行内存可能太小

2024/10/03: 使用10xE团队的最新版,解决tokenizer的问题,但还是被killed

最新版Github链接

Tameem-10xE/llama.cpp Github

问题:运行7B模型被killed

运行现象

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

可能原因

有可能跟qemu运行的swapfile有关
可以团队成员提的一个issue:
Github Issue: qemu-riscv64 unexpectedly reached EOF error

解决办法

先尝试换一个更小的模型试试,不行就解决swapfile的问题

运行3B规模的model的现象:failed to allocate buffer of size

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 27
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:   59 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.48 GiB (3.96 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  1518.09 MiB
.....................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 15032385568
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '/home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf'
main: error: unable to load model

可能原因

llama.cpp Github issue: Bug: ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 137438953504

解决办法

尝试调整模型的参数:
llama.cpp Github参数说明

调整ctx参数后成功运行

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64,v=true,vlen=256,elen=64,vext_spec=v1.0 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf -p "Anything" -n 9 -c 50
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/Llama-3.2-3B-Instruct-IQ3_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 27
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_K:   59 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  137 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_S mix - 3.66 bpw
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.48 GiB (3.96 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =  1518.09 MiB
.....................................................................
llama_new_context_with_model: n_ctx      = 64
llama_new_context_with_model: n_batch    = 64
llama_new_context_with_model: n_ubatch   = 64
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =     7.00 MiB
llama_new_context_with_model: KV self size  =    7.00 MiB, K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    32.06 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling seed: 2622092847
sampling params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr:
        logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 64, n_batch = 2048, n_predict = 9, n_keep = 1


Anything else?
Yes. You can also consider the
llama_perf_print:    sampling time =      10.96 ms /    11 runs   (    1.00 ms per token,  1003.37 tokens per second)
llama_perf_print:        load time =   73770.30 ms
llama_perf_print: prompt eval time =   15435.63 ms /     2 tokens ( 7717.81 ms per token,     0.13 tokens per second)
llama_perf_print:        eval time =   74267.24 ms /     8 runs   ( 9283.41 ms per token,     0.11 tokens per second)
llama_perf_print:       total time =   89764.85 ms /    10 tokens
Log end

2024/10/05:生成非向量支持的riscv版本llama.cpp,进行对比实验

llama.cpp编译

make CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"
make CC="riscv64-unknown-linux-gnu-gcc -march=rv64gc -mabi=lp64d" CXX="riscv64-unknown-linux-gnu-g++ -march=rv64gc -mabi=lp64d"

报错:riscv64-unknown-linux-gnu-g++: error: ‘-march=native’: ISA string must begin with rv32 or rv64

解决方案:
GCC GitHub:riscv64-unknown-linux-gnu-g++: error: ‘-march=native’: ISA string must begin with rv32 or rv64

即将llama.cpp工程中的Makefile中523和524行修改为rv64gc编译,该部分原先是交叉编译时使能RVV编译,如图所示:

在这里插入图片描述

修正后编译命令

make RISCV_CROSS_COMPILE=1

运行结果

kevin@BRICKHOUSE01:~/data/projects/kg_proj/rvv_transformer/llama.cpp.rv64gc$ qemu-riscv64  -L /home/kevin/data/projects/tools/riscv64_linux_gcc/sysroot  -cpu rv64 ./llama-cli -m /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf -p "Anything" -n 100 -c 1024
Log start
main: build = 3733 (e5701063)
main: built with riscv64-unknown-linux-gnu-gcc () 13.2.0 for riscv64-unknown-linux-gnu
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /home/kevin/data/projects/kg_proj/rvv_transformer/codellama-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-7b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1686 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name     = codellama_codellama-7b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: PRE token        = 32007 '▁<PRE>'
llm_load_print_meta: SUF token        = 32008 '▁<SUF>'
llm_load_print_meta: MID token        = 32009 '▁<MID>'
llm_load_print_meta: EOT token        = 32010 '▁<EOT>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3891.33 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    98.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling seed: 3630221793
sampling params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler constr:
        logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 1024, n_batch = 2048, n_predict = 100, n_keep = 1


 Anything is possible if you believe it. The only thing you are not capable of doing is believing in anything.
- John L. Sullivan



- To be honest, I've never had a problem with the fact that I'm a bad person and I'm a horrible human. I've always been okay with that.
- I think it's a mistake to think of yourself as being in the middle of the world, instead of at the
llama_perf_print:    sampling time =      94.53 ms /   104 runs   (    0.91 ms per token,  1100.12 tokens per second)
llama_perf_print:        load time =   44296.57 ms
llama_perf_print: prompt eval time =   79786.77 ms /     4 tokens (19946.69 ms per token,     0.05 tokens per second)
llama_perf_print:        eval time = 2360379.06 ms /    99 runs   (23842.21 ms per token,     0.04 tokens per second)
llama_perf_print:       total time = 2440911.31 ms /   103 tokens
Log end

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2195149.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

C语言-指针变量,常量与数组名的细微区别辨析

本节根据两个选择题进行展开辨析 一、例1 本题答案&#xff1a;C 解析&#xff1a;强干扰选项是B&#xff0c;我相信大多数同学都会在B&#xff0c;C之间犹豫好久&#xff0c;那么为什么答案会最终选择C呢&#xff1f;因为本题在定义函数&#xff0c;所以a首先是一个数组名&a…

【学习笔记】一种使用多项式快速计算 sin 和 cos 近似值的方法

一种使用多项式快速计算 sin 和 cos 近似值的方法 在嵌入式开发、游戏开发或其他需要快速数学计算的领域&#xff0c;sin 和 cos 函数的计算时间可能会影响程序的整体性能。特别是在对时间敏感、精度要求不高的场景中&#xff0c;传统的 sin 和 cos 函数由于依赖复杂的数值方法…

SOMEIP_ETS_168: SD_TestFieldUINT8Reliable

测试目的&#xff1a; 验证DUT能够通过Getter和Setter方法正确地发送和接收TestFieldUINT8Reliable字段的值&#xff0c;并且这些操作是可靠的。 描述 本测试用例旨在确保DUT的ETS能够响应Tester的请求&#xff0c;正确地使用Getter方法获取TestFieldUINT8Reliable的值&…

【MySQL必知会】事务

目录 &#x1f308;前言&#x1f308; &#x1f4c1; 事务概念 &#x1f4c1; 事务操作 &#x1f4c1; 事务提交方式 &#x1f4c1; 隔离级别 &#x1f4c1; MVCC &#x1f4c2; 3个隐藏列字段 &#x1f4c2; undo日志 &#x1f4c2; Read View视图 &#x1f4c1; RR和R…

分治算法(5)_归并排序_排序数组

个人主页&#xff1a;C忠实粉丝 欢迎 点赞&#x1f44d; 收藏✨ 留言✉ 加关注&#x1f493;本文由 C忠实粉丝 原创 分治算法(5)_归并排序_排序数组 收录于专栏【经典算法练习】 本专栏旨在分享学习算法的一点学习笔记&#xff0c;欢迎大家在评论区交流讨论&#x1f48c; 目录 …

JavaSE——面向对象11:内部类(局部内部类、匿名内部类、成员内部类、静态内部类)

目录 一、内部类基本介绍 (一)内部类定义 (二)内部类基本语法 (三)内部类代码示例 (四)内部类的分类 二、局部内部类 三、匿名内部类(重要) (一)基本介绍 (二)基于接口的匿名内部类 (三)基于类的匿名内部类 (四)注意事项与使用细节 (五)匿名内部类的最佳实践——当…

leetcode-42. 接雨水 单调栈

给定 n 个非负整数表示每个宽度为 1 的柱子的高度图&#xff0c;计算按此排列的柱子&#xff0c;下雨之后能接多少雨水。 示例 1&#xff1a; 输入&#xff1a;height [0,1,0,2,1,0,1,3,2,1,2,1] 输出&#xff1a;6 解释&#xff1a;上面是由数组 [0,1,0,2,1,0,1,3,2,1,2,1] 表…

Chrome浏览器调用ActiveX控件--allWebOffice控件

背景 allWebOffice控件能够实现在浏览器窗口中在线操作文档的应用&#xff08;阅读、编辑、保存等&#xff09;&#xff0c;支持编辑文档时保留修改痕迹&#xff0c;支持书签位置内容动态填充&#xff0c;支持公文套红&#xff0c;支持文档保护控制等诸多办公功能&#xff0c;本…

vim编辑器安装,并修改配置使其默认显示行数

centOS默认是未安装vim编辑器的&#xff0c;而vim编辑器相比vi编辑器更易用一些&#xff0c;如需使用vim编辑器&#xff0c;需要进行安装。 1.需要先配置本地yum源&#xff0c;参见如下链接&#xff1a; 点击查看如何配置本地yum源 2.安装vim编辑器&#xff0c;并修改配置。…

滑动窗口_找出字符串中所有字母异位词、串联所有单词的子串_C++

滑动窗口_找出字符串中所有字母异位词、串联所有单词的子串_C 1. 题目解析2. 算法分析3. 代码实现4. 举一反三&#xff1a;串联所有单词的子串 1. 题目解析 leetcode链接&#xff1a;https://leetcode.cn/problems/VabMRr/ 给定两个字符串 s 和 p&#xff0c;找到 s 中所有 p …

helm 测试安装redis

helm search repo redis # 搜索redis的chart helm show readme bitnami/redis # 展示安装相关文档&#xff08;readme文件&#xff09; 拉取指定版本的安装包&#xff08;chart&#xff09; helm pull bitnami/redis --version 17.4.3 解压安装包 tar -xf redis-17.4.3.tgz …

Vue3 动态路由实现的一种方法

动态路由 目的&#xff1a; 根据服务器传回来的数据动态的注册路由信息&#xff0c;登录用户的角色不同生成的菜单不同 使用插件做动态路由的好处&#xff1a; 路由页面增加或者减少时&#xff0c;只需要增加或减少相关的路由文件&#xff0c;不需要再修改代码 服务器返回的信…

POI数据的处理与分析

POI概念 POI&#xff08;Point of Interest&#xff0c;兴趣点&#xff09;数据指的是地理空间数据中的一类&#xff0c;表示某一具体地点或位置的信息。通常&#xff0c;这些数据包含位置坐标&#xff08;经纬度&#xff09;、名称、地址、类别和其他相关信息。POI 数据广泛应…

毕业设计 深度学习水果识别

文章目录 1 前言2 开发简介3 识别原理3.1 传统图像识别原理3.2 深度学习水果识别 4 数据集5 部分关键代码5.1 处理训练集的数据结构5.2 模型网络结构5.3 训练模型 6 识别效果 1 前言 Hi&#xff0c;大家好&#xff0c;这里是丹成学长&#xff0c;今天做一个 基于深度学习的水果…

算法剖析:双指针

文章目录 双指针算法一、 移动零1. 题目2. 算法思想3. 代码实现 二、 复写零1. 题目2. 算法思想3. 代码实现 三、 快乐数1. 题目2. 算法思想3. 代码实现 四、 盛水最多的容器1. 题目2. 算法思想3. 代码实现 五、有效三角形的个数1. 题目2. 算法思想3. 代码实现 六、 和为 s 的两…

【EXCEL数据处理】000020 案例 保姆级教程,附多个操作案例。EXCEL使用表格。

前言&#xff1a;哈喽&#xff0c;大家好&#xff0c;今天给大家分享一篇文章&#xff01;创作不易&#xff0c;如果能帮助到大家或者给大家一些灵感和启发&#xff0c;欢迎收藏关注哦 &#x1f495; 目录 【EXCEL数据处理】000020 案例 保姆级教程&#xff0c;附多个操作案例。…

【强训笔记】day27

NO.1 代码实现&#xff1a; #include<iostream>using namespace std;int n,m; int main() {cin>>n>>m;long long retn;for(int i0;i<m-1;i)retret*(n-1)%109;cout<<ret<<endl;return 0; }NO.2 思路&#xff1a;bfs遍历实现&#xff0c;dis…

Android架构--MVVM

一、开发架构 是什么&#xff1f; 二、Android开发中的架构 具体到Android开发中&#xff0c;开发架构就是描述 视图层、逻辑层、数据层 三者之间的关系和实施&#xff1a; 视图层&#xff1a;用户界面&#xff0c;即界面的展示、以及交互事件的响应。 逻辑层&#xff1a;为…

IL2CPP和Mono的区别

Mono 是一种开源的跨平台 .NET 框架实现&#xff0c;能够执行 C# 代码。Unity 使用 Mono 来处理 C# 脚本&#xff0c;并通过 JIT&#xff08;Just-In-Time&#xff09;即时编译器将托管代码转换为本地机器代码&#xff0c;随后在目标平台上执行 IL2CPP 代表 Intermediate Lang…

《业务三板斧:定目标、抓过程、拿结果》读书笔记3

关于目标&#xff0c;关键是共识目标&#xff1a; 为什么不是共识目标&#xff0c;而是共信目标&#xff1f; 共识目标是指管理者通过沟通&#xff0c;让所有团队成员就目标以及实现目 标的方法达成一致。当个人与组织、个人与个人之间出现“路径选择 差异”的时候&#xff0c;…