quantlab_ai版本v0.1代码发布：从研报中提取因子并建模（附代码与研报集下载）

news2025/7/3 4:24:00

原创内容第676篇，专注量化投资、个人成长与财富自由。

今天我们继续开发AI大模型自动读研报。

从研报到模型，大致分成几步：

["propose_hypo_exp", "propose", "exp_gen", "coding", "running", "feedback"]

首先，我们需要接入大模型API，我们使用openai的SDK来封装。

from openai import OpenAI
import os


class LLM:
    def __init__(self, api_key=os.getenv('KIMI_KEY'), base_url="https://api.moonshot.cn/v1", model='moonshot-v1-128k',system_prompt=None):
        self.client = OpenAI(
            api_key=api_key,
            base_url=base_url,
        )
        if not system_prompt:
            system_prompt = "你是quantlab金融投资专家，由AI量化实验室 " \
                            "提供的人工智能助手，你更擅长中文和英文的对话。你会为用户提供安全，有帮助，准确的回答。同时，你会拒绝一切涉及恐怖主义，种族歧视，黄色暴力等问题的回答。Moonshot AI " \
                            "为专有名词，不可翻译成其他语言。 "

        self.system_prompt = system_prompt
        self.model = model

    def chat(self, user_prompt):
        completion = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system",
                 "content": self.system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.3,
        )
        return completion.choices[0].message.content

这样一个基础功能就完成了。

我们可以向它提问：

from core.llm import LLM
resp = LLM().chat('你是谁？A股会涨吗？')
print(resp)

输出结果如下：

准备好了16份研报：

比如研报一，

我们的目标是AI自动读研报，然后自主建模复现研报。

研报基本都是pdf格式，因此我们需要一个pdf阅读器，langchain里实现了pdf解析，我们直接使用就好了。——langchain生态封装有点多，但工具也挺全的，我们可以使用它的周边。

from pathlib import Path
from langchain_community.document_loaders import PyPDFDirectoryLoader, PyPDFLoader

from langchain_core.documents import Document


def process_documents_by_langchain(docs: list[Document]) -> dict[str, str]:
    content_dict = {}

    for doc in docs:
        if Path(doc.metadata["source"]).exists():
            doc_name = str(Path(doc.metadata["source"]).resolve())
        else:
            doc_name = doc.metadata["source"]
        doc_content = doc.page_content

        if doc_name not in content_dict:
            content_dict[str(doc_name)] = doc_content
        else:
            content_dict[str(doc_name)] += doc_content

    return content_dict


def load_pdfs(path):
    if Path(path).is_dir():
        loader = PyPDFDirectoryLoader(path, silent_errors=True)
    else:
        loader = PyPDFLoader(path)
    docs = process_documents_by_langchain(loader.load())
    return docs

把研报一读取进来：

from loader.pdf_loader import load_pdfs
docs = load_pdfs(DATA_DIR.joinpath('all_reports').joinpath('report1.pdf'))
print(docs)

读进来是文本：

接下来是重点，system prompt，要求LLM从研报中抽取出因子，以及对因子的模型。

extract_factors_system: |-
    用户会提供一篇金融工程研报，其中包括了量化因子和模型研究，请按照要求抽取以下信息:
    1. 概述这篇研报的主要研究思路;
    2. 抽取出所有的因子，并概述因子的计算过程，请注意有些因子可能存在于表格中，请不要遗漏，因子的名称请使用英文，不能包含空格，可用下划线连接，研报中可能不含有因子，若没有请返回空字典;
    3. 抽取研报里面的所有模型，并概述模型的计算过程，可以分步骤描述模型搭建或计算的过程，研报中可能不含有模型，若没有请返回空字典;

    user will treat your factor name as key to store the factor, don't put any interaction message in the content. Just response the output without any interaction and explanation.
    All names should be in English.
    Respond with your analysis in JSON format. The JSON schema should include:
    {
        "summary": "The summary of this report",
        "factors": {
            "Name of factor 1": "Description to factor 1",
            "Name of factor 2": "Description to factor 2"
        },
        "models": {
            "Name of model 1": "Description to model 1",
            "Name of model 2": "Description to model 2"
        }
    }

看下代码运行的结果：

系统按要示的字典格式，把因子和模型都抽取出来了。

{

"summary": "This report discusses the role of high-frequency factors in quantitative stock selection strategies, with a case study on the enhancement of the CSI 1000 index. The report categorizes high-frequency factors into reversal, momentum, and deep learning types, and tests their stock selection capabilities on a monthly and weekly basis. It also examines the impact of incorporating these factors into the CSI 1000 index enhancement strategy under various constraints.",

"factors": {

"Reversal_HighFrequency": "Includes factors like Improved Reversal, EndOfDay_Trading_Volume_Ratio, HighFrequency_Skewness, Downside_Volume_Ratio, Average_Single_Outflow_Amount_Ratio, and LargeOrder_Push_Increase. These factors characterize investor overreaction and tend to select stocks with large previous drops or low turnover rates.",

"Momentum_HighFrequency": "Includes factors like Opening_Buy_Intention_Ratio, Opening_Buy_Intention_Strength, Opening_LargeOrder_NetBuy_Ratio, and Opening_LargeOrder_NetBuy_Strength. These factors characterize investor buying intentions, capital flow in the order book, or trading behavior of informed investors.",

"DeepLearning_HighFrequency": "Includes factors like Improved_DeepLearning_Factor and Residual_Attention_DeepLearning_Factor. These factors use past high-frequency features to dynamically fit recent trading patterns and are suitable for short-term windows."

"models": {

"CSI_1000_Index_Enhancement": "The model predicts stock returns using a linear weighting approach with base factors including market value, mid-cap (cube of market value), valuation, turnover, reversal, and volatility. High-frequency factors are incrementally introduced to examine changes in portfolio performance. The optimization objective maximizes expected returns under various constraints such as stock deviation, factor exposure, industry deviation, turnover frequency, and turnover rate limits."

}

代码结果如下，已经打包发布至星球：