SpringBoot项目路由信息自动化提取脚本

文章目录

前言
工具开发
- 1.1 ChatGPT初探
- 1.2 初版代码效果
WebGoat适配
- 2.1 识别常量路由
- 2.2 适配跨行定义
进阶功能优化
- 3.1 识别请求类型
- 3.2 识别上下文值
总结

前言

最近工作上遇到一个需求：提取 SpringBoot 项目中的所有路由信息，本来想着这是一个再普通不过的任务了，本着白嫖党 “拿来主义” 的一贯作风，马上到 Github 上搜索相关工具，结果发现居然没有能够有效满足我需求的开源项目……那就自己动手丰衣足食吧！

工具开发

本文的目标是通过自动化脚本一键识别、提取 Java SpringBoot 项目的所有路由信息，方便识别、梳理代码审计的工作量，并统计审计进度和覆盖率。

在 Java Web 代码审计中，寻找和识别路由是很关键的部分，路由信息直接反映了一个系统对外暴露的攻击入口。而 Controller 作为 MVC 架构中的一个组件，可以说是每个用户交互的入口点，我们可以通过 Controller 定位系统注册的路由。

一般在代码审计时都会逐个分析每个 Controller 对应的对外 API 实现，通过梳理对应的路由接口并检查对应的业务实现，能帮助我们快速的检索代码中存在的漏洞缺陷，发现潜在的业务风险。

SpringMVC 框架中注册路由的常见注解如下：

@Controller
@RestController
@RequestMapping
@GetMapping
@PostMapping
@PutMapping
@DeleteMapping
@PatchMapping

1.1 ChatGPT初探

一开始还是想偷懒，看看 ChatGPT 能不能帮我完成这项任务，结果 ChatGPT 提供了很简洁的代码，但是存在缺陷：无法识别 @Controller 类级别的父级路由并并自动拼接出完整路由，同时会导致提取的部分函数信息错乱。

import os
import re
import pandas as pd

# 正则表达式来匹配Spring的路由注解、方法返回类型、方法名称和参数
mapping_pattern = re.compile(r'@(?:Path|RequestMapping|GetMapping|PostMapping|PutMapping|DeleteMapping|PatchMapping)\((.*?)\)')
method_pattern = re.compile(r'(public|private|protected)\s+(\w[\w\.\<\>]*)\s+(\w+)\((.*?)\)\s*{')
value_pattern = re.compile(r'value\s*=\s*"(.*?)"')  # 只提取value字段中的路径值，可能包含的格式 value = "/xmlReader/sec", method = RequestMethod.POST


def extract_routes_from_file_01(file_path):
    """
    当前缺陷：无法识别@Controller类级别的父级路由并并自动拼接出完整路由，同时会导致提取的部分函数信息错乱，比如XXE（函数乱序）、xlsxStreamerXXE类（路由错误）
    为数不多的开源参考项目也存在同样的问题：https://github.com/charlpcronje/Java-Class-Component-Endpoint-Extractor
    """
    routes = []
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        # 找到所有路由注解
        mappings = mapping_pattern.findall(content)
        methods = method_pattern.findall(content)
        # 配对路由和方法
        for mapping, method in zip(mappings, methods):
            # 使用正则表达式提取出value字段的值
            value_match = value_pattern.search(mapping)
            route = value_match.group(1).strip() if value_match else mapping.strip()
            route = route.strip('"')  # 去除路径中的引号
            route_info = {
                'route': route,
                'return_type': method[1].strip(),
                'method_name': method[2].strip(),
                'parameters': method[3].strip(),
                'file_path': file_path,
            }
            routes.append(route_info)
    return routes


def scan_project_directory(directory):
    all_routes = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.java'):
                file_path = os.path.join(root, file)
                routes = extract_routes_from_file_01(file_path)
                if routes:
                    all_routes.extend(routes)
    return all_routes


def write_routes_to_xlsx(all_data_list):
    data = {
        "Route": [item['route'] for item in all_data_list],
        "Return Type": [item['return_type'] for item in all_data_list],
        "Method Name": [item['method_name'] for item in all_data_list],
        "Parameters": [item['parameters'] for item in all_data_list],
        "File Path": [item['file_path'] for item in all_data_list],
    }
    writer = pd.ExcelWriter('Data.xlsx')
    dataFrame = pd.DataFrame(data)
    dataFrame.to_excel(writer, sheet_name="password")
    writer.close()
    print(f"[*] Successfully saved data to xlsx")


if __name__ == '__main__':
    # project_directory = input("Enter the path to your Spring Boot project: ")
    project_directory = r'D:\Code\Java\Github\java-sec-code-master'
    routes_info = scan_project_directory(project_directory)
    write_routes_to_xlsx(routes_info)

1.2 初版代码效果

偷懒是没戏了，那就自己动手吧。。

实验代码：https://github.com/JoyChou93/java-sec-code。

脚本实现：

import os
import re
import pandas as pd
from colorama import Fore, init

# 配置colorama颜色自动重置，否则得手动设置Style.RESET_ALL
init(autoreset=True)

# 统计路由数量的全局变量
route_num = 1
# 正则表达式来匹配Spring的路由注解、方法返回类型、方法名称和参数
mapping_pattern = re.compile(r'@(Path|(Request|Get|Post|Put|Delete|Patch)Mapping)\(')


def write_routes_to_xlsx(all_data_list):
    """
    将路由信息写入Excel文件
    """
    data = {
        "Parent Route": [item['parent_route'] for item in all_data_list],
        "Route": [item['route'] for item in all_data_list],
        "Return Type": [item['return_type'] for item in all_data_list],
        "Method Name": [item['method_name'] for item in all_data_list],
        "Parameters": [item['parameters'] for item in all_data_list],
        "File Path": [item['file_path'] for item in all_data_list],
    }
    writer = pd.ExcelWriter('Data.xlsx')
    dataFrame = pd.DataFrame(data)
    dataFrame.to_excel(writer, sheet_name="password")
    writer.close()
    print(Fore.BLUE + "[*] Successfully saved data to xlsx")


def extract_request_mapping_value(s):
    """
    提取类开头的父级路由，通过@RequestMapping注解中的value字段的值，可能出现括号中携带除了value之外的字段，比如 method = RequestMethod.POST
    """
    pattern = r'@RequestMapping\((.*?)\)|@RequestMapping\(value\s*=\s*"(.*?)"'
    match = re.search(pattern, s)
    if match:
        if match.group(1):
            return match.group(1).strip('"')
        else:
            return match.group(2)
    else:
        return None


def get_class_parent_route(content):
    """
    提取类级别的父级路由
    注意有可能会返回None，比如java-sec-code-master里的CommandInject.java
    """
    parent_route = None
    content_lines = content.split('\n')
    public_class_line = None
    # 遍历每一行，找到 "public class" 所在的行
    for line_number, line in enumerate(content_lines, start=1):
        if re.search(r'public class', line):
            public_class_line = line_number
            break
    if public_class_line is not None:
        # 提取 "public class" 之前的行
        content_before_public_class = content_lines[:public_class_line]
        for line in content_before_public_class:
            if re.search(r'@RequestMapping\(', line):
                parent_route = extract_request_mapping_value(line)
    return parent_route, public_class_line


def extract_value_between_quotes(line):
    """
    提取字符串中第一个""中间的值，目的是提取@GetMapping("/upload")格式中的路由值（尚待解决的是部分项目的路由值是通过一个常量类集中定义的）
    """
    pattern = r'"(.*?)"'
    match = re.search(pattern, line)
    if match:
        value = match.group(1)
        return value
    else:
        return None


def extract_function_details(function_def):
    """
    从函数定义的行级代码，解析并返回一个函数的详细信息，包括返回类型、函数名、参数等
    """
    pattern = re.compile(
        r'public\s+(?:static\s+)?(\w+)\s+(\w+)\s*\((.*)\)'
    )
    # 匹配函数签名
    match = pattern.search(function_def)
    if match:
        return_type = match.group(1)  # 返回类型
        function_name = match.group(2)  # 函数名
        parameters = match.group(3)  # 参数
        return return_type, function_name, parameters
    else:
        return None, None, None


def extract_routes_from_file(file_path):
    routes = []
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        # 找到Controller注解对应的Controller类
        if re.search('@(?!(ControllerAdvice))(|Rest)Controller', content):
            parent_route, public_class_line = get_class_parent_route(content)
            content_lines = content.split('\n')
            # 提取类名定义所在行后的所有代码
            content_after_public_class = content_lines[public_class_line:]
            global route_num
            for i, line in enumerate(content_after_public_class):
                if re.search(mapping_pattern, line):
                    # 获取完整的一条路由信息
                    route = extract_value_between_quotes(line)
                    if parent_route is not None and route is not None:
                        route = parent_route + route
                    # 向下遍历找到第一行不以 @ 开头的代码，因为一个函数的定义可能包含多个注解，比如 @GetMapping("/upload") @ResponseBody
                    j = i + 1
                    while j < len(content_after_public_class) and content_after_public_class[j].strip().startswith('@'):
                        j += 1
                    method_line = content_after_public_class[j].strip()
                    # print(route)
                    # print(method_line)
                    return_type, function_name, parameters = extract_function_details(method_line)
                    # print(parameters)
                    route_info = {
                        'parent_route': parent_route,
                        'route': route,
                        'return_type': return_type,
                        'method_name': function_name,
                        'parameters': parameters,
                        'file_path': file_path,
                    }
                    routes.append(route_info)
                    print(Fore.GREEN + '[%s]' % str(route_num) + str(route_info))
                    route_num += 1
    return routes


def scan_project_directory(directory):
    all_routes = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.java'):
                file_path = os.path.join(root, file)
                routes = extract_routes_from_file(file_path)
                if routes:
                    all_routes.extend(routes)
    return all_routes


if __name__ == '__main__':
    # project_directory = input("Enter the path to your Spring Boot project: ")
    project_directory1 = r'D:\Code\Java\Github\java-sec-code-master'
    project_directory2 = r'D:\Code\Java\Github\java-sec-code-master\src\main\java\org\joychou\controller\othervulns'
    routes_info = scan_project_directory(project_directory1)
    write_routes_to_xlsx(routes_info)

生成的 xlsx 统计表格效果：
在这里插入图片描述

以最后的java-sec-code-master\src\main\java\org\joychou\controller\othervulns\xlsxStreamerXXE.java为例对比下代码：

解析均无误，此项目测试完毕。同时已验证另外的开源项目测试也没有问题：https://github.com/yangzongzhuan/RuoYi。

WebGoat适配

开源项目：https://github.com/WebGoat/WebGoat，扫描此项目面临需要解决的问题有两个。

2.1 识别常量路由

路由信息由静态常量定义，而非直接通过字符串提供。
在这里插入图片描述
直接通过上述脚本扫描将出错：

核心是修改上述脚本的 extract_value_between_quotes 函数提取路由入口函数的路由值所对应的代码逻辑。

2.2 适配跨行定义

提取路由注解定义的代码，如果出现换行符，则会导致此注解的参数解析出现残缺，比如：
在这里插入图片描述
同时获取路由的入口函数的定义，暂未考虑函数定义逻辑通过多行完成，可能导致提取的函数参数缺失，同时如果注解是多行的情况下，代码是有Bug的，不能直接提取第一行非@开头的代码。

直接通过上述脚本扫描则将提取到的字段全为空。
在这里插入图片描述
核心是修改上述脚本的提取路由注解、入口函数定义所对应的代码逻辑。

【需求新增的代码】

……

def find_constant_value(folder_path, constant_name):
    """
    提取出路由的常量值
    """
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith('.java'):
                file_path = os.path.join(root, file)
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                    pattern = fr'static\s+final\s+String\s+{constant_name}\s*=\s*"(.*?)";'
                    match = re.search(pattern, content)
                    if match:
                        return match.group(1)
    return None


def get_path_value(line, directory):
    """
    提取出路由的值，适配通过字符串直接提供的路由值，或者通过常量提供的路由值，比如：
    @GetMapping(path = "/server-directory")、@GetMapping(path = URL_HINTS_MVC, produces = "application/json")、@GetMapping(value = "/server-directory")
    """
    pattern = r'\((?:path|value)\s*=\s*(?:"([^"]*)"|([A-Z_]+))'
    matches = re.findall(pattern, line)
    route = ''
    for match in matches:
        if match[0]:  # 提取出path为字符串的值
            route = match[0]
            # print(Fore.GREEN + route)
        elif match[1]:  # 提取出path为常量的值
            route = find_constant_value(directory, match[1])
            # print(Fore.BLUE + route)
    return route
    

def extract_routes_from_file(file_path, directory):
    routes = []
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        # 找到Controller注解对应的Controller类
        if re.search('@(?!(ControllerAdvice))(|Rest)Controller', content):
            parent_route, public_class_line = get_class_parent_route(content)
            content_lines = content.split('\n')
            # 提取类名定义所在行后的所有代码
            content_after_public_class = content_lines[public_class_line:]
            global route_num
            for i, line in enumerate(content_after_public_class):
                try:
                    if re.search(mapping_pattern, line):
                        route_define = line.strip()
                        # 如果路由映射的定义逻辑在一行代码中完全覆盖
                        if route_define.endswith(')'):
                            route_define = route_define
                        # 如果路由映射的定义逻辑在多行代码中才覆盖
                        else:
                            q = i + 1
                            while q < len(content_after_public_class) and not content_after_public_class[q].strip().endswith(')'):
                                route_define += '' + content_after_public_class[q].strip()
                                q += 1
                            route_define += '' + content_after_public_class[q].strip()
                        # print(Fore.RED + route_define)
                        # 判断下路由信息是通过字符串字节提供的，还是通过常量提供的，然后统一提取出字符串值
                        if re.search(r'\("', route_define):
                            route = extract_value_between_quotes(route_define)
                        else:
                            route = get_path_value(route_define, directory)
                        # 获取完整的一条路由信息
                        if parent_route is not None and route is not None:
                            route = parent_route + route
                        # 向下遍历找到函数的定义，此处考虑了路由注解下方可能还携带多个其它用途的注解
                        j = i + 1
                        while j < len(content_after_public_class) and not content_after_public_class[j].strip().startswith('public'):
                            j += 1
                        method_define = content_after_public_class[j].strip()
                        # 获取函数定义的行级代码，考虑函数定义可能跨越多行，需进行代码合并，获得完整的函数定义，否则可能导致函数参数提取残缺
                        q = j
                        while j < len(content_after_public_class) and not content_after_public_class[q].strip().endswith('{'):
                            q += 1
                            method_define = method_define + '' + content_after_public_class[q].strip()
                        # print(route)
                        # print(method_define)
                        return_type, function_name, parameters = extract_function_details(method_define)
                        route_info = {
                            'parent_route': parent_route,
                            'route': route,
                            'return_type': return_type,
                            'method_name': function_name,
                            'parameters': parameters,
                            'file_path': file_path,
                        }
                        routes.append(route_info)
                        print(Fore.GREEN + '[%s]' % str(route_num) + str(route_info))
                        route_num += 1
                except Exception as e:
                    print(Fore.RED + '[-]' + str(file) + ' ' + str(e))
                    continue
    return routes

扫描结果与验证：
在这里插入图片描述

同时已验证对于前面 java-sec-code 的项目扫描结果不影响。

进阶功能优化

3.1 识别请求类型

增加路由注解的类型识别逻辑，最终对表格增加一列，保存路由所对应的 HTTP 请求类型字段，比如 GET、POST。

为此增加了get_request_type(route_define)函数：

def get_request_type(route_define):
    """
    从路由定义的注解中，提取出API请求类型，比如GET、POST等
    """
    # print(route_define)
    if route_define.startswith('@RequestMapping'):
        # 提取@RequestMapping注解中的method字段的值
        if route_define.find('method =') > -1:
            request_type = (str(route_define.split('method =')[1]).split('}')[0].strip().replace('{', '').replace(')', '')).replace('RequestMethod.', '')
        # 未指定具体请求类型的RequestMapping注解，则默认为支持所有请求类型
        else:
            request_type = 'All'
    else:
        request_type = route_define.split('Mapping')[0][1:]
    return request_type

本扫描效果：
在这里插入图片描述

3.2 识别上下文值

在 Spring Boot 项目中，context 上下文配置主要用于设置应用程序的上下文路径、组件扫描路径、国际化配置、资源路径、环境变量等。这些配置通常可以在以下几个地方进行：

1、application.properties 或 application.yml 文件

这些是 Spring Boot 项目中最常用的配置文件，位于 src/main/resources 目录下，设置上下文路径：

# application.properties
server.servlet.context-path=/myapp

或者：

# application.yml
server:
  servlet:
    context-path: /myapp

2、使用环境变量或命令行参数

Spring Boot 支持通过环境变量或命令行参数覆盖配置文件中的配置，这样可以动态调整上下文配置。

此处暂时只考虑识别第一种情况，即配置文件中的上下文路径配置。

添加识别上下文的功能函数如下：

def extract_context_path(directory):
    """
    从application.properties或xxx.yml等Java项目配置文件中提取上下文路径
    """
    for dirPath, dirNames, fileNames in os.walk(directory):
        for filename in fileNames:
            if filename.endswith(".properties") or filename.endswith('.yml') or filename.endswith('.yaml'):
                file_path = os.path.join(dirPath, filename)
                with open(file_path, 'r', encoding='utf-8') as data:
                    data = data.readlines()
                    for line in data:
                        # 匹配 properties 文件
                        if line.startswith('server.servlet.context-path'):
                            context = line.split('=')[1].strip()
                            print(Fore.BLUE + "[*]Found context-path:" + context)
                            return context
                        # 匹配 yml 文件
                        elif line.find('context-path') > -1:
                            context = line.strip().split(':')[1].strip()
                            print(Fore.BLUE + "[*]Found context-path:" + context)
                            return context
                        else:
                            continue
    return None

最终扫描效果如下所示：
在这里插入图片描述

符合预期：
在这里插入图片描述
对若依项目的识别也是正确的：

总结

最后附上代码开源地址：https://github.com/Tr0e/RouteScanner。

本文实现了对 Java SpringBoot 项目一键自动化识别、统计路由信息，并生成可视化的统计表格，此类项目在 Github 上当前基本找不到开源参考代码仓，也算是为开源社区做点贡献了。当然了，初版代码因为当前的实验数据并不是很多，后期在其它 Java 源代码项目中也可能出现不适配的情况，后续有时间的话会持续优化、完善，欢迎提交 issues。