项目实战--MySQL实现分词模糊匹配

news2024/10/5 14:15:57
一、需求描述

推广人员添加公司到系统时,直接填写公司简称,而公司全称可能之前已经被添加过,为防止添加重复的公司,所以管理员在针对公司信息审批之前,需要查看以往添加的公司信息里是否有相同公司。

二、方案

技术层面需要考虑实现的功能点:
• 分词
• 与库里已有数据进行匹配
• 按照匹配度对结果进行排序

分词功能有现成的分词器,所以整个需求的核心重点是如何与数据库中的数据匹配并按照匹配度排序。
方案可以选择:
• 方案一:引入ES
• 方案二:利用MySQL实现

若系统规模较小,单纯为实现这个功能引入ES成本较大,还要涉及到数据同步等问题,系统复杂性会提高,所以尽量使用MySQL已有的功能进行实现。
MySQL提供了以下三种模糊搜索的方式:

• like匹配: 要求模式串与整个目标字段完全匹配;
• RegExp正则匹配: 要求目标字段包含模式串即可;
• Fulltext全文索引: 在字段类型为CHARVARCHARTEXT的列上创建全文索引,执行SQL进行查询。

针对于上述业务场景,对相关技术进行优劣分析:

• like匹配: 无法满足需求,所以pass;
• 全文索引: 可定制性差,不支持任意匹配查询,pass;
• 正则匹配: 可实现任意模式匹配,缺点在于执行效率不如全文索引。

针对于这个场景,记录数目相对来说没有那么多,所以对于效率稍低的结果可以接受,因此技术选型方面采用RegExp正则匹配来实现模糊匹配的需求。

三、代码实现

1.提取公司关键信息
对输入的公司名称去除无用信息,保留关键信息。这里的无用信息指的是地名,圆括号,以及集团,股份,有限等。

/**
 * 匹配前去除公司名称的无意义信息
 * @param targetCompanyName
 * @return
 */
private String formatCompanyName(String targetCompanyName){

Stringregex="(?<province>[^省]+自治区|.*?省|.*?行政区|.*?市)"+
"?(?<city>[^市]+自治州|.*?地区|.*?行政单位|.+盟|市辖区|.*?市|.*?县)"+
"?(?<county>[^(区|市|县|旗|岛)]+区|.*?市|.*?县|.*?旗|.*?岛)"+
"?(?<village>.*)";
Matchermatcher=Pattern.compile(regex).matcher(targetCompanyName);
while(matcher.find()){
Stringprovince= matcher.group("province");
        log.info("province:{}",province);
if(StringUtils.isNotBlank(province)&& targetCompanyName.contains(province)){
            targetCompanyName = targetCompanyName.replace(province,"");
}
        log.info("处理完省份的公司名称:{}",targetCompanyName);
Stringcity= matcher.group("city");
        log.info("city:{}",city);
if(StringUtils.isNotBlank(city)&& targetCompanyName.contains(city)){
            targetCompanyName = targetCompanyName.replace(city,"");
}
        log.info("处理完城市的公司名称:{}",targetCompanyName);
Stringcounty= matcher.group("county");
        log.info("county:{}",county);
if(StringUtils.isNotBlank(county)&& targetCompanyName.contains(county)){
            targetCompanyName = targetCompanyName.replace(county,"");
}
        log.info("处理完区县级的公司名称:{}",targetCompanyName);
}
String[][] address =AddressUtil.ADDRESS;
for(String[] city: address){
for(String b : city ){
if(targetCompanyName.contains(b)){
                targetCompanyName = targetCompanyName.replace(b,"");
}
}
}
    log.info("处理后的公司名称:{}",targetCompanyName);
return targetCompanyName;
}

地名工具类

public class AddressUtil{
public static final String[][] ADDRESS ={
{"北京"},
{"天津"},
{"安徽","安庆","蚌埠","亳州","巢湖","池州","滁州","阜阳","合肥","淮北","淮南","黄山","六安","马鞍山","宿州","铜陵","芜湖","宣城"},
{"澳门"},
{"香港"},
{"福建","福州","龙岩","南平","宁德","莆田","泉州","厦门","漳州"},
{"甘肃","白银","定西","甘南藏族自治州","嘉峪关","金昌","酒泉","兰州","临夏回族自治州","陇南","平凉","庆阳","天水","武威","张掖"},
{"广东","潮州","东莞","佛山","广州","河源","惠州","江门","揭阳","茂名","梅州","清远","汕头","汕尾","韶关","深圳","阳江","云浮","湛江","肇庆","中山","珠海"},
{"广西","百色","北海","崇左","防城港","贵港","桂林","河池","贺州","来宾","柳州","南宁","钦州","梧州","玉林"},
{"贵州","安顺","毕节地区","贵阳","六盘水","黔东南苗族侗族自治州","黔南布依族苗族自治州","黔西南布依族苗族自治州","铜仁地区","遵义"},
{"海南","海口","三亚","直辖县级行政区划"},
{"河北","保定","沧州","承德","邯郸","衡水","廊坊","秦皇岛","石家庄","唐山","邢台","张家口"},
{"河南","安阳","鹤壁","焦作","开封","洛阳","漯河","南阳","平顶山","濮阳","三门峡","商丘","新乡","信阳","许昌","郑州","周口","驻马店"},
{"黑龙江","大庆","大兴安岭地区","哈尔滨","鹤岗","黑河","鸡西","佳木斯","牡丹江","七台河","齐齐哈尔","双鸭山","绥化","伊春"},
{"湖北","鄂州","恩施土家族苗族自治州","黄冈","黄石","荆门","荆州","十堰","随州","武汉","咸宁","襄樊","孝感","宜昌"},
{"湖南","长沙","常德","郴州","衡阳","怀化","娄底","邵阳","湘潭","湘西土家族苗族自治州","益阳","永州","岳阳","张家界","株洲"},
{"吉林","白城","白山","长春","吉林","辽源","四平","松原","通化","延边朝鲜族自治州"},
{"江苏","常州","淮安","连云港","南京","南通","苏州","宿迁","泰州","无锡","徐州","盐城","扬州","镇江"},
{"江西","抚州","赣州","吉安","景德镇","九江","南昌","萍乡","上饶","新余","宜春","鹰潭"},
{"辽宁","鞍山","本溪","朝阳","大连","丹东","抚顺","阜新","葫芦岛","锦州","辽阳","盘锦","沈阳","铁岭","营口"},
{"内蒙古","阿拉善盟","巴彦淖尔","包头","赤峰","鄂尔多斯","呼和浩特","呼伦贝尔","通辽","乌海","乌兰察布","锡林郭勒盟","兴安盟"},
{"宁夏回族","固原","石嘴山","吴忠","银川","中卫"},
{"青海","果洛藏族自治州","海北藏族自治州","海东地区","海南藏族自治州","海西蒙古族藏族自治州","黄南藏族自治州","西宁","玉树藏族自治州"},
{"山东","滨州","德州","东营","菏泽","济南","济宁","莱芜","聊城","临沂","青岛","日照","泰安","威海","潍坊","烟台","枣庄","淄博"},
{"山西","长治","大同","晋城","晋中","临汾","吕梁","朔州","太原","忻州","阳泉","运城"},
{"陕西","安康","宝鸡","汉中","商洛","铜川","渭南","西安","咸阳","延安","榆林"},
{"上海"},
{"四川","阿坝藏族羌族自治州","巴中","成都","达州","德阳","甘孜藏族自治州","广安","广元","乐山","凉山彝族自治州","泸州","眉山","绵阳","内江","南充","攀枝花","遂宁","雅安","宜宾","资阳","自贡"},
{"西藏","阿里地区","昌都地区","拉萨","林芝地区","那曲地区","日喀则地区","山南地区"},
{"新疆维吾尔","阿克苏地区","阿勒泰地区","巴音郭楞蒙古自治州","博尔塔拉蒙古自治州","昌吉回族自治州","哈密地区","和田地区","喀什地区","克拉玛依","克孜勒苏柯尔克孜自治州","塔城地区","吐鲁番地区","乌鲁木齐","伊犁哈萨克自治州","直辖县级行政区划"},
{"云南","保山","楚雄彝族自治州","大理白族自治州","德宏傣族景颇族自治州","迪庆藏族自治州","红河哈尼族彝族自治州","昆明","丽江","临沧","怒江僳僳族自治州","普洱","曲靖","文山壮族苗族自治州","西双版纳傣族自治州","玉溪","昭通"},
{"浙江","杭州","湖州","嘉兴","金华","丽水","宁波","衢州","绍兴","台州","温州","舟山"},
{"重庆"},
{"台湾","台北","高雄","基隆","台中","台南","新竹","嘉义"},
};
}

2 分词相关代码
pom文件:引入IK分词器相关依赖

 <!-- ikAnalyzer 中文分词器  -->
<dependency>
<groupId>com.janeluo</groupId>
<artifactId>ikanalyzer</artifactId>
<version>2012_u6</version>
<exclusions>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
</exclusion>
</exclusions>
</dependency>

<!--  lucene-queryParser 查询分析器模块 -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>7.3.0</version>
</dependency>

IKAnalyzerSupport类:用于配置分词器

@Slf4j
public class IKAnalyzerSupport{
/**
     * IK分词
     * @param target
     * @return
     */
public static List<String> iKSegmenterToList(String target) throwsException{
if(StringUtils.isEmpty(target)){
returnnewArrayList();
}
List<String> result =newArrayList<>();
StringReadersr=newStringReader(target);
// false:关闭智能分词 (对分词的精度影响较大)
IKSegmenterik=newIKSegmenter(sr,true);
Lexeme lex;
while((lex=ik.next())!=null){
StringlexemeText= lex.getLexemeText();
            result.add(lexemeText);
}
return result;
}
}

ServiceImpl类:进行分词处理

 /**
 * 对目标公司名称进行分词
 * @param targetCompanyName
 * @return
 */
private String splitWord(String targetCompanyName){
    log.info("对处理后端公司名称进行分词");

List<String> splitWord =newArrayList<>();
Stringresult= targetCompanyName;
try{
        splitWord = iKSegmenterToList(targetCompanyName);
        result =  splitWord.stream().map(String::valueOf).distinct().collect(Collectors.joining("|"));
        log.info("分词结果:{}",result);
}catch(Exception e){
        log.error("分词报错:{}",e.getMessage());
}
return result;
}

3 匹配
ServiceImpl类:匹配核心代码

public JsonResult matchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp){
// 对公司名称进行处理
StringsourceCompanyName= companyDTO.getCompanyName();
StringtargetCompanyName= sourceCompanyName;
    log.info("处理前公司名称:{}",targetCompanyName);
// 处理圆括号
    targetCompanyName = targetCompanyName.replaceAll("[(]|[)]|[(]|[)]","");
// 处理公司相关关键词
    targetCompanyName = targetCompanyName.replaceAll("[(集团|股份|有限|责任|分公司)]","");

if(!targetCompanyName.contains("银行")){
// 去除行政区域
        targetCompanyName = formatCompanyName(targetCompanyName);
}
// 分词
StringsplitCompanyName= splitWord(targetCompanyName);
//  匹配
List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName,targetCompanyName);

List<String> result =newArrayList();
for(Company companyInfo : matchedCompany){
        result.add(companyInfo.getCompanyName());
if(companyDTO.getCompanyId().equals(companyInfo.getCompanyId())){
            result.remove(companyInfo.getCompanyName());
}
}
returnJsonResult.successResult(result);
}

Repository类:编写SQL语句

/**  
* 模糊匹配公司名称  
* @param companyNameRegex 分词后的公司名称
* @param companyName 分词前的公司名称  
* @return  
*/
@Query(value = 
"SELECT * FROM company WHERE isDeleted = '0' and companyName REGEXP ?1 
ORDER BY length(REPLACE(companyName,?2,''))/length(companyName) ",
nativeQuery = true)  
List<Company> queryMatchCompanyName(String companyNameRegex,String companyName);

按照匹配度排序这个功能点,LENGTH(companyName)返回companyName的长度,LENGTH(REPLACE(companyName, ?2, ‘’))计算出companyName中关键词出现的次数。通过这种方式,可以根据匹配程度进行排序,匹配次数越多的公司名称排序越靠前。

四、效果展示

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1890978.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

盒子模型(笔记)

盒子模型 盒子模型的属性 padding属性 内边距&#xff1a;盒子的边框到内容的距离 /*每个方向内边距*/padding-top: 20px;padding-left:20px;padding-bottom:20px;padding-right: 20px; /*每个方向内边距的第二种方法*/ /* 顺序依次是上左右下*/padding: 10px 20px 30px 4…

WIN32核心编程 - 数据类型 错误处理 字符处理

公开视频 -> 链接点击跳转公开课程博客首页 -> 链接点击跳转博客主页 目录 数据类型 基本数据类型 Win32基本数据类型 错误处理 C语言中的错误处理 C中的错误处理 Win32中的错误处理 字符处理 C/C WIN32 字符处理 数据类型 基本数据类型 C/C语言定义了一系列…

Linux的Socket开发概述

套接字&#xff08;socket&#xff09;是 Linux 下的一种进程间通信机制&#xff08;socket IPC&#xff09;&#xff0c;在前面的内容中已经给大家提到过&#xff0c;使用 socket IPC 可以使得在不同主机上的应用程序之间进行通信&#xff08;网络通信&#xff09;&#xff0c…

cv2.cvtColor的示例用法

-------------OpenCV教程集合------------- Python教程99&#xff1a;一起来初识OpenCV&#xff08;一个跨平台的计算机视觉库&#xff09; OpenCV教程01&#xff1a;图像的操作&#xff08;读取显示保存属性获取和修改像素值&#xff09; OpenCV教程02&#xff1a;图像处理…

德国威步的技术演进之路(下):从云端许可管理到硬件加密狗的创新

从单机用户许可证到WkNET网络浮点授权的推出&#xff0c;再到引入使用次数和丰富的时间许可证管理&#xff0c;德国威步产品不断满足市场对灵活性和可扩展性的需求。TCP/IP浮动网络许可证进一步展示了威步技术在网络时代的创新应用。借助于2009年推出的借用许可证以及2015年推出…

CV- 人工智能-深度学习基础知识

一, 深度学习基础知识 1,什么是深度学习?机器学习是实现人工智能的一种途径,深度学习是机器学习的一个子集,也就是说深度学习是实现机器学习的一种方法。2, 传统机器学习算术依赖人工设计特征,并进行特征提取,而深度学习方法不需要人工,而是依赖算法自动提取特征。深度…

llm学习-4(llm和langchain)

langchain说明文档&#xff1a;langchain 0.2.6 — &#x1f99c;&#x1f517; langChain 0.2.6https://api.python.langchain.com/en/latest/langchain_api_reference.html#module-langchain.chat_models 1&#xff1a;模型 &#xff08;1&#xff09;自定义模型导入&#x…

代码随想录-Day46

121. 买卖股票的最佳时机 给定一个数组 prices &#xff0c;它的第 i 个元素 prices[i] 表示一支给定股票第 i 天的价格。 你只能选择 某一天 买入这只股票&#xff0c;并选择在 未来的某一个不同的日子 卖出该股票。设计一个算法来计算你所能获取的最大利润。 返回你可以从…

pmp顺利通关总结

目录 一、背景二、总结三、过程 一、背景 人活着总是想去做一些事情&#xff0c;通过这些事情来证明自己还活着。 而我证明自己还会活着并且活得很好的方式和途径&#xff0c;是通过这些东西去让自己有一个明确的边界节点&#xff1b;借此知识来验证自己的学习能力。 我坚定认…

掌握Go语言邮件发送:net/smtp实用教程与最佳实践

掌握Go语言邮件发送&#xff1a;net/smtp实用教程与最佳实践 概述基本配置与初始化导入net/smtp包设置SMTP服务器基本信息创建SMTP客户端实例身份验证 发送简单文本邮件配置发件人信息构建邮件头部信息编写邮件正文使用SendMail方法发送邮件示例代码 发送带附件的邮件邮件多部分…

硅纪元视角 | 1 分钟搞定 3D 创作,Meta 推出革命性 3D Gen AI 模型

在数字化浪潮的推动下&#xff0c;人工智能&#xff08;AI&#xff09;正成为塑造未来的关键力量。硅纪元视角栏目紧跟AI科技的最新发展&#xff0c;捕捉行业动态&#xff1b;提供深入的新闻解读&#xff0c;助您洞悉技术背后的逻辑&#xff1b;汇聚行业专家的见解&#xff0c;…

服务器之BIOS基础知识总结

1.BIOS是什么&#xff1f; BIOS全称Basic Input Output System&#xff0c;即基本输入输出系统&#xff0c;是固化在服务器主板的专用ROM上&#xff0c;加载在服务器硬件系统上最基本的运行程序&#xff0c;它位于服务器硬件和OS之间&#xff0c;在服务器启动过程中首先运行&am…

《亚马逊搬运亚马逊产品》配合跟卖采集爬取跟卖店铺高质量

亚马逊高质量产品如何搬运&#xff1f;亚马逊采集亚马逊。 哈喽大家好&#xff0c;大家讲一下做亚马逊是发货、铺货这块的功能。目前这款软件做跟卖大家都知道&#xff0c;同时也支持做铺货。铺货可以采集国内的1688、淘宝、京东都可以采&#xff0c;采完之后也可以采速卖通&a…

flutter开发实战-Webview及dispose关闭背景音

flutter开发实战-Webview及dispose关闭背景音 当在使用webview的时候&#xff0c;dispose需要关闭网页的背景音或者音效。 一、webview的使用 在工程的pubspec.yaml中引入插件 webview_flutter: ^4.4.2webview_cookie_manager: ^2.0.6Webview的使用代码如下 初始化WebView…

UiPath+Appium实现app自动化测试

一、环境准备工作 1.1 完成appium环境的搭建 参考&#xff1a;pythonappiumpytestallure模拟器(MuMu)自动化测试环境搭建_appium mumu模拟器-CSDN博客 1.2 完成uipath的安装 登录官网&#xff0c;完成注册与软件下载安装。 UiPath业务自动化平台&#xff1a;先进的RPA及自动…

Linux操作系统学习:day08

内容来自&#xff1a;Linux介绍 视频推荐&#xff1a;Linux基础入门教程-linux命令-vim-gcc/g -动态库/静态库 -makefile-gdb调试 目录 day0853、命令和编辑模式之间的切换54、命令模式到末行模式的切换与末行模式下的保存退出命令模式到末行模式的切换保存退出 55、末行模式…

大模型训练优化方法

写在前面 在训练模型尤其是大模型的时候&#xff0c;如何加快训练速度以及优化显存利用率是一个很关键的问题。本文主要参考HF上的一篇文章&#xff1a;https://huggingface.co/docs/transformers/perf_train_gpu_one&#xff0c;以及笔者在实际训练中的一些经验&#xff0c;给…

SpringBoot 整合 Minio 实现文件切片极速上传技术

Centos7安装Minio 创建目标文件夹 mkdir minio使用docker查看目标镜像状况 大家需要注意&#xff0c;此处我们首先需要安装docker&#xff0c;对于相关安装教程&#xff0c;大家可以查看我之前的文章&#xff0c;按部就班就可以&#xff0c;此处不再赘述&#xff01;&#x…

【电商指标详解】

前言&#xff1a; &#x1f49e;&#x1f49e;大家好&#xff0c;我是书生♡&#xff0c;本篇文章主要和大家分享一下电商行业中常见指标的详解&#xff01;存在的原因和作用&#xff01;&#xff01;&#xff01;希望对大家有所帮助。 &#x1f49e;&#x1f49e;代码是你的画…

论文学习笔记1:Federated Graph Neural Networks: Overview, Techniques, and Challenges

文章目录 一、introduction二、FedGNN术语与分类2.1主要分类法2.2辅助分类法 三、GNN-ASSISTED FL3.1Centralized FedGNNs3.2Decentralized FedGNNs 四、FL-ASSISTED GNNS4.1horizontal FedGNNs4.1.1Clients Without Missing Edges4.1.1.1Non-i.i.d. problem4.1.1.2Graph embed…