前言
在很多搜索场景中,我们希望能够搜索出搜索词相关的目标,同时也希望能搜索出其近义词相关的目标。例如在商品搜索中,搜索“瓠瓜”,也希望能够搜索出“西葫芦”,但“西葫芦”商品名称因不含有“瓠瓜”,导致无法搜索出来。
此时就需要将“瓠瓜”解析成“瓠瓜”和“西葫芦”,es的synonym,synonym gragh过滤器就是提供了该功能,将词转为近义词再分词。
如下,声明了一个将“瓠瓜”和“西葫芦”定义为近义词的分词器
// 定义自定义分词
PUT info_goods_v1/_settings
{
"analysis": {
"filter": {
"my_synonyms": {
"type": "synonym_graph",
"synonyms": [
"瓠瓜,西葫芦"
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
// 使用“瓠瓜”分词
GET info_goods_v1/_analyze
{
"analyzer": "my_analyzer",
"text": "瓠瓜"
}
// 结果:
{
"tokens" : [
{
"token" : "西葫芦",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "瓠",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0,
"positionLength" : 2
},
{
"token" : "葫芦",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "瓜",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 2
}
]
}
可以看到,“瓠瓜” 被分词成为了“西葫芦”,“葫芦”,“瓠”和“瓜”。这是因为在自定分词器中,我们将“瓠瓜”和“西葫芦”定义成了近义词“瓠瓜=》 瓠瓜,西葫芦”,相当于先将“瓠瓜”转为“瓠瓜”和“西葫芦”,再依次对近义词集合(也就是“瓠瓜”和“西葫芦”)分词得到结果。
是不是被“瓠瓜” 和“西葫芦”弄晕了,不急缓一缓我们接着看...
假如近义词发生了更新,我们该如何更新呢?一种方案是关闭索引,更新索引的分词器后再打开;或者可以借助elasticsearch-analysis-dynamic-synonym插件来动态更新,该插件提供了基于接口和文件的动态更新,但是没有提供基于数据库的。但是不要紧,我们可以稍稍修改一下就能达到我们的目的,这也是本文的主要内容。
过程如下
修改源码实现连接数据库获取近义词汇
下载elasticsearch-analysis-dynamic-synonym打开项目
一、修改pom.xml
引入依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.21</version>
</dependency>
将版本修改成跟你的es版本号一样的,比如我的是7.17.7
<version>7.17.7</version>
二、 修改main/assemblies/plugin.xml
在<dependencySets>标签下添加
<dependencySet>
<outputDirectory/>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<includes>
<include>mysql:mysql-connector-java</include>
</includes>
</dependencySet>
在<assemble>标签下添加
<fileSets>
<fileSet>
<directory>${project.basedir}/config</directory>
<outputDirectory>config</outputDirectory>
</fileSet>
</fileSets>
三、jdbc配置文件
在项目根目录下创建config/jdbc.properties文件,写入以下内容
jdbc.driver=com.mysql.cj.jdbc.Driver
jdbc.url=jdbc:mysql://cckg.liulingjie.cn:3306/test?useUnicode=true&characterEncoding=utf8&autoReconnect=true&useSSL=false&serverTimezone=Asia/Shanghai
jdbc.username=账号
jdbc.password=密码
#近义词sql查询语句。(注意要以words字段展示)
synonym.word.sql=SELECT `keys` AS words FROM es_synonym WHERE ifdel = '0'
#获取近义词最后更新时间,用来判断是否发生了更新。(注意要以maxModitime词汇显示)
synonym.lastModitime.sql=SELECT MAX(moditime) AS maxModitime FROM es_synonym
interval=10
四、编写加载词汇类
在com.bellszhu.elasticsearch.plugin.synonym.analysis包下,我们可以看到很多加载近义词汇的类,比如RemoteSynonymFile类就是通过接口来加载近义词词汇的。
我们在该包下创建类DynamicSynonymFromDb,同时继承SynonymFile接口,该类是用来读取数据库的近义词汇的,代码如下:
/**
* @author liulingjie
* @date 2023/4/12 19:43
*/
public class DynamicSynonymFromDb implements SynonymFile {
/**
* 配置文件名
*/
private final static String DB_PROPERTIES = "jdbc.properties";
private static Logger logger = LogManager.getLogger("dynamic-synonym");
private String format;
private boolean expand;
private boolean lenient;
private Analyzer analyzer;
private Environment env;
/**
* 动态配置类型
*/
private String location;
/**
* 作用类型
*/
private String group;
private long lastModified;
private Path conf_dir;
private JdbcConfig jdbcConfig;
DynamicSynonymFromDb(Environment env, Analyzer analyzer,
boolean expand, boolean lenient, String format, String location, String group) {
this.analyzer = analyzer;
this.expand = expand;
this.lenient = lenient;
this.format = format;
this.env = env;
this.location = location;
this.group = group;
// 读取配置文件
setJdbcConfig();
// 加载驱动
try {
Class.forName(jdbcConfig.getDriver());
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
// 判断是否需要加载
isNeedReloadSynonymMap();
}
/**
* 读取配置文件
*/
private void setJdbcConfig() {
// 读取当前 jar 包存放的路径
Path filePath = PathUtils.get(new File(DynamicSynonymPlugin.class.getProtectionDomain().getCodeSource()
.getLocation().getPath())
.getParent(), "config")
.toAbsolutePath();
this.conf_dir = filePath.resolve(DB_PROPERTIES);
File file = conf_dir.toFile();
Properties properties = null;
try {
properties = new Properties();
properties.load(new FileInputStream(file));
} catch (Exception e) {
logger.error("load jdbc.properties failed");
logger.error(e.getMessage());
}
jdbcConfig = new JdbcConfig(
properties.getProperty("jdbc.driver"),
properties.getProperty("jdbc.url"),
properties.getProperty("jdbc.username"),
properties.getProperty("jdbc.password"),
properties.getProperty("synonym.word.sql"),
properties.getProperty("synonym.lastModitime.sql"),
Integer.valueOf(properties.getProperty("interval"))
);
}
/**
* 加载同义词词典至SynonymMap中
* @return SynonymMap
*/
@Override
public SynonymMap reloadSynonymMap() {
try {
logger.info("start reload local synonym from {}.", location);
Reader rulesReader = getReader();
SynonymMap.Builder parser = RemoteSynonymFile.getSynonymParser(rulesReader, format, expand, lenient, analyzer);
return parser.build();
} catch (Exception e) {
logger.error("reload local synonym {} error!", e, location);
throw new IllegalArgumentException(
"could not reload local synonyms file to build synonyms", e);
}
}
/**
* 判断是否需要进行重新加载
* @return true or false
*/
@Override
public boolean isNeedReloadSynonymMap() {
try {
Long lastModify = getLastModify();
if (lastModified < lastModify) {
lastModified = lastModify;
return true;
}
} catch (Exception e) {
logger.error(e);
}
return false;
}
/**
* 获取同义词库最后一次修改的时间
* 用于判断同义词是否需要进行重新加载
*
* @return getLastModify
*/
public Long getLastModify() {
Connection connection = null;
Statement statement = null;
ResultSet resultSet = null;
Long last_modify_long = null;
try {
connection = DriverManager.getConnection(
jdbcConfig.getUrl(),
jdbcConfig.getUsername(),
jdbcConfig.getPassword()
);
statement = connection.createStatement();
resultSet = statement.executeQuery(jdbcConfig.getSynonymLastModitimeSql());
while (resultSet.next()) {
Timestamp last_modify_dt = resultSet.getTimestamp("maxModitime");
last_modify_long = last_modify_dt.getTime();
}
} catch (SQLException e) {
logger.error("获取同义词库最后一次修改的时间",e);
} finally {
try {
if (resultSet != null) {
resultSet.close();
}
if (statement != null) {
statement.close();
}
if (connection != null) {
connection.close();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
return last_modify_long;
}
/**
* 查询数据库中的同义词
* @return DBData
*/
public ArrayList<String> getDBData() {
ArrayList<String> arrayList = new ArrayList<>();
Connection connection = null;
Statement statement = null;
ResultSet resultSet = null;
try {
connection = DriverManager.getConnection(
jdbcConfig.getUrl(),
jdbcConfig.getUsername(),
jdbcConfig.getPassword()
);
statement = connection.createStatement();
String sql = jdbcConfig.getSynonymWordSql();
if (group != null && !"".equals(group.trim())) {
sql = String.format("%s AND `key_group` = '%s'", sql, group);
}
resultSet = statement.executeQuery(sql);
while (resultSet.next()) {
String theWord = resultSet.getString("words");
arrayList.add(theWord);
}
} catch (SQLException e) {
logger.error("查询数据库中的同义词异常",e);
} finally {
try {
if (resultSet != null) {
resultSet.close();
}
if (statement != null) {
statement.close();
}
if (connection != null) {
connection.close();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
return arrayList;
}
/**
* 同义词库的加载
* @return Reader
*/
@Override
public Reader getReader() {
StringBuffer sb = new StringBuffer();
try {
ArrayList<String> dbData = getDBData();
for (int i = 0; i < dbData.size(); i++) {
sb.append(dbData.get(i))
.append(System.getProperty("line.separator"));
}
logger.info("load the synonym from db");
} catch (Exception e) {
logger.error("reload synonym from db failed:", e);
}
return new StringReader(sb.toString());
}
}
/**
* 自己创建的配置类
*/
/**
* @author liulingjie
* @date 2022/11/30 16:03
*/
public class JdbcConfig {
public JdbcConfig() {
}
public JdbcConfig(String driver, String url, String username, String password, String synonymWordSql, String synonymLastModitimeSql, Integer interval) {
this.url = url;
this.username = username;
this.password = password;
this.synonymWordSql = synonymWordSql;
this.synonymLastModitimeSql = synonymLastModitimeSql;
this.interval = interval;
this.driver = driver;
}
/**
* 驱动名
*/
private String driver;
/**
* 数据库url
*/
private String url;
/**
* 数据库账号
*/
private String username;
/**
* 数据库密码
*/
private String password;
/**
* 查询近义词汇的sql,注意是以words字段展示
*/
private String synonymWordSql;
/**
* 获取近义词最近更新时间的sql
*/
private String synonymLastModitimeSql;
/**
* 间隔,暂时无用
*/
private Integer interval;
}
然后在DynamicSynonymTokenFilterFactory类的getSynonymFile方法添加如下代码
注意 group 字段是我自己加的,你们可以删除或者传空!!!
五、打包
最后点击 package 打包
在~\target\releases可以看到压缩包
六、配置放入ES
在es安装路径\plugins下创建dynamic-synonym文件夹,将上面的压缩包解压放入该文件夹
最后重启es,可以看到以下内容
七、尝试一下
然后,我们使用该过滤器类型。参考语句如下
POST info_goods/_close
PUT info_goods/_settings
{
"analysis": {
"filter": {
"my_synonyms": {
"type": "dynamic_synonym",
"synonyms_path": "fromDB",
"interval": 30 // 刷新间隔(秒)
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": [
"lowercase",
"my_synonyms"
]
}
}
}
}
POST info_goods/_open
浅浅试一下
# 解析“瓠瓜”
GET info_goods/_analyze
{
"analyzer": "my_analyzer",
"text": "瓠瓜"
}
# 结果
{
"tokens" : [
{
"token" : "西葫芦",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "瓠",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0,
"positionLength" : 2
},
{
"token" : "葫芦",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 1,
"positionLength" : 2
},
{
"token" : "瓜",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 2
}
]
}
有效果了!大功搞成!嘿嘿^_^
知道你们懒,源码和最终插件包已上传,你们看需下载吧^_^
报错处理
如果出现以下错误:
java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:3306 connect,resolve)
则创建一个策略文件socketPolicy.policy:
grant {
permission java.net.SocketPermission "cckg.liulingjie.cn:3306","connect,resolve";
permission java.net.SocketPermission "localhost:3306","connect,resolve";
};
修改elasticsearch-7.17.7\config\jvm.options配置文件,指定socketPolicy.policy文件路径
-Djava.security.policy=D:\ProgramFiles\elasticsearch-7.17.7\plugins\ik\config\socketPolicy.policy
重启es就OK了
如果是安装在windows服务的,记得执行以下命令重新注册服务
elasticsearch-service.bat remove
elasticsearch-service.bat install