谈到es的中文分词器,肯定少不了ik分词器.现ik分词器有两种获取主词汇和停用词的方法:
一是通过ik\config目录下的main.dic和stopword.dic获取,但是每次修改后要重启才能生效
二是通过提供接口返回所有词汇的接口,接口路径配置在.但是该方式每次都需要将所有词汇返回,效率不高.
本次目的就是通过jdbc直接连接数据库来实现增量更新词汇.我们要做的就是找到添加主词汇和停用词汇的方法,然后再通过jdbc获取数据库词汇来调用该方法来更新词汇
下载ik源码,我下载的是7.17.6本版.因为es使用的是7.17.7,为防止启动报错,下载后我将版本改成了7.17.7.
词汇更新介绍
(1)找到Dictionary.initial方法
可以看到,加载词汇的过程再Dictionary.initial 方法中,在该方法中,加载了各文件的词汇还有通过定时任务来获取接口词汇进行更新.
(2)接下来我们进入到singleton.loadMainDict -> loadExtDict -> loadDictFile方法中
可以看到dict.fillSegment就是添加主词汇
(3)同理的,如下_stopWords.fillSegment就是对停用词的加载
所以我们要做的就是拿到词汇,调用对应的fillSegment来加载词汇就可以了
准备工作
(1)表设计
主词汇表:
CREATE TABLE `es_dic_main` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(100) NOT NULL COMMENT '词汇',
`moditime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ifdel` char(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='主词汇'
通用词表:
CREATE TABLE `es_dic_stop` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(100) NOT NULL COMMENT '停用词',
`moditime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`ifdel` char(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='停用词'
(2)在/config目录下创建jdbc配置文件jdbc.properties:
jdbc.url=jdbc:mysql://cckg.liulingjie.cn:3306/test?useUnicode=true&characterEncoding=utf8&autoReconnect=true&useSSL=false&serverTimezone=Asia/Shanghai
jdbc.username=账号
jdbc.password=密码
# 主词汇增量查询sql
main.word.sql=SELECT * FROM es_dic_main WHERE moditime >= ?
# 通用词增量查询sql
stop.word.sql=SELECT * FROM es_dic_stop WHERE moditime >= ?
# 执行间隔(秒)
interval=10
(3)pom.xml添加jdbc依赖:
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.21</version>
</dependency>
(4)src/main/assemblies/plugin.xml下添加以下内容打包时包含mysql驱动jar包:
<dependencySet>
<outputDirectory/>
<useProjectArtifact>true</useProjectArtifact>
<useTransitiveFiltering>true</useTransitiveFiltering>
<includes>
<include>mysql:mysql-connector-java</include>
</includes>
</dependencySet>
过程
大致流程:
主要涉及有两个类,一个是Dictionary,一个是自己创建的类JdbcMonitor。
Dictionary:提供读取配置,加载词汇和启动词汇更新任务。
JdbcMonitor功能:是一个实现了Runner接口的类,通过jdbc读取数据库词汇并调用Dictionary的方法加载词汇
(1)在Dictionary类中添加以下方法提供对词汇的api
代码:
protected void fillSegmentMain(String word) {
_MainDict.fillSegment(word.trim().toCharArray());
}
protected void disableSegmentMain(String word) {
_MainDict.disableSegment(word.trim().toCharArray());
}
protected void fillSegmentStop(String word) {
_StopWords.fillSegment(word.trim().toCharArray());
}
protected void disableSegmentStop(String word) {
_StopWords.disableSegment(word.trim().toCharArray());
}
(2)在Dictionary构造方法中读取配置jdbc.properties
代码:
public class JdbcConfig {
private String url;
private String username;
private String password;
private String mainWordSql;
private String stopWordSql;
private Integer interval;
// geter,setter省略
}
private Dictionary(Configuration cfg) {
//......省略
// 读取jdbc配置
setJdbcConfig();
}
private void setJdbcConfig() {
Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_JDBC_CONFIG);
Properties properties = null;
try {
properties = new Properties();
properties.load(new FileInputStream(file.toFile()));
} catch (Exception e) {
logger.error("load jdbc.properties failed");
logger.error(e.getMessage());
}
jdbcConfig = new JdbcConfig(
properties.getProperty("jdbc.url"),
properties.getProperty("jdbc.username"),
properties.getProperty("jdbc.password"),
properties.getProperty("main.word.sql"),
properties.getProperty("stop.word.sql"),
Integer.valueOf(properties.getProperty("interval"))
);
}
(3)声明JdbcMinitor类定时连接数据库读取并更新词汇
package org.wltea.analyzer.dic;
import org.apache.logging.log4j.Logger;
import org.elasticsearch.SpecialPermission;
import org.wltea.analyzer.cfg.JdbcConfig;
import org.wltea.analyzer.help.ESPluginLoggerFactory;
import java.security.AccessController;
import java.security.PrivilegedAction;
import java.sql.*;
import java.util.ArrayList;
import java.util.List;
/**
* @author liulingjie
* @date 2022/11/29 20:36
*/
public class JdbcMonitor implements Runnable {
static {
try {
Class.forName("com.mysql.cj.jdbc.Driver");
} catch (Exception e) {
e.getStackTrace();
}
}
/**
* jdbc配置
*/
private JdbcConfig jdbcConfig;
/**
* 主词汇上次更新时间
*/
private Timestamp mainLastModitime = Timestamp.valueOf("2022-01-01 00:00:00");
/**
* 停用词上次更新时间
*/
private Timestamp stopLastModitime = Timestamp.valueOf("2022-01-01 00:00:00");
private static final Logger logger = ESPluginLoggerFactory.getLogger(JdbcMonitor.class.getName());
public JdbcMonitor(JdbcConfig jdbcConfig) {
this.jdbcConfig = jdbcConfig;
}
@Override
public void run() {
SpecialPermission.check();
AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
this.runUnprivileged();
return null;
});
}
/**
* 加载词汇和停用词
*/
public void runUnprivileged() {
//Dictionary.getSingleton().reLoadMainDict();
loadWords();
}
private void loadWords() {
List<String> mainWords = new ArrayList<>();
List<String> delMainWords = new ArrayList<>();
List<String> stopWords = new ArrayList<>();
List<String> delStopWords = new ArrayList<>();
setAllWordList(mainWords, delMainWords, stopWords, delStopWords);
mainWords.forEach(w -> Dictionary.getSingleton().fillSegmentMain(w));
delMainWords.forEach(w -> Dictionary.getSingleton().disableSegmentMain(w));
stopWords.forEach(w -> Dictionary.getSingleton().fillSegmentStop(w));
delStopWords.forEach(w -> Dictionary.getSingleton().disableSegmentStop(w));
logger.info("ik dic refresh from db. mainLastModitime: {} stopLastModitime: {}", mainLastModitime, stopLastModitime);
}
/**
* 获取主词汇和停用词
*
* @param mainWords
* @param delMainWords
* @param stopWords
* @param delStopWords
*/
private void setAllWordList(List<String> mainWords, List<String> delMainWords, List<String> stopWords, List<String> delStopWords) {
Connection connection = null;
try {
connection = DriverManager.getConnection(jdbcConfig.getUrl(), jdbcConfig.getUsername(), jdbcConfig.getPassword());
setWordList(connection, jdbcConfig.getMainWordSql(), mainLastModitime, mainWords, delMainWords);
setWordList(connection, jdbcConfig.getStopWordSql(), stopLastModitime, stopWords, delStopWords);
} catch (SQLException throwables) {
logger.error("jdbc load words failed: mainLastModitime-{} stopLostMOditime-{}", mainLastModitime, stopLastModitime);
logger.error(throwables.getStackTrace());
} finally {
if (connection != null) {
try {
connection.close();
} catch (SQLException throwables) {
logger.error("failed to close connection");
logger.error(throwables.getMessage());
}
}
}
}
/**
* 连接数据库获取词汇
*
* @param connection
* @param sql
* @param lastModitime
* @param words
* @param delWords
*/
private void setWordList(Connection connection, String sql, Timestamp lastModitime, List<String> words, List<String> delWords) {
PreparedStatement prepareStatement = null;
ResultSet result = null;
try {
prepareStatement = connection.prepareStatement(sql);
prepareStatement.setTimestamp(1, lastModitime);
result = prepareStatement.executeQuery();
while (result.next()) {
String word = result.getString("word");
Timestamp moditime = result.getTimestamp("moditime");
String ifdel = result.getString("ifdel");
if ("1".equals(ifdel)) {
delWords.add(word);
} else {
words.add(word);
}
// 取最大的时间
if (moditime.after(lastModitime)) {
lastModitime.setTime(moditime.getTime());
}
}
} catch (SQLException throwables) {
logger.error("jdbc load words failed: {}", lastModitime);
logger.error(throwables.getMessage());
} finally {
if (result != null) {
try {
result.close();
} catch (SQLException throwables) {
logger.error("failed to close prepareStatement");
logger.error(throwables.getMessage());
}
}
if (prepareStatement != null) {
try {
prepareStatement.close();
} catch (SQLException throwables) {
logger.error("failed to close prepareStatement");
logger.error(throwables.getMessage());
}
}
}
}
}
(4)最后在Dictionary.initial方法中启用该定时任务
代码:
public static synchronized void initial(Configuration cfg) {
if (singleton== null) {
synchronized (Dictionary.class) {
if (singleton == null) {
singleton = new Dictionary(cfg);
......
// 开启数据库增量更新
pool.scheduleAtFixedRate(new JdbcMonitor(singleton.jdbcConfig), 10, singleton.jdbcConfig.getInterval(), TimeUnit.SECONDS);
}
}
}
}
(5)最后mvn cliean package打包,在~\target\releases下会生成如下包
(6)解压放入到 es安装路径/plugins/ik
重启es就行了