文章目录
系列文章索引
Elasticsearch实战(一):Springboot实现Elasticsearch统一检索功能
Elasticsearch实战(二):Springboot实现Elasticsearch自动汉字、拼音补全,Springboot实现自动拼写纠错
Elasticsearch实战(三):Springboot实现Elasticsearch搜索推荐
Elasticsearch实战(四):Springboot实现Elasticsearch指标聚合与下钻分析
Elasticsearch实战(五):Springboot实现Elasticsearch电商平台日志埋点与搜索热词
一、安装ik拼音分词器插件
1、下载地址
源码地址:https://github.com/medcl/elasticsearch-analysis-pinyin
下载地址:https://github.com/medcl/elasticsearch-analysis-pinyin/releases
我们本次使用7.4.0版本的:https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.4.0/elasticsearch-analysis-pinyin-7.4.0.zip
2、下载安装
mkdir /mydata/elasticsearch/plugins/elasticsearch-analysis-pinyin-7.4.0
cd /mydata/elasticsearch/plugins/elasticsearch-analysis-pinyin-7.4.0
# 下载
wget https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.4.0/elasticsearch-analysis-pinyin-7.4.0.zip
# 解压
unzip elasticsearch-analysis-pinyin-7.4.0.zip
rm -f elasticsearch-analysis-pinyin-7.4.0.zip
# 重启es
docker restart 558eded797f9
3、属性大全
当我们创建索引时可以自定义分词器,通过指定映射去匹配自定义分词器:
{
"indexName": "product_completion_index",
"map": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": "pinyin_filter"
}
},
"filter": {
"pinyin_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
}
}
}
},
"mapping": {
"properties": {
"name": {
"type": "text"
},
"searchkey": {
"type": "completion",
"analyzer": "ik_pinyin_analyzer"
}
}
}
}
}
二、自定义语料库
1、新增索引映射
/*
* @Description: 新增索引+setting+映射+自定义分词器pinyin
* setting可以为空(自定义分词器pinyin在setting中)
* 映射可以为空
* @Method: addIndexAndMapping
* @Param: [commonEntity]
* @Return: boolean
*
*/
public boolean addIndexAndMapping(CommonEntity commonEntity) throws Exception {
//设置setting的map
Map<String, Object> settingMap = new HashMap<String, Object>();
//创建索引请求
CreateIndexRequest request = new CreateIndexRequest(commonEntity.getIndexName());
//获取前端参数
Map<String, Object> map = commonEntity.getMap();
//循环外层的settings和mapping
for (Map.Entry<String, Object> entry : map.entrySet()) {
if ("settings".equals(entry.getKey())) {
if (entry.getValue() instanceof Map && ((Map) entry.getValue()).size() > 0) {
request.settings((Map<String, Object>) entry.getValue());
}
}
if ("mapping".equals(entry.getKey())) {
if (entry.getValue() instanceof Map && ((Map) entry.getValue()).size() > 0) {
request.mapping((Map<String, Object>) entry.getValue());
}
}
}
//创建索引操作客户端
IndicesClient indices = client.indices();
//创建响应对象
CreateIndexResponse response = indices.create(request, RequestOptions.DEFAULT);
//得到响应结果
return response.isAcknowledged();
}
CommonEntity 的内容:
settings下面的为索引的设置信息,动态设置参数,遵循DSL写法
mapping下为映射的字段信息,动态设置参数,遵循DSL写法
{
"indexName": "product_completion_index",
"map": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": "pinyin_filter"
}
},
"filter": {
"pinyin_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
}
}
}
},
"mapping": {
"properties": {
"name": {
"type": "keyword"
},
"searchkey": {
"type": "completion",
"analyzer": "ik_pinyin_analyzer"
}
}
}
}
}
或者直接在kibana中执行:
PUT product_completion_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 2,
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": "pinyin_filter"
}
},
"filter": {
"pinyin_filter": {
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"searchkey": {
"type": "completion",
"analyzer": "ik_pinyin_analyzer"
}
}
}
}
2、批量新增文档
/*
* @Description: 批量新增文档,可自动创建索引、自动创建映射
* @Method: bulkAddDoc
* @Param: [indexName, map]
*
*/
public static RestStatus bulkAddDoc(CommonEntity commonEntity) throws Exception {
//通过索引构建批量请求对象
BulkRequest bulkRequest = new BulkRequest(commonEntity.getIndexName());
//循环前台list文档数据
for (int i = 0; i < commonEntity.getList().size(); i++) {
bulkRequest.add(new IndexRequest().source(XContentType.JSON, SearchTools.mapToObjectGroup(commonEntity.getList().get(i))));
}
//执行批量新增
BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
return bulkResponse.status();
}
public static void main(String[] args) throws Exception {
// 批量插入
CommonEntity commonEntity = new CommonEntity();
commonEntity.setIndexName("product_completion_index"); // 索引名
List<Map<String, Object>> list = new ArrayList<>();
commonEntity.setList(list);
list.add(new CommonMap<String, Object>().putData("searchkey", "小米手机").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米11").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米电视").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米9").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米手机").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米手环").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米笔记本").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "小米摄像头").putData("name", "小米(MI)"));
list.add(new CommonMap<String, Object>().putData("searchkey", "adidas男鞋").putData("name", "adidas男鞋"));
list.add(new CommonMap<String, Object>().putData("searchkey", "adidas女鞋").putData("name", "adidas女鞋"));
list.add(new CommonMap<String, Object>().putData("searchkey", "adidas外套").putData("name", "adidas外套"));
list.add(new CommonMap<String, Object>().putData("searchkey", "adidas裤子").putData("name", "adidas裤子"));
bulkAddDoc(commonEntity);
}
3、查询结果
GET product_completion_index/_search
三、产品搜索与汉字、拼音自动补全
1、概念
Term suggester :词条建议器。对给输入的文本进进行分词,为每个分词提供词项建议。
Phrase suggester :短语建议器,在term的基础上,会考量多个term之间的关系。
Completion Suggester,它主要针对的应用场景就是"Auto Completion"。
Context Suggester:上下文建议器。
GET product_completion_index/_search
{
"from": 0,
"size": 100,
"suggest": {
"czbk-suggest": {
"prefix": "小米",
"completion": {
"field": "searchkey",
"size": 20,
"skip_duplicates": true
}
}
}
}
2、java实现汉字自动补全
/*
* @Description: 自动补全 根据用户的输入联想到可能的词或者短语
* @Method: suggester
* @Param: [commonEntity]
* @Update:
* @since: 1.0.0
* @Return: org.elasticsearch.action.search.SearchResponse
* >>>>>>>>>>>>编写思路简短总结>>>>>>>>>>>>>
* 1、定义远程查询
* 2、定义查询请求(评分排序)
* 3、定义自动完成构建器(设置前台建议参数)
* 4、将自动完成构建器加入到查询构建器
* 5、将查询构建器加入到查询请求
* 6、获取自动建议的值(数据结构处理)
*/
public static List<String> cSuggest(CommonEntity commonEntity) throws Exception {
//定义返回
List<String> suggestList = new ArrayList<>();
//构建查询请求
SearchRequest searchRequest = new SearchRequest(commonEntity.getIndexName());
//通过查询构建器定义评分排序
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC));
//构造搜索建议语句,搜索条件字段
CompletionSuggestionBuilder completionSuggestionBuilder =new CompletionSuggestionBuilder(commonEntity.getSuggestFileld());
//搜索关键字
completionSuggestionBuilder.prefix(commonEntity.getSuggestValue());
//去除重复
completionSuggestionBuilder.skipDuplicates(true);
//匹配数量
completionSuggestionBuilder.size(commonEntity.getSuggestCount());
searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("common-suggest", completionSuggestionBuilder));
//common-suggest为返回的字段,所有返回将在common-suggest里面,可写死,sort按照评分排序
searchRequest.source(searchSourceBuilder);
//定义查找响应
SearchResponse suggestResponse = client.search(searchRequest, RequestOptions.DEFAULT);
//定义完成建议对象
CompletionSuggestion completionSuggestion = suggestResponse.getSuggest().getSuggestion("common-suggest");
List<CompletionSuggestion.Entry.Option> optionsList = completionSuggestion.getEntries().get(0).getOptions();
//从optionsList取出结果
if (!CollectionUtils.isEmpty(optionsList)) {
optionsList.forEach(item -> suggestList.add(item.getText().toString()));
}
return suggestList;
}
public static void main(String[] args) throws Exception {
// 自动补全
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("小米"); // 自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米手机, 小米手环, 小米摄像头]
// 自动补全自动去重
}
3、java实现拼音自动补全
// (1)自动补全 :全拼访问
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("xiaomi"); // 自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米摄像头, 小米电视, 小米笔记本]
// (2)自动补全 :全拼访问(分隔)
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("xiao mi"); // 自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米摄像头, 小米电视, 小米笔记本]
// (3)自动补全 :首字母访问
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("searchkey"); // 自动补全查找列
suggestEntity.setSuggestValue("xm"); // 自动补全输入的关键字
suggestEntity.setSuggestCount(5); // 自动补全返回个数
System.out.println(cSuggest(suggestEntity));
// 结果:[小米11, 小米9, 小米摄像头, 小米电视, 小米笔记本]
四、语言处理(拼写纠错)
1、实例
GET product_completion_index/_search
{
"suggest": {
"common-suggestion": {
"text": "adidaas男鞋",
"phrase": {
"field": "name",
"size": 13
}
}
}
}
2、java实现拼写纠错
/*
* @Description: 拼写纠错
* @Method: psuggest
* @Param: [commonEntity]
* @Update:
* @since: 1.0.0
* @Return: java.util.List<java.lang.String>
* >>>>>>>>>>>>编写思路简短总结>>>>>>>>>>>>>
* 1、定义远程查询
* 2、定义查询请求(评分排序)
* 3、定义自动纠错构建器(设置前台建议参数)
* 4、将拼写纠错构建器加入到查询构建器
* 5、将查询构建器加入到查询请求
* 6、获取拼写纠错的值(数据结构处理)
*/
public static String pSuggest(CommonEntity commonEntity) throws Exception {
//定义返回
String pSuggestString = new String();
//定义查询请求
SearchRequest searchRequest = new SearchRequest(commonEntity.getIndexName());
//定义查询条件构建器
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
//定义排序器
searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC));
//构造短语建议器对象(参数为匹配列)
PhraseSuggestionBuilder pSuggestionBuilder = new PhraseSuggestionBuilder(commonEntity.getSuggestFileld());
//搜索关键字(被纠错的值)
pSuggestionBuilder.text(commonEntity.getSuggestValue());
//匹配数量
pSuggestionBuilder.size(1);
searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("common-suggest", pSuggestionBuilder));
searchRequest.source(searchSourceBuilder);
//定义查找响应
SearchResponse suggestResponse = client.search(searchRequest, RequestOptions.DEFAULT);
//定义短语建议对象
PhraseSuggestion phraseSuggestion = suggestResponse.getSuggest().getSuggestion("common-suggest");
//获取返回数据
List<PhraseSuggestion.Entry.Option> optionsList = phraseSuggestion.getEntries().get(0).getOptions();
//从optionsList取出结果
if (!CollectionUtils.isEmpty(optionsList) &&optionsList.get(0).getText()!=null) {
pSuggestString = optionsList.get(0).getText().string().replaceAll(" ","");
}
return pSuggestString;
}
public static void main(String[] args) throws Exception {
CommonEntity suggestEntity = new CommonEntity();
suggestEntity.setIndexName("product_completion_index"); // 索引名
suggestEntity.setSuggestFileld("name"); // 自动补全查找列
suggestEntity.setSuggestValue("adidaas男鞋"); // 自动补全输入的关键字
System.out.println(pSuggest(suggestEntity)); // 结果:adidas男鞋
}
五、总结
- 需要一个搜索词库/语料库,不要和业务索引库在一起,方便维护和升级语料库
- 根据分词及其他搜索条件去语料库中查询若干条(京东13条、淘宝(天猫)10条、百度4条)记录返回
- 为了提升准确率,通常都是前缀搜索