一 分词
1.1 分词
1.1.1 查看分词
standard标准分析器是将每个字都分出来;
而ik_max_word是最细粒度的分词,将所有可能的词都分出来;
ik_smart 是最粗粒度的分词;
ik_smart
优点:特征是粗略快速的将文字进行分词,占用空间小,查询速度快
缺点:分词的颗粒度大,可能跳过一些重要分词,导致查询结果不全面,查全率低。
ik_max_word
优点:特征是详细的文字片段进行分词,查询时查全率高,不容易遗漏数据
缺点:因为分词太过详细,导致有一些无用分词,占用空间较大,查询速度慢standard是ES默认的分词器,"analyzer": "standard"是可以省略的
1.1.2 几种分词比较
1.使用 ik_max_word 分词
### 请求 GET http://localhost:9200/_analyze { "text":"中华人民共和国人民大会堂", "analyzer":"ik_max_word" } 响应结果: { "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "中华人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 1 }, { "token": "中华", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 2 }, { "token": "华人", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 3 }, { "token": "人民共和国", "start_offset": 2, "end_offset": 7, "type": "CN_WORD", "position": 4 }, { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 5 }, { "token": "共和国", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 6 }, { "token": "共和", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 7 }, { "token": "国人", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 8 }, { "token": "人民大会堂", "start_offset": 7, "end_offset": 12, "type": "CN_WORD", "position": 9 }, { "token": "人民大会", "start_offset": 7, "end_offset": 11, "type": "CN_WORD", "position": 10 }, { "token": "人民", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 11 }, { "token": "大会堂", "start_offset": 9, "end_offset": 12, "type": "CN_WORD", "position": 12 }, { "token": "大会", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 13 }, { "token": "会堂", "start_offset": 10, "end_offset": 12, "type": "CN_WORD", "position": 14 } ] } |
2.使用standard分词器
### 请求 GET http://localhost:9200/_analyze { "text":"中华人民共和国人民大会堂", "analyzer":"standard" } 响应结果: { "tokens": [ { "token": "中", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "华", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "人", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "民", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "共", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "和", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "国", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "人", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "民", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "大", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "会", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "堂", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 11 } ] } |
3.使用 ik_smart分词
### 请求 GET http://localhost:9200/_analyze { "text":"中华人民共和国人民大会堂", "analyzer":"ik_smart" } 响应结果: { "tokens": [ { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD", "position": 0 }, { "token": "人民大会堂", "start_offset": 7, "end_offset": 12, "type": "CN_WORD", "position": 1 } ] } |
https://www.jianshu.com/p/e8e6874799f6
https://www.bilibili.com/read/cv17912145/
1.1.3 入库和查询指定分词器
1.创建或者更新文档时,会对文档进行分词,可以指定分词
创建index mapping时指定search_analyzer
不指定分词时,会使用默认的standard
明确字段是否需要分词,不需要分词的字段将type设置为keyword,可以节省空间和提高写性能。
2.搜索:查询时,对查询语句分词
查询时通过analyzer指定分词器
es的分词器analyzer_51CTO博客_es分词器
1.2 text与keyword类型
1.2.1 两种类型说明
ES5.0及以后的版本取消了string类型,将原先的string类型拆分为text和keyword两种类型。
1.如果字段是text类型,存入的数据会先进行分词,然后将分完词的词组存入索引,但是text类型的数据不能用来过滤、排序和聚合等操作。
2.keyword则不会进行分词,直接存储。常常被用来过滤、排序和聚合。
3.es自动生成的该字段的mapping是text + keyword(es版本7.9.0)。
1.2.2 字段同时具有keyword和text属性
当直接保存一个字符串字段时,es自动生成的该字段的mapping是text + keyword(es版本7.9.0)。
{ "city_info": { "mappings": { "properties": { "address": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "cityName": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } } |
1.创建索引
2.直接添加文档
3.查看mapping
1.2.3 给text类型添加keyword属性
如果在创建index的时候给某个字段指定了类型text,但是之后又想给它追加上keyword,方便以后按完整字符串搜索。可以通过PUT命令实现。
例子如下: