一分词

1.1 分词

1.1.1 查看分词

standard标准分析器是将每个字都分出来；

而ik_max_word是最细粒度的分词，将所有可能的词都分出来；

ik_smart 是最粗粒度的分词；

ik_smart

优点:特征是粗略快速的将文字进行分词,占用空间小,查询速度快

缺点:分词的颗粒度大,可能跳过一些重要分词,导致查询结果不全面,查全率低。

ik_max_word

优点:特征是详细的文字片段进行分词,查询时查全率高,不容易遗漏数据

缺点:因为分词太过详细,导致有一些无用分词,占用空间较大,查询速度慢standard是ES默认的分词器,"analyzer": "standard"是可以省略的

1.1.2 几种分词比较

1.使用 ik_max_word 分词

### 请求

GET http://localhost:9200/_analyze

{

"text":"中华人民共和国人民大会堂",

"analyzer":"ik_max_word"

}

响应结果：

{

"tokens": [

{

"token": "中华人民共和国",

"start_offset": 0,

"end_offset": 7,

"type": "CN_WORD",

"position": 0

{

"token": "中华人民",

"start_offset": 0,

"end_offset": 4,

"type": "CN_WORD",

"position": 1

{

"token": "中华",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 2

{

"token": "华人",

"start_offset": 1,

"end_offset": 3,

"type": "CN_WORD",

"position": 3

{

"token": "人民共和国",

"start_offset": 2,

"end_offset": 7,

"type": "CN_WORD",

"position": 4

{

"token": "人民",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD",

"position": 5

{

"token": "共和国",

"start_offset": 4,

"end_offset": 7,

"type": "CN_WORD",

"position": 6

{

"token": "共和",

"start_offset": 4,

"end_offset": 6,

"type": "CN_WORD",

"position": 7

{

"token": "国人",

"start_offset": 6,

"end_offset": 8,

"type": "CN_WORD",

"position": 8

{

"token": "人民大会堂",

"start_offset": 7,

"end_offset": 12,

"type": "CN_WORD",

"position": 9

{

"token": "人民大会",

"start_offset": 7,

"end_offset": 11,

"type": "CN_WORD",

"position": 10

{

"token": "人民",

"start_offset": 7,

"end_offset": 9,

"type": "CN_WORD",

"position": 11

{

"token": "大会堂",

"start_offset": 9,

"end_offset": 12,

"type": "CN_WORD",

"position": 12

{

"token": "大会",

"start_offset": 9,

"end_offset": 11,

"type": "CN_WORD",

"position": 13

{

"token": "会堂",

"start_offset": 10,

"end_offset": 12,

"type": "CN_WORD",

"position": 14

}

]

}

2.使用standard分词器

### 请求

GET http://localhost:9200/_analyze

{

"text":"中华人民共和国人民大会堂",

"analyzer":"standard"

}

响应结果：

{

"tokens": [

{

"token": "中",

"start_offset": 0,

"end_offset": 1,

"type": "<IDEOGRAPHIC>",

"position": 0

{

"token": "华",

"start_offset": 1,

"end_offset": 2,

"type": "<IDEOGRAPHIC>",

"position": 1

{

"token": "人",

"start_offset": 2,

"end_offset": 3,

"type": "<IDEOGRAPHIC>",

"position": 2

{

"token": "民",

"start_offset": 3,

"end_offset": 4,

"type": "<IDEOGRAPHIC>",

"position": 3

{

"token": "共",

"start_offset": 4,

"end_offset": 5,

"type": "<IDEOGRAPHIC>",

"position": 4

{

"token": "和",

"start_offset": 5,

"end_offset": 6,

"type": "<IDEOGRAPHIC>",

"position": 5

{

"token": "国",

"start_offset": 6,

"end_offset": 7,

"type": "<IDEOGRAPHIC>",

"position": 6

{

"token": "人",

"start_offset": 7,

"end_offset": 8,

"type": "<IDEOGRAPHIC>",

"position": 7

{

"token": "民",

"start_offset": 8,

"end_offset": 9,

"type": "<IDEOGRAPHIC>",

"position": 8

{

"token": "大",

"start_offset": 9,

"end_offset": 10,

"type": "<IDEOGRAPHIC>",

"position": 9

{

"token": "会",

"start_offset": 10,

"end_offset": 11,

"type": "<IDEOGRAPHIC>",

"position": 10

{

"token": "堂",

"start_offset": 11,

"end_offset": 12,

"type": "<IDEOGRAPHIC>",

"position": 11

}

]

}

3.使用 ik_smart分词

### 请求

GET http://localhost:9200/_analyze

{

"text":"中华人民共和国人民大会堂",

"analyzer":"ik_smart"

}

响应结果：

{

"tokens": [

{

"token": "中华人民共和国",

"start_offset": 0,

"end_offset": 7,

"type": "CN_WORD",

"position": 0

{

"token": "人民大会堂",

"start_offset": 7,

"end_offset": 12,

"type": "CN_WORD",

"position": 1

}

]

}

https://www.jianshu.com/p/e8e6874799f6

https://www.bilibili.com/read/cv17912145/

1.1.3 入库和查询指定分词器

1.创建或者更新文档时，会对文档进行分词，可以指定分词

创建index mapping时指定search_analyzer

不指定分词时，会使用默认的standard

明确字段是否需要分词，不需要分词的字段将type设置为keyword，可以节省空间和提高写性能。

2.搜索：查询时，对查询语句分词

查询时通过analyzer指定分词器

es的分词器analyzer_51CTO博客_es分词器

1.2 text与keyword类型

1.2.1 两种类型说明

ES5.0及以后的版本取消了string类型，将原先的string类型拆分为text和keyword两种类型。

1.如果字段是text类型，存入的数据会先进行分词，然后将分完词的词组存入索引，但是text类型的数据不能用来过滤、排序和聚合等操作。

2.keyword则不会进行分词，直接存储。常常被用来过滤、排序和聚合。

3.es自动生成的该字段的mapping是text + keyword(es版本7.9.0)。

1.2.2 字段同时具有keyword和text属性

当直接保存一个字符串字段时，es自动生成的该字段的mapping是text + keyword(es版本7.9.0)。

{

"city_info": {

"mappings": {

"properties": {

"address": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

}

"cityName": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

}

1.创建索引

2.直接添加文档

3.查看mapping

1.2.3 给text类型添加keyword属性

如果在创建index的时候给某个字段指定了类型text，但是之后又想给它追加上keyword，方便以后按完整字符串搜索。可以通过PUT命令实现。

例子如下：

es7.x Es常用核心知识快捷版1（分词和text和keyword）

一分词

1.1 分词

1.1.1 查看分词

1.1.2 几种分词比较

1.1.3 入库和查询指定分词器

1.2 text与keyword类型

1.2.1 两种类型说明

1.2.2 字段同时具有keyword和text属性

1.2.3 给text类型添加keyword属性

相关文章

程序员，你被打标签了没？

这个选择排序详解过程我能吹一辈子！！！

chatgpt赋能python：Python操作表格的全面指南

文献阅读-A Survey on Transfer Learning 和 A Survey on Deep Transfer Learning

LRU 该用什么数据结构

chatgpt赋能Python-python_怎么赋值

【第八期】Apache DolphinScheduler 每周 FAQ 集锦

初识Monorepo

21 VueComponent 事件的处理

理解 Linux 文件权限

chatgpt赋能python：Python如何捕捉窗口？——一位有10年Python编程经验的工程师谈Windows操作系统编程

新手须知的pr入门知识，小红书媒介话术分享

惠普HP4294A（110M）安捷伦agilent 4294a精密阻抗分析仪

恭喜，拿到华为OD offer了，并分享刷题经验

Window MinGW 编译 OpenCV 人快疯了看这里！

一篇文章带你了解Netty

DNSPod十问崔久强：证书有效期缩短，CA机构要凉透？

【P35】JMeter 包含控制器（Include Controller）

Goby 漏洞更新｜中保無限Modem Configuration Interface 默认口令漏洞

chatgpt赋能python：Python操作网页的SEO

es7.x Es常用核心知识快捷版1（分词和text和keyword）

一 分词

1.1 分词

1.1.1 查看分词

1.1.2 几种分词比较

1.1.3 入库和查询指定分词器

1.2 text与keyword类型

1.2.1 两种类型说明

1.2.2 字段同时具有keyword和text属性

1.2.3 给text类型添加keyword属性

相关文章

一分词