Elasticsearch 的存储与查询

在搜索系统领域，数据的存储与查询是两个最基础且至关重要的环节。Elasticsearch(ES) 在这两方面进行了深度优化，使其在关系型数据库或非关系型数据库中脱颖而出，成为搜索系统的首选。

映射 (Mapping)

映射 (Mapping)
- 映射是 ES 中的一种元数据，用于描述文档中的数据结构和类型。映射可以在创建索引时自动推断，也可以手动定义。映射包括字段名、字段类型、是否可搜索、是否可分析等属性。映射可以在文档级别和索引级别定义。
- ES 中是实现了动态映射的，在索引中写入下面的一个文档。

{
    "name":"jack",
    "age":18,
    "birthDate": "1991-10-05"
}

在动态映射的作用下，name 会映射成 text 类型，age 会映射成 long 类型，birthDate 会被映射为 date 类型，自动判断的规则如下

JSON Type	Field Type
true, false	boolean
123, 456, 876	long
123.43, 234.534	double
String, “2022-05-15”	date
String: “Hello Elasticsearch”	string
符合 IPv4 或 IPv6 地址格式	ip
字段的内容是 base64 编码的字符串	binary
字段的内容是一个数组，数组中的每个元素都将根据其内容被映射为相应的类型	array
字段的内容是一个 JSON 对象，那么它将被映射为 object 类型。对象中的每个属性都将根据其内容被映射为相应的类型	object
字段的内容是一个数组，且数组中的元素是对象，并且可以对内部对象进行精确查询	nested
字段的内容是经纬度对	geo_point
字段的内容是除点外的任意几何形状坐标	geo_shape

Mapping 的字段类型

字段类型是映射中的一种属性，用于描述文档中的字段数据类型。ES 支持多种字段类型，如文本、数值、日期、布尔值等。每种字段类型有其特点和限制，因此选择合适的字段类型对于优化查询性能和存储空间至关重要。

一级分类	二级分类	具体类型
核心类型	字符串类型	~~string~~，text，keyword
	整数类型	integer，long，short，byte
	浮点类型	double，float，half_float，scaled_float
	逻辑类型	boolean
	日期类型	date
	范围类型	range(Integer_range，long_range，date_range…)
	二进制类型	binary (BASE64 的二进制)
复合类型	数组类型	array
	对象类型	object
	嵌套类型	nested
地理类型	地理坐标类型	geo_point
	地理地图	geo_shape
特殊类型	IP 类型	ip
…	…	…

字符串类型
- text: ES 5x 后不再支持 string, 由 text 和 keyword 类型替代
  - 子类型
    - text: 类型适用于需要被全文检索的字段
    - match_only_text: 是 text 的空间优化变体，它禁用评分，它适合为日志消息建立索引
  - text 类型的常用参数
    - analyzer: 指明该字段用于索引时和搜索时的分析字符串的分词器 (使用 search_analyzer 可覆盖它)。默认为索引分析器或标准分词器
    - search_analyzer: 在搜索时，用于分析该字段的分析器，默认是 analyzer 参数的值
    - boost: 查询时字段匹配上时的权重级别，接受浮点数，默认为 1.0
    - fields: 它允许同一个字符串值以多种方式进行索引以满足不同的目的
    - index: 设置该字段是否可以用于搜索，默认为 true
    - eager_global_ordinals: 在刷新时急切地加载全局序数，对于经常用于 term 聚合的字段，启用此功能是个好主意 (但会影响写入速度)
    - fielddata: 指明该字段是否可以使用内存中的 fielddata 进行排序，聚合或脚本编写？该字段可能会消耗大量的内存，如果要用的话建议 keyword 类型的字段使用
    - similarity: 设置相关性排序公式，默认为 BM25
- keyword
  - 子类型
    - keyword: 用于结构化过的内容，只能用于精准搜索，不会进行分词处理，常用户 ID、Email 等
    - constant_keyword: 某个字段为 constant_keyword 类型，则该索引中，所有文档的该字段的值必须一致，常用于版本号
    - wildcard: 这种类型主要用于非结构化的机器生成内容，对大数据量的字段做了优化，支持模糊匹配，常用于日志服务
  - keyword 类型的常用参数
    - ignore_above: 当字段文本的长度大于指定值时，不会被索引，但是会存储
    - 其它参数: 同 text


## fields 字段示例
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_smart",
      "similarity": "custom_similarity",
      "doc_values": false,
      "fields": {
        "title_smart": {
          "type": "text",
          "analyzer": "ik_smart",
          "search_analyzer": "ik_smart",
          "similarity": "custom_similarity",
          "doc_values": false
        }
      }
    }
  }
}

# 自定义排序算法
"similarity": {
  "custom_similarity": {
    "type": "BM25",
    "b": 0.9,
    "k1": 1.2
  }
}

数字类型

Type	Description
long	A signed 64-bit integer with a minimum value of -263 and a maximum value of 263-1.
integer	A signed 32-bit integer with a minimum value of -231 and a maximum value of 231-1.
short	A signed 16-bit integer with a minimum value of -32,768 and a maximum value of 32,767.
byte	A signed 8-bit integer with a minimum value of -128 and a maximum value of 127.
double	A double-precision 64-bit IEEE 754 floating point number, restricted to finite values.
float	A single-precision 32-bit IEEE 754 floating point number, restricted to finite values.
half_float	A half-precision 16-bit IEEE 754 floating point number, restricted to finite values.
scaled_float	A floating point number that is backed by a long, scaled by a fixed double scaling factor.
unsigned_long	An unsigned 64-bit integer with a minimum value of 0 and a maximum value of 264-1.

范围类型

integer_range 和 integer 类型的区别在于，如果你的字段只包含一个整数值，你可以使用 integer 类型。如果你的字段包含一个整数范围，你可以使用 integer_range 类型

Range Type	Description
integer_range	A range of signed 32-bit integers with a minimum value of -231 and maximum of 231-1.
float_range	A range of single-precision 32-bit IEEE 754 floating point values.
long_range	A range of signed 64-bit integers with a minimum value of -263 and maximum of 263-1.
double_range	A range of double-precision 64-bit IEEE 754 floating point values.
date_range	A range of date values. Date ranges support various date formats through the format mapping parameter. Regardless of the format used, date values are parsed into an unsigned 64-bit integer representing milliseconds since the Unix epoch in UTC. Values containing the now date math expression are not supported.
ip_range	A range of ip values supporting either IPv4 or IPv6 (or mixed) addresses.

# integer 存储例子
{
  "name": "John",
  "age": 25
}

# integer_range 存储例子
{
  "name": "John",
  "age": {
    "gte": 20,
    "lte": 30
  }
}

# 若用 keyword 类型定义 my_field, 则范围查询会变成"字符串比较"而非"数值比较"
GET /keyword_test/_search
{
  "query": {
    "range": {
      "my_field": {
        "gte": 21,
        "lt": 32
      }
    }
  }
}

"hits" : [
  {
    "_index" : "keyword_test",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 1.0,
    "_source" : {
      "my_field" : "3"
    }
  }
]

对象类型
- 对象类型即一个 JSON 对象，JSON 字符串允许嵌套对象，所以一个文档可以嵌套多个多层对象。但 Lucene 没有内部对象的概念，会将 JSON 文档扁平化

# 插入这个的一个文档
PUT my-index-000001/_doc/1
{
  "region": "US",
  "manager": {
    "age":     30,
    "name": {
      "first": "John",
      "last":  "Smith"
    }
  }
}

# 实际上该文档会被存储成 key-value 对的形式
{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

# 该文档会被动态的 mapping
{
  "mappings": {
    "properties": {
      "region": {
        "type": "keyword"
      },
      "manager": {
        "properties": {
          "age":  { "type": "integer" },
          "name": {
            "properties": {
              "first": { "type": "text" },
              "last":  { "type": "text" }
            }
          }
        }
      }
    }
  }
}

# 在搜索时
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "manager.name.first": "John"
          }
        }
      ]
    }
  }
}

嵌套类型
- nested 是一种特殊类型的 object 类型，它可以以数组对象的形式来进行索引，并且可以独立的查询其中的每一个对象


# ES 是没有内部对象的概念的，下述例子会动态的转化成一个"扁平化"的对象
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

# 类似下述的方式存储
{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

# 执行下述 query 会得到不正确的结果，因为对象失去了层级结构
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

# 以嵌套类型来定义字段
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested"
      }
    }
  }
}

# 插入相同的 doc
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

# 以 nested 的方式进行检索，无结果返回
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }}
          ]
        }
      }
    }
  }
}

# 在 ES 中，嵌套类型的字段会被转化成独立的文档，这些文档和主文档有相同的 id

向量类型
- 支持最大不超过 2048 维的向量，dense_vector 字段不支持查询、排序和聚合，只能接受 scripts 定义的函数

# 定义向量类型和维度
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 768
      }
    }
  }
}

# 定义相似度函数
{
"query": {
    "bool": {
        "must": {
            "function_score": {
                "functions": [
                {"script_score": {
                  "script": {
                  "source": "cosineSimilarity(params.queryVector, 'TitleVector') + 2.0",
                  "params": {
                      "queryVector": query_embedding
                  }}}}]}}}},
}

查询

分析器
- 被分析的字符串片段通过 analyzer 来传递，将字符串转换为一串 terms(词条) 用于索引及检索
- 分析器 analyzer 和分词器 tokenizer 并不相同，分析器不等于分词器，分词器只是分析器的一部分
- analyzer = [char_filter] + tokenizer + [token filter]
  - char filter: 对输入字符进行预处理，如去除 HTML 标签，ES 内置字符处理器
  - tokenizer: 对文本进行分词操作，如按照空格分词 (whitespace)，标注分词器 (standard) 等，ES 内置分词器
    - standard: Elasticsearch 默认的分词器，它会根据空格和标点符号将文本拆分为 term, 会过滤掉标点符号，大写转小写
    - simple: 会根据非字母字符将文本拆分为 term, 过滤数字和标点符号，大写转小写
    - whitespace: 根据空格字符将文本拆分为 term, 不会进行过滤和大小写转换
    - keyword: 不会对文本进行拆分，常用于关键字字段或精确匹配字段
  - filter (token filter): 对 token 集合的元素做过滤和转换，如统一转小写、过滤停用词等，token 经过 filter 处理之后的结果被定义为：term, ES 内置 token filter

POST _analyze
{
  "analyzer": "simple",
  "text": ["HI 111 , 哈哈"]
}

# 分词结果
{
  "tokens" : [
    {
      "token" : "hi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "哈哈",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    }
  ]
}

match 查询
- 支持全文搜索和精确查询，取决于字段是否支持全文检索
- operator 默认情况下该操作符是 or, 我们可以将它修改成 and 让所有指定词项都必须匹配
- minimum_should_match 最小匹配参数，可以指定必须匹配的词项数 (或者百分数) 来表示一个文档是否相关

# 全文搜索
GET job_item_profile/_search
{
  "query": {
    "match": {
      "job_name": "java 工程师"
    }
  }
}

# 精确查询
# 对于精确值的查询，可以使用 filter 语句来取代 query，因为 filter 将会被缓存
GET job_item_profile/_search
{
  "query": {
    "match": {
      "edu_level": "5000"
    }
  }
}

# operator
GET job_item_profile/_search
{
  "query": {
    "match": {
      "job_name": {
        "query": "java 工程师",
        "operator": "and"
      }
    }
  }
}

# minimum_should_match
GET job_item_profile/_search
{
  "query": {
    "match": {
      "job_name": {
        "query": "java 工程师",
        "minimum_should_match": "2"
      }
    }
  }
}

multi_match 查询
- 多字段查询，可以给不同的字段指定不同的权重，返回匹配更高的结果

GET job_item_profile/_search
{
  "query": {
    "multi_match": {
      "query": "red",
      "fields": ["job_name^2.0", "company_name^1.0"]
    }
  }
}

range 查询
- 范围查询操作符：gt (大于), gte(大于等于), lt(小于), lte（小于等于)

GET job_item_profile/_search
{
  "query": {
    "range": {
      "salary_max": {
        "gt": 4,
        "lt": 8
      }
    }
  }
}

term 查询
- term 查询会去倒排索引中寻找确切的 term, 它并不会走分词器，只会去配倒排索引，若某个字段的 type 是 text, 若用 term 去查询有可能出现查询不到的情况
- term 查询也不会处理大小写，type 是 text 的字段会调用分词器进行大小写转换
terms 查询
- terms 查询与 term 查询一样，但它允许你指定多值进行匹配，如果这个字段包含了指定值中的任何一个值，那么这个文档就满足条件

GET job_item_profile/_search
{
  "query": {
    "terms": {
      "edu_level": [5000, 6000]
    }
  }
}

match_phrase
- 短语查询/精确匹配，查询"java 专家"会匹配 job_name 字段包含"java 专家"短语的，而不会进行分词查询，也不会查询出"java 技术专家"这种词汇

GET job_item_profile/_search
{
  "query": {
    "match_phrase": {
      "job_name": "java 专家"
    }
  }
}

复合查询
- 使用 bool 语句实现复合查询，包括 must, must_not, should 和 filter
- must: 表示文档一定要包含查询的内容
- must_not: 表示文档一定不要包含查询的内容
- should: 表示如果文档匹配上可以增加文档相关性得分
- query DSL: 结构化查询，用于检查内容与条件是否匹配，内容查询中使用的 bool 和 match 语句，用于计算每个文档的匹配得分
- filter DSL: 结构化过滤，只是简单的决定文档是否匹配，内容过滤中使用的 term 和 range 语句，会过滤掉不匹配的文档，并且不影响计算文档匹配得分，使用过滤查询会被 ES 自动缓存用来提高效率
- 原则上来说，使用结构化查询语句做全文本搜索或其他需要进行相关性评分的情况，剩下的全部用过滤语句

实践

列表字段如何处理
- text
  - 纯中文逗号分隔，simple
  - 纯英文，可能包含下划线，空格分隔，standard
  - 纯数字，空格分隔，standard
- array, 通用方案，type=keyword, double. 注意，手动转小写。
IK 分析器的使用方式
- 网上很多文章会建议，建立索引的时候使用 ik_max_word 模式；搜索的时候使用 ik_smart 模式。但在实际应用中，我们会发现 ik_smart 的结果并不完全是 ik_max_word 结果的子集，这样会出现搜不出的情况参考: IK分词器实现原理剖析 —— 一个小问题引发的思考
- 一种解决方法如下所示，对 title 字段分别以 ik_max_word 方式建立 title 字段，再以 ik_smart 方式建立 title_smart 子字段；在用户搜索时统一使用 ik_smart 方式进行搜索，这样能保证相关的 query 一定能命中索引

  {
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_smart",
      "similarity": "custom_similarity",
      "doc_values": false,
      "fields": {
        "title_smart": {
          "type": "text",
          "analyzer": "ik_smart",
          "search_analyzer": "ik_smart",
          "similarity": "custom_similarity",
          "doc_values": false
        }
      }
    }
  }
}