Elasticsearch 分析器的高级用法二（停用词，拼音搜索）

news2025/10/30 10:04:05

Elasticsearch 分析器的高级用法二（停用词，拼音搜索）

停用词
- 简介
- 停用词分词过滤器
- - 自定义停用词分词过滤器
  - 内置分析器的停用词过滤器
  - 注意，有一个细节
拼音搜索
- 安装
- 使用
- - 相关配置

停用词

简介

停用词是指，在被分词后的词语中包含的无搜索意义的词。

例如：这里的风景真美。

分词后，”这里“，”的“ 相对于文档搜索意义不大，但这种词使用频率又比较高。为了使搜索更加准确，往往需要在构建索引时，忽略掉这些词

以在这个网站查看常用的停用词

英文：https://www.ranks.nl/stopwords
中文：https://www.ranks.nl/stopwords/chinese-stopwords

停用词分词过滤器

ES支持两种方式过滤停用词

自定义停用词分词过滤器

通过自定义分词过滤器为停用词过滤器，来实现停用词过滤

DELETE /my-index-000001
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stop_analyer": {
          "tokenizer": "ik_smart",
          "filter": [
            "stop"
          ]
        }
      },
      "filter": {
        "stop": {
          "type": "stop",
          "stopwords": [
            "我",
            "的",
            "这里",
            "哪里"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "stop_analyer"
      }
    }
  }
}

POST /my-index-000001/_analyze
{
  "field": "content",
  "text": "这里的风景真美"
}

在这里插入图片描述

内置分析器的停用词过滤器

一般情况下我们常用的内置分析器内部都包含停用词的设置，这里以标准分析器和IK分析器举例

standard 分析器
通过指定 standard 分析器的stopwords 属性实现停用词配置

DELETE /my-index-000002

PUT /my-index-000002
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stop_standard": {
          "type":"standard",
          "stopwords": [
            "我",
            "的",
            "这",
            "里"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "stop_standard"
      }
    }
  }
}


POST /my-index-000002/_analyze
{
  "field": "content",
  "text": "这里的风景真美"
}

在这里插入图片描述

ik 分析器

IK分析器默认只有英文停用词，中文停用词的使用需要自行添加。

与添加自定义词典过程类似
进入ik 分析器 config 目录
编辑 IKAnalyzer.cfg.xml 即可以实现自定义词典

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">custom-dict.dic</entry>
         <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords">custom-stop.dic</entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

在 custom-stop.dic 文件中写入所需的停用词，添加完成后，重启ES即可
在这里插入图片描述
验证下结果

POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "这里的风景真美"
}

在这里插入图片描述

注意，有一个细节

IK分析器过滤器是在 IK分词器内部开始过滤器
```
POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "this is boy"
}


POST /_analyze
{
  "tokenizer": "ik_smart",
  "text": "this is boy"
}
```
执行上述请求，结果都发生了停用词过滤，说明IK分析器在分词器层面就完成了停用词过滤。

standard 分析器的stopwords 是作用在分词过滤器上的

POST /_analyze
{
  "tokenizer": {
    "type":"standard",
    "stopwords": [
            "this",
            "is"
          ]
  },
  "text": "this is boy"
}

执行上述请求，停用词stopwords 指令没有生效。说明 stopwords 在分词器阶段无效！
在这里插入图片描述

拼音搜索

要实现拼音搜索，需要安装相应的拼音分析器插件

官网：https://github.com/infinilabs/analysis-pinyin

插件下载地址：https://release.infinilabs.com/analysis-pinyin/stable/

安装

下载对应压缩包（要求与ES版本一致）
本文以 elasticsearch-analysis-pinyin-7.10.2.zip 为例

# 进入es的插件目录
cd es/plugins
# 创建pinyin目录
mkdir pinyin
# 在pinyin 目录下解压 pinyin分析器
unzip elasticsearch-analysis-pinyin-7.10.2.zip
# 进入es/bin目录，重启es
./elasticsearch -d

使用

POST /_analyze
{
  "analyzer": "pinyin",
  "text":"北京大学"
}

在这里插入图片描述
如上：北京大学被切割为 bei ，jing， da，xue， bjdx