ElasticSearch7.3 学习之定制分词器（Analyzer） - |旧市拾荒| - JOYK Joy of Geek, Geek News, Link all geek

1、默认的分词器

关于分词器，前面的博客已经有介绍了，链接：ElasticSearch7.3 学习之倒排索引揭秘及初识分词器(Analyzer)。这里就只介绍默认的分词器standard analyzer

2、修改分词器的设置

首先自定义一个分词器es_std。启用english停用词token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

接下来开始测试两种不同的分词器，首先是默认的分词器

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "in",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "the",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

可以看到就是简单的按单词进行拆分，在接下来测试上面自定义的一个分词器es_std

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

{
  "tokens" : [
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

可以看到结果只有两个单词了，把停用词都给去掉了。

3、定制化自己的分词器

首先删除掉上面建立的索引

DELETE my_index

然后运行下面的语句。简单说下下面的规则吧，首先去除html标签，把&转换成and，然后采用standard进行分词，最后转换成小写字母及去掉停用词a the，建议读者好好看看，下面我也会对这个分词器进行测试。

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [
            "&=> and"
          ]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": [
            "the",
            "a"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "&_to_and"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stopwords"
          ]
        }
      }
    }
  }
}

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "my_index"
}

老规矩，测试这个分词器

GET /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!"
}

结果如下：

{
  "tokens" : [
    {
      "token" : "tomandjerry",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "are",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "friend",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "in",
      "start_offset" : 23,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 30,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "haha",
      "start_offset" : 42,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

最后我们可以在实际使用时设置某个字段使用自定义分词器，语法如下：

PUT /my_index/_mapping/
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

如果您觉得阅读本文对您有帮助，请点一下“推荐”按钮，您的“推荐”将是我最大的写作动力！欢迎各位转载，但是未经作者本人同意，转载文章之后必须在文章页面明显位置给出作者和原文连接，否则保留追究法律责任的权利。

ElasticSearch7.3 学习之定制分词器（Analyzer） - |旧市拾荒|

1、默认的分词器

2、修改分词器的设置

3、定制化自己的分词器

Recommend

谷歌将于2023年关闭统一版谷歌分析 - 牛透社

Brushwork VR Bringing Native Studio App To Quest

Google Allowing Android Users to Delete the Last 15 Mins of Search History

早报 | 微信、淘宝等推出关闭算法推荐功能 / 315 被点名公司称打击不大 / 华为更新 P5...

机器学习整理（逻辑回归） - 北极乌布

Teardown Reveals iPhone SE 2022 Battery Size

Mojito莫吉托

关于OAuth2.0 Authorization Code + PKCE flow在原生客户端(Native App)下集成的一点...

详解Art Blocks：生成艺术的自动售货机，2021年累计交易额超10亿美元

物联网安全市场到2029年将达到590亿美元

About Joyk

ElasticSearch7.3 学习之定制分词器（Analyzer） - |旧市拾荒|

1、默认的分词器

2、 修改分词器的设置

3、定制化自己的分词器

Recommend

About Joyk

2、修改分词器的设置