19

数组如何在ElasticSearch中索引

 3 years ago
source link: http://www.cnblogs.com/gudujian/p/13697598.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

一、简介

在ElasticSearch里没有专门的数组类型,任何一个字段都可以有零个和多个值。当字段值的个数大于1时,字段类型就变成了数组。

下面以视频数据为例,介绍ElasticSearch如何索引数组数据,以及如何检索数组中的字段值。

测试视频数据格式如下:

{
    "media_id": 88992211,
    "tags": ["电影","科技","恐怖","电竞"]
}

media_id代表视频id,tags是视频的标签,有多个值。业务上需要按视频标签检索标签下所有的视频。同一个视频有多个标签。

演示使用的ElasticSearch集群的版本是7.6.2。

二、测试演示

2.1 创建索引

PUT test_arrays
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "media_id": {
        "type": "long"
      },
      "tags": {
        "type": "text"
      }
    }
  }
}

2.2 向test_arrays索引里写入测试数据

POST test_arrays/_doc
{
  "media_id": 887722,
  "tags": [
      "电影",
      "科技",
      "恐怖",
      "电竞"
    ]
}

2.3 查看test_arrays内部如何索引tags字段

{
  "tokens" : [
    {
      "token" : "电",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "影",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "科",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 102
    },
    {
      "token" : "技",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 103
    },
    {
      "token" : "恐",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 204
    },
    {
      "token" : "怖",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 205
    },
    {
      "token" : "电",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<IDEOGRAPHIC>",
      "position" : 306
    },
    {
      "token" : "竞",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 307
    }
  ]
}

从响应结果可以看到,tags数组中的每个值被分词成多个token。

2.4 检索tags数组中的值

POST test_arrays/_search
{
  "query": {
    "match": {
      "tags": "电影"
    }
  }
}
响应结果:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.68324494,
    "hits" : [
      {
        "_index" : "test_arrays",
        "_type" : "_doc",
        "_id" : "MyhnpXQBGXOapfjvSpOW",
        "_score" : 0.68324494,
        "_source" : {
          "media_id" : 887722,
          "tags" : [
            "电影",
            "科技",
            "恐怖",
            "电竞"
          ]
        }
      }
    ]
  }
}

模糊检索:
POST test_arrays/_search
{
  "query": {
    "match": {
      "tags": "影"
    }
  }
}
响应结果
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_arrays",
        "_type" : "_doc",
        "_id" : "MyhnpXQBGXOapfjvSpOW",
        "_score" : 0.2876821,
        "_source" : {
          "media_id" : 887722,
          "tags" : [
            "电影",
            "科技",
            "恐怖",
            "电竞"
          ]
        }
      }
    ]
  }
}

视频数据业务上需要通过标签精确匹配,查询标签下的所有视频。实现这种效果,需要把tags字段类型修改为keyword。test_arrays索引的mappings设置如下:

PUT test_arrays
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "media_id": {
        "type": "long"
      },
      "tags": {
        "type": "keyword"
      }
    }
  }
}

此时tags字段数组中每一个值对应一个token,可以实现按标签精准查询标签下视频的效果。

{
  "tokens" : [
    {
      "token" : "电影",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "科技",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "恐怖",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "电竞",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    }
  ]
}

实际业务场景中,视频标签的数据可能不是按数组存储的,全部标签存储在一个字符串中,标签之间用逗号分隔。

{
    "media_id": 88992211,
    "tags": "电影,科技,恐怖,电竞"
}

上面的标签存储格式,通过调整索引字段的类型,同样可以实现精准检索单个标签下视频的效果。test_arrays索引的配置如下:

PUT test_arrays
{
  "settings": {
    "number_of_shards": 1,
    "analysis" : {
        "analyzer" : {
          "comma_analyzer": {
            "tokenizer": "comma_tokenizer"
          }
        },
        "tokenizer" : {
          "comma_tokenizer": {
            "type": "simple_pattern_split",
            "pattern": ","
          }
        }
      }
  },
  "mappings": {
    "properties": {
      "media_id": {
        "type": "long"
      },
      "tags": {
        "search_analyzer" : "simple",
        "analyzer" : "comma_analyzer",
        "type" : "text"
      }
    }
  }
}

写入一条测试数据到test_arrays索引

POST test_arrays/_doc
{
  "media_id": 887722,
  "tags": "电影,科技,恐怖,电竞"
}

tags字段的索引结构如下,同样实现了一个标签对应一个token。

{
  "tokens" : [
    {
      "token" : "电影",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "科技",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "恐怖",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "电竞",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    }
  ]
}

通过标签精准匹配查询。

请求参数
POST test_arrays/_search
{
  "query": {
    "match": {
      "tags": "电影"
    }
  }
}
响应结果
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_arrays",
        "_type" : "_doc",
        "_id" : "3i2ipXQBGXOapfjv3THH",
        "_score" : 0.2876821,
        "_source" : {
          "media_id" : 887722,
          "tags" : "电影,科技,恐怖,电竞"
        }
      }
    ]
  }
}

三、总结

ElasticSearch采用的一种数据类型同时支持单值和多值的设计理念,即简化了数据类型的总量,同时也降低索引配置的复杂度,是一种非常优秀的设计。

同时标签数据的组织方式支持数组和分隔符分隔两种形式,体现了ElasticSearch功能的灵活性。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK