使用ES的快速实现内容相似性推荐

0.1662020.04.11 01:12:21字数 614阅读 546

问答系统：通过用户给出的一段描述性文本，通过相似度计算查找与用户输入接近的问题
相似推荐：用户在浏览当前文章时，基于内容相似性推荐与本篇文章相似的文章

more_like_this顾名思义就是帮我找到更多像这个文档的数据，为了便于讲解，这里先构建一个索引库，该索引库包含title和desc两个字段：

 PUT /search_data
{
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "term_vector": "yes"
            },
            "desc": {
                "type": "text"
            }
        }
    }
}

term_vector
term_vector为yes时会索引terms向量，加快相似度计算的速度；这里desc字段没有配置term_vector也是可以进行more_like_this查询的，但会有性能损耗，这里不配置是为了更好的说明问题，到线上应用需要都设置为yes。

基于一段短文本或一个问题描述语句进行推荐

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "desc"],
            "like" : "清明节春游踏青春季旅游学校春游亲子游企业郊游活动",
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

fields
要执行查询的字段，目前只支持text和term
like
要查询相似的文本，可以是文档id或者一个查询字句
min_term_freq
最小词频率，低于该频率的词将被忽略
max_query_terms
根据max_query_terms配置的数，提取like中文本term的tfidf值最大的几个，其余的词将被忽略

另外如果文本太长，也可以基于文章Id进行相似推荐

GET /_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "desc"],
            "like" : [
            {
                "_index" : "search_data",
                "_id" : "1"
            } 
            ],
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

like后面是数组可以跟多篇文章，另外_index对应的索引库也可以不是当前查询的索引库。

unlike
如果对推荐的结果不是很满意，也可以通过unlike参数进行微调，使用方式和like一致，不同的是这里传入的是你不喜欢的一些内容，在进行相似性计算时进行降权，需要注意的是，这里如果降权的是头部推荐的话不是很明显。

GET search_data/_search
{
  "size": 112, 
  "_source": ["desc","title"], 
  "query": {
    "more_like_this" : {
            "fields" : ["title", "desc"],
            "unlike":[
              {
                "_index" : "search_data",
                "_id" : "1270715"
              },
              {
                "_index" : "search_data",
                "_id" : "1238991"
              },
              {
                "_index" : "search_data",
                "_id" : "506680"
              },
              "我要把不喜欢的内容屏蔽掉"
              ],
             "like" : [
             {
                "_index" : "search_data",
                "_id" : "986604"
             }  
            ], 
            
            "min_term_freq" : 1 
        }
  }
}

其它可选参数

min_doc_freq：最小的文档频率，默认为5。
max_doc_freq：最大文档频率。
min_word_length：单词的最小长度。
max_word_length：单词的最大长度。
stop_words：停用词列表。
analyzer：分词器。
minimum_should_match：文档应该匹配的最小单词数量，默认为query分词后词项的30%。
boost_terms：词项的权重。
include：是否把输入文档作为结果返回。
boost：整个query的权重，默认为1.0。

使用ES的快速实现内容相似性推荐 - 简书

使用ES的快速实现内容相似性推荐

基于一段短文本或一个问题描述语句进行推荐

另外如果文本太长，也可以基于文章Id进行相似推荐

其它可选参数

Recommend

Centos(Linux)系统下实现挂载硬盘 - 技术小黑屋

L4LB for Kubernetes: Theory and Practice with Cilium+BGP+ECMP

How Apple and Google’s Coronavirus Contact Tracing System Works | Five Stars

瑞幸做局 “完美故事”成中国新经济公司信任风暴

Erase your darlings

Need to Connect to a Local MySQL Server? Use Unix Domain Socket!

DustMite: The General-Purpose Data Reduction Tool

Hyper Traps

Not actually Linux distro review deux: GhostBSD

Build Your Own Neural Network in Go

About Joyk