whoosh自定义scoring和collectors

5 years ago

source link: https://blog.xubiaosunny.online/%E6%8A%80%E6%9C%AF/2019/11/07/whoosh_scoring_and_collectors.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

自定义scoring

scoring模块是whoosh控制搜索结果得分的。

使用whoosh自带的scoring就可以实现特别好的搜索结果，但架不住业务上的要求，就比如我们要将搜索结果内在售的排在前面，而且还要将最近的年份的显示在前面，并且不能简单的靠是否在售和时间来排序，还要根据搜索关键词的相关性综合考虑。其实就比较蛋疼，要控制好这几个维度的度，也就是各个维度的权重。

重写BM25FScorer

from whoosh import scoring


class MyBM25FScorer(scoring.BM25FScorer):
    def __init__(self, searcher, fieldname, text, B, K1, qf=1):
        super().__init__(searcher, fieldname, text, B, K1, qf=qf)
        self.searcher = searcher

    def score(self, matcher):
        s = self._score(matcher.weight(), self.dfl(matcher.id()))
        # customize
        d = self.searcher.stored_fields(matcher.id())
        return self.customize_add_score(d, s)

    @staticmethod
    def customize_add_score(d, s):
        if d['saleStatus'] == "在售":
            # 如果在售，加一定的分数
        # 其他条件...
        if ...
            ...
        return s

自定义collectors

自定义scoring后发现确实搜索结果都按照预期的来了，但有出现一个问题，有的关键词搜索结果特别多就会特别慢，原因就是每条记录都会根据自定义逻辑就行一次取值判断加分，匹配到的结果太多时间自然就变长了。实际测试匹配结果达到2万5千条的时候需要将近8s的时间。

其实搜索到的排在后面的数据基本都是不相关的，而且也不可能有用户去查看2w多条记录

于是可以建设匹配的数量来加快检索。

from whoosh import collectors

MIN_SCORE = 6

...
class MyBM25FScorer(scoring.BM25FScorer):
    ...

    def score(self, matcher):
        s = self._score(matcher.weight(), self.dfl(matcher.id()))
        # customize
        if s < MIN_SCORE:
            return s
        d = self.searcher.stored_fields(matcher.id())
        return self.customize_add_score(d, s)


class MyUnlimitedCollector(collectors.UnlimitedCollector):
    def _collect(self, global_docnum, score):
        if score < MIN_SCORE:
            return 0
        self.items.append((score, global_docnum))
        self.docset.add(global_docnum)
        # Negate score to act as sort key so higher scores appear first
        return 0 - score

使用自定义的scoring和collectors搜索

...
with ix.searcher(weighting=MyBM25F()) as s:
    c = MyUnlimitedCollector()
    s.search_with_collector(query, c)
    results = c.results()
...

Recommend

medium.com 7 years ago
Cache

Functional Java: Collectors – Kelvin Ma – Medium

In my last post on Functional Java: Streams, I left this challenge (slight edit for improved wording): You are given a finite input stream that you need to parse and return a collections of Foo…

Github github.com 6 years ago
Cache

GitHub - pivovarit/parallel-collectors: Parallel Collectors is a toolkit easinin...

README.md Parallel Collectors for Java 8 Stream API - overcoming limitations of standard Parallel Streams

www.tuicool.com 6 years ago
Cache

Accumulative: Custom Java Collectors Made Easy

Accumulative is an interface proposed for the intermediate accumulation type A of Collector<T, A, R> in order to make defining custom Java Collector...

www.tuicool.com 5 years ago
Cache

全文检索调研（Mongodb和Whoosh）

公司有一个类似搜索引擎的项目，现在公司是做保险的，这个项目就是用来搜索已经解析格式化的保险的，现在的模式搜索比较简单，就是根据保险名称、公司等字段筛选，而且也没有先后排名。最近我leader让我调研一下对该功能进行改造，以实现...

ionutbalosin.com 5 years ago
Cache

JVM Garbage Collectors Benchmarks Report 19.12

(Last Updated On: 14th December 2019) Context The current article describes a series of Java Virtual Machine (JVM) Garbage Collectors (GC) micro-benchmarks and their results, using a different set of...