README.md

LMDB Embeddings

Query word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by Lightning Memory-Mapped Database.

Inspired by Delft. As explained in their readme, this approach permits us to have the pre-trained embeddings immediately "warm" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.

For instance, in a traditional approach glove-840B takes around 2 minutes to load and 4GB in memory. Managed with LMDB glove-840B, can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).

Reading vectors

from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
  vector = embeddings.get_word_vector('google')
except MissingWordError:
  # 'google' is not in the database.
  pass

Writing vectors

An example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the iter_embeddings method below appropriately.

I will be writing a CLI interface to convert standard formats soon.

from gensim.models.keyedvectors import KeyedVectors
from lmdb_embeddings.writer import LmdbEmbeddingsWriter

OUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'

gensim_model = KeyedVectors.load('GoogleNews-vectors-negative300.w2v', mmap = 'r')

# Define an iterator to yield the vectors.
def iter_embeddings():
    for word in tqdm.tqdm(gensim_model.vocab.keys()):
        yield word, gensim_model[word]

# Write the vectors to disk.
writer = LmdbEmbeddingsWriter(
    iter_embeddings()
).write(OUTPUT_DATABASE_FOLDER)

# These vectors can now be loaded with the LmdbEmbeddingsReader.

Running tests

pytest

Customisation

By default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach such as msgpack. Simply inject the serializer and unserializer as callables into the LmdbEmbeddingsWriter and LmdbEmbeddingsReader.

GitHub - ThoughtRiver/lmdb-embeddings: Fast word vectors with little memory usag...

README.md

LMDB Embeddings

Reading vectors

Writing vectors

Running tests

Customisation

Recommend

中国植发市场爆发前夜，“头部玩家”「碧莲盛」的生意经

星美员工，讨薪维艰

白菜党:KNORVAY 诺为 N26C 激光笔送笔袋 5元包邮（需用券）_天猫精选优惠

不要对王俊凯要求太高

下半年，广告从业人员们差不多该迎来失业潮了

家里鸡犬不宁，各位集友帮忙支招

有什么排骨的好吃做法？排骨怎么做才不柴？ - 知乎

Firefox 62.0 正式版用户特性介绍

爱奇艺霸气关闭前台播放量，为的不是告别而是争权

直击|对垒阿里星巴克？瑞幸与腾讯合作共建智慧零售

About Joyk