README.md

Chinese Word Vectors 中文词向量

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

Window SizeDynamic WindowSub-samplingLow-Frequency WordIteration 5Yes1e-5105

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS) Corpus Context Features Word Word + Ngram Word + Character Word + Character + Ngram Baidu Encyclopedia 百度百科 300d 300d 300d 300d Wikipedia_zh 中文维基百科 300d 300d 300d 300d People's Daily News 人民日报 300d 300d 300d 300d Sogou News 搜狗新闻 300d 300d 300d 300d Financial News 金融新闻 300d 300d 300d 300d Zhihu_QA 知乎问答 300d 300d 300d 300d Weibo 微博 300d 300d 300d 300d Literature 文学作品 300d 300d 300d 300d Complete Library in Four Sections
四库全书* 300d 300d NAN NAN Mixed-large 综合 300d 300d 300d 300d Positive Pointwise Mutual Information (PPMI) Corpus Context Features Word Word + Ngram Word + Character Word + Character + Ngram Baidu Encyclopedia 百度百科 300d 300d 300d 300d Wikipedia_zh 中文维基百科 300d 300d 300d 300d People's Daily News 人民日报 300d 300d 300d 300d Sogou News 搜狗新闻 300d 300d 300d 300d Financial News 金融新闻 300d 300d 300d 300d Zhihu_QA 知乎问答 300d 300d 300d 300d Weibo 微博 300d 300d 300d 300d Literature 文学作品 300d 300d 300d 300d Complete Library in Four Sections
四库全书* 300d 300d NAN NAN Mixed-large 综合 300d 300d 300d 300d

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

FeatureCo-occurrence TypeTarget Word VectorsContext Word Vectors Word Word → Word300d 300d Ngram Word → Ngram (1-2) 300d 300d Word → Ngram (1-3) 300d 300d Ngram (1-2) → Ngram (1-2) 300d 300d CharacterWord → Character (1) 300d 300d Word → Character (1-2) 300d 300d Word → Character (1-4) 300d 300d Radical Radical 300d 300d PositionWord → Word (left/right) 300d 300d Word → Word (distance) 300d 300d GlobalWord → Text 300d 300d Syntactic FeatureWord → POS 300d 300d Word → Dependency300d 300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. The detailed corpora information is listed as follows:

Corpus Size Tokens Vocabulary Size Description Baidu Encyclopedia
百度百科 4.1G 745M 5422K Chinese Encyclopedia data from
https://baike.baidu.com/ Wikipedia_zh
中文维基百科 1.3G 223M 2129K Chinese Wikipedia data from
https://dumps.wikimedia.org/ People's Daily News
人民日报 3.9G 668M 1664K News data from People's Daily(1946-2017)
http://data.people.com.cn/ Sogou News
搜狗新闻 3.7G 649M 1226K News data provided by Sogou labs
http://www.sogou.com/labs/ Financial News
金融新闻 6.2G 1055M 2785K Financial news collected from multiple news websites Zhihu_QA
知乎问答 2.1G 384M 1117K Chinese QA data from
https://www.zhihu.com/ Weibo
微博 0.73G 136M 850K Chinese microblog data provided by NLPIR Lab
http://www.nlpir.org/download/weibo.7z Literature
文学作品 0.93G 177M 702K 8599 modern Chinese literature works Mixed-large
综合 22.6G 4037M 10653K We build the large corpus by merging the above corpora. Complete Library in Four Sections
四库全书 1.5G 714M 21.8K The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_dense.py -v <vector.txt> -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_sparse.py -v <vector.txt> -a CA8/semantic.txt

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

GitHub - Embedding/Chinese-Word-Vectors: 100+ Chinese Word Vectors 上百种预训练...

README.md

Chinese Word Vectors 中文词向量

Format

Pre-trained Chinese Word Vectors

Basic Settings

Various Domains

Various Co-occurrence Information

Representations

Context Features

Corpus

Toolkits

Chinese Word Analogy Benchmarks

Evaluation Toolkit

Reference

Recommend

阿里系43人上百富榜！陈亮，我真没有34个亿！究竟谁在说谎

上百位企业家挺柳传志：称联想是民族产业的一面旗帜

GitHub - ThoughtRiver/lmdb-embeddings: Fast word vectors with little memory usag...

Promomatic - 拥有上百种收集 Mockups 的网站，让你做出好看的效果展示图 - NEXT

太可怕，一夜之间收到上百条短信，账户空了...

【Github】GPT2-Chinese：中文的GPT2训练代码

直击上百商家堵门维权，淘集集寻求重组偿还货款

Python 脚本 bug 或导致上百篇论文出错

Github GitHub - Smaug123/fsharp-vectors: Type-safe vectors in F#

从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史

About Joyk