3-Minute Read

583 words

This article will introduce how to download Wikipedia corpus and train word embedding on it. All the code will be on Github. Downloading time and training time is extremely long, so I also uploaded my pretrained embedding. You can download my pretrained embedding here: Chinese Word2Vec, Chinese FastText, English Word2Vec, English FastText.

What word embedding does is project each word to a space, and make the words with similar meanings will be close to each other in the space. How to do this? Assuming that a word can be explained by its context. So we can make the words near by in the sentence also near by in the space. And words not in the same sentence make them far with each other. FastText considers so called “subword” information than word2vec: considering “apple” as “app”, “ppl”, and “ple”. Thus, rare words can be learnt by these subwords better than full word and make the embedding performs better. No math detail here, and let’s take a look to the code.

python train.py --model word2vec --lang en --output data/en_wiki_word2vec_300.txt

Run the command above will download latest English wikidump and train word2vec model, then save result to data/en_wiki_word2vec_300.txt.

Where to Download Wikipedia Corpus?

You can see backup status of Wikipedia in each language here. Also backup versions you can download for English Wikipedia. Choose “latest” because we want to use the latest version. And we need to download enwiki-latest-pages-articles-multistream.xml.bz2.

I used Python requests to download file. With tqdm you can see download progress on the screen. English wiki dump at 2019/02 has size about 16GB.

Corpus Processing

from gensim.corpora.wikicorpus import WikiCorpus

class WikiSentences:
    def __init__(self, wiki_dump_path, lang):
        logging.info('Parsing wiki corpus')
        self.wiki = WikiCorpus(wiki_dump_path)
        self.lang = lang

    def __iter__(self):
        for sentence in self.wiki.get_texts():
            if self.lang == 'zh':
                yield list(jieba.cut(''.join(sentence), cut_all=False))
            else:
                yield list(sentence)

I used WikiCorpus provided by gensim to parse the corpus. __iter__ is for model training, which needs an iterator to go through each sentences in our data. if self.lang == 'zh' section is used for Chinese wiki training, just ignore them if you are training English wiki.

I run the program with Intel i5-8400 CPU and 32GB RAM, it takes about 100 minutes to finish the constructor.

Model Training

import wiki as w

from gensim.models.fasttext import FastText
from gensim.models.word2vec import Word2Vec

wiki_sentences = w.WikiSentences(WIKIXML.format(lang=args.lang), args.lang)

if args.model == 'word2vec':
    model = Word2Vec(wiki_sentences, sg=1, hs=1, size=args.size, workers=12, iter=5, min_count=10)
elif args.model == 'fasttext':
    model = FastText(wiki_sentences, sg=1, hs=1, size=args.size, workers=12, iter=5, min_count=10)

I trained model by Word2Vec and FastText provided by gensim. size is how many dimension you want for your word embedding. iter is the number of iterations for training. min_count ignores all words with total frequency less than this number.

To train a Word2Vec model takes about 22 hours, and FastText model takes about 33 hours. If it’s too long to you, you can use fewer “iter”, but the performance might be worse.

Experiment Results

python demo.py --lang en --output data/en_wiki_word2vec_300.txt

Results of Word2Vec.

python demo.py --lang en --output data/en_wiki_fasttext_300.txt

Results of FastText.

I use cosine similarity for score. Because FastText embedding consider “subword” information, so the words with similar subwords will be ranked higher. Word2vec and FastText, which one is better? It depends on your application.

Rhe result of Word2vec that project to 2 dimensions by PCA.

This is the result of Word2vec that project to 2 dimensions by PCA. You can see the specific relationship between countries and capitals on the graph. Also Asia countries gather together, and European countries gather into another group. Result of FastText is similar, so I showed Word2vec’s result only.

References

  1. https://github.com/LasseRegin/gensim-word2vec-model/blob/master/train.py
  2. http://zake7749.github.io/2016/08/28/word2vec-with-gensim/
comments powered by Disqus

Recent Posts

Categories

Tags