Visualizing Word2Vec Embeddings using t-SNE

t-distributed stochastic neighbor embedding(t-SNE) is a machine learning algorithm for dimensionality reduction developed by Laurens van der Maaten and Geoffrey Hinton. It is a nonlinear dimensionality reduction technique that is particularly well suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space, it was created by a team of researchers led by Tomas Mikolov at Google.

This tutorial shows how to use the t-SNE to visualize the words embedding from Word2Vec, it should be similar to apply it to other kinds of embedding tasks. By the end of this tutorial we’ll get plot looking like following digram, from which we can see  similar words are close to each other, similarity can be meaning, location and so on.

Visualization word2vec using t-SNE

 

Prerequisite

Python version of the t-SNE is used, and it requires matplotlib.

Training

First download the word2vec and build the source code.

wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
unzip source-archive.zip
cd word2vec/trunk
make

If there is error complaining about “‘malloc.h’ file not found”, then go the related file and replace malloc.h to stdlib.h.

Then we are going to train use text corpus from Matt Mahoney.

wget http://mattmahoney.net/dc/text8.zip -O text8.gz
gzip -d text8.gz -f
./word2vec -train text8 -output vectors.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

Now we have the embeddings file named vectors.txt.

Visualization

Let’s download the tsne.py, and two helper scripts I have written.

wget https://raw.githubusercontent.com/kingwind/tsneplot/master/tsnePlot.py
wget https://raw.githubusercontent.com/kingwind/tsneplot/master/prepareForTsNE.py
wget https://raw.githubusercontent.com/kingwind/tsneplot/master/tsne.py

Usually we don’t want to plot all the embeddings at same time as it’s too noisy. For this Lab Notes suppose we are only interesting in the words from the analogy test used in word2vec.

wget https://raw.githubusercontent.com/dav/word2vec/master/data/questions-words.txt

Next step is to prepare the input for visualization using the prepareForTsNe.py. The output label_w2v.txt contains the word to plot, and the output embed_w2v.txt contains the related vectors.

python prepareForTsNE.py -e vectors.txt -s questions-words.txt -l label_w2v.txt -x embed_w2v.txt

There are maybe some warning about some words not found, which is OK.

Now it’s ready to visualize the results using t-SNE. 200 is the dimension of the vector and p is perplexity usually ranged from 20 to 50.

python tsnePlot.py -l label_w2v.txt -v embed_w2v.txt -d 200 -p 50

Please note that it’s possible that every time we run the t-SNE, we may get a slightly difference plot. That’s expected and explanation can be found on FAQ from t-SNE.

Leave a Reply