t-distributed stochastic neighbor embedding(t-SNE) is a machine learning algorithm for dimensionality reduction developed by Laurens van der Maaten and Geoffrey Hinton. It is a nonlinear dimensionality reduction technique that is particularly well suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points.
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space, it was created by a team of researchers led by Tomas Mikolov at Google.
This tutorial shows how to use the t-SNE to visualize the words embedding from Word2Vec, it should be similar to apply it to other kinds of embedding tasks. By the end of this tutorial we’ll get plot looking like following digram, from which we can see similar words are close to each other, similarity can be meaning, location and so on.
Prerequisite
Python version of the t-SNE is used, and it requires matplotlib.
Training
First download the word2vec and build the source code.
wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip unzip source-archive.zip cd word2vec/trunk make
If there is error complaining about “‘malloc.h’ file not found”, then go the related file and replace malloc.h to stdlib.h.
Then we are going to train use text corpus from Matt Mahoney.
wget http://mattmahoney.net/dc/text8.zip -O text8.gz gzip -d text8.gz -f ./word2vec -train text8 -output vectors.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15
Now we have the embeddings file named vectors.txt.
Visualization
Let’s download the tsne.py, and two helper scripts I have written.
wget https://raw.githubusercontent.com/kingwind/tsneplot/master/tsnePlot.py wget https://raw.githubusercontent.com/kingwind/tsneplot/master/prepareForTsNE.py wget https://raw.githubusercontent.com/kingwind/tsneplot/master/tsne.py
Usually we don’t want to plot all the embeddings at same time as it’s too noisy. For this Lab Notes suppose we are only interesting in the words from the analogy test used in word2vec.
wget https://raw.githubusercontent.com/dav/word2vec/master/data/questions-words.txt
Next step is to prepare the input for visualization using the prepareForTsNe.py. The output label_w2v.txt contains the word to plot, and the output embed_w2v.txt contains the related vectors.
python prepareForTsNE.py -e vectors.txt -s questions-words.txt -l label_w2v.txt -x embed_w2v.txt
There are maybe some warning about some words not found, which is OK.
Now it’s ready to visualize the results using t-SNE. 200 is the dimension of the vector and p is perplexity usually ranged from 20 to 50.
python tsnePlot.py -l label_w2v.txt -v embed_w2v.txt -d 200 -p 50
Please note that it’s possible that every time we run the t-SNE, we may get a slightly difference plot. That’s expected and explanation can be found on FAQ from t-SNE.