# Computing word2vec embeddings in pytorch using the Continuous Bag of Words (CBOW) method

In this notebook we demonstrate how to compute word2vec embeddings in pytorch. Word2vec embeddings have two main variants, SkipGram and CBOW (continuous bag of words). Both can be implemented straightforwardly in pytorch, but CBOW is a simpler architecture, so we demonstrate that here. This notebook is based on https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html and https://gist.github.com/GavinXing/9954ea846072e115bb07d9758892382c

In CBOW, the task is: Given a collection of context words, guess the target word with which they appeared. Say we have the sentence

    "I do not like green eggs and ham"
    
Say we remove stop words, and we are left with

   "I like green eggs ham"
   
If we use a context window of 1 word on either side, then here's how the task goes:

* My context words are "I", "green". Guess the target. (And the correct answer would be "like".)
* My context words are "like", "eggs". Guess the target. (And the correct answer would be "green".)

and so on.

In CBOW, the model predicts the target as follows:

Assume that every context vector has some current embedding. We first sum up the current vectors/embeddings of the context words to make an average vector -- so if "I" has the vector <1,2,3> and "green" has the vector <4,5,6>, their sum is <5,7,9>. Then we divide this vector (that is, we divide every dimension of this vector) by the number of context items. 

Call this vector $\mathbf{x}$. We need to map $\mathbf{x}$ to a vector $\mathbf{\hat y}$ of values, one for each word in the vocabulary. We want to train the model so that $\mathbf{\hat y}$ will have the highest value for the actual target word. To get there, we first run $\mathbf x$ through a linear layer with learned weigths $W$: 

$\mathbf{h} = W\mathbf{x}$. 

The output is a vector $\mathbf{h}$, with as many dimensions as there are words in the vocabulary. 


We then apply a nonlinear function to $\mathbf{h}$. The one we use is called the *softmax*: It maps each entry $h_i$ in $h$ to a number $g(h_i)$ that exaggerates the difference between values. That is, bigger $h_i$'s will get much bigger values, and smaller $h_i$'s will get much smaller values. Then it normalizes all the values to sum to one. So we get: 

$\hat{y}_i = \frac{g(h_i)}{g(h_1) + g(h_2) + \ldots} = \frac{g(h_i)}{\sum_j g(h_j)}$

The function $g$ that we use is actually `exp()`, that is, e-to-the-power-of $h_i$. Putting everything together, softmax is

$\hat{y}_i = \frac{exp(h_i)}{\sum_j exp(h_j)}$

# pytorch embeddings

We first get to know yet another data type: pytorch's embeddings. This is a data type where you can keep vectors (embeddings) for many targets. You initialize it with a fixed size, and it randomly sets all the initial embedding values: 

In [1]:
import torch

# this makes 10 embeddings of 2 dimensions each.
a = torch.nn.Embedding(10, 2)

In [2]:
# Let's look at the embeddings.
# As you can see, they are a tensor in which
# gradient-tracking has been switched on.
a.weight

Parameter containing:
tensor([[-0.4478, -1.0901],
        [ 0.3967, -1.1000],
        [-0.7594,  0.6428],
        [-0.1264, -0.3514],
        [ 1.1677,  1.3003],
        [ 0.2831, -1.2989],
        [ 1.0864,  0.0595],
        [ 0.5292,  0.1035],
        [-1.4991, -0.6353],
        [ 2.4363,  0.6902]], requires_grad=True)

In [3]:
# Here is how to address individual embeddings in the collection: 
# You call the embedding object like a function, and you give it as its argument
# the index of the line you want, as a tensor
a(torch.tensor([0]))


tensor([[-0.4478, -1.0901]], grad_fn=<EmbeddingBackward>)

In [4]:
# you can also ask for multiple embeddings at once
b = a(torch.tensor([0, 2, 3]))
b

tensor([[-0.4478, -1.0901],
        [-0.7594,  0.6428],
        [-0.1264, -0.3514]], grad_fn=<EmbeddingBackward>)

In [5]:
# Here is an operation that sums up a given collection of 
# embeddings. 
# This sums up all the first dimensions in b to make one new value,
# and all the second dimensions to make one new value
b.sum(dim = 0)

tensor([-1.3337, -0.7986], grad_fn=<SumBackward1>)

Here is how we will use pytorch embeddings objects below: We will first make a dictionary that maps each word in the vocabulary to an index, called `word_to_ix`. To look up a word's embedding:
* we look up its index,
* turn that index into a tensor,
* and retrieve the embedding for that word index.

In [6]:
import torch
import torch.nn as nn

# mini dictionary mapping words to indices
word_to_ix = {"hello": 0, "world": 1}

# random embeddindgs:
# 2 words in the vocbulary, 5 dimensional embeddings
embeds = nn.Embedding(2, 5)  

# turn the index of the word "hello" into a tensor
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)

# look up the embedding for that index
hello_embed = embeds(lookup_tensor)

# and check out its current embedding
print(hello_embed)

tensor([[-1.8003, -0.2694,  0.4876, -0.5699, -0.7899]],
       grad_fn=<EmbeddingBackward>)


# CBOW

Remember from the Logistic Regression notebook that to make a machine learning model, we need to create a subclass of torch.nn.Module. We again do this. 

Here, we need:

* Inputs: a series of context words
* We sum up the embeddings (vectors) for the context words learned so far. We don't divide by the number of context words because they always stay the same in our implementation. 
* We then send this average context word embedding through a linear layer (sum of weighted values, with a learned weight matrix). number of outputs: size of the vocabulary
* We then compute the softmax over these outputs


In [7]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

# Machine learning model class: continuous bag of words, CBOW
class CBOW(nn.Module):

   # parameters:
    # vocabulary_size: how many words do we have
    # embedding_dim: how many dimensions will our vectors have
    # noword_index: All words will be represented to the model
    #       simply through their index -- it doesn't need to know what the
    #       words are -- with one exception.
    #       When our target word is near the beginning of the sentence,
    #       or near the end of the sentence, 
    #       we won't have as many context words as we would like.
    #       In that case, we send the model a special word NONE,
    #       or rather its index. We tell the model what the index of NONE will be.
    def __init__(self, vocab_size, embedding_dim, noword_index = 0):
        # we call the __init__() function of the super-class
        # that is, of torch.nn.Module,
        # and announce to it the type of our class, CBOW
        super(CBOW, self).__init__()

        # keep embeddings for all words in the vocabulary.
        # (see above for experiments with torch.nn.Embedding)
        # number of embeddings = vocabulary size.
        # dimensionality of embeddings = vector_dimensions
        # We give nn.Embedding the noword index so it knows
        # not to update the embedding for NONE during the backwards pass.
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx = noword_index)
        
        # here's the linear layer that we will use.
        # inputs: number of vector dimensions.
        # outputs: number of words in the vocabulary
        self.linear = nn.Linear(embedding_dim, vocab_size)

    # forward step through the network.
    # input: a torch tensor of context word indices,
    # using noword_index for unfilled positions    
    def forward(self, contextword_indices):
        # look up the current embeddings for the context words
        embeds = self.embeddings(contextword_indices)
        
        # sum up the vectors,
        # and reformat the result to be a matrix rather than a vector:
        # view() takes as input a shape, 
        # a tuple of rows and columns.
        # Number of rows is one: one row.
        # Number of columns is set to -1: use as many columns as the input did.
        # We don't divide by number of context items because that is always the same here.
        embeds = (embeds.sum(dim=0)).view((1, -1))
        
        
        # run through the linear layer
        out = self.linear(embeds)
        
        # nonlinearity: softmax.
        # This gives more weight to the high-weight outputs,
        # and downweights the low-weight outputs        
        log_probs = F.log_softmax(out, dim=1)

        return log_probs


# Preprocessing the data

We again work with the Brown corpus, in particular the "fiction" subset. 

Here is how to remove stopwords from a sentence in the Brown corpus:

In [None]:
import nltk
import string

# accessing the list of NLTK English stopwords
stopwords = nltk.corpus.stopwords.words('english')
# and adding punctuation symbols
stopwords += list(string.punctuation)

In [19]:
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [20]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [21]:
list(string.punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [8]:
# This is the second sentence in the Fiction part of the Brown corpus
print(list(nltk.corpus.brown.sents(categories="fiction"))[1])

# This is the words from that sentence that remain when we remove stopwords
print([w for w in list(nltk.corpus.brown.sents(categories = "fiction"))[1] if w not in stopwords])

['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.']
['Scotty', 'go', 'back', 'school']


Counting words in the Brown fiction subcorpus using an NLTK frequency distribution object. As you can see, some punctuation symbols remain because they don't match anything on the `string.punctuation` list -- this is the case in particular for multi-character punctuation. We're not fixing this issue here. 

In [9]:
brown_wordcounts = nltk.FreqDist(w.lower() for w in nltk.corpus.brown.words(categories = "fiction") 
                                 if w.lower() not in stopwords)
brown_wordcounts

FreqDist({'``': 703, "''": 698, 'would': 291, 'said': 194, 'one': 184, '--': 176, 'could': 168, 'like': 151, 'man': 112, 'back': 104, ...})

We next again make a function that iterates through text and yields each targte with its context words in turn. Design decisions:

* We use a one word context window on either side of the target. For a target that's the first or last word in the sentence, we add a dummy context word "NONE". 
* We only count words that appear at least 5 times in our data. This is a low frequency threshold. We use it because our whole dataset is rather small. If you have a decent-sized dataset, use a higher frequency threshold. 
* We lowercase all words.
* The function is specific to the Brown corpus at this point. The only choice you get is in the Brown category to use. 

In [10]:
import nltk
import string

# function for iterating through each target with its context in the Brown corpus.
# parameters:
# wordcounts: this is an NLTK FreqDist object that maps each word to its counts
# min_wordcount: minimum word count for words to be included,
# stopwords: a stopword list
# brown_category: Brown corpus category to use
def each_target_contexts(wordcounts, min_wordcount, stopwords, brown_category):
    # iterate through sentence in that section of Brown
    for sent in nltk.corpus.brown.sents(categories = brown_category):
        # lowercase everything.
        # keep only non-stopwords that appeared at least min_wordcount times in the corpus
        cleansent = [w.lower() for w in sent if w not in stopwords and wordcounts[w] >= min_wordcount]
        
        if len(cleansent) < 2:
            # only a single word remaining, or no words at all: skip this sentence
            continue
            
        # iterate over all targets in the sentence
        for targetindex, target in enumerate(cleansent):
            # determine previous word, or NONE
            if targetindex > 0:
                prev_word = cleansent[targetindex -1]
            else: 
                prev_word = "NONE"
                
            # determine next word, or NONE
            if targetindex < len(cleansent) - 1:
                next_word = cleansent[targetindex + 1]
            else:
                next_word = "NONE"
                
            # yield: target, and the pair of previous word and next word
            yield(target, (prev_word, next_word))
                
            

We now make a `word_to_ix` dictionary mapping each target word to an index, as discussed above. 

In [11]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords += list(string.punctuation)

min_wordcount = 5

# making a lookup dictionary that maps each word to its index.
# we give the index 0 to NONE 
frequent_words = sorted([w for w in brown_wordcounts if brown_wordcounts[w] >= min_wordcount])
frequent_words = ["NONE"] + frequent_words

word_to_ix = { }
for wordindex, word in enumerate(frequent_words):
    word_to_ix[word] = wordindex
    

# Training the model

In [12]:
# Hyperparameters:
# number of dimensions for the vectors
embedding_size = 100
# learning rate for the model: how far to step in the direction of "lowest error"
learning_rate = 0.01
# number of words in our vocabulary
vocabulary_size = len(word_to_ix)
# how many times to iterate through the training data. 
num_epochs = 50

# loss function: cross entropy loss, to go with the nonlinearity we are using,
# which is softmax
lossfunction = nn.CrossEntropyLoss()
# we use the CBOW model we defined above
model = CBOW(vocabulary_size, embedding_size)
# optimizer: stochastic gradient descent
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

In [13]:
from torch.autograd import Variable

# This is the actual training loop
for epoch in range(num_epochs):
    # we keep track of the loss, as determined by the loss function,
    # to see if it goes down over epochs
    total_loss = 0
    # iterating through all the targetwords, with their sets of context words,
    # in the Brown corpus section we chose
    for targetword, contextwords in each_target_contexts(brown_wordcounts, min_wordcount, 
                                                         stopwords, "fiction"):
        
        # look up the target word in word_to_ix to determine its index,
        # turn the index into a pytorch tensor
        targetindex = torch.tensor([word_to_ix[targetword]], dtype=torch.long)
        # same for the context words
        contextindices = torch.tensor([word_to_ix[c] for c in contextwords], dtype=torch.long)
        
        # forward run through the model to make a prediction
        y_pred = model(contextindices)

        # how wrong was our model? we ask the loss function
        # to compare the prediction (which gives a weight to each possible word index)
        # and the actual target index
        loss = lossfunction(y_pred, targetindex)
        
        # remember how wrong our model was, for reporting
        total_loss += loss.item()
        
        # set all the gradients to zero 
        # before we do the backwards step:
        # pytorch accumulates gradients, and we don't want to 
        # muddle up our new gradients with the gradients from the
        # previous step
        optimizer.zero_grad()
        
        # backwards step: determine how to change all weights
        # based on the loss
        loss.backward()

        # update all parameters (weights)
        optimizer.step()
        
    # report how we did in this epoch
    print("Epoch", epoch + 1, "Loss", round(total_loss, 3))


Epoch 1 Loss 22689.144
Epoch 2 Loss 19093.579
Epoch 3 Loss 16462.106
Epoch 4 Loss 14349.956
Epoch 5 Loss 12671.192
Epoch 6 Loss 11352.195
Epoch 7 Loss 10335.1
Epoch 8 Loss 9564.016
Epoch 9 Loss 8985.135
Epoch 10 Loss 8551.103
Epoch 11 Loss 8223.397
Epoch 12 Loss 7972.599
Epoch 13 Loss 7777.334
Epoch 14 Loss 7622.531
Epoch 15 Loss 7497.648
Epoch 16 Loss 7395.256
Epoch 17 Loss 7310.055
Epoch 18 Loss 7238.22
Epoch 19 Loss 7176.951
Epoch 20 Loss 7124.172
Epoch 21 Loss 7078.301
Epoch 22 Loss 7038.11
Epoch 23 Loss 7002.627
Epoch 24 Loss 6971.078
Epoch 25 Loss 6942.842
Epoch 26 Loss 6917.421
Epoch 27 Loss 6894.409
Epoch 28 Loss 6873.476
Epoch 29 Loss 6854.347
Epoch 30 Loss 6836.792
Epoch 31 Loss 6820.621
Epoch 32 Loss 6805.672
Epoch 33 Loss 6791.807
Epoch 34 Loss 6778.909
Epoch 35 Loss 6766.874
Epoch 36 Loss 6755.617
Epoch 37 Loss 6745.06
Epoch 38 Loss 6735.136
Epoch 39 Loss 6725.789
Epoch 40 Loss 6716.964
Epoch 41 Loss 6708.617
Epoch 42 Loss 6700.708
Epoch 43 Loss 6693.199
Epoch 44 Loss 6686

# Inspecting the trained model

Let's look at the embeddings we got from our model

In [14]:
# what words do we have in our system?
word_to_ix.keys()

dict_keys(['NONE', "''", '--', '``', 'abel', 'able', 'accepted', 'across', 'act', 'ada', 'adam', 'added', 'addressed', 'afraid', 'afternoon', 'age', 'ago', 'agreed', 'ahead', "ain't", 'air', 'alert', 'alex', "alex's", 'almost', 'alone', 'along', 'aloud', 'already', 'also', 'although', 'always', 'amen', 'among', 'amy', 'andrei', 'andrus', 'another', 'answer', 'answered', 'antelope', 'anyone', 'anything', 'apartment', 'appear', 'appeared', 'approached', 'argiento', 'arm', 'arms', 'army', 'around', 'arranging', 'arrived', 'ask', 'asked', 'asking', 'asleep', 'ate', 'attention', 'audience', 'awake', 'aware', 'away', 'awful', 'baby', 'back', 'bad', 'bag', 'ball', 'bank', 'barely', 'bastard', 'bastards', 'battle', 'bay', 'beautiful', 'became', 'become', 'becoming', 'bed', 'bedroom', 'began', 'beginning', 'begun', 'behind', 'believe', 'beneath', 'bern', 'beside', 'best', 'better', 'beyond', 'bible', 'big', 'bird', 'birds', 'bit', 'black', 'blind', 'blood', 'blue', 'board', 'boat', 'bobby', 'bo

In [15]:
#how about reasonably frequent words?
brown_wordcounts.most_common(200)

[('``', 703),
 ("''", 698),
 ('would', 291),
 ('said', 194),
 ('one', 184),
 ('--', 176),
 ('could', 168),
 ('like', 151),
 ('man', 112),
 ('back', 104),
 ('time', 103),
 ('came', 91),
 ('get', 84),
 ('little', 82),
 ('old', 82),
 ('went', 79),
 ('know', 78),
 ('thought', 76),
 ('two', 76),
 ('go', 74),
 ('men', 73),
 ('looked', 73),
 ('never', 72),
 ('around', 71),
 ('house', 69),
 ('room', 63),
 ('even', 63),
 ('still', 63),
 ('way', 63),
 ('eyes', 61),
 ('good', 60),
 ('made', 60),
 ('see', 59),
 ('knew', 59),
 ('face', 58),
 ('felt', 58),
 ('saw', 58),
 ('come', 58),
 ('long', 56),
 ('church', 56),
 ('seemed', 55),
 ('away', 55),
 ('must', 55),
 ('first', 55),
 ('head', 54),
 ('well', 54),
 ('day', 53),
 ('night', 53),
 ('home', 52),
 ('big', 52),
 ('take', 51),
 ('make', 51),
 ('hand', 51),
 ('got', 51),
 ('much', 50),
 ('asked', 50),
 ('always', 49),
 ('another', 49),
 ('new', 49),
 ('told', 48),
 ('door', 47),
 ('going', 46),
 ('something', 45),
 ('life', 45),
 ('right', 44),
 (

Getting the trained embeddings out of the model: We access the embeddings, remove the gradient-computing information, and turn them into a numpy matrix.

In [16]:
# model.embeddings.weights holds all the embeddings.
# they still have the gradient information in them. 
# the function detach() removes that.
# After that, we can turn the embeddings into a numpy matrix
# with the pytorch tensor method numpy()
embeddings = model.embeddings.weight.detach().numpy()

Now we can look up the vector for  a word. Because we now have a numpy matrix and no longer an Embedding object, we use straight brackets to access a row, and we can just use a number, instead of a tensor, to do the indexing. 

In [17]:
wordvec = embeddings[word_to_ix["house"]]
wordvec

array([-1.4872195 ,  0.06641939,  1.2982209 , -0.61101246,  0.00883464,
       -0.23270789,  1.5696975 , -0.7494112 , -0.22670805, -1.0835124 ,
        0.4791768 , -1.2125345 , -1.6647613 , -0.17058022, -1.406579  ,
       -0.42746887, -0.17631586, -0.58749884, -0.4205233 ,  0.6934005 ,
        1.677947  , -1.7019265 ,  1.9020253 ,  0.32155886, -0.13052379,
        1.5877124 , -1.9082692 , -1.6059185 , -0.15644395,  0.5727711 ,
       -0.20309737, -2.5810702 , -0.9190053 ,  0.88198626,  0.37856722,
       -0.16648917,  0.7719214 ,  2.2452211 ,  0.38636082, -0.51983285,
        0.5675817 , -1.061423  ,  0.6014003 ,  1.0890468 ,  0.8471677 ,
       -0.86041206,  0.3903918 ,  0.03362339,  0.43421182, -0.15803993,
       -0.36676618, -0.8566278 , -0.12619567, -1.6321291 ,  0.7332814 ,
       -0.71833163,  0.4015565 , -0.75988644, -1.9280918 ,  1.0582693 ,
        0.7857137 , -0.83869463,  1.5704052 , -0.29472378,  0.9063143 ,
        0.43883145,  0.8523466 , -1.388142  , -0.46068442, -0.55

Now that we have normal vectors, we can again use all the same methods as before to inspect them: Pairwise cosine, nearest neighbors, and so on. Here, for example, is the similarity of "boy" and "girl" in our model:

In [18]:
import scipy

def cosine_sim(vec1, vec2):
    return 1 - scipy.spatial.distance.cosine(vec1, vec2)

vec1 = embeddings[word_to_ix["house"]]
vec2 = embeddings[word_to_ix["room"]]
cosine_sim(vec1, vec2)


-0.010986201465129852