# Computing a Word2vec space with gensim

Computing a prediction-based meaning space with gensim is very easy. 

Prediction-based spaces have two main advantages: first, they are faster to compute than count-based spaces. Second, they give you dense, low-dimensional spaces that are easier to store than huge, sparse spaces.

We first again demonstrate the method with a tiny corpus:

In [1]:
sam_corpus = """I am Sam. Sam I am. I do not like green eggs and ham."""

# we split the corpus up into sentences
import nltk
sam_sents = nltk.sent_tokenize(sam_corpus)
sam_sents

['I am Sam.', 'Sam I am.', 'I do not like green eggs and ham.']

In [2]:
# Then we split each sentence up into words. 
# We now have a list of sentences, each of which is a list of words.
sam_sent_words = [ nltk.word_tokenize(s) for s in sam_sents]
sam_sent_words

[['I', 'am', 'Sam', '.'],
 ['Sam', 'I', 'am', '.'],
 ['I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham', '.']]

In [3]:
# This is the format the gensim takes:
# a corpus that is a list of sentences,
# where each sentence is a list of words

from gensim.models import Word2Vec

# parameters: 
# corpus as a list of sentences,
# epochs: number of epochs, that is, number of times the
#   training goes through the whole corpus.
# min_count: minimum count for words to include. This
#   should be larger than 1, but the Sam corpus is so tiny 
#   that we keep all words
# sg: use the skipgram method
space_sam = Word2Vec(sam_sent_words, epochs = 10, min_count = 1,sg =1)
space_sam

<gensim.models.word2vec.Word2Vec at 0x1084efdc0>

We now move to a corpus that is not quite so tiny, so we can demonstrate cosine similarity.

The Brown corpus, a corpus of 1 million words, is sampled from different genres. You can access its genres with the `categories` parameter, like this:

In [4]:
from nltk.corpus import brown

# The first three sentences of the Brown fiction section
list(brown.sents(categories="fiction"))[:3]

[['Thirty-three'],
 ['Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.'],
 ['His',
  'parents',
  'talked',
  'seriously',
  'and',
  'lengthily',
  'to',
  'their',
  'own',
  'doctor',
  'and',
  'to',
  'a',
  'specialist',
  'at',
  'the',
  'University',
  'Hospital',
  '--',
  'Mr.',
  'McKinley',
  'was',
  'entitled',
  'to',
  'a',
  'discount',
  'for',
  'members',
  'of',
  'his',
  'family',
  '--',
  'and',
  'it',
  'was',
  'decided',
  'it',
  'would',
  'be',
  'best',
  'for',
  'him',
  'to',
  'take',
  'the',
  'remainder',
  'of',
  'the',
  'term',
  'off',
  ',',
  'spend',
  'a',
  'lot',
  'of',
  'time',
  'in',
  'bed',
  'and',
  ',',
  'for',
  'the',
  'rest',
  ',',
  'do',
  'pretty',
  'much',
  'as',
  'he',
  'chose',
  '--',
  'provided',
  ',',
  'of',
  'course',
  ',',
  'he',
  'chose',
  'to',
  'do',
  'nothing',
  'too',
  'exciting',
  'or',
  'too',
  'debilitating',
  '.']]

The corpus is too small to give us good word vectors, but at least it is not tiny. It has 68,500 words:

In [5]:
len(list(brown.words(categories = "fiction")))

68488

We can use gensim's Word2Vec in exactly the same way here as with the Sam corpus: 

In [6]:
%%time
# The %%time command is not a python command, it's jupyter notebook-ese
# This will put a timer on how long it takes to compute the meaning space


# parameters: 
# corpus as a list of sentences,
# epochs: number of epochs, that is, number of times the
#   training goes through the whole corpus.
# min_count: minimum count for words to include. We set it to 10. 
#   10, 20, or maybe 50 for large corpora are reasonable numbers here.
# sg: use the skipgram method
# vector_size: number of dimensions to use
w2vec_fiction = Word2Vec(brown.sents(categories = "fiction"), epochs=10, min_count=10, vector_size=300, sg = 1)

CPU times: user 2.07 s, sys: 49 ms, total: 2.12 s
Wall time: 987 ms


What target words do we have in our space?

Note that we need the *wv*, for word-vectors, to access the gensim space functions you know. 

In [7]:
space = w2vec_fiction.wv

space.key_to_index.keys()

dict_keys([',', '.', 'the', 'and', 'to', 'of', 'a', 'was', 'in', 'he', 'his', 'had', '``', "''", '?', 'that', 'I', 'He', 'with', 'it', 'on', 'her', 'for', 'him', 'The', 'at', ';', 'as', 'not', 'would', '!', 'she', 'be', 'were', 'you', 'they', 'from', 'out', 'but', 'said', 'up', 'all', '--', 'them', 'about', 'one', 'or', 'could', 'have', 'by', 'their', 'been', 'an', 'there', 'It', 'like', 'into', 'this', 'She', 'is', 'me', 'when', 'no', 'down', 'what', 'which', 'my', 'did', 'so', 'man', 'who', 'back', 'now', 'time', 'over', 'if', 'came', 'But', 'some', 'do', 'we', ':', 'They', 'more', 'little', 'went', 'get', 'where', 'then', 'thought', 'know', 'old', 'only', 'And', 'before', 'looked', 'men', 'go', 'never', 'around', 'himself', 'There', 'two', 'again', 'room', 'way', 'off', 'made', 'His', 'eyes', 'here', 'through', 'knew', 'When', 'face', 'saw', 'A', 'too', 'felt', 'What', 'see', 'even', 'own', 'must', 'seemed', "don't", 'good', 'come', 'away', 'In', 'still', 'head', 'just', 'how', 'hou

We can now compute cosine similarity as before:

In [8]:
space.similarity("doctor", "family")

0.97231007

In [9]:
space.most_similar("woman")

[('young', 0.9823521971702576),
 ('whose', 0.975893497467041),
 ('new', 0.9735254645347595),
 ('died', 0.9667412042617798),
 ('flesh', 0.9665197134017944),
 ('play', 0.966352641582489),
 ('once', 0.9643639922142029),
 ('death', 0.9642722606658936),
 ('ten', 0.9630625247955322),
 ('sort', 0.9629089832305908)]

Let's compare this space to another space:

In [10]:
%%time
w2v_romance = Word2Vec(brown.sents(categories = "romance"), epochs=10, min_count=10, vector_size=300,sg = 1)
space_romance = w2v_romance.wv

CPU times: user 2.95 s, sys: 854 ms, total: 3.8 s
Wall time: 1.23 s


In [11]:
space_romance.most_similar("woman")

[('life', 0.9812353253364563),
 ('wife', 0.9738490581512451),
 ('A', 0.9734092354774475),
 ('Hanford', 0.9696579575538635),
 ('letter', 0.9694443941116333),
 ('couple', 0.9689934253692627),
 ('lost', 0.9676991701126099),
 ('forever', 0.9664785861968994),
 ('job', 0.9653080701828003),
 ('strange', 0.9647602438926697)]