# Python dictionaries: Mapping keys to values

In our initial example of "how do I count how often each word appears in a newspaper article", the central data structure was a "sheet of counts": For each word, we kept a sequence of dashes indicating how often we had seen that word.

Our end goal in this notebook is to produce such "sheets of counts" to get a sense of the overall topic of a text, or a collection of texts. This will not be perfect, as the raw counts will contain plenty of words that do not give us a good sense of overall topic, but it will be a good first stab at the problem. We will want to use "sheets of counts" to get a sense of topic for (1) all state of the union addresses, and (2) words in wine reviews, for both cheap and less cheap wines. 

Another way to look at the "sheets of counts" is that we *connected* each word to a count (a sequence of dashes), or that we *mapped* each word to a count. In Python, you can do things like this with a *dictionary*, a data structure that maps *keys* to *values*. 

One example of how you could use a dictionary is to map words (the keys) to counts (the values):

```
{
"the" : 3,
"recent" : 1,
"development" : 1,
"that" : 2
}
```

Here is another example: an actual translation dictionary, mapping English words (the keys) to their German translations (the values):

```
{
"dog" : "Hund",
"cat" : "Katze",
"rhino" : "Nashorn"
}```

In the first case, the keys were strings, and the values were numbers. In the second case, the keys were again strings, and the values were also strings. Dictionaries are flexible about this. (And the keys could be numbers, or other data types.)

Here is how to construct a Python dictionary from scratch:

In [1]:
latindict = { "dog": "canis", "cat":"felis", "rhino": "rhinoceros", 
            "mouse": "mus"}

print("result of dictionary making:", latindict)

result of dictionary making: {'dog': 'canis', 'cat': 'felis', 'rhino': 'rhinoceros', 'mouse': 'mus'}


Or you can start with an empty dictionary and fill it one step at a time:

In [2]:
mydict = {}
print("Step 1 mydict is", mydict)

mydict["dog"] = "Hund"
print("Step 2 mydict is", mydict)

mydict["cat"] = "Katze"
print("Step 3 mydict is", mydict)

mydict["rhino"] = "Nashorn"
print("Step 4 mydict is", mydict)

Step 1 mydict is {}
Step 2 mydict is {'dog': 'Hund'}
Step 3 mydict is {'dog': 'Hund', 'cat': 'Katze'}
Step 4 mydict is {'dog': 'Hund', 'cat': 'Katze', 'rhino': 'Nashorn'}


Here is how you can access a dictionary. Give the key to get the corresponsing value:

In [3]:
mydict["dog"]

'Hund'

In [4]:
latindict["mouse"]

'mus'

Note that when you make a new dictionary, you need to use curly brackets, as in:

```mydict = { }```

But when you access the dictionary, or when you add a single entry to a dictionary, you use straight brackets, as in:

```latindict["mouse"]```

## Comparing Python dictionaries and Python lists

Let's compare notation across dictionaries and lists.

Initializing to an empty data structure:
```
mylist = []       # empty list: straight brackets
mydict = {}    # empty dictionary: curly brackets
```

Initializing to a nonempty data structure:
```
# initializing a list:straight brackets
mylist = [“dog”, “cat”, “rhinoceros”]
# initializing a dictionary: curly brackets, key-colon-value
mydict = {"dog":"Hund", "cat":"Katze",  "rhinoceros":"Nashorn"}
```

Accessing items on a list: index in straight brackets. A list “maps” indices to items.

```
mylist[1]  ### will yield 'cat'
```

Accessing items on a dictionary: key in straight brackets. A dictionary maps keys to values.

```
mydict['cat'] ### will yield 'Katze'
```

The standard way to modify a list is via ```append()```

In [5]:
mylist = ["dog", "cat", "rhinoceros"]
mylist.append("armadillo")
print("changed list", mylist)

changed list ['dog', 'cat', 'rhinoceros', 'armadillo']


The standard way to modify a dictionary is to store a value under a key. If you had a key/value pair before and change the value, the previous value is gone.

In [6]:
mydict = {"dog":"Hund", "cat":"Katze"}
print("original dictionary", mydict)

# adding an item 
# (the German word for 'armadillo' means literally belt-animal)
mydict["armadillo"] = "Guerteltier"
print("dictionary after adding an item", mydict)

# changing a value. 
# the previous value is gone.
mydict["cat"] = "felis"
print("dictionary after changing the value", mydict)

original dictionary {'dog': 'Hund', 'cat': 'Katze'}
dictionary after adding an item {'dog': 'Hund', 'cat': 'Katze', 'armadillo': 'Guerteltier'}
dictionary after changing the value {'dog': 'Hund', 'cat': 'felis', 'armadillo': 'Guerteltier'}


In a list, you get an error when you try to access a list index that isn't there. In the same way, in a dictionary, you get an error when you try to access a key that isn't there.

In [7]:
mylist = ["dog", "cat", "rhinoceros"]
# remove the hash in the beginning of the 'print' line
# that is, "uncomment" the 'print' line,
# to get an IndexError.
# print("this will get you an error", mylist[10])

In [8]:
mydict = {'dog':'Hund', 'rhinoceros':'Nashorn'}
# remove the hash in the beginning of the 'print' line
# that is, "uncomment" the 'print' line,
# to get a KeyError.
# print('this will get you an error', mydict['cat'])

## Dictionary keys and dictionary values

**What can be a dictionary key?**

Strings can be dictionary keys:
```mydict = {"dog":"Hund", "rhinoceros":"Nashorn"}```

Integers can be dictionary keys, for example: 

```prime_nums = {2:1, 3:2, 5:3, 7:4, 11:5}```

Floating point numbers can be dictionary keys as well:
```mydict = {3.1415 : "pi", 2.71828 : "e" }```

Not everything can be a dictionary key, for example lists cannot. (If you're interested in the details: It's because lists are "mutable", that is, you can change individual items on a list, and that would mess up the way dictionaries are represented internally in Python.)

**What can be a dictionary value?**

Any data type can be a dictionary value. Even a dictionary can be a dictionary value.

## Checking whether a key is present

You can use ```in``` to check whether a key is in a dictionary:

In [9]:
mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
print("is 'mouse' present?", 'mouse' in mydict)
print("is 'armadillo' present?", "armadillo" in mydict)

is 'mouse' present? False
is 'armadillo' present? True


The Boolean expression with ```in``` checks keys, it does not check values:

In [10]:
mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
'Katze' in mydict

False

**Try it for yourself**

* Say we have the following dictionary:

In [11]:
dict1 = { "dog" : "Hund", "armadillo" : "Guerteltier"}

Please add to this dictionary the translation of platypus, which is Schnabeltier (literally, beak animal).

* Say we have the following dictionary: 

In [12]:
dict2 = { "the" : 1, "a" : 2}

Please change the entry for "the" to be two instead of one.

Here is a piece of code that gives you an example of how to do the next "Try it for yourself" below. It iterates over a list of function words, using each of them as a key to retrieve the matching value from the dictionary. 

In [13]:
mylist = [ "the", "and", "a"]
dictionary_of_counts = { "house" : 2, "armadillo" : 1, 
                         "the" : 21, "recent" : 2, 
                         "said": 3, "a" : 15,
                         "went": 2, "and": 3,
                         "yellow": 1}
print("function word counts")
for word in mylist:
    print(word, dictionary_of_counts[word])

function word counts
the 21
and 3
a 15


**Try it for yourself** 

Here is a mini German/English dictionary: 

In [14]:
mydict = {"befreit":"liberated", "baeche":"brooks", 
          "eise":"ice", "sind":"are", "strom":"river", 
          "und":"and", "vom":"from"}

Can you use this dictionary to do a bad translation of the following German sentence? (Hint: this should look a lot like the use case above
where we iterated over a list and printed out dictionary values 
for items on the list.)

In [15]:
mysent = "vom eise befreit sind strom und baeche"




*Warning:* Solution below, don't read on if you want to solve the "bad translation" problem for youself.

...

...

...

...

...

...

...

...









Here is a solution.(Note that this is not how you want your machine translation to work! The translations that you get this way are terrible.)

In [16]:
mydict = {"befreit":"liberated", "baeche":"brooks", "eise":"ice", "sind":"are", "strom":"river", "und":"and", "vom":"from"}
mysent = "vom eise befreit sind strom und baeche"
for german_word in mysent.split():
    print( mydict[ german_word], end = " ")
print()

from ice liberated are river and brooks 


Adding the parameter ```end = " "``` puts a space instead of a linebreak at the end of what is printed. That way, multiple "print" outputs land on the same line.

# A dictionary as a collection of variables/containers

In a way, you can view a dictionary as a collection of containers, each of which you address by the key. 

First, let's see what happens when we have individual variables. Here's a variable storing the translation of "platypus". 

In [17]:
platypus_translation = "Schnabeltier"
platypus_translation

'Schnabeltier'

And of "cat":

In [18]:
cat_translation = "Katze"
cat_translation

'Katze'

We can make more of these, but we need to know, ahead of time, how many translations we want to store. In a dictionary, we can always store more, as needed:

In [19]:
mydict = {"cat":"Katze", "platypus":"Schnabeltier"}
mydict["platypus"]

'Schnabeltier'

Adding an entry:

In [20]:
mydict["armadillo"] = "Guerteltier"
mydict

{'cat': 'Katze', 'platypus': 'Schnabeltier', 'armadillo': 'Guerteltier'}

Like with individual variables, you can update the value that goes with a key. Here is an example. Say you want to count occurrences of words, and you've seen one more "the". Then you record it like this: 

In [21]:
mydict = {"the":1, "and": 1, "of": 1}
mydict["the"] = mydict["the"] + 1
mydict

{'the': 2, 'and': 1, 'of': 1}

Compare this to how you would change the contents of an individual variable/container:

In [22]:
counter = 0
mylist = ["a", "b", 'a']
for item in mylist:
    if item == "a":
        counter = counter + 1
      
counter

2

**Try it for yourself**:

* Say you have a dictionary of counts, as shown below. Now you see the following additional words: "watch", "the",  "bird". Add these counts. (Watch out: You need to do something different for the words that are already in the dictionary, in this case "the", than for the words that are not there yet, in this case "watch" and "bird".)

## Counting words in a text

We can use this idea of a dictionary as a collection of variables/containers to count occurrences of words in a text.

First, here is how you can count occurrences of just one word (here: "to") in a text:

In [23]:
# paragraph from the Onion, March 04
paragraph = """While dieters are accustomed to exercises of will, 
a new English translation of Germany's most popular diet book 
takes the concept to a new philosophical level. 
The Nietzschean diet, which commands its adherents to eat 
superhuman amounts of whatever they most fear, 
is developing a strong following in America."""

count_to = 0
for word in paragraph.split():
    if word == "to":
        count_to = count_to + 1

print( count_to )

3


Now suppose we want to count occurrences of all words at the same time. 
Then we can use a Python dictionary as a collection of containers, one for each word. The words are the keys, and their counts as the values. Every time we encounter a word, we add one to its value in the dictionary.

In [24]:
# paragraph from the Onion, March 04
paragraph = """While dieters are accustomed to exercises of 
will, a new English translation of Germany's most popular 
diet book takes the concept to a new philosophical level. 
The Nietzschean diet, which commands its adherents to eat 
superhuman amounts of whatever they most fear, 
is developing a strong following in America."""

counts = { }

for word in paragraph.split():
    if  word not in counts:
        counts[word] = 1
    else:
        counts[ word ] = counts[ word ] + 1

print( counts )

{'While': 1, 'dieters': 1, 'are': 1, 'accustomed': 1, 'to': 3, 'exercises': 1, 'of': 3, 'will,': 1, 'a': 3, 'new': 2, 'English': 1, 'translation': 1, "Germany's": 1, 'most': 2, 'popular': 1, 'diet': 1, 'book': 1, 'takes': 1, 'the': 1, 'concept': 1, 'philosophical': 1, 'level.': 1, 'The': 1, 'Nietzschean': 1, 'diet,': 1, 'which': 1, 'commands': 1, 'its': 1, 'adherents': 1, 'eat': 1, 'superhuman': 1, 'amounts': 1, 'whatever': 1, 'they': 1, 'fear,': 1, 'is': 1, 'developing': 1, 'strong': 1, 'following': 1, 'in': 1, 'America.': 1}


The condition ```if word not in counts``` is true if the content of the variable word is not a key in the dictionary counts.

Note that this is a variant of the "accumulation" code pattern that you have seen before. We initialize counts to an empty dictionary. Then we iterate over the words in the paragraph, adding numbers to the dictionary as we go along. The first time we encounter a word, we initialize its count to zero. We know we encounter it for the first time because there is no dictionary key for them yet.

**Try it for yourself**:
* In the code above, each word is counted "as is", which means that "the" and "The" are counted separately. Modify the code such that it lowercases each word before counting.

# Counting words using NLTK

Word counting is a task that we often need to do when we analyze texts. It is surprising for how many different analyses this is the first step! And because this is such a frequent task, the Natural Language Toolkit has a specialized type of dictionary just for counting (of words, or of other items). 

This is a trick you will see often with Python packages: They define specialized data types that come with their own methods. 

In [25]:
import nltk

# a poem from Alice in Wonderland
data = """"You are old, Father William," the young man said,
    "And your hair has become very white;
And yet you incessantly stand on your head—
    Do you think, at your age, it is right?"

"In my youth," Father William replied to his son,
    "I feared it might injure the brain;
But now that I'm perfectly sure I have none,
    Why, I do it again and again."

"You are old," said the youth, "as I mentioned before,
    And have grown most uncommonly fat;
Yet you turned a back-somersault in at the door—
    Pray, what is the reason of that?"

"In my youth," said the sage, as he shook his grey locks,
    "I kept all my limbs very supple
By the use of this ointment—one shilling the box—
    Allow me to sell you a couple."

"You are old," said the youth, "and your jaws are too weak
    For anything tougher than suet;
Yet you finished the goose, with the bones and the beak—
    Pray, how did you manage to do it?"

"In my youth," said his father, "I took to the law,
    And argued each case with my wife;
And the muscular strength, which it gave to my jaw,
    Has lasted the rest of my life."

"You are old," said the youth, "one would hardly suppose
    That your eye was as steady as ever;
Yet you balanced an eel on the end of your nose—
    What made you so awfully clever?"

"I have answered three questions, and that is enough,"
    Said his father; "don't give yourself airs!
Do you think I can listen all day to such stuff?
    Be off, or I'll kick you down stairs!"""

# we use the Natural Language Toolkit to split this poem into words
# in a way that also splits off punctuation
words = nltk.word_tokenize(data)
print("The words are", words)

The words are ['``', 'You', 'are', 'old', ',', 'Father', 'William', ',', "''", 'the', 'young', 'man', 'said', ',', '``', 'And', 'your', 'hair', 'has', 'become', 'very', 'white', ';', 'And', 'yet', 'you', 'incessantly', 'stand', 'on', 'your', 'head—', 'Do', 'you', 'think', ',', 'at', 'your', 'age', ',', 'it', 'is', 'right', '?', "''", '``', 'In', 'my', 'youth', ',', "''", 'Father', 'William', 'replied', 'to', 'his', 'son', ',', '``', 'I', 'feared', 'it', 'might', 'injure', 'the', 'brain', ';', 'But', 'now', 'that', 'I', "'m", 'perfectly', 'sure', 'I', 'have', 'none', ',', 'Why', ',', 'I', 'do', 'it', 'again', 'and', 'again', '.', "''", '``', 'You', 'are', 'old', ',', "''", 'said', 'the', 'youth', ',', '``', 'as', 'I', 'mentioned', 'before', ',', 'And', 'have', 'grown', 'most', 'uncommonly', 'fat', ';', 'Yet', 'you', 'turned', 'a', 'back-somersault', 'in', 'at', 'the', 'door—', 'Pray', ',', 'what', 'is', 'the', 'reason', 'of', 'that', '?', "''", '``', 'In', 'my', 'youth', ',', "''", 'sai

In [26]:
# Here is the item counting dictionary
# You initialize it with the list of items to be counted
fd = nltk.FreqDist(words)
# When you inspect this, you see
# the words with the highest counts first
fd

FreqDist({',': 30, 'the': 17, '``': 16, "''": 15, 'you': 10, 'I': 10, ';': 7, 'my': 7, 'said': 6, 'your': 6, ...})

In [27]:
# the 10 most common words and their counts
fd.most_common(10)

[(',', 30),
 ('the', 17),
 ('``', 16),
 ("''", 15),
 ('you', 10),
 ('I', 10),
 (';', 7),
 ('my', 7),
 ('said', 6),
 ('your', 6)]

We can also get the most frequent words in a tabular format:

In [28]:
fd.tabulate(10)

   ,  the   ``   ''  you    I    ;   my said your 
  30   17   16   15   10   10    7    7    6    6 


In [29]:
# the counts for a particular word: 
# ask this like you would a dictionary
fd["youth"]

6

**Try it for yourself**:

Get a short text passage from some webpage, and store it as a Python string. Split it into words using either ```split()``` or ```nltk.word_tokenize()```. Then make a new ```nltk.FreqDist``` and use it to count words in the passage.

* What are the 5 most frequent words in the passage, and what are their counts?

* Is the word "and" in the passage? If so, what is its count?

# All keys, all values, all pairs


You can retrieve all the keys of a dictionary:

In [30]:
mydict = {"dog":"Hund", "cat":"Katze", "armadillo":"Guerteltier"}
mydict.keys()

dict_keys(['dog', 'cat', 'armadillo'])

These keys are almost the same as a list, and you can iterate through them like through a list, using a for-loop:

In [31]:
# printing counts for all words that 
# are actual words,
# not punctuation
for word in fd.keys():
    if word.isalpha():
        print(word, fd[word], end= ", ")

You 4, are 5, old 4, Father 2, William 2, the 17, young 1, man 1, said 6, And 5, your 6, hair 1, has 1, become 1, very 2, white 1, yet 1, you 10, incessantly 1, stand 1, on 2, Do 2, think 2, at 2, age 1, it 5, is 3, right 1, In 3, my 7, youth 6, replied 1, to 6, his 4, son 1, I 10, feared 1, might 1, injure 1, brain 1, But 1, now 1, that 3, perfectly 1, sure 1, have 3, none 1, Why 1, do 3, again 2, and 4, as 4, mentioned 1, before 1, grown 1, most 1, uncommonly 1, fat 1, Yet 3, turned 1, a 2, in 1, Pray 2, what 1, reason 1, of 4, sage 1, he 1, shook 1, grey 1, locks 1, kept 1, all 2, limbs 1, supple 1, By 1, use 1, this 1, shilling 1, Allow 1, me 1, sell 1, couple 1, jaws 1, too 1, weak 1, For 1, anything 1, tougher 1, than 1, suet 1, finished 1, goose 1, with 2, bones 1, how 1, did 1, manage 1, father 2, took 1, law 1, argued 1, each 1, case 1, wife 1, muscular 1, strength 1, which 1, gave 1, jaw 1, Has 1, lasted 1, rest 1, life 1, one 1, would 1, hardly 1, suppose 1, That 1, eye 1, w

You can also get all the values in a dictionary:

In [32]:
mydict.values()

dict_values(['Hund', 'Katze', 'Guerteltier'])

In [33]:
for v in mydict.values():
    print(v)

Hund
Katze
Guerteltier


In [34]:
# summing up all the values in the 
# FreqDist object is
# the same as the length of the original poem
print("summed counts in the FreqDist:", sum(fd.values()))
print("number of words in the poem:", len(nltk.word_tokenize(data)))
print("summed counts in the FreqDist, version 2:", fd.N())

summed counts in the FreqDist: 354
number of words in the poem: 354
summed counts in the FreqDist, version 2: 354


**Try it for yourself.**

* Above, you made a ```nltk.FreqDist``` object counting words in a passage of your choosing. Now iterate through the keys in that dictionary in order to print counts *only for the uppercase words* in that passage.

* Here is a small English/German dictionary as a Python dictionary. Iterate through the values in that dictionary and print only the German words with a length greater or equal to 7 characters.

In [35]:
translationdict = {"dog":"Hund", "cat": "Katze", 
                   "dormouse":"Siebenschlaefer", 
                   "praying mantis":"Gottesanbeterin",
                  "gopher" : "Taschenratte"}
# put code here...




You can also get access to all key/value pairs in a dictionary, using the method ```items()```:

In [36]:
mydict.items()

dict_items([('dog', 'Hund'), ('cat', 'Katze'), ('armadillo', 'Guerteltier')])

The items (key/value pairs) have a shape like this:

```('dog', 'Hund')```

This looks almost like a list, but with round brackets rather than straight, and you can in fact treat it like a list. In particular, you can access the first part of this pair (the key) with index 0, and the second part of the pair (the value) with index 1. 

(If you want to know more: This data structure is called a *tuple*. It behaves like a list, except that it is immutable, like a string: you cannot ```append()``` to it, and you cannot exchange individual items on a tuple.)

In [37]:
firstpair = ("dog", "Hund")
firstkey = firstpair[0]
firstvalue = firstpair[1]
print("the first key is", firstkey, "and the first value is", firstvalue)

the first key is dog and the first value is Hund


Tuples don't have to be length 2. Here is a longer one:

In [38]:
longtuple = ("a", "b", "c", "d")
longtuple[2]

'c'

You can iterate over the keys of a dictionary, the values of a dictionary, and the key/value pairs (items). Here is how to do the latter:

In [39]:
for keyvalue in mydict.items():
    english = keyvalue[0]
    german = keyvalue[1]
    print('English', english, "translates to German", german)

English dog translates to German Hund
English cat translates to German Katze
English armadillo translates to German Guerteltier


You can take a tuple or a list apart by assigning multiple variables to it at once:

In [40]:
firstpair = ("dog", "Hund")
englishword, germanword = firstpair
print("We have assigned", englishword, "to 'englishword' and",
     germanword, "to 'germanword'")

We have assigned dog to 'englishword' and Hund to 'germanword'


So you can fill two containers (variables) at the same time by putting them on the left-hand side of the assignment =. That only works if on the right-hand side you have a list or tuple of length exactly two.

(You can also assign three/four/... variables at the same time if on the right-hand side you have a list or tuple of length exactly three/four/...)

In [41]:
var1, var2, var3 = (1,2,3)
print(var2)

2


Don't miscount, or you get an error message, in particular a ValueError. 

In [42]:
# Uncomment (remove the hash from) the 'var1, var2' line
# to get a ValueError with comment
# "too many values to unpack (expected 2)"
# var1, var2= (1,2,3)

Usually when doing assignments, assigning the right-hand side of the "=" to the left-hand side, there was only a single variable on the left-hand side. But if we know that the right-hand side of the "=" has exactly two components, we can put two variables on the left-hand side. The command above takes the tuple ('rhinoceros', 'Nashorn') apart into two items and assigns the first to the variable english and the second to the variable german.

We can combine this with a for-loop when we iterate over key/value pairs:

In [43]:
for english, german in mydict.items():
    print(german, "is German for", english)

Hund is German for dog
Katze is German for cat
Guerteltier is German for armadillo


The central line here is:
```for english, german in mydict.items():```

This is the same idea as above -- we know that any member of mydict.items() consists of two parts (a key and a value), so we can assign it to two variables at once.



**Try it for yourself**:

* Iterate through the key/value pairs in the ```nltk.FreqDist``` dictionary you made above from a passage you chose. For all words that consist solely of punctuation symbols, print out the words and counts.  

* You can also iterate through the key/value pairs that you get from ``fd.most_common(20)``. For all words that don't consist solely of punctuation symbols, print the word and its count.


A simple way to check for punctuation is to say  `not word.isalpha()` to check if `word` contains non-letter characters. But this will also get you words like "say..." since that contains non-letter characters. Here is a trick to check whether a word consists entirely of punctuation symbols: 

In [44]:
import string
print("Here is a string of all punctuation symbols that Python is aware of:", string.punctuation)

mystring = "??!??"
if mystring.strip(string.punctuation) == "":
    print("this string consisted entirely of punctuation symbols.")


Here is a string of all punctuation symbols that Python is aware of: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
this string consisted entirely of punctuation symbols.


* Using again the translation dictionary from above with animal names in English and German, iterate through key/value pairs, and for pairs where the German word is at least 7 letters long, print both the German word and its English translation. 

In [45]:
translationdict = {"dog":"Hund", "cat": "Katze", 
                   "dormouse":"Siebenschlaefer", 
                   "praying mantis":"Gottesanbeterin",
                  "gopher" : "Taschenratte"}
# put your code here.


# Counting words to get a sense of topic

Now we're ready to tackle the main task of this notebook: Counting words in text to get a sense of the overall topic. For now, we'll assume that the most frequent words are the ones that give us a sense of the topic. 

**Try it for yourself:** All the problems below are for you to solve. 

## Words across all State of the Union addresses

We'll first do word counts across *all* state of the union addresses. Like in the previous notebook, we'll use NLTK's interface to the speeches. 

We'll need an aggregator variable to collect word counts across all state of the union addresses. This time, our aggregator variable will be a sheet of counts, specifically an NLTK FreqDist. Since we count words across all speeches, we use one single FreqDist to collect counts.

Here is a neat fact about NLTK's FreqDist objects: You can use the method `update()` to add a whole new list of words to the counts, like this: 

In [46]:
fd = nltk.FreqDist(["here", "are", "some", "words"])
print(fd.items(), "\n")
# here comes the trick
fd.update(["here", "are", "more", "words"])
print(fd.items())

dict_items([('here', 1), ('are', 1), ('some', 1), ('words', 1)]) 

dict_items([('here', 2), ('are', 2), ('some', 1), ('words', 2), ('more', 1)])


Here is an example of a FreqDist as an aggregator variable. It counts words in ae Edward Lear poem, one line at a time. 

In [47]:
# first stanza of a poem by Edward Lear,
# one string per line
data = [
"They went to sea in a Sieve, they did",
"In a Sieve they went to sea:",
"In spite of all their friends could say,"
"On a winter's morn, on a stormy day,",
"In a Sieve they went to sea!",
"And when the Sieve turned round and round,",
"And every one cried, `You'll all be drowned!'",
"They called aloud, `Our Sieve ain't big,",
"But we don't care a button! we don't care a fig!",
"In a Sieve we'll go to sea!'",
"Far and few, far and few,",
"Are the lands where the Jumblies live;",
"Their heads are green, and their hands are blue,",
"And they went to sea in a Sieve."]

print("I got this many lines:", len(data))

# aggregator variable
fd = nltk.FreqDist()
# loop
for line in data:
    words = line.split()
    # adding to the aggregator variable
    fd.update(words)
    
print(fd.most_common(10))


I got this many lines: 13
[('a', 9), ('to', 5), ('Sieve', 5), ('went', 4), ('they', 4), ('In', 4), ('and', 4), ('And', 3), ('the', 3), ('They', 2)]


Now do the following:
* Make a FreqDist object as your aggregator variable
* iterate over fileIDs of state of the union addresses, as in the previous notebook
* for each fileID:
  * pull up the speech, as a list of words
  * add it to the aggregator variable
  
What are the most frequent words? 

You'll see a lot of "uninteresting" words come out on top. To make them disappear, let's remove "stopwords" from each speech. You can get NLTK's English stopwords like this -- I'm only showing the first 10:

In [48]:
nltk.corpus.stopwords.words("english")[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In a single state of the union address, you can remove stopwords like this:
* make an aggregator variable that is a list
* for each word of the speech:
  * if it is not in the list of stopwords:
    * add it to the aggregator variable
    
Let's do this for the first state of the union address:

In [49]:
sotu_fileids = nltk.corpus.state_union.fileids()
first_fileid = sotu_fileids[0]
first_sotu = nltk.corpus.state_union.words(fileids = first_fileid)

mystopwords= nltk.corpus.stopwords.words("english")

mylist = [ ]
for word in first_sotu:
    if word not in mystopwords:
        mylist.append(word)
        
fd = nltk.FreqDist(mylist)
fd.most_common(20)

[('.', 105),
 (',', 92),
 ('peace', 23),
 ('world', 20),
 ('must', 20),
 ('I', 17),
 ('We', 17),
 ('-', 16),
 ('!', 12),
 ('America', 11),
 ('The', 10),
 ('people', 10),
 ('nations', 10),
 ('In', 8),
 ('hope', 8),
 ('freedom', 7),
 ('never', 7),
 ('great', 6),
 ('upon', 6),
 ('shall', 6)]

To do this for all state of the union addresses, you will need to again use two aggregator variable, an inner one and an outer one:
* make an aggregator variable across all speeches that is a FreqDist 
* for each state of the union address, accessed as above:
  * make an aggregator variable for this speech, one that is a list
  * for each word of the speech:
    * if the word is not a stopword:
      * save it in the this-speech aggregator variable
  * update the across-speeches aggregator variable with the words in the this-speech aggregator variable
  
Let's do this, and then see if we get better words coming out on top:

## Words in Wine Reviews

On Canvas, there is a file with wine reviews, "winereviews.txt"

The notebook on accessing data from files tells you how to open "winereviews.txt" and read the file, one line at a time.

Each line starts with either the word "CHEAP" or "EXPENSIVE", followed by the actual wine review.

We would like to separately count words in reviews for cheap wines and reviews for expensive wines, to see if we can spot differences among the most common descriptors for cheap versus expensive wines. 

In a first pass, we again ignore stopwords. Do the following:
* make *two* aggregator variables that are FreqDist objects, one for cheap wines and one for expensive wines.
* open "winereviews.txt" and read it, one line at a time.
* For each line:
  * split it into words. 
  * If the first word is "CHEAP": 
    * update the cheap-wine aggregator variable with the words (but omit the first word, "CHEAP")
  * Else:
    * update the expensive-wine aggregator variable with the words (but omit the first word, "EXPENSIVE")
    
Then let's look at the most common words in both FreqDist objects. Anything to see?

Let's improve our analysis by again omitting stop words in each wine review before counting the words in it. 

Then let's look again at the most common words in both FreqDist objects. Do you see a difference?