In the inaugural addresses data frame we used earlier, there were several documents, one for each inaugural address, and the data frame had one row for each document, with information about the president's name, the length of the address, the year, and so on. 

Let's use this data again to explore how to get data into Pandas. Here is the plan: We will 
1. make a data frame containing one row with president's names, years, and speech lengths,
2. make a data frame containing multiple rows with president's names,years, and speech lengths, and
3. make a data frame with counts for words in inaugural addresses.

The inaugural address corpus in NLTK is ordered into "files", each of which is one inaugural address. You can access their names using the function ```fileids()```:

In [1]:
import nltk

inaugural_labels = nltk.corpus.inaugural.fileids()
print(inaugural_labels[:10])

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt']


Here is how we can get the words of one inaugural address:

In [2]:
washington_words = nltk.corpus.inaugural.words('1789-Washington.txt')
# looking at the top 20 words only
washington_words[:20]

['Fellow',
 '-',
 'Citizens',
 'of',
 'the',
 'Senate',
 'and',
 'of',
 'the',
 'House',
 'of',
 'Representatives',
 ':',
 'Among',
 'the',
 'vicissitudes',
 'incident',
 'to',
 'life',
 'no']

# Making a data frame row

The NLTK 'file IDs' have the form
```year - President_name . txt```

Let's exploint this format to get the president's name and year for the first inaugural address, and also determine the length of the address.

In [20]:
fileid0 = nltk.corpus.inaugural.fileids()[0]
print("first file ID is", fileid0)
# removing the '.txt' from '1789-Washington.txt'
yearpresident = fileid0[:-4]
print("removing the extension I have", yearpresident)

# splitting into year and text by splitting at -
year, president = yearpresident.split("-")

addresslength = len(nltk.corpus.inaugural.words(fileid0))
print("I got the name", president, "and the year", int(year), 
      "and address length", addresslength)

first file ID is 1789-Washington.txt
removing the extension I have 1789-Washington
I got the name Washington and the year 1789 and address length 1538


Here is how we can store this in a data frame: We first make a dictionary with keys that are the column labels we want to have in the data frame, and values that are the values in this data frame row. Then we hand this dictionary to a pandas DataFrame object. (Make sure to hand the DataFrame object a *list* whose only member is this dictionary.)

In [25]:
import pandas as pd

# The year we got out of split() above was a string.
# Convert this to an integer and store it as a number.

firstaddress_dict = { "president" : president,
                       "year" : int(year),
                       "length" : addresslength }

# Now we make the pandas dataframe.
# make sure to pass the dictionary as a list of one
firstaddress_df = pd.DataFrame([ firstaddress_dict ])
# and inspect it.
firstaddress_df

Unnamed: 0,president,year,length
0,Washington,1789,1538


**Try it for yourself**: Make a pandas data frame that contains *only* the president's name, year, and speech length for the 3rd inaugural speech, 1797 by Adams.

# Making multiple data frame rows

If we want to make a pandas data frame with multiple rows, we simply make one dictionary per row, and pass the list of dictionaries to a pandas DataFrame object. 

Here is how to get a data frame with presidents' names, years, and speech lengths for all the inaugural speeches in the corpus. Note that we again use the "Python loop idiom": Make an empty container (here, a list), then iterate over a list (here, a list of "file IDs" of inaugural speeches) and update the container for each item on the list (here, make a dictionary for each inaugural speech).

In [26]:
# aggregator variable: list of rows
dataframe_rows = [ ]

for fileid in nltk.corpus.inaugural.fileids():
    # remove the .txt from the file ID
    yearpresident = fileid[:-4]
    # split into year and text by splitting at -
    year, president = yearpresident.split("-")
    # and get the length of the speech
    words = nltk.corpus.inaugural.words(fileid)
    addresslength = len(words)
    
    row_dict = { "president" : president, "year" : int(year), "length" : addresslength }
    
    # store the row dictionary in the list
    dataframe_rows.append(row_dict)
    
# make the data frame
small_inaugural_df = pd.DataFrame(dataframe_rows)

# and displaying the top 10 rows
small_inaugural_df.head(10)

Unnamed: 0,president,year,length
0,Washington,1789,1538
1,Washington,1793,147
2,Adams,1797,2585
3,Jefferson,1801,1935
4,Jefferson,1805,2384
5,Madison,1809,1265
6,Madison,1813,1304
7,Monroe,1817,3693
8,Monroe,1821,4909
9,Adams,1825,3150


## An exercise of data frame building: measuring lexical diversity

As we discussed in the notebook on dictionaries, you can use Python dictionaries, including NLTK FreqDist objects, to count things, for example words in a text. As a review, here is how you would count how often each word appears in the opening stanza of Lewis Carroll's Hunting of the Snark.


In [6]:
import nltk

stanza = """ "Just the place for a Snark!" the Bellman cried,
   As he landed his crew with care;
Supporting each man on the top of the tide
   By a finger entwined in his hair."""
freq_stanza = nltk.FreqDist(nltk.word_tokenize(stanza))
freq_stanza

FreqDist({'the': 4, 'a': 2, 'his': 2, '``': 1, 'Just': 1, 'place': 1, 'for': 1, 'Snark': 1, '!': 1, "''": 1, ...})

We will estimate the lexical diversity as the extent to which someone uses *different* words, rather than reusing the same small vocabulary, in a speech.

Lexical diversity can be useful to estimate reading level, For example, to make a text for, say, third-graders, you might want to opt for a smaller vocabulary. 

To estimate lexical diversity, we want to compare two quantities:
* the overall number of words that were spoken
* how many *different* words there were in the speech.

We can get at the overall number of words in a text in several ways:
* if we have the text as a list of words, the length of that list
* if we have the text as a FreqDist object, the method `N()` lets you see how many words were counted to make the FreqDist 

We can get at the number of *different* words in a text by counting, in a FreqDist object of the text, how many words there are for which we have counts. 

Here is an example for the Hunting of the Snark: 

In [27]:
print("number of words (word tokens) in the stanza", freq_stanza.N())
print("number of different words (word types) in the stanza", len(freq_stanza))

number of words (word tokens) in the stanza 38
number of different words (word types) in the stanza 33


Here is the same, for the first inaugural address.

We estimate lexical diversity as number of types divided by number of tokens. A diversity of 0.4 means that if we read 10 words of the text, on average 4 of them will be words that haven't been used in the same text before. 

In [23]:
fileid0 = nltk.corpus.inaugural.fileids()[0]
words = nltk.corpus.inaugural.words(fileid0)
wordcounts = nltk.FreqDist(words)
print("first inaugural address: number of words", wordcounts.N())
print("first inaugural address: number of word *types*:", len(wordcounts))
print("lexical diversity: types per token, dividing word types by overall count of words", len(wordcounts) / wordcounts.N())

first inaugural address: number of words 1538
first inaugural address: number of word *types*: 628
lexical diversity: types per token, dividing word types by overall count of words 0.4083224967490247


Here is a more theoretical example of two texts, one that repeats the same word over and over again, and one that has no repetition at all. The second text has the maximal possible lexical diversity: 1.0. This means that every single word you encounter in the text is one that you haven't encountered before. 

In [24]:
text1 = "the the the the"
text2 = "softly and suddenly vanish"

fd1 = nltk.FreqDist(nltk.word_tokenize(text1))
fd2 = nltk.FreqDist(nltk.word_tokenize(text2))

# lexical diversity: number of types (= number of keys in the FreqDist) 
# divided by number of tokens (what you get by FreqDist's method N())
print("lexical diversity text 1:", len(fd1) / fd1.N())
print("lexical diversity text 2:", len(fd2) / fd2.N())

lexical diversity text 1: 0.25
lexical diversity text 2: 1.0


Or, more compactly:

**Try it for yourself**: 

* Redo the data frame with presidents, years, and speech lengths from above, using the same technique, that is, making a dictionary for each inaugural address. But this time, add another column that has the number of different word types for each inaugural address.

* Once you have made the pandas dataframe, add a column that has the *lexical diversity* of each inaugural address. This is the type/token ratio, which you get by dividing the number of different word types for an address by the length of the address. (See https://en.wikipedia.org/wiki/Lexical_diversity )

* Which speech has the greatest lexical diversity?

* Plot lexical diversity over time. How much does it vary?


# From word counts to data frame

When you then want to explore and visualize your data and run statistical tests over it, it is often convenient to have the data in a pandas data frame. Here is how to make a pandas data frame from an NLTK FreqDist object, shown for ```freq_stanza```. We make one column for the words, and another column or the counts.

In [10]:
import pandas as pd

stanza = """ "Just the place for a Snark!" the Bellman cried,
   As he landed his crew with care;
Supporting each man on the top of the tide
   By a finger entwined in his hair."""
freq_stanza = nltk.FreqDist(nltk.word_tokenize(stanza))

df_stanza = pd.DataFrame(freq_stanza.items(), columns = ["word", "count"])
df_stanza.head()

Unnamed: 0,word,count
0,``,1
1,Just,1
2,the,4
3,place,1
4,for,1


Then we can use the data frame as usual. For example, here is how we can get the row(s) where the word is "Snark":

In [11]:
df_stanza[df_stanza.word =="Snark"]

Unnamed: 0,word,count
6,Snark,1


You can also make a data frame in which the columns are words and there is a single row of counts:

In [12]:
df_stanza_2 = pd.DataFrame([freq_stanza])
df_stanza_2

Unnamed: 0,``,Just,the,place,for,a,Snark,!,'',Bellman,...,on,top,of,tide,By,finger,entwined,in,hair,.
0,1,1,4,1,1,2,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


**Try it for yourself:**

* Find a passage of your choice, and use an NLTK FreqDist to count how often each word appears in the passage.
* Then turn the counts into a data frame with two columns, one for the words and one for the counts.
* In this data frame, what is the highest count?
* In this data frame, what is the count for "the"? What is the count for "Boojum"?
* In this data frame, which words have a count greater than one?



The next question is: If we want to get counts for each word, in multiple documents, for example in each inaugural address, how do we do that? The answer is: Like before, we want one dictionary per row, just this time we use NLTK FreqDist objects. Then when we combine them into a data frame, we just have to tell pandas to use zeros for all words that are unobserved in a particular document.

We first show this here for two stanzas from the Hunting of the Snark:

In [13]:
import pandas as pd

stanza1 = """ "Just the place for a Snark!" the Bellman cried,
   As he landed his crew with care;
Supporting each man on the top of the tide
   By a finger entwined in his hair."""

freq_stanza1 = nltk.FreqDist(nltk.word_tokenize(stanza1))

stanza2 = """In the midst of the word he was trying to say 
in the midst of his laughter and glee
he had softly and suddenly vanished away
for the Snark was a Boojum, you see"""
freq_stanza2 = nltk.FreqDist(nltk.word_tokenize(stanza2))

df_snark = pd.DataFrame([ freq_stanza1, freq_stanza2]).fillna(0)
df_snark

Unnamed: 0,``,Just,the,place,for,a,Snark,!,'',Bellman,...,and,glee,had,softly,suddenly,vanished,away,Boojum,you,see
0,1.0,1.0,4,1.0,1,2,1,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,4,0.0,1,1,1,0.0,0.0,0.0,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


So here we have one FreqDist object per row, and we hand the DataFrame object a list of rows, that is, a list of FreqDist objects. 

We can also do this in a for-loop over rows to include, as we now show for the first five inaugural addresses only:

In [14]:
first_five_fileids = nltk.corpus.inaugural.fileids()[:5]

dataframe_rows = [ ]
for fileid in first_five_fileids:
    # get the words
    words = nltk.corpus.inaugural.words(fileid)
    # make the counts dictionary
    fd = nltk.FreqDist(words)
    # and store it
    dataframe_rows.append(fd)
    
# now we make a data frame.
# Important: tell pandas to pad missing cells with zeros
# fillna means "fill N/A"
df_wordcounts = pd.DataFrame(dataframe_rows).fillna(0)
df_wordcounts

Unnamed: 0,Fellow,-,Citizens,of,the,Senate,and,House,Representatives,:,...,fathers,Israel,old,planted,flowing,necessaries,infancy,riper,goodness,prosper
0,1.0,1.0,1.0,71,115,1.0,48,2.0,2.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,11,13,0.0,2,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,140,158,1.0,128,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,4.0,1.0,104,128,0.0,79,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,101,138,0.0,93,0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


**Try it for yourself:**

Make a dataframe with counts for the last five inaugural addresses in the dataset. Make two observations about the counts that you see.

# Counting only some of the words

What if you don't want to count all the words, just some? Here is an incantation that will do that. It uses a list comprehension, which will be the subject of a later notebook; for now, just take it at face value.

In [15]:
words_of_interest = ["country", "duty", "freedom"]

fileid0 = nltk.corpus.inaugural.fileids()[0]
words = nltk.corpus.inaugural.words(fileid0)

wordcounts = nltk.FreqDist( w for w in words if w in words_of_interest)
pd.DataFrame([ wordcounts])


Unnamed: 0,country,duty
0,4,4


We get a dataframe with counts of 4 for both 'country' and 'duty', and a count of zero (not shown) for 'freedom'. 

Here is code that does this count across all inaugural speeches:

In [16]:
words_of_interest = ["country", "duty", "freedom"]

dataframe_rows = [ ]
for fileid in nltk.corpus.inaugural.fileids():
    # get the words
    words = nltk.corpus.inaugural.words(fileid)
    # make the counts dictionary
    fd = nltk.FreqDist(w for w in words if w in words_of_interest)
    # and store it
    dataframe_rows.append(fd)
    
# now we make a data frame.
# and tell pandas to pad missing cells with zeros
df_wordcounts = pd.DataFrame(dataframe_rows).fillna(0)
df_wordcounts.head()

Unnamed: 0,country,duty,freedom
0,4.0,4.0,0.0
1,1.0,0.0,0.0
2,10.0,2.0,0.0
3,4.0,0.0,4.0
4,4.0,4.0,2.0


**Try it for yourself:**
Make a dataframe that shows the counts, across all inaugural addresses, for the words "citizen", "America", and "future". 

# Combining counts and metadata

We have made dataframes with metadata (data about the addresses), in our case: years, presidents' names, and speech lengths. And we have made dataframes with word counts. How can we combine the two?

This is quite easy because we can merge dictionaries using Python's ```update``` method. Here is an example where the contents of ```dict2``` are integrated into ```dict1```:

In [17]:
dict1 = { "cat":"Katze", "armadillo":"Guerteltier"}
dict2 = { "platypus":"Schnabeltier", "dormouse":"Siebenschlaefer"}
dict1.update(dict2)

dict1

{'cat': 'Katze',
 'armadillo': 'Guerteltier',
 'platypus': 'Schnabeltier',
 'dormouse': 'Siebenschlaefer'}

When we combine word counts and metadata, we need to watch our column names: We don't want to name our metadata columns "president" and "year" anymore, because they may also occur as words with counts. We'll use "meta-president", "meta-year", and "meta-length" instead. In order not to let the data frames become too huge, we again only count "freedom", "duty", and "country".

The code below is getting a bit lengthy, but mainly it just combines the code from above for making the metadata data frame with the code from above for counting particular words of interest.

In [18]:
words_of_interest = ["country", "duty", "freedom"]


dataframe_rows = [ ]
for fileid in nltk.corpus.inaugural.fileids():
    # make a dictionary for this row
    row_dict = { }
    # remove the .txt from the file ID
    yearpresident = fileid[:-4]
    # split into year and text by splitting at -
    year, president = yearpresident.split("-")
    # and get the length of the speech
    words = nltk.corpus.inaugural.words(fileid)
    # store what we have so far. this is the metadata
    row_dict["meta-president"] = president
    row_dict["meta-year"] = int(year)
    row_dict["meta-length"] = len(words)
    
    # now we count words
    fd = nltk.FreqDist(w for w in words if w in words_of_interest)
    
    # and merge the two dictionaries
    row_dict.update(fd)
    
    # store the row dictionary in the list
    dataframe_rows.append(row_dict)
    
# make the data frame
small_inaugural_df = pd.DataFrame(dataframe_rows).fillna(0)
small_inaugural_df.head()

Unnamed: 0,meta-president,meta-year,meta-length,country,duty,freedom
0,Washington,1789,1538,4.0,4.0,0.0
1,Washington,1793,147,1.0,0.0,0.0
2,Adams,1797,2585,10.0,2.0,0.0
3,Jefferson,1801,1935,4.0,0.0,4.0
4,Jefferson,1805,2384,4.0,4.0,2.0


**Try it for yourself:** 

Make a data frame that shows, for each inaugural address, the president's name and the year, the speech length, and the counts of 10 words of your choice.
