Python list comprehensions are weird-looking things. They seem to be lists with a wrong-way-round for-loop in it. 

Here is an example, applied to the very first sentence of the Penn Treebank corpus -- the list comprehsion is in the third line:

In [1]:
text= """Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29."""
words = text.split()
[w.lower() for w in words]

['pierre',
 'vinken,',
 '61',
 'years',
 'old,',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'nov.',
 '29.']

As you can see, what it gives you is, in fact, a list: A list of all the items that were in the list ```words```, lowercased.

A list comprehension is a Python expression that transforms a list into another list. The input list is the one mentioned in the ```for``` expression:

```for w in words```
    
So here the input list is ```words```, a list of strings that we got from splitting the Pierre Vinken sentence at whitespace.

The output is also a list. This is signaled by the list brackets around the list comprehension: 

```[ ... ]```
    
What this particular list comprehension does is to give you ```w.lower``` for every ```w``` in the list ```words```.


Here is another example that follows the same shape. The input list is ```numberlist``` and the output list will contain ```num``` squared for each entry ```num``` in ```numberlist```:

In [2]:
numberlist = [1,2,3,4,5]
[ num ** 2 for num in numberlist]

[1, 4, 9, 16, 25]

***Try it for yourself:***

1. Here is a passage from the Wikipedia page about Python: ```"""Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural,) object-oriented, and functional programming."""```
Split this passage on whitespace to obtain a list. Then use a list comprehension to strip all occurences of period, comma, opening and closing parentheses from the beginnings and ends of the words in the list. You can use the method ```strip()``` for this.

2. Here is a list of words (the beginning of a poem from Alice in Wonderland): ```["how", "doth", "the", "little", "crocodile"]``` Use a list comprehension to put one whitespace before and one whitespace after each word on the list, creating a new list that should look like this: ```[" how ", " doth ", " the ", " little ", " crocodile "]``` You can use ```+``` to concatenate strings, for example

3. Here is another sentence from the Wikipedia page about Python: ```"""Python is an interpreted, high-level, general-purpose programming language."""``` Split this string on whitespace to obtain a list of words. Then use a list comprehension to invert every word in the list, obtaining a list that starts ```["nyhtyP", "si", ...]``` You can use slices to invert a string, like this:

In [3]:
"palindrome"[::-1]

'emordnilap'

In [4]:
# your code here

4. Here is a list of pairs of numbers:```[(1,10), (2, 20), (3, 30)]``` Use a list comprehension to multiply each pair of numbers to obtain a list of products. That is, you should obtain the list ```[10, 40, 90]```

In [5]:
# your code here

By the way, if you have two equal length lists, you can use ```zip()``` to make this list of pairs: 

In [6]:
mylist1 = [1,2,3]
mylist2 = [10, 20, 30]
list(zip(mylist1, mylist2))

[(1, 10), (2, 20), (3, 30)]

List comprehensions are for transforming one list to another. So which of the following tasks can be done using a list comprehension?

1. Take a list of words, and make a list of the same words, just uppercased
2. Determine the number that is the sum of all the numbers in a number list
3. Determine that is the number of items in a list
4. Take a list of numbers, and make a list that has the square root of each number in the first list
5. Take a sentence given as a string, and word-tokenize it to obtain a list

# List comprehensions with "if"

Here is another example of a list comprehension. Again, the list comprehension is in the 3rd line. (The first line has a sentence from The War of the Worlds, transformed into a word list.)

In [7]:
mylist = ['for', 'a', 'minute', 'he', 'scarcely', 'realised', 'what', 'this', 'meant', 'and', 'although', 'the', 'heat', 'was', 'excessive', 'he', 'clambered', 'down', 'into', 'the', 'pit', 'close', 'to', 'the', 'bulk', 'to', 'see', 'the', 'thing', 'more', 'clearly']

mystopwords = ["the", "a", "to", "for", "he", "she", "it", "what", "and"]

[w for w in mylist if w not in mystopwords]

['minute',
 'scarcely',
 'realised',
 'this',
 'meant',
 'although',
 'heat',
 'was',
 'excessive',
 'clambered',
 'down',
 'into',
 'pit',
 'close',
 'bulk',
 'see',
 'thing',
 'more',
 'clearly']

Again, the list comprehension transformed a list (```mylist```) into another list. This time, the resulting list was shorter than the original: The list comprehension has *filtered* the list.

The input list is ```mylist```, as you can see in the for-loop part of the list comprehension:
```for w in mylist```

Now there is an extra component after the for loop part: a condition. Here it reads ```if w not in mystopwords```.

Putting things together, this list comprehension only retains those members ```w``` in ```mylist``` for which it is true that ```w not in mystopwords```.

**Try it for yourself:***

1. Here is a list of numbers: ```[4,2,7,6,9]``` Use a list comprehension to only retain the even numbers from the list. You can test whether a number is even by using ```%```, modulo:

In [8]:
print("5 modulo three is", 5%3)
print("5 modulo two is", 5%2)
print("For all even numbers, the modulo two is zero. Here is 4 modulo 2:", 4%2,
     "and 3 modulo 2:", 3%2)
print("Is 42 even?", 42 % 2 == 0)

5 modulo three is 2
5 modulo two is 1
For all even numbers, the modulo two is zero. Here is 4 modulo 2: 0 and 3 modulo 2: 1
Is 42 even? True


In [9]:
# your code here

2. Using again the list ```mylist``` of words that are a sentence from the War of the Worlds, use a list comprehension to only retain words that are longer than 3 letters.

In [10]:
mylist = ['for', 'a', 'minute', 'he', 'scarcely', 'realised', 'what', 'this', 'meant', 'and', 'although', 'the', 'heat', 'was', 'excessive', 'he', 'clambered', 'down', 'into', 'the', 'pit', 'close', 'to', 'the', 'bulk', 'to', 'see', 'the', 'thing', 'more', 'clearly']
# your code here

3. Here is a list of numbers encoded as strings: ```["123", "50.3", "1970"]``` Use a list comprehension to turn this to a list of numbers encoded as numbers. You can use the function ```float()``` to turn a string into a floating point number, provided the string describes a floating point number:

In [11]:
float("3.1415")

3.1415

In [12]:
# your code here

# Combining transformation and filtering

Our first group of list comprehensions transformed each member of a list. The second group filtered list members. You can do both at the same time. Here is a list comprehension that, given a list of words, discards the stopwords and lowercases the rest. We apply it to a passage from the novel Dracula by Bram Stoker:

In [13]:
import nltk
stopwords = nltk.corpus.stopwords.words("english")

inputtext = """'The night is chill, mein Herr, and my master the Count bade me take all
care of you. There is a flask of slivovitz (the plum brandy of the
country) underneath the seat, if you should require it.' I did not take
any, but it was a comfort to know it was there all the same. I felt a
little strangely, and not a little frightened. I think had there been
any alternative I should have taken it, instead of prosecuting that
unknown night journey. The carriage went at a hard pace straight along,
then we made a complete turn and went along another straight road. It
seemed to me that we were simply going over and over the same ground
again; and so I took note of some salient point, and found that this was
so. I would have liked to have asked the driver what this all meant, but
I really feared to do so, for I thought that, placed as I was, any
protest would have had no effect in case there had been an intention to
delay. By-and-by, however, as I was curious to know how time was
passing, I struck a match, and by its flame looked at my watch; it was
within a few minutes of midnight. This gave me a sort of shock, for I
suppose the general superstition about midnight was increased by my
recent experiences. I waited with a sick feeling of suspense.
 """

words = nltk.word_tokenize(inputtext)

[ w.lower() for w in words if w not in stopwords]

["'the",
 'night',
 'chill',
 ',',
 'mein',
 'herr',
 ',',
 'master',
 'count',
 'bade',
 'take',
 'care',
 '.',
 'there',
 'flask',
 'slivovitz',
 '(',
 'plum',
 'brandy',
 'country',
 ')',
 'underneath',
 'seat',
 ',',
 'require',
 '.',
 "'",
 'i',
 'take',
 ',',
 'comfort',
 'know',
 '.',
 'i',
 'felt',
 'little',
 'strangely',
 ',',
 'little',
 'frightened',
 '.',
 'i',
 'think',
 'alternative',
 'i',
 'taken',
 ',',
 'instead',
 'prosecuting',
 'unknown',
 'night',
 'journey',
 '.',
 'the',
 'carriage',
 'went',
 'hard',
 'pace',
 'straight',
 'along',
 ',',
 'made',
 'complete',
 'turn',
 'went',
 'along',
 'another',
 'straight',
 'road',
 '.',
 'it',
 'seemed',
 'simply',
 'going',
 'ground',
 ';',
 'i',
 'took',
 'note',
 'salient',
 'point',
 ',',
 'found',
 '.',
 'i',
 'would',
 'liked',
 'asked',
 'driver',
 'meant',
 ',',
 'i',
 'really',
 'feared',
 ',',
 'i',
 'thought',
 ',',
 'placed',
 'i',
 ',',
 'protest',
 'would',
 'effect',
 'case',
 'intention',
 'delay',
 '.',


# List comprehensions and Pandas

When you have your data in a Pandas frame, you often need to transform one column to produce another. For example, you may have individual snippets of text in a Pandas column, and want to transform it into a column that has a list of words. 

Then Pandas columns and list comprehensions are a winning combination! Here is an example. This code takes a Pandas column containing sentences, and transforms it into a new column that has lists of words:

In [14]:
import pandas as pd

mydf = pd.DataFrame( {"line" : ["How doth the little crocodile",
                                "Improve his shining tail", 
                                "And pour the waters of the Nile",
                                "On every golden scale!",
                                "How cheerfully he seems to grin",
                                "How neatly spreads his claws,",
                                "And welcomes little fishes in",
                                "With gently smiling jaws"]})
mydf

Unnamed: 0,line
0,How doth the little crocodile
1,Improve his shining tail
2,And pour the waters of the Nile
3,On every golden scale!
4,How cheerfully he seems to grin
5,"How neatly spreads his claws,"
6,And welcomes little fishes in
7,With gently smiling jaws


In [15]:
# remember to use the dictionary key like notation when 
# making a new column
mydf["tokenized"] = [nltk.word_tokenize(s) for s in mydf.line]
mydf

Unnamed: 0,line,tokenized
0,How doth the little crocodile,"[How, doth, the, little, crocodile]"
1,Improve his shining tail,"[Improve, his, shining, tail]"
2,And pour the waters of the Nile,"[And, pour, the, waters, of, the, Nile]"
3,On every golden scale!,"[On, every, golden, scale, !]"
4,How cheerfully he seems to grin,"[How, cheerfully, he, seems, to, grin]"
5,"How neatly spreads his claws,","[How, neatly, spreads, his, claws, ,]"
6,And welcomes little fishes in,"[And, welcomes, little, fishes, in]"
7,With gently smiling jaws,"[With, gently, smiling, jaws]"


**Try it for yourself:**

1. Use a list comprehension to add a column that contains the number of words for each line.

In [16]:
# your code here

2. Change the list comprehension from above that we used to make the column ```tokenized``` such that it lowercases the string before tokenizing.