{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Defining your own functions\n",
"\n",
"You have seen builtin functions, like type(), int(), str(), and functions from packages, like math.sqrt(). In Python, you can also define functions yourself. This is useful when you need to do the same thing (or something very similar) again and again. Suppose that you are in a country where temperature is given in Celsius, but you are more familiar with Fahrenheit. Here is how you can figure out that 20 degrees Celsius are 68 degrees Fahrenheit, and 30 degrees Celsius are 86 degrees Fahrenheit:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"68.0\n"
]
}
],
"source": [
"print( (20 * 9/5) + 32)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"86.0\n"
]
}
],
"source": [
"print( (30 * 9/5) + 32)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need to convert temperatures very often, this gets tedious. Instead, you can define your own function by giving a name to a piece of code:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def celsius_2_fahrenheit(temp):\n",
" fahrenheit = temp * 9/5 + 32\n",
" return fahrenheit"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# here is a variant without the variable \"fahrenheit\"\n",
"def celsius_2_fahrenheit_variant(temp):\n",
" return (temp * 9/5) + 32"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After that, you can use ```celsius_2_fahrenheit``` like any built-in function:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100.4"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"celsius_2_fahrenheit(38)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"32.0"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"celsius_2_fahrenheit(0)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"212.0"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"celsius_2_fahrenheit(100)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"float"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fahrenheit_val = celsius_2_fahrenheit(25)\n",
"type(fahrenheit_val)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"77.0\n"
]
}
],
"source": [
"print(fahrenheit_val)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# if you call your function with a parameter value\n",
"# that doesn't match what you do with the parameter\n",
"# in the function, you get an error, as usual\n",
"\n",
"# celsius_2_fahrenheit(\"ten\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# \"ten\" * 9"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# 'tententententententententen' / 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Input and output\n",
"\n",
"A function has an input and an output. You know that from a builtin function like ```len()```: It takes as input an object, for example a list. You put the input between the parentheses. And it returns an output, the length of the object. You can, for example, store that output in a variable:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"17"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(\"how are you doing\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the output of len that I stored is 3\n"
]
}
],
"source": [
"mylist = [\"a\", \"b\", \"c\"]\n",
"somevariable = len(mylist)\n",
"print(\"the output of len that I stored is\", somevariable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same is true for functions that you define yourself. You define the inputs in parentheses after the function name. The function name is ```celsius_2_fahrenheit```, and in parentheses after that you see ```(temp)```. What is ```temp```? It is just a variable name. When you call a pre-defined function, you give it an actual input value. When you define your own function, you just prepare a container, a variable name, that will store the input. In this case, that container is ```temp```.\n",
"\n",
"The output of the function is what you see after the keyword ```return```: It is the value currently stored in ```fahrenheit```.\n",
"\n",
"So when we call our self-defined function with a different input, for example 100 (that is, 100 degrees celsius), the output that we get is 212:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"212.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"celsius_2_fahrenheit(100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The variable ```temp``` in the function definition is used in much the same way as in function definitions in mathematics:\n",
"\n",
"f(x) = x+1\n",
"\n",
"There, you don't have to specify beforehand what x is. Rather, for each x, the function value is f(x) = x+1. In the same way, when you call your function saying celsius_2_fahrenheit(20), then at this time, Python stores the value 20 in the variable ```temp``` and executes the code of the function with ```temp``` set to 20.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The components of a function definition\n",
"\n",
"Let's take a look at all the bits and pieces of this function definition:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def celsius_2_fahrenheit(temp):\n",
" fahrenheit = temp * 9/5 + 32\n",
" return fahrenheit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* ```def``` is a reserved word. It tells Python that you are defining a function.\n",
"\n",
"* The name of the function is ```celsius_2_fahrenheit```. This is a variable that you define, just this time it is a container that contains Python code, not a number or a string. The names of functions are subject to the same restrictions as other Python variable names.\n",
"\n",
"* After the name of the function, you see the input to the function, also called its argument, in round brackets. This is a variable name: the container that we have prepared to take the actual input when the function is called.\n",
"\n",
"* After the argument there is a colon. \n",
"* Then come some indented lines, which define what the function does."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a\n",
"bc\n"
]
}
],
"source": [
"# compare to:\n",
"for i in [\"a\", \"b\" \"c\"]:\n",
" print(i)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So we have an overall shape of \n",
"\n",
"```keyword something:\n",
" indented code```\n",
" \n",
"which is the same as, for example, in for-loops and in if-conditions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Try it for yourself:**\n",
"\n",
"* Define a function that converts degrees Fahrenheit to degrees Celsius (for the benefit of Europeans like me). \n",
"\n",
"* Define a function that takes a string as input, then strips all punctuation symbols from the beginning and end and lowercases. For example, when the input is \"Hello!\" the output should be \"hello\", and when the input is \"???WHAT???\", the output should be \"what\". "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# more about \"return\"\n",
"\n",
"\"return\" means: hand back this value, and you're done. Any line after this one in the function will not be executed anymore:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"about to return value\n"
]
},
{
"data": {
"text/plain": [
"68.0"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def c2f(temp):\n",
" fahrenheit = (temp * 9/5) + 32\n",
" print(\"about to return value\")\n",
" return fahrenheit\n",
" print(\"after returning value\")\n",
"\n",
"c2f(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But you can, of course, use \"return\" in an \"if\" statement:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aardvark\n",
"ZEBRA\n"
]
}
],
"source": [
"def maybe_lowercase(word):\n",
" if word.startswith(\"A\"):\n",
" return word.lower()\n",
" else:\n",
" return word\n",
"\n",
"print(maybe_lowercase(\"AARDVARK\"))\n",
"print(maybe_lowercase(\"ZEBRA\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Returning lists\n",
"\n",
"The output, what you write after the keyword ```return```, can be a single value: a number, as in the Celsius-to-Fahrenheit example, or a string, as in the stripping-punctuation-and-lowercasing example. The output of a function can also be a list. Here is a function that takes a string as input, tokenizes it, lowercases, and removes punctuation:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import string\n",
"string.punctuation"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'hello'"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"hello!!!\".strip(string.punctuation)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"w = \"!!?!!\"\n",
"w.strip(string.punctuation) == \"\""
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['what', 'a', 'lovely', 'day']"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import nltk\n",
"import string\n",
"\n",
"def my_preprocess(inputstring):\n",
" words = nltk.word_tokenize(inputstring)\n",
" newwords = [w.lower() for w in words if w.strip(string.punctuation) != \"\"]\n",
" return newwords\n",
"\n",
"my_preprocess(\"What a lovely day!!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you want to return multiple pieces of data, you can also return them in a list, or in a tuple.\n",
"\n",
"You can then catch the multiple return values in multiple separate variables:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['what', 'a', 'lovely', 'day']\n",
"['!', '!']\n"
]
}
],
"source": [
"\n",
"def my_preprocess(inputstring):\n",
" words = nltk.word_tokenize(inputstring)\n",
" newwords = [w.lower() for w in words if w.strip(string.punctuation) != \"\"]\n",
" punctuations = [w for w in words if w.strip(string.punctuation) == \"\"]\n",
" # this makes a tuple of words without punctuation, punctuation symbols\n",
" return (newwords, punctuations)\n",
"\n",
"words, punctuationsymbols = my_preprocess(\"What a lovely day!!\")\n",
"print(words)\n",
"print(punctuationsymbols)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(['what', 'a', 'lovely', 'day'], ['!', '!'])"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_preprocess(\"What a lovely day!!\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Try it for yourself:**\n",
"\n",
"* Define a function that takes as input a string, splits it on whitespace, removes every word that starts with \"http\", and returns the remaining list of words. So if the input is ```\"Go to https:\\\\abc.def.gh\"```, the output should be ```[\"Go\", \"to\"]```\n",
"\n",
"* Define a function that takes as input a string of numbers separated by whitespace, for example ```\"123 3.4 67.9\"```. It should return a list with the numbers that were in the string, in our case ```[123, 3.4, 67.9]```.`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Functions with multiple arguments\n",
"\n",
"A function can have more than one piece of input. Here is an example. This function repeats a string a given number of times:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xxxxxxxxxx'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"x\" * 10"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'xy'"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"x\" + \"y\""
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'hellohellohellohellohello'"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def repeat_string(somestring, numtimes):\n",
" return somestring * numtimes\n",
"\n",
"repeat_string(\"hello\", 5)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'hellohello'"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"repeat_string(\"hello\", 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Optional arguments\n",
"\n",
"You can define a function with optional arguments. For an argument that is optional, you give a *default* value in the definition. When calling the function, you can either keep or overwrite the default."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def repeat_string_again(somestring, numtimes = 5):\n",
" return somestring * numtimes"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'hellohellohellohellohellohellohellohellohellohello'"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"repeat_string_again(\"hello\", 10)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'hellohellohellohellohello'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"repeat_string_again(\"hello\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Important: When you use optional arguments, you have to put all the non-optional arguments before the optional ones in your function definition.**`"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# this will not work\n",
"# error: SyntaxError: non-default argument follows default argument\n",
"\n",
"# def repeat_string_withbug(numtimes = 5, somestring):\n",
"# return somestring * numtimes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Put all values that the function uses in the list of input parameters\n",
"\n",
"Make your function so that its list of input parameters is its only \"service window\", the only way to pass in values. "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# in principle you can do this\n",
"\n",
"numtimes_baddesign = 5\n",
"\n",
"def repeatstring(s):\n",
" return s * numtimes_baddesign\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# but please do this\n",
"\n",
"def repeatstring_better(s, numtimes):\n",
" return s * numtimes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using functions when transforming Pandas columns\n",
"\n",
"In the Python list comprehensions notebook, I said that list comprehensions are great for transforming a Pandas dataframe column. Here is the example from that notebook again: "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" line | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" How doth the little crocodile | \n",
"
\n",
" \n",
" 1 | \n",
" Improve his shining tail | \n",
"
\n",
" \n",
" 2 | \n",
" And pour the waters of the Nile | \n",
"
\n",
" \n",
" 3 | \n",
" On every golden scale! | \n",
"
\n",
" \n",
" 4 | \n",
" How cheerfully he seems to grin | \n",
"
\n",
" \n",
" 5 | \n",
" How neatly spreads his claws, | \n",
"
\n",
" \n",
" 6 | \n",
" And welcomes little fishes in | \n",
"
\n",
" \n",
" 7 | \n",
" With gently smiling jaws | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" line\n",
"0 How doth the little crocodile\n",
"1 Improve his shining tail\n",
"2 And pour the waters of the Nile\n",
"3 On every golden scale!\n",
"4 How cheerfully he seems to grin\n",
"5 How neatly spreads his claws,\n",
"6 And welcomes little fishes in\n",
"7 With gently smiling jaws"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"mydf = pd.DataFrame( {\"line\" : [\"How doth the little crocodile\",\n",
" \"Improve his shining tail\", \n",
" \"And pour the waters of the Nile\",\n",
" \"On every golden scale!\",\n",
" \"How cheerfully he seems to grin\",\n",
" \"How neatly spreads his claws,\",\n",
" \"And welcomes little fishes in\",\n",
" \"With gently smiling jaws\"]})\n",
"\n",
"mydf"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" line | \n",
" tokenized | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" How doth the little crocodile | \n",
" [How, doth, the, little, crocodile] | \n",
"
\n",
" \n",
" 1 | \n",
" Improve his shining tail | \n",
" [Improve, his, shining, tail] | \n",
"
\n",
" \n",
" 2 | \n",
" And pour the waters of the Nile | \n",
" [And, pour, the, waters, of, the, Nile] | \n",
"
\n",
" \n",
" 3 | \n",
" On every golden scale! | \n",
" [On, every, golden, scale, !] | \n",
"
\n",
" \n",
" 4 | \n",
" How cheerfully he seems to grin | \n",
" [How, cheerfully, he, seems, to, grin] | \n",
"
\n",
" \n",
" 5 | \n",
" How neatly spreads his claws, | \n",
" [How, neatly, spreads, his, claws, ,] | \n",
"
\n",
" \n",
" 6 | \n",
" And welcomes little fishes in | \n",
" [And, welcomes, little, fishes, in] | \n",
"
\n",
" \n",
" 7 | \n",
" With gently smiling jaws | \n",
" [With, gently, smiling, jaws] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" line tokenized\n",
"0 How doth the little crocodile [How, doth, the, little, crocodile]\n",
"1 Improve his shining tail [Improve, his, shining, tail]\n",
"2 And pour the waters of the Nile [And, pour, the, waters, of, the, Nile]\n",
"3 On every golden scale! [On, every, golden, scale, !]\n",
"4 How cheerfully he seems to grin [How, cheerfully, he, seems, to, grin]\n",
"5 How neatly spreads his claws, [How, neatly, spreads, his, claws, ,]\n",
"6 And welcomes little fishes in [And, welcomes, little, fishes, in]\n",
"7 With gently smiling jaws [With, gently, smiling, jaws]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Then we added a new column with \n",
"# tokenized lines like this, using\n",
"# a list comprehension:\n",
"mydf[\"tokenized\"] = [nltk.word_tokenize(s) for s in mydf.line]\n",
"mydf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But not all column transformations are this simple. Say we want to do more preprocessing: Tokenize the line, lowercase, and remove stopwords.\n",
"\n",
"This is clunky when expressed in a single list comprehension. But we can define a function that does such a transformation, then use that in a list comprehension:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" line | \n",
" tokenized | \n",
" preprocessed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" How doth the little crocodile | \n",
" [How, doth, the, little, crocodile] | \n",
" [doth, little, crocodile] | \n",
"
\n",
" \n",
" 1 | \n",
" Improve his shining tail | \n",
" [Improve, his, shining, tail] | \n",
" [improve, shining, tail] | \n",
"
\n",
" \n",
" 2 | \n",
" And pour the waters of the Nile | \n",
" [And, pour, the, waters, of, the, Nile] | \n",
" [pour, waters, nile] | \n",
"
\n",
" \n",
" 3 | \n",
" On every golden scale! | \n",
" [On, every, golden, scale, !] | \n",
" [every, golden, scale, !] | \n",
"
\n",
" \n",
" 4 | \n",
" How cheerfully he seems to grin | \n",
" [How, cheerfully, he, seems, to, grin] | \n",
" [cheerfully, seems, grin] | \n",
"
\n",
" \n",
" 5 | \n",
" How neatly spreads his claws, | \n",
" [How, neatly, spreads, his, claws, ,] | \n",
" [neatly, spreads, claws, ,] | \n",
"
\n",
" \n",
" 6 | \n",
" And welcomes little fishes in | \n",
" [And, welcomes, little, fishes, in] | \n",
" [welcomes, little, fishes] | \n",
"
\n",
" \n",
" 7 | \n",
" With gently smiling jaws | \n",
" [With, gently, smiling, jaws] | \n",
" [gently, smiling, jaws] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" line tokenized \\\n",
"0 How doth the little crocodile [How, doth, the, little, crocodile] \n",
"1 Improve his shining tail [Improve, his, shining, tail] \n",
"2 And pour the waters of the Nile [And, pour, the, waters, of, the, Nile] \n",
"3 On every golden scale! [On, every, golden, scale, !] \n",
"4 How cheerfully he seems to grin [How, cheerfully, he, seems, to, grin] \n",
"5 How neatly spreads his claws, [How, neatly, spreads, his, claws, ,] \n",
"6 And welcomes little fishes in [And, welcomes, little, fishes, in] \n",
"7 With gently smiling jaws [With, gently, smiling, jaws] \n",
"\n",
" preprocessed \n",
"0 [doth, little, crocodile] \n",
"1 [improve, shining, tail] \n",
"2 [pour, waters, nile] \n",
"3 [every, golden, scale, !] \n",
"4 [cheerfully, seems, grin] \n",
"5 [neatly, spreads, claws, ,] \n",
"6 [welcomes, little, fishes] \n",
"7 [gently, smiling, jaws] "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import nltk\n",
"stopwords = nltk.corpus.stopwords.words(\"english\")\n",
"\n",
"def my_preprocess(line):\n",
" words = nltk.word_tokenize(line)\n",
" words1 = [w.lower() for w in words]\n",
" words2 = [w for w in words1 if w not in stopwords]\n",
" return words2\n",
"\n",
"\n",
"mydf[\"preprocessed\"] = [my_preprocess(line) for line in mydf.line]\n",
"mydf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Try it for yourself**: \n",
"\n",
"* Can you change the function ```my_preprocess()``` such that it also filters away words that consist solely of punctuation?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}