Content from Introduction
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- What is programming?
- How do I document code?
- How do I find reliable and safe resources or code online?
Objectives
- identify basic concepts in programming
Programming in Python
In most general terms, programming is the process of writing instructions for a computer. In this course we will be using Python as the language to communicate with the computer.
Strictly speaking, Python is an interpreted language, rather than a compiled language, meaning we are not communicating directly with the computer when we use Python. When we run Python code, our Python source code is first translated into byte code, which is then executed by the Python virtual machine.
Programming is a wide topic including a variety of techniques and tools. In this course we’ll be focusing on programming for statistical analysis.
IDEs
IDE stands for Integrated Development Environment. IDEs are where you will write, edit, and debug python scripts, so you want to choose one that makes you feel comfortable and includes the functionality that you need. Some open-source IDEs for Python include JupyterLab and Visual Studio Code.
Packages
Packages, or libraries, are extensions to the statistical programming language. They contain code, data, and documentation in a standardised collection format that can be installed by users, typically via a centralised software repository. A typical Python workflow will use base Python (the core operations and functions provided by your Python installation) as well as specialised data analysis and scientific packages like NumPy, SciPy and Pandas.
Best Practices
Let’s overview some base concepts that any programmer should always keep in mind.
Documentation
Have you ever returned to a task and tried to read a note that you quickly scrawled for yourself the last time you were working on it? Have you ever inherited a project from a colleague and found you have no idea what remains to be done?
It can be very challenging to return to your own work or a colleague’s and this goes doubly for programming. Documentation is one way we can reduce the burden on future selves and our colleagues.
Inline Documentation
As a new programmer, inline documentation can be the most helpful. Inline documentation refers to writing comments on the same line as your code. For example, if we wrote a line of code to sum 1+1, we might document it as follows:
Although this is a very simple line of code and it might seem like overkill to document it in this way, these types of comments can be very helpful in jogging your memory when returning to a project. Inline comments can also help you to break multi-step programs into digestible and readable pieces.
External Documentation
Sometimes you require more detail than you can comfortably fit in your inline documentation. In this case it can be helpful to create separate files to document your project. This type of documentation will typically focus on the goals, scope, and any special instructions relating to your project rather than the details fo your code. The most common type of external documentation is a README file. It is best practice to create a basic README file for any project. A basic README should include:
- a brief description of the project,
- any special instructions for installation or use,
- the authors and any references.
README files are just text files and it is best practice is to save
your README file as a README.md
markdown document. This
file format is automatically recognised by code repositories like
GitHub, so your README contents are displayed alongside your code
repository.
DocStrings
In chapter 7: functions we’ll learn about documentation specific to functions known as DocStrings.
Getting Help
Later on, in chapter 10: Errors and Exceptions we will cover errors in more detail. However, before we get there it’s very likely you’ll need some assistance writing Python code.
Built-in Help
There is a help function built into base Python. You can use it to investigate built-in functions, data types, and more. For example, say we want to know more about the print() function in Python:
OUTPUT
Help on built-in function print in module builtins:
print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
-- More --
Finding Resources online
Stack Overflow is a valuable resource for programmers of all levels. It can be daunting to post your own question! Fortunately, chances are someone else has already asked a similar question!
The Official Python Documentation is another great resource.
It can also be helpful to do a general search for a particular topic or error message. It’s very likely the first few results will be from StackOverflow, followed by a few from official documentation and then you may start seeing results from personal blogs or third parties. These third party results can sometime be valuable but we should be cautious! Here are a few things to keep in mind when you are looking for online resources:
- Don’t download or install anything unless you are certain of what it is and why you need it.
- Don’t copy or run code unless you fully understand what it does.
- Python is an open-source language; official documentation and resources will not be behind a paywall.
- You may not find a resource or solution to fit your exact needs. Try to be flexible and adapt online solutions to fit your needs.
Key Points
- Python is an interpreted language.
- Code is commonly developed inside an integrated development environment.
- A typical Python workflow uses base Python and additional Python packages developed for statistical programming purposes.
- In-line and external documentation helps ensure that your code is readable.
- You can find help through the built-in help function and external resources.
Content from Python Fundamentals
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- What basic data types can I work with in Python?
- How can I create a new variable in Python?
- How do I use a function?
- Can I change the value associated with a variable after I create it?
Objectives
- Assign values to variables.
Variables
Any Python interpreter can be used as a calculator:
OUTPUT
23
This is great but not very interesting. To do anything useful with
data, we need to assign its value to a variable. In Python, we
can assign a value to a variable, using the equals sign
=
. For example, we can track the weight of a patient who
weighs 60 kilograms by assigning the value 60
to a variable
weight_kg
:
From now on, whenever we use weight_kg
, Python will
substitute the value we assigned to it. In layperson’s terms, a
variable is a name for a value.
In Python, variable names:
- can include letters, digits, and underscores
- cannot start with a digit
- are case sensitive.
This means that, for example:
-
weight0
is a valid variable name, whereas0weight
is not -
weight
andWeight
are different variables
Types of data
Python knows various types of data. Three common ones are:
- integer numbers
- floating point numbers, and
- strings.
In the example above, variable weight_kg
has an integer
value of 60
. If we want to more precisely track the weight
of our patient, we can use a floating point value by executing:
To create a string, we add single or double quotes around some text. To identify and track a patient throughout our study, we can assign each person a unique identifier by storing it in a string:
Using Variables in Python
Once we have data stored with variable names, we can make use of it in calculations. We may want to store our patient’s weight in pounds as well as kilograms:
We might decide to add a prefix to our patient identifier:
Built-in Python functions
To carry out common tasks with data and variables in Python, the
language provides us with several built-in functions. To display information to
the screen, we use the print
function:
OUTPUT
132.66
inflam_001
When we want to make use of a function, referred to as calling the
function, we follow its name by parentheses. The parentheses are
important: if you leave them off, the function doesn’t actually run!
Sometimes you will include values or variables inside the parentheses
for the function to use. In the case of print
, we use the
parentheses to tell the function what value we want to display. We will
learn more about how functions work and how to create our own in later
episodes.
We can display multiple things at once using only one
print
call:
OUTPUT
inflam_001 weight in kilograms: 60.3
We can also call a function inside of another function call. For example,
Python has a built-in function called type
that tells you a
value’s data type:
OUTPUT
<class 'float'>
<class 'str'>
Moreover, we can do arithmetic with variables right inside the
print
function:
OUTPUT
weight in pounds: 132.66
The above command, however, did not change the value of
weight_kg
:
OUTPUT
60.3
To change the value of the weight_kg
variable, we have
to assign weight_kg
a new value using the
equals =
sign:
OUTPUT
weight in kilograms is now: 65.0
Variables as Sticky Notes
A variable in Python is analogous to a sticky note with a name written on it: assigning a value to a variable is like putting that sticky note on a particular value.
Using this analogy, we can investigate how assigning a value to one variable does not change values of other, seemingly related, variables. For example, let’s store the subject’s weight in pounds in its own variable:
PYTHON
# There are 2.2 pounds per kilogram
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)
OUTPUT
weight in kilograms: 65.0 and in pounds: 143.0
Everything in a line of code following the ‘#’ symbol is a comment that is ignored by Python. Comments allow programmers to leave explanatory notes for other programmers or their future selves.
Similar to above, the expression 2.2 * weight_kg
is
evaluated to 143.0
, and then this value is assigned to the
variable weight_lb
(i.e. the sticky note
weight_lb
is placed on 143.0
). At this point,
each variable is “stuck” to completely distinct and unrelated
values.
Let’s now change weight_kg
:
PYTHON
weight_kg = 100.0
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)
OUTPUT
weight in kilograms is now: 100.0 and weight in pounds is still: 143.0
Since weight_lb
doesn’t “remember” where its value comes
from, it is not updated when we change weight_kg
.
OUTPUT
`mass` holds a value of 47.5, `age` does not exist
`mass` still holds a value of 47.5, `age` holds a value of 122
`mass` now has a value of 95.0, `age`'s value is still 122
`mass` still has a value of 95.0, `age` now holds 102
OUTPUT
Hopper Grace
Key Points
- Basic data types in Python include integers, strings, and floating-point numbers.
- Use
variable = value
to assign a value to a variable in order to record it in memory. - Variables are created on demand whenever a value is assigned to them.
- Use
print(something)
to display the value ofsomething
. - Use
# some kind of explanation
to add comments to programs. - Built-in functions are always available to use.
Content from Data Transformation
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I process tabular data files in Python?
Objectives
- Explain what a library is and what libraries are used for.
- Import a Python library and use the functions it contains.
- Read tabular data from a file into a program.
- Select individual values and subsections from data.
- Perform operations on arrays of data.
Words are useful, but what’s more useful are the sentences and stories we build with them. Similarly, while a lot of powerful, general tools are built into Python, specialized tools built up from these basic units live in libraries that can be called upon when needed.
Loading data into Python
To begin processing the clinical trial inflammation data, we need to load it into Python. Python can work with many different file types. Text files can be loaded into Python by using the base Python function
where “r” means read only, or if you want to write to the file, you can use “w”.
However, our patient data is in a csv. file, which is more commonly loaded by using a library. Python has hundreds of thousands of libraries to choose from to help carry out your work. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program. There are a couple common Python libraries to load (and work with data).
pandas
The first library we will present is called pandas pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way.
Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty.
Remember to write the library name with a lower case ‘p’ because the name of the package and Python is case sensitive.
Importing the pandas library
Importing the pandas library is done in exactly the same way as for
any other library. In almost all examples of Python code using the
pandas library, it will have been imported and given an alias of
pd
. We will follow the same convention.
Pandas data structures
There are two main data structure used by pandas, they are the Series and the Dataframe. The Series equates in general to a vector or a list. The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure.
We will mainly be looking at the Dataframe.
We can easily create a Pandas Dataframe by reading a .csv file
Reading a csv file
When we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading datasets in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer.
The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of any size.
In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a Dataframe. If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically.
To tell Python that we’d like to start using pandas, we need to import it:
Often, libraries are given an alias or a short form name, in this case pandas is given the alias “pd”. Aliases for common data analysis libraries include:
Once we’ve imported the library, we can ask the library to read our data file for us:
pandas is a commonly used library for working with and analysing data. However, we will be working with a different package for the remainder of this course. If you would like to learn more about data manipulation and analysis using pandas, we recommend checking out Data Analysis and Visualization with Python for Social Scientists.
numpy
The second package that we will present is called NumPy, which stands for Numerical Python. In general, you should use this library when you want to do fancy things with lots of numbers, especially if you have matrices or arrays. Numpy matrices are typically lighter weight with better performance, particularly when working with large datasets.
We will be using this package to work with our clinical trial inflammation data.
To tell Python that we’d like to start using NumPy, we need to import it:
Now that we have imported the library, we can ask the library (by using the alisa np) to read our data file for us:
OUTPUT
array([[ 0., 0., 1., ..., 3., 0., 0.],
[ 0., 1., 2., ..., 1., 0., 1.],
[ 0., 1., 1., ..., 2., 1., 1.],
...,
[ 0., 1., 1., ..., 1., 1., 1.],
[ 0., 0., 0., ..., 0., 2., 0.],
[ 0., 0., 1., ..., 1., 1., 0.]])
The expression np.loadtxt(...)
is a function call that asks Python
to run the function
loadtxt
which belongs to the np
library. The
dot notation in Python is used most of all as an object
attribute/property specifier or for invoking its method.
object.property
will give you the object.property value,
object_name.method()
will invoke on object_name method.
As an example, John Smith is the John that belongs to the Smith
family. We could use the dot notation to write his name
smith.john
, just as loadtxt
is a function that
belongs to the np
library.
np.loadtxt
has two parameters: the name of the file we
want to read and the delimiter
that separates values on a line. These both need to be character strings
(or strings for short), so we put
them in quotes.
Since we haven’t told it to do anything else with the function’s
output, the notebook displays it.
In this case, that output is the data we just loaded. By default, only a
few rows and columns are shown (with ...
to omit elements
when displaying big arrays). Note that, to save space when displaying
NumPy arrays, Python does not show us trailing zeros, so
1.0
becomes 1.
.
Our call to np.loadtxt
read our file but didn’t save the
data in memory. To do that, we need to assign the array to a variable.
In a similar manner to how we assign a single value to a variable, we
can also assign an array of values to a variable using the same syntax.
Let’s re-run np.loadtxt
and save the returned data:
This statement doesn’t produce any output because we’ve assigned the
output to the variable data
. If we want to check that the
data have been loaded, we can print the variable’s value:
OUTPUT
[[ 0. 0. 1. ..., 3. 0. 0.]
[ 0. 1. 2. ..., 1. 0. 1.]
[ 0. 1. 1. ..., 2. 1. 1.]
...,
[ 0. 1. 1. ..., 1. 1. 1.]
[ 0. 0. 0. ..., 0. 2. 0.]
[ 0. 0. 1. ..., 1. 1. 0.]]
Now that the data are in memory, we can manipulate them. First, let’s
ask what type of thing
data
refers to:
OUTPUT
<class 'np.ndarray'>
The output tells us that data
currently refers to an
N-dimensional array, the functionality for which is provided by the
NumPy library. These data correspond to arthritis patients’
inflammation. The rows are the individual patients, and the columns are
their daily inflammation measurements.
Data Type
A Numpy array contains one or more elements of the same type. The
type
function will only tell you that a variable is a NumPy
array but won’t tell you the type of thing inside the array. We can find
out the type of the data contained in the NumPy array.
OUTPUT
float64
This tells us that the NumPy array’s elements are floating-point numbers.
With the following command, we can see the array’s shape:
OUTPUT
(60, 40)
The output tells us that the data
array variable
contains 60 rows and 40 columns. When we created the variable
data
to store our arthritis data, we did not only create
the array; we also created information about the array, called members or attributes. This extra
information describes data
in the same way an adjective
describes a noun. data.shape
is an attribute of
data
which describes the dimensions of data
.
We use the same dotted notation for the attributes of variables that we
use for the functions in libraries because they have the same
part-and-whole relationship.
If we want to get a single number from the array, we must provide an index in square brackets after the variable name, just as we do in math when referring to an element of a matrix. Our inflammation data has two dimensions, so we will need to use two indices to refer to one specific value:
OUTPUT
first value in data: 0.0
OUTPUT
middle value in data: 16.0
The expression data[29, 19]
accesses the element at row
30, column 20. While this expression may not surprise you,
data[0, 0]
might. Programming languages like Fortran,
MATLAB and R start counting at 1 because that’s what human beings have
done for thousands of years. Languages in the C family (including C++,
Java, Perl, and Python) count from 0 because it represents an offset
from the first value in the array (the second value is offset by one
index from the first value). This is closer to the way that computers
represent arrays (if you are interested in the historical reasons behind
counting indices from zero, you can read Mike
Hoye’s blog post). As a result, if we have an M×N array in Python,
its indices go from 0 to M-1 on the first axis and 0 to N-1 on the
second. It takes a bit of getting used to, but one way to remember the
rule is that the index is how many steps we have to take from the start
to get the item we want.
In the Corner
What may also surprise you is that when Python displays an array, it
shows the element with index [0, 0]
in the upper left
corner rather than the lower left. This is consistent with the way
mathematicians draw matrices but different from the Cartesian
coordinates. The indices are (row, column) instead of (column, row) for
the same reason, which can be confusing when plotting data.
Slicing data
An index like [30, 20]
selects a single element of an
array, but we can select whole sections as well. For example, we can
select the first ten days (columns) of values for the first four
patients (rows) like this:
OUTPUT
[[ 0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
[ 0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
[ 0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
[ 0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]
The slice 0:4
means,
“Start at index 0 and go up to, but not including, index 4”. Again, the
up-to-but-not-including takes a bit of getting used to, but the rule is
that the difference between the upper and lower bounds is the number of
values in the slice.
We don’t have to start slices at 0:
OUTPUT
[[ 0. 0. 1. 2. 2. 4. 2. 1. 6. 4.]
[ 0. 0. 2. 2. 4. 2. 2. 5. 5. 8.]
[ 0. 0. 1. 2. 3. 1. 2. 3. 5. 3.]
[ 0. 0. 0. 3. 1. 5. 6. 5. 5. 8.]
[ 0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]
We also don’t have to include the upper and lower bound on the slice. If we don’t include the lower bound, Python uses 0 by default; if we don’t include the upper, the slice runs to the end of the axis, and if we don’t include either (i.e., if we use ‘:’ on its own), the slice includes everything:
The above example selects rows 0 through 2 and columns 36 through to the end of the array.
OUTPUT
small is:
[[ 2. 3. 0. 0.]
[ 1. 1. 0. 1.]
[ 2. 2. 1. 1.]]
Content from List and Dictionary Methods
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I store many values together?
- How can I create a list succinctly?
- How can I efficiently access nested data?
Objectives
- Identify and create lists and dictionaries
- Understand the properties and behaviours of lists and dictionaries
- Access values in lists and dictionaries
- Create and access values from nest lists and dictionaries
Values can also be stored in other Python data types such as lists, dictionaries, sets and tuples. Storing objects in a list is a fast and versatile way to apply transformations across a sequence of values. Storing objects in dictionary as key-value pairs is useful for extracting specific values i.e. performing lookup operations.
Create and access lists
Lists have the following properties and behaviours:
- A single list can store different primitive object types and even other lists
- Lists are ordered and have a 0-based index
- Lists can be appended to using the methods
append()
orinsert()
- Values inside a list can be removed using the methods
remove()
orpop()
- Two lists can be concatenated with the operator
+
- Values inside a list can be conditionally iterated through
- A list is mutable i.e. the values inside a list can be modified in place
To create a list, values are contained within square brackets
i.e. []
and individually separated by commas. The function
list()
can also be used to create a list of values from an
iterable object like a string, set or tuple.
OUTPUT
[1, 3, 5, 7]
PYTHON
# Unlike atomic vectors in R, a list can contain multiple primitive object types
list_2 = [1, "one", 1.0, True]
print(list_2)
OUTPUT
[1, 'one', 1.0, True]
PYTHON
# You can also use list() on an iterable object to convert it into a list
string = 'abcdefg'
list_3 = list(string)
print(list_3)
OUTPUT
['a', 'b', 'c', 'd', 'e', 'f', 'g']
Because lists have a 0-based index, we can access individual values by their list index position. For 0-based indexes, the first value always starts at position 0 i.e. the first element has an index of 0. Accessing multiple values by their index positions is also referred to as slicing or subsetting a list.
Note that we can use negative numbers as indices in Python. When we
do so, the index -1
gives us the last element in the list,
-2
gives us the second to last element in the list, and so
on.
PYTHON
# Extract individual values from list_3
print('first value:', list_3[0])
print('second value:', list_3[1])
print('last value:', list_3[-1])
OUTPUT
first value: a
second value: b
last value: g
PYTHON
# A syntax quirk for slicing values is to +1 to the last value's index
# To extract from index 0 to 2, we need to slice from [0:2+1] or [0:3]
# Extract the first three values from list_3
print('first 3 values:', list_3[0:3])
# Start from index 0 and extract values from each subsequent second position
print('every second value:', list_3[0::2])
# Start from index 1, end at index 3 and extract from each subsequent second position
print('every second value from index 1 to 3:', list_3[1:4:2])
OUTPUT
first 3 values: ['a', 'b', 'c']
every second value: ['a', 'c', 'e', 'g']
every second value from index 1 to 3: ['b', 'd']
Change list values
Data which can be modified in place is called mutable, while data which cannot be modified is called immutable. Strings and numbers are immutable in that when we want to change the value of a string or number variable, we can only replace the old value with a completely new value.
PYTHON
string = 'abcde'
string[0] = 'b' # Produces a type error as strings are immutable
# TypeError: 'str' object does not support item assignment
In contrast, lists are mutable and we can modify them after they have been created. We can change individual values, append new values, or reorder the whole list through sorting.
PYTHON
list_4 = ['apple', 'pear', 'plum']
print('original list_4:', list_4)
# Change the first value i.e. modify the list in place
list_4[0] = 'banana'
print('modified list_4:', list_4)
# Add new value to list using the method .insert(index number, value)
list_4.insert(1, 'apple') # Index 1 refers to the second position
print('appended list_4:', list_4)
OUTPUT
original list_4: ['apple', 'pear', 'plum']
modified list_4: ['banana', 'pear', 'plum']
appended list_4: ['banana', 'apple', 'pear', 'plum']
PYTHON
# Sorting a list also modifies it in place
list_5 = [2, 1, 3, 7]
list_5.sort()
print('list_5:', list_5)
OUTPUT
list_5: [1, 2, 3, 7]
However, be careful when modifying data in-place. If two variables refer to the same list, and you modify the list value, it will change for both variables!
PYTHON
# When we assign list_6 to list_5, it means both list_6 and list_5 point to the
# same list object, not that list_6 is a copy of list_5.
list_6 = list_5
print('list_5:', list_5)
print('list_6:', list_6)
# Change the first value in list_6 from 1 to 2
list_6[0] = 2
print('modified list_6:', list_6)
print('unmodified list_5:', list_5)
# Warning: list_5 and list_6 have both been modified in place!
OUTPUT
list_5: [1, 2, 3, 7]
list_6: [1, 2, 3, 7]
modified list_6: [2, 2, 3, 7]
unmodified list_5: [2, 2, 3, 7]
Because of this behaviour, code which modifies data in place should be handled with care. You can also avoid this behaviour by expliciting creating a copy of the original list and modifying only the object copy. This is why creating a copy of the original data object can be useful in Python.
PYTHON
list_5 = [1, 2, 3, 7]
list_7 = list_5.copy()
print('list_5:', list_5)
print('list_7:', list_7)
# As list_7 is a completely new object copied from list_5, modifying list_7 does
# not affect list_5.
list_7[0] = 2
print('modified list_7:', list_7)
print('unmodified list_5:', list_5)
OUTPUT
list_5: [1, 2, 3, 7]
list_7: [1, 2, 3, 7]
modified list_7: [2, 2, 3, 7]
unmodified list_5: [1, 2, 3, 7]
Useful list functions
There are a lot of functions and methods which can be applied to
lists, such as len()
, max()
,
index()
and so forth. Mathematical operations do not work
on lists of integers, with the exception of +
.
Note that +
concatenates two lists into a single longer
list, rather than outputting the sum of two lists of numbers.
PYTHON
list_8 = [1, 2, 3]
list_9 = [4, 5, 6]
list_8 + list_9 # This concatenates the lists and does not sum the two lists together
OUTPUT
[1, 2, 3, 4, 5, 6]
In your spare time after this workshop, you can search for different list functions and methods and test them out yourselves.
Nested lists
We have previously mentioned that lists can be used to store other Python object types, including lists. This means that we can create nested lists in Python i.e. lists containing lists containing values. This property is useful when we have a collection of values that we want to access or transform as a subgroup.
To create a nested list, we also use []
or
list()
to contain one or more lists of values of
interest.
PYTHON
veg_stock = [
['lettuce', 'lettuce', 'tomato', 'zucchini'],
['lettuce', 'lettuce', 'carrot', 'zucchini'],
['lettuce', 'basil', 'tomato', 'zucchini']
]
# Check that veg_stock is a list object
print(type(veg_stock))
# Check that the first value in veg_stock is itself a list
print(veg_stock[0], 'has type', type(veg_stock[0]))
OUTPUT
<class 'list'>
['lettuce', 'lettuce', 'tomato', 'zucchini'] has type <class 'list'>
To extract the first sub-list within the veg_stock
list
object, we refer to its index like we would with any other value inside
a list i.e. veg_stock[1]
points to the second sub-list
within the veg_stock
list.
To access an individual string value inside a sub-list, we make use of a second index, which points to an individual value inside the sub-list.
PYTHON
print(veg_stock[0]) # Access the first sub-list
print(veg_stock[0][0]) # Access the first value in the first sub-list
print(type(veg_stock[0])) # The first value in veg_stock is a list
print(type(veg_stock[0][0])) # The first value in the first list in veg_stock is a string
OUTPUT
['lettuce', 'lettuce', 'tomato', 'zucchini']
lettuce
<class 'list'>
<class 'str'>
In general, however, when we are analysing a large collection of values, the best practice is to structure those values in columns and rows as a tabular Pandas data frame object. This is covered in another Carpentries Course called Python for Social Sciences.
Lists are still incredibly versatile and useful when you have a collection of values that need to be efficiently accessed or transformed. For example, data frame column names are commonly extracted and stored inside a list, so that the same transformation can then be mapped across multiple columns.
Create and access dictionaries
A dictionary is a Python data type that is particularly suited for enabling quick lookup operations on unstructured data sets.
A dictionary can therefore be thought of as an unordered list where
every item or value is associated with a unique key (i.e. a self-defined
index of unique strings or numbers). The index values are called keys
and a dictionary contains key-value pairs with the format
{key: value(s)}
.
Dictionaries can be created by listing individual key-values pairs
inside {}
or using dict()
.
PYTHON
# A key-value pair can contain single or multiple values
# Keys are treated as case sensitive and unique
# Multiple values are first stored inside a list
teams = {
'data science': ['Mei Ling', 'Paul', 'Gwen', 'Suresh'],
'user design': ['Amy', 'Linh', 'Sasha'],
'software dev': ['David', 'Prya'],
'comms': 'Taylor'
}
When using dict()
, we need to indicate which key is
associated with which value. This can be done directly using tuples,
direct association i.e. using =
or using
zip()
, which creates a set of tuples from an iterable
list.
PYTHON
# To use dict(), key-value pairs are can be stored inside tuples
ds_emp_status = dict([
('Mei Ling', 'full time'),
('Paul', 'full time'),
('Gwen', 'part time'),
('Suresh', 'part time')
])
# Key-value pairs can also be assigned by direct association
# Keys cannot be strings i.e. wrapped in '' using this approach
ud_emp_status = dict(
Amy = 'full time',
Linh = 'full time',
Sasha = 'casual'
)
# zip() can also be used if each key has only one value
sd_emp_status = dict(zip(
['David', 'Prya'],
['full time', 'full time']
))
To access a specific value inside a dictionary, we need to specify
its key using []
. This is similar to slicing or subsetting
a list by specifying its index using []
.
PYTHON
# Access the values associated with the key 'data science'
print(teams['data science'])
print('The object teams is of type', type(teams))
print('The dict value', teams['data science'], 'is of type', type(teams['data science']))
OUTPUT
['Mei Ling', 'Paul', 'Gwen', 'Suresh']
The data object teams is of type <class 'dict'>
The value ['Mei Ling', 'Paul', 'Gwen', 'Suresh'] is of type <class 'list'>
We can also access a value from a dictionary using the
get()
method.
PYTHON
print(teams.get('user design'))
# get() also enables us to return an alternate string when the key is not found
# This prevents our code from returning an error message that halts the analysis
print(teams.get('data engineering', 'WARNING: key does not exist'))
OUTPUT
['Amy', 'Linh', 'Sasha']
WARNING: key does not exist
To access data inside a dictionary, we can also perform the following other actions:
- Check whether a key exists in a dictionary using the keyword
in
- Retrieve unique dictionary keys using
dict.keys()
- Retrieve dictionary values using
dict.values()
- Retrieve dictionary items using
dict.items()
PYTHON
# Check whether a key exists in a dictionary
print('data science' in teams)
print('Data Science' in teams) # Keys are case sensitive
# Retrieve all dictionary keys
print(teams.keys())
print(sd_emp_status.keys())
# Retrieve all dictionary values
print(sd_emp_status.values())
# Retrieve all dictionary key-value pairs
print(sd_emp_status.items())
OUTPUT
True
False
dict_keys(['data science', 'user design', 'software dev', 'comms'])
dict_keys(['David', 'Prya'])
dict_values(['full time', 'full time'])
dict_items([('David', 'full time'), ('Prya', 'full time')])
To add a new key-value pair to an existing dictionary, we can create
a new key and directly attach a new value to it using =
or
alternatively use the method update()
.
PYTHON
print('original dict items:', sd_emp_status.items())
# Add new key-value pair using direct assignment
sd_emp_status['Mohammad'] = 'full time'
# Add new key-value pair using update({'key': 'value'})
sd_emp_status.update({'Carrie': 'part time'})
print('updated dict items:', sd_emp_status.items())
OUTPUT
original dict items: dict_items([('David', 'full time'), ('Prya', 'full time')])
updated dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'part time')])
Because keys are unique, a dictionary cannot contain two keys with the same name. This means that adding an item using a key that is already present in the dictionary will cause the previous value to be overwritten.
PYTHON
print('original dict items:', sd_emp_status.items())
# As the key 'Carrie' already exists, its value will be overwritten
sd_emp_status['Carrie'] = 'full time'
print('updated dict items:', sd_emp_status.items())
OUTPUT
original dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'part time')])
updated dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'full time')])
To remove a key-value pair for an existing dictionary, we can use the
del
keyword or the method pop()
. Using
pop()
also enables us to return an alternate string if we
trt to remove a non-existing key, which prevents our code from returning
an error message that halts the analysis.
PYTHON
print('original dict items:', sd_emp_status.items())
# Delete dictionary keys using del and pop()
del sd_emp_status['Mohammad']
sd_emp_status.pop('Carrie')
sd_emp_status.pop('Anuradha', 'WARNING: key does not exist') # Does not generate an error
print('modified dict items:', sd_emp_status.items())
OUTPUT
original dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'full time')])
modified dict items: dict_items([('David', 'full time'), ('Prya', 'full time')])
Nested dictionaries
Similar to lists, dictionaries can be nested as we can also store
dictionaries as values inside a key-value pair using {}
.
Nested dictionaries are useful when we need to store unstructured data
in a complex structure. For example, JSON data is commonly used for
transmitting data in web applications and often exists in a nested
structure that can be stored using nested dictionaries in Python.
PYTHON
# Individual dictionaries are enclosed in {} and separated by a comma
nested_dict = {
'dict_1': { # First key is a dictionary of key-value pairs
'key_1a': 'value_1a',
'key_1b': 'value_1b'
},
'dict_2': { # Second key is another dictionary of key-value pairs
'key_2a': 'value_2a',
'key_2b': 'value_2b'
}
}
print(nested_dict)
OUTPUT
{'dict_1': {'key_1a': 'value_1a', 'key_1b': 'value_1b'},
'dict_2': {'key_2a': 'value_2a', 'key_2b': 'value_2b'}}
Similar to working with nested lists, to extract a value from the
first sub-dictionary, we specify both the main dictionary and
sub-dictionary keys using []
.
PYTHON
# Extract the value for key 2a in dict_2
print('original value:', nested_dict['dict_2']['key_2a'])
# Adding or updating a value can be done through the same approach
nested_dict['dict_2']['key_2a'] = "modified_value_2a"
print('modified value:', nested_dict['dict_2']['key_2a'])
OUTPUT
original value: value_2a
modified value: modified_value_2a
Optional: converting lists and dictionaries to Pandas data frames
Lists and dictionaries can be easily converted into a tabular Pandas data frame format. This can be useful when you need to create a small data set for unit testing purposes.
PYTHON
# Import pandas library
import pandas as pd
# Create a dictionary with each key-value pair representing a data frame column
data = {
'col_1': [3, 2, 1, 0],
'col_2': ['a', 'b', 'c', 'd']
}
df = pd.DataFrame.from_dict(data)
print(df) # Outputs data as a tabular Pandas data frame
print(type(df))
OUTPUT
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
<class 'pandas.core.frame.DataFrame'>
Key Points
- Lists can contain any Python object including other lists
- Lists are ordered i.e. indexed and can therefore be sliced by index number
- Unlike strings and integers, the values inside a list can be modified in place
- A list which contains other lists is referred to as a nested list
- Dictionaries behave like unordered lists and are defined using key-value pairs
- Dictionary keys are unique
- A dictionary which contains other dictionaries is referred to as a nested dictionary
- Values inside nested lists and dictionaries can be accessed by an additional index
Content from Loops and Conditional Logic
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I do the same operations on many different values?
- How can my programs do different things based on data values?
Objectives
- identify and create loops
- use logical statements to allow for decision-based operations in code
This episode contains two lessons:
Repeating Actions with Loops
In the episode about visualizing
data, we will see Python code that plots values of interest from our
first inflammation dataset (inflammation-01.csv
), which
revealed some suspicious features.
We have a dozen data sets right now and potentially more on the way if Dr. Maverick can keep up their surprisingly fast clinical trial rate. We want to create plots for all of our data sets with a single statement. To do that, we’ll have to teach the computer how to repeat things.
An example task that we might want to repeat is accessing numbers in a list, which we will do by printing each number on a line of its own.
In Python, a list is basically an ordered
collection of elements, and every element has a unique number associated
with it — its index. This means that we can access elements in a list
using their indices. For example, we can get the first number in the
list odds
, by using odds[0]
. One way to print
each number is to use four print
statements:
OUTPUT
1
3
5
7
This is a bad approach for three reasons:
Not scalable. Imagine you need to print a list that has hundreds of elements. It might be easier to type them in manually.
Difficult to maintain. If we want to decorate each printed element with an asterisk or any other character, we would have to change four lines of code. While this might not be a problem for small lists, it would definitely be a problem for longer ones.
Fragile. If we use it with a list that has more elements than what we initially envisioned, it will only display part of the list’s elements. A shorter list, on the other hand, will cause an error because it will be trying to display elements of the list that do not exist.
ERROR
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-7974b6cdaf14> in <module>()
3 print(odds[1])
4 print(odds[2])
----> 5 print(odds[3])
IndexError: list index out of range
Here’s a better approach: a for loop
OUTPUT
1
3
5
7
This is shorter — certainly shorter than something that prints every number in a hundred-number list — and more robust as well:
OUTPUT
1
3
5
7
9
11
The improved version uses a for loop to repeat an operation — in this case, printing — once for each thing in a sequence. The general form of a loop is:
Using the odds example above, the loop might look like this:
where each number (num
) in the variable
odds
is looped through and printed one number after
another. The other numbers in the diagram denote which loop cycle the
number was printed in (1 being the first loop cycle, and 6 being the
final loop cycle).
We can call the loop
variable anything we like, but there must be a colon at the end of
the line starting the loop, and we must indent anything we want to run
inside the loop. Unlike many other languages, there is no command to
signify the end of the loop body (e.g., end for
);
everything indented after the for
statement belongs to the
loop.
What’s in a name?
In the example above, the loop variable was given the name
num
as a mnemonic; it is short for ‘number’. We can choose
any name we want for variables. We might just as easily have chosen the
name banana
for the loop variable, as long as we use the
same name when we invoke the variable inside the loop:
OUTPUT
1
3
5
7
9
11
It is a good idea to choose variable names that are meaningful, otherwise it would be more difficult to understand what the loop is doing.
Here’s another loop that repeatedly updates a variable:
PYTHON
length = 0
names = ['Curie', 'Darwin', 'Turing']
for value in names:
length = length + 1
print('There are', length, 'names in the list.')
OUTPUT
There are 3 names in the list.
It’s worth tracing the execution of this little program step by step.
Since there are three names in names
, the statement on line
4 will be executed three times. The first time around,
length
is zero (the value assigned to it on line 1) and
value
is Curie
. The statement adds 1 to the
old value of length
, producing 1, and updates
length
to refer to that new value. The next time around,
value
is Darwin
and length
is 1,
so length
is updated to be 2. After one more update,
length
is 3; since there is nothing left in
names
for Python to process, the loop finishes and the
print
function on line 5 tells us our final answer.
Note that a loop variable is a variable that is being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:
PYTHON
name = 'Rosalind'
for name in ['Curie', 'Darwin', 'Turing']:
print(name)
print('after the loop, name is', name)
OUTPUT
Curie
Darwin
Turing
after the loop, name is Turing
Note also that finding the length of an object is such a common
operation that Python actually has a built-in function to do it called
len
:
OUTPUT
4
len
is much faster than any function we could write
ourselves, and much easier to read than a two-line loop; it will also
give us the length of many other data types we haven’t seen yet, so we
should always use it when we can.
From 1 to N
Python has a built-in function called range
that
generates a sequence of numbers range
can accept 1, 2, or 3
parameters.
- If one parameter is given,
range
generates a sequence of that length, starting at zero and incrementing by 1. For example,range(3)
produces the numbers0, 1, 2
. - If two parameters are given,
range
starts at the first and ends just before the second, incrementing by one. For example,range(2, 5)
produces2, 3, 4
. - If
range
is given 3 parameters, it starts at the first one, ends just before the second one, and increments by the third one. For example,range(3, 10, 2)
produces3, 5, 7, 9
.
Using range
, write a loop that uses range
to print the first 3 natural numbers:
OUTPUT
1
2
3
The body of the loop is executed 6 times.
Summing a List
Write a loop that calculates the sum of elements in a list by adding
each element and printing the final value, so
[124, 402, 36]
prints 562
Computing the Value of a Polynomial
The built-in function enumerate
takes a sequence (e.g.,
a list) and generates a new sequence of the
same length. Each element of the new sequence is a pair composed of the
index (0, 1, 2,…) and the value from the original sequence:
The code above loops through a_list
, assigning the index
to idx
and the value to val
.
Suppose you have encoded a polynomial as a list of coefficients in the following way: the first element is the constant term, the second element is the coefficient of the linear term, the third is the coefficient of the quadratic term, etc.
OUTPUT
97
Write a loop using enumerate(coefs)
which computes the
value y
of any polynomial, given x
and
coefs
.
Making Choices with Conditional Logic
How can we use Python to automatically recognize different situations we encounter with our data and take a different action for each? In this lesson, we’ll learn how to write code that runs only when certain conditions are true.
Conditionals
We can ask Python to take different actions, depending on a
condition, with an if
statement:
OUTPUT
not greater
done
The second line of this code uses the keyword if
to tell
Python that we want to make a choice. If the test that follows the
if
statement is true, the body of the if
(i.e., the set of lines indented underneath it) is executed, and
“greater” is printed. If the test is false, the body of the
else
is executed instead, and “not greater” is printed.
Only one or the other is ever executed before continuing on with program
execution to print “done”:
Conditional
statements don’t have to include an else
. If there
isn’t one, Python simply does nothing if the test is false:
PYTHON
num = 53
print('before conditional...')
if num > 100:
print(num, 'is greater than 100')
print('...after conditional')
OUTPUT
before conditional...
...after conditional
We can also chain several tests together using elif
,
which is short for “else if”. The following Python code uses
elif
to print the sign of a number.
PYTHON
num = -3
if num > 0:
print(num, 'is positive')
elif num == 0:
print(num, 'is zero')
else:
print(num, 'is negative')
OUTPUT
-3 is negative
Note that to test for equality we use a double equals sign
==
rather than a single equals sign =
which is
used to assign values.
Comparing in Python
Along with the >
and ==
operators we
have already used for comparing values in our conditionals, there are a
few more options to know about:
-
>
: greater than -
<
: less than -
==
: equal to -
!=
: does not equal -
>=
: greater than or equal to -
<=
: less than or equal to
We can also combine tests using and
and or
.
and
is only true if both parts are true:
PYTHON
if (1 > 0) and (-1 >= 0):
print('both parts are true')
else:
print('at least one part is false')
OUTPUT
at least one part is false
while or
is true if at least one part is true:
OUTPUT
at least one test is true
True
and False
True
and False
are special words in Python
called booleans
, which represent truth values. A statement
such as 1 < 0
returns the value False
,
while -1 < 0
returns the value True
.
Checking Our Data
Now that we’ve seen how conditionals work, we can use them to check
for the suspicious features we saw in our inflammation data. We are
about to use functions provided by the numpy
module again.
Therefore, if you’re working in a new Python session, make sure to load
the module with:
From the first couple of plots, we saw that maximum daily inflammation exhibits a strange behavior and raises one unit a day. Wouldn’t it be a good idea to detect such behavior and report it as suspicious? Let’s do that! However, instead of checking every single day of the study, let’s merely check if maximum inflammation in the beginning (day 0) and in the middle (day 20) of the study are equal to the corresponding day numbers.
PYTHON
max_inflammation_0 = numpy.amax(data, axis=0)[0]
max_inflammation_20 = numpy.amax(data, axis=0)[20]
if max_inflammation_0 == 0 and max_inflammation_20 == 20:
print('Suspicious looking maxima!')
We also saw a different problem in the third dataset; the minima per
day were all zero (looks like a healthy person snuck into our study). We
can also check for this with an elif
condition:
And if neither of these conditions are true, we can use
else
to give the all-clear:
Let’s test that out:
PYTHON
data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
max_inflammation_0 = numpy.amax(data, axis=0)[0]
max_inflammation_20 = numpy.amax(data, axis=0)[20]
if max_inflammation_0 == 0 and max_inflammation_20 == 20:
print('Suspicious looking maxima!')
elif numpy.sum(numpy.amin(data, axis=0)) == 0:
print('Minima add up to zero!')
else:
print('Seems OK!')
OUTPUT
Suspicious looking maxima!
PYTHON
data = numpy.loadtxt(fname='inflammation-03.csv', delimiter=',')
max_inflammation_0 = numpy.amax(data, axis=0)[0]
max_inflammation_20 = numpy.amax(data, axis=0)[20]
if max_inflammation_0 == 0 and max_inflammation_20 == 20:
print('Suspicious looking maxima!')
elif numpy.sum(numpy.amin(data, axis=0)) == 0:
print('Minima add up to zero!')
else:
print('Seems OK!')
OUTPUT
Minima add up to zero!
In this way, we have asked Python to do something different depending
on the condition of our data. Here we printed messages in all cases, but
we could also imagine not using the else
catch-all so that
messages are only printed when something is wrong, freeing us from
having to manually examine every plot for features we’ve seen
before.
C gets printed because the first two conditions,
4 > 5
and 4 == 5
, are not true, but
4 < 5
is true. In this case, only one of these
conditions can be true for at a time, but in other scenarios multiple
elif
conditions could be met. In these scenarios, only the
action associated with the first true elif
condition will
occur, starting from the top of the conditional section.
This contrasts with the case of multiple if
statements,
where every action can occur as long as their condition is met.
What Is Truth?
True
and False
booleans are not the only
values in Python that are true and false. In fact, any value
can be used in an if
or elif
. After reading
and running the code below, explain what the rule is for which values
are considered true and which are > considered false.
That’s Not Not What I Meant
Sometimes it is useful to check whether some condition is
not true. The Boolean operator not
can do this
explicitly. After reading and running the code below, write some
if
statements that use not
to test the rule
that you formulated in the previous challenge.
Close Enough
Write some conditions that print True
if the variable
a
is within 10% of the variable b
and
False
otherwise. Compare your implementation with your
partner’s. Do you get the same answer for all possible pairs of
numbers?
There is a built-in
function abs
that returns the absolute value of a
number:
OUTPUT
12
In-Place Operators
Python (and most other languages in the C family) provides in-place operators that work like this:
PYTHON
x = 1 # original value
x += 1 # add one to x, assigning result back to x
x *= 3 # multiply x by 3
print(x)
OUTPUT
6
Write some code that sums the positive and negative numbers in a list separately, using in-place operators. Do you think the result is more or less readable than writing the same without in-place operators?
PYTHON
positive_sum = 0
negative_sum = 0
test_list = [3, 4, 6, 1, -1, -5, 0, 7, -8]
for num in test_list:
if num > 0:
positive_sum += num
elif num == 0:
pass
else:
negative_sum += num
print(positive_sum, negative_sum)
Here pass
means “don’t do anything”. In this particular
case, it’s not actually needed, since if num == 0
neither
sum needs to change, but it illustrates the use of elif
and
pass
.
Sorting a List Into Buckets
In our data
folder, large data sets are stored in files
whose names start with “inflammation-” and small data sets – in files
whose names start with “small-”. We also have some other files that we
do not care about at this point. We’d like to break all these files into
three lists called large_files
, small_files
,
and other_files
, respectively.
Add code to the template below to do this. Note that the string
method startswith
returns True
if and only if the string it is called on
starts with the string passed as an argument, that is:
OUTPUT
True
But
OUTPUT
False
Use the following Python code as your starting point:
PYTHON
filenames = ['inflammation-01.csv',
'myscript.py',
'inflammation-02.csv',
'small-01.csv',
'small-02.csv']
large_files = []
small_files = []
other_files = []
Your solution should:
- loop over the names of the files
- figure out which group each filename belongs in
- append the filename to that list
In the end the three lists should be:
PYTHON
for filename in filenames:
if filename.startswith('inflammation-'):
large_files.append(filename)
elif filename.startswith('small-'):
small_files.append(filename)
else:
other_files.append(filename)
print('large_files:', large_files)
print('small_files:', small_files)
print('other_files:', other_files)
- Write a loop that counts the number of vowels in a character string.
- Test it on a few individual words and full sentences.
- Once you are done, compare your solution to your neighbor’s. Did you make the same decisions about how to handle the letter ‘y’ (which some people think is a vowel, and some do not)?
Key Points
- Use
for variable in sequence
to process the elements of a sequence one at a time. - The body of a
for
loop must be indented. - Use
len(thing)
to determine the length of something that contains other values. - Use
if condition
to start a conditional statement,elif condition
to provide additional tests, andelse
to provide a default. - The bodies of the branches of conditional statements must be indented.
- Use
==
to test for equality. -
X and Y
is only true if bothX
andY
are true. -
X or Y
is true if eitherX
orY
, or both, are true. - Zero, the empty string, and the empty list are considered false; all other numbers, strings, and lists are considered true.
-
True
andFalse
represent truth values.
Content from Alternatives to Loops
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I vectorize my loops?
Objectives
- identify what vectorized operations are
- perform basic vectorized operations
FIXME
Key Points
- NULL
Content from Creating Functions
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- What are functions, and how can I use them in Python?
- How can I define new functions?
- What’s the difference between defining and calling a function?
- What happens when I call a function?
Objectives
- identify what a function is
- create new functions
- Set default values for function parameters.
- Explain why we should divide programs into small, single-purpose functions.
At this point, we’ve seen that code can have Python make decisions about what it sees in our data. What if we want to convert some of our data, like taking a temperature in Fahrenheit and converting it to Celsius. We could write something like this for converting a single number
and for a second number we could just copy the line and rename the variables
PYTHON
fahrenheit_val = 99
celsius_val = ((fahrenheit_val - 32) * (5/9))
fahrenheit_val2 = 43
celsius_val2 = ((fahrenheit_val2 - 32) * (5/9))
But we would be in trouble as soon as we had to do this more than a
couple times. Cutting and pasting it is going to make our code get very
long and very repetitive, very quickly. We’d like a way to package our
code so that it is easier to reuse, a shorthand way of re-executing
longer pieces of code. In Python we can use ‘functions’. Let’s start by
defining a function fahr_to_celsius
that converts
temperatures from Fahrenheit to Celsius:
PYTHON
def explicit_fahr_to_celsius(temp):
# Assign the converted value to a variable
converted = ((temp - 32) * (5/9))
# Return the value of the new variable
return converted
def fahr_to_celsius(temp):
# Return converted value more efficiently using the return
# function without creating a new variable. This code does
# the same thing as the previous function but it is more explicit
# in explaining how the return command works.
return ((temp - 32) * (5/9))
The function definition opens with the keyword def
followed by the name of the function (fahr_to_celsius
) and
a parenthesized list of parameter names (temp
). The body of the function — the statements
that are executed when it runs — is indented below the definition line.
The body concludes with a return
keyword followed by the
return value.
When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.
Let’s try running our function.
This command should call our function, using “32” as the input and return the function value.
In fact, calling our own function is no different from calling any other function:
PYTHON
print('freezing point of water:', fahr_to_celsius(32), 'C')
print('boiling point of water:', fahr_to_celsius(212), 'C')
OUTPUT
freezing point of water: 0.0 C
boiling point of water: 100.0 C
We’ve successfully called the function that we defined, and we have access to the value that we returned.
Composing Functions
Now that we’ve seen how to turn Fahrenheit into Celsius, we can also write the function to turn Celsius into Kelvin:
PYTHON
def celsius_to_kelvin(temp_c):
return temp_c + 273.15
print('freezing point of water in Kelvin:', celsius_to_kelvin(0.))
OUTPUT
freezing point of water in Kelvin: 273.15
What about converting Fahrenheit to Kelvin? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:
PYTHON
def fahr_to_kelvin(temp_f):
temp_c = fahr_to_celsius(temp_f)
temp_k = celsius_to_kelvin(temp_c)
return temp_k
print('boiling point of water in Kelvin:', fahr_to_kelvin(212.0))
OUTPUT
boiling point of water in Kelvin: 373.15
This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here — typically half a dozen to a few dozen lines — but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.
Variable Scope
In composing our temperature conversion functions, we created
variables inside of those functions, temp
,
temp_c
, temp_f
, and temp_k
. We
refer to these variables as local variables because they no
longer exist once the function is done executing. If we try to access
their values outside of the function, we will encounter an error:
ERROR
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-eed2471d229b> in <module>
----> 1 print('Again, temperature in Kelvin was:', temp_k)
NameError: name 'temp_k' is not defined
If you want to reuse the temperature in Kelvin after you have
calculated it with fahr_to_kelvin
, you can store the result
of the function call in a variable:
OUTPUT
temperature in Kelvin was: 373.15
The variable temp_kelvin
, being defined outside any
function, is said to be global.
Inside a function, one can read the value of such global variables:
PYTHON
def print_temperatures():
print('temperature in Fahrenheit was:', temp_fahr)
print('temperature in Kelvin was:', temp_kelvin)
temp_fahr = 212.0
temp_kelvin = fahr_to_kelvin(temp_fahr)
print_temperatures()
OUTPUT
temperature in Fahrenheit was: 212.0
temperature in Kelvin was: 373.15
By giving our functions human-readable names, we can more easily read
and understand what is happening in the for
loop. Even
better, if at some later date we want to use either of those pieces of
code again, we can do so in a single line.
Testing and Documenting
Once we start putting things in functions so that we can re-use them, we need to start testing that those functions are working correctly. To see how to do this, let’s write a function to offset a dataset so that it’s mean value shifts to a user-defined value:
PYTHON
def offset_mean(data, target_mean_value):
return (data - numpy.mean(data)) + target_mean_value
We could test this on our actual data, but since we don’t know what the values ought to be, it will be hard to tell if the result was correct. Instead, let’s use NumPy to create a matrix of 0’s and then offset its values to have a mean value of 3:
OUTPUT
[[ 3. 3.]
[ 3. 3.]]
That looks right, so let’s try offset_mean
on our real
data:
OUTPUT
[[-6.14875 -6.14875 -5.14875 ... -3.14875 -6.14875 -6.14875]
[-6.14875 -5.14875 -4.14875 ... -5.14875 -6.14875 -5.14875]
[-6.14875 -5.14875 -5.14875 ... -4.14875 -5.14875 -5.14875]
...
[-6.14875 -5.14875 -5.14875 ... -5.14875 -5.14875 -5.14875]
[-6.14875 -6.14875 -6.14875 ... -6.14875 -4.14875 -6.14875]
[-6.14875 -6.14875 -5.14875 ... -5.14875 -5.14875 -6.14875]]
It’s hard to tell from the default output whether the result is correct, but there are a few tests that we can run to reassure us:
PYTHON
print('original min, mean, and max are:', numpy.amin(data), numpy.mean(data), numpy.amax(data))
offset_data = offset_mean(data, 0)
print('min, mean, and max of offset data are:',
numpy.amin(offset_data),
numpy.mean(offset_data),
numpy.amax(offset_data))
OUTPUT
original min, mean, and max are: 0.0 6.14875 20.0
min, mean, and and max of offset data are: -6.14875 2.84217094304e-16 13.85125
That seems almost right: the original mean was about 6.1, so the lower bound from zero is now about -6.1. The mean of the offset data isn’t quite zero — we’ll explore why not in the challenges — but it’s pretty close. We can even go further and check that the standard deviation hasn’t changed:
OUTPUT
std dev before and after: 4.61383319712 4.61383319712
Those values look the same, but we probably wouldn’t notice if they were different in the sixth decimal place. Let’s do this instead:
PYTHON
print('difference in standard deviations before and after:',
numpy.std(data) - numpy.std(offset_data))
OUTPUT
difference in standard deviations before and after: -3.5527136788e-15
Again, the difference is very small. It’s still possible that our function is wrong, but it seems unlikely enough that we should probably get back to doing our analysis.
Documentation
We have one more task first, though: we should write some documentation for our function to remind ourselves later what it’s for and how to use it.
The usual way to put documentation in software is to add comments like this:
PYTHON
# offset_mean(data, target_mean_value):
# return a new array containing the original data with its mean offset to match the desired value.
def offset_mean(data, target_mean_value):
return (data - numpy.mean(data)) + target_mean_value
There’s a better way, though. If the first thing in a function is a string that isn’t assigned to a variable, that string is attached to the function as its documentation:
PYTHON
def offset_mean(data, target_mean_value):
"""Return a new array containing the original data
with its mean offset to match the desired value."""
return (data - numpy.mean(data)) + target_mean_value
This is better because we can now ask Python’s built-in help system to show us the documentation for the function:
OUTPUT
Help on function offset_mean in module __main__:
offset_mean(data, target_mean_value)
Return a new array containing the original data with its mean offset to match the desired value.
A string like this is called a docstring. We don’t need to use triple quotes when we write one, but if we do, we can break the string across multiple lines:
PYTHON
def offset_mean(data, target_mean_value):
"""Return a new array containing the original data
with its mean offset to match the desired value.
Examples
--------
>>> offset_mean([1, 2, 3], 0)
array([-1., 0., 1.])
"""
return (data - numpy.mean(data)) + target_mean_value
help(offset_mean)
OUTPUT
Help on function offset_mean in module __main__:
offset_mean(data, target_mean_value)
Return a new array containing the original data
with its mean offset to match the desired value.
Examples
--------
>>> offset_mean([1, 2, 3], 0)
array([-1., 0., 1.])
Defining Defaults
We have passed parameters to functions in two ways: directly, as in
type(data)
, and by name, as in
numpy.loadtxt(fname='something.csv', delimiter=',')
. In
fact, we can pass the filename to loadtxt
without the
fname=
:
OUTPUT
array([[ 0., 0., 1., ..., 3., 0., 0.],
[ 0., 1., 2., ..., 1., 0., 1.],
[ 0., 1., 1., ..., 2., 1., 1.],
...,
[ 0., 1., 1., ..., 1., 1., 1.],
[ 0., 0., 0., ..., 0., 2., 0.],
[ 0., 0., 1., ..., 1., 1., 0.]])
but we still need to say delimiter=
:
ERROR
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/username/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 1041, in loa
dtxt
dtype = np.dtype(dtype)
File "/Users/username/anaconda3/lib/python3.6/site-packages/numpy/core/_internal.py", line 199, in
_commastring
newitem = (dtype, eval(repeats))
File "<string>", line 1
,
^
SyntaxError: unexpected EOF while parsing
To understand what’s going on, and make our own functions easier to
use, let’s re-define our offset_mean
function like
this:
PYTHON
def offset_mean(data, target_mean_value=0.0):
"""Return a new array containing the original data
with its mean offset to match the desired value, (0 by default).
Examples
--------
>>> offset_mean([1, 2, 3])
array([-1., 0., 1.])
"""
return (data - numpy.mean(data)) + target_mean_value
The key change is that the second parameter is now written
target_mean_value=0.0
instead of just
target_mean_value
. If we call the function with two
arguments, it works as it did before:
OUTPUT
[[ 3. 3.]
[ 3. 3.]]
But we can also now call it with just one parameter, in which case
target_mean_value
is automatically assigned the default value of 0.0:
PYTHON
more_data = 5 + numpy.zeros((2, 2))
print('data before mean offset:')
print(more_data)
print('offset data:')
print(offset_mean(more_data))
OUTPUT
data before mean offset:
[[ 5. 5.]
[ 5. 5.]]
offset data:
[[ 0. 0.]
[ 0. 0.]]
This is handy: if we usually want a function to work one way, but occasionally need it to do something else, we can allow people to pass a parameter when they need to but provide a default to make the normal case easier. The example below shows how Python matches values to parameters:
PYTHON
def display(a=1, b=2, c=3):
print('a:', a, 'b:', b, 'c:', c)
print('no parameters:')
display()
print('one parameter:')
display(55)
print('two parameters:')
display(55, 66)
OUTPUT
no parameters:
a: 1 b: 2 c: 3
one parameter:
a: 55 b: 2 c: 3
two parameters:
a: 55 b: 66 c: 3
As this example shows, parameters are matched up from left to right, and any that haven’t been given a value explicitly get their default value. We can override this behavior by naming the value as we pass it in:
OUTPUT
only setting the value of c
a: 1 b: 2 c: 77
With that in hand, let’s look at the help for
numpy.loadtxt
:
OUTPUT
Help on function loadtxt in module numpy.lib.npyio:
loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, use
cols=None, unpack=False, ndmin=0, encoding='bytes')
Load data from a text file.
Each row in the text file must have the same number of values.
Parameters
----------
...
There’s a lot of information here, but the most important part is the first couple of lines:
OUTPUT
loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, use
cols=None, unpack=False, ndmin=0, encoding='bytes')
This tells us that loadtxt
has one parameter called
fname
that doesn’t have a default value, and eight others
that do. If we call the function like this:
then the filename is assigned to fname
(which is what we
want), but the delimiter string ','
is assigned to
dtype
rather than delimiter
, because
dtype
is the second parameter in the list. However
','
isn’t a known dtype
so our code produced
an error message when we tried to run it. When we call
loadtxt
we don’t have to provide fname=
for
the filename because it’s the first item in the list, but if we want the
','
to be assigned to the variable delimiter
,
we do have to provide delimiter=
for the second
parameter since delimiter
is not the second parameter in
the list.
Readable functions
Consider these two functions:
PYTHON
def s(p):
a = 0
for v in p:
a += v
m = a / len(p)
d = 0
for v in p:
d += (v - m) * (v - m)
return numpy.sqrt(d / (len(p) - 1))
def std_dev(sample):
sample_sum = 0
for value in sample:
sample_sum += value
sample_mean = sample_sum / len(sample)
sum_squared_devs = 0
for value in sample:
sum_squared_devs += (value - sample_mean) * (value - sample_mean)
return numpy.sqrt(sum_squared_devs / (len(sample) - 1))
The functions s
and std_dev
are
computationally equivalent (they both calculate the sample standard
deviation), but to a human reader, they look very different. You
probably found std_dev
much easier to read and understand
than s
.
As this example illustrates, both documentation and a programmer’s coding style combine to determine how easy it is for others to read and understand the programmer’s code. Choosing meaningful variable names and using blank spaces to break the code into logical “chunks” are helpful techniques for producing readable code. This is useful not only for sharing code with others, but also for the original programmer. If you need to revisit code that you wrote months ago and haven’t thought about since then, you will appreciate the value of readable code!
Combining Strings
“Adding” two strings produces their concatenation:
'a' + 'b'
is 'ab'
. Write a function called
fence
that takes two parameters called
original
and wrapper
and returns a new string
that has the wrapper character at the beginning and end of the original.
A call to your function should look like this:
OUTPUT
*name*
Return versus print
Note that return
and print
are not
interchangeable. print
is a Python function that
prints data to the screen. It enables us, users, see
the data. return
statement, on the other hand, makes data
visible to the program. Let’s have a look at the following function:
Question: What will we see if we execute the following commands?
Python will first execute the function add
with
a = 7
and b = 3
, and, therefore, print
10
. However, because function add
does not
have a line that starts with return
(no return
“statement”), it will, by default, return nothing which, in Python
world, is called None
. Therefore, A
will be
assigned to None
and the last line (print(A)
)
will print None
. As a result, we will see:
OUTPUT
10
None
Selecting Characters From Strings
If the variable s
refers to a string, then
s[0]
is the string’s first character and s[-1]
is its last. Write a function called outer
that returns a
string made up of just the first and last characters of its input. A
call to your function should look like this:
OUTPUT
hm
Rescaling an Array
Write a function rescale
that takes an array as input
and returns a corresponding array of values scaled to lie in the range
0.0 to 1.0. (Hint: If L
and H
are the lowest
and highest values in the original array, then the replacement for a
value v
should be (v-L) / (H-L)
.)
Testing and Documenting Your Function
Run the commands help(numpy.arange)
and
help(numpy.linspace)
to see how to use these functions to
generate regularly-spaced values, then use those values to test your
rescale
function. Once you’ve successfully tested your
function, add a docstring that explains what it does.
PYTHON
"""Takes an array as input, and returns a corresponding array scaled so
that 0 corresponds to the minimum and 1 to the maximum value of the input array.
Examples:
>>> rescale(numpy.arange(10.0))
array([ 0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
>>> rescale(numpy.linspace(0, 100, 5))
array([ 0. , 0.25, 0.5 , 0.75, 1. ])
"""
Defining Defaults
Rewrite the rescale
function so that it scales data to
lie between 0.0
and 1.0
by default, but will
allow the caller to specify lower and upper bounds if they want. Compare
your implementation to your neighbor’s: do the two functions always
behave the same way?
PYTHON
def rescale(input_array, low_val=0.0, high_val=1.0):
"""rescales input array values to lie between low_val and high_val"""
L = numpy.amin(input_array)
H = numpy.amax(input_array)
intermed_array = (input_array - L) / (H - L)
output_array = intermed_array * (high_val - low_val) + low_val
return output_array
OUTPUT
259.81666666666666
278.15
273.15
0
k
is 0 because the k
inside the function
f2k
doesn’t know about the k
defined outside
the function. When the f2k
function is called, it creates a
local variable
k
. The function does not return any values and does not
alter k
outside of its local copy. Therefore the original
value of k
remains unchanged. Beware that a local
k
is created because f2k
internal statements
affect a new value to it. If k
was only
read
, it would simply retrieve the global k
value.
Mixing Default and Non-Default Parameters
Given the following code:
PYTHON
def numbers(one, two=2, three, four=4):
n = str(one) + str(two) + str(three) + str(four)
return n
print(numbers(1, three=3))
what do you expect will be printed? What is actually printed? What rule do you think Python is following?
1234
one2three4
1239
SyntaxError
Given that, what does the following piece of code display when run?
a: b: 3 c: 6
a: -1 b: 3 c: 6
a: -1 b: 2 c: 6
a: b: -1 c: 2
Attempting to define the numbers
function results in
4. SyntaxError
. The defined parameters two
and
four
are given default values. Because one
and
three
are not given default values, they are required to be
included as arguments when the function is called and must be placed
before any parameters that have default values in the function
definition.
The given call to func
displays
a: -1 b: 2 c: 6
. -1 is assigned to the first parameter
a
, 2 is assigned to the next parameter b
, and
c
is not passed a value, so it uses its default value
6.
Readable Code
Revise a function you wrote for one of the previous exercises to try to make the code more readable. Then, collaborate with one of your neighbors to critique each other’s functions and discuss how your function implementations could be further improved to make them more readable.
Key Points
- Define a function using
def function_name(parameter)
. - The body of a function must be indented.
- Call a function using
function_name(value)
. - Numbers are stored as integers or floating-point numbers.
- Variables defined within a function can only be seen and used within the body of the function.
- Variables created outside of any function are called global variables.
- Within a function, we can access global variables.
- Variables created within a function override global variables if their names match.
- Use
help(thing)
to view help for something. - Put docstrings in functions to provide help for that function.
- Specify default values for parameters when defining a function using
name=value
in the parameter list. - Parameters can be passed by matching based on name, by position, or by omitting them (in which case the default value is used).
- Put code whose parameters change frequently in a function, then call it with different parameter values to customize its behavior.
Content from Data Analysis
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I process tabular data files in Python?
- How can I do the same operations on many different files?
Objectives
- read in data files to Python
- perform common operations on tabular data
- write code to perform the same operation on multiple files
FIXME
Key Points
- NULL
Content from Visualizations
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I visualize tabular data in Python?
- How can I group several plots together?
Objectives
- create graphs and other visualizations using tabular data
- group plots together to make comparative visualizations
FIXME
Key Points
- NULL
Content from Errors and Exceptions
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How does Python report errors?
- How can I handle errors in Python programs?
Objectives
- identify different errors and correct bugs associated with them
Every programmer encounters errors, both those who are just beginning, and those who have been programming for years. Encountering errors and exceptions can be very frustrating at times, and can make coding feel like a hopeless endeavour. However, understanding what the different types of errors are and when you are likely to encounter them can help a lot. Once you know why you get certain types of errors, they become much easier to fix.
Errors in Python have a very specific form, called a traceback. Let’s examine one:
PYTHON
# This code has an intentional error. You can type it directly or
# use it for reference to understand the error message below.
def favorite_ice_cream():
ice_creams = [
'chocolate',
'vanilla',
'strawberry'
]
print(ice_creams[3])
favorite_ice_cream()
ERROR
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-1-70bd89baa4df> in <module>()
9 print(ice_creams[3])
10
----> 11 favorite_ice_cream()
<ipython-input-1-70bd89baa4df> in favorite_ice_cream()
7 'strawberry'
8 ]
----> 9 print(ice_creams[3])
10
11 favorite_ice_cream()
IndexError: list index out of range
This particular traceback has two levels. You can determine the number of levels by looking for the number of arrows on the left hand side. In this case:
The first shows code from the cell above, with an arrow pointing to Line 11 (which is
favorite_ice_cream()
).The second shows some code in the function
favorite_ice_cream
, with an arrow pointing to Line 9 (which isprint(ice_creams[3])
).
The last level is the actual place where the error occurred. The
other level(s) show what function the program executed to get to the
next level down. So, in this case, the program first performed a function call to the function
favorite_ice_cream
. Inside this function, the program
encountered an error on Line 6, when it tried to run the code
print(ice_creams[3])
.
Long Tracebacks
Sometimes, you might see a traceback that is very long -- sometimes they might even be 20 levels deep! This can make it seem like something horrible happened, but the length of the error message does not reflect severity, rather, it indicates that your program called many functions before it encountered the error. Most of the time, the actual place where the error occurred is at the bottom-most level, so you can skip down the traceback to the bottom.
So what error did the program actually encounter? In the last line of
the traceback, Python helpfully tells us the category or type of error
(in this case, it is an IndexError
) and a more detailed
error message (in this case, it says “list index out of range”).
If you encounter an error and don’t know what it means, it is still important to read the traceback closely. That way, if you fix the error, but encounter a new one, you can tell that the error changed. Additionally, sometimes knowing where the error occurred is enough to fix it, even if you don’t entirely understand the message.
If you do encounter an error you don’t recognize, try looking at the official documentation on errors. However, note that you may not always be able to find the error there, as it is possible to create custom errors. In that case, hopefully the custom error message is informative enough to help you figure out what went wrong. Libraries like pandas and numpy have these custom errors, but the procedure to figure them out is the same: go to the earliest line in the error, and look at the error message for it. The documentation for these libraries will often provide the information you need about any functions you are using. There are also large communities of users for data libraries that can help as well!
Reading Error Messages
Read the Python code and the resulting traceback below, and answer the following questions:
- How many levels does the traceback have?
- What is the function name where the error occurred?
- On which line number in this function did the error occur?
- What is the type of error?
- What is the error message?
PYTHON
# This code has an intentional error. Do not type it directly;
# use it for reference to understand the error message below.
def print_message(day):
messages = [
'Hello, world!',
'Today is Tuesday!',
'It is the middle of the week.',
'Today is Donnerstag in German!',
'Last day of the week!',
'Hooray for the weekend!',
'Aw, the weekend is almost over.'
]
print(messages[day])
def print_sunday_message():
print_message(7)
print_sunday_message()
ERROR
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-7-3ad455d81842> in <module>
16 print_message(7)
17
---> 18 print_sunday_message()
19
<ipython-input-7-3ad455d81842> in print_sunday_message()
14
15 def print_sunday_message():
---> 16 print_message(7)
17
18 print_sunday_message()
<ipython-input-7-3ad455d81842> in print_message(day)
11 'Aw, the weekend is almost over.'
12 ]
---> 13 print(messages[day])
14
15 def print_sunday_message():
IndexError: list index out of range
- 3 levels
print_message
- 13
IndexError
-
list index out of range
You can then infer that7
is not the right index to use withmessages
.
Better errors on newer Pythons
Newer versions of Python have improved error printouts. If you are debugging errors, it is often helpful to use the latest Python version, even if you support older versions of Python.
Type Errors
One of the most common types of errors in Python are called type errors. These errors occur when you try to perform an operation on an object in python that cannot support it. This happens easily when working with large datasets where there are expected value types like either strings or integers. When we write a function expecting integers, we will not get an error until we encounter an operation that cannot handle strings. For example:
ERROR
File "<ipython-input-3-6bb841ea1423>", line 3
letter=my_string["e"]
^
TypeError: string indices must be integers
We get this error because we are trying to use an index to access part of our string, which requires an integer. Instead, we entered a character and received a type error. This is fixed by replacing “e” with 2.
In the case of datasets, we often see type errors when a mathematical operation, such as taking a mean, is performed on a column that contains characters, either as a result of formatting or introduced through error. As a result, correcting the error can involve simply removing the characters from the strings using regular expressions, or if the characters have resulted in incorrect data, removing those observations from the dataset.
Syntax Errors
When you forget a colon at the end of a line, accidentally add one
space too many when indenting under an if
statement, or
forget a parenthesis, you will encounter a syntax error. This means that
Python couldn’t figure out how to read your program. This is similar to
forgetting punctuation in English: for example, this text is difficult
to read there is no punctuation there is also no capitalization why is
this hard because you have to figure out where each sentence ends you
also have to figure out where each sentence begins to some extent it
might be ambiguous if there should be a sentence break or not
People can typically figure out what is meant by text with no punctuation, but people are much smarter than computers. If Python doesn’t know how to read the program, it will give up and inform you with an error. For example:
ERROR
File "<ipython-input-3-6bb841ea1423>", line 1
def some_function()
^
SyntaxError: invalid syntax
Here, Python tells us that there is a SyntaxError
on
line 1, and even puts a little arrow in the place where there is an
issue. In this case the problem is that the function definition is
missing a colon at the end.
Actually, the function above has two issues with syntax. If
we fix the problem with the colon, we see that there is also an
IndentationError
, which means that the lines in the
function definition do not all have the same indentation:
ERROR
File "<ipython-input-4-ae290e7659cb>", line 4
return msg
^
IndentationError: unexpected indent
Both SyntaxError
and IndentationError
indicate a problem with the syntax of your program, but an
IndentationError
is more specific: it always means
that there is a problem with how your code is indented.
Tabs and Spaces
Some indentation errors are harder to spot than others. In
particular, mixing spaces and tabs can be difficult to spot because they
are both whitespace. In the
example below, the first two lines in the body of the function
some_function
are indented with tabs, while the third line
— with spaces. If you’re working in a Jupyter notebook, be sure to copy
and paste this example rather than trying to type it in manually because
Jupyter automatically replaces tabs with spaces.
Visually it is impossible to spot the error. Fortunately, Python does not allow you to mix tabs and spaces.
ERROR
File "<ipython-input-5-653b36fbcd41>", line 4
return msg
^
TabError: inconsistent use of tabs and spaces in indentation
Variable Name Errors
Another very common type of error is called a NameError
,
and occurs when you try to use a variable that does not exist. For
example:
ERROR
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-9d7b17ad5387> in <module>()
----> 1 print(a)
NameError: name 'a' is not defined
Variable name errors come with some of the most informative error messages, which are usually of the form “name ‘the_variable_name’ is not defined”.
Why does this error message occur? That’s a harder question to answer, because it depends on what your code is supposed to do. However, there are a few very common reasons why you might have an undefined variable. The first is that you meant to use a string, but forgot to put quotes around it:
ERROR
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-9553ee03b645> in <module>()
----> 1 print(hello)
NameError: name 'hello' is not defined
The second reason is that you might be trying to use a variable that
does not yet exist. In the following example, count
should
have been defined (e.g., with count = 0
) before the for
loop:
ERROR
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-dd6a12d7ca5c> in <module>()
1 for number in range(10):
----> 2 count = count + number
3 print('The count is:', count)
NameError: name 'count' is not defined
Finally, the third possibility is that you made a typo when you were
writing your code. Let’s say we fixed the error above by adding the line
Count = 0
before the for loop. Frustratingly, this actually
does not fix the error. Remember that variables are case-sensitive, so the variable
count
is different from Count
. We still get
the same error, because we still have not defined
count
:
ERROR
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-d77d40059aea> in <module>()
1 Count = 0
2 for number in range(10):
----> 3 count = count + number
4 print('The count is:', count)
NameError: name 'count' is not defined
Index Errors
Next up are errors having to do with containers (like lists and strings) and the items within them. If you try to access an item in a list or a string that does not exist, then you will get an error. This makes sense: if you asked someone what day they would like to get coffee, and they answered “caturday”, you might be a bit annoyed. Python gets similarly annoyed if you try to ask it for an item that doesn’t exist:
PYTHON
letters = ['a', 'b', 'c']
print('Letter #1 is', letters[0])
print('Letter #2 is', letters[1])
print('Letter #3 is', letters[2])
print('Letter #4 is', letters[3])
OUTPUT
Letter #1 is a
Letter #2 is b
Letter #3 is c
ERROR
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-11-d817f55b7d6c> in <module>()
3 print('Letter #2 is', letters[1])
4 print('Letter #3 is', letters[2])
----> 5 print('Letter #4 is', letters[3])
IndexError: list index out of range
Here, Python is telling us that there is an IndexError
in our code, meaning we tried to access a list index that did not
exist.
File Errors
The last type of error we’ll cover today are the most common type of
error when using Python with data, those associated with reading and
writing files: FileNotFoundError
. If you try to read a file
that does not exist, you will receive a FileNotFoundError
telling you so. If you attempt to write to a file that was opened
read-only, Python 3 returns an UnsupportedOperationError
.
More generally, problems with input and output manifest as
OSError
s, which may show up as a more specific subclass;
you can see the
list in the Python docs. They all have a unique UNIX
errno
, which is you can see in the error message.
ERROR
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-14-f6e1ac4aee96> in <module>()
----> 1 file_handle = open('myfile.txt', 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'myfile.txt'
One reason for receiving this error is that you specified an
incorrect path to the file. For example, if I am currently in a folder
called myproject
, and I have a file in
myproject/writing/myfile.txt
, but I try to open
myfile.txt
, this will fail. The correct path would be
writing/myfile.txt
. It is also possible that the file name
or its path contains a typo. There may also be specific settings based
on your organization if you are using shared, networked, or cloud-based
drives. It is best to check with your IT administrators if you are still
encountering issues reading in a file after troubleshooting.
A related issue can occur if you use the “read” flag instead of the
“write” flag. Python will not give you an error if you try to open a
file for writing when the file does not exist. However, if you meant to
open a file for reading, but accidentally opened it for writing, and
then try to read from it, you will get an
UnsupportedOperation
error telling you that the file was
not opened for reading:
ERROR
---------------------------------------------------------------------------
UnsupportedOperation Traceback (most recent call last)
<ipython-input-15-b846479bc61f> in <module>()
1 file_handle = open('myfile.txt', 'w')
----> 2 file_handle.read()
UnsupportedOperation: not readable
If you are getting a read or write error on file or folder that you are able to open and/or edit with other programs, you may need to contact an IT administrator to check the permissions granted to you and any programs you are using.
These are the most common errors with files, though many others exist. If you get an error that you’ve never seen before, searching the Internet for that error type often reveals common reasons why you might get that error.
Identifying Syntax Errors
- Read the code below, and (without running it) try to identify what the errors are.
- Run the code, and read the error message. Is it a
SyntaxError
or anIndentationError
? - Fix the error.
- Repeat steps 2 and 3, until you have fixed all the errors.
Identifying Variable Name Errors
- Read the code below, and (without running it) try to identify what the errors are.
- Run the code, and read the error message. What type of
NameError
do you think this is? In other words, is it a string with no quotes, a misspelled variable, or a variable that should have been defined but was not? - Fix the error.
- Repeat steps 2 and 3, until you have fixed all the errors.
3 NameError
s for number
being misspelled,
for message
not defined, and for a
not being
in quotes.
Fixed version:
A Final Note About Correcting Errors
There are a lot of very helpful answers for many error messages, however when working with official statistics, we need to also exercise some caution. Be aware and be wary of any answers that ask you to download a package from someone’s personal GitHub repository or other file sharing service. Try to find the type of error first and understand what the issue is before downloading anything claiming to fix the error. If the error is the result of an issue with a version of a package, check if there are any security vulnerabilities with that version, and use a package manager to move between package versions.
Key Points
- NULL