Python for Official Statistics: All in One View

Last updated on 2024-07-11 | Edit this page

Overview

Questions

What is programming?
How do I document code?
How do I find reliable and safe resources or code online?

Objectives

identify basic concepts in programming

Programming in Python

In most general terms, programming is the process of writing instructions for a computer. In this course we will be using Python as the language to communicate with the computer.

Strictly speaking, Python is an interpreted language, rather than a compiled language, meaning we are not communicating directly with the computer when we use Python. When we run Python code, our Python source code is first translated into byte code, which is then executed by the Python virtual machine.

Programming is a wide topic including a variety of techniques and tools. In this course we’ll be focusing on programming for statistical analysis.

IDEs

IDE stands for Integrated Development Environment. IDEs are where you will write, edit, and debug python scripts, so you want to choose one that makes you feel comfortable and includes the functionality that you need. Some open-source IDEs for Python include JupyterLab and Visual Studio Code.

Packages

Packages, or libraries, are extensions to the statistical programming language. They contain code, data, and documentation in a standardised collection format that can be installed by users, typically via a centralised software repository. A typical Python workflow will use base Python (the core operations and functions provided by your Python installation) as well as specialised data analysis and scientific packages like NumPy, SciPy and Pandas.

Best Practices

Let’s overview some base concepts that any programmer should always keep in mind.

Documentation

Have you ever returned to a task and tried to read a note that you quickly scrawled for yourself the last time you were working on it? Have you ever inherited a project from a colleague and found you have no idea what remains to be done?

It can be very challenging to return to your own work or a colleague’s and this goes doubly for programming. Documentation is one way we can reduce the burden on future selves and our colleagues.

Inline Documentation

As a new programmer, inline documentation can be the most helpful. Inline documentation refers to writing comments on the same line as your code. For example, if we wrote a line of code to sum 1+1, we might document it as follows:

PYTHON

1+1         # adding the numbers 1 and 1 together.

Although this is a very simple line of code and it might seem like overkill to document it in this way, these types of comments can be very helpful in jogging your memory when returning to a project. Inline comments can also help you to break multi-step programs into digestible and readable pieces.

External Documentation

Sometimes you require more detail than you can comfortably fit in your inline documentation. In this case it can be helpful to create separate files to document your project. This type of documentation will typically focus on the goals, scope, and any special instructions relating to your project rather than the details fo your code. The most common type of external documentation is a README file. It is best practice to create a basic README file for any project. A basic README should include:

a brief description of the project,
any special instructions for installation or use,
the authors and any references.

README files are just text files and it is best practice is to save your README file as a README.md markdown document. This file format is automatically recognised by code repositories like GitHub, so your README contents are displayed alongside your code repository.

DocStrings

In chapter 7: functions we’ll learn about documentation specific to functions known as DocStrings.

Getting Help

Later on, in chapter 10: Errors and Exceptions we will cover errors in more detail. However, before we get there it’s very likely you’ll need some assistance writing Python code.

Built-in Help

There is a help function built into base Python. You can use it to investigate built-in functions, data types, and more. For example, say we want to know more about the print() function in Python:

PYTHON

help(print)

OUTPUT

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
-- More  --

Finding Resources online

Stack Overflow is a valuable resource for programmers of all levels. It can be daunting to post your own question! Fortunately, chances are someone else has already asked a similar question!

The Official Python Documentation is another great resource.

It can also be helpful to do a general search for a particular topic or error message. It’s very likely the first few results will be from StackOverflow, followed by a few from official documentation and then you may start seeing results from personal blogs or third parties. These third party results can sometime be valuable but we should be cautious! Here are a few things to keep in mind when you are looking for online resources:

Don’t download or install anything unless you are certain of what it is and why you need it.
Don’t copy or run code unless you fully understand what it does.
Python is an open-source language; official documentation and resources will not be behind a paywall.
You may not find a resource or solution to fit your exact needs. Try to be flexible and adapt online solutions to fit your needs.

Key Points

Python is an interpreted language.
Code is commonly developed inside an integrated development environment.
A typical Python workflow uses base Python and additional Python packages developed for statistical programming purposes.
In-line and external documentation helps ensure that your code is readable.
You can find help through the built-in help function and external resources.

Content from Python Fundamentals

Last updated on 2024-07-11 | Edit this page

Overview

Questions

What basic data types can I work with in Python?
How can I create a new variable in Python?
How do I use a function?
Can I change the value associated with a variable after I create it?

Objectives

Assign values to variables.

Variables

Any Python interpreter can be used as a calculator:

PYTHON

3 + 5 * 4

OUTPUT

This is great but not very interesting. To do anything useful with data, we need to assign its value to a variable. In Python, we can assign a value to a variable, using the equals sign =. For example, we can track the weight of a patient who weighs 60 kilograms by assigning the value 60 to a variable weight_kg:

PYTHON

weight_kg = 60

From now on, whenever we use weight_kg, Python will substitute the value we assigned to it. In layperson’s terms, a variable is a name for a value.

In Python, variable names:

can include letters, digits, and underscores
cannot start with a digit
are case sensitive.

This means that, for example:

weight0 is a valid variable name, whereas 0weight is not
weight and Weight are different variables

Types of data

Python knows various types of data. Three common ones are:

integer numbers
floating point numbers, and
strings.

In the example above, variable weight_kg has an integer value of 60. If we want to more precisely track the weight of our patient, we can use a floating point value by executing:

PYTHON

weight_kg = 60.3

To create a string, we add single or double quotes around some text. To identify and track a patient throughout our study, we can assign each person a unique identifier by storing it in a string:

PYTHON

patient_id = '001'

Using Variables in Python

Once we have data stored with variable names, we can make use of it in calculations. We may want to store our patient’s weight in pounds as well as kilograms:

PYTHON

weight_lb = 2.2 * weight_kg

We might decide to add a prefix to our patient identifier:

PYTHON

patient_id = 'inflam_' + patient_id

Built-in Python functions

To carry out common tasks with data and variables in Python, the language provides us with several built-in functions. To display information to the screen, we use the print function:

PYTHON

print(weight_lb)
print(patient_id)

OUTPUT

132.66
inflam_001

When we want to make use of a function, referred to as calling the function, we follow its name by parentheses. The parentheses are important: if you leave them off, the function doesn’t actually run! Sometimes you will include values or variables inside the parentheses for the function to use. In the case of print, we use the parentheses to tell the function what value we want to display. We will learn more about how functions work and how to create our own in later episodes.

We can display multiple things at once using only one print call:

PYTHON

print(patient_id, 'weight in kilograms:', weight_kg)

OUTPUT

inflam_001 weight in kilograms: 60.3

We can also call a function inside of another function call. For example, Python has a built-in function called type that tells you a value’s data type:

PYTHON

print(type(60.3))
print(type(patient_id))

OUTPUT

<class 'float'>
<class 'str'>

Moreover, we can do arithmetic with variables right inside the print function:

PYTHON

print('weight in pounds:', 2.2 * weight_kg)

OUTPUT

weight in pounds: 132.66

The above command, however, did not change the value of weight_kg:

PYTHON

print(weight_kg)

OUTPUT

60.3

To change the value of the weight_kg variable, we have to assign weight_kg a new value using the equals = sign:

PYTHON

weight_kg = 65.0
print('weight in kilograms is now:', weight_kg)

OUTPUT

weight in kilograms is now: 65.0

Variables as Sticky Notes

A variable in Python is analogous to a sticky note with a name written on it: assigning a value to a variable is like putting that sticky note on a particular value.

Value of 65.0 with weight_kg label stuck on it

Using this analogy, we can investigate how assigning a value to one variable does not change values of other, seemingly related, variables. For example, let’s store the subject’s weight in pounds in its own variable:

PYTHON

# There are 2.2 pounds per kilogram
weight_lb = 2.2 * weight_kg
print('weight in kilograms:', weight_kg, 'and in pounds:', weight_lb)

OUTPUT

weight in kilograms: 65.0 and in pounds: 143.0

Everything in a line of code following the ‘#’ symbol is a comment that is ignored by Python. Comments allow programmers to leave explanatory notes for other programmers or their future selves.

Value of 65.0 with weight_kg label stuck on it, and value of 143.0 with weight_lb label stuck on it

Similar to above, the expression 2.2 * weight_kg is evaluated to 143.0, and then this value is assigned to the variable weight_lb (i.e. the sticky note weight_lb is placed on 143.0). At this point, each variable is “stuck” to completely distinct and unrelated values.

Let’s now change weight_kg:

PYTHON

weight_kg = 100.0
print('weight in kilograms is now:', weight_kg, 'and weight in pounds is still:', weight_lb)

OUTPUT

weight in kilograms is now: 100.0 and weight in pounds is still: 143.0

Value of 100.0 with label weight_kg stuck on it, and value of 143.0 with label weight_lbstuck on it

Since weight_lb doesn’t “remember” where its value comes from, it is not updated when we change weight_kg.

Check Your Understanding

What values do the variables mass and age have after each of the following statements? Test your answer by executing the lines.

PYTHON

mass = 47.5
age = 122
mass = mass * 2.0
age = age - 20

Show me the solution

OUTPUT

`mass` holds a value of 47.5, `age` does not exist
`mass` still holds a value of 47.5, `age` holds a value of 122
`mass` now has a value of 95.0, `age`'s value is still 122
`mass` still has a value of 95.0, `age` now holds 102

Sorting Out References

Python allows you to assign multiple values to multiple variables in one line by separating the variables and values with commas. What does the following program print out?

PYTHON

first, second = 'Grace', 'Hopper'
third, fourth = second, first
print(third, fourth)

Show me the solution

OUTPUT

Hopper Grace

Seeing Data Types

What are the data types of the following variables?

PYTHON

planet = 'Earth'
apples = 5
distance = 10.5

Show me the solution

PYTHON

print(type(planet))
print(type(apples))
print(type(distance))

OUTPUT

<class 'str'>
<class 'int'>
<class 'float'>

Key Points

Basic data types in Python include integers, strings, and floating-point numbers.
Use variable = value to assign a value to a variable in order to record it in memory.
Variables are created on demand whenever a value is assigned to them.
Use print(something) to display the value of something.
Use # some kind of explanation to add comments to programs.
Built-in functions are always available to use.

Content from Data Transformation

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How can I process tabular data files in Python?

Objectives

Explain what a library is and what libraries are used for.
Import a Python library and use the functions it contains.
Read tabular data from a file into a program.
Select individual values and subsections from data.
Perform operations on arrays of data.

Words are useful, but what’s more useful are the sentences and stories we build with them. Similarly, while a lot of powerful, general tools are built into Python, specialized tools built up from these basic units live in libraries that can be called upon when needed.

Loading data into Python

To begin processing the clinical trial inflammation data, we need to load it into Python. Python can work with many different file types. Text files can be loaded into Python by using the base Python function

PYTHON

Open("filename.txt", "r")

where “r” means read only, or if you want to write to the file, you can use “w”.

However, our patient data is in a csv. file, which is more commonly loaded by using a library. Python has hundreds of thousands of libraries to choose from to help carry out your work. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program. There are a couple common Python libraries to load (and work with data).

pandas

The first library we will present is called pandas pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way.

Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty.

Remember to write the library name with a lower case ‘p’ because the name of the package and Python is case sensitive.

Importing the pandas library

Importing the pandas library is done in exactly the same way as for any other library. In almost all examples of Python code using the pandas library, it will have been imported and given an alias of pd. We will follow the same convention.

PYTHON

import pandas as pd

Pandas data structures

There are two main data structure used by pandas, they are the Series and the Dataframe. The Series equates in general to a vector or a list. The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure.

We will mainly be looking at the Dataframe.

We can easily create a Pandas Dataframe by reading a .csv file

Reading a csv file

When we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading datasets in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer.

The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of any size.

In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a Dataframe. If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically.

To tell Python that we’d like to start using pandas, we need to import it:

PYTHON

import pandas as pd

Often, libraries are given an alias or a short form name, in this case pandas is given the alias “pd”. Aliases for common data analysis libraries include:

PYTHON

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

Once we’ve imported the library, we can ask the library to read our data file for us:

PYTHON

pd.read_csv("filename.csv)

pandas is a commonly used library for working with and analysing data. However, we will be working with a different package for the remainder of this course. If you would like to learn more about data manipulation and analysis using pandas, we recommend checking out Data Analysis and Visualization with Python for Social Scientists.

numpy

The second package that we will present is called NumPy, which stands for Numerical Python. In general, you should use this library when you want to do fancy things with lots of numbers, especially if you have matrices or arrays. Numpy matrices are typically lighter weight with better performance, particularly when working with large datasets.

We will be using this package to work with our clinical trial inflammation data.

To tell Python that we’d like to start using NumPy, we need to import it:

PYTHON

import numpy as np

Now that we have imported the library, we can ask the library (by using the alisa np) to read our data file for us:

PYTHON

np.loadtxt(fname='inflammation-01.csv', delimiter=',')

OUTPUT

array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ...,
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])

The expression np.loadtxt(...) is a function call that asks Python to run the function loadtxt which belongs to the np library. The dot notation in Python is used most of all as an object attribute/property specifier or for invoking its method. object.property will give you the object.property value, object_name.method() will invoke on object_name method.

As an example, John Smith is the John that belongs to the Smith family. We could use the dot notation to write his name smith.john, just as loadtxt is a function that belongs to the np library.

np.loadtxt has two parameters: the name of the file we want to read and the delimiter that separates values on a line. These both need to be character strings (or strings for short), so we put them in quotes.

Since we haven’t told it to do anything else with the function’s output, the notebook displays it. In this case, that output is the data we just loaded. By default, only a few rows and columns are shown (with ... to omit elements when displaying big arrays). Note that, to save space when displaying NumPy arrays, Python does not show us trailing zeros, so 1.0 becomes 1..

Our call to np.loadtxt read our file but didn’t save the data in memory. To do that, we need to assign the array to a variable. In a similar manner to how we assign a single value to a variable, we can also assign an array of values to a variable using the same syntax. Let’s re-run np.loadtxt and save the returned data:

PYTHON

data = np.loadtxt(fname='inflammation-01.csv', delimiter=',')

This statement doesn’t produce any output because we’ve assigned the output to the variable data. If we want to check that the data have been loaded, we can print the variable’s value:

PYTHON

print(data)

OUTPUT

[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ...,
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

Now that the data are in memory, we can manipulate them. First, let’s ask what type of thing data refers to:

PYTHON

print(type(data))

OUTPUT

<class 'np.ndarray'>

The output tells us that data currently refers to an N-dimensional array, the functionality for which is provided by the NumPy library. These data correspond to arthritis patients’ inflammation. The rows are the individual patients, and the columns are their daily inflammation measurements.

Data Type

A Numpy array contains one or more elements of the same type. The type function will only tell you that a variable is a NumPy array but won’t tell you the type of thing inside the array. We can find out the type of the data contained in the NumPy array.

PYTHON

print(data.dtype)

OUTPUT

float64

This tells us that the NumPy array’s elements are floating-point numbers.

With the following command, we can see the array’s shape:

PYTHON

print(data.shape)

OUTPUT

(60, 40)

The output tells us that the data array variable contains 60 rows and 40 columns. When we created the variable data to store our arthritis data, we did not only create the array; we also created information about the array, called members or attributes. This extra information describes data in the same way an adjective describes a noun. data.shape is an attribute of data which describes the dimensions of data. We use the same dotted notation for the attributes of variables that we use for the functions in libraries because they have the same part-and-whole relationship.

If we want to get a single number from the array, we must provide an index in square brackets after the variable name, just as we do in math when referring to an element of a matrix. Our inflammation data has two dimensions, so we will need to use two indices to refer to one specific value:

PYTHON

print('first value in data:', data[0, 0])

OUTPUT

first value in data: 0.0

PYTHON

print('middle value in data:', data[29, 19])

OUTPUT

middle value in data: 16.0

The expression data[29, 19] accesses the element at row 30, column 20. While this expression may not surprise you, data[0, 0] might. Programming languages like Fortran, MATLAB and R start counting at 1 because that’s what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because it represents an offset from the first value in the array (the second value is offset by one index from the first value). This is closer to the way that computers represent arrays (if you are interested in the historical reasons behind counting indices from zero, you can read Mike Hoye’s blog post). As a result, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second. It takes a bit of getting used to, but one way to remember the rule is that the index is how many steps we have to take from the start to get the item we want.

'data' is a 3 by 3 numpy array containing row 0: ['A', 'B', 'C'], row 1: ['D', 'E', 'F'], androw 2: ['G', 'H', 'I']. Starting in the upper left hand corner, data[0, 0] = 'A', data[0, 1] = 'B',data[0, 2] = 'C', data[1, 0] = 'D', data[1, 1] = 'E', data[1, 2] = 'F', data[2, 0] = 'G',data[2, 1] = 'H', and data[2, 2] = 'I',in the bottom right hand corner.

In the Corner

What may also surprise you is that when Python displays an array, it shows the element with index [0, 0] in the upper left corner rather than the lower left. This is consistent with the way mathematicians draw matrices but different from the Cartesian coordinates. The indices are (row, column) instead of (column, row) for the same reason, which can be confusing when plotting data.

Slicing data

An index like [30, 20] selects a single element of an array, but we can select whole sections as well. For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:

PYTHON

print(data[0:4, 0:10])

OUTPUT

[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.]
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7.]]

The slice 0:4 means, “Start at index 0 and go up to, but not including, index 4”. Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.

We don’t have to start slices at 0:

PYTHON

print(data[5:10, 0:10])

OUTPUT

[[ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.]
 [ 0.  0.  2.  2.  4.  2.  2.  5.  5.  8.]
 [ 0.  0.  1.  2.  3.  1.  2.  3.  5.  3.]
 [ 0.  0.  0.  3.  1.  5.  6.  5.  5.  8.]
 [ 0.  1.  1.  2.  1.  3.  5.  3.  5.  8.]]

We also don’t have to include the upper and lower bound on the slice. If we don’t include the lower bound, Python uses 0 by default; if we don’t include the upper, the slice runs to the end of the axis, and if we don’t include either (i.e., if we use ‘:’ on its own), the slice includes everything:

PYTHON

small = data[:3, 36:]
print('small is:')
print(small)

The above example selects rows 0 through 2 and columns 36 through to the end of the array.

OUTPUT

small is:
[[ 2.  3.  0.  0.]
 [ 1.  1.  0.  1.]
 [ 2.  2.  1.  1.]]

Content from List and Dictionary Methods

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How can I store many values together?
How can I create a list succinctly?
How can I efficiently access nested data?

Objectives

Identify and create lists and dictionaries
Understand the properties and behaviours of lists and dictionaries
Access values in lists and dictionaries
Create and access values from nest lists and dictionaries

Values can also be stored in other Python data types such as lists, dictionaries, sets and tuples. Storing objects in a list is a fast and versatile way to apply transformations across a sequence of values. Storing objects in dictionary as key-value pairs is useful for extracting specific values i.e. performing lookup operations.

Create and access lists

Lists have the following properties and behaviours:

A single list can store different primitive object types and even other lists
Lists are ordered and have a 0-based index
Lists can be appended to using the methods append() or insert()
Values inside a list can be removed using the methods remove() or pop()
Two lists can be concatenated with the operator +
Values inside a list can be conditionally iterated through
A list is mutable i.e. the values inside a list can be modified in place

To create a list, values are contained within square brackets i.e. [] and individually separated by commas. The function list() can also be used to create a list of values from an iterable object like a string, set or tuple.

PYTHON

# Create a list of integers using []
list_1 = [1, 3, 5, 7]
print(list_1)

OUTPUT

[1, 3, 5, 7]

PYTHON

# Unlike atomic vectors in R, a list can contain multiple primitive object types
list_2 = [1, "one", 1.0, True]
print(list_2)

OUTPUT

[1, 'one', 1.0, True]

PYTHON

# You can also use list() on an iterable object to convert it into a list
string = 'abcdefg'  
list_3 = list(string)  
print(list_3)

OUTPUT

['a', 'b', 'c', 'd', 'e', 'f', 'g']

Because lists have a 0-based index, we can access individual values by their list index position. For 0-based indexes, the first value always starts at position 0 i.e. the first element has an index of 0. Accessing multiple values by their index positions is also referred to as slicing or subsetting a list.

Note that we can use negative numbers as indices in Python. When we do so, the index -1 gives us the last element in the list, -2 gives us the second to last element in the list, and so on.

PYTHON

# Extract individual values from list_3
print('first value:', list_3[0])
print('second value:', list_3[1])
print('last value:', list_3[-1])

OUTPUT

first value: a
second value: b
last value: g

PYTHON

# A syntax quirk for slicing values is to +1 to the last value's index 
# To extract from index 0 to 2, we need to slice from [0:2+1] or [0:3]

# Extract the first three values from list_3
print('first 3 values:', list_3[0:3])

# Start from index 0 and extract values from each subsequent second position
print('every second value:', list_3[0::2])

# Start from index 1, end at index 3 and extract from each subsequent second position
print('every second value from index 1 to 3:', list_3[1:4:2])

OUTPUT

first 3 values: ['a', 'b', 'c']
every second value: ['a', 'c', 'e', 'g']
every second value from index 1 to 3: ['b', 'd']

Change list values

Data which can be modified in place is called mutable, while data which cannot be modified is called immutable. Strings and numbers are immutable in that when we want to change the value of a string or number variable, we can only replace the old value with a completely new value.

PYTHON

string = 'abcde'
string[0] = 'b' # Produces a type error as strings are immutable

# TypeError: 'str' object does not support item assignment

In contrast, lists are mutable and we can modify them after they have been created. We can change individual values, append new values, or reorder the whole list through sorting.

PYTHON

list_4 = ['apple', 'pear', 'plum']
print('original list_4:', list_4)

# Change the first value i.e. modify the list in place
list_4[0] = 'banana'
print('modified list_4:', list_4)

# Add new value to list using the method .insert(index number, value)
list_4.insert(1, 'apple') # Index 1 refers to the second position
print('appended list_4:', list_4)

OUTPUT

original list_4: ['apple', 'pear', 'plum']
modified list_4: ['banana', 'pear', 'plum']
appended list_4: ['banana', 'apple', 'pear', 'plum']

PYTHON

# Sorting a list also modifies it in place
list_5 = [2, 1, 3, 7]
list_5.sort()
print('list_5:', list_5)

OUTPUT

list_5: [1, 2, 3, 7]

However, be careful when modifying data in-place. If two variables refer to the same list, and you modify the list value, it will change for both variables!

PYTHON

# When we assign list_6 to list_5, it means both list_6 and list_5 point to the
# same list object, not that list_6 is a copy of list_5.  

list_6 = list_5  
print('list_5:', list_5)
print('list_6:', list_6)

# Change the first value in list_6 from 1 to 2 
list_6[0] = 2 

print('modified list_6:', list_6)
print('unmodified list_5:', list_5)

# Warning: list_5 and list_6 have both been modified in place!

OUTPUT

list_5: [1, 2, 3, 7]
list_6: [1, 2, 3, 7]
modified list_6: [2, 2, 3, 7]
unmodified list_5: [2, 2, 3, 7]

Because of this behaviour, code which modifies data in place should be handled with care. You can also avoid this behaviour by expliciting creating a copy of the original list and modifying only the object copy. This is why creating a copy of the original data object can be useful in Python.

PYTHON

list_5 = [1, 2, 3, 7]
list_7 = list_5.copy()  
print('list_5:', list_5)
print('list_7:', list_7)

# As list_7 is a completely new object copied from list_5, modifying list_7 does
# not affect list_5.  

list_7[0] = 2 
print('modified list_7:', list_7)
print('unmodified list_5:', list_5)

OUTPUT

list_5: [1, 2, 3, 7]
list_7: [1, 2, 3, 7]
modified list_7: [2, 2, 3, 7]
unmodified list_5: [1, 2, 3, 7]

Useful list functions

There are a lot of functions and methods which can be applied to lists, such as len(), max(), index() and so forth. Mathematical operations do not work on lists of integers, with the exception of +.

Note that + concatenates two lists into a single longer list, rather than outputting the sum of two lists of numbers.

PYTHON

list_8 = [1, 2, 3]
list_9 = [4, 5, 6]

list_8 + list_9 # This concatenates the lists and does not sum the two lists together

OUTPUT

[1, 2, 3, 4, 5, 6]

In your spare time after this workshop, you can search for different list functions and methods and test them out yourselves.

Nested lists

We have previously mentioned that lists can be used to store other Python object types, including lists. This means that we can create nested lists in Python i.e. lists containing lists containing values. This property is useful when we have a collection of values that we want to access or transform as a subgroup.

To create a nested list, we also use [] or list() to contain one or more lists of values of interest.

PYTHON

veg_stock = [
    ['lettuce', 'lettuce', 'tomato', 'zucchini'],
    ['lettuce', 'lettuce', 'carrot', 'zucchini'],
    ['lettuce', 'basil', 'tomato', 'zucchini']
    ]

# Check that veg_stock is a list object
print(type(veg_stock))

# Check that the first value in veg_stock is itself a list
print(veg_stock[0], 'has type', type(veg_stock[0]))

OUTPUT

<class 'list'>
['lettuce', 'lettuce', 'tomato', 'zucchini'] has type <class 'list'>

To extract the first sub-list within the veg_stock list object, we refer to its index like we would with any other value inside a list i.e. veg_stock[1] points to the second sub-list within the veg_stock list.

To access an individual string value inside a sub-list, we make use of a second index, which points to an individual value inside the sub-list.

PYTHON

print(veg_stock[0]) # Access the first sub-list 
print(veg_stock[0][0]) # Access the first value in the first sub-list 

print(type(veg_stock[0])) # The first value in veg_stock is a list
print(type(veg_stock[0][0])) # The first value in the first list in veg_stock is a string

OUTPUT

['lettuce', 'lettuce', 'tomato', 'zucchini']
lettuce
<class 'list'>
<class 'str'>

In general, however, when we are analysing a large collection of values, the best practice is to structure those values in columns and rows as a tabular Pandas data frame object. This is covered in another Carpentries Course called Python for Social Sciences.

Lists are still incredibly versatile and useful when you have a collection of values that need to be efficiently accessed or transformed. For example, data frame column names are commonly extracted and stored inside a list, so that the same transformation can then be mapped across multiple columns.

Create and access dictionaries

A dictionary is a Python data type that is particularly suited for enabling quick lookup operations on unstructured data sets.

A dictionary can therefore be thought of as an unordered list where every item or value is associated with a unique key (i.e. a self-defined index of unique strings or numbers). The index values are called keys and a dictionary contains key-value pairs with the format {key: value(s)}.

Dictionaries can be created by listing individual key-values pairs inside {} or using dict().

PYTHON

# A key-value pair can contain single or multiple values  
# Keys are treated as case sensitive and unique
# Multiple values are first stored inside a list  

teams = {
    'data science': ['Mei Ling', 'Paul', 'Gwen', 'Suresh'],
    'user design': ['Amy', 'Linh', 'Sasha'],
    'software dev': ['David', 'Prya'],
    'comms': 'Taylor' 
    }

When using dict(), we need to indicate which key is associated with which value. This can be done directly using tuples, direct association i.e. using = or using zip(), which creates a set of tuples from an iterable list.

PYTHON

# To use dict(), key-value pairs are can be stored inside tuples  
ds_emp_status = dict([
        ('Mei Ling', 'full time'),
        ('Paul', 'full time'),
        ('Gwen', 'part time'),
        ('Suresh', 'part time')
    ])  

# Key-value pairs can also be assigned by direct association  
# Keys cannot be strings i.e. wrapped in '' using this approach
ud_emp_status = dict(
    Amy = 'full time',
    Linh = 'full time',
    Sasha = 'casual' 
    ) 

# zip() can also be used if each key has only one value  
sd_emp_status = dict(zip(
    ['David', 'Prya'],
    ['full time', 'full time']
    ))

To access a specific value inside a dictionary, we need to specify its key using []. This is similar to slicing or subsetting a list by specifying its index using [].

PYTHON

# Access the values associated with the key 'data science'
print(teams['data science'])

print('The object teams is of type', type(teams))
print('The dict value', teams['data science'], 'is of type', type(teams['data science']))

OUTPUT

['Mei Ling', 'Paul', 'Gwen', 'Suresh']
The data object teams is of type <class 'dict'>
The value ['Mei Ling', 'Paul', 'Gwen', 'Suresh'] is of type <class 'list'>

We can also access a value from a dictionary using the get() method.

PYTHON

print(teams.get('user design'))

# get() also enables us to return an alternate string when the key is not found   
# This prevents our code from returning an error message that halts the analysis

print(teams.get('data engineering', 'WARNING: key does not exist'))

OUTPUT

['Amy', 'Linh', 'Sasha']
WARNING: key does not exist

To access data inside a dictionary, we can also perform the following other actions:

Check whether a key exists in a dictionary using the keyword in
Retrieve unique dictionary keys using dict.keys()
Retrieve dictionary values using dict.values()
Retrieve dictionary items using dict.items()

PYTHON

# Check whether a key exists in a dictionary 
print('data science' in teams) 
print('Data Science' in teams) # Keys are case sensitive  

# Retrieve all dictionary keys  
print(teams.keys())
print(sd_emp_status.keys())

# Retrieve all dictionary values  
print(sd_emp_status.values())  

# Retrieve all dictionary key-value pairs
print(sd_emp_status.items())

OUTPUT

True
False
dict_keys(['data science', 'user design', 'software dev', 'comms'])
dict_keys(['David', 'Prya'])
dict_values(['full time', 'full time'])
dict_items([('David', 'full time'), ('Prya', 'full time')])

To add a new key-value pair to an existing dictionary, we can create a new key and directly attach a new value to it using = or alternatively use the method update().

PYTHON

print('original dict items:', sd_emp_status.items())  

# Add new key-value pair using direct assignment  
sd_emp_status['Mohammad'] = 'full time'

# Add new key-value pair using update({'key': 'value'})   
sd_emp_status.update({'Carrie': 'part time'})

print('updated dict items:', sd_emp_status.items())

OUTPUT

original dict items: dict_items([('David', 'full time'), ('Prya', 'full time')])
updated dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'part time')])

Because keys are unique, a dictionary cannot contain two keys with the same name. This means that adding an item using a key that is already present in the dictionary will cause the previous value to be overwritten.

PYTHON

print('original dict items:', sd_emp_status.items())  

# As the key 'Carrie' already exists, its value will be overwritten
sd_emp_status['Carrie'] = 'full time'
print('updated dict items:', sd_emp_status.items())

OUTPUT

original dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'part time')])
updated dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'full time')])

To remove a key-value pair for an existing dictionary, we can use the del keyword or the method pop(). Using pop() also enables us to return an alternate string if we trt to remove a non-existing key, which prevents our code from returning an error message that halts the analysis.

PYTHON

print('original dict items:', sd_emp_status.items())

# Delete dictionary keys using del and pop()
del sd_emp_status['Mohammad']
sd_emp_status.pop('Carrie')
sd_emp_status.pop('Anuradha', 'WARNING: key does not exist') # Does not generate an error

print('modified dict items:', sd_emp_status.items())

OUTPUT

original dict items: dict_items([('David', 'full time'), ('Prya', 'full time'),
('Mohammad', 'full time'), ('Carrie', 'full time')])
modified dict items: dict_items([('David', 'full time'), ('Prya', 'full time')])

Nested dictionaries

Similar to lists, dictionaries can be nested as we can also store dictionaries as values inside a key-value pair using {}. Nested dictionaries are useful when we need to store unstructured data in a complex structure. For example, JSON data is commonly used for transmitting data in web applications and often exists in a nested structure that can be stored using nested dictionaries in Python.

PYTHON

# Individual dictionaries are enclosed in {} and separated by a comma
nested_dict = {
    'dict_1': { # First key is a dictionary of key-value pairs 
        'key_1a': 'value_1a',
        'key_1b': 'value_1b'
                },
    'dict_2': { # Second key is another dictionary of key-value pairs
        'key_2a': 'value_2a',
        'key_2b': 'value_2b'
                }
            }

print(nested_dict)

OUTPUT

{'dict_1': {'key_1a': 'value_1a', 'key_1b': 'value_1b'},
 'dict_2': {'key_2a': 'value_2a', 'key_2b': 'value_2b'}}

Similar to working with nested lists, to extract a value from the first sub-dictionary, we specify both the main dictionary and sub-dictionary keys using [].

PYTHON

# Extract the value for key 2a in dict_2
print('original value:', nested_dict['dict_2']['key_2a'])

# Adding or updating a value can be done through the same approach
nested_dict['dict_2']['key_2a'] = "modified_value_2a"  

print('modified value:', nested_dict['dict_2']['key_2a'])

OUTPUT

original value: value_2a
modified value: modified_value_2a

Optional: converting lists and dictionaries to Pandas data frames

Lists and dictionaries can be easily converted into a tabular Pandas data frame format. This can be useful when you need to create a small data set for unit testing purposes.

PYTHON

# Import pandas library
import pandas as pd

# Create a dictionary with each key-value pair representing a data frame column
data = {
    'col_1': [3, 2, 1, 0],
    'col_2': ['a', 'b', 'c', 'd']
    }

df = pd.DataFrame.from_dict(data) 

print(df) # Outputs data as a tabular Pandas data frame   
print(type(df))

OUTPUT

   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d
<class 'pandas.core.frame.DataFrame'>

Key Points

Lists can contain any Python object including other lists
Lists are ordered i.e. indexed and can therefore be sliced by index number
Unlike strings and integers, the values inside a list can be modified in place
A list which contains other lists is referred to as a nested list
Dictionaries behave like unordered lists and are defined using key-value pairs
Dictionary keys are unique
A dictionary which contains other dictionaries is referred to as a nested dictionary
Values inside nested lists and dictionaries can be accessed by an additional index

Content from Loops and Conditional Logic

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How can I do the same operations on many different values?
How can my programs do different things based on data values?

Objectives

identify and create loops
use logical statements to allow for decision-based operations in code

This episode contains two lessons:

Repeating Actions with Loops
Making Choices with Conditional Logic

Repeating Actions with Loops

In the episode about visualizing data, we will see Python code that plots values of interest from our first inflammation dataset (inflammation-01.csv), which revealed some suspicious features.

Line graphs showing average, maximum, and minimum inflammation across all patients over a 40-day period.

We have a dozen data sets right now and potentially more on the way if Dr. Maverick can keep up their surprisingly fast clinical trial rate. We want to create plots for all of our data sets with a single statement. To do that, we’ll have to teach the computer how to repeat things.

An example task that we might want to repeat is accessing numbers in a list, which we will do by printing each number on a line of its own.

PYTHON

odds = [1, 3, 5, 7]

In Python, a list is basically an ordered collection of elements, and every element has a unique number associated with it — its index. This means that we can access elements in a list using their indices. For example, we can get the first number in the list odds, by using odds[0]. One way to print each number is to use four print statements:

PYTHON

print(odds[0])
print(odds[1])
print(odds[2])
print(odds[3])

OUTPUT

This is a bad approach for three reasons:

Not scalable. Imagine you need to print a list that has hundreds of elements. It might be easier to type them in manually.
Difficult to maintain. If we want to decorate each printed element with an asterisk or any other character, we would have to change four lines of code. While this might not be a problem for small lists, it would definitely be a problem for longer ones.
Fragile. If we use it with a list that has more elements than what we initially envisioned, it will only display part of the list’s elements. A shorter list, on the other hand, will cause an error because it will be trying to display elements of the list that do not exist.

PYTHON

odds = [1, 3, 5]
print(odds[0])
print(odds[1])
print(odds[2])
print(odds[3])

PYTHON

1
3
5

ERROR

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-7974b6cdaf14> in <module>()
      3 print(odds[1])
      4 print(odds[2])
----> 5 print(odds[3])

IndexError: list index out of range

Here’s a better approach: a for loop

PYTHON

odds = [1, 3, 5, 7]
for num in odds:
    print(num)

OUTPUT

This is shorter — certainly shorter than something that prints every number in a hundred-number list — and more robust as well:

PYTHON

odds = [1, 3, 5, 7, 9, 11]
for num in odds:
    print(num)

OUTPUT

The improved version uses a for loop to repeat an operation — in this case, printing — once for each thing in a sequence. The general form of a loop is:

PYTHON

for variable in collection:
    # do things using variable, such as print

Using the odds example above, the loop might look like this:

Loop variable 'num' being assigned the value of each element in the list odds in turn andthen being printed

where each number (num) in the variable odds is looped through and printed one number after another. The other numbers in the diagram denote which loop cycle the number was printed in (1 being the first loop cycle, and 6 being the final loop cycle).

We can call the loop variable anything we like, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. Unlike many other languages, there is no command to signify the end of the loop body (e.g., end for); everything indented after the for statement belongs to the loop.

What’s in a name?

In the example above, the loop variable was given the name num as a mnemonic; it is short for ‘number’. We can choose any name we want for variables. We might just as easily have chosen the name banana for the loop variable, as long as we use the same name when we invoke the variable inside the loop:

PYTHON

odds = [1, 3, 5, 7, 9, 11]
for banana in odds:
   print(banana)

OUTPUT

It is a good idea to choose variable names that are meaningful, otherwise it would be more difficult to understand what the loop is doing.

Here’s another loop that repeatedly updates a variable:

PYTHON

length = 0
names = ['Curie', 'Darwin', 'Turing']
for value in names:
    length = length + 1
print('There are', length, 'names in the list.')

OUTPUT

There are 3 names in the list.

It’s worth tracing the execution of this little program step by step. Since there are three names in names, the statement on line 4 will be executed three times. The first time around, length is zero (the value assigned to it on line 1) and value is Curie. The statement adds 1 to the old value of length, producing 1, and updates length to refer to that new value. The next time around, value is Darwin and length is 1, so length is updated to be 2. After one more update, length is 3; since there is nothing left in names for Python to process, the loop finishes and the print function on line 5 tells us our final answer.

Note that a loop variable is a variable that is being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:

PYTHON

name = 'Rosalind'
for name in ['Curie', 'Darwin', 'Turing']:
    print(name)
print('after the loop, name is', name)

OUTPUT

Curie
Darwin
Turing
after the loop, name is Turing

Note also that finding the length of an object is such a common operation that Python actually has a built-in function to do it called len:

PYTHON

print(len([0, 1, 2, 3]))

OUTPUT

len is much faster than any function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other data types we haven’t seen yet, so we should always use it when we can.

From 1 to N

Python has a built-in function called range that generates a sequence of numbers range can accept 1, 2, or 3 parameters.

If one parameter is given, range generates a sequence of that length, starting at zero and incrementing by 1. For example, range(3) produces the numbers 0, 1, 2.
If two parameters are given, range starts at the first and ends just before the second, incrementing by one. For example, range(2, 5) produces 2, 3, 4.
If range is given 3 parameters, it starts at the first one, ends just before the second one, and increments by the third one. For example, range(3, 10, 2) produces 3, 5, 7, 9.

Using range, write a loop that uses range to print the first 3 natural numbers:

OUTPUT

1
2
3

Show me the solution

PYTHON

for number in range(1, 4):
   print(number)

Understanding the loops

Given the following loop:

PYTHON

word = 'oxygen'
for letter in word:
    print(letter)

How many times is the body of the loop executed?

3 times
4 times
5 times
6 times

Show me the solution

The body of the loop is executed 6 times.

Computing Powers With Loops

Exponentiation is built into Python:

PYTHON

print(5 ** 3)

OUTPUT

Write a loop that calculates the same result as 5 ** 3 using multiplication (and without exponentiation).

Show me the solution

PYTHON

result = 1
for number in range(0, 3):
    result = result * 5
print(result)

Summing a List

Write a loop that calculates the sum of elements in a list by adding each element and printing the final value, so [124, 402, 36] prints 562

Show me the solution

PYTHON

numbers = [124, 402, 36]
summed = 0
for num in numbers:
    summed = summed + num
print(summed)

Computing the Value of a Polynomial

The built-in function enumerate takes a sequence (e.g., a list) and generates a new sequence of the same length. Each element of the new sequence is a pair composed of the index (0, 1, 2,…) and the value from the original sequence:

PYTHON

for idx, val in enumerate(a_list):
    # Do something using idx and val

The code above loops through a_list, assigning the index to idx and the value to val.

Suppose you have encoded a polynomial as a list of coefficients in the following way: the first element is the constant term, the second element is the coefficient of the linear term, the third is the coefficient of the quadratic term, etc.

PYTHON

x = 5
coefs = [2, 4, 3]
y = coefs[0] * x**0 + coefs[1] * x**1 + coefs[2] * x**2
print(y)

OUTPUT

Write a loop using enumerate(coefs) which computes the value y of any polynomial, given x and coefs.

Show me the solution

PYTHON

y = 0
for idx, coef in enumerate(coefs):
    y = y + coef * x**idx

Making Choices with Conditional Logic

How can we use Python to automatically recognize different situations we encounter with our data and take a different action for each? In this lesson, we’ll learn how to write code that runs only when certain conditions are true.

Conditionals

We can ask Python to take different actions, depending on a condition, with an if statement:

PYTHON

num = 37
if num > 100:
    print('greater')
else:
    print('not greater')
print('done')

OUTPUT

not greater
done

The second line of this code uses the keyword if to tell Python that we want to make a choice. If the test that follows the if statement is true, the body of the if (i.e., the set of lines indented underneath it) is executed, and “greater” is printed. If the test is false, the body of the else is executed instead, and “not greater” is printed. Only one or the other is ever executed before continuing on with program execution to print “done”:

A flowchart diagram of the if-else construct that tests if variable num is greater than 100

Conditional statements don’t have to include an else. If there isn’t one, Python simply does nothing if the test is false:

PYTHON

num = 53
print('before conditional...')
if num > 100:
    print(num, 'is greater than 100')
print('...after conditional')

OUTPUT

before conditional...
...after conditional

We can also chain several tests together using elif, which is short for “else if”. The following Python code uses elif to print the sign of a number.

PYTHON

num = -3

if num > 0:
    print(num, 'is positive')
elif num == 0:
    print(num, 'is zero')
else:
    print(num, 'is negative')

OUTPUT

-3 is negative

Note that to test for equality we use a double equals sign == rather than a single equals sign = which is used to assign values.

Comparing in Python

Along with the > and == operators we have already used for comparing values in our conditionals, there are a few more options to know about:

>: greater than
<: less than
==: equal to
!=: does not equal
>=: greater than or equal to
<=: less than or equal to

We can also combine tests using and and or. and is only true if both parts are true:

PYTHON

if (1 > 0) and (-1 >= 0):
    print('both parts are true')
else:
    print('at least one part is false')

OUTPUT

at least one part is false

while or is true if at least one part is true:

PYTHON

if (1 < 0) or (1 >= 0):
    print('at least one test is true')

OUTPUT

at least one test is true

`True` and `False`

True and False are special words in Python called booleans, which represent truth values. A statement such as 1 < 0 returns the value False, while -1 < 0 returns the value True.

Checking Our Data

Now that we’ve seen how conditionals work, we can use them to check for the suspicious features we saw in our inflammation data. We are about to use functions provided by the numpy module again. Therefore, if you’re working in a new Python session, make sure to load the module with:

PYTHON

import numpy

From the first couple of plots, we saw that maximum daily inflammation exhibits a strange behavior and raises one unit a day. Wouldn’t it be a good idea to detect such behavior and report it as suspicious? Let’s do that! However, instead of checking every single day of the study, let’s merely check if maximum inflammation in the beginning (day 0) and in the middle (day 20) of the study are equal to the corresponding day numbers.

PYTHON

max_inflammation_0 = numpy.amax(data, axis=0)[0]
max_inflammation_20 = numpy.amax(data, axis=0)[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')

We also saw a different problem in the third dataset; the minima per day were all zero (looks like a healthy person snuck into our study). We can also check for this with an elif condition:

PYTHON

elif numpy.sum(numpy.amin(data, axis=0)) == 0:
    print('Minima add up to zero!')

And if neither of these conditions are true, we can use else to give the all-clear:

PYTHON

else:
    print('Seems OK!')

Let’s test that out:

PYTHON

data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')

max_inflammation_0 = numpy.amax(data, axis=0)[0]
max_inflammation_20 = numpy.amax(data, axis=0)[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')
elif numpy.sum(numpy.amin(data, axis=0)) == 0:
    print('Minima add up to zero!')
else:
    print('Seems OK!')

OUTPUT

Suspicious looking maxima!

PYTHON

data = numpy.loadtxt(fname='inflammation-03.csv', delimiter=',')

max_inflammation_0 = numpy.amax(data, axis=0)[0]
max_inflammation_20 = numpy.amax(data, axis=0)[20]

if max_inflammation_0 == 0 and max_inflammation_20 == 20:
    print('Suspicious looking maxima!')
elif numpy.sum(numpy.amin(data, axis=0)) == 0:
    print('Minima add up to zero!')
else:
    print('Seems OK!')

OUTPUT

Minima add up to zero!

In this way, we have asked Python to do something different depending on the condition of our data. Here we printed messages in all cases, but we could also imagine not using the else catch-all so that messages are only printed when something is wrong, freeing us from having to manually examine every plot for features we’ve seen before.

How Many Paths?

Consider this code:

PYTHON

if 4 > 5:
    print('A')
elif 4 == 5:
    print('B')
elif 4 < 5:
    print('C')

Which of the following would be printed if you were to run this code? Why did you pick this answer?

A
B
C
B and C

Show me the solution

C gets printed because the first two conditions, 4 > 5 and 4 == 5, are not true, but 4 < 5 is true. In this case, only one of these conditions can be true for at a time, but in other scenarios multiple elif conditions could be met. In these scenarios, only the action associated with the first true elif condition will occur, starting from the top of the conditional section.

A flowchart diagram of a conditional section with multiple elif conditions and some > possible outcomes.

This contrasts with the case of multiple if statements, where every action can occur as long as their condition is met.

A flowchart diagram of a conditional section with multiple if statements and some possible outcomes.

What Is Truth?

True and False booleans are not the only values in Python that are true and false. In fact, any value can be used in an if or elif. After reading and running the code below, explain what the rule is for which values are considered true and which are > considered false.

PYTHON

if '':
    print('empty string is true')
if 'word':
    print('word is true')
if []:
    print('empty list is true')
if [1, 2, 3]:
    print('non-empty list is true')
if 0:
    print('zero is true')
if 1:
    print('one is true')

That’s Not Not What I Meant

Sometimes it is useful to check whether some condition is not true. The Boolean operator not can do this explicitly. After reading and running the code below, write some if statements that use not to test the rule that you formulated in the previous challenge.

PYTHON

if not '':
    print('empty string is not true')
if not 'word':
    print('word is not true')
if not not True:
    print('not not True is true')

Close Enough

Write some conditions that print True if the variable a is within 10% of the variable b and False otherwise. Compare your implementation with your partner’s. Do you get the same answer for all possible pairs of numbers?

Hint

There is a built-in function abs that returns the absolute value of a number:

PYTHON

print(abs(-12))

OUTPUT

Solution 1

PYTHON

a = 5
b = 5.1

if abs(a - b) <= 0.1 * abs(b):
    print('True')
else:
    print('False')

Solution 2

PYTHON

print(abs(a - b) <= 0.1 * abs(b))

This works because the Booleans True and False have string representations which can be printed.

In-Place Operators

Python (and most other languages in the C family) provides in-place operators that work like this:

PYTHON

x = 1  # original value
x += 1 # add one to x, assigning result back to x
x *= 3 # multiply x by 3
print(x)

OUTPUT

Write some code that sums the positive and negative numbers in a list separately, using in-place operators. Do you think the result is more or less readable than writing the same without in-place operators?

Show me the solution

PYTHON

positive_sum = 0
negative_sum = 0
test_list = [3, 4, 6, 1, -1, -5, 0, 7, -8]
for num in test_list:
    if num > 0:
        positive_sum += num
    elif num == 0:
        pass
    else:
        negative_sum += num
print(positive_sum, negative_sum)

Here pass means “don’t do anything”. In this particular case, it’s not actually needed, since if num == 0 neither sum needs to change, but it illustrates the use of elif and pass.

Sorting a List Into Buckets

In our data folder, large data sets are stored in files whose names start with “inflammation-” and small data sets – in files whose names start with “small-”. We also have some other files that we do not care about at this point. We’d like to break all these files into three lists called large_files, small_files, and other_files, respectively.

Add code to the template below to do this. Note that the string method startswith returns True if and only if the string it is called on starts with the string passed as an argument, that is:

PYTHON

'String'.startswith('Str')

OUTPUT

True

But

PYTHON

'String'.startswith('str')

OUTPUT

False

Use the following Python code as your starting point:

PYTHON

filenames = ['inflammation-01.csv',
         'myscript.py',
         'inflammation-02.csv',
         'small-01.csv',
         'small-02.csv']
large_files = []
small_files = []
other_files = []

Your solution should:

loop over the names of the files
figure out which group each filename belongs in
append the filename to that list

In the end the three lists should be:

PYTHON

large_files = ['inflammation-01.csv', 'inflammation-02.csv']
small_files = ['small-01.csv', 'small-02.csv']
other_files = ['myscript.py']

Show me the solution

PYTHON

for filename in filenames:
    if filename.startswith('inflammation-'):
        large_files.append(filename)
    elif filename.startswith('small-'):
        small_files.append(filename)
    else:
        other_files.append(filename)

print('large_files:', large_files)
print('small_files:', small_files)
print('other_files:', other_files)

Counting Vowels

Write a loop that counts the number of vowels in a character string.
Test it on a few individual words and full sentences.
Once you are done, compare your solution to your neighbor’s. Did you make the same decisions about how to handle the letter ‘y’ (which some people think is a vowel, and some do not)?

Solution

vowels = 'aeiouAEIOU'
sentence = 'Mary had a little lamb.'
count = 0
for char in sentence:
   if char in vowels:
       count += 1

print('The number of vowels in this string is ' + str(count))

{.challenge}

Key Points

Use for variable in sequence to process the elements of a sequence one at a time.
The body of a for loop must be indented.
Use len(thing) to determine the length of something that contains other values.
Use if condition to start a conditional statement, elif condition to provide additional tests, and else to provide a default.
The bodies of the branches of conditional statements must be indented.
Use == to test for equality.
X and Y is only true if both X and Y are true.
X or Y is true if either X or Y, or both, are true.
Zero, the empty string, and the empty list are considered false; all other numbers, strings, and lists are considered true.
True and False represent truth values.

Content from Alternatives to Loops

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How can I vectorize my loops?

Objectives

identify what vectorized operations are
perform basic vectorized operations

FIXME

Key Points

NULL

Content from Creating Functions

Last updated on 2024-07-11 | Edit this page

Overview

Questions

What are functions, and how can I use them in Python?
How can I define new functions?
What’s the difference between defining and calling a function?
What happens when I call a function?

Objectives

identify what a function is
create new functions
Set default values for function parameters.
Explain why we should divide programs into small, single-purpose functions.

At this point, we’ve seen that code can have Python make decisions about what it sees in our data. What if we want to convert some of our data, like taking a temperature in Fahrenheit and converting it to Celsius. We could write something like this for converting a single number

PYTHON

fahrenheit_val = 99
celsius_val = ((fahrenheit_val - 32) * (5/9))

and for a second number we could just copy the line and rename the variables

PYTHON

fahrenheit_val = 99
celsius_val = ((fahrenheit_val - 32) * (5/9))

fahrenheit_val2 = 43
celsius_val2 = ((fahrenheit_val2 - 32) * (5/9))

But we would be in trouble as soon as we had to do this more than a couple times. Cutting and pasting it is going to make our code get very long and very repetitive, very quickly. We’d like a way to package our code so that it is easier to reuse, a shorthand way of re-executing longer pieces of code. In Python we can use ‘functions’. Let’s start by defining a function fahr_to_celsius that converts temperatures from Fahrenheit to Celsius:

PYTHON

def explicit_fahr_to_celsius(temp):
    # Assign the converted value to a variable
    converted = ((temp - 32) * (5/9))
    # Return the value of the new variable
    return converted
    
def fahr_to_celsius(temp):
    # Return converted value more efficiently using the return
    # function without creating a new variable. This code does
    # the same thing as the previous function but it is more explicit
    # in explaining how the return command works.
    return ((temp - 32) * (5/9))

The function definition opens with the keyword def followed by the name of the function (fahr_to_celsius) and a parenthesized list of parameter names (temp). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the return value.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Let’s try running our function.

PYTHON

fahr_to_celsius(32)

This command should call our function, using “32” as the input and return the function value.

In fact, calling our own function is no different from calling any other function:

PYTHON

print('freezing point of water:', fahr_to_celsius(32), 'C')
print('boiling point of water:', fahr_to_celsius(212), 'C')

OUTPUT

freezing point of water: 0.0 C
boiling point of water: 100.0 C

We’ve successfully called the function that we defined, and we have access to the value that we returned.

Composing Functions

Now that we’ve seen how to turn Fahrenheit into Celsius, we can also write the function to turn Celsius into Kelvin:

PYTHON

def celsius_to_kelvin(temp_c):
    return temp_c + 273.15

print('freezing point of water in Kelvin:', celsius_to_kelvin(0.))

OUTPUT

freezing point of water in Kelvin: 273.15

What about converting Fahrenheit to Kelvin? We could write out the formula, but we don’t need to. Instead, we can compose the two functions we have already created:

PYTHON

def fahr_to_kelvin(temp_f):
    temp_c = fahr_to_celsius(temp_f)
    temp_k = celsius_to_kelvin(temp_c)
    return temp_k

print('boiling point of water in Kelvin:', fahr_to_kelvin(212.0))

OUTPUT

boiling point of water in Kelvin: 373.15

This is our first taste of how larger programs are built: we define basic operations, then combine them in ever-larger chunks to get the effect we want. Real-life functions will usually be larger than the ones shown here — typically half a dozen to a few dozen lines — but they shouldn’t ever be much longer than that, or the next person who reads it won’t be able to understand what’s going on.

Variable Scope

In composing our temperature conversion functions, we created variables inside of those functions, temp, temp_c, temp_f, and temp_k. We refer to these variables as local variables because they no longer exist once the function is done executing. If we try to access their values outside of the function, we will encounter an error:

PYTHON

print('Again, temperature in Kelvin was:', temp_k)

ERROR

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-eed2471d229b> in <module>
----> 1 print('Again, temperature in Kelvin was:', temp_k)

NameError: name 'temp_k' is not defined

If you want to reuse the temperature in Kelvin after you have calculated it with fahr_to_kelvin, you can store the result of the function call in a variable:

PYTHON

temp_kelvin = fahr_to_kelvin(212.0)
print('temperature in Kelvin was:', temp_kelvin)

OUTPUT

temperature in Kelvin was: 373.15

The variable temp_kelvin, being defined outside any function, is said to be global.

Inside a function, one can read the value of such global variables:

PYTHON

def print_temperatures():
  print('temperature in Fahrenheit was:', temp_fahr)
  print('temperature in Kelvin was:', temp_kelvin)

temp_fahr = 212.0
temp_kelvin = fahr_to_kelvin(temp_fahr)

print_temperatures()

OUTPUT

temperature in Fahrenheit was: 212.0
temperature in Kelvin was: 373.15

By giving our functions human-readable names, we can more easily read and understand what is happening in the for loop. Even better, if at some later date we want to use either of those pieces of code again, we can do so in a single line.

Testing and Documenting

Once we start putting things in functions so that we can re-use them, we need to start testing that those functions are working correctly. To see how to do this, let’s write a function to offset a dataset so that it’s mean value shifts to a user-defined value:

PYTHON

def offset_mean(data, target_mean_value):
    return (data - numpy.mean(data)) + target_mean_value

We could test this on our actual data, but since we don’t know what the values ought to be, it will be hard to tell if the result was correct. Instead, let’s use NumPy to create a matrix of 0’s and then offset its values to have a mean value of 3:

PYTHON

z = numpy.zeros((2,2))
print(offset_mean(z, 3))

OUTPUT

[[ 3.  3.]
 [ 3.  3.]]

That looks right, so let’s try offset_mean on our real data:

PYTHON

data = numpy.loadtxt(fname='inflammation-01.csv', delimiter=',')
print(offset_mean(data, 0))

OUTPUT

[[-6.14875 -6.14875 -5.14875 ... -3.14875 -6.14875 -6.14875]
 [-6.14875 -5.14875 -4.14875 ... -5.14875 -6.14875 -5.14875]
 [-6.14875 -5.14875 -5.14875 ... -4.14875 -5.14875 -5.14875]
 ...
 [-6.14875 -5.14875 -5.14875 ... -5.14875 -5.14875 -5.14875]
 [-6.14875 -6.14875 -6.14875 ... -6.14875 -4.14875 -6.14875]
 [-6.14875 -6.14875 -5.14875 ... -5.14875 -5.14875 -6.14875]]

It’s hard to tell from the default output whether the result is correct, but there are a few tests that we can run to reassure us:

PYTHON

print('original min, mean, and max are:', numpy.amin(data), numpy.mean(data), numpy.amax(data))
offset_data = offset_mean(data, 0)
print('min, mean, and max of offset data are:',
      numpy.amin(offset_data),
      numpy.mean(offset_data),
      numpy.amax(offset_data))

OUTPUT

original min, mean, and max are: 0.0 6.14875 20.0
min, mean, and and max of offset data are: -6.14875 2.84217094304e-16 13.85125

That seems almost right: the original mean was about 6.1, so the lower bound from zero is now about -6.1. The mean of the offset data isn’t quite zero — we’ll explore why not in the challenges — but it’s pretty close. We can even go further and check that the standard deviation hasn’t changed:

PYTHON

print('std dev before and after:', numpy.std(data), numpy.std(offset_data))

OUTPUT

std dev before and after: 4.61383319712 4.61383319712

Those values look the same, but we probably wouldn’t notice if they were different in the sixth decimal place. Let’s do this instead:

PYTHON

print('difference in standard deviations before and after:',
      numpy.std(data) - numpy.std(offset_data))

OUTPUT

difference in standard deviations before and after: -3.5527136788e-15

Again, the difference is very small. It’s still possible that our function is wrong, but it seems unlikely enough that we should probably get back to doing our analysis.

Documentation

We have one more task first, though: we should write some documentation for our function to remind ourselves later what it’s for and how to use it.

The usual way to put documentation in software is to add comments like this:

PYTHON

# offset_mean(data, target_mean_value):
# return a new array containing the original data with its mean offset to match the desired value.
def offset_mean(data, target_mean_value):
    return (data - numpy.mean(data)) + target_mean_value

There’s a better way, though. If the first thing in a function is a string that isn’t assigned to a variable, that string is attached to the function as its documentation:

PYTHON

def offset_mean(data, target_mean_value):
    """Return a new array containing the original data
       with its mean offset to match the desired value."""
    return (data - numpy.mean(data)) + target_mean_value

This is better because we can now ask Python’s built-in help system to show us the documentation for the function:

PYTHON

help(offset_mean)

OUTPUT

Help on function offset_mean in module __main__:

offset_mean(data, target_mean_value)
    Return a new array containing the original data with its mean offset to match the desired value.

A string like this is called a docstring. We don’t need to use triple quotes when we write one, but if we do, we can break the string across multiple lines:

PYTHON

def offset_mean(data, target_mean_value):
    """Return a new array containing the original data
       with its mean offset to match the desired value.

    Examples
    --------
    >>> offset_mean([1, 2, 3], 0)
    array([-1.,  0.,  1.])
    """
    return (data - numpy.mean(data)) + target_mean_value

help(offset_mean)

OUTPUT

Help on function offset_mean in module __main__:

offset_mean(data, target_mean_value)
    Return a new array containing the original data
       with its mean offset to match the desired value.

    Examples
    --------
    >>> offset_mean([1, 2, 3], 0)
    array([-1.,  0.,  1.])

Defining Defaults

We have passed parameters to functions in two ways: directly, as in type(data), and by name, as in numpy.loadtxt(fname='something.csv', delimiter=','). In fact, we can pass the filename to loadtxt without the fname=:

PYTHON

numpy.loadtxt('inflammation-01.csv', delimiter=',')

OUTPUT

array([[ 0.,  0.,  1., ...,  3.,  0.,  0.],
       [ 0.,  1.,  2., ...,  1.,  0.,  1.],
       [ 0.,  1.,  1., ...,  2.,  1.,  1.],
       ...,
       [ 0.,  1.,  1., ...,  1.,  1.,  1.],
       [ 0.,  0.,  0., ...,  0.,  2.,  0.],
       [ 0.,  0.,  1., ...,  1.,  1.,  0.]])

but we still need to say delimiter=:

PYTHON

numpy.loadtxt('inflammation-01.csv', ',')

ERROR

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/username/anaconda3/lib/python3.6/site-packages/numpy/lib/npyio.py", line 1041, in loa
dtxt
    dtype = np.dtype(dtype)
  File "/Users/username/anaconda3/lib/python3.6/site-packages/numpy/core/_internal.py", line 199, in
_commastring
    newitem = (dtype, eval(repeats))
  File "<string>", line 1
    ,
    ^
SyntaxError: unexpected EOF while parsing

To understand what’s going on, and make our own functions easier to use, let’s re-define our offset_mean function like this:

PYTHON

def offset_mean(data, target_mean_value=0.0):
    """Return a new array containing the original data
       with its mean offset to match the desired value, (0 by default).

    Examples
    --------
    >>> offset_mean([1, 2, 3])
    array([-1.,  0.,  1.])
    """
    return (data - numpy.mean(data)) + target_mean_value

The key change is that the second parameter is now written target_mean_value=0.0 instead of just target_mean_value. If we call the function with two arguments, it works as it did before:

PYTHON

test_data = numpy.zeros((2, 2))
print(offset_mean(test_data, 3))

OUTPUT

[[ 3.  3.]
 [ 3.  3.]]

But we can also now call it with just one parameter, in which case target_mean_value is automatically assigned the default value of 0.0:

PYTHON

more_data = 5 + numpy.zeros((2, 2))
print('data before mean offset:')
print(more_data)
print('offset data:')
print(offset_mean(more_data))

OUTPUT

data before mean offset:
[[ 5.  5.]
 [ 5.  5.]]
offset data:
[[ 0.  0.]
 [ 0.  0.]]

This is handy: if we usually want a function to work one way, but occasionally need it to do something else, we can allow people to pass a parameter when they need to but provide a default to make the normal case easier. The example below shows how Python matches values to parameters:

PYTHON

def display(a=1, b=2, c=3):
    print('a:', a, 'b:', b, 'c:', c)

print('no parameters:')
display()
print('one parameter:')
display(55)
print('two parameters:')
display(55, 66)

OUTPUT

no parameters:
a: 1 b: 2 c: 3
one parameter:
a: 55 b: 2 c: 3
two parameters:
a: 55 b: 66 c: 3

As this example shows, parameters are matched up from left to right, and any that haven’t been given a value explicitly get their default value. We can override this behavior by naming the value as we pass it in:

PYTHON

print('only setting the value of c')
display(c=77)

OUTPUT

only setting the value of c
a: 1 b: 2 c: 77

With that in hand, let’s look at the help for numpy.loadtxt:

PYTHON

help(numpy.loadtxt)

OUTPUT

Help on function loadtxt in module numpy.lib.npyio:

loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, use
cols=None, unpack=False, ndmin=0, encoding='bytes')
    Load data from a text file.

    Each row in the text file must have the same number of values.

    Parameters
    ----------
...

There’s a lot of information here, but the most important part is the first couple of lines:

OUTPUT

loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, use
cols=None, unpack=False, ndmin=0, encoding='bytes')

This tells us that loadtxt has one parameter called fname that doesn’t have a default value, and eight others that do. If we call the function like this:

PYTHON

numpy.loadtxt('inflammation-01.csv', ',')

then the filename is assigned to fname (which is what we want), but the delimiter string ',' is assigned to dtype rather than delimiter, because dtype is the second parameter in the list. However ',' isn’t a known dtype so our code produced an error message when we tried to run it. When we call loadtxt we don’t have to provide fname= for the filename because it’s the first item in the list, but if we want the ',' to be assigned to the variable delimiter, we do have to provide delimiter= for the second parameter since delimiter is not the second parameter in the list.

Readable functions

Consider these two functions:

PYTHON

def s(p):
    a = 0
    for v in p:
        a += v
    m = a / len(p)
    d = 0
    for v in p:
        d += (v - m) * (v - m)
    return numpy.sqrt(d / (len(p) - 1))

def std_dev(sample):
    sample_sum = 0
    for value in sample:
        sample_sum += value

    sample_mean = sample_sum / len(sample)

    sum_squared_devs = 0
    for value in sample:
        sum_squared_devs += (value - sample_mean) * (value - sample_mean)

    return numpy.sqrt(sum_squared_devs / (len(sample) - 1))

The functions s and std_dev are computationally equivalent (they both calculate the sample standard deviation), but to a human reader, they look very different. You probably found std_dev much easier to read and understand than s.

As this example illustrates, both documentation and a programmer’s coding style combine to determine how easy it is for others to read and understand the programmer’s code. Choosing meaningful variable names and using blank spaces to break the code into logical “chunks” are helpful techniques for producing readable code. This is useful not only for sharing code with others, but also for the original programmer. If you need to revisit code that you wrote months ago and haven’t thought about since then, you will appreciate the value of readable code!

Combining Strings

“Adding” two strings produces their concatenation: 'a' + 'b' is 'ab'. Write a function called fence that takes two parameters called original and wrapper and returns a new string that has the wrapper character at the beginning and end of the original. A call to your function should look like this:

PYTHON

print(fence('name', '*'))

OUTPUT

*name*

Show me the solution

PYTHON

def fence(original, wrapper):
    return wrapper + original + wrapper

Return versus print

Note that return and print are not interchangeable. print is a Python function that prints data to the screen. It enables us, users, see the data. return statement, on the other hand, makes data visible to the program. Let’s have a look at the following function:

PYTHON

def add(a, b):
    print(a + b)

Question: What will we see if we execute the following commands?

PYTHON

A = add(7, 3)
print(A)

Show me the solution

Python will first execute the function add with a = 7 and b = 3, and, therefore, print 10. However, because function add does not have a line that starts with return (no return “statement”), it will, by default, return nothing which, in Python world, is called None. Therefore, A will be assigned to None and the last line (print(A)) will print None. As a result, we will see:

OUTPUT

10
None

Selecting Characters From Strings

If the variable s refers to a string, then s[0] is the string’s first character and s[-1] is its last. Write a function called outer that returns a string made up of just the first and last characters of its input. A call to your function should look like this:

PYTHON

print(outer('helium'))

OUTPUT

hm

Show me the solution

PYTHON

def outer(input_string):
    return input_string[0] + input_string[-1]

Rescaling an Array

Write a function rescale that takes an array as input and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0. (Hint: If L and H are the lowest and highest values in the original array, then the replacement for a value v should be (v-L) / (H-L).)

Show me the solution

PYTHON

def rescale(input_array):
    L = numpy.amin(input_array)
    H = numpy.amax(input_array)
    output_array = (input_array - L) / (H - L)
    return output_array

Testing and Documenting Your Function

Run the commands help(numpy.arange) and help(numpy.linspace) to see how to use these functions to generate regularly-spaced values, then use those values to test your rescale function. Once you’ve successfully tested your function, add a docstring that explains what it does.

Show me the solution

PYTHON

"""Takes an array as input, and returns a corresponding array scaled so
that 0 corresponds to the minimum and 1 to the maximum value of the input array.

Examples:
>>> rescale(numpy.arange(10.0))
array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
       0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])
>>> rescale(numpy.linspace(0, 100, 5))
array([ 0.  ,  0.25,  0.5 ,  0.75,  1.  ])
"""

Defining Defaults

Rewrite the rescale function so that it scales data to lie between 0.0 and 1.0 by default, but will allow the caller to specify lower and upper bounds if they want. Compare your implementation to your neighbor’s: do the two functions always behave the same way?

Show me the solution

PYTHON

def rescale(input_array, low_val=0.0, high_val=1.0):
    """rescales input array values to lie between low_val and high_val"""
    L = numpy.amin(input_array)
    H = numpy.amax(input_array)
    intermed_array = (input_array - L) / (H - L)
    output_array = intermed_array * (high_val - low_val) + low_val
    return output_array

Variables Inside and Outside Functions

What does the following piece of code display when run — and why?

PYTHON

f = 0
k = 0

def f2k(f):
    k = ((f - 32) * (5.0 / 9.0)) + 273.15
    return k

print(f2k(8))
print(f2k(41))
print(f2k(32))

print(k)

Show me the solution

OUTPUT

259.81666666666666
278.15
273.15
0

k is 0 because the k inside the function f2k doesn’t know about the k defined outside the function. When the f2k function is called, it creates a local variable k. The function does not return any values and does not alter k outside of its local copy. Therefore the original value of k remains unchanged. Beware that a local k is created because f2k internal statements affect a new value to it. If k was only read, it would simply retrieve the global k value.

Mixing Default and Non-Default Parameters

Given the following code:

PYTHON

def numbers(one, two=2, three, four=4):
    n = str(one) + str(two) + str(three) + str(four)
    return n

print(numbers(1, three=3))

what do you expect will be printed? What is actually printed? What rule do you think Python is following?

1234
one2three4
1239
SyntaxError

Given that, what does the following piece of code display when run?

PYTHON

def func(a, b=3, c=6):
    print('a: ', a, 'b: ', b, 'c:', c)

func(-1, 2)

a: b: 3 c: 6
a: -1 b: 3 c: 6
a: -1 b: 2 c: 6
a: b: -1 c: 2

Show me the solution

Attempting to define the numbers function results in 4. SyntaxError. The defined parameters two and four are given default values. Because one and three are not given default values, they are required to be included as arguments when the function is called and must be placed before any parameters that have default values in the function definition.

The given call to func displays a: -1 b: 2 c: 6. -1 is assigned to the first parameter a, 2 is assigned to the next parameter b, and c is not passed a value, so it uses its default value 6.

Readable Code

Revise a function you wrote for one of the previous exercises to try to make the code more readable. Then, collaborate with one of your neighbors to critique each other’s functions and discuss how your function implementations could be further improved to make them more readable.

Key Points

Define a function using def function_name(parameter).
The body of a function must be indented.
Call a function using function_name(value).
Numbers are stored as integers or floating-point numbers.
Variables defined within a function can only be seen and used within the body of the function.
Variables created outside of any function are called global variables.
Within a function, we can access global variables.
Variables created within a function override global variables if their names match.
Use help(thing) to view help for something.
Put docstrings in functions to provide help for that function.
Specify default values for parameters when defining a function using name=value in the parameter list.
Parameters can be passed by matching based on name, by position, or by omitting them (in which case the default value is used).
Put code whose parameters change frequently in a function, then call it with different parameter values to customize its behavior.

Content from Data Analysis

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How can I process tabular data files in Python?
How can I do the same operations on many different files?

Objectives

read in data files to Python
perform common operations on tabular data
write code to perform the same operation on multiple files

FIXME

Key Points

NULL

Content from Visualizations

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How can I visualize tabular data in Python?
How can I group several plots together?

Objectives

create graphs and other visualizations using tabular data
group plots together to make comparative visualizations

FIXME

Key Points

NULL

Content from Errors and Exceptions

Last updated on 2024-07-11 | Edit this page

Overview

Questions

How does Python report errors?
How can I handle errors in Python programs?

Objectives

identify different errors and correct bugs associated with them

Every programmer encounters errors, both those who are just beginning, and those who have been programming for years. Encountering errors and exceptions can be very frustrating at times, and can make coding feel like a hopeless endeavour. However, understanding what the different types of errors are and when you are likely to encounter them can help a lot. Once you know why you get certain types of errors, they become much easier to fix.

Errors in Python have a very specific form, called a traceback. Let’s examine one:

PYTHON

# This code has an intentional error. You can type it directly or
# use it for reference to understand the error message below.
def favorite_ice_cream():
    ice_creams = [
        'chocolate',
        'vanilla',
        'strawberry'
    ]
    print(ice_creams[3])

favorite_ice_cream()

ERROR

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-70bd89baa4df> in <module>()
      9     print(ice_creams[3])
      10
----> 11 favorite_ice_cream()

<ipython-input-1-70bd89baa4df> in favorite_ice_cream()
      7         'strawberry'
      8     ]
----> 9     print(ice_creams[3])
      10
      11 favorite_ice_cream()

IndexError: list index out of range

This particular traceback has two levels. You can determine the number of levels by looking for the number of arrows on the left hand side. In this case:

The first shows code from the cell above, with an arrow pointing to Line 11 (which is favorite_ice_cream()).
The second shows some code in the function favorite_ice_cream, with an arrow pointing to Line 9 (which is print(ice_creams[3])).

The last level is the actual place where the error occurred. The other level(s) show what function the program executed to get to the next level down. So, in this case, the program first performed a function call to the function favorite_ice_cream. Inside this function, the program encountered an error on Line 6, when it tried to run the code print(ice_creams[3]).

Long Tracebacks

Sometimes, you might see a traceback that is very long -- sometimes they might even be 20 levels deep! This can make it seem like something horrible happened, but the length of the error message does not reflect severity, rather, it indicates that your program called many functions before it encountered the error. Most of the time, the actual place where the error occurred is at the bottom-most level, so you can skip down the traceback to the bottom.

So what error did the program actually encounter? In the last line of the traceback, Python helpfully tells us the category or type of error (in this case, it is an IndexError) and a more detailed error message (in this case, it says “list index out of range”).

If you encounter an error and don’t know what it means, it is still important to read the traceback closely. That way, if you fix the error, but encounter a new one, you can tell that the error changed. Additionally, sometimes knowing where the error occurred is enough to fix it, even if you don’t entirely understand the message.

If you do encounter an error you don’t recognize, try looking at the official documentation on errors. However, note that you may not always be able to find the error there, as it is possible to create custom errors. In that case, hopefully the custom error message is informative enough to help you figure out what went wrong. Libraries like pandas and numpy have these custom errors, but the procedure to figure them out is the same: go to the earliest line in the error, and look at the error message for it. The documentation for these libraries will often provide the information you need about any functions you are using. There are also large communities of users for data libraries that can help as well!

Reading Error Messages

Read the Python code and the resulting traceback below, and answer the following questions:

How many levels does the traceback have?
What is the function name where the error occurred?
On which line number in this function did the error occur?
What is the type of error?
What is the error message?

PYTHON

# This code has an intentional error. Do not type it directly;
# use it for reference to understand the error message below.
def print_message(day):
    messages = [
        'Hello, world!',
        'Today is Tuesday!',
        'It is the middle of the week.',
        'Today is Donnerstag in German!',
        'Last day of the week!',
        'Hooray for the weekend!',
        'Aw, the weekend is almost over.'
    ]
    print(messages[day])

def print_sunday_message():
    print_message(7)

print_sunday_message()

ERROR

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-7-3ad455d81842> in <module>
     16     print_message(7)
     17
---> 18 print_sunday_message()
     19

<ipython-input-7-3ad455d81842> in print_sunday_message()
     14
     15 def print_sunday_message():
---> 16     print_message(7)
     17
     18 print_sunday_message()

<ipython-input-7-3ad455d81842> in print_message(day)
     11         'Aw, the weekend is almost over.'
     12     ]
---> 13     print(messages[day])
     14
     15 def print_sunday_message():

IndexError: list index out of range

Show me the solution

3 levels
print_message
13
IndexError
list index out of range You can then infer that 7 is not the right index to use with messages.

Better errors on newer Pythons

Newer versions of Python have improved error printouts. If you are debugging errors, it is often helpful to use the latest Python version, even if you support older versions of Python.

Type Errors

One of the most common types of errors in Python are called type errors. These errors occur when you try to perform an operation on an object in python that cannot support it. This happens easily when working with large datasets where there are expected value types like either strings or integers. When we write a function expecting integers, we will not get an error until we encounter an operation that cannot handle strings. For example:

PYTHON


def our_function()
  my_string="Hello World"
  letter=my_string["e""]

ERROR

  File "<ipython-input-3-6bb841ea1423>", line 3
    letter=my_string["e"]
                       ^
TypeError: string indices must be integers

We get this error because we are trying to use an index to access part of our string, which requires an integer. Instead, we entered a character and received a type error. This is fixed by replacing “e” with 2.

In the case of datasets, we often see type errors when a mathematical operation, such as taking a mean, is performed on a column that contains characters, either as a result of formatting or introduced through error. As a result, correcting the error can involve simply removing the characters from the strings using regular expressions, or if the characters have resulted in incorrect data, removing those observations from the dataset.

Syntax Errors

When you forget a colon at the end of a line, accidentally add one space too many when indenting under an if statement, or forget a parenthesis, you will encounter a syntax error. This means that Python couldn’t figure out how to read your program. This is similar to forgetting punctuation in English: for example, this text is difficult to read there is no punctuation there is also no capitalization why is this hard because you have to figure out where each sentence ends you also have to figure out where each sentence begins to some extent it might be ambiguous if there should be a sentence break or not

People can typically figure out what is meant by text with no punctuation, but people are much smarter than computers. If Python doesn’t know how to read the program, it will give up and inform you with an error. For example:

PYTHON

def some_function()
    msg = 'hello, world!'
    print(msg)
     return msg

ERROR

  File "<ipython-input-3-6bb841ea1423>", line 1
    def some_function()
                       ^
SyntaxError: invalid syntax

Here, Python tells us that there is a SyntaxError on line 1, and even puts a little arrow in the place where there is an issue. In this case the problem is that the function definition is missing a colon at the end.

Actually, the function above has two issues with syntax. If we fix the problem with the colon, we see that there is also an IndentationError, which means that the lines in the function definition do not all have the same indentation:

PYTHON

def some_function():
    msg = 'hello, world!'
    print(msg)
     return msg

ERROR

  File "<ipython-input-4-ae290e7659cb>", line 4
    return msg
    ^
IndentationError: unexpected indent

Both SyntaxError and IndentationError indicate a problem with the syntax of your program, but an IndentationError is more specific: it always means that there is a problem with how your code is indented.

Tabs and Spaces

Some indentation errors are harder to spot than others. In particular, mixing spaces and tabs can be difficult to spot because they are both whitespace. In the example below, the first two lines in the body of the function some_function are indented with tabs, while the third line — with spaces. If you’re working in a Jupyter notebook, be sure to copy and paste this example rather than trying to type it in manually because Jupyter automatically replaces tabs with spaces.

PYTHON

def some_function():
	msg = 'hello, world!'
	print(msg)
        return msg

Visually it is impossible to spot the error. Fortunately, Python does not allow you to mix tabs and spaces.

ERROR

  File "<ipython-input-5-653b36fbcd41>", line 4
    return msg
              ^
TabError: inconsistent use of tabs and spaces in indentation

Variable Name Errors

Another very common type of error is called a NameError, and occurs when you try to use a variable that does not exist. For example:

PYTHON

print(a)

ERROR

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-9d7b17ad5387> in <module>()
----> 1 print(a)

NameError: name 'a' is not defined

Variable name errors come with some of the most informative error messages, which are usually of the form “name ‘the_variable_name’ is not defined”.

Why does this error message occur? That’s a harder question to answer, because it depends on what your code is supposed to do. However, there are a few very common reasons why you might have an undefined variable. The first is that you meant to use a string, but forgot to put quotes around it:

PYTHON

print(hello)

ERROR

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-9553ee03b645> in <module>()
----> 1 print(hello)

NameError: name 'hello' is not defined

The second reason is that you might be trying to use a variable that does not yet exist. In the following example, count should have been defined (e.g., with count = 0) before the for loop:

PYTHON

for number in range(10):
    count = count + number
print('The count is:', count)

ERROR

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-dd6a12d7ca5c> in <module>()
      1 for number in range(10):
----> 2     count = count + number
      3 print('The count is:', count)

NameError: name 'count' is not defined

Finally, the third possibility is that you made a typo when you were writing your code. Let’s say we fixed the error above by adding the line Count = 0 before the for loop. Frustratingly, this actually does not fix the error. Remember that variables are case-sensitive, so the variable count is different from Count. We still get the same error, because we still have not defined count:

PYTHON

Count = 0
for number in range(10):
    count = count + number
print('The count is:', count)

ERROR

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-d77d40059aea> in <module>()
      1 Count = 0
      2 for number in range(10):
----> 3     count = count + number
      4 print('The count is:', count)

NameError: name 'count' is not defined

Index Errors

Next up are errors having to do with containers (like lists and strings) and the items within them. If you try to access an item in a list or a string that does not exist, then you will get an error. This makes sense: if you asked someone what day they would like to get coffee, and they answered “caturday”, you might be a bit annoyed. Python gets similarly annoyed if you try to ask it for an item that doesn’t exist:

PYTHON

letters = ['a', 'b', 'c']
print('Letter #1 is', letters[0])
print('Letter #2 is', letters[1])
print('Letter #3 is', letters[2])
print('Letter #4 is', letters[3])

OUTPUT

Letter #1 is a
Letter #2 is b
Letter #3 is c

ERROR

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-11-d817f55b7d6c> in <module>()
      3 print('Letter #2 is', letters[1])
      4 print('Letter #3 is', letters[2])
----> 5 print('Letter #4 is', letters[3])

IndexError: list index out of range

Here, Python is telling us that there is an IndexError in our code, meaning we tried to access a list index that did not exist.

File Errors

The last type of error we’ll cover today are the most common type of error when using Python with data, those associated with reading and writing files: FileNotFoundError. If you try to read a file that does not exist, you will receive a FileNotFoundError telling you so. If you attempt to write to a file that was opened read-only, Python 3 returns an UnsupportedOperationError. More generally, problems with input and output manifest as OSErrors, which may show up as a more specific subclass; you can see the list in the Python docs. They all have a unique UNIX errno, which is you can see in the error message.

PYTHON

file_handle = open('myfile.txt', 'r')

ERROR

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-14-f6e1ac4aee96> in <module>()
----> 1 file_handle = open('myfile.txt', 'r')

FileNotFoundError: [Errno 2] No such file or directory: 'myfile.txt'

One reason for receiving this error is that you specified an incorrect path to the file. For example, if I am currently in a folder called myproject, and I have a file in myproject/writing/myfile.txt, but I try to open myfile.txt, this will fail. The correct path would be writing/myfile.txt. It is also possible that the file name or its path contains a typo. There may also be specific settings based on your organization if you are using shared, networked, or cloud-based drives. It is best to check with your IT administrators if you are still encountering issues reading in a file after troubleshooting.

A related issue can occur if you use the “read” flag instead of the “write” flag. Python will not give you an error if you try to open a file for writing when the file does not exist. However, if you meant to open a file for reading, but accidentally opened it for writing, and then try to read from it, you will get an UnsupportedOperation error telling you that the file was not opened for reading:

PYTHON

file_handle = open('myfile.txt', 'w')
file_handle.read()

ERROR

---------------------------------------------------------------------------
UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-15-b846479bc61f> in <module>()
      1 file_handle = open('myfile.txt', 'w')
----> 2 file_handle.read()

UnsupportedOperation: not readable

If you are getting a read or write error on file or folder that you are able to open and/or edit with other programs, you may need to contact an IT administrator to check the permissions granted to you and any programs you are using.

These are the most common errors with files, though many others exist. If you get an error that you’ve never seen before, searching the Internet for that error type often reveals common reasons why you might get that error.

Identifying Syntax Errors

Read the code below, and (without running it) try to identify what the errors are.
Run the code, and read the error message. Is it a SyntaxError or an IndentationError?
Fix the error.
Repeat steps 2 and 3, until you have fixed all the errors.

PYTHON

def another_function
  print('Syntax errors are annoying.')
   print('But at least Python tells us about them!')
  print('So they are usually not too hard to fix.')

Show me the solution

SyntaxError for missing (): at end of first line, IndentationError for mismatch between second and third lines. A fixed version is:

PYTHON

def another_function():
    print('Syntax errors are annoying.')
    print('But at least Python tells us about them!')
    print('So they are usually not too hard to fix.')

Identifying Variable Name Errors

Read the code below, and (without running it) try to identify what the errors are.
Run the code, and read the error message. What type of NameError do you think this is? In other words, is it a string with no quotes, a misspelled variable, or a variable that should have been defined but was not?
Fix the error.
Repeat steps 2 and 3, until you have fixed all the errors.

PYTHON

for number in range(10):
    # use a if the number is a multiple of 3, otherwise use b
    if (Number % 3) == 0:
        message = message + a
    else:
        message = message + 'b'
print(message)

Show me the solution

3 NameErrors for number being misspelled, for message not defined, and for a not being in quotes.

Fixed version:

PYTHON

message = ''
for number in range(10):
    # use a if the number is a multiple of 3, otherwise use b
    if (number % 3) == 0:
        message = message + 'a'
    else:
        message = message + 'b'
print(message)

Identifying Index Errors

Read the code below, and (without running it) try to identify what the errors are.
Run the code, and read the error message. What type of error is it?
Fix the error.

PYTHON

seasons = ['Spring', 'Summer', 'Fall', 'Winter']
print('My favorite season is ', seasons[4])

Show me the solution

IndexError; the last entry is seasons[3], so seasons[4] doesn’t make sense. A fixed version is:

PYTHON

seasons = ['Spring', 'Summer', 'Fall', 'Winter']
print('My favorite season is ', seasons[-1])

A Final Note About Correcting Errors

There are a lot of very helpful answers for many error messages, however when working with official statistics, we need to also exercise some caution. Be aware and be wary of any answers that ask you to download a package from someone’s personal GitHub repository or other file sharing service. Try to find the type of error first and understand what the issue is before downloading anything claiming to fix the error. If the error is the result of an issue with a version of a package, check if there are any security vulnerabilities with that version, and use a package manager to move between package versions.

Key Points

NULL