Content from Introduction to Programming
Last updated on 2024-10-28 | Edit this page
Overview
Questions
- What is programming?
- What is object oriented programming?
- How do I document code?
- What is a directory?
Objectives
- Learn basic concepts of programming
What is programming?
Programmers use programming languages to give instructions to their computers. In this course, we will learn how to use the open source language R to complete common tasks required in the field of official statistics. This includes the basics of R, data manipulation, and best practices.
There are a few reasons why programming with R is useful for official statistics. Data manipulation and analysis with R is:
Time-saving: R can complete many computations on a large amount of data that would take a person a long time manually
Reproducible: This code can be re-run with other data with small modifications and shared with others to be applied to other new purposes
Transparent: When you’ve completed a script using best practices, you should be left with a clear list of instructions to complete the data analysis in the form of code. This avoids “black boxes” where an analyst is unsure what they’ve done to the data to get it to it’s final form
R is an object oriented programming language
Object oriented programming languages use objects as their
main tools. These objects have classes, which describe their
general properties. For example, in R you might work with
numeric objects, which would contain numbers. You could also
work with characters, which would be composed of text. We’ll
explore classes and data types thoroughly in Episode 3 (Data Types and
Structures). We can assign “labels” to these objects, creating a
variable and use them interchangeably. We assign objects with
an assignment operator. In R, the most commonly used assignment operator
is <-
. Try reproducing the example below on your machine
by entering the code into the RStudio “Console” and hitting “Enter”.
R
# Assign a number to a variable
number_flowers <- 8
# Print the variable's contents
print(number_flowers)
We can get the value stored within the variable by printing it.
OUTPUT
[1] 8
Assigning a new value to a variable breaks the connection with the old value; R forgets that number and applies the variable name to the new value.
When you assign a value to a variable, R only stores the value, not the calculation you used to create it. This is an important point if you’re used to the way a spreadsheet program automatically updates linked cells. Let’s look at an example.
R
# Reassign the variable
number_flowers <- 7
# Print the variable's contents
print(number_flowers)
OUTPUT
[1] 7
Variable Naming Conventions
Historically, R programmers have used a variety of conventions for
naming variables. The .
character in R can be a valid part
of a variable name; thus the above assignment could have easily been
weight.kg <- 57.5
. This is often confusing to R
newcomers who have programmed in languages where .
has a
more significant meaning. Today, most R programmers 1) start variable
names with lower case letters, 2) separate words in variable names with
underscores, and 3) use only lowercase letters, underscores, and numbers
in variable names. This is known as “snake case”. The Tidyverse
Style Guide includes a section on this and
other style considerations.
Documenting Code
Notice that in the above examples, hashtags (#
) are used
before giving instructions that are intended for you rather than R.
Hashtags produce comments, which are handy for leaving
information about the code that will follow. Commenting as much code as
possible is part of best practices. Always comment your code! You owe it
to your colleagues who may see your code (not to mention your future
coding self).
R
# Hashtags go before commented code, which is not run
# print("This code will not be run")
print("Always comment your code!")
OUTPUT
[1] "Always comment your code!"
Directories
A directory is a location on your machine. Say you’d like to open a file that’s located in a folder on your computer. We need to tell R where to look for the file if we expect to find it. Directories are usually listed by referencing nested folders separated by slashes. There are small differences due to operating system (OS), so refer to documentation specific to your OS when learning to work with folder structures.
For example: /Users/Documents/Learning-R
points to a
folder called “Learning-R” in a user’s documents folder. Depending on
your IDE (Integrated Development Environment) and setup, you can print
your current directory, known as the working directory. R
automatically reads and writes files from and to your current working
directory.
R
# Print current working directory
getwd()
OUTPUT
[1] "/Users/Documents/
Before beginning our lessons, please set your working directory to
the folder that we created in the setup section with
setwd()
. For example, if your folder is named
Learning-R
:
R
# Change current working directory
setwd("~/Documents/Learning-R")
Key Points
- Programming makes our work faster, more reproducible, and more transparent
- R is an object oriented programming language
- Document your code with comments
- A working directory is the active location on your computer where R can read and write files
Content from R Fundamentals
Last updated on 2024-10-17 | Edit this page
Overview
Questions
- How do I use the RStudio IDE?
- How do I read data into R?
- What is a data frame?
- How do I access subsets of a data frame?
- How do I calculate simple statistics like mean and median?
- What is plotting?
- How do I install and use packages?
Objectives
- read data into R
- perform basic data operations
The RStudio IDE
RStudio (which will soon be known as Posit) is an IDE (Integrated Development Environment). Just as certain word processing software provide a handy squiggly line under a misspelled word, your IDE provides tools for helping you write good code. RStudio has 4 panels by default. Utilize them and you’re on your way to becoming the programming data scientist you always dreamed you’d be. Check out this link from r-bloggers.com for an in depth tour.
The Four Corners of RStudio
Top Left: Your script editor. From the top left you can select the type of file within RStudio you wish to run. In the case of this tutorial, you can use the plus sign to select for a R script. This section contains files that you can edit and save for later.
Top Right: This is your environment. It tells you all of the objects and datafiles that are active within your working directory. It also tells you useful information such as the the type of file, for example numerical or character based.
Bottom Left: This is your console. Think of this as the interactive pane, where you can write practice lines without saving them.
Bottom Right: This is the management section. From here you can browse the files within your computer and manually select a working directory. Here is where the help finder will pop up when we use it later in this tutorial. Any plots that are produced will be created here.
Reading Data into R
The files you were asked to download in the setup section are in comma-separated values (CSV) format. Each row holds the observations and each column holds information per that observation. This is what the first few rows look like.
R
tmp <- read.csv("data/inflammation-01.csv", header = FALSE, nrows = 5)
write.table(tmp, quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE)
rm(tmp)
We want to:
- Load data into memory,
- Calculate the average value of inflammation per day across all patients, and
- Plot the results.
To do all that, we’ll have to learn a little bit about programming.
Getting started
Since we want to import the file called
inflammation-01.csv
into our R environment, we need to be
able to tell our computer where the file is. To do this, we will create
a “Project” with RStudio that contains the data we want to work with.
The “Projects” interface in RStudio not only creates a working directory
for you, but also remembers its location (allowing you to quickly
navigate to it). The interface also (optionally) preserves custom
settings and open files to make it easier to resume work after a
break.
Create a new project
- Under the
File
menu in RStudio, click onNew project
, chooseNew directory
, thenNew project
- Enter a name for this new folder (or “directory”) and choose a
convenient location for it. This will be your working
directory for the rest of the day (e.g.,
~/Desktop/r-novice-inflammation
). - Click on
Create project
- Create a new file where we will type our scripts. Go to File >
New File > R script. Click the save icon on your toolbar and save
your script as “
script.R
”. - Make sure you copy the data for the lesson into this folder, if they’re not there already.
Loading Data
Now that we are set up with an RStudio project, we are sure that the
data and scripts we are using are all in our working directory. The data
files should be located in the directory data
, inside the
working directory. Now we can load the data into R using
read.csv
:
R
read.csv(file = "data/inflammation-01.csv", header = FALSE)
The expression read.csv(...)
is a function call that asks R to run
the function read.csv
.
read.csv
has two arguments: the name of the file we
want to read, and whether the first line of the file contains names for
the columns of data. The filename needs to be a character string (or string for short), so we put it in
quotes. Assigning the second argument, header
, to be
FALSE
indicates that the data file does not have column
headers. We’ll talk more about the value FALSE
, and its
converse TRUE
, in lesson 04. In case of our
inflammation-01.csv
example, R auto-generates column names
in the sequence V1
(for “variable 1”), V2
, and
so on, until V40
.
Other Options for Reading CSV Files
read.csv
actually has many more arguments that you may
find useful when importing your own data in the future. You can learn
more about these options in this supplementary lesson.
Loading Data with Headers
What happens if you forget to put header = FALSE
? The
default value is header = TRUE
, which you can check with
?read.csv
or help(read.csv)
. What do you
expect will happen if you leave the default value of
header
? Before you run any code, think about what will
happen to the first few rows of your data frame, and its overall size.
Then run the following code and see if your expectations agree:
R
read.csv(file = "data/inflammation-01.csv")
R will construct column headers from values in your first row of
data, resulting in X0 X0.1 X1 X3 X1.1 X2 ...
.
Note that the character X
is prepended: a standalone
number would not be a valid variable name. Because column headers are
variables, the same naming rules apply. Appending .1
,
.2
etc. is necessary to avoid duplicate column headers.
Reading Different Decimal Point Formats
Depending on the country you live in, your standard can use the “dot”
or the “comma” as decimal mark. Also, different devices or software can
generate data with different kinds of decimal marks. Take a look at
?read.csv
and write the code to load a file called
commadec.txt
that has numeric values with commas as decimal
mark, separated by semicolons.
OUTPUT
read.csv(file = "data/commadec.txt", sep = ";", dec = ",")
or the built-in shortcut:
OUTPUT
read.csv2(file = "data/commadec.txt")
Data Frames
Now that we know how to assign things to variables, let’s re-run read.csv and save its result into a variable called ‘dat’:
R
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
This statement doesn’t produce any output because the assignment
doesn’t display anything. If we want to check if our data has been
loaded, we can print the variable’s value by typing the name of the
variable dat
. However, for large data sets it is convenient
to use the function head to display only the first few rows of data.
R
head(dat)
First, let’s ask what type of object dat
is:
R
class(dat)
The output tells us that it’s a data frame. Think of this structure as a spreadsheet in MS Excel that many of us are familiar with. Data frames are very useful for storing data and you will use them frequently when programming in R. A typical data frame of experimental data contains individual observations in rows and variables in columns.
We can see the shape, or dimensions, of the data
frame with the function dim
:
R
dim(dat)
This tells us that our data frame, dat
, has
r nrow(dat)
rows and r ncol(dat)
columns.
If we want to get a single value from the data frame, we can provide an index in square brackets. The first number specifies the row and the second the column:
R
# first value in dat, row 1, column 1
dat[1, 1]
# middle value in dat, row 30, column 20
dat[30, 20]
The first value in a data frame index is the row, the second value is
the column. If we want to select more than one row or column, we can use
the function c
, which combines the values
you give it into one vector or list. For example, to pick columns 10 and
20 from rows 1, 3, and 5, we can do this:
R
dat[c(1, 3, 5), c(10, 20)]
We frequently want to select contiguous rows or columns, such as the
first ten rows, or columns 3 through 7. You can use c
for
this, but it’s more convenient to use the :
operator. This
special function generates sequences of numbers:
R
1:5
3:12
For example, we can select the first ten columns of values for the first four rows like this:
R
dat[1:4, 1:10]
or the first ten columns of rows 5 to 10 like this:
R
dat[5:10, 1:10]
If you want to select all rows or all columns, leave that index value empty.
R
# All columns from row 5
dat[5, ]
# All rows from column 16-18
dat[, 16:18]
If you leave both index values empty (i.e., dat[,]
), you
get the entire data frame.
Addressing Columns by Name
Columns can also be addressed by name, with either the $
operator (ie. dat$V16
) or square > brackets (ie.
dat[, 'V16']
). You can learn more about subsetting by
column name in this supplementary lesson.
Now let’s perform some common mathematical operations to learn more about our inflammation data. When analyzing data we often want to look at partial statistics, such as the maximum value per patient or the average value per day. One way to do this is to select the data we want to create a new temporary data frame, and then perform the calculation on this subset:
R
# first row, all of the columns
patient_1 <- dat[1, ]
# max inflammation for patient 1
max(patient_1)
We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:
R
# max inflammation for patient 2
max(dat[2, ])
R also has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the data:
R
# minimum inflammation on day 7
min(dat[, 7])
# mean inflammation on day 7
mean(dat[, 7])
# median inflammation on day 7
median(dat[, 7])
# standard deviation of inflammation on day 7
sd(dat[, 7])
Forcing Conversion
The code above may give you an error in some R installations, since R
does not automatically convert a row from a data.frame
to a
vector. (Confusingly, subsetted columns are automatically converted.) If
this happens, you can use the as.numeric
command to convert
the row of data to a numeric vector:
patient_1 <- as.numeric(dat[1, ])
max(patient_1)
You can also check the class
of each object:
class(dat[1, ])
class(as.numeric(dat[1, ]))
R also has a function that summaries the previous common calculations:
R
# Summarize function
summary(dat[, 1:4])
For every column in the data frame, the function “summary” calculates: the minimun value, the first quartile, the median, the mean, the third quartile and the max value, giving helpful details about the sample distribution.
Plotting
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few of R’s plotting features.
Let’s take a look at the average inflammation over time. Recall that
we already calculated these values above using
apply(dat, 2, mean)
and saved them in the variable
avg_day_inflammation
. Plotting the values is done with the
function plot
.
R
plot(avg_day_inflammation)
Above, we gave the function plot
a vector of numbers
corresponding to the average inflammation per day across all patients.
plot
created a scatter plot where the y-axis is the average
inflammation level and the x-axis is the order, or index, of the values
in the vector, which in this case correspond to the 40 days of
treatment. The result is roughly a linear rise and fall, which is
suspicious: based on other studies, we expect a sharper rise and slower
fall. Let’s have a look at two other statistics: the maximum and minimum
inflammation per day.
R
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
R
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
The maximum value rises and falls perfectly smoothly, while the minimum seems to be a step function. Neither result seems particularly likely, so either there’s a mistake in our calculations or something is wrong with our data.
Plotting Data
Create a plot showing the standard deviation of the inflammation data for each day across all patients.
This is the body of the solution.
R
sd_day_inflammation <- apply(dat, 2, sd)
plot(sd_day_inflammation)
Installing Packages
Although R has many built in tools for calculations, most programmers use packages (also known as libraries) to add to their toolbox. Packages are bundles of functions that can be installed and loaded from CRAN or other library repositories.
How you download packages might be specific to your statistical
organization, so consult with your instructor. Let’s try installing
tidyverse
, a widely used package for data management.
R
install.packages("dplyr")
Then, we’ll need to load the package with library()
R
library(dplyr)
Key Points
- The RStudio IDE gives you tools for programming
- Read in data with `read.csv()
- Data frames are the most common data type used in R
- Index with square bracket notation to access specific parts of a dataframe
- R has built in functions for many common calculations and operations
- Use plots to visualize data
- Install packages from CRAN with `install.packages()
Content from Data Types and Structures
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- What are the different data types in R?
- What are the different data structures in R?
- How do I access data within the various data structures?
Objectives
- be able to indentify and create different data types and structures in R
- manipulate and perform operations on different data types and structures
- access and perform operations on data within different data structures
FIXME
Key Points
- while tibbles are important, users may come accross data that needs to be stored in different data types
Content from Functions
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- What is a function?
- How do I make a function?
- How can I test my functions?
- How should I document my code?
Objectives
- create functions to reuse code
- learn best practices for documentation
FIXME
Key Points
- functions allow us to reuse code and make it more readable
- documenting functions using best practices helps us and others in the future
Content from Making Choices
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How do I make choices using if and else statements?
- How do I compare values?
Objectives
- use conditional logic
- compare values to make choices
- document code
FIXME
Key Points
- we can compare stored values
- we can automate data production by setting up our choices in our R programs
Content from Loops in R
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I do the same thing multiple times more efficiently in R?
- What is vectorization?
- Should I use a loop or an apply statement?
Objectives
- understand loops types and create them
- identify and perform vectorized operations
- identify and understand when to use loops or vectorized operations
FIXME
Key Points
- loops consume computer resources and should be used sparingly
- R has vectorized operations to make classical looping operations faster
Content from Plotting
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I use ggplot2?
- What is a geom?
- How can I use ggplot2 to create a scatterplot?
- What other types of plots can I make with ggplot2?
Objectives
- Learn to use ggplot to create visualizations
FIXME
Content from Dealing with Messy Data
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- What do I do if my data is the wrong class?
- What is an NA?
- What if my data has missing values?
- How do I change decimal points and rounding in numeric data?
- What if text strings have different cases?
Objectives
- learn to address common issues when cleaning data
- learn to address incorrect variable types. NA values, missing values, decimal points, lowercase vs uppercase in strings
FIXME
Key Points
- there are common issues a user can look for when working with new data
- making sure your data is clean before you start analysing will help make it easier
Content from Best Practices for Writing R Code
Last updated on 2024-10-21 | Edit this page
Overview
Questions
- How can I write R code that other people can understand and use?
Objectives
- learn best practices to make your code useable for others
FIXME
This is an example change for demonstration purposes
This change will be reverted later on.
Key Points
- making your code readable allows for others to collaborate
Content from Understanding Factors
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How is categorical data represented in R?
- How do I work with factors?
Objectives
- learn how to perform operations on factors
FIXME
Key Points
- factors are what allow us to work with categorical data
Content from Outputting Data
Last updated on 2024-07-11 | Edit this page
Overview
Questions
- How can I put my text, code, and results all in one document?
- How do I use knitr?
- How do I write in Markdown?
Objectives
- Create basic dynamic reports in R
FIXME
Key Points
- Markdown as a method to publish results
- Basic dynamic visualizations with R can be combined with data notes in RMarkdown