SECTION 2a - Loading Data

Preamble: I will use a couple of stylistic conventions in this code

  1. variables will have names in all lower case with words separated by underscores, e.g. test_var
  2. I will use = for assignment like a reasonable person
  3. I will generally use single quotes for strings unless I need to include literal single quote characters in a string

First, load the StateIncomeData.csv file into memory

NOTE: Pay attention to which directory R is currently treating as its working directory and where the data file is stored

## Look at the current working directory
curr_dir = getwd()

Assuming your data is in your working directory, load it.

If it isn’t, set your working directory to where your data is located at.

The “./” directory label tells R to look in the current working directory

By default, the read.csv function tries to infer the type of each column in a table. This can be dangerous. You can use the “colClasses” argument to force R to use specific types for each column.

setwd("C:/Users/juliegil/Documents/GitHub/epimath/mi-umbrella-R-workshop/datasets")
# change this to where your data is located at.

df_from_csv = read.csv(file='./StateIncomeData.csv')

## Examine the dataset. Make sure every column looks the way it's supposed to
# Note the column names. 
print(head(df_from_csv))
##   Rank                 State Per.capita.income Median.household.income
## 1    0 District of Columbia              45877                   71648
## 2    1          Connecticut              39373                   70048
## 3    2           New Jersey              37288                   69160
## 4    3        Massachusetts              36593                   71919
## 5    4             Maryland              36338                   73971
## 6    5        New Hampshire              34691                   66532
##   Median.family.income Population Number.of.households Number.of.families
## 1                84094     658893               277378             117864
## 2                88819    3596677              1355817             887263
## 3                87951    8938175              2549336            1610581
## 4                88419    6938608              3194844            2203675
## 5                89678    5976407              2165438            1445972
## 6                80581    1326813               519756             345901

What if your data is an excel spreadsheet? You may need additional packages for different file formats The code below checks whether a package is installed, installs it if it isn’t, and loads it if it is. We’ll talk about how this sort of logic works later

if (!require('readxl')) {
    install.packages('readxl')
}
## Loading required package: readxl

The read_excel function from the readxl package will parse .xls and .xlsx files You can specify which sheet you want to load from a multi-sheet document using the “sheet” argument

The :: indicates that you’re using a function from a specific package:

package::function

I like to use it so I don’t lose track of which function came from which package

setwd("C:/Users/juliegil/Documents/GitHub/epimath/mi-umbrella-R-workshop/datasets")
# change this to where your data is located at.

df_from_xls = readxl::read_excel(path='./StateIncomeData.xlsx')

# Note that the two dataframes should be exactly the same provided the input
# data in each was the same. 

SECTION 2b - Accessing data in a dataframe

Accessing columns:

You can access a column in a dataframe either using its name, or its index (its number starting from 1 for the leftmost column)

The $ indicates that you’re accessing a column in a dataframe:

df$colname

income_per_cap_by_name = df_from_csv$Per.capita.income

Per capita income is the 3rd column in the dataframe

Use square brackets to access elements of a dataframe by index.

The first argument in brackets is the row index, the second is the column index

If you leave either position blank, you’ll get every element in that axis

e.g. df[,1] gets the first column, df[3,] gets the third row

income_per_cap_by_idx = df_from_csv[,3]

Creating columns

You can make new columns in a dataframe pretty easily

Let’s add a column that multiplies income per capita by 2

# First I'll do this using the vector I assigned from the per capita income col
df_from_csv$multiplied_income_from_vec = income_per_cap_by_name*2

# Next I'll do it by accessing the income per capita column directly
df_from_csv$multiplied_income_from_col = df_from_csv$Per.capita.income*2

Making new columns is also extremely useful as an intermediate step in data analysis

For example, here I’ll make a column that has the value TRUE for a state with a median household income over $60,000 and FALSE otherwise

df_from_csv$over_60k = df_from_csv$Median.household.income > 60000

You can also access elements in a column by using vectors containing TRUE and FALSE. The result will subset elements corresponding to the TRUEs

In this example I’ll get all the states with median household incomes over $60k

states_over_60k = df_from_csv$State[df_from_csv$over_60k]

SECTION 3a - Loops

Often we need to do the same task many times Loops are programming structures that let us accomplish this

A for loop uses some special syntax:

The first variable in the for loop parentheses is called the contro.l variable - it tells the loop when to start and when to stop. We can also use it within the loop itself

In this case we’ll use the control variable to keep track of the iteration of the loop I’m on

Let’s write a for loop that squares a sequence of numbers

seq = c(5,6,7,8,9,10,11,12,13)

# In this example 1:length(seq) represents the range of values that i can
# take during the loop. length(seq) gets the length of the variable seq
# so this loop will go from 1 to 9, as there are 9 elements in seq

for (i in 1:length(seq)) {
    print(seq[i]^2)
}
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
## [1] 121
## [1] 144
## [1] 169

SECTION 3b - Conditional Statements

Let’s use a conditional statement to check whether a variable is between greater than 5

The syntax for a conditional statement looks a bit like a for loop

The check itself is in parenthesis, then the action that the code should take is within curly braces

An else statement tells R what to do if the check evaluates to FALSE

x = 7
check_var = NA

if (x > 5) {
    check_var = TRUE
} else {
    check_var = FALSE
}

SECTION 3c - Functions

Let’s write a function to convert knots to mph

We define a function like we’re defining a variable

The name of the function goes on the left, then the function() statement tells R that we’re defining a function within the following curly braces.

The variable name(s) within parentheses is the argument of the function it’s the input that the function uses to generate its output.

Functions can have as many arguments as you want!

The return statement tells the function what should come out the other end

kn_to_mph = function(mph) {
    kn = 1.15078*mph
    return(kn)
}

Make sure to define your function before you use it in your code.

Otherwise, functions you write behave just like functions that are built in to R or that come with packages.

speed_in_knots = 32
speed_in_mph = kn_to_mph(speed_in_knots)