Data Preparation

Overview

Teaching: 45 min
Exercises: 30 min

Questions

What data will we be using?

How do I load the data?

Objectives

Introduction to RStudio

Loading the data

Introduction to RStudio

We will be using RStudio throughout this workshop, and so the first prerequisite is installing R and RStudio. Upon installing and opening RStudio, you will be greeted by three panels:

The interactive R console/Terminal (entire left)
Environment/History/Connections (tabbed in upper right)
Files/Plots/Packages/Help/Viewer (tabbed in lower right)

RStudio Layout

Opening a text file (such as an R or Rmd file) in RStudio will open the file in a new panel in the top left.

RStudio Layout

There are two main ways that you can run R commands or scripts within RStudio:

The interactive R console
- This works well when running individual lines to test code and when starting your analysis.
- It can become laborious and inefficient and is not suitable when running many commands at once.
Writing in a .R / .Rmd file
- All of your code is saved for editing and later use.
- You can run as many lines as you wish at once.

The hash symbol (#) can be used to signal text in our script file that should not be run, which are called comments. An example of using comments to describe code is shown below.

# Chunk of code that will add two numbers together

1 + 2 # Adds one and two together

Tip: Running segments of code

There are a few ways you can run lines of code from a .R file. If you want to run a single line, place your cursor at the end of the line. If you want to run multiple lines, select the lines you would like to run. We have a few options for running the code:

click on the Run button above the editor panel, or

hit Ctrl+Return (⌘+Return also works if you are using OS X)

If you are using a .R file and edit a segment of code after running it and want to quickly re-run the segment, you can press the button to the right of the Run button above the editor panel to re-run the previous code region.

Tip: Getting help in R

For help with any function in R, put a question mark before the function name to determine what arguments to use, some examples and other background information. For example, running ? hist will give you a description for base R’s function to generate a histogram.

If you don’t know the name of the function you want, you can use two question marks (??) to search for functions relating to a keyword (e.g. ?? histogram)

First Example - Data Dictionary

Today, we will be working on two data sets throughout the day to understand correlation and linear regression in R.

In our first example we will use a data set consisting of 100 individuals with 13 different measurements taken. This is data of medical records, vitals and clinical examinations of participants with heart disease. Descriptions of the 13 variables are given in the data dictionary below.

Variable	Description
Age	Age in years
Sex	Sex; 0=Female, 1=Male
Cp	Chest Pain; 1=Typical Angina, 2=Atypical Angina, 3=Non-Anginal pain, 4=Asymptomatic
Trestbps	Resting Systolic BP in mmHg
Chol	Blood Cholesterol Level in mg/dl
Fbs	Fasting Blood Sugar; 0 = less than 120mg/dl and 1= greater than 120 mg/dl
Exang	Exercise Induced Angina; 0=No, 1=Yes
Thalach	Maximum Heart Rate Achieved
Old Peak ST	ST wave depression induced by exercise
Slope	The slope of peak exercise segment; 1=Up-sloping, 2=Flat, 3=Down Sloping
Ca	Number of major vessels coloured by fluoroscopy
Class	Diagnosis Class; 0=No disease, 1-4 Various stage of disease in ascending order
restecg	Resting ECG abnormalities; 0=Normal, 1=ST Abnormality, 2=LVH

Working Directory

The working directory is a file path on your computer that is the default location of any files you read or save in R. You can set this directory Files pane in RStudio, as shown below.

RStudio Layout

You can also set the working directory in the menu bar, under Session -> Set Working Directory. Alternatively, you can do this through the RStudio console using the command setwd by entering the absolute filepath as a string. Use getwd to get the current working directory.

For example, to set the working directory to the downloads folder on Mac or Windows, use

setwd("~/Downloads") # for Mac
setwd("C:\Users\YourUserName\Downloads") # for Windows

Importing and Preparing the Data

First, import the data into your R environment as a data frame and display its dimensions.

heart <- read.csv("data/heart_disease.csv")

dim(heart)

## [1] 100  14

From this we know that we have 100 rows (observations) and 14 columns (variables): 1 identification variable and 13 measurement variables.

Tip: stringsAsFactors

When importing data with columns containing character strings to be used as categories (e.g. male/female or low/medium/high), we can set the stringsAsFactors argument as TRUE to automatically set these columns to factors.

We can use the str function to look at the first few observations for each variable.

str(heart)

## 'data.frame':	100 obs. of  14 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 0 1 ...
##  $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ class   : int  0 2 1 0 0 0 3 0 2 1 ...
##  $ cp      : int  1 4 4 3 2 2 4 4 4 4 ...

Using the summary function, we can view some information about each variable.

summary(heart)

##        ID              age             sex            chol            fbs
##  Min.   :  1.00   Min.   :37.00   Min.   :0.00   Min.   :141.0   Min.   :0.00
##  1st Qu.: 25.75   1st Qu.:48.75   1st Qu.:0.00   1st Qu.:215.2   1st Qu.:0.00
##  Median : 50.50   Median :55.50   Median :1.00   Median :239.0   Median :0.00
##  Mean   : 50.50   Mean   :54.73   Mean   :0.71   Mean   :246.5   Mean   :0.13
##  3rd Qu.: 75.25   3rd Qu.:60.25   3rd Qu.:1.00   3rd Qu.:270.8   3rd Qu.:0.00
##  Max.   :100.00   Max.   :71.00   Max.   :1.00   Max.   :417.0   Max.   :1.00
##     thalach         trestbps        restecg         exang        oldpeak
##  Min.   : 99.0   Min.   :104.0   Min.   :0.00   Min.   :0.0   Min.   :0.000
##  1st Qu.:142.0   1st Qu.:120.0   1st Qu.:0.00   1st Qu.:0.0   1st Qu.:0.400
##  Median :155.5   Median :130.0   Median :2.00   Median :0.0   Median :1.000
##  Mean   :152.2   Mean   :133.2   Mean   :1.14   Mean   :0.3   Mean   :1.235
##  3rd Qu.:165.8   3rd Qu.:140.0   3rd Qu.:2.00   3rd Qu.:1.0   3rd Qu.:1.800
##  Max.   :188.0   Max.   :180.0   Max.   :2.00   Max.   :1.0   Max.   :6.200
##      slope            ca           class            cp
##  Min.   :1.00   Min.   :0.00   Min.   :0.00   Min.   :1.00
##  1st Qu.:1.00   1st Qu.:0.00   1st Qu.:0.00   1st Qu.:3.00
##  Median :1.50   Median :0.00   Median :0.00   Median :3.00
##  Mean   :1.61   Mean   :0.59   Mean   :0.85   Mean   :3.18
##  3rd Qu.:2.00   3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:4.00
##  Max.   :3.00   Max.   :3.00   Max.   :4.00   Max.   :4.00

Recoding Variables

Looking at the summary output, we can see that the categorical variables such as sex, slope and class are being treated as numerical data. We can fix this by setting these categorical variables as factors.

To do this, we can use as.factor on each of the categorical columns in our data frame, specifying the levels and labels of each variable as arguments.

heart$ID <- as.factor(heart$ID)
heart$sex <- factor(heart$sex,levels = c(0, 1), labels = c("Female", "Male"))
heart$fbs <- factor(heart$fbs,levels = c(0, 1), labels = c("<120", ">120"))
heart$restecg <- factor(heart$restecg,levels = c(0, 1, 2), labels = c("Normal", "ST Abnormality", "LVH"))
heart$exang <- factor(heart$exang,levels = c(0, 1), labels = c("No", "Yes"))
heart$slope <- factor(heart$slope,levels = c(1, 2, 3), labels = c("Up-sloping", "Flat", "Down-sloping"))
heart$cp <- factor(heart$cp,levels = c(1, 2, 3, 4), labels = c("Typical angina", "Atypical angina", "Non-Anginal pain", "Asymptomatic"))

For the class variable, we will merge the four levels of disease into a single “disease” factor, leaving us with a binary variable.

heart$class <- as.factor(heart$class)
levels(heart$class)[which(levels(heart$class) == "0")] <- "No Disease"
levels(heart$class)[which(levels(heart$class) %in% c("1", "2", "3", "4"))] <- "Disease"

Running summary on the data again, now with the correct types, will give us the correct description of the data (counts for categorical variables and a five number summary and mean for the numerical variables).

summary(heart)

##        ID          age            sex          chol         fbs
##  1      : 1   Min.   :37.00   Female:29   Min.   :141.0   <120:87
##  2      : 1   1st Qu.:48.75   Male  :71   1st Qu.:215.2   >120:13
##  3      : 1   Median :55.50               Median :239.0
##  4      : 1   Mean   :54.73               Mean   :246.5
##  5      : 1   3rd Qu.:60.25               3rd Qu.:270.8
##  6      : 1   Max.   :71.00               Max.   :417.0
##  (Other):94
##     thalach         trestbps               restecg   exang       oldpeak
##  Min.   : 99.0   Min.   :104.0   Normal        :43   No :70   Min.   :0.000
##  1st Qu.:142.0   1st Qu.:120.0   ST Abnormality: 0   Yes:30   1st Qu.:0.400
##  Median :155.5   Median :130.0   LVH           :57            Median :1.000
##  Mean   :152.2   Mean   :133.2                                Mean   :1.235
##  3rd Qu.:165.8   3rd Qu.:140.0                                3rd Qu.:1.800
##  Max.   :188.0   Max.   :180.0                                Max.   :6.200
##
##           slope          ca              class                   cp
##  Up-sloping  :50   Min.   :0.00   No Disease:57   Typical angina  : 7
##  Flat        :39   1st Qu.:0.00   Disease   :43   Atypical angina :13
##  Down-sloping:11   Median :0.00                   Non-Anginal pain:35
##                    Mean   :0.59                   Asymptomatic    :45
##                    3rd Qu.:1.00
##                    Max.   :3.00
##

We can now use this data in our analyses!

Key Points

Make sure you set your working directory.

View a summary of your data to ensure your variables are of the right type.

lesson home

Exploring and Predicting using Linear Regression in R

next episode

Data Preparation

Overview

Introduction to RStudio

Tip: Running segments of code

Tip: Getting help in R

First Example - Data Dictionary

Working Directory

Importing and Preparing the Data

Tip: stringsAsFactors

Recoding Variables

Key Points

lesson home

next episode