Preparation of Data
Overview
Teaching: 45 min
Exercises: 10 minQuestions
How can I import a dataset into R?
Objectives
Review the structure of a CSV datafile
Read a CSV file into a dataframe
Manage the assignment of data as factors or character strings
Previewing data
Prior to uploading any dataset into R for analysis, it is useful to review the contents of that file to understand the structure of the data. This will allow you to choose the best data structure for your dataset, and will identify the relevant command and settings you need when uploading the data to R.
For small files, like the one we are using today, this review can be done with a
simple text editor, either an app on your computer or from within RStudio. For
very large datasets this is not advisable - files may take a very long time to
open and may even crash your computer. Such large files may have documentation
explaining their format, or you may need to use the head
function to look at
just the first part of the file.
Today we will be working with a datafile based on a study into the effects of treatment with the drug ursodeoxycholic acid on gallstones. This is based on a real study (Hood et al, 1993). In this study, detailed follow-up records were kept for 93 patients receiving various combinations of dietary intervention and ursodeoxycholic acid treatment to investigate ways of reducing the risk of gallstone recurrence. For our exercises today we will just use a subset of the data, with dummy demographic data substituted for real patient information.
Challenge 1
Use RStudio “Files” window, your text editor of choice, or a spreadsheet program to open and review the data file gallstones.csv (do not overwrite the file by saving afterwards). Consider the meaning of the headers and what the nature of the data in each column might be; perhaps discuss with your neighbour if you are in an in-person session. How many patient records are there in the dataset?
Solution to challenge 1
The field contents are as follows:
- Patient_ID – a unique numerical identifier for each patient
- Gender – female or male: one-letter character string
- Age – patient’s age: numerical value in years
- Height – patient’s height: numerical value in (probably) cm
- Weight – patient’s weight: numerical value in (probably) kg
- BMI – patient’s body mass index (weight/height in m2)
- Obese – is the patient obese (BMI>30): 0=no, 1=yes;
- Smoking.Status – does the patient smoke: 1=non-smoker, 2=smoker;
- Alcohol.Consumption – does the patient drink alcohol: 1=no, 2=previously, 3=currently;
- Treatment – did the patient receive the ursodeoxycholic acid treatment: 0=untreated, 1=treated;
- Rec – did gallstones recur: 0=no recurrence, 1=recurrence;
- Mult – did the patient have multiple gallstones: 0=single stone, 1=multiple stones;
- Diam – original gallstone diameter in mm;
- Dis – time in months for gallstone dissolution
There are a total of 37 patient records in this file
Importing data
Once you’ve reviewed the contents of the data file, you can decide how best to
upload it into R. In this case, we have a tabular layout with fields separated
by commas - a CSV file - so we will use the read.csv
function to import this
dataset.
Tip
When reading in tabular data in R, you can use either
read.table
or a range of pre-configured aliases:read.csv
,read.csv2
,read.delim
andread.delim2
. These have function arguments already set to correspond to a range of different formats of CSV (comma separated value) and TSV (tab separated value) files.
# The default settings of `read.csv` differ between R versions, so when running
# the command, we need to specify the stringsAsFactors parameter
gallstones <- read.csv("data/gallstones.csv", stringsAsFactors = TRUE)
Now the file is uploaded, you can find out more about this dataset using the
head
, str
, and summary
functions.
head(gallstones)
## Patient_ID Gender Age Height Weight BMI Obese Smoking.Status Alcohol.Consumption Treatment Rec Mult Diam Dis
## 1 P25 F 64 147 65 30.08006 1 2 1 1 1 1 6 8
## 2 P28 F 81 151 69 30.26183 1 2 2 0 1 1 7 6
## 3 P17 M 77 156 59 24.24392 0 2 1 0 0 0 20 20
## 4 P27 F 80 156 47 19.31295 0 2 3 1 0 0 15 2
## 5 P5 F 86 156 53 21.77844 0 2 2 0 1 0 18 14
## 6 P6 F 69 157 48 19.47341 0 1 3 1 0 0 19 8
str(gallstones)
## 'data.frame': 37 obs. of 14 variables:
## $ Patient_ID : Factor w/ 37 levels "P1","P10","P11",..: 18 21 9 20 33 34 35 19 28 30 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 1 2 1 1 1 1 1 1 1 ...
## $ Age : int 64 81 77 80 86 69 75 77 73 88 ...
## $ Height : int 147 151 156 156 156 157 157 160 160 160 ...
## $ Weight : int 65 69 59 47 53 48 46 55 51 54 ...
## $ BMI : num 30.1 30.3 24.2 19.3 21.8 ...
## $ Obese : int 1 1 0 0 0 0 0 0 0 0 ...
## $ Smoking.Status : int 2 2 2 2 2 1 2 2 2 1 ...
## $ Alcohol.Consumption: int 1 2 1 3 2 3 2 3 3 3 ...
## $ Treatment : int 1 0 0 1 0 1 0 1 0 1 ...
## $ Rec : int 1 1 0 0 1 0 1 0 1 0 ...
## $ Mult : int 1 1 0 0 0 0 1 0 0 1 ...
## $ Diam : int 6 7 20 15 18 19 14 18 15 5 ...
## $ Dis : int 8 6 20 2 14 8 8 4 15 3 ...
summary(gallstones)
## Patient_ID Gender Age Height Weight
## P1 : 1 F:21 Min. :31.00 Min. :147.0 Min. : 46.0
## P10 : 1 M:16 1st Qu.:67.00 1st Qu.:160.0 1st Qu.: 58.0
## P11 : 1 Median :77.00 Median :163.0 Median : 65.0
## P12 : 1 Mean :72.57 Mean :164.6 Mean : 69.7
## P13 : 1 3rd Qu.:85.00 3rd Qu.:168.0 3rd Qu.: 79.0
## P14 : 1 Max. :90.00 Max. :189.0 Max. :109.0
## (Other):31
## BMI Obese Smoking.Status Alcohol.Consumption
## Min. :18.66 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:21.78 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:2.000
## Median :24.46 Median :0.0000 Median :2.000 Median :2.000
## Mean :25.51 Mean :0.2973 Mean :1.649 Mean :2.243
## 3rd Qu.:30.09 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :34.72 Max. :1.0000 Max. :2.000 Max. :3.000
##
## Treatment Rec Mult Diam
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 3.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :1.0000 Median :0.0000 Median :1.0000 Median :10.00
## Mean :0.5405 Mean :0.4324 Mean :0.5135 Mean :11.41
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :27.00
##
## Dis
## Min. : 2.00
## 1st Qu.: 6.00
## Median : 8.00
## Mean :10.68
## 3rd Qu.:12.00
## Max. :48.00
##
Columns as factors
This shows, among other details, that the dataset consists of records from 37 patients, aged 31 to 90, of whom 21 are female and 16 male. Looking further, you may notice that patient identifier, which should be unique, has been imported as a factor - this is probably not the best option for a unique identifier so it would be best to convert this to character strings.
gallstones$Patient_ID <- as.character(gallstones$Patient_ID)
str(gallstones$Patient_ID)
## chr [1:37] "P25" "P28" "P17" "P27" "P5" "P6" "P7" "P26" "P34" "P36" "P11" ...
Tip
Sometimes patient ID is numeric and ordered according to recruitment. In this situation, it can be used as a surrogate for time to check for any biases over the course of the study.
Challenge 2
How could you have modified your
read.csv
command so that patient identifiers were imported as character strings in the first place? What other issues might this have caused?Solution to challenge 2
You could use the
stringsAsFactors = FALSE
argument toread.csv
. But this would mean that Gender was also read in as character strings, and that is best represented as a factor.
Key Points
Open a data file in a text editor or RStudio file viewer
Use
read.table
orread.csv
to import dataReview a dataframe using
str
andsummary
Convert columns from factors to string using
as.character