Data Analysis with R Workshop

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: (cars is a data set automatically loaded into R.)

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Load in the data

The babynames data comes from the Social Security Administration and gives information about the Top 1000 names given to children in the USA since 1880 (http://www.ssa.gov/oact/babynames/limits.html). The flights data comes from the Bureau of Transportation Statistics at the US Department of Transportation and gives information about all outgoing flights from SEA and PDX in 2014 (http://www.transtats.bts.gov/).

data("flights", package = "pnwflights14")
data("names", package = "babynames")

Now that the data is loaded try the following problems. We won’t focus on plotting anything today. You can come to the Data Visualization using R workshop next week for that! Remember that you will need to include all of your R code in Chunks. I’ve added a few blank chunks below to get you started. Note that I’ve also labeled the chunks corresponding to their problem number. It’s always a good habit to label your chunks!

Notice that the data is called names above when we load it from the babynames package, but you will receive an error if you try to use names as your data frame before the %>%. This is just a quirk with how Hadley labeled things.

One more example

Identify how many people were born with your first name in the US in 2010. (If you have a rare first name, try a more common name instead.)

entered_name <- "Jack"
entered_year <- 2010
result <- babynames %>% filter(name == entered_name) %>%
  filter(year == entered_year) %>%
  summarize(count = sum(n))
result

## Source: local data frame [1 x 1]
## 
##   count
##   (int)
## 1  8519

Another cool thing is that you can reference values stored in R directly in your text. Something like this:

We have determined that in 2010 there were 8519 babies born in the US with the first name Jack.

Hint

Sometimes you will get errors that don’t seem to make sense or resulting data frames that have no values in them. Oftentimes, this is due to missing data appearing in the datasets. Lots of summary functions in R include the argument na.rm. It’s usually a good strategy to set it to TRUE. For example, mean(arr_time, na.rm = TRUE).

Problems

How many people have been named “Robin” over all the years in the data set? How many males? How many females?

babynames %>% filter(name == "Robin") %>% 
  summarize(count = n())

## Source: local data frame [1 x 1]
## 
##   count
##   (int)
## 1   224

babynames %>% filter(name == "Robin") %>% 
  group_by(sex) %>% summarize(count = n())

## Source: local data frame [2 x 2]
## 
##     sex count
##   (chr) (int)
## 1     F   105
## 2     M   119

What name has been the most popular over time for males? For females? (There are various ways to measure this. Let’s look at the highest median proportion over time. Try a different measure if you like!)

babynames %>% group_by(sex, name) %>%
  summarize(median_prop = median(prop)) %>%
  top_n(1)

## Selecting by median_prop

## Source: local data frame [2 x 3]
## Groups: sex [2]
## 
##     sex  name median_prop
##   (chr) (chr)       (dbl)
## 1     F  Mary  0.04063204
## 2     M  John  0.04451599

Select the following three variables from the flights dataset: carrier, origin, and arr_delay. Assign this subset to a new variable called flights2. Using flights2, what was the largest arrival delay for each of the carriers at each of the airports? (If you’d like a reference to what the carrier abbreviations stand for, you can find them in the airlines data set in pnwflights14.)

flights2 <- flights %>% select(carrier, origin, arr_delay)
flights2 %>% group_by(carrier, origin) %>%
  summarize(max_arr_delay = max(arr_delay, na.rm = TRUE))

## Source: local data frame [22 x 3]
## Groups: carrier [?]
## 
##    carrier origin max_arr_delay
##      (chr)  (chr)         (dbl)
## 1       AA    PDX          1539
## 2       AA    SEA          1454
## 3       AS    PDX           338
## 4       AS    SEA           844
## 5       B6    PDX           273
## 6       B6    SEA           357
## 7       DL    PDX           651
## 8       DL    SEA           900
## 9       F9    PDX           575
## 10      F9    SEA           804
## 11      HA    PDX           407
## 12      HA    SEA           866
## 13      OO    PDX           421
## 14      OO    SEA           671
## 15      UA    PDX           472
## 16      UA    SEA           557
## 17      US    PDX           347
## 18      US    SEA           690
## 19      VX    PDX           366
## 20      VX    SEA           351
## 21      WN    PDX           694
## 22      WN    SEA           565

Conduct a \(t\)-test (at 5% significance) to determine whether a significant difference exists between the mean distance traveled for flights leaving Portland compared to flights leaving Seattle. Do we have evidence that a difference exists?

t_test4 <- flights %>% t.test(distance ~ origin, data = .)
tidy(t_test4) %>% rename("PDX" = estimate1, "SEA" = estimate2)

##    estimate      PDX     SEA statistic p.value parameter  conf.low
## 1 -206.8359 1065.754 1272.59 -60.35406       0  104976.4 -213.5528
##   conf.high
## 1 -200.1189

Yes, the \(p\)-value is 0 which is smaller than 5%.

Conduct a \(t\)-test to determine whether a significant difference exists between the mean air time traveled for flights leaving Seattle for the first half of the year compared to the second half of the year.

flights_half <- flights %>% filter(origin == "SEA") %>%
  mutate(half = ifelse(month <= 6, "First", "Second"))
t_test5 <- flights_half %>% t.test(air_time ~ half, data = .)
tidy(t_test5) %>% rename("First" = estimate1, "Second" = estimate2)

##    estimate    First   Second statistic     p.value parameter  conf.low
## 1 -1.560721 159.3716 160.9323 -3.580919 0.000342541  105811.2 -2.414968
##    conf.high
## 1 -0.7064731

Run an ANOVA (at 10% significance) to compare the mean departure delay times for each of the twelve months of the year in Seattle. Do we evidence to reject the null hypothesis of \(\mu_{1} = \mu_{2} = \mu_{3} = \mu_{4} = \mu_{5} = \mu_{6} = \mu_{7} = \mu_{8} = \mu_{9} = \mu_{10} = \mu_{11} = \mu_{12}\)?

anova_test <- flights %>% aov(dep_time ~ month, data = .)
tidy(anova_test)$p.value[1]

## [1] 0.2619995

tidy(anova_test)

##        term     df        sumsq   meansq statistic   p.value
## 1     month      1 3.435978e+05 343597.8   1.25817 0.2619995
## 2 Residuals 161190 4.401992e+10 273093.4        NA        NA

No, the \(p\)-value of 0.2619995 is greater than \(\alpha = 0.10\).

Data Analysis with R Workshop - SOLUTIONS

Chester Ismay

September 23, 2015

R Markdown

Load in the data

One more example

Hint

Problems