This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: (cars
is a data set automatically loaded into R.)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
The babynames
data comes from the Social Security Administration and gives information about the Top 1000 names given to children in the USA since 1880 (http://www.ssa.gov/oact/babynames/limits.html). The flights
data comes from the Bureau of Transportation Statistics at the US Department of Transportation and gives information about all outgoing flights from SEA and PDX in 2014 (http://www.transtats.bts.gov/).
data("flights", package = "pnwflights14")
data("names", package = "babynames")
Now that the data is loaded try the following problems. We won’t focus on plotting anything today. You can come to the Data Visualization using R workshop next week for that! Remember that you will need to include all of your R code in Chunks. I’ve added a few blank chunks below to get you started. Note that I’ve also labeled the chunks corresponding to their problem number. It’s always a good habit to label your chunks!
Notice that the data is called names
above when we load it from the babynames
package, but you will receive an error if you try to use names
as your data frame before the %>%
. This is just a quirk with how Hadley labeled things.
Identify how many people were born with your first name in the US in 2010. (If you have a rare first name, try a more common name instead.)
entered_name <- "Jack"
entered_year <- 2010
result <- babynames %>% filter(name == entered_name) %>%
filter(year == entered_year) %>%
summarize(count = sum(n))
result
## Source: local data frame [1 x 1]
##
## count
## (int)
## 1 8519
Another cool thing is that you can reference values stored in R directly in your text. Something like this:
We have determined that in 2010 there were 8519 babies born in the US with the first name Jack.
Sometimes you will get errors that don’t seem to make sense or resulting data frames that have no values in them. Oftentimes, this is due to missing data appearing in the datasets. Lots of summary functions in R include the argument na.rm
. It’s usually a good strategy to set it to TRUE
. For example, mean(arr_time, na.rm = TRUE)
.
babynames %>% filter(name == "Robin") %>%
summarize(count = n())
## Source: local data frame [1 x 1]
##
## count
## (int)
## 1 224
babynames %>% filter(name == "Robin") %>%
group_by(sex) %>% summarize(count = n())
## Source: local data frame [2 x 2]
##
## sex count
## (chr) (int)
## 1 F 105
## 2 M 119
babynames %>% group_by(sex, name) %>%
summarize(median_prop = median(prop)) %>%
top_n(1)
## Selecting by median_prop
## Source: local data frame [2 x 3]
## Groups: sex [2]
##
## sex name median_prop
## (chr) (chr) (dbl)
## 1 F Mary 0.04063204
## 2 M John 0.04451599
flights
dataset: carrier
, origin
, and arr_delay
. Assign this subset to a new variable called flights2
. Using flights2
, what was the largest arrival delay for each of the carriers at each of the airports? (If you’d like a reference to what the carrier abbreviations stand for, you can find them in the airlines
data set in pnwflights14
.)flights2 <- flights %>% select(carrier, origin, arr_delay)
flights2 %>% group_by(carrier, origin) %>%
summarize(max_arr_delay = max(arr_delay, na.rm = TRUE))
## Source: local data frame [22 x 3]
## Groups: carrier [?]
##
## carrier origin max_arr_delay
## (chr) (chr) (dbl)
## 1 AA PDX 1539
## 2 AA SEA 1454
## 3 AS PDX 338
## 4 AS SEA 844
## 5 B6 PDX 273
## 6 B6 SEA 357
## 7 DL PDX 651
## 8 DL SEA 900
## 9 F9 PDX 575
## 10 F9 SEA 804
## 11 HA PDX 407
## 12 HA SEA 866
## 13 OO PDX 421
## 14 OO SEA 671
## 15 UA PDX 472
## 16 UA SEA 557
## 17 US PDX 347
## 18 US SEA 690
## 19 VX PDX 366
## 20 VX SEA 351
## 21 WN PDX 694
## 22 WN SEA 565
t_test4 <- flights %>% t.test(distance ~ origin, data = .)
tidy(t_test4) %>% rename("PDX" = estimate1, "SEA" = estimate2)
## estimate PDX SEA statistic p.value parameter conf.low
## 1 -206.8359 1065.754 1272.59 -60.35406 0 104976.4 -213.5528
## conf.high
## 1 -200.1189
Yes, the \(p\)-value is 0 which is smaller than 5%.
flights_half <- flights %>% filter(origin == "SEA") %>%
mutate(half = ifelse(month <= 6, "First", "Second"))
t_test5 <- flights_half %>% t.test(air_time ~ half, data = .)
tidy(t_test5) %>% rename("First" = estimate1, "Second" = estimate2)
## estimate First Second statistic p.value parameter conf.low
## 1 -1.560721 159.3716 160.9323 -3.580919 0.000342541 105811.2 -2.414968
## conf.high
## 1 -0.7064731
anova_test <- flights %>% aov(dep_time ~ month, data = .)
tidy(anova_test)$p.value[1]
## [1] 0.2619995
tidy(anova_test)
## term df sumsq meansq statistic p.value
## 1 month 1 3.435978e+05 343597.8 1.25817 0.2619995
## 2 Residuals 161190 4.401992e+10 273093.4 NA NA
No, the \(p\)-value of 0.2619995 is greater than \(\alpha = 0.10\).