Web data

There’s a lot of data on the internet that you won’t be able to click a button to download as a typical file type, anything from Twitter data to a static html table. This page is a quick intro to some solutions for getting this sort of data into R.

JSON and XML

Lots of data on the internet is formatted as JSON (JavaScript Object Notation) or XML (Extensible Markup Language). If you can download one of these file types (or copy-paste them), they won’t be stored as a flat file that directly reads as a table. You can use the jsonlite and XMLpackages to process this data, however.

For JSON files:

library(jsonlite)
quiz <- fromJSON("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")

This quiz file is a good example of something that doesn’t translate intuitively into a data frame, but is stored directly as nested lists. For a more data-frame-like object, try this colors data set.

For XML files:

library(XML)
note <- xmlToDataFrame(doc = "cds.xml")

You’ll have to download this file to try the example. This CD catalog file can easily be converted to a data frame, but for files that cannot you can use xmlTreeParse or xmlParse.

Using web APIs

Lots of data on the internet is stored in databases, and there’s not just a link to a static webpage like some data sources. Typically you access data from these by sending a request to some server, and receiving a response that matches your request. The software that communicates your request and brings back the response is called an application programming interface, or API. A good example is an airline website. You put in a request to the API (show me all the flights from PDX to JFK on May 16), and the API brings back a response from the data base matching your request.

When you’re using an API to do only a handful of requests, having a point-and-click interface like an airline website makes sense. If you’re using the data from one for programming, however, putting in hundreds of requests manually isn’t feasible. R has some good resources for accessing an API that you can take advantage of in this situation.

Existing API clients

Depending on the task you’re trying to accomplish, somebody may have written an R package that can pull the data you need already. For example, the rtweet package is a good tool for downloading tweets, and googleLanguageR can be used to access the functionality of Google Translate. You can find a list of some packages like this here under “Web Services”. Any package like this will have its own functions and syntax, so read through the package website or documentation for each to figure out how to use it.

Writing API requests

If somebody hasn’t already written a package to simplify API requests for the task you want to accomplish, you can write your own requests to the API using the httr package. This process will be pretty different every time based on the different websites, parameters, and needs you have, but here’s a very basic example.

r <- httr::GET("https://api.github.com/users/RStudio")

httr::content(r)

In addition to defining the url, sometimes you might need to restrict the search (if a search is what you’re doing) with a list of parameters. This can be set with the query argument of GET.

For more information on how to write GET requests in R, try reading this vignette. For help finding the parameters and links you need to write a correct request, check out this blog post.

Downloading HTML tables

The simplest (to the viewer) form of data on the web is a plain table put on a web page. The process of getting a table like this is called “web scraping”, and can be accomplished in R with the rvest package. For an example, we’ll scrape this table on the geographic distribution of first-year students at Reed.

library(rvest)
data2 <- read_html("http://www.reed.edu/ir/geographic_states.html") %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table()

data <- read_html("http://www.reed.edu/ir/geographic_states.html") %>%
  html_nodes(xpath = '//*[@id="mainContent"]/table') %>%
  .[[1]] %>%
  html_table()

The steps of this command are:

Read in the entire webpage
Pull out every element tagged as a table (in the second set, pull out a specific element by its xpath)
Extract the first of these elements
Change this raw html into a data frame