Web data
There’s a lot of data on the internet that you won’t be able to click a button to download as a typical file type, anything from Twitter data to a static html table. This page is a quick intro to some solutions for getting this sort of data into R.
JSON and XML
Lots of data on the internet is formatted as JSON (JavaScript Object Notation) or XML (Extensible Markup Language). If you can download one of these file types (or copy-paste them), they won’t be stored as a flat file that directly reads as a table. You can use the jsonlite
and XML
packages to process this data, however.
For JSON files:
library(jsonlite)
quiz <- fromJSON("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
This quiz file is a good example of something that doesn’t translate intuitively into a data frame, but is stored directly as nested lists. For a more data-frame-like object, try this colors data set.
For XML files:
library(XML)
note <- xmlToDataFrame(doc = "cds.xml")
You’ll have to download this file to try the example. This CD catalog file can easily be converted to a data frame, but for files that cannot you can use xmlTreeParse
or xmlParse
.
Using web APIs
Lots of data on the internet is stored in databases, and there’s not just a link to a static webpage like some data sources. Typically you access data from these by sending a request to some server, and receiving a response that matches your request. The software that communicates your request and brings back the response is called an application programming interface, or API. A good example is an airline website. You put in a request to the API (show me all the flights from PDX to JFK on May 16), and the API brings back a response from the data base matching your request.
When you’re using an API to do only a handful of requests, having a point-and-click interface like an airline website makes sense. If you’re using the data from one for programming, however, putting in hundreds of requests manually isn’t feasible. R has some good resources for accessing an API that you can take advantage of in this situation.
Existing API clients
Depending on the task you’re trying to accomplish, somebody may have written an R package that can pull the data you need already. For example, the rtweet
package is a good tool for downloading tweets, and googleLanguageR
can be used to access the functionality of Google Translate. You can find a list of some packages like this here under “Web Services”. Any package like this will have its own functions and syntax, so read through the package website or documentation for each to figure out how to use it.
Writing API requests
If somebody hasn’t already written a package to simplify API requests for the task you want to accomplish, you can write your own requests to the API using the httr
package. This process will be pretty different every time based on the different websites, parameters, and needs you have, but here’s a very basic example.
r <- httr::GET("https://api.github.com/users/RStudio")
httr::content(r)
In addition to defining the url, sometimes you might need to restrict the search (if a search is what you’re doing) with a list of parameters. This can be set with the query
argument of GET
.
For more information on how to write GET requests in R, try reading this vignette. For help finding the parameters and links you need to write a correct request, check out this blog post.
Downloading HTML tables
The simplest (to the viewer) form of data on the web is a plain table put on a web page. The process of getting a table like this is called “web scraping”, and can be accomplished in R with the rvest
package. For an example, we’ll scrape this table on the geographic distribution of first-year students at Reed.
library(rvest)
data2 <- read_html("http://www.reed.edu/ir/geographic_states.html") %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
data <- read_html("http://www.reed.edu/ir/geographic_states.html") %>%
html_nodes(xpath = '//*[@id="mainContent"]/table') %>%
.[[1]] %>%
html_table()
The steps of this command are:
-
Read in the entire webpage
-
Pull out every element tagged as a table (in the second set, pull out a specific element by its xpath)
-
Extract the first of these elements
-
Change this raw html into a data frame