rvest
)If you find data that you would like to use on a website, package rvest
can bring that table into R. (Scraping data both saves you the process/frustration of copying/pasting data from a webpage, and also makes your work less error-prone and more reproducible.)
Before you take data from a website, always make sure you are allowed to scrape and analyze the data. You can sometimes find this information by digging into the webpage, but luckily there’s an R-package that will do this for you. It is called robotstxt
and while it has a great number of functions, the one that we’ll be using is called paths_allowed()
.
First, install the robotstxt
package:
install.packages("robotstxt")
and then use the function paths_allowed()
with the url/link to the website to check if scraping is allowed.
robotstxt::paths_allowed("url_to_website")
This will return either TRUE or FALSE.
You can read more about scraping data from the web here. If you still have questions about the legality of a web-scraping workflow after reading this documentation, you can contact Reed’s data librarian.
Once we know that we are allowed to scrape data from our webpage, we can use a package called rvest
to actually do the scraping.
First, install and load the rvest
package:
install.packages("rvest")
library(rvest)
This example will work from a page reporting on the different occupations of Reed alumni.
First, save the URL of this site under a variable name so that it is easy to use later:
url <- "https://www.reed.edu/ir/success.html"
In order to access the data, R needs to know not only the URL of the website that contains the table, but also where on that page the table is located. To point R to the data:
note: if you do not see this option (and you are using Safari), go to the menu bar at the top of your screen click Safari -> preferences. A window will come open with all kinds of preferences. Go to Advanced and make sure the box next to “Show Develop menu in menu bar” is selected.
Hover your mouse over the lines of code in this bar on the right. You will notice that as you move your mouse, different sections of the webpage will be highlighted. Hover until you find a line of code that highlights the whole table that you want.
Right-click that line, go to copy > copy selector (CSS Selector on Firefox). You will then paste that text in the code below where it says css = " "
.
Once you have the specific location of the table, use the url
saved earlier, then save the data (example: as reed_table
) using the <-
operator.
reed_table <- url %>%
read_html() %>%
html_node(css = "#mainContent > table:nth-child(3)") %>%
html_table()
Run the code and check your environment for the dataset reed_table
.
While the rvest
code may look complicated, it needs only two pieces of information:
The URL of the page with the data table.
The “selector” code that specifies where on the page the data table is located. This tutorial is a great resource for learning more about selector.
If you want to learn more about rvest
, the Tidyverse documentation for rvest
contains a more extensive overview of the package and its functions.