Using rvest and dplyr to look at aviation incidents

For a project I recently faced the issue of getting a database of all aviation incidents. As I really wanted to try Hadley’s new rvest-package, I thought I will give it a try and share the code with you.

The data of aviation incidents starting in 1919 from the Aviation Safety Network can be found here: http://aviation-safety.net/database/.

First, we needed to install and load the rvest-package, as well as dplyr, which I love for removing lots of messy code (if you are unfamiliar with the piping-operator %>% have a look at this description: http://www.r-statistics.com/2014/08/simpler-r-coding-with-pipes-the-present-and-future-of-the-magrittr-package/)

install.packages("rvest")
install.packages("dplyr")
require("rvest")
require("dplyr")

Let’s try out some functions of rvest.

Say we want to read all incidents that happened in the year 1920: http://aviation-safety.net/database/dblist.php?Year=1920. We need to find the right html table to download and the link to it, to be more precise, the XPath. This can be done by using “inspect element” (right-click on the table, inspect element, right click on the element in the code and “copy XPath”). In our case the XPath is
“//*[@id=”contentcolumnfull”]/div/table”.
To load the html data to R we can use:

url <- "http://aviation-safety.net/database/dblist.php?Year=1920"

# load the html code to R
incidents1920 <- url %>% read_html() 

# filter for the right xpath node
incidents1920 <- incidents1920 %>% 
  html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') 

# convert to a data.frame
incidents1920 <- incidents1920 %>% html_table() %>% data.frame()

# or in one go
incidents1920 <- url %>% read_html() %>% 
  html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>% 
  html_table() %>% data.frame()

Which gives us a small data.frame of 4 accidents.

But what happens if we have more than one page of data per year? We certainly don’t want to paste everything by hand. Take 1962 for example http://aviation-safety.net/database/dblist.php?Year=1962, which has 3 pages. Luckily we can get the number of pages by using rvest as well.

We follow the steps above to get the number of pages per year with the XPath “//*[@id=”contentcolumnfull”]/div/div[2]“, with some cleaning we get the maximum pagenumber as:


url <- "http://aviation-safety.net/database/dblist.php?Year=1962"

pages <- url %>% read_html() %>%
  html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
  html_text() %>% strsplit(" ") %>% unlist() %>%
  as.numeric() %>% max()

pages
# [1] 3

Now we can write a small loop to get all incidents of 1962, as the link changes with the page number, ie from:
http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=1
to
http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=2

The code for the loop looks like this:


# initiate empty data.frame, in which we will store the data
dat <- data.frame(date = numeric(0), type = numeric(0), registration = numeric(0),
  operator = numeric(0), fatalities = numeric(0),
  location = numeric(0), category = numeric(0))

# loop through all page numbers
for (page in 1:pages){
  # create the new URL for the current page
  url <- paste0("http://aviation-safety.net/database/dblist.php?Year=1962&lang=&page=", page)

  # get the html data and convert it to a data.frame
  incidents <- url %>% read_html() %>%
    html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
    html_table() %>% data.frame()

  # combine the data
  dat <- rbind(dat, incidents)
}

# quick look at the dimensions of the data
dim(dat)
# [1] 211 9

which gives us a data.frame consisting of 211 incidents of the year 1962.

Lastly, we can write a loop to gather the data over multiple years:


# set-up of initial values
startyear <- 1960
endyear <- 1965
url_init <- "http://aviation-safety.net/database/dblist.php?Year="

# initiate empty dataframe, in which we will store the data
dat <- data.frame(date = numeric(0), type = numeric(0), registration = numeric(0),
  operator = numeric(0), fatalities = numeric(0),
  location = numeric(0), category = numeric(0))

for (year in startyear:endyear){
  # get url for this year
  url_year <- paste0(url_init, year)

  # get pages
  pages <- url_year %>% read_html() %>%
    html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/div[2]') %>%
    html_text() %>% strsplit(" ") %>% unlist() %>%
    as.numeric() %>% max()

  # loop through the pages
  for (page in 1:pages){
    url <- paste0(url_year,"&lang=&page=", page)

    # get the html data and convert it to a data.frame
    incidents <- url %>% read_html() %>%
      html_nodes(xpath = '//*[@id="contentcolumnfull"]/div/table') %>%
      html_table() %>% data.frame()

    # combine the data
    dat <- rbind(dat, incidents)
  }
}

dim(dat)
# [1] 1268 9

In the years 1960-1965 there are 1.268 recorded aviation incidents, which we can now use in R.

Advertisements

2 thoughts on “Using rvest and dplyr to look at aviation incidents

  1. Thank you for the post. It works fine for me with the provided code. But when I tried getting the “xpath” myself, I can’t get it to work. Specifically, I used Firefox, and have to install the add-in named “Firebug” to copy the xpath (default “Inspection” does not have this feature). I then choose the element correspondings to the whole table, right click, “copy xpath” and paste this. I receive the line: ” /html/body/div[1]/div[1]/div[6]/div/div/table”. Apparently this xpath does not work. I’m not quite sure what causes the difference. Please let me know if you have any comment on this!
    Thanks,
    Long

    Like

  2. Hi Long,
    glad you like the post.
    Finding the right xpath can be tricky, I usually play a bit around with it (sometimes going up one level seems to help). Alternatively, you can look into the CSS-selecctor.
    Cheers,
    David

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s