14.10.2021

Overview of presentation

  1. Difference between APIs & web scraping
  2. How to extract data via an API
  3. How to determine if a website is using an API
  4. How to scrape data

Note: examples will use R

What is web scraping?

  • Range of techniques to use a computing platform (e.g. R & Python) to extract data embedded in websites and store it in easy to analyse formats
  • it is often long, fiddly & computationally intensive
  • Web scraping can also breach the usage rights of a website & could lead to you getting blocked by the website you are scraping

Ethics of web scraping

  • Web scraping can be illegal if:
    • Terms & conditions specifically prohibit downloading/copying content
    • You pass off data as your own/republish in original form (i.e. breach “fair use”)
  • In practice, scraping is tolerated if you do not disrupt websites’ regular use
    • This can occur if querying website repeatedly & accessing large number of pages (i.e. many requests to site’s server), causing server to run out of resources or crash, blocking normal users’ access
    • This is essentially a Denial of Service (DoS) attack & can lead to you being blocked by the website
    • To avoid this, you can add a random delay between requests, giving server enough time to handle requests from all users
  • Further reading here & here

What are Application Programming Interfaces (APIs)?

  • APIs are lightweight, structured interfaces allowing a program/computer to access the features or data of another program/computer directly & should come with documentation explaining how to use it
  • REST APIs are the most common & are used by large organisations (e.g. Twitter,UNDP, IBGE) to help you access their data
  • Also used by many websites to access the data they are displaying

API example - IBGE

API example - IBGE

This will be provide data simply by searching the URL in your browser

API example - IBGE

But can be also easily read by a program like R or Python

url <- "https://apisidra.ibge.gov.br/values/t/6579/n3/all/v/9324?formato=json"
pop <- jsonlite::fromJSON(url) # read json

API example - IBGE

  • For larger sites, there are often already packages in R/Python etc. to access these data without having to learn the API syntax
  • In R, there is a package to access the IBGE API called sidrar
library(sidrar)

API example - IBGE

Which will return the exact same data with less hassle

pop_sidrar <- get_sidra(6579, variable = 9324,geo="State") #extract IBGE data

Websites that use APIs & how to access

  • Often websites use APIs to access the data they are presenting, although this may not be obvious

  • To determine this you need to use the inspector panel & other developer tools

Example - Barry Hallebaut’s suppliers

Example - Barry Hallebaut’s suppliers

Element tab will help you find which HTML nodes containing the desired data

Example - Barry Hallebaut’s suppliers

Network tab will help you find out whether the website is drawing its data from external sources (e.g. an API) and where to find these sources. In this case, Barry Hallebaut is accessing data from https://services1.arcgis.com

Example - Barry Hallebaut’s suppliers

url2 <- "https://services1.arcgis.com/gASdGGCiDRjdrOYB/arcgis/rest/services/Cooperatives_and_Districts_February_2021/FeatureServer/0/query?f=geojson&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&maxRecordCountFactor=4&outSR=102100&resultOffset=0&resultRecordCount=8000&cacheHint=true&quantizationParameters=%7B%22mode%22%3A%22view%22%2C%22originPosition%22%3A%22upperLeft%22%2C%22tolerance%22%3A1.0583354500042335%2C%22extent%22%3A%7B%22xmin%22%3A-8.34314999999998%2C%22ymin%22%3A2.908540000000073%2C%22xmax%22%3A12.520860000000027%2C%22ymax%22%3A7.605490000000032%2C%22spatialReference%22%3A%7B%22wkid%22%3A4326%2C%22latestWkid%22%3A4326%7D%7D%7D"
data_sf <- geojsonsf::geojson_sf(url2) #read geojson as simple feature shapefile

Websites that need to be scraped

Only scrape a website if:

  • The site/site owner does not provide an API
  • The site does not access its data from an accessible external source (e.g. via an API)
    • Note: some sites may use a password protected external source & you will need to scrape

Basic procedure of a web scrape

Scraping is done through three key steps:

  1. Get the HTML for the web page that you want to scrape
  2. Determine what part(s) of the page contain data you want & what HTML/CSS/XPATH refer to these parts(s) of the page
  3. Select the desired HTML elements and parse them in the appropriate data type (shapefile, datatable etc.)

Working with HTML

  • HTML uses a tree structure of nodes (also called elements), styled using CSS
  • A node will have HTML tags & can have CSS IDs or classes with following syntax: <tag#id.class>
  • Specific nodes can be identified through their node/class or through XPATH, a query expression for node selection

Example - AgroLink

Example - AgroLink

Scraping less structured data

  • Much of the data online you may want to scrape will not be already structured into tables & may lack unique nodes for each data type
  • For instance Cargill’s grievance data

Example - Cargill grievances

url4 <- "https://www.cargill.com/sustainability/palm-oil/managing-grievances"
html2 <- read_html(url4) #stage 1

# scrape company names (dropdown headers)
headers <- html_elements(html2,".showhide-header") # stage 2 
comps   <- html_text2(headers, preserve_nbsp=TRUE) # stage 3

#scrape grievance issue per company & per grievance entity  
grievance <- html2 %>%
  html_elements(".mod-content") %>%                  #extract node for each company
  html_text(trim=T)  %>%                             #extract text
  as.list() %>%                                      #convert to list
  map(str_split,pattern="Issue Under Review: ",simplify=T) %>%     #split at issue
  map(function(x){x[x!=""]}) %>%                                   #drop empty strings
  map(function(x){map(x,str_split,pattern="\n\n",n=3,simplify=T)}) #split at \n\n 3x

#attach company names to grievance data
names(grievance) <- comps[comps!=""]
##                    levelName
## 1  Root                     
## 2   ¦--Aceydesa             
## 3   ¦   °--1                
## 4   ¦--Cargill Tropical Palm
## 5   ¦   ¦--1                
## 6   ¦   ¦--2                
## 7   ¦   °--3                
## 8   ¦--Felda Global Ventures
## 9   ¦   ¦--1                
## 10  ¦   ¦--2                
## 11  ¦   ¦--3                
## 12  ¦   °--4                
## 13  °--Golden Agri-Resources
## 14      °--1

Example - Cargill grievances

# convert from nested list to longform dataframe with reshape2::melt
dat_scrape2 <- grievance %>%                         
  reshape2::melt(level=2) %>%         
  pivot_wider(id_cols=c(L2,L3),       #reshape to wide
              names_from = Var2,
              values_from = value) %>%
  separate(`3`,into = c("Other_info","Action_taken"),
           sep = "Actions Cargill Has Taken to Date") %>%
  separate(`2`,into=c("Entity","Date","Status"),
           sep = "\n|-|–") %>%
  rename(c("Supplier"=L2,"Subsupplier"=L3,"Grievance"=`1`))

Key takeaways

  1. Scraping is messy & very fiddly
  2. Where data has clear structure reflected in nodes with unique identifiers (via HTML tag/CSS/XPATH), scraping can be completed with a very small amount of code
  3. Where data has a messy structure, you should:
    • Use recursive procedures to work node by node (e.g. purrr::map in R)
    • Use nested data formats that maintain HTML tree structure (e.g. lists in R)
    • Use string splitting tools to split data where appropriate
    • Can also use if else clauses to achieve the same goal (not covered in presentation)

Things not covered

  • How to deal with scrapes that don’t work & thus break your code (can use purrr::possibly)
  • Dynamic websites + how to scrape them (can use selenium / POST)
  • How to add random delay into queries to avoid overloading websites’ servers & getting blocked (e.g. via recursive loop that includes Sys.sleep)
  • Polite scraping with the polite package (makes it easier to identify yourself & your intentions while scraping)
  • Scraping using other platforms (e.g. Python)

Conclusion

  • A lot of data online is available through APIs, even when it doesn’t appear to be
  • Where possible, use an API, as it will almost always be easier & more reliable than scraping (will also avoid overloading website )
  • In either case, the inspector panel & developer tools are your friends
    • Use the network tab to search for an API or other external data source
    • Use the elements tab to search for the node(s) you want to scrape


note: code used to create this presentation can be found on my GitHub

Useful packages in R

  • rvest - main package for scraping websites in R (uses xml2 & httr)
  • xml2 - parses XML & HTML(Languages used for encoding data on the web). XML is an alternative
  • httr - makes http requests (e.g. to an API) easier
  • RSelenium - advanced scraping tool that allows you to create virtual browser (e.g. to get around logins)
  • data.tree - package for working with/visualising tree structured data (like XML/HTML)
  • polite - scraping sites politely
  • jsonlite - read json data into R easily
  • geojsonio - read geojson data into R easily
  • tidyverse - group of packages to make coding easier, includes purrr::map() as well as %>% and tidyr::pivot_wider used here

Thanks
 And…