How to scrape a website

14.10.2021

Overview of presentation

Difference between APIs & web scraping
How to extract data via an API
How to determine if a website is using an API
How to scrape data

Note: examples will use R

What is web scraping?

Range of techniques to use a computing platform (e.g. R & Python) to extract data embedded in websites and store it in easy to analyse formats
it is often long, fiddly & computationally intensive
Web scraping can also breach the usage rights of a website & could lead to you getting blocked by the website you are scraping

Ethics of web scraping

Web scraping can be illegal if:
- Terms & conditions specifically prohibit downloading/copying content
- You pass off data as your own/republish in original form (i.e. breach “fair use”)
In practice, scraping is tolerated if you do not disrupt websites’ regular use
- This can occur if querying website repeatedly & accessing large number of pages (i.e. many requests to site’s server), causing server to run out of resources or crash, blocking normal users’ access
- This is essentially a Denial of Service (DoS) attack & can lead to you being blocked by the website
- To avoid this, you can add a random delay between requests, giving server enough time to handle requests from all users
Further reading here & here

What are Application Programming Interfaces (APIs)?

APIs are lightweight, structured interfaces allowing a program/computer to access the features or data of another program/computer directly & should come with documentation explaining how to use it
REST APIs are the most common & are used by large organisations (e.g. Twitter,UNDP, IBGE) to help you access their data
Also used by many websites to access the data they are displaying

API example - IBGE

IBGE offers both an API & a webpage to help you learn how to access it
Like all REST APIs functions via http requests (essentially a structured URL)
e.g. population of Brazil over time can be accessed via https://apisidra.ibge.gov.br/values/t/6579/v/9324/n3/all?formato=json

API example - IBGE

This will be provide data simply by searching the URL in your browser

API example - IBGE

But can be also easily read by a program like R or Python

url <- "https://apisidra.ibge.gov.br/values/t/6579/n3/all/v/9324?formato=json"
pop <- jsonlite::fromJSON(url) # read json

API example - IBGE

For larger sites, there are often already packages in R/Python etc. to access these data without having to learn the API syntax
In R, there is a package to access the IBGE API called sidrar

library(sidrar)

API example - IBGE

Which will return the exact same data with less hassle

pop_sidrar <- get_sidra(6579, variable = 9324,geo="State") #extract IBGE data

Websites that use APIs & how to access

Often websites use APIs to access the data they are presenting, although this may not be obvious
To determine this you need to use the inspector panel & other developer tools

Example - Barry Hallebaut’s suppliers

Barry Hallebaut website is one such case

Example - Barry Hallebaut’s suppliers

Element tab will help you find which HTML nodes containing the desired data

Example - Barry Hallebaut’s suppliers

Network tab will help you find out whether the website is drawing its data from external sources (e.g. an API) and where to find these sources. In this case, Barry Hallebaut is accessing data from https://services1.arcgis.com

Example - Barry Hallebaut’s suppliers

url2 <- "https://services1.arcgis.com/gASdGGCiDRjdrOYB/arcgis/rest/services/Cooperatives_and_Districts_February_2021/FeatureServer/0/query?f=geojson&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&maxRecordCountFactor=4&outSR=102100&resultOffset=0&resultRecordCount=8000&cacheHint=true&quantizationParameters=%7B%22mode%22%3A%22view%22%2C%22originPosition%22%3A%22upperLeft%22%2C%22tolerance%22%3A1.0583354500042335%2C%22extent%22%3A%7B%22xmin%22%3A-8.34314999999998%2C%22ymin%22%3A2.908540000000073%2C%22xmax%22%3A12.520860000000027%2C%22ymax%22%3A7.605490000000032%2C%22spatialReference%22%3A%7B%22wkid%22%3A4326%2C%22latestWkid%22%3A4326%7D%7D%7D"
data_sf <- geojsonsf::geojson_sf(url2) #read geojson as simple feature shapefile

Websites that need to be scraped

Only scrape a website if:

The site/site owner does not provide an API
The site does not access its data from an accessible external source (e.g. via an API)
- Note: some sites may use a password protected external source & you will need to scrape

Basic procedure of a web scrape

Scraping is done through three key steps:

Get the HTML for the web page that you want to scrape
Determine what part(s) of the page contain data you want & what HTML/CSS/XPATH refer to these parts(s) of the page
Select the desired HTML elements and parse them in the appropriate data type (shapefile, datatable etc.)

Working with HTML

HTML uses a tree structure of nodes (also called elements), styled using CSS
A node will have HTML tags & can have CSS IDs or classes with following syntax: <tag#id.class>
Specific nodes can be identified through their node/class or through XPATH, a query expression for node selection

Example - AgroLink

Agrolink is a Brazilian agro-business company that tracks prices of various commodities over time, incl. soy

Example - AgroLink

library(rvest)                                           # read necessary R packages
library(xml2)

url3 <- "https://www.agrolink.com.br/cotacoes/historico/mt/soja-em-grao-sc-60kg"
html <- read_html(url3)                                  # stage 1: get HTML
node <- html_element(html,".table-striped.agk-cont-tb1") # stage 2: extract elements
dat_scrape <- html_table(node,dec=",")                   # stage 3: read as data table

Scraping less structured data

Much of the data online you may want to scrape will not be already structured into tables & may lack unique nodes for each data type
For instance Cargill’s grievance data

Example - Cargill grievances

url4 <- "https://www.cargill.com/sustainability/palm-oil/managing-grievances"
html2 <- read_html(url4) #stage 1

# scrape company names (dropdown headers)
headers <- html_elements(html2,".showhide-header") # stage 2 
comps   <- html_text2(headers, preserve_nbsp=TRUE) # stage 3

#scrape grievance issue per company & per grievance entity  
grievance <- html2 %>%
  html_elements(".mod-content") %>%                  #extract node for each company
  html_text(trim=T)  %>%                             #extract text
  as.list() %>%                                      #convert to list
  map(str_split,pattern="Issue Under Review: ",simplify=T) %>%     #split at issue
  map(function(x){x[x!=""]}) %>%                                   #drop empty strings
  map(function(x){map(x,str_split,pattern="\n\n",n=3,simplify=T)}) #split at \n\n 3x

#attach company names to grievance data
names(grievance) <- comps[comps!=""]

##                    levelName
## 1  Root                     
## 2   ¦--Aceydesa             
## 3   ¦   °--1                
## 4   ¦--Cargill Tropical Palm
## 5   ¦   ¦--1                
## 6   ¦   ¦--2                
## 7   ¦   °--3                
## 8   ¦--Felda Global Ventures
## 9   ¦   ¦--1                
## 10  ¦   ¦--2                
## 11  ¦   ¦--3                
## 12  ¦   °--4                
## 13  °--Golden Agri-Resources
## 14      °--1

Example - Cargill grievances

# convert from nested list to longform dataframe with reshape2::melt
dat_scrape2 <- grievance %>%                         
  reshape2::melt(level=2) %>%         
  pivot_wider(id_cols=c(L2,L3),       #reshape to wide
              names_from = Var2,
              values_from = value) %>%
  separate(`3`,into = c("Other_info","Action_taken"),
           sep = "Actions Cargill Has Taken to Date") %>%
  separate(`2`,into=c("Entity","Date","Status"),
           sep = "\n|-|–") %>%
  rename(c("Supplier"=L2,"Subsupplier"=L3,"Grievance"=`1`))

Key takeaways

Scraping is messy & very fiddly
Where data has clear structure reflected in nodes with unique identifiers (via HTML tag/CSS/XPATH), scraping can be completed with a very small amount of code
Where data has a messy structure, you should:
- Use recursive procedures to work node by node (e.g. purrr::map in R)
- Use nested data formats that maintain HTML tree structure (e.g. lists in R)
- Use string splitting tools to split data where appropriate
- Can also use if else clauses to achieve the same goal (not covered in presentation)

Things not covered

How to deal with scrapes that don’t work & thus break your code (can use purrr::possibly)
Dynamic websites + how to scrape them (can use selenium / POST)
How to add random delay into queries to avoid overloading websites’ servers & getting blocked (e.g. via recursive loop that includes Sys.sleep)
Polite scraping with the polite package (makes it easier to identify yourself & your intentions while scraping)
Scraping using other platforms (e.g. Python)

Conclusion

A lot of data online is available through APIs, even when it doesn’t appear to be
Where possible, use an API, as it will almost always be easier & more reliable than scraping (will also avoid overloading website )
In either case, the inspector panel & developer tools are your friends
- Use the network tab to search for an API or other external data source
- Use the elements tab to search for the node(s) you want to scrape

note: code used to create this presentation can be found on my GitHub

Useful packages in R

rvest - main package for scraping websites in R (uses xml2 & httr)
xml2 - parses XML & HTML(Languages used for encoding data on the web). XML is an alternative
httr - makes http requests (e.g. to an API) easier
RSelenium - advanced scraping tool that allows you to create virtual browser (e.g. to get around logins)
data.tree - package for working with/visualising tree structured data (like XML/HTML)
polite - scraping sites politely
jsonlite - read json data into R easily
geojsonio - read geojson data into R easily
tidyverse - group of packages to make coding easier, includes purrr::map() as well as %>% and tidyr::pivot_wider used here

Overview of presentation

What is web scraping?

Ethics of web scraping

What are Application Programming Interfaces (APIs)?

API example - IBGE

API example - IBGE

API example - IBGE

API example - IBGE

API example - IBGE

Websites that use APIs & how to access

Example - Barry Hallebaut’s suppliers

Example - Barry Hallebaut’s suppliers

Example - Barry Hallebaut’s suppliers

Example - Barry Hallebaut’s suppliers

Websites that need to be scraped

Basic procedure of a web scrape

Working with HTML

Example - AgroLink

Example - AgroLink

Scraping less structured data

Example - Cargill grievances

Example - Cargill grievances

Key takeaways

Things not covered

Conclusion

Useful packages in R

Thanks And…

Thanks
And…