- Difference between APIs & web scraping
- How to extract data via an API
- How to determine if a website is using an API
- How to scrape data
Note: examples will use R
14.10.2021
Note: examples will use R
This will be provide data simply by searching the URL in your browser
But can be also easily read by a program like R or Python
url <- "https://apisidra.ibge.gov.br/values/t/6579/n3/all/v/9324?formato=json" pop <- jsonlite::fromJSON(url) # read json
sidrar
library(sidrar)
Which will return the exact same data with less hassle
pop_sidrar <- get_sidra(6579, variable = 9324,geo="State") #extract IBGE data
Often websites use APIs to access the data they are presenting, although this may not be obvious
To determine this you need to use the inspector panel & other developer tools
Element tab will help you find which HTML nodes containing the desired data
Network tab will help you find out whether the website is drawing its data from external sources (e.g. an API) and where to find these sources. In this case, Barry Hallebaut is accessing data from https://services1.arcgis.com
url2 <- "https://services1.arcgis.com/gASdGGCiDRjdrOYB/arcgis/rest/services/Cooperatives_and_Districts_February_2021/FeatureServer/0/query?f=geojson&where=1%3D1&returnGeometry=true&spatialRel=esriSpatialRelIntersects&outFields=*&maxRecordCountFactor=4&outSR=102100&resultOffset=0&resultRecordCount=8000&cacheHint=true&quantizationParameters=%7B%22mode%22%3A%22view%22%2C%22originPosition%22%3A%22upperLeft%22%2C%22tolerance%22%3A1.0583354500042335%2C%22extent%22%3A%7B%22xmin%22%3A-8.34314999999998%2C%22ymin%22%3A2.908540000000073%2C%22xmax%22%3A12.520860000000027%2C%22ymax%22%3A7.605490000000032%2C%22spatialReference%22%3A%7B%22wkid%22%3A4326%2C%22latestWkid%22%3A4326%7D%7D%7D" data_sf <- geojsonsf::geojson_sf(url2) #read geojson as simple feature shapefile
Only scrape a website if:
Scraping is done through three key steps:
<tag#id.class>
Agrolink is a Brazilian agro-business company that tracks prices of various commodities over time, incl. soy
library(rvest) # read necessary R packages library(xml2) url3 <- "https://www.agrolink.com.br/cotacoes/historico/mt/soja-em-grao-sc-60kg" html <- read_html(url3) # stage 1: get HTML node <- html_element(html,".table-striped.agk-cont-tb1") # stage 2: extract elements dat_scrape <- html_table(node,dec=",") # stage 3: read as data table
url4 <- "https://www.cargill.com/sustainability/palm-oil/managing-grievances" html2 <- read_html(url4) #stage 1 # scrape company names (dropdown headers) headers <- html_elements(html2,".showhide-header") # stage 2 comps <- html_text2(headers, preserve_nbsp=TRUE) # stage 3 #scrape grievance issue per company & per grievance entity grievance <- html2 %>% html_elements(".mod-content") %>% #extract node for each company html_text(trim=T) %>% #extract text as.list() %>% #convert to list map(str_split,pattern="Issue Under Review: ",simplify=T) %>% #split at issue map(function(x){x[x!=""]}) %>% #drop empty strings map(function(x){map(x,str_split,pattern="\n\n",n=3,simplify=T)}) #split at \n\n 3x #attach company names to grievance data names(grievance) <- comps[comps!=""]
## levelName ## 1 Root ## 2 ¦--Aceydesa ## 3 ¦ °--1 ## 4 ¦--Cargill Tropical Palm ## 5 ¦ ¦--1 ## 6 ¦ ¦--2 ## 7 ¦ °--3 ## 8 ¦--Felda Global Ventures ## 9 ¦ ¦--1 ## 10 ¦ ¦--2 ## 11 ¦ ¦--3 ## 12 ¦ °--4 ## 13 °--Golden Agri-Resources ## 14 °--1
# convert from nested list to longform dataframe with reshape2::melt dat_scrape2 <- grievance %>% reshape2::melt(level=2) %>% pivot_wider(id_cols=c(L2,L3), #reshape to wide names_from = Var2, values_from = value) %>% separate(`3`,into = c("Other_info","Action_taken"), sep = "Actions Cargill Has Taken to Date") %>% separate(`2`,into=c("Entity","Date","Status"), sep = "\n|-|–") %>% rename(c("Supplier"=L2,"Subsupplier"=L3,"Grievance"=`1`))
purrr::map
in R)lists
in R)purrr::possibly
)Sys.sleep
)polite
package (makes it easier to identify yourself & your intentions while scraping)
note: code used to create this presentation can be found on my GitHub
rvest
- main package for scraping websites in R (uses xml2
& httr
)xml2
- parses XML & HTML(Languages used for encoding data on the web). XML
is an alternativehttr
- makes http requests (e.g. to an API) easierRSelenium
- advanced scraping tool that allows you to create virtual browser (e.g. to get around logins)data.tree
- package for working with/visualising tree structured data (like XML/HTML)polite
- scraping sites politelyjsonlite
- read json data into R easilygeojsonio
- read geojson data into R easilytidyverse
- group of packages to make coding easier, includes purrr::map()
as well as %>%
and tidyr::pivot_wider
used here