Web Scraping with Processing

After our workshop with Tom on Webscraping we were asked to go and try scraping some data ourselves. We looked at writing a scraper in Python, which I found a little hard to get my head around. As I have worked in Processing before it seemed logical to try and replicate a scraper using P5.

The following code can be inout into Processing and used to scrape the HTML data from a given URL and output a number based on the amount of lines scraped.

String lines[] = loadStrings(“http://clive-wright.co.uk/”);  //Input chosen URL here
void draw(){
println(“there are ” + lines.length + ” lines”); //This states “There are X lines”
for (int i = 0 ; i < lines.length; i++) { //following counts ‘i’ for each line scraped
  println(lines[i]); //Prints ‘i’ after counted
delay(1000); //Time delay for scraper to run

My idea for the use of this webscraper would be to be used on a live updated webpage, hence the addition of the timer (delay) at the bottom instead of a stop command.

Web Scraping

We looked at different kinds of data scraping today.
Here is a short Python sketch we looked at that looked for  information of recent earthquakes:

#!/usr/bin/env python

import scraperwiki
import requests
import lxml.html

html = requests.get(“http://www.earthquakes.bgs.ac.uk/earthquakes/recent_world_events.html”).content

dom = lxml.html.fromstring(html)

count = 0

for row in dom.cssselect(‘tr’):
this_row = “”
for cell in row:
print this_row

unique_keys = [ 'id' ]
data = {‘id’:count, ‘data’: this_row}
scraperwikiki.sql.save(unique_keys, data)


We also looked at the Wiki API.
Here is my example that looks at the most recent changes to the ‘Cat’ Wiki: