HSS8120 : Scraping
This session is intended to introduce scraping as a cultural paradigm and as an artistic (and research) activity. We will look at some examples of scraping in art, music, visualisation and performance and learn about some tools and approaches used. From there we will think about what it means to use scraped data in our own creative work. The key point is that scraping (for me) signifies an approach to gathering data that emphasises activity, effort, non linearity and contingency. It is, in some senses the opposite of ‘fake news’ whose main feature is that it is provided to us.
SECTION 1:A completely selective and loosely structured overview of scraping:
Neurotic Armageddon Indicator a wall clock for the end of the world.
- web data and conventions of use
- figurative statistics
- qualitative and quantitative research
Daily Paywall Paolo Cirio
- Hacking newspaper paywalls and scraping content
- aggregating culture
- vagaries of scale
My own work on the Bloodaxe archive.
- audiences for new hybrid objects
Audio scraping as mapping a layer. The Quiet Walk (Alessandro Altavilla, Tom Schofield)
Bloodaxe Archive scrapings
- As a contemporary form of scraping to make a mark. See William Blake.
As An Industry
Data ‘sifting’ is now a substantial industry.
What does scraping get you? What does the term do?
- It suggests an active mode of data gathering
- It carries with it associated activities : filtering, ordering, saving – all of which can structure your work in culturally-situated ways. Thoughts about these activities as a site of work can inform your practice.
- It provides a series of productive metaphors which can, in turn, become practices – digging, uncovering, sifting.
- It can provide kinds of gesture – think burins, trowels, fingernails.
When can you scrape?
For instance on Wired.com you can’t:
copy, harvest, crawl, index, scrape, spider, mine, gather, extract, compile, obtain, aggregate, capture, or store any Content, including without limitation photos, images, text, music, audio, videos, podcasts, data, software, source or object code, algorithms, statistics, analysis, formulas, indexes, registries, repositories, or any other information available on or through the Service, including by an automated or manual process or otherwise, if we have taken steps to forbid, prohibit, or prevent you from doing so;
SECTION 2: To Work!
First we’ll need python 2.7
And also pip
And it helps to have Sublime Text 2
- look for features of interest
- find secrets
- look for things you can use
- think about the way you move through a building
For instance SSID 1-line ascii art.
Using an API. Many of them need you to register and receive a KEY.
Using the mediawiki api what can we find on our subject. How could this be used computationally to tell us something that we couldn’t simply read? What does this mean for the humanities – for humanism?
For instance we can programmatically generate a list of images for a given subject. Like Earthquakes.
Look at the tutorial here. What else can you find of use?
SECTION 3: To Play!
Write a scraper (in Python) that turns unusable public data into something useful.
Writing a scraper.
- identify a changing data source on web (there’s a cool one here )
- check for a robots.txt file to see if what you want to do is allowed. The site above has one here.
- check any licensing information that will tell you what you can and can’t do with the data.
- look at the HTML and see what we can identify that uniquely identifies the thing that we want
- adapt the earthquake scraper to get it