Extract and use open web data

A colleague of mine, helping a development aid organisation with data management, asked me: some people are currently collecting data from several sources on the web (reason not important). How can we save time and effort by somehow automatically extract data from websites?

So i searched a couple of hours and this is what i found. Maybe it helps others, maybe other can add information in a reply to this blog :-).

I found some (sites about/collections of) tools for web data extraction (screen scraping, API's etc):
  • simply using Google to get data from a table on a website:
  • a blog from a guy who seems to be doing a lot in the field of web data extraction: . Also see his comparion of a couple of tools at
  • an informative page from IBM about using Ruby for web data extraction:
  • a nice looking solution:

Apart from tools, it could also matter where you get data from. Maybe some sources have better quality data, or their data can be more easily extracred:
  • a nice table with per country whether or not (and where) they supply some common country data:
  • some public open data sources, including examples of visualizing trends (for example the average age a person dies and the average number of children per couple):
  • the cia publishes a nice set of data for all countries in the world in their world factbook:
  • the world bank supplies data via data api's: ... als the US has a nice overview of open government data:
  • also have a look at this Wikipdia page about open data:
  • even more data sources:

Other info:
  • a nice backgrounder for web data extraction:
  • another nice and interesting website when you want to start supplying and/or using open data: ("The White House developed Project Open Data – this collection of code, tools, and case studies – to help agencies adopt the Open Data Policy and unlock the potential of government data")
  • another site you should check out: ("Founded by Sir Tim Berners-Lee and Professor Nigel Shadbolt, the ODI is an independent, non-profit, non-partisan, limited by guarantee company")
  • an initiative to standardize the data formats for humanitarian data: "Humanitarian Exchange Language (HXL)"at
Image source: .

