zaterdag, mei 17, 2014

Extract and use open web data

A colleague of mine, helping a development aid organisation with data management, asked me: some people are currently collecting data from several sources on the web (reason not important). How can we save time and effort by somehow automatically extract data from websites?

So i searched a couple of hours and this is what i found. Maybe it helps others, maybe other can add information in a reply to this blog :-).

I found some (sites about/collections of) tools for web data extraction (screen scraping, API's etc):
  • simply using Google to get data from a table on a website: http://eagereyes.org/data/scrape-tables-using-google-docs
  • a blog from a guy who seems to be doing a lot in the field of web data extraction: http://scraping.pro/ . Also see his comparion of a couple of tools at http://scraping.pro/web-scraper-test-drive/
  • an informative page from IBM about using Ruby for web data extraction: http://www.ibm.com/developerworks/library/os-extractruby/
  • a nice looking solution: http://www.grepsr.com/
  • http://www.wired.com/2014/03/kimono/
  • http://maraboustork.co.uk/
  • http://www.capterra.com/web-data-extraction-software


Apart from tools, it could also matter where you get data from. Maybe some sources have better quality data, or their data can be more easily extracred:
  • a nice table with per country whether or not (and where) they supply some common country data: http://national.census.okfn.org/
  • some public open data sources, including examples of visualizing trends (for example the average age a person dies and the average number of children per couple): http://www.google.com/publicdata/directory
  • the cia publishes a nice set of data for all countries in the world in their world factbook: https://www.cia.gov/library/publications/the-world-factbook/wfbExt/region_afr.html
  • the world bank supplies data via data api's: http://data.worldbank.org/developers ... als the US has a nice overview of open government data: https://www.data.gov/developers/apis
  • also have a look at this Wikipdia page about open data: https://en.wikipedia.org/wiki/Open_data
  • even more data sources: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public


Other info:
  • a nice backgrounder for web data extraction: https://en.wikipedia.org/wiki/Web_scraping
  • another nice and interesting website when you want to start supplying and/or using open data: https://project-open-data.github.io/ ("The White House developed Project Open Data – this collection of code, tools, and case studies – to help agencies adopt the Open Data Policy and unlock the potential of government data")
  • another site you should check out: http://theodi.org/ ("Founded by Sir Tim Berners-Lee and Professor Nigel Shadbolt, the ODI is an independent, non-profit, non-partisan, limited by guarantee company")
  • an initiative to standardize the data formats for humanitarian data: "Humanitarian Exchange Language (HXL)"at https://sites.google.com/site/hxlproject/home/project-overview
Image source: http://www.data-processing.hk/web-data-capture .

Geen opmerkingen:

Een reactie posten