A colleague of mine, helping a development aid organisation with data management, asked me: some people are currently collecting data from several sources on the web (reason not important). How can we save time and effort by somehow automatically extract data from websites?
So i searched a couple of hours and this is what i found. Maybe it helps others, maybe other can add information in a reply to this blog :-).
I found some (sites about/collections of) tools for web data extraction (screen scraping, API's etc):
Apart from tools, it could also matter where you get data from. Maybe some sources have better quality data, or their data can be more easily extracred:
Other info:
So i searched a couple of hours and this is what i found. Maybe it helps others, maybe other can add information in a reply to this blog :-).
I found some (sites about/collections of) tools for web data extraction (screen scraping, API's etc):
- simply using Google to get data from a table on a website: http://eagereyes.org/data/scrape-tables-using-google-docs
- a blog from a guy who seems to be doing a lot in the field of web data extraction: http://scraping.pro/ . Also see his comparion of a couple of tools at http://scraping.pro/web-scraper-test-drive/
- an informative page from IBM about using Ruby for web data extraction: http://www.ibm.com/developerworks/library/os-extractruby/
- a nice looking solution: http://www.grepsr.com/
- http://www.wired.com/2014/03/kimono/
- http://maraboustork.co.uk/
- http://www.capterra.com/web-data-extraction-software
Apart from tools, it could also matter where you get data from. Maybe some sources have better quality data, or their data can be more easily extracred:
- a nice table with per country whether or not (and where) they supply some common country data: http://national.census.okfn.org/
- some public open data sources, including examples of visualizing trends (for example the average age a person dies and the average number of children per couple): http://www.google.com/publicdata/directory
- the cia publishes a nice set of data for all countries in the world in their world factbook: https://www.cia.gov/library/publications/the-world-factbook/wfbExt/region_afr.html
- the world bank supplies data via data api's: http://data.worldbank.org/developers ... als the US has a nice overview of open government data: https://www.data.gov/developers/apis
- also have a look at this Wikipdia page about open data: https://en.wikipedia.org/wiki/Open_data
- even more data sources: https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
Other info:
- a nice backgrounder for web data extraction: https://en.wikipedia.org/wiki/Web_scraping
- another nice and interesting website when you want to start supplying and/or using open data: https://project-open-data.github.io/ ("The White House developed Project Open Data – this collection of code, tools, and case studies – to help agencies adopt the Open Data Policy and unlock the potential of government data")
- another site you should check out: http://theodi.org/ ("Founded by Sir Tim Berners-Lee and Professor Nigel Shadbolt, the ODI is an independent, non-profit, non-partisan, limited by guarantee company")
- an initiative to standardize the data formats for humanitarian data: "Humanitarian Exchange Language (HXL)"at https://sites.google.com/site/hxlproject/home/project-overview
Geen opmerkingen:
Een reactie posten