Without relevant sources of information to interrogate and analyse, the Data Scientist’s role is redundant. Fortunately, the internet provides a rich and varied source of intelligence that covers most subjects. Some key factors that make the internet sources of interest are:
The OpenGovernment initiative, which has a goal to make public data available online, has resulted in vast quantities of additional data becoming accessible on government websites for users to mine and has triggered the publishing of detailed analyses and reports, and provision of applications to aid information scrutiny. In the table below, I have summarised some rich and interesting data sources that might be worth exploring and analysing with your Data Science tools. In order to help tailor your data searches I have included brief details of the subject areas covered, the file formats available and whether the website provides an API (Application Programming Interface) that can be used to directly access or query the data source.
Website Source | Subject Areas | Data formats | API |
---|---|---|---|
Academic Torrents | Miscellaneous (User uploads) | Diverse (csv, jpeg, tar.gz, rar, …) | N/A |
Amazon Web Services Public Datasets | Astronomy, Biology, Chemistry, Climate, Economics, Encyclopedic, Geographic, Mathematics | Diverse (XML, xls, csv, ARC GRID, mat-file, HDF5, …) | N/A |
Crowdflower | Miscellaneous | csv | N/A |
DataCite Re3 data | Humanities & Social Sciences, Life Sciences, Natural Sciences, Engineering Sciences | Diverse (txt, csv, vcf, pdf, jpeg,…) | N/A |
Datahub.io | Miscellaneous | Diverse (csv, xls, rdf, XML, SparQL, pdf, html,…) | N/A |
Enigma | Oil & Gas, Reference, Company, Healthcare | N/A | |
GeoLite Legacy Downloadable Databases | IP Geolocation, Autonomous System Numbers | Binary, csv | N/A |
Global Open Data Index | Miscellaneous (National Statistics, Government Budget, Legislation, Procurement Tenders…) | csv | JSON |
Google Public Data | Miscellaneous | Diverse (xls, csv, …) | N/A |
Government of Canada Open Data | Miscellaneous (government services) | Diverse (csv, JSON, XML, GeoTIFF, SHP, HTML, …) | cURL, JSON |
Greater London Authority Datastore | Miscellaneous (government services) | Diverse (xls, csv, pdf, HTML, XML, SHP, TSV, tif…) | CKAN |
Hadoop Illuminated | Collection of links | ||
IGSR: The International Genome Sample Resource | Genomes | CRAM | N/A |
Kaggle Data Sources | Collection of links | ||
Kaggle datasets | Miscellaneous | csv | N/A |
KDnuggets | Collection of links | ||
Knoema | Miscellaneous (categorised) | Diverse (ppt, xls, pdf, png, csv, atomsvc) | JSON, ODATA |
Linked Data | Collection of resources | RDF | N/A |
Medical Data for Machine Learning | Collection of resources | Diverse | N/A |
NASA's Data Portal | Aerospace, Applied Science, Earth Science, Management/Operations, Space Science | Diverse (Csv, xls, png, JSON, RDF, RSS, XML, TSV, KML, KMZ, SHP, GeoJSON,…) | JSON, GeoJSON, SODA |
National Renewal Energy Laboratory | Energy forms | XML, KMZ, SHP, jpg, dbf | N/A |
Open Data Inception (multiple data portals) | Collection of worldwide Open Data portals | ||
Ordnance Survey OpenData | Digital Map Data | GeoTIFF, Raster, Vector, csv, txt | N/A |
Quandl | Finance | csv, XML | JSON |
Quora | Collection of links | ||
Reddit Datasets | Collection of links | ||
Stanford Large Network Dataset Collection | Miscellaneous (Social network topics) | Tar.gz, txt.gz | N/A |
UCI Machine Learning Repository | Miscellaneous (Life Sciences, Physical Sciences, CS/Engineering, Social Sciences, Business, Game, et al.) | csv | N/A |
UK Government | Miscellaneous (government services) | Diverse (csv, HTML, WMS, xls, pdf, XML, GeoJSON, WCS, WFS) | Proprietary (Basic, SQL) |
UK Land Registry Open Data | House prices | csv | Proprietary (PPB Builder), SparQL |
UK Police | Crime and Policing | csv | JSON |
UN Comtrade Database | Global Trade Data | csv | JSON |
US Government | Miscellaneous (government services) | Diverse (csv, HMTL, WMS, xls, XML, GeoJSON, tif, JSON, RDF, jpg, txt, gml,…) | CKAN |
US Government Census | Census data | Diverse (xls, csv, SHP, dbf. Gdb, kml, HTML, WMS, pdf, …) | JSON |
US Government Web Services and XML Data Sources | Miscellaneous (government services) | xls, csv, pdf, XML | N/A |
Weather Underground | Weather and air quality | JSON, XML | JSON |
World Bank Open Data | Global Development Data | xls, XML | JSON |
World Values Survey | Social Values research | SPSS, stata, pdf,xls | N/A |
World Wildlife Fund Open Data | Conservation Science | Diverse (OSM, GeoJSON, mdb, shp, pdf, XML, …) | Proprietary (InVEST) |
However, if prepared datasets don't meet your needs another approach is to seek out and prepare your own data through web-scraping. I have documented just such a case study (using Python).