Home Product Details Information Sources Gallery

Without relevant sources of information to interrogate and analyse, the Data Scientist’s role is redundant. Fortunately, the internet provides a rich and varied source of intelligence that covers most subjects. Some key factors that make the internet sources of interest are:

  • the range of subject matter – relevant and representative data can be found for your research;
  • the multiplicity of data providers – information can be combined, triangulated and compared;
  • the digital form – data is readily accessible, searchable and downloadable;
  • volume of information – a vast amount of raw data that can be obtained for processing.

    The OpenGovernment initiative, which has a goal to make public data available online, has resulted in vast quantities of additional data becoming accessible on government websites for users to mine and has triggered the publishing of detailed analyses and reports, and provision of applications to aid information scrutiny. In the table below, I have summarised some rich and interesting data sources that might be worth exploring and analysing with your Data Science tools. In order to help tailor your data searches I have included brief details of the subject areas covered, the file formats available and whether the website provides an API (Application Programming Interface) that can be used to directly access or query the data source.

    Website Source Subject Areas Data formats API
    Academic Torrents Miscellaneous (User uploads) Diverse (csv, jpeg, tar.gz, rar, …) N/A
    Amazon Web Services Public Datasets Astronomy, Biology, Chemistry, Climate, Economics, Encyclopedic, Geographic, Mathematics Diverse (XML, xls, csv, ARC GRID, mat-file, HDF5, …) N/A
    Crowdflower Miscellaneous csv N/A
    DataCite Re3 data Humanities & Social Sciences, Life Sciences, Natural Sciences, Engineering Sciences Diverse (txt, csv, vcf, pdf, jpeg,…) N/A
    Datahub.io Miscellaneous Diverse (csv, xls, rdf, XML, SparQL, pdf, html,…) N/A
    Enigma Oil & Gas, Reference, Company, Healthcare N/A
    GeoLite Legacy Downloadable Databases IP Geolocation, Autonomous System Numbers Binary, csv N/A
    Global Open Data Index Miscellaneous (National Statistics, Government Budget, Legislation, Procurement Tenders…) csv JSON
    Google Public Data Miscellaneous Diverse (xls, csv, …) N/A
    Government of Canada Open Data Miscellaneous (government services) Diverse (csv, JSON, XML, GeoTIFF, SHP, HTML, …) cURL, JSON
    Greater London Authority Datastore Miscellaneous (government services) Diverse (xls, csv, pdf, HTML, XML, SHP, TSV, tif…) CKAN
    Hadoop Illuminated Collection of links
    IGSR: The International Genome Sample Resource Genomes CRAM N/A
    Kaggle Data Sources Collection of links
    Kaggle datasets Miscellaneous csv N/A
    KDnuggets Collection of links
    Knoema Miscellaneous (categorised) Diverse (ppt, xls, pdf, png, csv, atomsvc) JSON, ODATA
    Linked Data Collection of resources RDF N/A
    Medical Data for Machine Learning Collection of resources Diverse N/A
    NASA's Data Portal Aerospace, Applied Science, Earth Science, Management/Operations, Space Science Diverse (Csv, xls, png, JSON, RDF, RSS, XML, TSV, KML, KMZ, SHP, GeoJSON,…) JSON, GeoJSON, SODA
    National Renewal Energy Laboratory Energy forms XML, KMZ, SHP, jpg, dbf N/A
    Open Data Inception (multiple data portals) Collection of worldwide Open Data portals
    Ordnance Survey OpenData Digital Map Data GeoTIFF, Raster, Vector, csv, txt N/A
    Quandl Finance csv, XML JSON
    Quora Collection of links
    Reddit Datasets Collection of links
    Stanford Large Network Dataset Collection Miscellaneous (Social network topics) Tar.gz, txt.gz N/A
    UCI Machine Learning Repository Miscellaneous (Life Sciences, Physical Sciences, CS/Engineering, Social Sciences, Business, Game, et al.) csv N/A
    UK Government Miscellaneous (government services) Diverse (csv, HTML, WMS, xls, pdf, XML, GeoJSON, WCS, WFS) Proprietary (Basic, SQL)
    UK Land Registry Open Data House prices csv Proprietary (PPB Builder), SparQL
    UK Police Crime and Policing csv JSON
    UN Comtrade Database Global Trade Data csv JSON
    US Government Miscellaneous (government services) Diverse (csv, HMTL, WMS, xls, XML, GeoJSON, tif, JSON, RDF, jpg, txt, gml,…) CKAN
    US Government Census Census data Diverse (xls, csv, SHP, dbf. Gdb, kml, HTML, WMS, pdf, …) JSON
    US Government Web Services and XML Data Sources Miscellaneous (government services) xls, csv, pdf, XML N/A
    Weather Underground Weather and air quality JSON, XML JSON
    World Bank Open Data Global Development Data xls, XML JSON
    World Values Survey Social Values research SPSS, stata, pdf,xls N/A
    World Wildlife Fund Open Data Conservation Science Diverse (OSM, GeoJSON, mdb, shp, pdf, XML, …) Proprietary (InVEST)

    However, if prepared datasets don't meet your needs another approach is to seek out and prepare your own data through web-scraping. I have documented just such a case study (using Python).