1 minute read

50 Years of Data science

  • Definitely worth going through again. To identify what project to work on to hone meaningful skills.

Skimmed some interesting parts of Donoho- 2017 - 50 Years of Data Science. Its a follow up on Tukey’s 50 years article I think. It gives a pretty good overview of what data science is and shows a landscape of what the research is and should be. Along with some reasonable classifications of a Greater Data Science (GDS) Which isnt just solely focused on teaching Predictive modelling. Also read about the role played by Common Task Framework (CTF) in a kaggle-like competetive environment for producing predictive models with given common datasets, How this CTF culture has shaped most of machine learning / deep learning today with models seeking to outperform benchmarks etc. GDS was divided into 6 main classifications:

  1. Data Exploration and Preparation
  2. Data Representation and Transformation
  3. Computing with Data
  4. Data Modelling
  5. Data Visualization and Presentation
  6. Science about Data Science

Today academia and masters programs in data science put most of their focus on 4. Data Modelling and mathematical proofs, while most practioners estimate that 80% of their work is in 1. and 2. which is put under the umbrella of Data Cleaning. Fancy methods that have marginal, incremental gains on theoretical clean datasets do not perform any better than the most basic of algorithms (HC-Clip and Mas-o-Menos) from the examples..

There is also a discussion on Cross-Study Methods and Workflows. This is part of meta analysis of how data analysis is being done on the same problem but by different groups and how by using different workflows and methods they come to different conclusions although they are using the same dataset.