Data Science vs Statistics
Despite reading four articles that disagree with this opinion, I still think a data scientist is a statistician that has the ability to code. In other words, I think all data scientists are statistician, but not all statistician are data scientists. You can’t divorce data science from statistics. Data science’s foundation is statistics and data science would not exist, if not for statistics. Like I wouldn’t say that an epidemiologists, bookmakers, or statistics teachers aren’t statistician just because that’s not what they put into their job searches. It could be that my definition of statistician is more generous than what the authors of these articles have in their mind.
Two of these articles, ‘Machine Learning Engineer vs. Data Scientist’ by Andrew Zola and ‘Data Science vs. Data Analytics vs. Machine Learning: Expert Talk’ by Srihari Sasikumar, use a Venn Diagram as a visual aide. The three large circles of the Venn Diagram were “Hacking Skills”, “Statistic/Math Know-How”, and “Substantive Expertise” and data science is the intersection of those three subjects. I think it’s a poor visual. Data science should be a circle completely within “Statistic/Math Know-How” because I don’t know any data science job that won’t require you to know statistics. I also think nearly all statistician are “substantive experts”. Even if someone only works within theory, they are still substantive experts on that theory. It just seems like a weird distinction to make. Finally, we have “hacking skills”, which, because we clearly care about proper definition, is not programming unless these authors mean hacking in a ‘life hack’ sort of way. Data scientists not only get to make beautiful visuals relatively easily in 2020, but they also get to perform the statistics with super powerful computers. The programming doesn’t replace the statistics know-how -it just makes the conclusions easier to reach.
At the end of the day, I think this is an argument about what to enter into a job search, but, substantively, data scientists are statistician. I actually have never looked for statistician jobs before so I just did a quick search on Indeed, in the DC area, and it seems like statistician is just as nebulous of a title as data scientist. I chose the DC area because there are a lot of jobs of all types in the area; there’s government jobs and University of Maryland is right there so I thought I would even catch some academia jobs with this net. Nearly every single statistician and data scientist post list programming experience AND statistics needed. It does seem that the statistic jobs involve a leadership or senior role over data scientists. The statistician seems to be in control of data collection and experiment design, where a data scientist may come in at a later stage to do the cleaning and manipulation of the data. Both roles involve reaching and reporting conclusions from the data that was collected or found.
I personally will still be entering data science into my job searches. I really enjoy the coding and communicative aspect of statistics, which puts me firmly in the data science job duties. To me, data scientist is a more precise job title than statistician. I find data collection and the experimental design, the job duties that tend to solely belong to the statistician, tedious and I’m probably not as good at math as the standard statistician. I would much rather skip to the step where I can clean, explore, and make conclusions about the data provided. Pretty much, I don’t think data science and statistics are separate things like most of these articles suggest. ODSC’s closing points in their article, ‘Data Scientists Versus Statisticians’, comes closest to summing up how I feel about this distinctions. They say, “Given time, the fields of data science and statistics likely will converge to a common end-point”, and it could be that it’s been over two years since this article has been published and we have reached this common end-point, but I think statistics is already there and has been here. SAS was released almost 50 years ago. Python celebrated it’s 30th birthday this year. R is going to turn 27 with me next year. I’m struggling to find a field of statistics that doesn’t use any of these programs (or other ones not mentioned). We can no longer divorce programming from statistics.
To leave you with a simple metaphor, these articles are trying to convey that statistics and data science are similar, but not the same. Statistics is vanilla ice cream and data science is chocolate according to the authors of these articles. I think statistics is vanilla, and data science is just vanilla with sprinkles.