The current euphoria regarding Big Data is focused on the technologies such as Hadoop, visualisation tools and other often open source technologies. Whilst this is all great stuff the people involved are just as important.
Accompanying the big data bandwagon we have seen the rise of a new role, that of the data scientist. Whilst not exclusively a big data technology job, this new role is complementary to the agenda behind many big data initiatives of uncovering new insights from vast amounts of formally incomprehensible data.
Many data scientist jobs are advertised these days earning huge salaries so what is a data scientist, what do they actually do and do they wear a white lab coat? Well the lab coat is optional but lets see if we can clear up what one is.
The concept of the data scientist has been around in literature for a while now. In Isaac Asimov’s ‘Foundation’ series we have one of the first examples of the data science and big data under the name of psychohistory.
“It is remarkable, Hari, how the religion of science has grabbed hold. ”
― Isaac Asimov, Foundation
The Hari mentioned in the quote is the mathematician, Hari Seldon; the inventor of Psychohistory in Asimov’s books. Psychohistory is an algorithmic science that can predict, in probabilistic terms, the future. Using psychohistory he predicts the eventual fall of the great Galactic Empire and the 10 to 30 thousand years of chaos and barbarism that would ensue. To prevent this he develops a plan (called the Seldon Plan) to shorten the dark period down to a mere thousand years. This plan forms the basis of the Foundation series of books.
Hari Seldon’s psychohistory involves a huge amount of data on subjects such as economics, history, sociology and psychology. This is all processed using statistics and mathematics to understand the likely behaviour of large groups, in this case the Galactic Empire. Sounds just like Big Data to me, and Seldon sounds like a data scientist.
For me the role is the amalgamation (or possibly the evolution) of three different roles; data analyst, business analyst and scientist/mathematician. Let me see if I can explain:
- Data Analyst: because they typically have a computer science background understanding programmatical techniques such as SQL, are familiar with software such as SAS or more open source languages such as R. In short they know their way around a database and datasets regardless of the underlying technology.
- Business Analyst: because they understand business and how to express complex numbers and data sets into terms that the business can understand. They also will know what is relevant to explore further and therefore focus their attention on problems that matter to the business.
- Scientist/Mathematician: because they will have a deep understanding of mathematical modelling and statistically techniques. They will add to this with the traditional skills of a scientist and a scientist general need to understand everything (ie they will be inquisitive and continually asking questions).
The background of data scientists is varied with some coming from quite a ethnology based backgrounds knowing software languages such as R or Python. Other data scientists are mathematicians, statisticians, business analyst or visualisation experts.
With the rise of the data scientist and the professionalism of the role has also come the acceptance that there are two broad categories that data scientists fall into, imaginatively called type A and type B. These two distinct roles are quite critical in organisations success in leveraging the potential of these individuals.
Type A (for analysis) Data Scientists are the most common type and use many techniques to manipulate data sets (some of which can be vast in size). They will understand how to clean and manipulate the data. They will also be knowledgeable in a specific domain. They have commonality with statisticians but explicitly are targeting those skills towards the manipulation and understanding of the data sets.
Type B (for build) Data Scientists use coding and programming skills with big data technology to build models that interact directly with users. These models recommend your next holiday or some music recommendations based upon your purchasing history.
Hopefully this article has helped a bit.