“Data science and big data – they are so much more interesting than the subject of storage. You have to be honest about it,” said Josh Klahr, vice president of products at Greenplum.
We were sitting over cups of steaming coffee, right after finishing an indepth interview on big data, on the sidelines of EMC World 2012. The twelfth annual global conference had almost finished its week-long spin by the time I got to have a chat with Klahr. It was, however, the day of the second annual Data Science Summit. The Summit brings together the geeks of the data world – the ones that EMC calls the very new-fangled data scientists – to discuss the evolution and relevance of information, as well as the ways in which more information can be extracted from raw data.
I sat in on the Data Science Summit 2012 for a short while. And I know what Klahr was referring to.
There is something fascinating about the field of data science. For one, it did not exist in as strong a form even a year back. The first time I heard of a data scientist was at last year’s EMC World 2011.
For another, data science is not the practice of a single skill, it effectively brings together multiple disciplines within a single fold. It defines a discipline that incorporates applying varying degrees of statistics, data visualisations, computer programming, data mining, machine learning and database engineering to solve complex data problems. That is according to Wikipedia. (Which incidentally also states that the term has existed for over a decade. The International Council for Science: Committee on Data for Science and Technology has been publishing the CODATA Data Science Journal since April 2002).
Though yet in its infancy, the world of data science is set to take us by global storm as the area of big data gains traction among organisations. EMC and Greenplum are working to create ‘an army of geeks’ in order to fill more of the data scientist roles, as demand in markets like North America start to increase.
Creating data scientists though is easier said than done. The first step of the process involves vertical-trained aspirants, who have a baseline understanding of data creation and have an interest in understanding patterns. Then, the trainee also needs to have a penchant towards statistics, and be attuned to picking up repeat patterns in raw data.
For example, a biomedical student who has been working on models, and has an interest in understanding more of the data layers, is a prime candidate to become a data scientist. Greenplum is currently taking in some of these vertical specialists to groom them in the areas they lack in to make them proper data scientists.
On the day of the Data Science Summit, I was witness to a heated debate on whether these data scientists should have vertical knowledge at all or not. A represntative from a German data-mining company felt strongly that any kind of training in one particular field, would restrict people from ‘missing the forest for the trees.’ However, Klahr states that a grounding in a particular stream can be a helpful base for a future data scientist, as long as the said person is adaptive enough to learn new streams and integrate existing knowledge seamlessly.
In many ways, companies like EMC and Greenplum are making the route as they walk the path. In other words, they are a creating a market for data scientists, even as they work out the ways in which to train personnel to fill what will be a rising need. Programmes like the one that Greenplum currently operates will be imitated and repeated a hundred times over by private firms and vendors as they try to fill the need.
The Age of the Data Scientist is just beginning – and it is likely to get crowded by a lot of “me toos” – much like the world of cloud computing like now. What is likely to come out of it though, is a brave new world of information and context, and that is what makes data science way more interesting than plain old storage.