(This post first published on Google+).
Two areas that I work in, “big data” and “data science”, both have somewhat nebulous definitions. As with many technology trends, they are neither wholly novel or well-defined. Many business analysts or statistical programmers will lay claim to doing what is characterized as data science, and managers of large BI (business intelligence) deployments will assert they’re handling big data.
What is the point of such labels?
The utility of the terms big data and data science does not lie in their watertight definition. Though many, including me, seek to both educate others and profit by defining labels, a clear ontology of a developing phenomenon is incorrect as soon as you publish it.
Beware the vaunted four Vs of big data (volume, velocity, variety and variability) for they will have little practical use outside of a consultant’s white paper. You do not buy a cow by specifying that you wish for a bovine ungulate that consumes monocotyledonous herbaceous plants. You want something that will produce milk and that will live on your farm.
Instead, the value of the terms is in highlighting a change in the who and how, a migration of technologies and practice into broader applicability.
In the case of big data, the web and social software mean that even modest businesses have large and expanding datasets on their hands. This was the sort of problem that only larger retailers or search engines found it profitable to face previously. Now—the change in “who”—anybody can use technology to process these large and messy datasets. This is thanks to commodity hardware, the open sourcing of projects like Hadoop, and on-demand cloud computing resource—the change in “how”.
For data science, the chance in “who” is in the spread of exploratory and entrepreneurial analysis, historically performed for example by physicists or investment industry quants, into a business and product context. Data scientists (or data science teams) embody domain knowledge, programming ability and mathematical aptitude. Rather than sectioning them behind the IT firewall, they are brought into the conversation of the business as a whole—a change in “how”.
Feeding on themselves
Another interesting aspect of both big data and data science is that, once started, they exhibit positive self-reinforcement. Organizations find that once Hadoop, for example, is in an organization it can be used in an increasing number of contexts. Such scalable computation is enabling many mobile and sensor applications that would previously have been impractical.
Likewise, data science as a phenomenon is self-reinforcing. Its existence and value within companies is causing its recognition not just as a compound skillset and job title, but also as a leading edge of the transformation of corporate culture. To be driven by data necessitates much cultural change: successful data science teams provide an example and learning ground for effecting such changes.
Inevitably, the terms big data and data science will be used by anybody wanting to jump on the bandwagon, for there is considerable financial value in these trends. Cloud, social software, web 2.0 all suffered a similar fate.
But behind the labels lie important changes in the way that we think about both computing and its place within organizations. Just don’t let anybody tell you they know exactly what big data or data science means!