When data gets big
So this post is about big data. When looking around on the internet, you can find amazing examples of big data. For example, the Large Hadron Collider of the CERN generates about 15 Petabytes of data every year (see http://www.lhc-facts.ch/index.php?page=datenverarbeitung for details). This is about 41 Terabyte each day. Impressive. However, you might argue that you don’t have such a collider in your company. In fact, most companies will only have to deal with a very small fraction of this amount of data. So where does big data start for common business applications? And what does it mean for the IT strategy. Does it have an influence or is it just a matter of scaling and improving systems – a process that we always have to do in IT to keep up with business requirements.
Wikipedia defines big data as “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (http://en.wikipedia.org/wiki/Big_data ). And I think that this definition is a good starting point because it focuses not only on the amount of data but puts that amount in relation to what we do with the data (process). Let me paraphrase this definition:
I’m talking of big data if my analytical capabilities are no longer sufficient because of the amount or complexity of the data and if I’m not able to scale this capabilities using traditional approaches (more RAM, more servers in my clusters, more disks etc.). It’s not difficult (and not expensive) to store Petabytes of data. It’s difficult to process that data and to do analytics on this data and to gain insights.
So, to be honest, even a few hundred million rows of data may be big, if I’m not able to perform the important analytics, that I need to supply my core business processes in a timely manner. And there are two things to keep in mind:
- Modern ideas of modeling markets and complex statistical models and methods are available.
- While it may be difficult to apply these methods, maybe our competitors already do.
Also, another aspect about analytical capabilities is the question “Do I have the right data or do I need other data sources?”. Limits in analytical capabilities may also exists because I don’t have the information I would need. In todays world with lots of data markets (like Microsoft Azure Datamarket, http://datamarket.azure.com/ ), it’s reality that you can get information/data that you might not even have dared dreaming about a few years ago. Now you get data about consumer trends, your competitors or global trends and you get this data in a reliable, accurate and up-to-date way. Again, this increases the amount of data you need to process and by that, may worsen the analytical restrictions that result from the pure amount of data.
But then, this still is nothing new. As I mentioned before IT had to follow these requirements during each of the last years. We added more cores, bought newer machines. The database algorithms improved, we used OLAP and other technologies to speed up analysis. But let me get back to the second half of the definition from above: “it becomes difficult to process using on-hand database management tools or traditional data processing applications”. If you like, you may replace “difficult” with “makes it more expensive” or “costs more afford” or – in some cases – “makes it impossible, at least for the required period of time”. For example, if you want to calculate a complex price elasticity model in retail and it takes you a month to do so, the result will not be useful anymore as the situation in your market might have already changed significantly (for example because of your competitors’ campaigns).
Again, this is not really new. During the past you may have added other components in your IT strategy, for example OLAP. Or you have replaced a slow database solution with a faster one. And you focused on scalability in order to cope with these challenges. So, you might look at some typical components of a big data environment in just the same way. Here are some of them (be careful, buzzword mode is switched on now):
- Hadoop, Hive, Pig etc.
- In-Memory Computing
- Massive parallel computing (MPP).
- Complex event processing (CEP)
[buzzword mode off] However, if you think of these components in a traditional way of enhancing the IT infrastructure you might think in the wrong direction. The main thing about big data is, that when you get to the limits of your analytical capabilities (as in the definition from above) there are almost always tools and methods to get beyond those limits. However, these tools may require some fundamental changes in the IT ecosystem. As for MPP databases, for example, it’s not done getting one and putting all the data on it, but it is about re-shaping the BI architecture in order to match the new paradigm of those systems.
During the next posts, I’ll get a little bit deeper in this topic, the fundamental changes in the Big Data Architecture and especially MPP databases.