What’s the buzz about MPP Data Warehouses (part 1)?
In the context of more and more data and the need of being able to analyze this data, you might also have stumbled over the MPP approaches for large data warehouses. MPP stand for massively parallel processing in contrast to SMP which means symmetric multi processing. A good definition both worlds can be found here : In an SMP machine, you usually have multiple CPUs which are sharing memory (RAM, disks) and are therefore well suited to boost performance on CPU critical tasks while in an MPP machine you also have multiple CPUs, but this time, each CPU has it’s own memory. Therefore MPP systems are better suited for a workload where you need a very high throughput of data. And this is what we typically see in data warehouses. Here we need to load large amount of data and we need to efficiently query large portions of this data.
But wait, comparing DWH and OLTP, I’m thinking of the following situation:
The main difference between an OLTP and a DWH solution is the data model and not the underlying hardware or database server software. Or, in other words, a modern database server should be suitable for both work loads, OLTP and DWH. The data model (database layout) however, differs a lot: In case of an OLTP database, we want to reduce redundancy and therefore build the data model as by normalizing the data to transactional and reference data and potentially complex relationship between tables. On the other hand, in a DWH model we want to be able to read large amounts of data with simple queries and therefore prefer a de-normalized model (star schema).
And as long as we do not have too much data, this point of view works fine. It’s surprising that it does, as both use cases have different requirements to the underlying infrastructure.
- Usually small batches of data (transactions), usually structured in a complicated way (covering many tables)
- Needs to be able to roll back changes spanning multiple tables (complex transactions)
- Ensure data integrity (foreign keys, other contraints)
- Ensure simultaneous read/write access of users to the same amount of data (isolation level)
- Support rich programming features (triggers, user defined functions etc.)
- Load large amounts of data at specific loading times
- Query large amounts of data (often in “full scan”), create aggregates
But still, for small amounts of data, you don’t have to consider the infrastructure too much. However, as the amount of data and its complexity grows, you have to think about ways of optimizing your data warehouse architecture:
The first step is to apply best practices for your data warehouse model. For example, loading large amount of data is not a good idea if you are having active foreign key constraints or – even worse – triggers – on your tables. But having a feature in a database software, does not necessarily mean, that you have to use it. So, here are some of these best practices:
- Avoid active foreign key constraints when loading a large amount of data
- use table partitioning and partition switching for updates rather than individual row insert/update processes (for example: late arriving facts)
- Avoid granular transaction logging (simple recovery model)
On this step, you didn’t really touch the data warehouse system infrastructure at all. So your database server is still “universal”. On the next step of complexity, usually we start tuning the machine itself, for example
- Choose a specific layout for your IO (SAN, RAID)
- Choose a specific distribution of database files and file groups (log, temp etc.)
- Use specifically tuned machines for the different tasks, like staging, ODS, data warehouse, data marts
- Use server clusters to balance workload and provide high availability
At this step, the SQL Server becomes more and more optimized for data warehouse workload. It will be possible to run OLTP workload too, but this maybe less efficiently, as we started to optimize for DWH workload.
However, as the amount of data grows, one question comes in to mind: Wouldn’t it be better, to really optimize the database server for DWH workload? And consequently, don’t consider OLTP requirements as we do this optimization? This will offer different ways of storing and handling the data. If we follow this path, we get an infrastructure that might not be suited for OLTP traffic at all, but perfectly supports large loads and fast reads of very large data.
MPP data warehouse solutions, like Teradata, Oracle Exadata, IBM Netezza and Microsoft Parallel Data Warehouse or Greenplum are examples of these approaches. Usually, the approach is a shared nothing MPP architecture of nodes, which have their own segment of data on their own disks (not a shared memory or disk). Most consequently all components (including the hardware) are perfectly tuned and aligned for this purpose. To achieve this, pre-installed and configured appliances are commonly used, so instead of buying hardware and software individually and trying to make it run well and fast, you get a “black box” (i.e. one or more racks) of components and software that are selected and configured in the best possible way.
In part 2 of this post, I’ll show the basic ideas of this shared nothing architecture and how query performance can benefit from the distribution of data on several compute nodes.