PDW loading parameters explained
The best and fastest way to load data into PDW is the dwloader. Depending on the type of table you are going to load (replicated or distributed) you have different load mechanism.
The 2 important parameters for loading are the „-M“ Parameter which tells dwloader what kind of load is going to take place and „-m“ to either load into multi-transactional mode or not. But what exactly happens while loading data?
The most common load methods append or fast append.
The append mode will create a temporary table before loading the destination table. During the load an exact copy of the table is created, to make sure that the data inserted is correct. What we know now is, that the data matches the criteria of our table structure and that the data can be loaded towards the „real“ destination table. Replicated tables e.g. have to be loaded in append mode and cannot be loaded in fast append mode.
The fast append mode writes directly into the destination table instead of creating a temporary table first. In order to do so, the transactional parameter has to be set and is mandatory. The transactional parameter and the ups and downs are explained below.
Now we can decided what transactional mode we are going to use. With „-m“ gives us the option to either load multi-transactional or not. In order to understand the loading methods we have to know how the PDW and the nodes work. To make it easier we image that our PDW has only one compute node. When we create a distributed table, we don’t only create one table but 8 tables. The so called distribution is nothing else but 8 tables, each one on an own file group attached to its own storage.
Loading without “-m”
Without “-m” parameter the data from the temporary table will be written towards the distributions within a transaction. It starts off with the “Begin Transaction” statements followed from in insert into each distribution. Because writing into 8 tables can not be done in parallel the whole process is done sequentially. Not until the last table is written the “Commit Transaction” takes place to make sure that the transaction is successfully completed and the data is in the destination table. As you can image this process of sequential writing is taking way longer than parallel writing.
Loading with “-m”
Same as before we know that we have the data in a temporary table so far. But instead of skipping the parameter, we include it in our command line. Instead of writing the data sequential in each distribution, with this mode the data gets written in all distributions at the same time. Therefore the speed improvement is between 6-8x times faster and without the parameter. The downside on the other hand is the change of problems during the write process. In case of any failure the data is corrupted and has to be manually cleaned up. This can be easily handled by a Load_ID. Replicated tables can not be loaded with –m.