Performance optimizations when loading many small files in SSIS
SQL Server Integration Services (SSIS)
In general, reading text files from SQL Server Integration Services (SSIS) is not a complicated task. The flat file source offers a user friendly interface to deal with separators, header lines and code pages/unicode. It can even determine the best data type for each column by scanning sample rows from the text file. And if your flat file is in XML-format you can use the XML source component in SSIS to read the file’s contents.
In many cases however you will not just have a single file but a directory containing many files instead. SSIS offers the for-each loop container in this case, to create a loop over all those files. The for-each loop container also has a friendly user friendly interface, so you can also easily solve this task in SSIS.
In this post I’d like to discuss a scenario with many small files to import and I will compare the for-each loop approach with a single data flow approach. Therefore, the task was to read about 2,100 files in JSON format into a single SQL Server database table. Each file has up to 80 KB in size containing from 1 to about 350 rows of data with an average of about 310 rows per file. So, in total I had to import about 650,000 rows of data from about 2,100 files. Doesn’t sound like a big deal so far.
First I created a for-each container with a single data flow:
The for-each container’s type is set to a “Foreach File Enumerator” scanning all files from a given directory:
The data flow simply reads the current file (I’m not going into details about the JSON-file here but some library like JSON.Net will do), does some minor changes (derived column) and writes the results into a SQL Server database table:
Running the package imported all of the files into my data table BUT… it took much longer than expected. In fact, it took 8 minutes. So whatever my expectation was before running the package, this was way too slow. First I checked if I made some mistakes. The OLE DB destination was set to use a fast table load with a table lock not checking any constraints. So, this was ok. The destination table was a heap with no primary key, so there wasn’t a problem with index reorganization.
Checking the progress log revealed that validation, pre- and post execute events are executed for each file. And since each file contained only a few rows, very small batches were committed in each loop causing the bad performance. Also, sending small batches to a table may be a bad idea depending on your table geometry. For example, when using clustered columnstore index tables, sending small batches results in asynchronous compression cycles as explained here .
If the flat files are actually CSV files, the best approach is to use the MULTIFLATFILE Connection Manager. Actually I must admit that I wasn’t aware it existed, until a colleague showed it to me. So here are the instructions to find this connection manager: When you right-click in the connection section of your package, a dialog appears to choose the connection type. Click on “New Connection…” here.
In the following dialog you can choose the MULTIFLATFILE Connection Manager. It is configured in exactly the same way as the standard flat file connection manager, but now you can specify multiple files or directories to scan.
But since my source files we’ not in CSV file format, I had to go for a different approach here. I replaced the for-each loop container with C# code inside the script component from above. Here is the corresponding code I used:
- public override void CreateNewOutputRows()
- String path = "c:\\temp\\JSON_Import";
- foreach (String filename in Directory.EnumerateFiles(path))
- // process single file
Again, I’m not going into details about the actual code for importing the JSON file here, but the code above shows how simple a for each loop can be implemented within a script (of course you will want to add some error-handling and use a package variable for the import folder instead of the constant string here). The remaining parts of the data flow were left unchanged.
This time, importing all of the 2,100 files took 10 seconds, so this approach was about 48 times faster, than the for-each loop container.
In SSIS, writing data to a database table using a data flow gives the best performance if you have a large number of rows. However, importing many small files from a directory using the for-each loop container results in the opposite: many inserts with just a few rows each. If you encounter performance degradations in such a scenario, using the MULTIFLATFILE connection manager or, if not possible, converting the for-each loop container and the file read operation itself into a script task, may result in a much better performance. To improve performance even more, you could also try to parallelize the script tasks (for example the first script importing files 1, 3, 5 … and the second one importing files 2, 4, 6, …).