19.06.2015

Backups on the APS

Technical Value

APS runs the Database backup as a distributed action. The control node builds a distributed query plan to perform a parallel database backup. Each node involved in the backup writes one backup file to  the backup server. The backup is done in parallel with the use of the infinband network. Figure from the APS help file: bild1 According to this, there are issues to note.

  • The master is not backed up in parallel.
  • If there are several databases to be backed up, the backups are performed sequentially.

  During the backup the status can be seen in the admin-console under  „Backups/Restores“:Bild2 The progress of the actual backup can be seen here, but only the total status. We have some performance issues with our backups on the APS/PDW, because they exceeded in time and we were not able to do any loads because the backups needs an exclusive lock. The first suggestion was that the network as a bottleneck, but we have tested it with different infrastructure settings and even in an infiny-band network the backups took the same „long“ time. During the backup the first 30-50% are reached very fast and the remaining percents comes very slowly. So we investigated further  and the following query has showed us a big suprise:

  1. select       run.run_id, database_name,
  2. RIGHT('0' + CAST(run.total_elapsed_time /1000 / 3600 AS VARCHAR),2)
  3. + ':' + RIGHT('0' + CAST((run.total_elapsed_time /1000 / 60) % 60 AS VARCHAR),2) 
  4. + ':' +RIGHT('0' + CAST( run.total_elapsed_time /1000  % 60 % 60  AS VARCHAR),2)
  5. as TotalElapsedTime
  6. , run.status , det.pdw_node_id as Node , 
  7. RIGHT('0' + CAST(det.total_elapsed_time /1000 / 3600 AS VARCHAR),2)
  8. + ':' + RIGHT('0' + CAST((det.total_elapsed_time /1000 / 60) % 60 AS VARCHAR),2)
  9. + ':' +RIGHT('0' + CAST( det.total_elapsed_time /1000  % 60 % 60  AS VARCHAR),2)
  10. as TotalElapsedTimePerNode
  11. from sys.pdw_loader_backup_runs  run
  12. left join sys.pdw_loader_backup_run_details  det on  run.run_id = det.run_id
  13. where operation_type = 'BACKUP'
  14. and mode = 'FULL'
  15. order by Submit_time desc

  One of the nodes finished only after 1 hour while the other nodes  (in this case only 3 further nodes)  took five times longer: TheNodes This may occur because the distribution might be different but a little checkup in the storage properties of the database shows the following, the database files on each node were nearly the same: bild4 We have tested this issue using two different half-racked APS with AU2 and even in  a half-racked APS with AU3. Even here are differences between the nodes. With the AU3 the differences were smaller  and the backup itself has performed better, but there is still an issue.   Conclusion:

  • Backup is ready, when the last node has performed the backup
  • The weak (1Gbit/s network) is not the weakes part in the backup chain, there must be something in the APS which is slowing down the backup on „some“ nodes.

A ticket at Microsoft has been opened to solve this issue.    

Teilen auf

Newsletter Anmeldung

Abonnieren Sie unseren Newsletter!
Lassen Sie sich regelmäßig über alle Neuigkeiten rundum ORAYLIS und die BI- & Big-Data-Branche informieren.

Jetzt anmelden