Source code – https://github.com/ehelin/StorageExperiments
While there were many things I was not expecting, there was one – SQL Azure was the clear winner during this round. Slow at first, it moved faster when I moved it from S0 to S3 (probably much faster on the higher premium price tiers). It took roughly 13 hours to insert the total data set of 23 million plus records. This may not be a real shock since the source data is also in a local SQL Server database and I am just passing the table columns over as is with no transformation. Guess you really get what you pay for. Or there is a little throttle troll making the lower price tier performance options suffer accordingly. You decide 🙂 However, this is the only thing that occurred as expected.
While I am going to chalk the outcomes listed below as being new to hard-core cloud processing, the cloud is a much different environment than the local premise shop we have all come to know and love and perhaps other unknown factors, I was surprised by so many unexpected outcomes. I was expecting each storage load to occur without issue at roughly the same time (ish) 😉
Here they are:
- SQL Azure – 13:14:34 (approximation since a timeout occurred after the inserts, but before the total count could complete)
- Event Hub – 26:45:25
- Blob Storage – 68:28:16
- Table Storage – Didn’t complete run (I killed it while still running when I decided to run tests again)
- Document Db – Didn’t complete run (I killed it while still running when I decided to run tests again)
- SQL Azure – 23,310,170 (26 extra records)
- Event Hub – 23,310,175 (lost 31 records)
- Blob – 23,310,175 (lost 31 records)
- Table Storage – 5,463,288 (lost a bunch)
- Document Db – 104,097 (lost a bunch)
- Price Tier was way too low. I purposely started with the out of the box tier settings to see what worked and what didn’t.
- The data load job threw an exception on the total count at the end. I modified the code to write errors to the console rather than throw them since these jobs are long running.
- One of the setup conditions was to drop and recreate the table. This stalled when I had to re-launch because the table had so many records.
- It inserted 26 additional records. I am assuming these are duplicates. I have no easy to prove this, so I am going to include the original database id in the inserts to be able to run comparisons on the next test and see what was duplicated.
- Both lost 31 records. I am going to include the original database id in the inserts to be able to run comparisons on the next test and see what happens if the same result occurs.
- This was just slow. I am not sure if it is because I had it configured wrong or what. My partition strategy is to have one for each satellite. The row key is the [type] data column which serves as the identifier from the original data source. It contains the name of the satellite, update type, and a date time stamp.
- This was almost all configuration. More specifically, each document database setup comes with a base configuration and you can increase it from there. Increasing the quota looks like a request, so I opted to stay within the allowance of the base set up. This meant 5 different document collections and 5 threads. While working through the document database specifics, I had to add a System.Threading.Thread.Sleep(X) statement because I kept getting a ‘Rate to Large…’ error. I started on 100, then went to 1000 and ended up on 2000 before killing the process.
I am going to run the same exact tests again once my credits have renewed as part of my MSDN membership.
I am also going to try another version of this using the newer .NET await multi threading approach. I am now thinking that the System.ComponentModel.BackgroundWorker thread strategy I used here is dated.
Additionally, these are code approaches which typically are slower than the larger Extract Transform Load (ETL) tools like SQL Server Integration Services (SSIS). There are some new components that allow you to drag and drop cloud controls within SSIS. I plan to try each load container as well as an actual load using SSIS.