Cloud Adventures Part 2 – Blob, Event Hub, Table Storage, and SQL Azure – Round 2

Source code – https://github.com/ehelin/StorageExperiments (get the commit closest to the date of this blog)

Things turned out very differently this time and for the better 🙂 Event Hub, Azure SQL Server, Blob and Table Storage all loaded the correct number of records…almost! Event Hub had an extra 164 records. Upon further investigation, this upload made some double inserts. I didn’t see any errors and since any staging location will periodically have to deal with duplicates, I am going to move on. Additionally, I may (or may not) come back to Document Db. Out of the box, Document Db doesn’t seem as robust as the others for these types of loads.

I will organize the results by storage type, but the biggest surprise was the threading model I used. Whenever a developer needs to speed something up, a multi-threaded version is generally the easiest option. Here, I replaced the BackgroundWorker (see Reference #5) with the new .NET Async/Await (see Reference #6) threading model. I must say that I was skeptical, but this really seems like the best way to multi-thread a .NET application.

I also noticed that it was much easier on the test machine than the BackgroundWorker in terms of CPU time and memory. It is much easier to program as most of the difficult items like thread management and synching call backs are abstracted away. Once you get used to the Task<> wrapper, it is pretty straight forward! One caveat to this is that launching each thread in a loop caused weird issues (see Reference #4). To do 32 separate threads like before, I made 32 separate calls. There is probably a way around this.

Another item shared by all 4 was I had to run the code 3 times to tweak for performance and error handling. Both Azure SQL Server and the Event Hub (via the Stream Analytics job) had time out issues. Even though I create a new connection for each insert, there were still hiccups that I wasn’t anticipating. To overcome, I implemented a retry which seems to have solved the issue. No issues like this occurred with Blob or Table Storage.

One item shared by both Blob and Table Storage since they are tied to the same storage account was an increase in speed. I think that is because I upgraded the account type up from Locally Redundant to Geo-Redundant.

Another thing I think can do better is find the best way to keep the console application open while the threads run. While I am sure there are better options, I opted to use Console.Read().

I also added some test queries to see how responsive each medium was. These tests included count total, search for a specific record and a search for a type of record.

Ok, on to the details!

Azure SQL Server
Load Count 23,310,144
Time Roughly a day (run window closed before I could record the time)
Start XXXXXXXXXXXXXX
End XXXXXXXXXXXXXX
Query – Total Count Count 23,310,144
Time 15 minutes (ish)
Start 4/5/2016 5:37:14 PM
End 4/5/2016 5:52:08 PM
Query –  Specific Item Found True
Time 14 minutes (ish)
Start 4/5/2016 5:52:08 PM
End 4/5/2016 6:06:57 PM
Query –  Specific Type Count 2,940,199
Time 16 minutes (ish)
Start 4/5/2016 6:06:57 PM
End 4/5/2016 6:22:02 PM
Issues/Caveats/Misc
Azure SQL Server price tier change – good idea not to use while price tier is updating (takes a while)
Event Hub
Load Count 23,310,308
Time Roughly a day (run window closed before I could record the time)
Start XXXXXXXXXXXXXX
End XXXXXXXXXXXXXX
Query – Total Count Count 23,310,308
Time Almost 2 hours…I am going to re-run this because it shouldn’t be so different since the query is run against the same database as the one above
Start 4/5/2016 1:29:54 PM
End 4/5/2016 3:20:57 PM
Query –  Specific Item Found True
Time Almost 2 hours…I am going to re-run this because it shouldn’t be so different since the query is run against the same database as the one above
Start 4/5/2016 3:20:57 PM
End 4/5/2016 5:15:24 PM
Query –  Specific Type Count 2,940,199
Time 22 minutes (ish)…still long
Start 4/5/2016 5:15:24 PM
End 4/5/2016 5:37:14 PM
Issues/Caveats/Misc
Azure SQL Server storage via a stream analytics job.
If I go back, I am going to look at the SendAsync() option instead of the multi-threading approach. Event Hub is supposed to be able to handle an insane amount of records and I honestly thought it would be done first. I think it is because I am using it wrong…anyways, possible future blog post!!
Blob Storage
Load Count 23,310,144
Time 21 hours 56 minutes (ish)
Start 3/20/2016 7:47:27 PM
End 3/21/2016 5:43:48 PM
Query – Total Count Count 23,310,144
Time 2 hours 11 minutes (ish)
Start 4/6/2106 8:44:32 AM
End 4/6/2106 10:55:58 AM
Query –  Specific Item Found True
Time 1 hour 23 minutes (ish)
Start 4/6/2106 10:55:58 AM
End 4/6/2106 12:17:47 PM
Query –  Specific Type Count 2,940,199
Time 1 hour 55 minutes (ish)
Start 4/6/2106 12:17:47 PM
End 4/6/2106 2:12:01 PM
Issues/Caveats/Misc
I added a connection entry in the app.config which is supposed to help with multiple threads (see Reference #2)
I was getting empty blob files until I used the stream.Seek(0, SeekOrigin.Begin); – see Reference #8
I think speed also was faster on the blob because of adding ‘<add address=”*” maxconnection=”1000″ />’ to the <connectionManagement> app.config list of options – see Reference #3)
Table Storage
Load Count 23,310,144
Time 19 hours (ish)
Start 3/20/2016 7:56:01 PM
End 3/21/2016 2:56:27 PM
Query – Total Count Count 23,310,144
Time 1 hour 59 minutes (ish)
Start 4/6/2106 3:03:36 PM
End 4/6/2106 5:02:05 PM
Query –  Specific Item Found True
Time 37 minutes (ish)
Start 4/6/2106 5:02:05 PM
End 4/6/2106 5:39:20 PM
Query –  Specific Type Count XXXXXXXXXXXXXXXXXXX
Time XXXXXXXXXXXXXXXXXXX
Start XXXXXXXXXXXXXXXXXXX
End XXXXXXXXXXXXXXXXXXX
Issues/Caveats/Misc
On the query for a specific type of record, table storage threw an out of memory error. A more specific .IndexOf() search returned a ‘Not Implemented’ error. Seems like Linq support on Azure is still fairly limited 😦

 

Moving forward, I have another blog entry planned for SQL Server Integration Services (SSIS). I think SSIS is the work horse of the Microsoft Data World. It is a very flexible and easy tool to use. Drag and drop connectors for most items and script tasks for items you have to write yourself. The one drawback I have noticed so far is that for Script Task based cloud loads, I have had to load the required class libraries through the GAC to run in the SSIS package which has been somewhat painful. However, there is a new SDK with cloud connectors that I haven’t played yet with.  I will be using this new SDK for that blog entry.

Eventually, I hope to do one for Cassandra. Cassandra is a sophisticated tool for large data stores that is available in the Azure cloud and I want to run tests on. I may explore Mongo DB if it is available as well. Plus, I plan to repeat these cloud load tests as well as the ingestion and reporting in the Amazon cloud and possibly the Google cloud.

However, these will be intermingled with the next group of blog posts of parsing and breaking the records up into smaller objects that are useful!

Stay tuned!

References

1. https://msdn.microsoft.com/en-us/library/jj155756.aspx

2. http://tk.azurewebsites.net/2012/12/10/greatly-increase-the-performance-of-azure-storage-cloudblobclient/ (did this help speed up the blob storage?)

3. https://alexandrebrisebois.wordpress.com/2013/03/24/why-are-webrequests-throttled-i-want-more-throughput/

4. http://stackoverflow.com/questions/30225476/task-run-with-parameters

5. https://msdn.microsoft.com/en-us/library/system.componentmodel.backgroundworker%28v=vs.110%29.aspx

6. http://blog.stephencleary.com/2012/02/async-and-await.html

7. http://stackoverflow.com/questions/30225476/task-run-with-parameters

8. https://social.msdn.microsoft.com/Forums/azure/en-US/a9f8dae4-5636-43d0-b177-e631d9c8d92c/blob-uploadfromstream-saves-empty-files?forum=windowsazuredata

 

.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s