Cloud Adventures Part 3/4 – Azure HDInsight/Blob Storage

Source code:

Ok, I have been approaching this from a traditional data warehouse perspective.  First, we created a data set to work with as documented here:

Second, we loaded this data set into a staging environment as documented here:

Thirdly, we merge data into something useful and then served it up to users in format(s) they can use.  There are still many ways to do this in the Azure Cloud, but I choose two routes – SQL Server and Blob/Hadoop.  We completed the SQL Server series with these blog entries:

NOTE:  The Logic Application had not performed as I had hoped.  So, part 3 was a bust.  However, I was able to get a good result with part 4

This blog entry will center on the Hadoop/Blob Storage option.  More specifically part 3 and 4 for Hadoop/Blob. Hadoop is complicated and has a lot of features.  It is defined as “a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation” (

One of the algorithms used is called MapReduce.  An excellent definition/example of this can be found here –  For more information, please see any number of internet resources starting with

In Azure, HDInsights seems to work best with blob storage.   To map to blob storage, he high-level process involves:

  • Spinning up a cluster
    • This can be quite expensive if it is large and runs for a long time.  For this experiment, it took about 15 minutes to spin up a minor cluster.
  •  Write a Query to load/parse the data
    • Hadoop uses a query language similar to T-SQL called HiveQL (Hive Query Language) to run and queries and perform data manipulations just like you would in a SQL environment.  Starting out, the simplest command is to get the top 10 records from whatever table you are looking at (i.e. select * from satellite_data_parsed limit 10;),
  • Point Power BI Desktop at the processed blob storage location and design your report(s).
  • Deploy to the Power BI website.

However, it was a lot harder than I thought it would be.  First, when I ran my Hadoop query against the 23 million plus blob location, it would just spin.  Subsequent research showed that Hadoop works best with a few really, really large files (see reference #9).  I knew it I had to redo things, but I did an small Proof Of Concept (POC) to verify this would work.  I ran a Haoop query on a small number of Satellite updates and I was able to create a Power BI Desktop project:


My first thought was to download them and create the single large file in one effort.  But since my first version was single-threaded, the download took to long.  So, I created a class with a series of methods that downloaded the files and merged them into a large file:

  • Blob.CreateLargerFiles clf = new CreateLargerFiles(<multiple arguments>)
  • clf.WriteBlobFileNames() – This method downloads all of the satellite update names and places them in a single file.
  • clf.SeperateFileNamesIntoDirectories() – This method splits the single file of satellite update names into one for each satellite and one for each type (16 total).
  • clf.DownloadFilesIntoDirectories() – This method downloads the 16 file types in separate threads.
  • clf.CreateLargeFiles() – This method takes all of the downloaded files and creates one file for that type.
  • clf.CreateSingleLargeFile() – This method takes the 16 large files and creates one large file for all of the updates.
  • clf.UploadFileToBlob(<single file path argument>) – This method uploads the file to the blob location where the Hadoop query

The result was not as good as I had hoped.  Even with the downloading files with multiple threads, it took a while.  And because I have other projects I am interested in, I was not keen on repeating it.  Only around 10-11 million records loaded into the single 4 GB file.  I thought this would be enough to create a source for the Power BI Desktop project and I could move on.  However, while running the Hadoop query, 50 (ish) .csv files were created instead of one.  The query (source above) did return totals as expected, but I am unsure why 50 separate files were created.


I think this setup would work, but I am going to move on to more topics.

Stay tuned!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s