Cloud Adventures Part 3/4 – Azure HDInsight/Blob Storage

Source code:

Ok, I have been approaching this from a traditional data warehouse perspective.  First, we created a data set to work with as documented here:

Second, we loaded this data set into a staging environment as documented here:

Thirdly, we merge data into something useful and then served it up to users in format(s) they can use.  There are still many ways to do this in the Azure Cloud, but I choose two routes – SQL Server and Blob/Hadoop.  We completed the SQL Server series with these blog entries:

NOTE:  The Logic Application had not performed as I had hoped.  So, part 3 was a bust.  However, I was able to get a good result with part 4

This blog entry will center on the Hadoop/Blob Storage option.  More specifically part 3 and 4 for Hadoop/Blob. Hadoop is complicated and has a lot of features.  It is defined as “a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation” (http://searchcloudcomputing.techtarget.com/definition/Hadoop).

One of the algorithms used is called MapReduce.  An excellent definition/example of this can be found here – http://www.techspot.co.in/2011/07/mapreduce-for-dummies.html.  For more information, please see any number of internet resources starting with http://www.sas.com/en_us/insights/big-data/hadoop.html.

In Azure, HDInsights seems to work best with blob storage.   To map to blob storage, he high-level process involves:

  • Spinning up a cluster
    • This can be quite expensive if it is large and runs for a long time.  For this experiment, it took about 15 minutes to spin up a minor cluster.
  •  Write a Query to load/parse the data
    • Hadoop uses a query language similar to T-SQL called HiveQL (Hive Query Language) to run and queries and perform data manipulations just like you would in a SQL environment.  Starting out, the simplest command is to get the top 10 records from whatever table you are looking at (i.e. select * from satellite_data_parsed limit 10;),
  • Point Power BI Desktop at the processed blob storage location and design your report(s).
  • Deploy to the Power BI website.

However, it was a lot harder than I thought it would be.  First, when I ran my Hadoop query against the 23 million plus blob location, it would just spin.  Subsequent research showed that Hadoop works best with a few really, really large files (see reference #9).  I knew it I had to redo things, but I did an small Proof Of Concept (POC) to verify this would work.  I ran a Haoop query on a small number of Satellite updates and I was able to create a Power BI Desktop project:

SmallSampleBlobMount

My first thought was to download them and create the single large file in one effort.  But since my first version was single-threaded, the download took to long.  So, I created a class with a series of methods that downloaded the files and merged them into a large file:

  • Blob.CreateLargerFiles clf = new CreateLargerFiles(<multiple arguments>)
  • clf.WriteBlobFileNames() – This method downloads all of the satellite update names and places them in a single file.
  • clf.SeperateFileNamesIntoDirectories() – This method splits the single file of satellite update names into one for each satellite and one for each type (16 total).
  • clf.DownloadFilesIntoDirectories() – This method downloads the 16 file types in separate threads.
  • clf.CreateLargeFiles() – This method takes all of the downloaded files and creates one file for that type.
  • clf.CreateSingleLargeFile() – This method takes the 16 large files and creates one large file for all of the updates.
  • clf.UploadFileToBlob(<single file path argument>) – This method uploads the file to the blob location where the Hadoop query

The result was not as good as I had hoped.  Even with the downloading files with multiple threads, it took a while.  And because I have other projects I am interested in, I was not keen on repeating it.  Only around 10-11 million records loaded into the single 4 GB file.  I thought this would be enough to create a source for the Power BI Desktop project and I could move on.  However, while running the Hadoop query, 50 (ish) .csv files were created instead of one.  The query (source above) did return totals as expected, but I am unsure why 50 separate files were created.

MultipleResultFiles

I think this setup would work, but I am going to move on to more topics.

Stay tuned!

References
1) http://dennyglee.com/2013/03/18/why-use-blob-storage-with-hdinsight-on-azure/
2) https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/
3) https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-move-hive-tables/
4) https://blogs.msdn.microsoft.com/cindygross/2013/02/05/hdinsight-hive-internal-and-external-tables-intro/
5) http://www.bimonkey.com/2012/07/getting-data-into-hive-using-hadoop-on-azure/
6) http://hortonworks.com/blog/100x-faster-hive/
7) http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
8) https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-data-sources/
9) http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s