- C# Application – https://github.com/ehelin/StorageExperiments (get the commit closest to the date of this blog)
- Hadoop Query – https://github.com/ehelin/StorageExperiments/blob/master/Shared/dto/blob/hql/Satellite.hql
Ok, I have been approaching this from a traditional data warehouse perspective. First, we created a data set to work with as documented here:
Second, we loaded this data set into a staging environment as documented here:
Thirdly, we merge data into something useful and then served it up to users in format(s) they can use. There are still many ways to do this in the Azure Cloud, but I choose two routes – SQL Server and Blob/Hadoop. We completed the SQL Server series with these blog entries:
NOTE: The Logic Application had not performed as I had hoped. So, part 3 was a bust. However, I was able to get a good result with part 4
This blog entry will center on the Hadoop/Blob Storage option. More specifically part 3 and 4 for Hadoop/Blob. Hadoop is complicated and has a lot of features. It is defined as “a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation” (http://searchcloudcomputing.techtarget.com/definition/Hadoop).
One of the algorithms used is called MapReduce. An excellent definition/example of this can be found here – http://www.techspot.co.in/2011/07/mapreduce-for-dummies.html. For more information, please see any number of internet resources starting with http://www.sas.com/en_us/insights/big-data/hadoop.html.
In Azure, HDInsights seems to work best with blob storage. To map to blob storage, he high-level process involves:
- Spinning up a cluster
- This can be quite expensive if it is large and runs for a long time. For this experiment, it took about 15 minutes to spin up a minor cluster.
- Write a Query to load/parse the data
- Hadoop uses a query language similar to T-SQL called HiveQL (Hive Query Language) to run and queries and perform data manipulations just like you would in a SQL environment. Starting out, the simplest command is to get the top 10 records from whatever table you are looking at (i.e. select * from satellite_data_parsed limit 10;),
- Point Power BI Desktop at the processed blob storage location and design your report(s).
- Deploy to the Power BI website.
However, it was a lot harder than I thought it would be. First, when I ran my Hadoop query against the 23 million plus blob location, it would just spin. Subsequent research showed that Hadoop works best with a few really, really large files (see reference #9). I knew it I had to redo things, but I did an small Proof Of Concept (POC) to verify this would work. I ran a Haoop query on a small number of Satellite updates and I was able to create a Power BI Desktop project:
My first thought was to download them and create the single large file in one effort. But since my first version was single-threaded, the download took to long. So, I created a class with a series of methods that downloaded the files and merged them into a large file:
- Blob.CreateLargerFiles clf = new CreateLargerFiles(<multiple arguments>)
- clf.WriteBlobFileNames() – This method downloads all of the satellite update names and places them in a single file.
- clf.SeperateFileNamesIntoDirectories() – This method splits the single file of satellite update names into one for each satellite and one for each type (16 total).
- clf.DownloadFilesIntoDirectories() – This method downloads the 16 file types in separate threads.
- clf.CreateLargeFiles() – This method takes all of the downloaded files and creates one file for that type.
- clf.CreateSingleLargeFile() – This method takes the 16 large files and creates one large file for all of the updates.
- clf.UploadFileToBlob(<single file path argument>) – This method uploads the file to the blob location where the Hadoop query
The result was not as good as I had hoped. Even with the downloading files with multiple threads, it took a while. And because I have other projects I am interested in, I was not keen on repeating it. Only around 10-11 million records loaded into the single 4 GB file. I thought this would be enough to create a source for the Power BI Desktop project and I could move on. However, while running the Hadoop query, 50 (ish) .csv files were created instead of one. The query (source above) did return totals as expected, but I am unsure why 50 separate files were created.
I think this setup would work, but I am going to move on to more topics.