Getting Started With Microsoft HDInsight
We've created and installed a way to get sample data and some basic jobs. If you'd like to run some jobs on your Hadoop installation, please select one of the above tiles.
What These Samples Are
These samples are intended to get you started quickly learning and to give you an extensible testing bed to work through concepts. They provide you an easy way to create data sets of various sizes and allow running samples over the various sizes so you may observe the effects of data size on jobs as you get to learn your cluster.
What These Samples Are Not
These samples are not an exhaustive study of each of the disciplines implemented. Extensive documentation exists on the web for Java MR, Pig and Hive.
What's In These Samples
This set of samples is intended to get you up and running quickly with your new cluster. It contains the following artifacts
- C# Map Reduce samples
- Java Map Reduce samples
- Hive scripts
- Pig scripts
- A data generation tool
- Powershell scripts to build projects, import data into hdfs and a script to run the samples.
The samples are arranged around the concept of scenarios. Currently, there is a single sample scenarios with w3c log data
- Generation and import of IIS W3C log data in three sizes into HDFS 1 MB, 500MB , 2G
- Three job types
- totalhits – Calculates the total requests for a given page
- avgtime – Calculates the average time taken in seconds per page
- errors – Calculates errors per page, per hour, per error for request whose status were 404 or 500
- An implementation of each job in C#, Java, Pig and Hive
Loading the Data
There is a simple driver program for loading the data:
powershell -ExecutionPolicy unrestricted –F importdata.ps1
Running the Samples
There is a simple driver program for running the samples that takes in 4 paramters, scenario, size, method, job. You can run the samples simply by calling:
powershell -ExecutionPolicy unrestricted –F runSamples.ps1 <scenario> <size> <method> <job>
size: small | medium | large
method: java | c# | pig | hive
job: totalhits | avgtime | errors
Building the Samples
The samples currently contain all of the binaries required, so building is not required. If you'd like to make changes, we've included PowerShell scripts to build the C# implementation and the Java sample implementations:
powershell –F buildsamples.ps1
To build the .NET samples, you'll need to have msbuild in your path. If not, you’ll see the following error:
Error Building .NET: The term 'msbuild' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
If you are running Windows 8, you will need to add the following to your path (if you are not running Windows 8, you will have a different version number):