In the post Intro to Big Data Analytics – Part I, covered an introduction to big data, the evolution of big data, the characteristics of big data, and some factors that influence the adoption of big data.
This post will cover an introduction to Windows Azure HDInsight, an overview of the HDInsight Hadoop ecosystem, the Hadoop technologies on HDInsight, and a link to a webinar that walks through the steps of provisioning a HDInsight cluster, running HiveQL, and ultimately querying the data bringing it into Excel for a big data analytics solution.
Different Ways of Deploying Hadoop on Microsoft’s Azure Cloud
There are couple of different ways to deploy HDInsight in Microsoft Azure Cloud.
We can manually deploy and configure Hadoop on Microsoft’s Azure Cloud. Here are few highlights of this approach:
- Provision Virtual Machines on Azure (IaaS). VMs with any kind of operating system, depending upon the Hadoop distribution, can be provisioned.
- Install Hadoop on those virtual machines. Core Hadoop or other Hadoop distributions can be installed.
- Configure those VMs as a Hadoop cluster.
- This manual process of deploying Hadoop involves a lot of time and effort and is prone to errors unless worked on by experienced Hadoop Administrators.
Deployment using Azure HDInsight
We can deploy Hadoop as Azure HDInsight in a few simple, proven, and tested steps defined by Microsoft, which takes care of all the necessary configurations, Hadoop version management, etc.
Here are a few highlights of this approach:
- Easy to Deploy – We can deploy the cluster in less time with tested and proven methods / steps defined by Microsoft (via the Azure portal or through Windows PowerShell).
- Provision a cluster with the required number of nodes, depending upon processing needs, and destroy the cluster when you are done. This way, you pay only for what you use.
- By using Azure Storage to store the data, one can retain the data even after the cluster is destroyed unlike the other distributions where the data resides on the data nodes and cluster should be available to access that data.
- The data stored in Azure storage can be accessed / used by other processes / applications as appropriate.
- With Azure’s flexibility, one can easily scale up / down the resources depending upon the usage / need.
Now that we have built the foundation by understanding the basics of Microsoft HDInsight offering, we will start exploring other aspects of this distributions.
Introduction to Windows Azure HDInsight
Windows Azure HDInsight Service makes the Apache Hadoop available as a service in the cloud. You can quickly create an HDInsight cluster for processing Big Data. HDInsight offers several ways to provision an HDInsight cluster. In the webinar video the HDInsight cluster is provisioned using the Windows Azure Management portal. Some other options to mention are provisioning the cluster using Windows Azure powershell, HDInsight .NET SDK, and cross-platform command line interfaces.
Creating an Azure Storage Account
- HDInsight uses Azure Blob storage for storing data. It is called WASB or Azure Storage – Blob as Microsoft’s implementation of HDFS.
- When You specify an Azure storage account while provisioning a cluster. A specific Blob container from the account is used as the default file system, just like in HDFS.
- The HDIsight cluster is by default provisioned in the same data center as the storage account you specify.
- In the webinar video to simplfy things, only the default blob container and the default storage account are used.
Provisioning HDInsight Cluster
HDInsight makes Apache Hadoop available as a service in the cloud. Hadoop offers a distributed platform to store and manage large volumes of unstructured data, commonly called Big Data.
The creation of the HDInsight Cluster goes through 5 status/phases:
- Cluster Storage Provisioned
- Windows Azure VM Configuration
On average, the HDCluster provisioning process takes anywhere from 8 – 12 minutes to complete.
Pay As You Go
By using the Windows Azure storage you only pay for the storage and compute capacity used. When you are done you can simply delete the cluster when you no longer need it without paying any additional charges.
Provisioning HDInsight Requirements
Provisioning an HDInsight cluster involves 4 quick tasks. Please note it is a prerequesite that you must already have a Windows Azure Account set up. You can do so for free initially to try things out.
- Creating a Windows Azure Account Subscription
- Enabling the HDInsight Service on my Windows Azure account
- Provisioning and configuring a location in the cloud where I can store my data
- Creating and Provisioning the HDInsight cluster.
This may sound like a lot, but it is relatively straight forward. Definitely quicker than implementing and doing this on-premise. HDInsight supports multiple Hadoop cluster versions that can be deployed at any time. Each version choice provisions a specific version of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that distribution. The component version associated with each HDInsight cluster version are itemized.
Overview Hadoop Ecosystem on HDInsight
Hadoop is a rapidly expanding technology stack that is the go-to solution for big data analysis. HDInsight is the framework for the Microsoft Azure cloud implementation of Hadoop. Hadoop clusters in HDInsight use versions of the Hortonworks Data Platform (HDP) distribution and the set of Hadoop components within that distribution. Apache Hadoop is a software framework for big data management and analysis. Apache Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze, in parallel, the data stored in this distributed system. HDFS uses data replication to address hardware failure issues that arise when deploying such highly distributed systems. For big data to provide actionable intelligence or insight, not only must the right questions be asked and data relevant to the issues be collected, the data must be accessible, cleaned, analyzed, and then presented in a useful way. That’s where Hadoop in HDInsight can help.
Hadoop Technologies on HDInsight
Ambari – Apache Ambari is for provisioning, managing and monitoring Apache Hadoop clusters. It includes an intuitive collection of operator tools and a robust set of APIs that hide the complexity of Hadoop, simplifying the operation of clusters.
Avro – The Microsoft .NET Library for Avro implements the Apache Avro compact binary data interchange format for serialization for the Microsoft .NET environment.
Hbase – is a non-relational database built on Hadoop and designed for large amounts of unstructured and semi-structured data – potentially billions of rows times millions of columns. HBase clusters on HDInsight are configured to store data directly in Azure Blob storage, with low latency and increased elasticity.
HDFS – Hadoop Distributed File System (HDFS) is a distributed file system that, with MapReduce and YARN, is the core of the Hadoop ecosystem. HDFS is the standard file system for Hadoop clusters on HDInsight.
Hive – Lightweight DW Layer. Is data warehouse software built on Hadoop that allows you to query and manage large datasets in distributed storage using a SQL-like language call HiveQL. Hive, like Pig, is an abstraction on top of MapReduce and when run, Hive translates queries into a series of MapReduce jobs. Hive is conceptually closer to a relational database management system than Pig, and is therefore appropriate for use with more structured data.
Mahout – Is a scalable library of machine learning algorithms that run on Hadoop. Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.
MapReduce and YARN: Distributed processing and resource management
Oozie: Workflow management
Pig: Simpler scripting for MapReduce transformations
Sqoop: Data import and export
Storm: Real-time processing of fast, large data streams
Zookeeper: Coordinates processes in distributed systems
Typical Architecture of HDInsight
Here are the highlights of Microsoft HDInsight Architecture:
- HDInsight is built on top of Hortonworks Data Platform (HDP).
- HDInsight is 100% compliant with Apache Hadoop.
- HDInsight is offered in the cloud as Azure HDInsight on Microsoft’s Azure Cloud.
- HDInsight is tightly integrated with Azure Cloud and various other Microsoft Technologies.
- HDInsight can be installed on the Windows operating system unlike the majority of the distributions which are based on the Linux operating system.
- HDInsight also works in a master-slave fashion. Master / Control Node (Head node) controls the overall operation of the Cluster. A Secondary Head node is also included as part of Azure HDInsight deployments.
- HDInsight can be configured to store the data either on HDFS within HDInsight cluster nodes or on Azure Blob Storage. The most common approach is to use Azure Storage to store the data, intermediate results, and the output and not store data on individual nodes.
- User data (Data to be processed) and job metadata resides in Windows Azure Storage – Blob (WASB). WASB is an implementation of HDFS on Azure Blob Storage.
- Individual nodes of the cluster offer MapReduce functionality.
- The Master node reads the job metadata and the user data from Blob Storage and uses it to do the processing. The intermediate and the final results are stored in the Blob Storage
Business Intelligence Tools and HDInsight
Familiar business intelligence (BI) tools – such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services – retrieve, analyze, and report data integrated with HDInsight using either the Power Query add-in or the Microsoft Hive ODBC Driver. Microsoft Power Query for Excel is an Excel add-in that enhances the self-service Business Intelligence experience in Excel by simplifying data discovery, access and collaboration. Power Query for Excel is a downloadable add-in from Microsoft. Typically, Microsoft provides regular updates to this add-in. Roughly once-a-month. There is a 32- and 64-bit option for the add-in.
Power Query for Excel
Microsoft Power Query for Excel enhances self-service business intelligence (BI) for Excel with an intuitive and consistent experience for discovering, combining, and refining data across a wide variety of sources including relational, structured and semi-structured, OData, Web, Hadoop, Azure Marketplace, and more. Power Query also provides you with the ability to search for public data from sources such as Wikipedia.
With Power Query, you can share and manage queries as well as search data within your organization. Users in the enterprise can find and use these shared queries (if it is shared with them) to use the underlying data in the queries for their data analysis and reporting.
With Power Query, you can
- Find and connect data across a wide variety of sources.
- Merge and shape data sources to match your data analysis requirements or prepare it for further analysis and modeling by tools such as Power Pivot and Power View.
- Create custom views over data.
- Use the JSON parser to create data visualizations over Big Data and Azure HDInsight.
- Perform data cleansing operations.
- Import data from multiple log files.
- Perform Online Search for data from a large collection of public data sources including Wikipedia tables, a subset of Microsoft Azure Marketplace, and a subset of Data.gov.
- Create a query from your Facebook likes that render an Excel chart.
- Pull data into Power Pivot from new data sources, such as XML, Facebook, and File Folders as refreshable connections.
- With Power Query, you can share and manage queries as well as search data within your organization.
Power Query Data Sources
- Web page
- Excel or CSV file
- XML file
- Text file
- SQL Server database
- Microsoft Azure SQL Database
- Access database
- Oracle database
- IBM DB2 database
- MySQL database
- PostgreSQL Database
- Sybase Database
- Teradata Database
- SharePoint List
- OData feed
- Microsoft Azure Marketplace
- Hadoop File (HDFS)
- Microsoft Azure HDInsight
- Microsoft Azure Table Storage
- Active Directory
- Microsoft Exchange
PowerView for Excel
Power View is an interactive data exploration, visualization, and presentation experience that encourages intuitive ad-hoc reporting. Power View is a feature of Microsoft Excel 2013, and of Microsoft SharePoint Server 2010 and 2013 as part of the SQL Server 2012 Service Pack 1 Reporting Services Add-in for Microsoft SharePoint Server Enterprise Edition.
Power View provides visualization of PowerPivot Data Models and SSAS tabular mode databases. Power View doesn’t work directly on top of relational data or local Excel data, but Excel 2013 can create a data model on the fly when needed.
In Excel 2013, every workbook can contain an internal Data Model that you can modify in Excel, in Power Pivot, and even in a Power View sheet in Excel. A workbook can contain only one internal Data Model, and you can base a Power View in sheet on the Data Model in the same workbook or on an external data source – another workbook, or a SQL Server 2012 Analysis Services (SSAS) tabular model. A single Excel workbook can contain multiple Power View sheets, and each of the sheets can be based on a different data model.
Analyzing Sensor Data using Hive with HDInsight
Hive ODBC Driver Download
HDInsight Get Started
Microsoft Azure Portal – Preview
Blog: Big Data for Everyone
If you are visiting this post, I hope you found the information useful.