In this post, we will discuss “What is Big Data”, an overview of HDInsight components, common terminology, and scenarios, and resources for using Hadoop in HDInsight. Additionally, there will be links to a series of webinars (recorded by yours truly) that will use data created in HDInsight for a Big Data Analysis solution.
This article is intended to provide an introductory level to the concept of Big Data Analysis and discuss in general terms topics and share information to provide as a guide, reference, and some of the findings made in relation to HDInsight.
Introduction to Big Data
What is Big Data? The basic definition is as varied as the possible results returned when typing this phrase into your favorite search engine. In my situation, when I did this more than 60 million results were returned. So, we will operate and highlight the most common definition provided currently. How has this definition evolved…as well as technologies in the Big Data space evolve, so will the definition. However, I perceive it to be relative to a given set of circumstances much as it is today.
Wikipedia defines Big Data as a “collection of data sets so large and complex it becomes difficult to process using on-hand database management tools or traditional data processing applications”. In simple terms, “Big Data” consists of very large volumes of heterogeneous data that is being generated, often, at high speeds. These data sets cannot be managed and processed using traditional data management tools and applications at hand. Big Data requires the use of a new set of tools, applications, and frameworks to process and manage the data.
There are some things that are so big that they have implications for everyone, whether we want it or not. Big Data is one of those things, and is completely transforming the way we do business and is impacting most other parts of our lives. The basic idea behind the phrase ‘Big Data’ is that everything we do is increasingly leaving a digital trace (data), which we (and others) want to use and analyze. Therefore, Big Data refers to our ability to make use of the ever-increasing volumes of data.
Evolution of Big Data
Data has always been around and there has always been a need for storage, processing, and management of data, since the beginning of human civilization and societies. However, the amount and type of data captured, stored, processed and managed depended then and and even now on various factors including the necessity felt by humans, available tools/technologies for storage, processing, management, effort/cost, ability to gain insights into the data, make decisions, and so on.
Going back a few centuries, in the ancient days, humans used very primitive ways of capturing/storing data like carving on stones, metal sheets, wood, etc. Then with new inventions and advancements a few centuries in time, humans starting capturing the data on paper, cloth, etc. As time progressed, the medium of capturing/storage/management became punching cards followed by magnetic drums, laser disks, floppy disks, magnetic tapes, and finally today we are storing data on various device, USB Drives, CDs, Hard Drives, etc. In fact the curiosity to capture, store, and process the data has enabled human beings to pass on knowledge and research from one generation to the next, so that the next generation does no have to “re-invent the wheel”.
As we can clearly see from this trend, the capacity of data storage has been increasing exponentially, and today with the availability of the cloud infrastructure, potentially one can store unlimited amounts of data. Today terabytes and petabytes of data is being generated, captured, processed, stored, and managed.
Characteristics of Big Data
Characteristics of Big Data – The Three V’s of Big Data
When do we say we are dealing with Big Data? For some people 1TB might seem big, for others 10TB might be big, for others 100GB might be big, and something else for others. This term is qualitative and it cannot really be quantified. Hence we identify Big Data by a few characteristics which are specific to Big Data. These characteristics of Big Data are popularly known as Three V’s of Big Data.
The three v’s of Big Data are Volume, Velocity, and Variety.
Volume refers to the size of data that we are working with. With the advancement of technology and with the invention of social media, the amount of data is growing very rapidly. This data is spread across different places, in different formats, in large volumes ranging from Gigabytes to Terabytes, Petabytes, and even more. Today, the data is not only generated by humans, but large amounts of data is being generated by machines and it surpasses human generated data. This size aspect of data is referred to as Volume in the Big Data world.
Velocity refers to the speed at which the data is being generated. Different applications have different latency requirements and in today’s competitive world, decision makers want the necessary data/information in the least amount of time as possible. Generally, in near real time or real time in certain scenarios. In different fields and different areas of technology, we see data getting generated at different speeds. A few examples include trading/stock exchange data, tweets on Twitter, status updates/likes/shares on Facebook, and many others. This speed aspect of data generation is referred to as Velocity in the Big Data world.
Variety refers to the different formats in which the data is being generated/stored. Different applications generate/store the data in different formats. In today’s world, there are large volumes of unstructured data being generated apart from the structured data getting generated in enterprises. Until the advancements in Big Data technologies, the industry didn’t have any powerful and reliable tools/technologies which can work with such voluminous unstructured data that we see today. In today’s world, organizations not only need to rely on the structured data from enterprise databases/warehouses, they are also forced to consume lots of data that is being generated both inside and outside of the enterprise like click stream data, social media, etc. to stay competitive. Apart from the traditional flat files, spreadsheets, relational databases etc., we have a lot of unstructured data stored in the form of images, audio files, video files, web logs, sensor data, and many others. This aspect of varied data formats is referred to as Variety in the Big Data world.
Big data refers to data being collected in ever-escalating volumes, at increasingly high velocities, and for a widening variety of unstructured formats and variable semantic contexts.
Big data describes any large body of digital information from the text in a Twitter feed, to the sensor information from industrial equipment, to information about customer browsing and purchases on an online catalog. Big data can be historical (meaning stored data) or real-time (meaning streamed directly from the source).
You may be less than impressed with the overly simplistic definition, but there is more than meets the eye. As Big Data refers to data being collected in ever-escalating volumes, at increasingly high velocities, and for widening variety of unstructured formats and variable semantic contexts…for Big Data to provide actionable intelligence or insight, not only must the right questions be asked and relevant data be collected, the data must be accessible, cleaned, analyzed, and then presented in a useful way.
Big Data Adoption
Data has always been there and is growing at a rapid pace. One question being asked quite often is “Why are organizations taking interest in the silos of data, which otherwise was not utilized effectively in the past, and embracing Big Data technologies today?”. The reason for adoption of Big Data technologies is due to various factors including the following:
- Availability of Commodity Hardware
- Availability of Open Source Operating Systems
- Availability of Cheaper Storage
- Availability of Open Source Tools/Software
There is lot of data being generated outside the enterprise and organizations are compelled to consume that data to stay ahead of the competition. Often organizations are interested in a subset of this large volume of data. The volume of structured and unstructured data being generated in the enterprise is very large and cannot be effectively handled using the traditional data management and processing tools.
In Part 2 of this series, I will introduce Windows Azure HDInsight. There will be a webinar video available discussing different ways of deploying Hadoop on Microsoft’s Azure Cloud. Additionally, introducing the Hadoop ecosystem on HDInsight. Ultimately concluding with a link to a webinar that walks through the steps of setting up an Azure storage account, provisioning of an HDInsight cluster, and building a big data analytics solution using Excel and HiveQL.