Tagged: Lingo Toggle Comment Threads | Keyboard Shortcuts

  • admin 9:57 am on December 8, 2014 Permalink
    Tags: , , , Hive, Lingo,   

    Breaking down Hadoop Lingo Part 2: PIG & Hive 

    In my previous blog I introduced Part 1 of my blog series on Hadoop and covered the HDFS component. As the building block to the rest of Hadoop, HDFS plays an important role in the storage of data within Hadoop. In this blog I’m now going to cover the terms Pig and Hive.

    Whenever I mention Hadoop in conversations I often say the words pig and hive and always look at the person’s face to see their reaction. Most often the look on their face is one of bewilderment as the terms conjure up thoughts of something else.

    However both Pig & Hive are two very important components of the Hadoop environment as they work on top of HDFS via the MapReduce framework and provides us with the interface to mine the data contained within HDFS.

    When I speak to many customers on the topic of Hadoop I always make the comment that it is very easy to get data into Hadoop, but hard to extract value from the data once it’s in there. By hard I mean that if you want users to be interacting with the data in HDFS then they will either need to learn how to script in Pig or Hive. It’s not particularly hard to learn, but it’s yet another skill that your users will need to have. Skills in the market are still thin on the ground, so you’ll need to look to re-skilling your existing users instead.

    This blog is more to provide a high level overview of the two capabilities within Hadoop. If you are looking for a more in depth comparison between both capabilities I highly recommend the article by Alan Gates from Yahoo.

    What is Pig?

    Pig is a scripting platform that allows users to write MapReduce operations using a scripting language called Pig Latin. Pig Latin is a flow language whereas SQL is a declarative language. SQL is great for asking a question of your data, while Pig Latin allows you to write a data flow that describes how your data will be transformed. Therefore the types of operations that it is used for is filtering, transforming, joining and writing data. These operations are exactly what MapReduce was intended for.

    The Pig platform itself takes the Pig Latin script and transforms that into a MapReduce job that is then executed against a dataset. It is designed for running both operations against large data sets. Therefore the types of use cases it is ideal for are:

    • ETL of data within Hadoop
    • Iterative data processing
    • Initial research on raw data sets

     What is Hive?

    Pig although is very powerful and useful, it still requires you to master a new language. Therefore to overcome this barrier, the smart cookies at Facebook developed Hive which allows people familiar with SQL (Structured Query Language) to write HQL (Hive Query Language) statements. A HQL statement is read by the Hive service and then transformed into a MapReduce job. This approach makes it very fast and adoptable for people that are already familiar with the syntax of SQL to write Hive queries. There are a few caveats however and these include:

    • HQL is not a full replica of SQL statements. Therefore you need to be aware of what HQL cannot do that you typically do in SQL.
    • Hive is not suited for simple quick transactional statements like what SQL can perform. Keep in mind that HQL is transformed into a MapReduce job which is then executed against a large dataset. Therefore don’t expect blazingly fast response times as MapReduce is not intended for this purpose.
    • Hive only does Read based queries and not write operations. Forget about updates and deletes in Hive. However such operations in the future may be a possibility.

    Therefore in summary both Pig and Hive get converted to MapReduce jobs at the end of the day, however both can be used interchangeably for particular purposes. The following table lists some particular functions and comments on both Pig and Hive.

    If we look at the High Level architecture of Pig and Hive and their position in the overall Hadoop environment you can see how the two components interact with MapReduce to eventually get access to the data.

    Source: http://www.venturesity.com

    So the choice is up to you and what you are most comfortable with. The openness of Hadoop really gives you choice and flexibility when it comes to deciding what tool to use. If you are from the SQL world then you’ll find Hive the easiest to get used to. However if you competent in the Python language you’ll probably find Pig the most applicable. Keep in mind the limitations of both and you’ll be on your way to developing applications that are extracting value from the data in Hadoop in no time at all!

    Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.

    Teradata Blogs Feed

     
  • admin 9:51 am on October 30, 2014 Permalink
    Tags: , , , HDFS, Lingo,   

    Breaking down Hadoop Lingo Part 1: HDFS 

    I have just come off another Hadoop training course last week this time centered around Hive and Pig. Keeping up to date on what’s happening in the Hadoop space is time exhausting. Just recently Teradata announced a partnership with the other big Hadoop player Cloudera.

    Therefore keeping track of the bugs, releases, what other people are building, how it is being used and where the platform is heading is a never ending course of reading and research. In my previous blogs I’ve covered the value of Hadoop and how important it is to have a metadata strategy for Hadoop.

    Many people have a vague understanding of what Hadoop does and the business benefits it provides, but others need to delve into the detail. Over the next few blogs, I’m going to cover some of the basic individual components of Hadoop in detail. Explain what they do, some use cases and why they are important. The best approach I think would be to start from the ground and then move up. Therefore blog #1 will focus on HDFS (Hadoop Distributed File System).

    The purpose of HDFS is to distribute a large data set in a cluster of commodity linux machines in order to later use the computing resources on the machines to perform batch data analytics. One of the key significant attractions of Hadoop is it’s ability to be run on cheap hardware and HDFS is the component that provides this capability. HDFS provides a very high throughout access to the data and is the perfect environment for storing large data sets. The throughput rates makes it great for quickly landing data from multiple data sources such as sensor’s, RFID and web log data.

    Each HDFS cluster contains the following:

    • NameNode: Runs on a “master node” that tracks and directs the storage of the cluster.
    • DataNode: Runs on “slave nodes,” which make up the majority of the machines within a cluster. The NameNode instructs data files to be split into blocks, each of which are replicated three times and stored on machines across the cluster. These replicas ensure the entire system won’t go down if one server fails or is taken offline—known as “fault tolerance.”
    • Client machine: neither a NameNode or a DataNode, Client machines have Hadoop installed on them. They’re responsible for loading data into the cluster, submitting MapReduce jobs and viewing the results of the job once complete.

    WORM– Write Once Read Many. HDFS uses a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. There is a plan to support appending-writes to files in the future.

    The following diagram downloaded from the Apache site outlines the basics of the HDFS architecture.

    HDFS

    Diagram 1- The HDFS architecture

    Data Replication within HDFS

    HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. The NameNode makes all decisions regarding replication of blocks.

    Accessibility of HDFS

    I’ve been asked on several occasions on whether the storage of files in Hadoop is proprietary to only those applications that can access it. In fact the opposite is true. HDFS can be accessed in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance.

    data lake best practices

    Storage volumes

    Another interesting question that is often asked is what is the volume of data that Hadoop can hold? Well how long is a piece of string? At a minimum, a 3 cluster datanode environment according to the Teradata Hadoop appliance can hold 12.5TB per data node. That’s 37.5TB of storage at a minimum. Then add in an average compression factor of 3x and all of a sudden we are looking at 112TB of data storage for a minimum Hadoop configuration. That’s some serious storage!

    Therefore in summary, HDFS is often called the “secret sauce” of Hadoop. It is the layer where the data is stored and managed. Think of it like a standard file storage system with an ability to provide data replication across commodity hardware devices. A minimal install of Hadoop has a Namenode that manages the environment (Metadata file location etc) and then multiple datanodes where the data is stored in chunks across as many datanodes available.

    Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.

     

    Teradata Blogs Feed

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel