Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language used here is Pig Latin. It’s an abstraction layer on top of Map Reduce. So the developer who does not have in depth knowledge of Map Reduce program and the Data analyst can use this platform to analysis the BIG data.

Now let’s see how we are going to use this with some examples.

The first step is to install the Pig. I am using Ubuntu. Follow the below installation steps.

Installation

  1. Download the Pig from Cloudera. I have used the below one. https://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0-cdh5.2.0.tar.gz
  2. Extract out the file pig-0.12.0-cdh5.2.0.tar.gz
  3. Move the extracted folder to /usr/local/pig
    sudo mv pig-0.12.0-cdh5.2.0 /usr/local/pig
  4. Then edit the profile file to update the ENV variables.gedit ~/.bashrc

    Add the below variables,

    
    export PIG_HOME="/usr/local/pig"
    export PIG_CONF_DIR="$PIG_HOME/conf"
    export PIG_CLASSPATH="$PIG_CONF_DIR"
    export PATH="$PIG_HOME/bin:$PATH"
    
  5. Finally, run the profile to reflect the changes.
    source ~/.bashrc

To verify the pig installation, type the below command and check the version and other details.

pig -h

We can run the Pig program either in local mode or Map Reduce mode. If you have the small data set and want to test your script then you can run the pig script in the local mode. Typing the below command will open up a Grunt Shell where we can enter the PigLatin scripts and run it.

Local Mode:

pig -x local

Map Reduce Mode:

pig 

For this exercise, I have used the below datasets. Please download these files and move it to /usr/local/pig folder. I am running all my scripts in local mode.

asthma_adults_stats.csv:

http://raw.githubusercontent.com/dkbalachandar/health-stats-application/master/app/resources/asthma_adults_stats.csv

NationalNames.csv:

https://raw.githubusercontent.com/dkbalachandar/spark-scala-examples/master/src/main/resources/NationalNames.csv

comments.json:


{
    "comments": [
        {
            "text": "test1",
            "time": "1486051170277",
            "userName": "test1"
        },
        {
            "text": "test1",
            "time": "1486051170277",
            "userName": "test1"
        }
    ]
}

We can load the CSV data with or without schema. But when loading JSON you have to specify the schema oherwise it will throw an error.

All the scripts should end with semicolon.

We have lots of operatator and functions. So i am not going to show all those in this example.

To exit from the Grunt shell, you can use ‘quit’ command.

To load the CSV data


data = load '/usr/local/pig/asthma_adults_stats.csv' using PigStorage(','); 
b = foreach data generate $0;
dump b;

To load the data with an actual schema.


data = load '/usr/local/pig/asthma_adults_stats.csv' using PigStorage(',') as (state:chararray, percentage:double); 
b = foreach data generate state, percentage;
dump b;

If you change the field name in the schema and try it, then you will end up with some errors.


data = load '/usr/local/pig/asthma_adults_stats.csv' using PigStorage(',') as (state1:chararray, percentage:double); 
b = foreach data generate state, percentage;
dump b;

 Invalid field projection. Projected field [state] does not exist in schema: state1:chararray,percentage:double.
2017-02-01 15:36:23,942 [main] WARN  org.apache.pig.tools.grunt.Grunt - There is no log file to write to.
2017-02-01 15:36:23,942 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 Invalid field projection. Projected field [state] does not exist in schema: state1:chararray,percentage:double.
	at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)


To load the JSON data


data = load '/usr/local/pig/comments.json' using JsonLoader('comments:{(userName:chararray,text:chararray, time:chararray)}');

Load the CSV data and perform Filter operation


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int,
Name:chararray, Year:int, Gender: chararray, Count:int);
filteredData = filter data by Year > 2010;
dump filteredData;

To order the data


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int,
Name:chararray, Year:int, Gender: chararray, Count:int);
filteredData = filter data by Year > 2010;
orderByData =  order filteredData by Year;
dump orderByData;

To group the data


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int, Name:chararray, Year:int, Gender: chararray, Count:int);
groupByData =  group data by Name;
dump groupByData;

To extract out the Name and Gender


b = foreach data generate Name, Gender;
dump b;

Filter all the Female kid’s names, group the data by Year and then count it


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int, Name:chararray, Year:int, Gender: chararray, Count:int);
filterData = filter data by Gender =='F';
groupData = group filterData by Year;
countData = foreach groupData generate group, COUNT($1);
dump countData;

To store the data


store filterData INTO '/tmp/output' USING PigStorage(',');

To describe the relation


 describe filterData;

 Output:
 filterData: {Id: int,Name: chararray,Year: int,Gender: chararray,Count: int}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s