Apache HBase – Java Client API with Docker HBase

HBase is the Hadoop database, a distributed, scalable, big data store. We can use HBase when we need random, realtime read/write access to our Big Data.

I have used the Standalone HBase and Docker HBase for this exercise.

The first step is to install Docker if you dont have it and then do the below steps to install docker HBase.

  1. Refer this repository https://github.com/sel-fish/hbase.docker and follow the instructions available to install Docker HBase.
  2. I have Ubuntu VM hence used my hostname instead of ‘myhbase’. If you have used, the hostname, then you don’t need to update the /etc/hosts file. But make sure to check the /etc/hosts file and verify the below.

    
    <<MACHINE_IP_ADDRESSS>> <<HOSTNAME>>
    
    
  3. My docker run command will be like below.
    
    docker run -d -h $(hostname) -p 2181:2181 -p 60000:60000 -p 60010:60010 -p 60020:60020 -p 60030:60030 --name hbase debian-hbase
    
    
  4. Once you are done, then check the links http://localhost:60010(Master) and http://localhost:60030(Region Server)

pom.xml


<dependency>
  <groupId>org.apache.hbase</groupId>
  <artifactId>hbase-client</artifactId>
  <version>1.3.0</version>
</dependency>

To access the Hbase shell, then follow the below steps,


1. Run 'docker exec -it hbase bash' to enter into the container
2. Go to '/opt/hbase/bin/' folder 
3. Run'./hbase shell' and it will open up the HBase Shell.

You can use the HBase shell available inside the docker container and run scripts to perform all the operations(create table, list, put and scan)


root@HOST-NAME:/opt/hbase/bin# ./hbase shell
2017-02-15 14:55:26,117 INFO  [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
2017-02-15 14:55:27,095 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 1.2.0-cdh5.7.0, r49168a0b3987d5d8b1f1b359417666f477a0618e, Wed Jul 20 23:13:03 EDT 2016

hbase(main):001:0> status
1 active master, 0 backup masters, 1 servers, 0 dead, 3.0000 average load

hbase(main):002:0> list
TABLE                                                                                                                                                                                         
customer                                                                                                                                                                                      
1 row(s) in 0.0330 seconds

=> ["customer"]
hbase(main):003:0> create 'user','personal'
0 row(s) in 1.2540 seconds

=> Hbase::Table - user
hbase(main):004:0> list
TABLE                                                                                                                                                                                         
customer                                                                                                                                                                                      
user                                                                                                                                                                                          
2 row(s) in 0.0080 seconds

=> ["customer", "user"]
hbase(main):005:0> list 'user'
TABLE                                                                                                                                                                                         
user                                                                                                                                                                                          
1 row(s) in 0.0090 seconds

=> ["user"]
hbase(main):006:0> put 'user','row1','personal:name','bala'
0 row(s) in 0.1500 seconds

hbase(main):007:0> put 'user','row2','personal:name','chandar'
0 row(s) in 0.0110 seconds

hbase(main):008:0> scan 'user'
ROW                                              COLUMN+CELL                                                                                                                                  
 row1                                            column=personal:name, timestamp=1487170597246, value=bala                                                                                    
 row2                                            column=personal:name, timestamp=1487170608622, value=chandar                                                                                 
2 row(s) in 0.0700 seconds

hbase(main):009:0> get 'user' , 'row2'
COLUMN                                           CELL                                                                                                                                         
 personal:name                                   timestamp=1487170608622, value=chandar                                                                                                       
1 row(s) in 0.0110 seconds



The hbase-site.xml will be like this. It will be available in the docker container inder /opt/hbase/conf.

hbase-site.xml


<configuration>
  <property>
    <name>hbase.master.port</name>
    <value>60000</value>
  </property>
  <property>
    <name>hbase.master.info.port</name>
    <value>60010</value>
  </property>
  <property>
    <name>hbase.regionserver.port</name>
    <value>60020</value>
  </property>
  <property>
    <name>hbase.regionserver.info.port</name>
    <value>60030</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>
  <property>
    <name>hbase.localcluster.port.ephemeral</name>
    <value>false</value>
  </property>
</configuration>

Create Table



import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

public class CreateTable {

    public static void main(String... args) throws Exception {
        System.out.println("Creating Htable starts");
        Configuration config = HBaseConfiguration.create();
        //config.set("hbase.zookeeper.quorum", "HOSTNAME");
        //config.set("hbase.zookeeper.property.clientPort","2181");
        Connection connection = ConnectionFactory.createConnection(config);
        Admin admin = connection.getAdmin();
        TableName tableName = TableName.valueOf("customer");
        if (!admin.tableExists(tableName)) {
            HTableDescriptor htable = new HTableDescriptor(tableName);
            htable.addFamily(new HColumnDescriptor("personal"));
            htable.addFamily(new HColumnDescriptor("address"));
            admin.createTable(htable);
        } else {
            System.out.println("customer Htable is exists");
        }
        admin.close();
        connection.close();
        System.out.println("Creating Htable Done");
    }
}

List Tables



import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

public class ListTable {

    public static void main(String... args) throws Exception {
        Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create());
        Admin admin = connection.getAdmin();
        HTableDescriptor[] tableDescriptors = admin.listTables();
        for (HTableDescriptor tableDescriptor : tableDescriptors) {
            System.out.println("Table Name:"+ tableDescriptor.getNameAsString());
        }
        admin.close();
        connection.close();
    }
}


Delete Table



import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

import java.io.IOException;

public class DeleteTable {

    public static void main(String... args) {

        System.out.println("DeleteTable Starts");
        Connection connection = null;
        Admin admin = null;

        try {
            connection = ConnectionFactory.createConnection(HBaseConfiguration.create());
            TableName tableName = TableName.valueOf("customer");
            admin = connection.getAdmin();
            admin.disableTable(tableName);
            admin.deleteTable(tableName);
            if(!admin.tableExists(tableName)){
                System.out.println("Table is deleted");
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (admin != null) admin.close();
                if (connection != null) connection.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        System.out.println("DeleteTable Done");
    }
}

Delete Data



import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

public class DeleteData {

    public static void main(String... args) throws Exception {
        System.out.println("DeleteData starts");
        Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create());
        TableName tableName = TableName.valueOf("customer");
        Table table = connection.getTable(tableName);
        Delete delete = new Delete(Bytes.toBytes("row1"));
        table.delete(delete);
        Get get = new Get(Bytes.toBytes("row1"));
        Result result = table.get(get);
        System.out.println("result:"+result);
        if (result.value() == null) {
            System.out.println("Delete Data is successful");
        }
        table.close();
        connection.close();
    }

}

To populate HBase table:


import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

public class PopulateData {

    public static void main(String... args) throws Exception {

        Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create());

        TableName tableName = TableName.valueOf("customer");
        Table table = connection.getTable(tableName);

        Put p = new Put(Bytes.toBytes("row1"));
        //Customer table has personal and address column families. So insert data for 'name' column in 'personal' cf
        // and 'city' for 'address' cf
        p.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"), Bytes.toBytes("bala"));
        p.addColumn(Bytes.toBytes("address"), Bytes.toBytes("city"), Bytes.toBytes("new york"));
        table.put(p);
        Get get = new Get(Bytes.toBytes("row1"));
        Result result = table.get(get);
        byte[] name = result.getValue(Bytes.toBytes("personal"), Bytes.toBytes("name"));
        byte[] city = result.getValue(Bytes.toBytes("address"), Bytes.toBytes("city"));
        System.out.println("Name: " + Bytes.toString(name) + " City: " + Bytes.toString(city));
        table.close();
        connection.close();
    }
}

To scan the tables


import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

public class ScanTable {

    public static void main(String... args) {
        Connection connection = null;
        ResultScanner scanner = null;
        try {
            connection = ConnectionFactory.createConnection(HBaseConfiguration.create());
            TableName tableName = TableName.valueOf("customer");
            Table table = connection.getTable(tableName);
            Scan scan = new Scan();
            // Scanning the required columns
            scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));

            scanner = table.getScanner(scan);

            // Reading values from scan result
            for (Result result = scanner.next(); result != null; result = scanner.next())
                System.out.println("Found row : " + result);
            //closing the scanner
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (scanner != null) scanner.close();
            if (connection != null) try {
                connection.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }


Advertisements

Acceptance testing with Cucumber and Capybara

Cucumber is a software testing tool used for defining the acceptance test cases and running it.

Cucumber itself is written in Ruby Programming language. It uses ‘Gherkin’ language to define the test cases and acceptance tests are written in behavior-driven development style.

Please refer the below links to know more about Cucumber(Features, Step Definitions)

https://cucumber.io/

https://en.m.wikipedia.org/wiki/Cucumber_(software)

Capybara is a library used for simulating the user actions. Please refer https://github.com/teamcapybara/capybara to know more about Capybara.

In this article, I am going to show how we can test the “Yelp” website with Cucumber and Capybara.

My objective is to write an acceptance test which opens up a browser and goes to ‘Yelp’ site and search for a restaurant and validate the results.

Here is the Feature file. It’s written in Gherkin language.

Refer the code atĀ https://github.com/dkbalachandar/ruby-cucumber-test

ruby-cucumber-test/features/yelp.feature

@run
Feature: Search Restaurants
  Scenario: Go to yelp and search for valid restaurant
    Given a user goes to Yelp
    When Search for taco bell
    Then See the List of taco bell Restaurants

   Scenario: Go to yelp and search for restaurant
    Given a user goes to Yelp
    When Search for Qboba
    Then See the List of Qboba Restaurants

   Scenario: Go to yelp and search for restaurant
    Given a user goes to Yelp
    When Search for Chipotle
    Then See the List of Chipotle Restaurants
  
  Scenario: Go to yelp and search for invalid restaurant
    Given a user goes to Yelp
    When Search for hhahsdahsdhasd
    Then See No Results found error message

  Scenario Outline:Go to yelp and search for <searchText> 	
    Given a user goes to Yelp
    When Search for <searchText> 	 
    Then See the List of  Restaurants
    Examples:
	    |searchText|
	    |Scardello|
	    |Roti Grill|
	    |Mughlai Restaurant|
	    |Spice In The City Dallas|			


Here is the step definitions file. Its a Ruby file and uses Capybara to simulate the user actions.

ruby-cucumber-test/features/step_definitions/yelp-step.rb


Given(/^a user goes to Yelp$/) do    
  visit "https://www.yelp.com"   
end

When(/^Search for (.*?)$/) do |searchTerm|
  fill_in 'dropperText_Mast', :with => 'Dallas, TX'    
  fill_in 'find_desc', :with => searchTerm
  click_button('submit')
end

Then(/^See the List of (.*?) Restaurants$/) do |searchTerm|  
 expect(page).to have_content(searchTerm)
 expect(page).to have_no_content('No Results')
end

Then(/^See No Results found error message$/) do
 expect(page).to have_content('No Results')
end

This file has all the Env related configurations. I have used ‘Chrome’ as my default browser instead of Firefox. Those configurations can be defined here.

ruby-cucumber-test/features/support/env.rb


require 'capybara/cucumber'
require 'colorize'
require 'rspec'
Capybara.default_driver = :chrome 
Capybara.register_driver :chrome do |app|
   Capybara::Selenium::Driver.new(app, :browser => :chrome)
end

Below is the Gemfile for my program. This file is used for describing gem dependencies for Ruby program.

ruby-cucumber-test/Gemfile


source "https://rubygems.org"

gem "cucumber"
gem "capybara"
gem "selenium-webdriver"
gem "rspec"
gem "chromedriver-helper"

Follow the below steps to run thisĀ 


1. Install Bundler (http://bundler.io/): gem install bundler
2. Run bundler: bundle install
3. Start test: cucumber --tag @run

After running the test case, the output will be like below.

cucumber.jpg

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language used here is Pig Latin. It’s an abstraction layer on top of Map Reduce. So the developer who does not have in depth knowledge of Map Reduce program and the Data analyst can use this platform to analysis the BIG data.

Now let’s see how we are going to use this with some examples.

The first step is to install the Pig. I am using Ubuntu. Follow the below installation steps.

Installation

  1. Download the Pig from Cloudera. I have used the below one. https://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0-cdh5.2.0.tar.gz
  2. Extract out the file pig-0.12.0-cdh5.2.0.tar.gz
  3. Move the extracted folder to /usr/local/pig
    sudo mv pig-0.12.0-cdh5.2.0 /usr/local/pig
  4. Then edit the profile file to update the ENV variables.gedit ~/.bashrc

    Add the below variables,

    
    export PIG_HOME="/usr/local/pig"
    export PIG_CONF_DIR="$PIG_HOME/conf"
    export PIG_CLASSPATH="$PIG_CONF_DIR"
    export PATH="$PIG_HOME/bin:$PATH"
    
  5. Finally, run the profile to reflect the changes.
    source ~/.bashrc

To verify the pig installation, type the below command and check the version and other details.

pig -h

We can run the Pig program either in local mode or Map Reduce mode. If you have the small data set and want to test your script then you can run the pig script in the local mode. Typing the below command will open up a Grunt Shell where we can enter the PigLatin scripts and run it.

Local Mode:

pig -x local

Map Reduce Mode:

pig 

For this exercise, I have used the below datasets. Please download these files and move it to /usr/local/pig folder. I am running all my scripts in local mode.

asthma_adults_stats.csv:

http://raw.githubusercontent.com/dkbalachandar/health-stats-application/master/app/resources/asthma_adults_stats.csv

NationalNames.csv:

https://raw.githubusercontent.com/dkbalachandar/spark-scala-examples/master/src/main/resources/NationalNames.csv

comments.json:


{
    "comments": [
        {
            "text": "test1",
            "time": "1486051170277",
            "userName": "test1"
        },
        {
            "text": "test1",
            "time": "1486051170277",
            "userName": "test1"
        }
    ]
}

We can load the CSV data with or without schema. But when loading JSON you have to specify the schema oherwise it will throw an error.

All the scripts should end with semicolon.

We have lots of operatator and functions. So i am not going to show all those in this example.

To exit from the Grunt shell, you can use ‘quit’ command.

To load the CSV data


data = load '/usr/local/pig/asthma_adults_stats.csv' using PigStorage(','); 
b = foreach data generate $0;
dump b;

To load the data with an actual schema.


data = load '/usr/local/pig/asthma_adults_stats.csv' using PigStorage(',') as (state:chararray, percentage:double); 
b = foreach data generate state, percentage;
dump b;

If you change the field name in the schema and try it, then you will end up with some errors.


data = load '/usr/local/pig/asthma_adults_stats.csv' using PigStorage(',') as (state1:chararray, percentage:double); 
b = foreach data generate state, percentage;
dump b;

 Invalid field projection. Projected field [state] does not exist in schema: state1:chararray,percentage:double.
2017-02-01 15:36:23,942 [main] WARN  org.apache.pig.tools.grunt.Grunt - There is no log file to write to.
2017-02-01 15:36:23,942 [main] ERROR org.apache.pig.tools.grunt.Grunt - org.apache.pig.impl.plan.PlanValidationException: ERROR 1025: 
 Invalid field projection. Projected field [state] does not exist in schema: state1:chararray,percentage:double.
	at org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:191)


To load the JSON data


data = load '/usr/local/pig/comments.json' using JsonLoader('comments:{(userName:chararray,text:chararray, time:chararray)}');

Load the CSV data and perform Filter operation


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int,
Name:chararray, Year:int, Gender: chararray, Count:int);
filteredData = filter data by Year > 2010;
dump filteredData;

To order the data


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int,
Name:chararray, Year:int, Gender: chararray, Count:int);
filteredData = filter data by Year > 2010;
orderByData =  order filteredData by Year;
dump orderByData;

To group the data


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int, Name:chararray, Year:int, Gender: chararray, Count:int);
groupByData =  group data by Name;
dump groupByData;

To extract out the Name and Gender


b = foreach data generate Name, Gender;
dump b;

Filter all the Female kid’s names, group the data by Year and then count it


data = load '/usr/local/pig/NationalNames.csv' using PigStorage(',') as (Id: int, Name:chararray, Year:int, Gender: chararray, Count:int);
filterData = filter data by Gender =='F';
groupData = group filterData by Year;
countData = foreach groupData generate group, COUNT($1);
dump countData;

To store the data


store filterData INTO '/tmp/output' USING PigStorage(',');

To describe the relation


 describe filterData;

 Output:
 filterData: {Id: int,Name: chararray,Year: int,Gender: chararray,Count: int}