For Developers: Pig

What Is Pig?

Pig provides an engine for executing data flows in parallel on Hadoop. It includes a
language, Pig Latin, for expressing these data flows. Pig Latin includes operators for
many of the traditional data operations (join, sort, filter, etc.), as well as the ability for
users to develop their own functions for reading, processing, and writing data.
Pig is an Apache open source project.

Pig on Hadoop

Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.

Pig Setup-

Download pig from -
http://www.apache.org/dyn/closer.cgi/pig

Extract package using

tar -zxvf filename

Set pig classpath in ~./bashrc

Start pig in local mode-
. /bin/pig -x local

Start pig in mapreduce mode -

./bin/pig

or

./bin/pig -x mapreduce

Here is some use cases for pig and solutions-

Dataset Name: tag-genome

Download tag-genome data set

Dataset Description: Want to know how quirky a particular movie is? Or how to find the most visually appealing movies of all time? Or how to find a movie that is similar to another movie you’ve seen but less big budget and more cerebral?

The tag genome is a data structure that enables you to answer queries. The tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.)

This data set contains the tag relevance values that make up the tag genome.Tag relevance represents the relevance of a tag to a movie on a continuous scale from 0 to 1. Tag relevance values are provided for 9,734 movies and 1,128 tags.

Files Stucture-

movies.dat

<MovieID><Title><MoviePopularity>

tags.dat

<TagID><Tag><TagPopularity>

tag_relevance.dat

<MovieID><TagID><Relevance>

Relevance values are on a continuous 0-1 scale. A value of 1 indicates that a tag is strongly relevant to a movie and a value of 0 indicates that a tag has no relevance to a movie.

Load and dump data using pig grunt shell command

movies = LOAD '/user/hduser/pig/tag-genome/movies.dat' USING PigStorage('\t') as (MovieID:int,Title:chararray,MoviePopularity:int);
dump movies

tag_relevance = LOAD '/user/hduser/pig/tag-genome/tag_relevance.dat' USING PigStorage('\t') as (MovieID:int,TagID:int,Relevance:float);
dump tag_relevance

tags = LOAD '/user/hduser/pig/tag-genome/tags.dat ' USING PigStorage('\t') as (TagID:int,Tag:chararray,TagPopularity:int);
dump tags

Find out movieID where movie name is 'Kids of the Round Table (1995)'

grunt> Y= FILTER movies BY Title=='Kids of the Round Table (1995)';
grunt> dump Y

Ranging from very popular tags

grunt> Y= ORDER tags BY TagPopularity DESC;
grunt> Z= LIMIT Y 10;
grunt> dump Y

Find out popularity score (displays tags with a popularity score greater than 50. )

grunt> Y= FILTER tags BY TagPopularity>50;
grunt> dump Y

Predict the relevant tags for an item

grunt> Y= FILTER tag_relevance BY MovieID==1;
grunt> X= ORDER Y BY Relevance DESC;
grunt> Z= LIMIT X 1;
grunt> dump Z

Group moviesId and count

grunt> Y = GROUP tag_relevance by MovieID;
grunt> count_movies = FOREACH Y GENERATE group, COUNT(tag_relevance);
grunt> dump count_movies

Movies title started with 'J'

grunt> Y= FILTER movies BY Title matches 'J.+';
grunt> dump Y

Dataset Name: songs-data

Download songs data set

Load data

track = LOAD '/user/hduser/pig/songs-data/subset_unique_tracks.txt '    USING PigStorage('\t')as (trackID:chararray,songID:chararray,artistName:chararray,songTitle:chararray) ;

artist = LOAD '/user/hduser/pig/songs-data/subset_artist_location.txt' USING PigStorage('\t') as (artistID:chararray,artistmbID:chararray,trackID:chararray,artistName:chararray);

track_per_year = LOAD '/user/hduser/pig/songs-data/subset_tracks_per_year.txt' USING PigStorage('\t') as (year:chararray,trakID:chararray,artistName:chararray,songTitle:chararray);

Group yearwise

grunt> year_group= GROUP track_per_year BY year;
grunt> dump year_group
grunt> YEAR_COUNT = FOREACH year_group GENERATE COUNT(track_per_year);

Check the description of relation

grunt> describe year_group
year_group: {group: chararray,track_per_year: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)}}

Flat Schema of relation

grunt> a = foreach year_group generate $0, FLATTEN($1);

grunt> describe a;
a: {group: chararray,track_per_year::year: chararray,track_per_year::trakID: chararray,track_per_year::artistName: chararray,track_per_year::songTitle: chararray}

Group artistName

grunt> x= Group a BY artistName;

grunt> describe x
x: {group: chararray,a: {(group: chararray,track_per_year::year: chararray,track_per_year::trakID: chararray,track_per_year::artistName: chararray,track_per_year::songTitle: chararray)}}

grunt> y = foreach x generate $0, FLATTEN($1);

grunt> year_group= GROUP track_per_year BY year;
grunt> x = foreach year_group generate group as year , track_per_year as artistName;

grunt> describe x
x: {year: chararray,artistName: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)}}

grunt> x = foreach year_group generate group as year , track_per_year as artistName, track_per_year as songTitle;

grunt> describe x
x: {year: chararray,artistName: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)},songTitle: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)}}

grunt> y = foreach x generate FLATTEN($1);
grunt> describe y
y: {artistName::year: chararray,artistName::trakID: chararray,artistName::artistName: chararray,artistName::songTitle: chararray}

grunt> y = foreach x generate FLATTEN($0);
grunt> describe y
y: {year: chararray}

grunt> y = foreach x generate FLATTEN($2);
grunt> describe y
y: {songTitle::year: chararray,songTitle::trakID: chararray,songTitle::artistName: chararray,songTitle::songTitle: chararray}

grunt> y = foreach x generate FLATTEN(BagToTuple($2));
grunt> describe y
y: {org.apache.pig.builtin.bagtotuple_songTitle_17::year: chararray,org.apache.pig.builtin.bagtotuple_songTitle_17::trakID: chararray,org.apache.pig.builtin.bagtotuple_songTitle_17::artistName: chararray,org.apache.pig.builtin.bagtotuple_songTitle_17::songTitle: chararray}

grunt> artist_group= GROUP track_per_year BY artistName;
grunt> dump artist_group;

./pig -Dpig.additional.jars=/usr/local/pig/contrib/piggybank/java/myudf.jar

or
grunt>register '/usr/local/pig/contrib/piggybank/java/myudf.jar';
grunt> define myudf SepUDF ;

track = LOAD '/user/hduser/pig/subset_unique_tracks.txt' ;

A= FOREACH track GENERATE myudf.SepUDF($0);

For Developers

Monday, 11 January 2016

Pig

1 comment: