What Is Pig?
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a
language, Pig Latin, for expressing these data flows. Pig Latin includes operators for
many of the traditional data operations (join, sort, filter, etc.), as well as the ability for
users to develop their own functions for reading, processing, and writing data.
Pig is an Apache open source project.
Pig on Hadoop
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.
Pig Setup-
Download pig from -
http://www.apache.org/dyn/closer.cgi/pig
Extract package using
tar -zxvf filename
Set pig classpath in ~./bashrc
Start pig in local mode-
. /bin/pig -x local
Start pig in mapreduce mode -
./bin/pig
or
./bin/pig -x mapreduce
Dataset Name: tag-genome
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a
language, Pig Latin, for expressing these data flows. Pig Latin includes operators for
many of the traditional data operations (join, sort, filter, etc.), as well as the ability for
users to develop their own functions for reading, processing, and writing data.
Pig is an Apache open source project.
Pig on Hadoop
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System,
HDFS, and Hadoop’s processing system, MapReduce.
Pig Setup-
Download pig from -
http://www.apache.org/dyn/closer.cgi/pig
Extract package using
tar -zxvf filename
Set pig classpath in ~./bashrc
Start pig in local mode-
. /bin/pig -x local
Start pig in mapreduce mode -
./bin/pig
or
./bin/pig -x mapreduce
Here is some use cases for pig and solutions-
Dataset Name: tag-genome
Download tag-genome data set
Dataset
Description: Want to know
how quirky a particular movie is? Or how to find the most
visually appealing movies of all time? Or how to find a movie
that is similar to another movie you’ve seen but less big budget
and more cerebral?
The
tag genome is a data structure that enables you to answer queries.
The tag genome encodes how strongly movies exhibit particular
properties represented by tags (atmospheric,
thought-provoking, realistic, etc.)
This
data set contains the tag relevance values that make up the tag
genome.Tag relevance represents the relevance of a tag
to a movie on a continuous scale from 0 to 1. Tag relevance values
are provided for 9,734 movies and 1,128 tags.
Files
Stucture-
movies.dat
<MovieID><Title><MoviePopularity>
tags.dat
<TagID><Tag><TagPopularity>
tag_relevance.dat
<MovieID><TagID><Relevance>
Relevance
values are on a continuous 0-1 scale. A value of 1 indicates that a
tag is strongly relevant to a movie and a value of 0 indicates that a
tag has no relevance to a movie.
Load and dump data using pig grunt shell command
movies = LOAD '/user/hduser/pig/tag-genome/movies.dat' USING PigStorage('\t') as (MovieID:int,Title:chararray,MoviePopularity:int);
dump movies
tag_relevance = LOAD '/user/hduser/pig/tag-genome/tag_relevance.dat' USING PigStorage('\t') as (MovieID:int,TagID:int,Relevance:float);
dump tag_relevance
tags = LOAD '/user/hduser/pig/tag-genome/tags.dat ' USING PigStorage('\t') as (TagID:int,Tag:chararray,TagPopularity:int);
dump tags
Find out movieID where movie name is 'Kids of the Round Table (1995)'
grunt> Y= FILTER movies BY Title=='Kids of the Round Table (1995)';
grunt> dump Y
Ranging from very popular tags
grunt> Y= ORDER tags BY TagPopularity DESC;
grunt> Z= LIMIT Y 10;
grunt> dump Y
Find out popularity score (displays tags with a popularity score greater than 50. )
grunt> Y= FILTER tags BY TagPopularity>50;
grunt> dump Y
Predict the relevant tags for an item
grunt> Y= FILTER tag_relevance BY MovieID==1;
grunt> X= ORDER Y BY Relevance DESC;
grunt> Z= LIMIT X 1;
grunt> dump Z
Group moviesId and count
grunt> Y = GROUP tag_relevance by MovieID;
grunt> count_movies = FOREACH Y GENERATE group, COUNT(tag_relevance);
grunt> dump count_movies
Movies title started with 'J'
grunt> Y= FILTER movies BY Title matches 'J.+';
grunt> dump Y
Dataset
Name: songs-data
Download songs data set
track = LOAD '/user/hduser/pig/songs-data/subset_unique_tracks.txt ' USING PigStorage('\t')as (trackID:chararray,songID:chararray,artistName:chararray,songTitle:chararray) ;
artist = LOAD '/user/hduser/pig/songs-data/subset_artist_location.txt' USING PigStorage('\t') as (artistID:chararray,artistmbID:chararray,trackID:chararray,artistName:chararray);
track_per_year = LOAD '/user/hduser/pig/songs-data/subset_tracks_per_year.txt' USING PigStorage('\t') as (year:chararray,trakID:chararray,artistName:chararray,songTitle:chararray);
Group yearwise
grunt> year_group= GROUP track_per_year BY year;
grunt> dump year_group
grunt> YEAR_COUNT = FOREACH year_group GENERATE COUNT(track_per_year);
Check the description of relation
grunt> describe year_group
year_group: {group: chararray,track_per_year: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)}}
Flat Schema of relation
grunt> a = foreach year_group generate $0, FLATTEN($1);
grunt> describe a;
a: {group: chararray,track_per_year::year: chararray,track_per_year::trakID: chararray,track_per_year::artistName: chararray,track_per_year::songTitle: chararray}
Group artistName
grunt> x= Group a BY artistName;
grunt> describe x
x: {group: chararray,a: {(group: chararray,track_per_year::year: chararray,track_per_year::trakID: chararray,track_per_year::artistName: chararray,track_per_year::songTitle: chararray)}}
grunt> y = foreach x generate $0, FLATTEN($1);
grunt> year_group= GROUP track_per_year BY year;
grunt> x = foreach year_group generate group as year , track_per_year as artistName;
grunt> describe x
x: {year: chararray,artistName: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)}}
grunt> x = foreach year_group generate group as year , track_per_year as artistName, track_per_year as songTitle;
grunt> describe x
x: {year: chararray,artistName: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)},songTitle: {(year: chararray,trakID: chararray,artistName: chararray,songTitle: chararray)}}
grunt> y = foreach x generate FLATTEN($1);
grunt> describe y
y: {artistName::year: chararray,artistName::trakID: chararray,artistName::artistName: chararray,artistName::songTitle: chararray}
grunt> y = foreach x generate FLATTEN($0);
grunt> describe y
y: {year: chararray}
grunt> y = foreach x generate FLATTEN($2);
grunt> describe y
y: {songTitle::year: chararray,songTitle::trakID: chararray,songTitle::artistName: chararray,songTitle::songTitle: chararray}
grunt> y = foreach x generate FLATTEN(BagToTuple($2));
grunt> describe y
y: {org.apache.pig.builtin.bagtotuple_songTitle_17::year: chararray,org.apache.pig.builtin.bagtotuple_songTitle_17::trakID: chararray,org.apache.pig.builtin.bagtotuple_songTitle_17::artistName: chararray,org.apache.pig.builtin.bagtotuple_songTitle_17::songTitle: chararray}
grunt> artist_group= GROUP track_per_year BY artistName;
grunt> dump artist_group;
./pig -Dpig.additional.jars=/usr/local/pig/contrib/piggybank/java/myudf.jar
or
grunt>register '/usr/local/pig/contrib/piggybank/java/myudf.jar';
grunt> define myudf SepUDF ;
track = LOAD '/user/hduser/pig/subset_unique_tracks.txt' ;
A= FOREACH track GENERATE myudf.SepUDF($0);
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big data hadoop online Course Bangalore
ReplyDelete