Learn more how to embed presentation in WordPress
- Slides
- 16 slides
Published Jun 20, 2013 in
Business & Management
Direct Link :
Copy and paste the code below into your blog post or website
Copy URL
Embed into WordPress (learn more)
Comments
comments powered by DisqusPresentation Slides & Transcript
Presentation Slides & Transcript
Alan F. Gates
Yahoo!
Pig, Making Hadoop Easy
Who Am I?
Pig committer
Hadoop PMC Member
An architect in Yahoo! grid team
Or, as one coworker put it, “the lipstick on the Pig”
Who are you?
Motivation By Example
Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
In Map Reduce
In Pig Latin
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;
Performance
0.1
0.2
0.3
0.4,
0.5
0.6,
0.7
Why not SQL?
Data Collection
Data Factory
Pig
Pipelines
Iterative Processing
Research
Data Warehouse
Hive
BI Tools
Analysis
Pig Highlights
User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)
UDFs can be written to take advantage of the combiner
Four join implementations built in: hash, fragment-replicate, merge, skewed
Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned
Order by provides total ordering across reducers in a balanced way
Writing load and store functions is easy once an InputFormat and OutputFormat exist
Piggybank, a collection of user contributed UDFs
Who uses Pig for What?
70% of production jobs at Yahoo (10ks per day)
Also used by Twitter, LinkedIn, Ebay, AOL, …
Used to
Process web logs
Build user behavior models
Process images
Build maps of the web
Do research on raw data sets
Accessing Pig
Submit a script directly
Grunt, the pig shell
PigServer Java class, a JDBC like interface
Components
User machine
Hadoop Cluster
Pig resides on user machine
Job executes on cluster
No need to install anything extra on your Hadoop cluster.
How It Works
A = LOAD ‘myfile’
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO ‘output’;
Pig Latin
Execution Plan
Map:
Filter
Count
Combine/Reduce:
Sum
pig.jar:
parses
checks
optimizes
plans execution
submits jar
to Hadoop
monitors job progress
Demo
s3://hadoopday/pig_tutorial
Upcoming Features
In 0.8 (plan to branch end of August, release this fall):
Runtime statistics collection
UDFs in scripting languages (e.g. python)
Ability to specify a custom partitioner
Adding many string and math functions as Pig supported UDFs
Post 0.8
Adding branches, loops, functions, and modules
Usability
Better error messages
Fix ILLUSTRATE
Improved integration with workflow systems
Learn More
Read the online documentation: http://hadoop.apache.org/pig/
On line tutorials
From Yahoo, http://developer.yahoo.com/hadoop/tutorial/
From Cloudera, http://www.cloudera.com/hadoop-training
Using Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728
A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore
Join the mailing lists:
pig-user@hadoop.apache.org for user questions
pig-dev@hadoop.apache.com for developer issues
howldev@yahoogroups.com for Howl