by wyyhzc

Slides
16 slides

Hadoopdayaug2010 100817144744 Phpapp01.pptx

Published Jun 20, 2013 in Business & Management
Direct Link :

Hadoopdayaug2010 100817144744 Phpapp01.pptx... Read more

Read less


Comments

comments powered by Disqus

Presentation Slides & Transcript

Presentation Slides & Transcript

Alan F. Gates
Yahoo!
Pig, Making Hadoop Easy

Who Am I?
Pig committer
Hadoop PMC Member
An architect in Yahoo! grid team
Or, as one coworker put it, “the lipstick on the Pig”

Who are you?

Motivation By Example
Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5

In Map Reduce

In Pig Latin
Users = load ‘users’ as (name, age);
Fltrd = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’;

Performance
0.1
0.2
0.3
0.4,
0.5
0.6,
0.7

Why not SQL?
Data Collection
Data Factory
Pig

Pipelines
Iterative Processing
Research

Data Warehouse
Hive

BI Tools
Analysis

Pig Highlights
User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)
UDFs can be written to take advantage of the combiner
Four join implementations built in: hash, fragment-replicate, merge, skewed
Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned
Order by provides total ordering across reducers in a balanced way
Writing load and store functions is easy once an InputFormat and OutputFormat exist
Piggybank, a collection of user contributed UDFs

Who uses Pig for What?
70% of production jobs at Yahoo (10ks per day)
Also used by Twitter, LinkedIn, Ebay, AOL, …
Used to
Process web logs
Build user behavior models
Process images
Build maps of the web
Do research on raw data sets

Accessing Pig
Submit a script directly
Grunt, the pig shell
PigServer Java class, a JDBC like interface

Components
User machine
Hadoop Cluster
Pig resides on user machine
Job executes on cluster
No need to install anything extra on your Hadoop cluster.

How It Works
A = LOAD ‘myfile’
AS (x, y, z);
B = FILTER A by x > 0;
C = GROUP B BY x;
D = FOREACH A GENERATE
x, COUNT(B);
STORE D INTO ‘output’;
Pig Latin
Execution Plan
Map:
Filter
Count

Combine/Reduce:
Sum
pig.jar:
parses
checks
optimizes
plans execution
submits jar
to Hadoop
monitors job progress

Demo
s3://hadoopday/pig_tutorial

Upcoming Features
In 0.8 (plan to branch end of August, release this fall):
Runtime statistics collection
UDFs in scripting languages (e.g. python)
Ability to specify a custom partitioner
Adding many string and math functions as Pig supported UDFs
Post 0.8
Adding branches, loops, functions, and modules
Usability
Better error messages
Fix ILLUSTRATE
Improved integration with workflow systems

Learn More
Read the online documentation: http://hadoop.apache.org/pig/
On line tutorials
From Yahoo, http://developer.yahoo.com/hadoop/tutorial/
From Cloudera, http://www.cloudera.com/hadoop-training
Using Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728
A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstore
Join the mailing lists:
pig-user@hadoop.apache.org for user questions
pig-dev@hadoop.apache.com for developer issues
howldev@yahoogroups.com for Howl