This document discusses using Hadoop and HBase to build content relevance and personalization systems for big data applications. It provides an overview of Hadoop and HBase, and how they can be used together. As a case study, it describes how Groupon uses Hadoop and HBase for their deal relevance and personalization systems, including storing user data in HBase and running recommendation algorithms using MapReduce.
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email Experience for Millions of Users
1. USING HADOOP & HBASE
TO BUILD CONTENT
RELEVANCE &
PERSONALIZATION
Tools to build your big data application
Ameya Kanitkar
2. Ameya Kanitkar – That‟s me!
• Big Data Infrastructure Engineer @ Groupon, Palo Alto
USA (Working on Deal Relevance & Personalization
Systems)
ameya.kanitkar@gmail.com
http://www.linkedin.com/in/ameyakanitkar
@aktwits
3. Agenda
Basics of Hadoop & HBase
How you can use Hadoop & HBase for big data
application
Case Study: Deal Relevance and Personalization
Systems at Groupon with Hadoop & HBase
4. Big Data Application Examples
Recommendation Systems
Ad targeting
Personalization Systems
BI/ DW
Log Analysis
Natural Language Processing
5. So what is Hadoop?
General purpose framework for processing huge
amounts of data.
Open Source
Batch / Offline Oriented
6. Hadoop - HDFS
Open Source Distributed File System.
Store large files. Can easily be accessed via application
built on top of HDFS.
Data is distributed and replicated over multiple machines
Linux Style commands eg. ls, cp, mv, touchz etc
7. Hadoop – HDFS
Example:
hadoop fs –dus /data/
185453399927478 bytes =~ 168 TB
(One of the folders from one of our hadoop cluster)
8. Hadoop – Map Reduce
Application Framework built on top of HDFS to process
your big data
Operates on key-value pairs
Mappers filter and transform input data
Reducers aggregate mapper output
9. Example
• Given web logs, calculate landing page conversion rate
for each product
• So basically we need to see how many impressions each
product received and then calculate conversion rate of for
each product
10. Map Reduce Example
Map Phase
Reduce Phase
Map 1: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map 2: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Map N: Process Log File:
Output: Key (Product ID), Value
(Impression Count)
Reducer: Here we receive all
data for a given product. Just run
simple for loop to calculate
conversion rate.
(Output: Product ID, Conversion
Rate
11. Recap
We just processed terabytes of data, and calculated
conversion rate across millions of products.
Note: This is batch process only. It takes time. You can
not start this process after some one visits your website.
How about we generate recommendations in batch process
and serve them in real time?
12. HBase
Provides real time random read/ write access over HDFS
Built on Google‟s „Big Table‟ design
Open Sourced
This is not RDBMS, so no joins. Access patterns are
generally simple like get(key), put(key, value) etc.
15. Putting it all together
Store data in
HDFS
Web
Generate
Recommendations
(Map Reduce)
Serve Real Time
Requests
(HBase)
Analyze Data
(Map Reduce)
Do offline analysis in Hadoop, and serve real time requests with HBase
Mobile
19. Our Relevance Scenario
How do we surface relevant
deals ?
Users
Deals are perishable (Deals
expire or are sold out)
No direct user intent (As in
traditional search
advertising)
Relatively Limited User
Information
Deals are highly local
20. Two Sides to the Relevance Problem
Algorithmic
Issues
Scaling
Issues
How to find
relevant deals for
individual users
given a set of
optimization criteria
How to handle
relevance for
all users across
multiple
delivery platforms
21. Developing Deal Ranking Algorithms
• Exploring Data
• Understanding signals, finding
patterns
• Building Models/Heuristics
• Employ both classical machine
learning techniques and heuristic
adjustments to estimate user
purchasing behavior
• Conduct Experiments
• Try out ideas on real users and
evaluate their effect
22. Data Infrastructure
Growing Deals
2011
2012
Growing Users
2013
100 Million+
subscribers
We need to store data
20+
like, user click history,
400+
email records, service
logs etc. This tunes to
2000+
billions of data points
and TB‟s of data
23. Deal Personalization Infrastructure Use
Cases
• Deliver Personalized
Emails
• Deliver Personalized
Website & Mobile
Experience
Email
Personalize billions of emails for
hundredsof millions of users
Offline System
Personalize one of the most popular
e-commerce mobile & web app
for hundreds of millions of
users & page views
Online System
24. Architecture
• We can now
maintain different
SLA on online and
offline systems
Email
Real Time
Relevance
Relevance
Map/Reduce
HBase
Offline
System
Data Pipeline
Replication
HBase for
Online System
• We can tune
HBase cluster
differently for
online and offline
systems
25. HBase Schema Design
User ID
Column Family 1
Column Family 2
Unique Identifier
for Users
User History and
Profile Information
Email History For Users
Overwrite user history
and profile info
Append email history for
each day as a separate
columns. (On avg each
row has over 200
columns)
• Most of our data access patterns are via “User Key”
• This makes it easy to design HBase schema
• The actual data is kept in JSON
26. Cluster Sizing
HBase
Replication
Hadoop +
HBase
Cluster
100+ machine Hadoop
cluster, this runs heavy
map reduce jobs
The same cluster also
hosts 15 node HBase
cluster
Online HBase
Cluster
10 Machine
dedicated HBase
cluster to serve
real time SLA
• Machine Profile
• 96 GB RAM (HBase
25 GB)
• 24 Virtual Cores
CPU
• 8 2TB Disks
• Data Profile
• 100 Million+
Records
• 2TB+ Data
• Over 4.2 Billion Data
Points
The relevance problem can coarsely be divided into to conceptual parts: algorithmic aspects and scale-related issues. We’ll start on the algorithmic side of things.