SlideShare a Scribd company logo
1 of 16
Ingest and Indexing of
RSS News Feeds in the
Hadoop Environment
Stephanie F. Guadagno
January 2014

SFG- 1/9/2014

1
Introduction
• Work is being done on a Virtual Machine, loaded
with Cloudera’s CDH 4.3.
• Used Flume 1.3, Cloudera’s Morphlines, Cloudera
Search with Solr 4.3, Hadoop 2.0.

• Used Flume to pull over RSS News Feeds from
BBC World News into HDFS.
• The news data, in HDFS, was indexed and loaded
into Solr using the MapReduceIndexerTool and
the Cloudera’s Morphlines framework.
SFG- 1/9/2014

2
Overview of Components Used
• Flume is used to reliably ingest large amounts of data from various
sources (e.g. log files, Web Sites, Social Media Sites) into a
centralized or distributed data store, such HBase or HDFS.
• MapReduceIndexerTool is a MapReduce batch job driver used with
Cloudera Search. The tool is used to index a set of input files and
then write the indexes into HDFS. The GoLive feature will merge
the output shards into a set of live Solr servers (e.g. a SolrCloud).
• Cloudera Morphlines is a new open source framework that
facilitates simple ETL of ingested data into Apache Solr. The
framework consists of the new Morphlines library and
specifications for creating a “morphline”, which encompasses a
chain of transformation commands.
• Cloudera Search facilitates Big Data search by bringing search and
scalable indexing from Solr 4.X into the Hadoop ecosystem.

SFG- 1/9/2014

3
Flume’s Data Flow
•

•
•
•
•

A Flume Agent is a Java process
that hosts the Flume Source,
Channel, and Sink components
through which events flow from an
external source to the next
destination.
An event is a unit of data that flows
through the components.
A Flume Source listens for events
and writes the event to the
Channel.
The Channel queues the events as
transactions.
The Flume Sink writes the event to
the external source (e.g. HDFS,
HBase, Solr, or a file) and removes
the event from the queue.

SFG- 1/9/2014

External Source
(e.g. Social Media, Log files,
Web Pages, RSS News Feed)
in a format recognized by
the Flume Source

Channel

Source

(e.g. Memory,
File, JDBC)

(e.g. Avro, Exec,
HTTP, JMS,
Syslog, etc.)

Agent

Sink
(e.g. File, HDFS,
HBase,
Morphline Solr
Sink)

in a format specified by
the Flume Sink

File

HBase,
HDFS, Solr

4
Morphline Data Flow
•
•

•
•

•

Cloudera’s Morphlines is a Java
library that was developed as part of
Cloudera Search.
The library contains a suite of
frequently used transformation and
I/O “command” classes for use in
simple ETL on data flows into Solr.
The library can be integrated into
Flume for near-real-time ETL or into
MapReduce for batch ETL.
For batch ETL, Cloudera provides the
MapReduceIndexerTool for data in
HDFS. For data in HBase, the tool is
the HBaseMapReduceIndexerTool.
A morphline will consume input
records, which are then turned into a
stream of records. The stream of
records are piped through a chain of
transformation commands.

SFG- 1/9/2014

Source
B
a
t
c
h

HDFS,
HBase

cmd

N
R
T

…

cmd
record

Flume
Source

cmd

record

Morphline

Solr

5
News Feed ETL Data Flow
1) Ingest using Flume

2) Index using MapReduce and Morphline

External Source

Custom
Source

Morphline

(BBC RSS News Feeds – us, uk, asia, etc.)

Configuration File

Memory
Channel
Avro JSON
record(s)

HDFS
Sink

MapReduceIndexerTool

MapReduce

Agent

(org.apache.solr.hadoop.MapReduceIndexerTool)

(“agent”)

Avro JSON
record(s)

HDFS
("newsfeeds/”)

•

•
•
•

Implemented a Custom Flume Event Driven Source to pull RSS
News Feeds from BBC World News. Details:
– Must implement Flume’s EventDrivenSource interface
– Parsed the News Feeds items
– Wrote each item to the Channel in Avro JSON format
Ensured the Agent was defined. CDH4 came with an agent
called “tier1”. I created an agent called “agent”.
Configured the Data Flow in a Flume agent configuration file.
Wrote a script that runs the flume agent with the agent
configuration file.

SFG- 1/9/2014

Solr Cloud
(“news_feeds”)

•
•
•
•

Created Solr Instance for the “news_feeds” collection
with modified Schema for fields in news feed data.
Created the “news_feeds” collection with 1 shard.
Wrote Morphine File
Wrote a script that runs the MapReduceIndexerTool with
the Morphline specification file.

6
News Feed Ingest Details-1 of 2
(Configuration)
# Flume Data Flow Configuration
# ----------------------------------------------# Definitions
agent.sources=news-source
agent.channels=memory-ch
agent.sinks=hdfs-sink

External Source
(BBC RSS News Feeds – us, uk, asia, etc.)

Custom
Source

Memory
Channel
Avro JSON
record(s)

HDFS
Sink

Agent

# Channel (memory channel with queue capacity of 5000)
agent.channels.memory-ch.type=memory
agent.channels.memory-ch.capacity=5000

(“agent”)

HDFS
("newsfeeds/”)

Chose:
1. Custom Flume Event Driven Source
2. Memory Channel
3. HDFS Sink
SFG- 1/9/2014

# Sources (ingest using RSSFlumeSourceReader class)
agent.sources.news-source.type=dataingest.rssfeeds.RSSFlumeSourceReader
agent.sources.news-source.channels=memory-ch

# Sink (output to HDFS in Text format)
agent.sinks.hdfs-sink.type=hdfs
agent.sinks.hdfs-sink.channel=memory-ch
agent.sinks.hdfs-sink.hdfs.path=
hdfs://localhost:8020/user/cloudera/flume/newsfeeds
agent.sinks.hdfs-sink.hdfs.filePrefix=input
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
agent.sinks.hdfs-sink.hdfs.writeFormat = Text

7
News Feed Ingest Details-2 of 2
(Custom Source – RSSFlumeSourceReader)
public class RSSFlumeSourceReader extends AbstractSource
implements EventDrivenSource, Configurable
{
ChannelProcessor cp = getChannelProcessor();

External Source
(BBC RSS News Feeds – us, uk, asia, etc.)

Custom
Source

Memory
Channel
Avro JSON
record(s)

HDFS
Sink

Agent
(“agent”)

HDFS
("newsfeeds/”)

SFG- 1/9/2014

@Override
public synchronized void start()
{
super.start();
// for each URL
{
// read RSS News Feeds; using java.net.URL
// obtain Document by parsing news using DocumentBuilder from
//
javax.xml.parsers
// get NodeList object for “item” tag contain in the Document object
// for each node in the NodeList object
{
// write data in Avro JSON format using Apache Avro library
}
// create Flume Event and Send Event to Channel
Event event = EventBuilder.withBody(out.toString(), Charsets.UTF_8);
cp.processEvent(event);
}
}
@Override
public synchronized void stop()
{
super.stop();
}
}

8
News Feed Data in HDFS

SFG- 1/9/2014

9
News Feed Data Indexing Details-1 of 3
(MapReduce)
Morphline
Configuration File

MapReduceIndexerTool
Avro JSON
record(s)

MapReduce
(org.apache.solr.hadoop.MapReduceIndexerTool)

HDFS

Solr Cloud

("newsfeeds/”)

(“news_feeds”)

Two tools being used:
1. HdfsFindTool : used to get the most recent files changed
within the past day.
2. MapReducerIndexerTool: will run MapReduce job to index
the HDFS input files and push the index to Solr.

# Go-live merges the output shards of the previous phase into a
# set of on-line Solr servers.
#
echo “Running go-live mode“
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.HdfsFindTool 
-find hdfs:///${HDFS_INDIR} 
-type f 
-name 'in*' 
-mtime -1 | 
sudo -u hdfs hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 
jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar 
org.apache.solr.hadoop.MapReduceIndexerTool 
--libjars /usr/lib/solr/contrib/mr/search-mr-0.9.1-cdh4.3.0-SNAPSHOT.jar 
-D 'mapred.child.java.opts=-Xmx500m' 
--log4j ${LOG_FILE} ${DRYRUN} 
--morphline-file ${MORPHLINE_FILE} 
--update-conflict-resolver
org.apache.solr.hadoop.dedup.RetainMostRecentUpdateConflictResolver 
${REDUCERS_ARG} 
--verbose 
--output-dir hdfs://localhost:8020/${HDFS_SOLR_IDXDIR} 
--go-live 
--zk-host localhost:2181/solr 
--collection ${COLLECTION} 
--input-list echo “Clean-up tmp directory"
sudo -u hdfs rm /tmp/solr*.zip
echo "Done."

SFG- 1/9/2014

10
News Feed Data Indexing Details-2 of 3
(Morphline)
SOLR_LOCATOR : {
# specifiy collection and zkHost
}
morphlines : [
{
id : morphlineNewsFeed
importCommands : ["com.cloudera.**"]
commands : [

“readAvro”

Record

“extractAvroPaths”
Record

“convertTimestamp”
Record

“sanitizeUnknown
SolrFields”

Tid-bits
{ readAvro {
isJson : true
writerSchemaFile: /home/dataingest/schema/NewsRecord.avsc
}}

{ extractAvroPaths {
flatten : false
paths : {
id: /id
title: /Title
url: /Link
published_date: /Publish_Date
author: /Author
comments: /Comments
description: /Description }
}}

Record

“loadSolr”
Document
]}]

Solr Cloud
(“news_feeds”)

SFG- 1/9/2014

{ loadSolr {
solrLocator : ${SOLR_LOCATOR}
} }



HOCON format: Human-Optimized
Configuration format. JSON-like format



Morphline is defined with a tree of
commands.



The output of one command is sent to
the next command.



The morphline is compiled at run-time.

# Convert last_modified to native Solr timestamp format
{ convertTimestamp {
field : published_date
inputFormats : ["EEE, d MMM yyyy HH:mm:ss z",
"EEE, dd MMM yyyy HH:mm:ss z"]
inputTimezone : GMT
outputTimezone: US/Eastern
} }
# Solr will throw an exception on any attempt to load
# a document containing a field not specified in schema.xml.
{ sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
} }

11
News Feed Data Indexing Details-3 of 3
(Solr Collection)
• The “news_feeds” Solr collection presently contains 3800 documents
in the index.

SFG- 1/9/2014

12
News Feed Document in Solr

SFG- 1/9/2014

13
Summary
• Presented ingest of RSS News Feeds using
– Flume with a Custom Source, Memory Channel,
and HDFS Sink

• Presented indexing of News Feed data into
HDFS using
– Cloudera’s Morphlines library and “morphline”
configuration
– Cloudera’s MapReduceIndexerTool
– Cloudera Search with Solr 4.X
SFG- 1/9/2014

14
Thank You

Stephanie F. Guadagno
January 2014

SFG- 1/9/2014

15
References
•

•
•

•

•
•

Flume Developer’s Guide; http://flume.apache.org/FlumeDeveloperGuide.htmlThe Apache
Software Foundation; 2009-2012
Flume User Guide; http://flume.apache.org/FlumeUserGuide.html; The Apache Software
Foundation; 2009-2012
GoLive; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-UserGuide/csug_batch_index_to_solr_servers_using_golive.html; Cloudera, Inc.; 2014
MapReduceIndexerTool; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html;
Cloudera, Inc.; 2014
Morphlines; http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-andintegrate-etl-apps-for-apache-hadoop/; Wolfgang Hoschek; Cloudera, Inc.; July 11, 2013
Morphlines ETL; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-User-Guide/csug_etl_morphlines.html; Cloudera, Inc.;
2014

SFG- 1/9/2014

16

More Related Content

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Ingest and Indexing in CDH4 Hadoop Environment

  • 1. Ingest and Indexing of RSS News Feeds in the Hadoop Environment Stephanie F. Guadagno January 2014 SFG- 1/9/2014 1
  • 2. Introduction • Work is being done on a Virtual Machine, loaded with Cloudera’s CDH 4.3. • Used Flume 1.3, Cloudera’s Morphlines, Cloudera Search with Solr 4.3, Hadoop 2.0. • Used Flume to pull over RSS News Feeds from BBC World News into HDFS. • The news data, in HDFS, was indexed and loaded into Solr using the MapReduceIndexerTool and the Cloudera’s Morphlines framework. SFG- 1/9/2014 2
  • 3. Overview of Components Used • Flume is used to reliably ingest large amounts of data from various sources (e.g. log files, Web Sites, Social Media Sites) into a centralized or distributed data store, such HBase or HDFS. • MapReduceIndexerTool is a MapReduce batch job driver used with Cloudera Search. The tool is used to index a set of input files and then write the indexes into HDFS. The GoLive feature will merge the output shards into a set of live Solr servers (e.g. a SolrCloud). • Cloudera Morphlines is a new open source framework that facilitates simple ETL of ingested data into Apache Solr. The framework consists of the new Morphlines library and specifications for creating a “morphline”, which encompasses a chain of transformation commands. • Cloudera Search facilitates Big Data search by bringing search and scalable indexing from Solr 4.X into the Hadoop ecosystem. SFG- 1/9/2014 3
  • 4. Flume’s Data Flow • • • • • A Flume Agent is a Java process that hosts the Flume Source, Channel, and Sink components through which events flow from an external source to the next destination. An event is a unit of data that flows through the components. A Flume Source listens for events and writes the event to the Channel. The Channel queues the events as transactions. The Flume Sink writes the event to the external source (e.g. HDFS, HBase, Solr, or a file) and removes the event from the queue. SFG- 1/9/2014 External Source (e.g. Social Media, Log files, Web Pages, RSS News Feed) in a format recognized by the Flume Source Channel Source (e.g. Memory, File, JDBC) (e.g. Avro, Exec, HTTP, JMS, Syslog, etc.) Agent Sink (e.g. File, HDFS, HBase, Morphline Solr Sink) in a format specified by the Flume Sink File HBase, HDFS, Solr 4
  • 5. Morphline Data Flow • • • • • Cloudera’s Morphlines is a Java library that was developed as part of Cloudera Search. The library contains a suite of frequently used transformation and I/O “command” classes for use in simple ETL on data flows into Solr. The library can be integrated into Flume for near-real-time ETL or into MapReduce for batch ETL. For batch ETL, Cloudera provides the MapReduceIndexerTool for data in HDFS. For data in HBase, the tool is the HBaseMapReduceIndexerTool. A morphline will consume input records, which are then turned into a stream of records. The stream of records are piped through a chain of transformation commands. SFG- 1/9/2014 Source B a t c h HDFS, HBase cmd N R T … cmd record Flume Source cmd record Morphline Solr 5
  • 6. News Feed ETL Data Flow 1) Ingest using Flume 2) Index using MapReduce and Morphline External Source Custom Source Morphline (BBC RSS News Feeds – us, uk, asia, etc.) Configuration File Memory Channel Avro JSON record(s) HDFS Sink MapReduceIndexerTool MapReduce Agent (org.apache.solr.hadoop.MapReduceIndexerTool) (“agent”) Avro JSON record(s) HDFS ("newsfeeds/”) • • • • Implemented a Custom Flume Event Driven Source to pull RSS News Feeds from BBC World News. Details: – Must implement Flume’s EventDrivenSource interface – Parsed the News Feeds items – Wrote each item to the Channel in Avro JSON format Ensured the Agent was defined. CDH4 came with an agent called “tier1”. I created an agent called “agent”. Configured the Data Flow in a Flume agent configuration file. Wrote a script that runs the flume agent with the agent configuration file. SFG- 1/9/2014 Solr Cloud (“news_feeds”) • • • • Created Solr Instance for the “news_feeds” collection with modified Schema for fields in news feed data. Created the “news_feeds” collection with 1 shard. Wrote Morphine File Wrote a script that runs the MapReduceIndexerTool with the Morphline specification file. 6
  • 7. News Feed Ingest Details-1 of 2 (Configuration) # Flume Data Flow Configuration # ----------------------------------------------# Definitions agent.sources=news-source agent.channels=memory-ch agent.sinks=hdfs-sink External Source (BBC RSS News Feeds – us, uk, asia, etc.) Custom Source Memory Channel Avro JSON record(s) HDFS Sink Agent # Channel (memory channel with queue capacity of 5000) agent.channels.memory-ch.type=memory agent.channels.memory-ch.capacity=5000 (“agent”) HDFS ("newsfeeds/”) Chose: 1. Custom Flume Event Driven Source 2. Memory Channel 3. HDFS Sink SFG- 1/9/2014 # Sources (ingest using RSSFlumeSourceReader class) agent.sources.news-source.type=dataingest.rssfeeds.RSSFlumeSourceReader agent.sources.news-source.channels=memory-ch # Sink (output to HDFS in Text format) agent.sinks.hdfs-sink.type=hdfs agent.sinks.hdfs-sink.channel=memory-ch agent.sinks.hdfs-sink.hdfs.path= hdfs://localhost:8020/user/cloudera/flume/newsfeeds agent.sinks.hdfs-sink.hdfs.filePrefix=input agent.sinks.hdfs-sink.hdfs.fileType = DataStream agent.sinks.hdfs-sink.hdfs.writeFormat = Text 7
  • 8. News Feed Ingest Details-2 of 2 (Custom Source – RSSFlumeSourceReader) public class RSSFlumeSourceReader extends AbstractSource implements EventDrivenSource, Configurable { ChannelProcessor cp = getChannelProcessor(); External Source (BBC RSS News Feeds – us, uk, asia, etc.) Custom Source Memory Channel Avro JSON record(s) HDFS Sink Agent (“agent”) HDFS ("newsfeeds/”) SFG- 1/9/2014 @Override public synchronized void start() { super.start(); // for each URL { // read RSS News Feeds; using java.net.URL // obtain Document by parsing news using DocumentBuilder from // javax.xml.parsers // get NodeList object for “item” tag contain in the Document object // for each node in the NodeList object { // write data in Avro JSON format using Apache Avro library } // create Flume Event and Send Event to Channel Event event = EventBuilder.withBody(out.toString(), Charsets.UTF_8); cp.processEvent(event); } } @Override public synchronized void stop() { super.stop(); } } 8
  • 9. News Feed Data in HDFS SFG- 1/9/2014 9
  • 10. News Feed Data Indexing Details-1 of 3 (MapReduce) Morphline Configuration File MapReduceIndexerTool Avro JSON record(s) MapReduce (org.apache.solr.hadoop.MapReduceIndexerTool) HDFS Solr Cloud ("newsfeeds/”) (“news_feeds”) Two tools being used: 1. HdfsFindTool : used to get the most recent files changed within the past day. 2. MapReducerIndexerTool: will run MapReduce job to index the HDFS input files and push the index to Solr. # Go-live merges the output shards of the previous phase into a # set of on-line Solr servers. # echo “Running go-live mode“ hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.HdfsFindTool -find hdfs:///${HDFS_INDIR} -type f -name 'in*' -mtime -1 | sudo -u hdfs hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool --libjars /usr/lib/solr/contrib/mr/search-mr-0.9.1-cdh4.3.0-SNAPSHOT.jar -D 'mapred.child.java.opts=-Xmx500m' --log4j ${LOG_FILE} ${DRYRUN} --morphline-file ${MORPHLINE_FILE} --update-conflict-resolver org.apache.solr.hadoop.dedup.RetainMostRecentUpdateConflictResolver ${REDUCERS_ARG} --verbose --output-dir hdfs://localhost:8020/${HDFS_SOLR_IDXDIR} --go-live --zk-host localhost:2181/solr --collection ${COLLECTION} --input-list echo “Clean-up tmp directory" sudo -u hdfs rm /tmp/solr*.zip echo "Done." SFG- 1/9/2014 10
  • 11. News Feed Data Indexing Details-2 of 3 (Morphline) SOLR_LOCATOR : { # specifiy collection and zkHost } morphlines : [ { id : morphlineNewsFeed importCommands : ["com.cloudera.**"] commands : [ “readAvro” Record “extractAvroPaths” Record “convertTimestamp” Record “sanitizeUnknown SolrFields” Tid-bits { readAvro { isJson : true writerSchemaFile: /home/dataingest/schema/NewsRecord.avsc }} { extractAvroPaths { flatten : false paths : { id: /id title: /Title url: /Link published_date: /Publish_Date author: /Author comments: /Comments description: /Description } }} Record “loadSolr” Document ]}] Solr Cloud (“news_feeds”) SFG- 1/9/2014 { loadSolr { solrLocator : ${SOLR_LOCATOR} } }  HOCON format: Human-Optimized Configuration format. JSON-like format  Morphline is defined with a tree of commands.  The output of one command is sent to the next command.  The morphline is compiled at run-time. # Convert last_modified to native Solr timestamp format { convertTimestamp { field : published_date inputFormats : ["EEE, d MMM yyyy HH:mm:ss z", "EEE, dd MMM yyyy HH:mm:ss z"] inputTimezone : GMT outputTimezone: US/Eastern } } # Solr will throw an exception on any attempt to load # a document containing a field not specified in schema.xml. { sanitizeUnknownSolrFields { # Location from which to fetch Solr schema solrLocator : ${SOLR_LOCATOR} } } 11
  • 12. News Feed Data Indexing Details-3 of 3 (Solr Collection) • The “news_feeds” Solr collection presently contains 3800 documents in the index. SFG- 1/9/2014 12
  • 13. News Feed Document in Solr SFG- 1/9/2014 13
  • 14. Summary • Presented ingest of RSS News Feeds using – Flume with a Custom Source, Memory Channel, and HDFS Sink • Presented indexing of News Feed data into HDFS using – Cloudera’s Morphlines library and “morphline” configuration – Cloudera’s MapReduceIndexerTool – Cloudera Search with Solr 4.X SFG- 1/9/2014 14
  • 15. Thank You Stephanie F. Guadagno January 2014 SFG- 1/9/2014 15
  • 16. References • • • • • • Flume Developer’s Guide; http://flume.apache.org/FlumeDeveloperGuide.htmlThe Apache Software Foundation; 2009-2012 Flume User Guide; http://flume.apache.org/FlumeUserGuide.html; The Apache Software Foundation; 2009-2012 GoLive; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-UserGuide/csug_batch_index_to_solr_servers_using_golive.html; Cloudera, Inc.; 2014 MapReduceIndexerTool; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html; Cloudera, Inc.; 2014 Morphlines; http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-andintegrate-etl-apps-for-apache-hadoop/; Wolfgang Hoschek; Cloudera, Inc.; July 11, 2013 Morphlines ETL; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-User-Guide/csug_etl_morphlines.html; Cloudera, Inc.; 2014 SFG- 1/9/2014 16