SlideShare a Scribd company logo
1 of 37
Download to read offline
Friends of Solr - “Nutch and HDFS” Saumitra Srivastav saumitra.srivastav@glassbeam.com Bangalore Apache Solr Group September-2014 Meetup
Friends
Friend #1 
Nutch
What is Nutch? 
-Distributed framework for large scale web crawling 
-but does not have to be large scale at all 
-Based on Apache Hadoop 
-Direct integration with Solr
Overview 
Seed 
(URLs) 
Solr 
Nutch Crawl Fetch Parse
Overview
Components 
-CrawlDB 
-Info about URLs 
-LinkDB 
-Info about links to each URL 
-Segments 
-set of URLs that are fetched as a unit
Segments 
1.crawl_generate 
-set of URLs to be fetched 
2.crawl_fetch 
-status of fetching each URL 
3.content 
-raw content retrieved from each URL 
4.parse_text 
-parsed text of each URL 
5.parse_data 
-outlinks and metadata parsed from each URL 
6.crawl_parse 
-outlink URLs, used to update the crawldb
Scale 
-Scalable storage 
-HDFS 
-Scalable crawling 
-Map-Reduce 
-Scalable search 
-SolrCloud 
-Scalable backend 
-Gora
Features 
-Fetcher 
-Multi-threaded fetcher 
-Queues URLs per hostname / domain / IP 
-Limit the number of URLs for round of fetching 
-Default values are polite but can be made more aggressive
Features 
-Crawl Strategy 
-Breadth-first but can be depth-first 
-Configurable via custom ScoringFilters
Features 
-Scoring 
-OPIC (On-line Page Importance Calculation) by default 
-LinkRank 
-Protocols 
-Http, file, ftp, https 
-Respects robots.txt directives
Features 
-Scheduling 
-Fixed or adaptive 
-URL filters 
-Regex, FSA, TLD, prefix, suffix 
-URL normalisers 
-Default, regex
Features 
-Parsing with Apache Tika 
-Hundreds of formats supported 
-But some legacy parsers as well 
-Plugins 
-Feeds, Language Identification etc. 
-Pluggable indexing 
-Solr, ES etc.
Common crawled fields 
-url 
-content 
-title 
-anchor 
-site 
-boost 
-digest 
-segment 
-host 
-type 
-arbitrary metadata
Setup 
-Download binary and unzip 
-http://nutch.apache.org/downloads.html 
-Conf Directory
Solr Schema
Solr-Nutch Mapping
Indexing crawled data to Solr 
-Add agent.name in nutch-default.xml 
-Copy fields from schema.xml to a core/collection in Solr 
-create seed directory 
-bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
Friend #2 
HDFS
Why integrate with Hadoop? 
-Hadoop is NOT AT ALL needed to scale your Solr installation 
-Hadoop is NOT AT ALL needed for Solr distributed capabilities
Why integrate with Hadoop? 
-Integrate Solr with HDFS when your whole pipeline is hadoop based 
-Avoid moving data and indexes in and out 
-Avoid multiple sinks 
-Avoid redundant provisioning for Solr 
-Individual nodes disk, etc
Solr + Hadoop 
-Read and write directly to HDFS 
-build indexes for Solr with Hadoop's map-reduce
Lucene Directory Abstraction 
Class Directory { 
listAll(); 
createOutput(file, context); 
openInput(file, context); 
deleteFile(file); 
makeLock(file); 
clearLock(file); 
... 
... 
}
HdfsDirectory
Index in HDFS 
-writing and reading index and transaction log files to the HDFS 
-does not use Hadoop Map-Reduce to process Solr data 
-Filesystem cache needed for Solr performance 
-HDFS not fit for random access
Block Cache 
-enables Solr to cache HDFS index files on read and write 
-LRU semantics 
-Hot blocks are cached
Transaction Log 
-HdfsUpdateLog 
-Extends updateLog 
-Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ 
-no additional configuration
Running Solr on HDFS 
Cloud mode java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://localhost:5432/solr/ -DzkHost=localhost:2181 -jar start.jar
Map-Reduce index building 
-Scalable index creation via map-reduce 
-https://github.com/markrmiller/solr-map-reduce- example
Map-Reduce index building 
-initial implementations sent documents from reducer to SolrCloud over http 
-Not scable 
-Reducers create indexes in HDFS
Map-Reduce index building 
-Reducers create indexes in HDFS 
-merge the indexes down to the correct number of ‘shards’ 
-zookeeper aware 
-Go-Live
Map-Reduce index building
MorphLines 
-A morphline is a configuration file that allows you to define ETL transformation pipelines 
-replaces Java programming with simple configuration steps 
-Extract content from input files, transform content, load content 
-Uses Tika to extract content from a large variety of input documents
MorphLines 
SOLR_LOCATOR : { collection : collection1 zkHost : "127.0.0.1:9983" batchSize : 100 } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readAvroContainer { ... } } { extractAvroPaths {...} } { convertTimestamp {...} } { sanitizeUnknownSolrFields {...} } { loadSolr {...} } .... ] } ]
Map-Reduce index building 
bin/hadoop --config /tmp/hadoop/sample1 jar ~/softwares/solr/solr-4.10.0/dist/solr-map-reduce-*.jar -D 'mapred.child.java.opts=-Xmx500m' -libjars "$HADOOP_LIBJAR" --morphline-file /tmp/readAvroContainer.conf --zk-host localhost:2181 --output-dir hdfs://localhost/outdir --collection twitter --log4j log4j.properties --go-live --verbose "hdfs://localhost/indir"
Thanks 
-Attributions 
•Julien Nioche’s slides on “Large scale crawling with Apache Nutch” 
•Mark Miller’s slides on “First Class Integration of Solr with Hadoop” 
-Connect 
•saumitra.srivastav@glassbeam.com 
•saumitra.srivastav7@gmail.com 
•https://www.linkedin.com/in/saumitras 
•@_saumitra_ 
-Join: 
•http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/

More Related Content

What's hot

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitabial
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases LabFabio Fumarola
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single clusterSalil Navgire
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosJoe Stein
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockMichael Stack
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on MesosJoe Stein
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Joe Stein
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 

What's hot (20)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
HBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with ClusterdockHBaseConEast2016: HBase on Docker with Clusterdock
HBaseConEast2016: HBase on Docker with Clusterdock
 
Containerized Data Persistence on Mesos
Containerized Data Persistence on MesosContainerized Data Persistence on Mesos
Containerized Data Persistence on Mesos
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

Viewers also liked

Solr installation
Solr installationSolr installation
Solr installationZHAO Sam
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachAllen Wu
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringAllenWu
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache KafkaSaumitra Srivastav
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Lucidworks
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data MiningValerii Klymchuk
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineNYC Predictive Analytics
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineTrey Grainger
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, StubhubDeduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, StubhubLucidworks
 

Viewers also liked (17)

Scaling search with SolrCloud
Scaling search with SolrCloudScaling search with SolrCloud
Scaling search with SolrCloud
 
Drools Ecosystem
Drools EcosystemDrools Ecosystem
Drools Ecosystem
 
Solr installation
Solr installationSolr installation
Solr installation
 
Co-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approachCo-clustering of multi-view datasets: a parallelizable approach
Co-clustering of multi-view datasets: a parallelizable approach
 
A scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clusteringA scalable collaborative filtering framework based on co clustering
A scalable collaborative filtering framework based on co clustering
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Distributed messaging with Apache Kafka
Distributed messaging with Apache KafkaDistributed messaging with Apache Kafka
Distributed messaging with Apache Kafka
 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Building a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engineBuilding a Recommendation Engine - An example of a product recommendation engine
Building a Recommendation Engine - An example of a product recommendation engine
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, StubhubDeduplication Using Solr: Presented by Neeraj Jain, Stubhub
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
 

Similar to Friends of Solr - Nutch & HDFS

Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413gregchanan
 
Decoupled Libraries for PHP
Decoupled Libraries for PHPDecoupled Libraries for PHP
Decoupled Libraries for PHPPaul Jones
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 

Similar to Friends of Solr - Nutch & HDFS (20)

Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Decoupled Libraries for PHP
Decoupled Libraries for PHPDecoupled Libraries for PHP
Decoupled Libraries for PHP
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
A glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika AcharyA glimpse of test automation in hadoop ecosystem by Deepika Achary
A glimpse of test automation in hadoop ecosystem by Deepika Achary
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 

Recently uploaded

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 

Recently uploaded (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 

Friends of Solr - Nutch & HDFS

  • 1. Friends of Solr - “Nutch and HDFS” Saumitra Srivastav saumitra.srivastav@glassbeam.com Bangalore Apache Solr Group September-2014 Meetup
  • 4. What is Nutch? -Distributed framework for large scale web crawling -but does not have to be large scale at all -Based on Apache Hadoop -Direct integration with Solr
  • 5. Overview Seed (URLs) Solr Nutch Crawl Fetch Parse
  • 7. Components -CrawlDB -Info about URLs -LinkDB -Info about links to each URL -Segments -set of URLs that are fetched as a unit
  • 8. Segments 1.crawl_generate -set of URLs to be fetched 2.crawl_fetch -status of fetching each URL 3.content -raw content retrieved from each URL 4.parse_text -parsed text of each URL 5.parse_data -outlinks and metadata parsed from each URL 6.crawl_parse -outlink URLs, used to update the crawldb
  • 9. Scale -Scalable storage -HDFS -Scalable crawling -Map-Reduce -Scalable search -SolrCloud -Scalable backend -Gora
  • 10. Features -Fetcher -Multi-threaded fetcher -Queues URLs per hostname / domain / IP -Limit the number of URLs for round of fetching -Default values are polite but can be made more aggressive
  • 11. Features -Crawl Strategy -Breadth-first but can be depth-first -Configurable via custom ScoringFilters
  • 12. Features -Scoring -OPIC (On-line Page Importance Calculation) by default -LinkRank -Protocols -Http, file, ftp, https -Respects robots.txt directives
  • 13. Features -Scheduling -Fixed or adaptive -URL filters -Regex, FSA, TLD, prefix, suffix -URL normalisers -Default, regex
  • 14. Features -Parsing with Apache Tika -Hundreds of formats supported -But some legacy parsers as well -Plugins -Feeds, Language Identification etc. -Pluggable indexing -Solr, ES etc.
  • 15. Common crawled fields -url -content -title -anchor -site -boost -digest -segment -host -type -arbitrary metadata
  • 16. Setup -Download binary and unzip -http://nutch.apache.org/downloads.html -Conf Directory
  • 19. Indexing crawled data to Solr -Add agent.name in nutch-default.xml -Copy fields from schema.xml to a core/collection in Solr -create seed directory -bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
  • 21. Why integrate with Hadoop? -Hadoop is NOT AT ALL needed to scale your Solr installation -Hadoop is NOT AT ALL needed for Solr distributed capabilities
  • 22. Why integrate with Hadoop? -Integrate Solr with HDFS when your whole pipeline is hadoop based -Avoid moving data and indexes in and out -Avoid multiple sinks -Avoid redundant provisioning for Solr -Individual nodes disk, etc
  • 23. Solr + Hadoop -Read and write directly to HDFS -build indexes for Solr with Hadoop's map-reduce
  • 24. Lucene Directory Abstraction Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); ... ... }
  • 26. Index in HDFS -writing and reading index and transaction log files to the HDFS -does not use Hadoop Map-Reduce to process Solr data -Filesystem cache needed for Solr performance -HDFS not fit for random access
  • 27. Block Cache -enables Solr to cache HDFS index files on read and write -LRU semantics -Hot blocks are cached
  • 28. Transaction Log -HdfsUpdateLog -Extends updateLog -Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ -no additional configuration
  • 29. Running Solr on HDFS Cloud mode java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://localhost:5432/solr/ -DzkHost=localhost:2181 -jar start.jar
  • 30. Map-Reduce index building -Scalable index creation via map-reduce -https://github.com/markrmiller/solr-map-reduce- example
  • 31. Map-Reduce index building -initial implementations sent documents from reducer to SolrCloud over http -Not scable -Reducers create indexes in HDFS
  • 32. Map-Reduce index building -Reducers create indexes in HDFS -merge the indexes down to the correct number of ‘shards’ -zookeeper aware -Go-Live
  • 34. MorphLines -A morphline is a configuration file that allows you to define ETL transformation pipelines -replaces Java programming with simple configuration steps -Extract content from input files, transform content, load content -Uses Tika to extract content from a large variety of input documents
  • 35. MorphLines SOLR_LOCATOR : { collection : collection1 zkHost : "127.0.0.1:9983" batchSize : 100 } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readAvroContainer { ... } } { extractAvroPaths {...} } { convertTimestamp {...} } { sanitizeUnknownSolrFields {...} } { loadSolr {...} } .... ] } ]
  • 36. Map-Reduce index building bin/hadoop --config /tmp/hadoop/sample1 jar ~/softwares/solr/solr-4.10.0/dist/solr-map-reduce-*.jar -D 'mapred.child.java.opts=-Xmx500m' -libjars "$HADOOP_LIBJAR" --morphline-file /tmp/readAvroContainer.conf --zk-host localhost:2181 --output-dir hdfs://localhost/outdir --collection twitter --log4j log4j.properties --go-live --verbose "hdfs://localhost/indir"
  • 37. Thanks -Attributions •Julien Nioche’s slides on “Large scale crawling with Apache Nutch” •Mark Miller’s slides on “First Class Integration of Solr with Hadoop” -Connect •saumitra.srivastav@glassbeam.com •saumitra.srivastav7@gmail.com •https://www.linkedin.com/in/saumitras •@_saumitra_ -Join: •http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/