4. What is Nutch?
-Distributed framework for large scale web crawling
-but does not have to be large scale at all
-Based on Apache Hadoop
-Direct integration with Solr
7. Components
-CrawlDB
-Info about URLs
-LinkDB
-Info about links to each URL
-Segments
-set of URLs that are fetched as a unit
8. Segments
1.crawl_generate
-set of URLs to be fetched
2.crawl_fetch
-status of fetching each URL
3.content
-raw content retrieved from each URL
4.parse_text
-parsed text of each URL
5.parse_data
-outlinks and metadata parsed from each URL
6.crawl_parse
-outlink URLs, used to update the crawldb
10. Features
-Fetcher
-Multi-threaded fetcher
-Queues URLs per hostname / domain / IP
-Limit the number of URLs for round of fetching
-Default values are polite but can be made more aggressive
11. Features
-Crawl Strategy
-Breadth-first but can be depth-first
-Configurable via custom ScoringFilters
12. Features
-Scoring
-OPIC (On-line Page Importance Calculation) by default
-LinkRank
-Protocols
-Http, file, ftp, https
-Respects robots.txt directives
13. Features
-Scheduling
-Fixed or adaptive
-URL filters
-Regex, FSA, TLD, prefix, suffix
-URL normalisers
-Default, regex
14. Features
-Parsing with Apache Tika
-Hundreds of formats supported
-But some legacy parsers as well
-Plugins
-Feeds, Language Identification etc.
-Pluggable indexing
-Solr, ES etc.
19. Indexing crawled data to Solr
-Add agent.name in nutch-default.xml
-Copy fields from schema.xml to a core/collection in Solr
-create seed directory
-bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
21. Why integrate with Hadoop?
-Hadoop is NOT AT ALL needed to scale your Solr installation
-Hadoop is NOT AT ALL needed for Solr distributed capabilities
22. Why integrate with Hadoop?
-Integrate Solr with HDFS when your whole pipeline is hadoop based
-Avoid moving data and indexes in and out
-Avoid multiple sinks
-Avoid redundant provisioning for Solr
-Individual nodes disk, etc
23. Solr + Hadoop
-Read and write directly to HDFS
-build indexes for Solr with Hadoop's map-reduce
26. Index in HDFS
-writing and reading index and transaction log files to the HDFS
-does not use Hadoop Map-Reduce to process Solr data
-Filesystem cache needed for Solr performance
-HDFS not fit for random access
27. Block Cache
-enables Solr to cache HDFS index files on read and write
-LRU semantics
-Hot blocks are cached
28. Transaction Log
-HdfsUpdateLog
-Extends updateLog
-Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/
-no additional configuration
34. MorphLines
-A morphline is a configuration file that allows you to define ETL transformation pipelines
-replaces Java programming with simple configuration steps
-Extract content from input files, transform content, load content
-Uses Tika to extract content from a large variety of input documents