Frontera: open source, large scale web crawling framework

Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, October 15, 2015
sibiryakov@scrapinghub.com

• Software Engineer @
Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
Hola a todos!
2

• Over 2 billion requests per month (~800 per second)
• Focused crawls & Broad crawls
We help turn web content
into useful data
3
{
"content": [
{
"title": {
"text": "'Extreme poverty'
to fall below 10% of world
population for first time",
"href": "http://
www.theguardian.com/society/2015/
oct/05/world-bank-extreme-poverty-
to-fall-below-10-of-world-
population-for-first-time"
},
"points": "9 points",

• News analysis
• Topical crawling
• Price change monitoring
• Sentiment analysis (popularity, likability)
• Due diligence (proﬁle/business data)
• Lead generation (extracting contact information)
• Track criminal activity & ﬁnd lost persons (DARPA)
Broad crawl usages
4

Task
• Crawl Spanish web to gather
statistics about hosts and their
sizes.
• Limit crawl to .es zone.
• Breadth-ﬁrst strategy: ﬁrst crawl
1-click distance documents,
next 2-clicks, and so on,
• Finishing condition: absence of
hosts with less than 100
crawled documents.
• Low costs.
5

Spanish internet (.es) in 2012
• Domain names registered - 1,56М (39% growth per
year)
• Web server in zone - 283,4K (33,1%)
• Hosts - 4,2M (21%)
• Spanish web sites in DMOZ catalog - 22043 
* - отчет OECD Communications Outlook 2013
6

Solution
• Scrapy* - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear scanning,
scalability).
• Twisted.Internet - library for async primitives for use in workers.
• Snappy - efﬁcient compression algorithm for IO-bounded
applications.
* - network operations in Scrapy are implemented asynchronously,
based on the same Twisted.Internet
7

Architecture
Kafka topic
SW
Crawling strategy
workers
Storage workers
8
DB

1. Big and small hosts
problem
• When crawler comes to huge
number of links from some
host, along with usage of
simple prioritization models, it
turns out queue is ﬂooded with
URLs from the same host.
• That causes underuse of
spider resources.
• We adopted additional per-
host (optionally per-IP)
queue and metering
algorithm: URLs from big
hosts are cached in memory.
9

2. DDoS DNS service
Amazon AWS
• Breadth-ﬁrst strategy assumes
ﬁrst visiting of previously
unknown hosts, therefore
generating huge amount of
DNS request.
• Recursive DNS server on each
downloading node, with
upstream set to Verizon and
OpenDNS.
• We used dnsmasq.
10

3. Tuning Scrapy thread pool’а
for efﬁcient DNS resolution
• Scrapy uses a thread pool to
resolve DNS name to IP.
• When ip is absent in cache,
request is sent to DNS server
in it’s own thread, which is
blocking.
• Scrapy reported numerous
errors related to DNS name
resolution and timeouts.
• We added option to Scrapy
for thread pool size and
timeout adjustment.
11

4. Overloaded HBase region
servers during state check
• Crawler extracts from document
hundreds of links in average.
• Before adding this links to queue, they
needs to be checked if they weren’t
already crawled (to avoid repetitive
visiting).
• On small volumes SSDs were just fine.
After increase of table size, we had to
move to HDDs, and response times
dramatically grew up.
• Host-local fingerprint function for
keys in HBase.
• Tuning HBase block cache to fit
average host states into one block.
12

5. Intensive network trafﬁc
from workers to services
• We noticed throughput
between workers Kafka and
HBase up to 1Gbit/s.
• Switched to Thrift compact
protocol for HBase
communication.
• Message compression in
Kafka using Snappy.
13

6. Further query and trafﬁc
optimizations to HBase
• State check required lion’s
share of requests and
network throughput.
• Consistency was another
requirement.
• We created local state cache
in strategy worker.
• For consistency, spider log
was partitioned by host, to
avoid cache overlap
between workers.
14

State cache
• All operations are batched:
• If key is absent in cache, it’s
requested from HBase,
• every ~4K documents
cache is flushed to HBase.
• When achieving 3M (~1Гб)
elements, flush and cleanup
happens.
• It seems Least-Recently-Used
(LRU) algorithm is a good fit
there.
15

Spider priority queue (slot)
• Cell has an array of: 
- ﬁngerprint,  
- Crc32(hostname),  
- URL,  
- score
• Dequeueing top N.
• Such design is prone to huge
hosts.
• Partially this problem can be
solved using scoring model
taking into account known
document count per host.
16

7. Problem of big and small
hosts (strikes back!)
• During crawling we’ve found few
very huge hosts (>20M docs)
• All queue partitions were
ﬂooded with pages from few
huge hosts, because of queue
design and scoring model used.
• We made two MapReduce
jobs:
• queue shufﬂing,
• limiting all hosts to no more
than 100 documents.
17

• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
18

• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
• DNS Service.
Software requirements
CDH (100% Open source
Hadoop package)
19

Maintaining Cloudera Hadoop on
Amazon EC2
• CDH is very sensitive to free space on root partition, parcels, and
storage of Cloudera Manager.
• We’ve moved it using symbolic links to separate EBS partition.
• EBS should be at least 30Gb, base IOPS should be enough.
• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
• After one week of crawling, we ran out of space, and started to
move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
20

Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es -
are the biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
50M pages
21

where are the rest of
web servers?!

Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
23

Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
24

Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
25

• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
26

• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
module.
• Polite by design: each website is downloaded by
at most one spider.
• Python: workers, spiders.
Distributed Frontera features
27

References
• Distributed Frontera. https://github.com/
scrapinghub/distributed-frontera
• Frontera. https://github.com/scrapinghub/frontera
• Documentation:
• http://distributed-frontera.readthedocs.org/
• http://frontera.readthedocs.org/
28

Future plans
• Lighter version, without HBase
and Kafka. Communicating
using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking
website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub
services.
• Testing on larger volumes.
29

Contribute!
• Distributed Frontera is a
historically ﬁrst attempt to
implement web scale web
crawler using Python.
• Truly resource-intensive task:
CPU, network, disks.
• Made in Scrapinghub, a
company where Scrapy was
created.
• A plans to become an Apache
Software Foundation project.
30

We’re hiring!
http://scrapinghub.com/jobs/
31

32
Mandatory sales slide
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator
Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping
try.scrapinghub.com/BDS15

Gracias!
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

Frontera: open source, large scale web crawling framework

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Frontera: open source, large scale web crawling framework

Similar a Frontera: open source, large scale web crawling framework (20)

Último

Último (20)

Frontera: open source, large scale web crawling framework