SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Frontera: open source, large scale web
crawling framework
Alexander Sibiryakov, October 15, 2015
sibiryakov@scrapinghub.com
• Software Engineer @
Scrapinghub
• Born in Yekaterinburg, RU
• 5 years at Yandex, search
quality department: social and
QA search, snippets.
• 2 years at Avast! antivirus,
research team: automatic false
positive solving, large scale
prediction of malicious
download attempts.
Hola a todos!
2
• Over 2 billion requests per month (~800 per second)
• Focused crawls & Broad crawls
We help turn web content
into useful data
3
{
"content": [
{
"title": {
"text": "'Extreme poverty'
to fall below 10% of world
population for first time",
"href": "http://
www.theguardian.com/society/2015/
oct/05/world-bank-extreme-poverty-
to-fall-below-10-of-world-
population-for-first-time"
},
"points": "9 points",
• News analysis
• Topical crawling
• Price change monitoring
• Sentiment analysis (popularity, likability)
• Due diligence (profile/business data)
• Lead generation (extracting contact information)
• Track criminal activity & find lost persons (DARPA)
Broad crawl usages
4
Task
• Crawl Spanish web to gather
statistics about hosts and their
sizes.
• Limit crawl to .es zone.
• Breadth-first strategy: first crawl
1-click distance documents,
next 2-clicks, and so on,
• Finishing condition: absence of
hosts with less than 100
crawled documents.
• Low costs.
5
Spanish internet (.es) in 2012
• Domain names registered - 1,56М (39% growth per
year)
• Web server in zone - 283,4K (33,1%)
• Hosts - 4,2M (21%)
• Spanish web sites in DMOZ catalog - 22043

* - отчет OECD Communications Outlook 2013
6
Solution
• Scrapy* - network operations.
• Apache Kafka - data bus (offsets, partitioning).
• Apache HBase - storage (random access, linear scanning,
scalability).
• Twisted.Internet - library for async primitives for use in workers.
• Snappy - efficient compression algorithm for IO-bounded
applications.
* - network operations in Scrapy are implemented asynchronously,
based on the same Twisted.Internet
7
Architecture
Kafka topic
SW
Crawling strategy
workers
Storage workers
8
DB
1. Big and small hosts
problem
• When crawler comes to huge
number of links from some
host, along with usage of
simple prioritization models, it
turns out queue is flooded with
URLs from the same host.
• That causes underuse of
spider resources.
• We adopted additional per-
host (optionally per-IP)
queue and metering
algorithm: URLs from big
hosts are cached in memory.
9
2. DDoS DNS service
Amazon AWS
• Breadth-first strategy assumes
first visiting of previously
unknown hosts, therefore
generating huge amount of
DNS request.
• Recursive DNS server on each
downloading node, with
upstream set to Verizon and
OpenDNS.
• We used dnsmasq.
10
3. Tuning Scrapy thread pool’а
for efficient DNS resolution
• Scrapy uses a thread pool to
resolve DNS name to IP.
• When ip is absent in cache,
request is sent to DNS server
in it’s own thread, which is
blocking.
• Scrapy reported numerous
errors related to DNS name
resolution and timeouts.
• We added option to Scrapy
for thread pool size and
timeout adjustment.
11
4. Overloaded HBase region
servers during state check
• Crawler extracts from document
hundreds of links in average.
• Before adding this links to queue, they
needs to be checked if they weren’t
already crawled (to avoid repetitive
visiting).
• On small volumes SSDs were just fine.
After increase of table size, we had to
move to HDDs, and response times
dramatically grew up.
• Host-local fingerprint function for
keys in HBase.
• Tuning HBase block cache to fit
average host states into one block.
12
5. Intensive network traffic
from workers to services
• We noticed throughput
between workers Kafka and
HBase up to 1Gbit/s.
• Switched to Thrift compact
protocol for HBase
communication.
• Message compression in
Kafka using Snappy.
13
6. Further query and traffic
optimizations to HBase
• State check required lion’s
share of requests and
network throughput.
• Consistency was another
requirement.
• We created local state cache
in strategy worker.
• For consistency, spider log
was partitioned by host, to
avoid cache overlap
between workers.
14
State cache
• All operations are batched:
• If key is absent in cache, it’s
requested from HBase,
• every ~4K documents
cache is flushed to HBase.
• When achieving 3M (~1Гб)
elements, flush and cleanup
happens.
• It seems Least-Recently-Used
(LRU) algorithm is a good fit
there.
15
Spider priority queue (slot)
• Cell has an array of:

- fingerprint, 

- Crc32(hostname), 

- URL, 

- score
• Dequeueing top N.
• Such design is prone to huge
hosts.
• Partially this problem can be
solved using scoring model
taking into account known
document count per host.
16
7. Problem of big and small
hosts (strikes back!)
• During crawling we’ve found few
very huge hosts (>20M docs)
• All queue partitions were
flooded with pages from few
huge hosts, because of queue
design and scoring model used.
• We made two MapReduce
jobs:
• queue shuffling,
• limiting all hosts to no more
than 100 documents.
17
• Single-thread Scrapy spider gives 1200 pages/min.
from about 100 websites in parallel.
• Spiders to workers ratio is 4:1 (without content)
• 1 Gb of RAM for every SW (state cache, tunable).
• Example:
• 12 spiders ~ 14.4K pages/min.,
• 3 SW and 3 DB workers,
• Total 18 cores.
Hardware requirements
18
• Apache HBase,
• Apache Kafka,
• Python 2.7+,
• Scrapy 0.24+,
• DNS Service.
Software requirements
CDH (100% Open source
Hadoop package)
19
Maintaining Cloudera Hadoop on
Amazon EC2
• CDH is very sensitive to free space on root partition, parcels, and
storage of Cloudera Manager.
• We’ve moved it using symbolic links to separate EBS partition.
• EBS should be at least 30Gb, base IOPS should be enough.
• Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD).
• After one week of crawling, we ran out of space, and started to
move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD).
20
Spanish (.es) internet crawl results
• fnac.es, rakuten.es, adidas.es,
equiposdefutbol2014.es,
druni.es,
docentesconeducacion.es -
are the biggest websites
• 68.7K domains found (~600K
expected),
• 46.5M crawled pages overall,
• 1.5 months,
• 22 websites with more than
50M pages
21
where are the rest of
web servers?!
Bow-tie model
A. Broder et al. / Computer Networks 33 (2000) 309-320
23
Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005
24
Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014
25
• Online operation: scheduling of new batch,
updating of DB state.
• Storage abstraction: write your own backend
(sqlalchemy, HBase is included).
• Canonical URLs resolution abstraction: each
document has many URLs, which to use?
• Scrapy ecosystem: good documentation, big
community, ease of customization.
Main features
26
• Communication layer is Apache Kafka: topic
partitioning, offsets mechanism.
• Crawling strategy abstraction: crawling goal, url
ordering, scoring model is coded in separate
module.
• Polite by design: each website is downloaded by
at most one spider.
• Python: workers, spiders.
Distributed Frontera features
27
References
• Distributed Frontera. https://github.com/
scrapinghub/distributed-frontera
• Frontera. https://github.com/scrapinghub/frontera
• Documentation:
• http://distributed-frontera.readthedocs.org/
• http://frontera.readthedocs.org/
28
Future plans
• Lighter version, without HBase
and Kafka. Communicating
using sockets.
• Revisiting strategy out-of-box.
• Watchdog solution: tracking
website content changes.
• PageRank or HITS strategy.
• Own HTML and URL parsers.
• Integration into Scrapinghub
services.
• Testing on larger volumes.
29
Contribute!
• Distributed Frontera is a
historically first attempt to
implement web scale web
crawler using Python.
• Truly resource-intensive task:
CPU, network, disks.
• Made in Scrapinghub, a
company where Scrapy was
created.
• A plans to become an Apache
Software Foundation project.
30
We’re hiring!
http://scrapinghub.com/jobs/
31
32
Mandatory sales slide
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator
Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping
try.scrapinghub.com/BDS15
Gracias!
Thank you!
Alexander Sibiryakov,
sibiryakov@scrapinghub.com

Más contenido relacionado

La actualidad más candente

Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachJeremy Zawodny
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB DeploymentMongoDB
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarStreamNative
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd SeasonSATOSHI TAGOMORI
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerTreasure Data, Inc.
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Consjohnrjenson
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToSATOSHI TAGOMORI
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDBMongoDB
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
 
Security monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshareSecurity monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshareReZa AdineH
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭台灣資料科學年會
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)Shivji Kumar Jha
 
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDBWebinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDBMongoDB
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Jui An Huang (黃瑞安)
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSBowenDing4
 

La actualidad más candente (20)

Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
 
Securing Your MongoDB Deployment
Securing Your MongoDB DeploymentSecuring Your MongoDB Deployment
Securing Your MongoDB Deployment
 
Building event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache PulsarBuilding event streaming pipelines using Apache Pulsar
Building event streaming pipelines using Apache Pulsar
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Fluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker containerFluentd and Docker - running fluentd within a docker container
Fluentd and Docker - running fluentd within a docker container
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
Security monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshareSecurity monitoring log management-describe logstash,kibana,elastic slidshare
Security monitoring log management-describe logstash,kibana,elastic slidshare
 
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化-曾書庭
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDBWebinar: Enabling Microservices with Containers, Orchestration, and MongoDB
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
Log Files
Log FilesLog Files
Log Files
 

Destacado

Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemixDuckDuckGo
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureNUS-ISS
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...Brian Grant
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Weaveworks
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureWSO2
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsRamit Surana
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Ambassador Labs
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Bob Cotton
 

Destacado (9)

Kubernetes and bluemix
Kubernetes  and  bluemixKubernetes  and  bluemix
Kubernetes and bluemix
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Business use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise ArchitectureBusiness use of Social Media and Impact on Enterprise Architecture
Business use of Social Media and Impact on Enterprise Architecture
 
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
WSO2Con US 2015 Kubernetes: a platform for automating deployment, scaling, an...
 
Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes Orchestrating Microservices with Kubernetes
Orchestrating Microservices with Kubernetes
 
Deep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer ArchitectureDeep-dive into Microservice Outer Architecture
Deep-dive into Microservice Outer Architecture
 
A brief study on Kubernetes and its components
A brief study on Kubernetes and its componentsA brief study on Kubernetes and its components
A brief study on Kubernetes and its components
 
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
Velocity NYC 2017: Building Resilient Microservices with Kubernetes, Docker, ...
 
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017
 

Similar a Frontera: open source, large scale web crawling framework

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Ontico
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jcJames Saint-Rossy
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, whenEugenio Minardi
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 

Similar a Frontera: open source, large scale web crawling framework (20)

Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016High cardinality time series search: A new level of scale - Data Day Texas 2016
High cardinality time series search: A new level of scale - Data Day Texas 2016
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jc
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
MongoDB: What, why, when
MongoDB: What, why, whenMongoDB: What, why, when
MongoDB: What, why, when
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 

Último

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolsosttopstonverter
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 

Último (20)

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024VictoriaMetrics Anomaly Detection Updates: Q1 2024
VictoriaMetrics Anomaly Detection Updates: Q1 2024
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
eSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration toolseSoftTools IMAP Backup Software and migration tools
eSoftTools IMAP Backup Software and migration tools
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 

Frontera: open source, large scale web crawling framework

  • 1. Frontera: open source, large scale web crawling framework Alexander Sibiryakov, October 15, 2015 sibiryakov@scrapinghub.com
  • 2. • Software Engineer @ Scrapinghub • Born in Yekaterinburg, RU • 5 years at Yandex, search quality department: social and QA search, snippets. • 2 years at Avast! antivirus, research team: automatic false positive solving, large scale prediction of malicious download attempts. Hola a todos! 2
  • 3. • Over 2 billion requests per month (~800 per second) • Focused crawls & Broad crawls We help turn web content into useful data 3 { "content": [ { "title": { "text": "'Extreme poverty' to fall below 10% of world population for first time", "href": "http:// www.theguardian.com/society/2015/ oct/05/world-bank-extreme-poverty- to-fall-below-10-of-world- population-for-first-time" }, "points": "9 points",
  • 4. • News analysis • Topical crawling • Price change monitoring • Sentiment analysis (popularity, likability) • Due diligence (profile/business data) • Lead generation (extracting contact information) • Track criminal activity & find lost persons (DARPA) Broad crawl usages 4
  • 5. Task • Crawl Spanish web to gather statistics about hosts and their sizes. • Limit crawl to .es zone. • Breadth-first strategy: first crawl 1-click distance documents, next 2-clicks, and so on, • Finishing condition: absence of hosts with less than 100 crawled documents. • Low costs. 5
  • 6. Spanish internet (.es) in 2012 • Domain names registered - 1,56М (39% growth per year) • Web server in zone - 283,4K (33,1%) • Hosts - 4,2M (21%) • Spanish web sites in DMOZ catalog - 22043
 * - отчет OECD Communications Outlook 2013 6
  • 7. Solution • Scrapy* - network operations. • Apache Kafka - data bus (offsets, partitioning). • Apache HBase - storage (random access, linear scanning, scalability). • Twisted.Internet - library for async primitives for use in workers. • Snappy - efficient compression algorithm for IO-bounded applications. * - network operations in Scrapy are implemented asynchronously, based on the same Twisted.Internet 7
  • 9. 1. Big and small hosts problem • When crawler comes to huge number of links from some host, along with usage of simple prioritization models, it turns out queue is flooded with URLs from the same host. • That causes underuse of spider resources. • We adopted additional per- host (optionally per-IP) queue and metering algorithm: URLs from big hosts are cached in memory. 9
  • 10. 2. DDoS DNS service Amazon AWS • Breadth-first strategy assumes first visiting of previously unknown hosts, therefore generating huge amount of DNS request. • Recursive DNS server on each downloading node, with upstream set to Verizon and OpenDNS. • We used dnsmasq. 10
  • 11. 3. Tuning Scrapy thread pool’а for efficient DNS resolution • Scrapy uses a thread pool to resolve DNS name to IP. • When ip is absent in cache, request is sent to DNS server in it’s own thread, which is blocking. • Scrapy reported numerous errors related to DNS name resolution and timeouts. • We added option to Scrapy for thread pool size and timeout adjustment. 11
  • 12. 4. Overloaded HBase region servers during state check • Crawler extracts from document hundreds of links in average. • Before adding this links to queue, they needs to be checked if they weren’t already crawled (to avoid repetitive visiting). • On small volumes SSDs were just fine. After increase of table size, we had to move to HDDs, and response times dramatically grew up. • Host-local fingerprint function for keys in HBase. • Tuning HBase block cache to fit average host states into one block. 12
  • 13. 5. Intensive network traffic from workers to services • We noticed throughput between workers Kafka and HBase up to 1Gbit/s. • Switched to Thrift compact protocol for HBase communication. • Message compression in Kafka using Snappy. 13
  • 14. 6. Further query and traffic optimizations to HBase • State check required lion’s share of requests and network throughput. • Consistency was another requirement. • We created local state cache in strategy worker. • For consistency, spider log was partitioned by host, to avoid cache overlap between workers. 14
  • 15. State cache • All operations are batched: • If key is absent in cache, it’s requested from HBase, • every ~4K documents cache is flushed to HBase. • When achieving 3M (~1Гб) elements, flush and cleanup happens. • It seems Least-Recently-Used (LRU) algorithm is a good fit there. 15
  • 16. Spider priority queue (slot) • Cell has an array of:
 - fingerprint, 
 - Crc32(hostname), 
 - URL, 
 - score • Dequeueing top N. • Such design is prone to huge hosts. • Partially this problem can be solved using scoring model taking into account known document count per host. 16
  • 17. 7. Problem of big and small hosts (strikes back!) • During crawling we’ve found few very huge hosts (>20M docs) • All queue partitions were flooded with pages from few huge hosts, because of queue design and scoring model used. • We made two MapReduce jobs: • queue shuffling, • limiting all hosts to no more than 100 documents. 17
  • 18. • Single-thread Scrapy spider gives 1200 pages/min. from about 100 websites in parallel. • Spiders to workers ratio is 4:1 (without content) • 1 Gb of RAM for every SW (state cache, tunable). • Example: • 12 spiders ~ 14.4K pages/min., • 3 SW and 3 DB workers, • Total 18 cores. Hardware requirements 18
  • 19. • Apache HBase, • Apache Kafka, • Python 2.7+, • Scrapy 0.24+, • DNS Service. Software requirements CDH (100% Open source Hadoop package) 19
  • 20. Maintaining Cloudera Hadoop on Amazon EC2 • CDH is very sensitive to free space on root partition, parcels, and storage of Cloudera Manager. • We’ve moved it using symbolic links to separate EBS partition. • EBS should be at least 30Gb, base IOPS should be enough. • Initial hardware was 3 x m3.xlarge (4 CPU, 15Gb, 2x40 SSD). • After one week of crawling, we ran out of space, and started to move DataNodes to d2.xlarge (4 CPU, 30.5Gb, 3x2Tb HDD). 20
  • 21. Spanish (.es) internet crawl results • fnac.es, rakuten.es, adidas.es, equiposdefutbol2014.es, druni.es, docentesconeducacion.es - are the biggest websites • 68.7K domains found (~600K expected), • 46.5M crawled pages overall, • 1.5 months, • 22 websites with more than 50M pages 21
  • 22. where are the rest of web servers?!
  • 23. Bow-tie model A. Broder et al. / Computer Networks 33 (2000) 309-320 23
  • 24. Y. Hirate, S. Kato, and H. Yamana, Web Structure in 2005 24
  • 25. Graph Structure in the Web — Revisited, Meusel, Vigna, WWW 2014 25
  • 26. • Online operation: scheduling of new batch, updating of DB state. • Storage abstraction: write your own backend (sqlalchemy, HBase is included). • Canonical URLs resolution abstraction: each document has many URLs, which to use? • Scrapy ecosystem: good documentation, big community, ease of customization. Main features 26
  • 27. • Communication layer is Apache Kafka: topic partitioning, offsets mechanism. • Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module. • Polite by design: each website is downloaded by at most one spider. • Python: workers, spiders. Distributed Frontera features 27
  • 28. References • Distributed Frontera. https://github.com/ scrapinghub/distributed-frontera • Frontera. https://github.com/scrapinghub/frontera • Documentation: • http://distributed-frontera.readthedocs.org/ • http://frontera.readthedocs.org/ 28
  • 29. Future plans • Lighter version, without HBase and Kafka. Communicating using sockets. • Revisiting strategy out-of-box. • Watchdog solution: tracking website content changes. • PageRank or HITS strategy. • Own HTML and URL parsers. • Integration into Scrapinghub services. • Testing on larger volumes. 29
  • 30. Contribute! • Distributed Frontera is a historically first attempt to implement web scale web crawler using Python. • Truly resource-intensive task: CPU, network, disks. • Made in Scrapinghub, a company where Scrapy was created. • A plans to become an Apache Software Foundation project. 30
  • 32. 32 Mandatory sales slide Crawl the web, at scale • cloud-based platform • smart proxy rotator Get data, hassle-free • off-the-shelf datasets • turn-key web scraping try.scrapinghub.com/BDS15