SlideShare a Scribd company logo
1 of 65
Dimitri Brunel
Search Data Strategist @botify.com
- SEO manager in Marketing Agencies & Pure Players (previously)
- Currently part of the Botify Search Data Strategist team
How does Google crawl the web ?
Shedding light on common misconceptions and theories about Googlebot and SEO
Alpha Keita
Search Data Strategist @botify.com
- Onboarding & API training Manager @Botify
- Currently part of the Botify Search Data Strategist team
Load Times impact
Crawl Ratio
Do load times
impact all websites
the same way? How much do load
times impact Google’s
crawl?
SMALL WEBSITES BIG WEBSITES
Add a scientific approach to the
empiric one for detailed and
data-backed insights.
Scale-up the dataset from a single
website to a full set of websites from
different industries.
Confirm or invalidate Google’s
behavior and discover new ones.
Be more efficient in our SEO in order to
continually improve Googlebot’s
efficiency and user experience.
Improved SEO Methodology
Real Insights
More Precise Analysis
Share a Belief
Percentage of compliant pages (indexable pages) crawled by Google
in 30 days.
Average number of times a website’s URL was crawled by Google
in 30 days.
URLs that meet the following requirements:
Crawl Ratio
Crawl
Frequency
Compliant
URL
Canonical tag to self or not set
HTTP 200 Status Code
Text/HTML Content
Index Status (no noindex meta tag)
CRAWL RATIO
49% ACTIVE PAGES RATIO
23% CRAWL FREQUENCY
2.3
Percentage of compliant pages in the
website’s structure crawled by Google in
30 days.
Percentage of pages that have
generated at least one organic visit in
30 days.
Average number of times a website’s
URL was crawled by Google in 30 days.
A website’s size is one
of the most important
factors impacting
Google’s crawl.
Some KPIs like the number of orphan pages, the load time, or the
percentage of words vs. template, have almost no impact on small
websites but have a huge impact on big websites.
Orphan Pages Load Time
% of Words vs. Template
The larger the website, the
greater the impact:
PageRank Depth
Content Size
Huge impact regardless of size:
Other KPIs like the PageRank dilution, depth, or surprisingly content
size, have a big impact on Google’s crawl, regardless of website size.
The data showed that
bad HTTP Codes only
had a small impact on
Google’s crawl.
* Anonymized data
Data from Botify Analytics.
Data from log files from the same websites.
Data calculated with 30 days of logs.
Websites that fall in one of the following industries:
Retail Publisher Classified
Websites
270
pages crawled and
analyzed by
Botify.
413
Million
pages crawled by
Google and analyzed
by Botify.
6.2
Billion
*15% of the data crawled and analyzed each month by Botify.
Industries Analyzed Dataset by Website Size (in pages)
WEBSITE
1. Industry
2. Size
STRUCTURAL KPIS
1. PageRank
2. Load Times
3. Depth
4. Outlinks
5. Content Size
TYPE OF PAGES
1. Not Compliant Pages
2. Bad HTTP Codes
3. 304 HTTP Codes
4. Orphan Pages
WEBSITE
1. Industry ✘
2. Size ✓
STRUCTURAL KPIS
1. PageRank ✓
2. Load Times ✓
3. Depth ✓
4. Outlinks ✘
5. Content Size ✓
TYPE OF PAGES
1. Not Compliant Pages ✓
2. Bad HTTP Codes ✘
3. Orphan Pages ✓
Website related
elements
Expected Results
Similar crawl rate
Different crawl frequency
depending on industry
CLASSIFIED PUBLISHERRETAILER
CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY
● Googlebot impartially crawls on the web.
● Googlebot crawls impartially regardless of
industry.
● Publishers tend to have more active pages (in %).
From our past Experience
From the analysis of the Dataset
Confirmation
CRAWL FREQUENCY BY INDUSTRY From the analysis of the Dataset
New learnings!
Expected Results
Decreasing crawl ratio Adaptative crawl frequency
> 10K
PAGES
> 1 MILLIONS
PAGES
> 100K
PAGES
< 10K
PAGES
CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE
From our past Experience
From the analysis of the Dataset
Confirmation
● More pages means more difficulties for
Googlebot.
● More pages means fewer active pages in the
SERPs (in %).
● Small websites are better crawled by Google
but still not crawled entirely.
● Big websites have a harder time effectively
using Crawl Budget.
CRAWL FREQUENCY BY WEBSITE SIZE
From the analysis of the Dataset
Confirmation
Big websites tend to have more long tail pages
that will be less frequently crawled by Google.
Good news: this can be influenced with crawl
budget optimization.
Type of Pages
related
elements
3# - Not Compliant Pages
and
or
Canonical tag set not
to self
Not text
nor HTML content
● Badly indexable pages from a
technical POV.
● Shows bad crawl signal for web
spiders.
Risk
Expected Results
Weakest Crawl Ratio Weak Indexation Lower Crawl Frequency
At last a composite indicator
Noindex status
HTTP codes other
than 200 status code
COMPLIANT PAGES CRAWLED BY BOTIFY VS. NOT COMPLIANT PAGES CRAWLED BY
BOTIFY
From our past Experience
From the analysis of the Dataset
Confirmation
● The proportion of not compliant (37%) pages
is still too important vs. aiming for total
indexability (100% of compliant pages).
● The overall average shows that SEO still have
room for optimization.
● From our past experience we see that many websites
still face this problem, usually because of:
○ Extensive use of noindex
○ Server errors
○ Incorrect canonical annotations
413M pages crawled
CRAWLED COMPLIANT PAGES VS. CRAWLED NOT COMPLIANT PAGES
From our past Experience
From the analysis of the Dataset
Confirmation
● As most websites have a huge proportion of
not compliant pages, Google is on average
wasting 16% of its time crawling these useless
pages, when it could focus instead on more
interesting pages for searchers.
● Google is wasting time crawling not
compliant pages.
CRAWL RATIO VS. % OF NOT COMPLIANT PAGES CRAWLED BY GOOGLE
From our past Experience
From the analysis of the Dataset
Confirmation
● When the proportion of not compliant pages
crawled by Google increases, the crawl ratio
decreases.
● We expect that having more not compliant
pages crawled by Google will have a negative
impact on the compliant page’s crawl ratios.
LESS THAN 100K PAGES MORE THAN 100K PAGES
From the analysis of the Dataset
Confirmation
● Low impact on small websites but huge on medium sites.
Expected Results
Slow down / stop crawl Impact crawl efficiency
#4 - Bad HTTP Codes
404 302 500
200304
HTTP CODES DISTRIBUTION From our past Experience
From the analysis of the Dataset
New Learnings!
● The overall situation is quite good (code 200).
● The code 304 is truly under used.
304 is not
commonly
used by
SEOs
● From our experience, we see many problems
related to bad HTTP codes:
○ Temporary redirect, redirect chains, redirect
loops
○ Client errors, server errors...
CRAWL RATIO VS. CRAWL SHARE IN % ON BAD HTTP CODES From our past Experience
From the analysis of the Dataset
New Learnings!
● We don’t see a huge impact on crawl ratio.
● Potential reason: most bad HTTP codes in the
dataset are 3xx. These don’t consume much
crawl budget.
● We could expect bad HTTP status codes to
have a big impact on Google’s crawl ratio.
#5 - Orphan Pages
● that are outside of the website structure,
● that we did not discover,
● that Google crawled,
● that received crawl budget.
Expected Results
Cannibalization of crawl
budget
Lowering the crawl ratio
of the site structure
PAGES
Crawled by
BOTIFY
Crawled by
GOOGLE
Crawled by
Google AND Botify
CRAWL VOLUME ON STRUCTURE PAGES VS. CRAWL VOLUME ON ORPHAN PAGES From our past Experience
From the analysis of the Dataset
Confirmation
● On avg. orphan pages steal ¼ of the crawl.
● We see a lot of orphans URLs.
● Common reasons:
○ Old implementations or technical regressions
○ No DNS cleaning
CRAWL RATIO VS. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our past Experience
From the analysis of the Dataset
Confirmation
● Orphan pages tend to cannibalize crawl
budget and impact the crawl ratio of the
structural pages.
● As the percentage of orphan pages
increases, the crawl ratio should be
negatively impacted.
Few orphans
= Better crawl ratio
More orphans
= Lower crawl ratio
LESS THAN 100K PAGES MORE THAN 100K PAGES
From the analysis of the Dataset
New learnings!
From our past Experience
● This is very true on big and gigantic websites
only.
● Crawl budget cannibalization whatever the
size of the website.
Structural related
elements
#6 - Internal PageRank
Expected Results
Diluting the Internal PageRank on Not Compliant Pages Should
Positively Impact Google’s Crawl Ratio on Compliant Pages
The popularity
spread into the
website internal
structure
A strong crawl
signal supposed
to pilot the
Googlebot(s)
CRAWL RATIO VS. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES From our past Experience
From the analysis of the Dataset
Confirmation
● If compliant pages get more PageRank, their
crawl ratio should improve.
Pro Tips:
● Don’t waste PR with nofollow and noindex tags.
● Crawl ratio ⇔ opportunity to improve your links.
better crawl ratio
=
rework your links!
The number of physical clicks from the home page
#7 - Depth
Expected Results
Slow Down Crawl Potential Crawl Budget Waste
# Folders Depth # Clicks from the Home Page
CRAWL RATIO VS. AVG. DEPTH IN ANY WEBSITE STRUCTURE
From our past Experience
From the analysis of the Dataset
Confirmation
● We know depth is an SEO / UX problem:
○ Catalog size
○ Faceted navigation
○ Structure pruning to cut low value content
Avg. Depth
● Websites with a higher average depth
should be less crawled by Google.
#9 - Load Time
Expected Results
Idle Crawl
Huge Impact on Crawl
Ratio
We consider from a web crawler “point of view” :
- The time to first byte (webserver responsiveness) +
- The time to download the page HTML source (the DOM).
CRAWL RATIO VS. LOAD TIMES IN MILLISECONDES From our past Experience
From the analysis of the Dataset
● When we look at all the websites sizes, load
times don’t seem to have a big impact on
Google’s crawl.
● With higher average load time, crawl ratio
should decrease.
Disturbing
fact here
Your
target
LESS THAN 10K PAGES
From the analysis of the Dataset
New learnings!
● Small websites ⇔ Low impact of load times
● Big websites ⇔ Huge impact of load times
● With higher average load time, crawl ratio
should decrease.
Limited
impact
MORE THAN 10K PAGES
From our past Experience
Dramatic
impact
#10 - Number of Internal Outlinks
Expected Results
Quantity is not Quality
Impact on crawl ratio
when too many
Either
Follow Nofollow
To a
compliant page
To a not
compliant page
CRAWL RATIO VS. NO. OF OUTLINKS PER PAGE CRAWL RATIO VS. NO. OF OUTLINKS TO NOT COMPLIANT PAGES
From the analysis of the Dataset
No Confirmation
From our past Experience
● Google’s crawl doesn’t seem to be impacted
by the number of outlinks.
● Less outlinks ⇔ Better crawl ratio
● Bad outlinks ⇔ Slightly decrease the crawl
ratio
#11 - Percentage of Content
Expected Results
Low percentage of “real” content
often means heavier pages
Heavier pages are more
difficult to crawl for
Google
Percentage
of
Content
REAL Content
TEMPLATE Content
LESS THAN 10K PAGES MORE THAN 1M PAGES
From the analysis of the Dataset
New learnings!
From the analysis of the Dataset
Confirmation
● Small websites => Limited impact of the % of
content vs. template
● Big websites => Huge impact of the % of
content vs. template
Limited
impact
Awesome
impact
#12 - Content Size (in words not ignored)
The number of words
on a page, excluding
the template.
Expected Results
The more content on
average, the more crawled
Yet a limited impact on
Google crawl
CRAWL RATIO VS. CONTENT SIZE (IN WORDS)
From the analysis of the Dataset
Confirmation
From our past Experience
● Content size impacts Google’s Crawl.
● Websites with more content should be more
crawled by Google but we do not expect a
very high impact on Google’s crawl.
● Content size is more impactful on Google’s
crawl than we expected.
New learnings!
From the analysis of the Dataset
New learnings!
● Content Size positively impacts Google’s crawl for every size of website.
Medium
impact
LESS THAN 100K PAGES MORE THAN 1M PAGES
Good
impact
Awesome
impact
BETWEEN 100K AND 1M PAGES
Website size
dramatically impacts
Google’s crawl ratio.
Even small websites are not
crawled at 100% by Google.
Crawl Budget matters.
#1 #2
Content size
matters:
Build high quality
unique content
Orphans volume
matters:
Like water, don’t
waste the crawl
budget!
Structure depth
matters:
Don’t be afraid, prune
the useless branches!
#3 #4 #5
OPPORTUNITIES :
1. Analyze a larger set of websites (iterate on the dataset)
2. Extend the duration of the study (6 months, 12 months, 24 months)
3. Increase the list of KPIs to test (nofollow, noindex, etc.)
4. Cross even more SEO KPIs
5. Extend the data to Keywords (impressions, positions, clicks, etc.)
6. Consider seasonality in the analysis (trending topics, breakout topics, etc.)
Googlebot
Smartphone’s current
percentage of
Google’s crawl.
15%
GOOGLEBOT(S) DISTRIBUTION BY INDUSTRYGOOGLEBOT(S) DISTRIBUTION
DESKTOP
MOBILE
From the analysis of the DatasetFrom our past Experience
● Rolling out the Mobile-First Index takes lot of time.● Googlebot desktop is still very present.
Book a Demo to learn what Botify can do
for you:

More Related Content

More from Botify

New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...Botify
 
Living in a mobile first index world
Living in a mobile first index worldLiving in a mobile first index world
Living in a mobile first index worldBotify
 
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering BudgetBrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering BudgetBotify
 
Botify Webinar - The new Version of Botify Keywords
Botify Webinar - The new Version of Botify KeywordsBotify Webinar - The new Version of Botify Keywords
Botify Webinar - The new Version of Botify KeywordsBotify
 
Mobile-First Index: A Data-Driven Analysis & Discussion
Mobile-First Index:  A Data-Driven Analysis & DiscussionMobile-First Index:  A Data-Driven Analysis & Discussion
Mobile-First Index: A Data-Driven Analysis & DiscussionBotify
 
Why auditing your rel=canonical configuration is a shrewd move
Why auditing your rel=canonical configuration is a shrewd moveWhy auditing your rel=canonical configuration is a shrewd move
Why auditing your rel=canonical configuration is a shrewd moveBotify
 
Botify webinar Internal Linking - October 2018
Botify webinar   Internal Linking - October 2018Botify webinar   Internal Linking - October 2018
Botify webinar Internal Linking - October 2018Botify
 
How Does Google Crawl the Web?
How Does Google Crawl the Web?How Does Google Crawl the Web?
How Does Google Crawl the Web?Botify
 
GSC vs Scraping: Go Beyond Rankings
GSC vs Scraping: Go Beyond RankingsGSC vs Scraping: Go Beyond Rankings
GSC vs Scraping: Go Beyond RankingsBotify
 
The GDPR: What, Why and How Botify is Compliant by Design
The GDPR: What, Why and How Botify is Compliant by DesignThe GDPR: What, Why and How Botify is Compliant by Design
The GDPR: What, Why and How Botify is Compliant by DesignBotify
 
Demystifying JavaScript & SEO
Demystifying JavaScript & SEODemystifying JavaScript & SEO
Demystifying JavaScript & SEOBotify
 
Webinar Structured Data
Webinar Structured DataWebinar Structured Data
Webinar Structured DataBotify
 
Mobile first index webinar
Mobile first index webinarMobile first index webinar
Mobile first index webinarBotify
 
Decrypt Google’s Behavior with Botify Log Analyzer
Decrypt Google’s Behavior with Botify Log AnalyzerDecrypt Google’s Behavior with Botify Log Analyzer
Decrypt Google’s Behavior with Botify Log AnalyzerBotify
 
Understand the impact of Javascript on SEO
Understand the impact of Javascript on SEOUnderstand the impact of Javascript on SEO
Understand the impact of Javascript on SEOBotify
 
Botify Keywords webinar - september 2017
Botify Keywords webinar - september 2017Botify Keywords webinar - september 2017
Botify Keywords webinar - september 2017Botify
 
Webinar content quality - march 2017
Webinar   content quality - march 2017Webinar   content quality - march 2017
Webinar content quality - march 2017Botify
 

More from Botify (17)

New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
New Holiday Data Reveals Insights About Handling Seasonal Volatility - Q1 202...
 
Living in a mobile first index world
Living in a mobile first index worldLiving in a mobile first index world
Living in a mobile first index world
 
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering BudgetBrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
BrightonSEO 2019 - Crawl Budget is dead, please welcome Rendering Budget
 
Botify Webinar - The new Version of Botify Keywords
Botify Webinar - The new Version of Botify KeywordsBotify Webinar - The new Version of Botify Keywords
Botify Webinar - The new Version of Botify Keywords
 
Mobile-First Index: A Data-Driven Analysis & Discussion
Mobile-First Index:  A Data-Driven Analysis & DiscussionMobile-First Index:  A Data-Driven Analysis & Discussion
Mobile-First Index: A Data-Driven Analysis & Discussion
 
Why auditing your rel=canonical configuration is a shrewd move
Why auditing your rel=canonical configuration is a shrewd moveWhy auditing your rel=canonical configuration is a shrewd move
Why auditing your rel=canonical configuration is a shrewd move
 
Botify webinar Internal Linking - October 2018
Botify webinar   Internal Linking - October 2018Botify webinar   Internal Linking - October 2018
Botify webinar Internal Linking - October 2018
 
How Does Google Crawl the Web?
How Does Google Crawl the Web?How Does Google Crawl the Web?
How Does Google Crawl the Web?
 
GSC vs Scraping: Go Beyond Rankings
GSC vs Scraping: Go Beyond RankingsGSC vs Scraping: Go Beyond Rankings
GSC vs Scraping: Go Beyond Rankings
 
The GDPR: What, Why and How Botify is Compliant by Design
The GDPR: What, Why and How Botify is Compliant by DesignThe GDPR: What, Why and How Botify is Compliant by Design
The GDPR: What, Why and How Botify is Compliant by Design
 
Demystifying JavaScript & SEO
Demystifying JavaScript & SEODemystifying JavaScript & SEO
Demystifying JavaScript & SEO
 
Webinar Structured Data
Webinar Structured DataWebinar Structured Data
Webinar Structured Data
 
Mobile first index webinar
Mobile first index webinarMobile first index webinar
Mobile first index webinar
 
Decrypt Google’s Behavior with Botify Log Analyzer
Decrypt Google’s Behavior with Botify Log AnalyzerDecrypt Google’s Behavior with Botify Log Analyzer
Decrypt Google’s Behavior with Botify Log Analyzer
 
Understand the impact of Javascript on SEO
Understand the impact of Javascript on SEOUnderstand the impact of Javascript on SEO
Understand the impact of Javascript on SEO
 
Botify Keywords webinar - september 2017
Botify Keywords webinar - september 2017Botify Keywords webinar - september 2017
Botify Keywords webinar - september 2017
 
Webinar content quality - march 2017
Webinar   content quality - march 2017Webinar   content quality - march 2017
Webinar content quality - march 2017
 

Recently uploaded

The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...
The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...
The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...CIO Business World
 
McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)DEVARAJV16
 
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...CIO Business World
 
From Chance to Choice - Tactical Link Building for International SEO
From Chance to Choice - Tactical Link Building for International SEOFrom Chance to Choice - Tactical Link Building for International SEO
From Chance to Choice - Tactical Link Building for International SEOSzymon Słowik
 
定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一
定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一
定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一s SS
 
Fueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdfFueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdfVWO
 
Storyboards for my Final Major Project Video
Storyboards for my Final Major Project VideoStoryboards for my Final Major Project Video
Storyboards for my Final Major Project VideoSineadBidwell
 
The Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO CopywritingThe Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO CopywritingJuan Pineda
 
top marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar Ctop marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar CManojkumar C
 
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdf
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdfSnapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdf
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdfEastern Online-iSURVEY
 
Exploring Web 3.0 Growth marketing: Navigating the Future of the Internet
Exploring Web 3.0 Growth marketing: Navigating the Future of the InternetExploring Web 3.0 Growth marketing: Navigating the Future of the Internet
Exploring Web 3.0 Growth marketing: Navigating the Future of the Internetnehapardhi711
 
Digital Marketing in 5G Era - Digital Transformation in 5G Age
Digital Marketing in 5G Era - Digital Transformation in 5G AgeDigital Marketing in 5G Era - Digital Transformation in 5G Age
Digital Marketing in 5G Era - Digital Transformation in 5G AgeDigiKarishma
 
The Evolution of Internet : How consumers use technology and its impact on th...
The Evolution of Internet : How consumers use technology and its impact on th...The Evolution of Internet : How consumers use technology and its impact on th...
The Evolution of Internet : How consumers use technology and its impact on th...sowmyrao14
 
Influencer Marketing Power point presentation
Influencer Marketing  Power point presentationInfluencer Marketing  Power point presentation
Influencer Marketing Power point presentationdgtivemarketingagenc
 
Codes and Conventions of Film Magazine Covers.pptx
Codes and Conventions of Film Magazine Covers.pptxCodes and Conventions of Film Magazine Covers.pptx
Codes and Conventions of Film Magazine Covers.pptxGeorgeCulica
 
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...Ahrefs
 
The power of SEO-driven market intelligence
The power of SEO-driven market intelligenceThe power of SEO-driven market intelligence
The power of SEO-driven market intelligenceHinde Lamrani
 
Michael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysisMichael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysisjunaid794917
 
Most Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdf
Most Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdfMost Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdf
Most Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdfCIO Business World
 
Infographics about SEO strategies and uses
Infographics about SEO strategies and usesInfographics about SEO strategies and uses
Infographics about SEO strategies and usesbhavanirupeshmoksha
 

Recently uploaded (20)

The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...
The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...
The 10 Most Influential CMO's Leading the Way of Success, 2024 (Final file) (...
 
McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)
 
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
 
From Chance to Choice - Tactical Link Building for International SEO
From Chance to Choice - Tactical Link Building for International SEOFrom Chance to Choice - Tactical Link Building for International SEO
From Chance to Choice - Tactical Link Building for International SEO
 
定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一
定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一
定制(ULV毕业证书)拉文大学毕业证成绩单原版一比一
 
Fueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdfFueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdf
 
Storyboards for my Final Major Project Video
Storyboards for my Final Major Project VideoStoryboards for my Final Major Project Video
Storyboards for my Final Major Project Video
 
The Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO CopywritingThe Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO Copywriting
 
top marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar Ctop marketing posters - Fresh Spar Technologies - Manojkumar C
top marketing posters - Fresh Spar Technologies - Manojkumar C
 
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdf
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdfSnapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdf
Snapshot of Consumer Behaviors of March 2024-EOLiSurvey (EN).pdf
 
Exploring Web 3.0 Growth marketing: Navigating the Future of the Internet
Exploring Web 3.0 Growth marketing: Navigating the Future of the InternetExploring Web 3.0 Growth marketing: Navigating the Future of the Internet
Exploring Web 3.0 Growth marketing: Navigating the Future of the Internet
 
Digital Marketing in 5G Era - Digital Transformation in 5G Age
Digital Marketing in 5G Era - Digital Transformation in 5G AgeDigital Marketing in 5G Era - Digital Transformation in 5G Age
Digital Marketing in 5G Era - Digital Transformation in 5G Age
 
The Evolution of Internet : How consumers use technology and its impact on th...
The Evolution of Internet : How consumers use technology and its impact on th...The Evolution of Internet : How consumers use technology and its impact on th...
The Evolution of Internet : How consumers use technology and its impact on th...
 
Influencer Marketing Power point presentation
Influencer Marketing  Power point presentationInfluencer Marketing  Power point presentation
Influencer Marketing Power point presentation
 
Codes and Conventions of Film Magazine Covers.pptx
Codes and Conventions of Film Magazine Covers.pptxCodes and Conventions of Film Magazine Covers.pptx
Codes and Conventions of Film Magazine Covers.pptx
 
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
 
The power of SEO-driven market intelligence
The power of SEO-driven market intelligenceThe power of SEO-driven market intelligence
The power of SEO-driven market intelligence
 
Michael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysisMichael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysis
 
Most Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdf
Most Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdfMost Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdf
Most Influential HR Leaders Leading the Corporate World, 2024 (Final file).pdf
 
Infographics about SEO strategies and uses
Infographics about SEO strategies and usesInfographics about SEO strategies and uses
Infographics about SEO strategies and uses
 

How does Google crawl the web? - Botify at SMX Paris 2018

  • 1.
  • 2. Dimitri Brunel Search Data Strategist @botify.com - SEO manager in Marketing Agencies & Pure Players (previously) - Currently part of the Botify Search Data Strategist team How does Google crawl the web ? Shedding light on common misconceptions and theories about Googlebot and SEO Alpha Keita Search Data Strategist @botify.com - Onboarding & API training Manager @Botify - Currently part of the Botify Search Data Strategist team
  • 3.
  • 4.
  • 6. Do load times impact all websites the same way? How much do load times impact Google’s crawl?
  • 8. Add a scientific approach to the empiric one for detailed and data-backed insights. Scale-up the dataset from a single website to a full set of websites from different industries. Confirm or invalidate Google’s behavior and discover new ones. Be more efficient in our SEO in order to continually improve Googlebot’s efficiency and user experience. Improved SEO Methodology Real Insights More Precise Analysis Share a Belief
  • 9. Percentage of compliant pages (indexable pages) crawled by Google in 30 days. Average number of times a website’s URL was crawled by Google in 30 days. URLs that meet the following requirements: Crawl Ratio Crawl Frequency Compliant URL Canonical tag to self or not set HTTP 200 Status Code Text/HTML Content Index Status (no noindex meta tag)
  • 10.
  • 11. CRAWL RATIO 49% ACTIVE PAGES RATIO 23% CRAWL FREQUENCY 2.3 Percentage of compliant pages in the website’s structure crawled by Google in 30 days. Percentage of pages that have generated at least one organic visit in 30 days. Average number of times a website’s URL was crawled by Google in 30 days.
  • 12. A website’s size is one of the most important factors impacting Google’s crawl.
  • 13. Some KPIs like the number of orphan pages, the load time, or the percentage of words vs. template, have almost no impact on small websites but have a huge impact on big websites. Orphan Pages Load Time % of Words vs. Template The larger the website, the greater the impact: PageRank Depth Content Size Huge impact regardless of size: Other KPIs like the PageRank dilution, depth, or surprisingly content size, have a big impact on Google’s crawl, regardless of website size.
  • 14. The data showed that bad HTTP Codes only had a small impact on Google’s crawl.
  • 16. Data from Botify Analytics. Data from log files from the same websites. Data calculated with 30 days of logs. Websites that fall in one of the following industries: Retail Publisher Classified
  • 17. Websites 270 pages crawled and analyzed by Botify. 413 Million pages crawled by Google and analyzed by Botify. 6.2 Billion *15% of the data crawled and analyzed each month by Botify.
  • 18. Industries Analyzed Dataset by Website Size (in pages)
  • 19. WEBSITE 1. Industry 2. Size STRUCTURAL KPIS 1. PageRank 2. Load Times 3. Depth 4. Outlinks 5. Content Size TYPE OF PAGES 1. Not Compliant Pages 2. Bad HTTP Codes 3. 304 HTTP Codes 4. Orphan Pages
  • 20.
  • 21. WEBSITE 1. Industry ✘ 2. Size ✓ STRUCTURAL KPIS 1. PageRank ✓ 2. Load Times ✓ 3. Depth ✓ 4. Outlinks ✘ 5. Content Size ✓ TYPE OF PAGES 1. Not Compliant Pages ✓ 2. Bad HTTP Codes ✘ 3. Orphan Pages ✓
  • 23. Expected Results Similar crawl rate Different crawl frequency depending on industry CLASSIFIED PUBLISHERRETAILER
  • 24. CRAWL RATIO AND ACTIVE PAGES RATIO BY INDUSTRY ● Googlebot impartially crawls on the web. ● Googlebot crawls impartially regardless of industry. ● Publishers tend to have more active pages (in %). From our past Experience From the analysis of the Dataset Confirmation
  • 25. CRAWL FREQUENCY BY INDUSTRY From the analysis of the Dataset New learnings!
  • 26. Expected Results Decreasing crawl ratio Adaptative crawl frequency > 10K PAGES > 1 MILLIONS PAGES > 100K PAGES < 10K PAGES
  • 27. CRAWL RATIO AND ACTIVE PAGES RATIO BY WEBSITE SIZE From our past Experience From the analysis of the Dataset Confirmation ● More pages means more difficulties for Googlebot. ● More pages means fewer active pages in the SERPs (in %). ● Small websites are better crawled by Google but still not crawled entirely. ● Big websites have a harder time effectively using Crawl Budget.
  • 28. CRAWL FREQUENCY BY WEBSITE SIZE From the analysis of the Dataset Confirmation Big websites tend to have more long tail pages that will be less frequently crawled by Google. Good news: this can be influenced with crawl budget optimization.
  • 30. 3# - Not Compliant Pages and or Canonical tag set not to self Not text nor HTML content ● Badly indexable pages from a technical POV. ● Shows bad crawl signal for web spiders. Risk Expected Results Weakest Crawl Ratio Weak Indexation Lower Crawl Frequency At last a composite indicator Noindex status HTTP codes other than 200 status code
  • 31. COMPLIANT PAGES CRAWLED BY BOTIFY VS. NOT COMPLIANT PAGES CRAWLED BY BOTIFY From our past Experience From the analysis of the Dataset Confirmation ● The proportion of not compliant (37%) pages is still too important vs. aiming for total indexability (100% of compliant pages). ● The overall average shows that SEO still have room for optimization. ● From our past experience we see that many websites still face this problem, usually because of: ○ Extensive use of noindex ○ Server errors ○ Incorrect canonical annotations 413M pages crawled
  • 32. CRAWLED COMPLIANT PAGES VS. CRAWLED NOT COMPLIANT PAGES From our past Experience From the analysis of the Dataset Confirmation ● As most websites have a huge proportion of not compliant pages, Google is on average wasting 16% of its time crawling these useless pages, when it could focus instead on more interesting pages for searchers. ● Google is wasting time crawling not compliant pages.
  • 33. CRAWL RATIO VS. % OF NOT COMPLIANT PAGES CRAWLED BY GOOGLE From our past Experience From the analysis of the Dataset Confirmation ● When the proportion of not compliant pages crawled by Google increases, the crawl ratio decreases. ● We expect that having more not compliant pages crawled by Google will have a negative impact on the compliant page’s crawl ratios.
  • 34. LESS THAN 100K PAGES MORE THAN 100K PAGES From the analysis of the Dataset Confirmation ● Low impact on small websites but huge on medium sites.
  • 35. Expected Results Slow down / stop crawl Impact crawl efficiency #4 - Bad HTTP Codes 404 302 500 200304
  • 36. HTTP CODES DISTRIBUTION From our past Experience From the analysis of the Dataset New Learnings! ● The overall situation is quite good (code 200). ● The code 304 is truly under used. 304 is not commonly used by SEOs ● From our experience, we see many problems related to bad HTTP codes: ○ Temporary redirect, redirect chains, redirect loops ○ Client errors, server errors...
  • 37. CRAWL RATIO VS. CRAWL SHARE IN % ON BAD HTTP CODES From our past Experience From the analysis of the Dataset New Learnings! ● We don’t see a huge impact on crawl ratio. ● Potential reason: most bad HTTP codes in the dataset are 3xx. These don’t consume much crawl budget. ● We could expect bad HTTP status codes to have a big impact on Google’s crawl ratio.
  • 38. #5 - Orphan Pages ● that are outside of the website structure, ● that we did not discover, ● that Google crawled, ● that received crawl budget. Expected Results Cannibalization of crawl budget Lowering the crawl ratio of the site structure PAGES Crawled by BOTIFY Crawled by GOOGLE Crawled by Google AND Botify
  • 39. CRAWL VOLUME ON STRUCTURE PAGES VS. CRAWL VOLUME ON ORPHAN PAGES From our past Experience From the analysis of the Dataset Confirmation ● On avg. orphan pages steal ¼ of the crawl. ● We see a lot of orphans URLs. ● Common reasons: ○ Old implementations or technical regressions ○ No DNS cleaning
  • 40. CRAWL RATIO VS. % OF ORPHAN PAGES CRAWLED BY GOOGLE From our past Experience From the analysis of the Dataset Confirmation ● Orphan pages tend to cannibalize crawl budget and impact the crawl ratio of the structural pages. ● As the percentage of orphan pages increases, the crawl ratio should be negatively impacted. Few orphans = Better crawl ratio More orphans = Lower crawl ratio
  • 41. LESS THAN 100K PAGES MORE THAN 100K PAGES From the analysis of the Dataset New learnings! From our past Experience ● This is very true on big and gigantic websites only. ● Crawl budget cannibalization whatever the size of the website.
  • 43. #6 - Internal PageRank Expected Results Diluting the Internal PageRank on Not Compliant Pages Should Positively Impact Google’s Crawl Ratio on Compliant Pages The popularity spread into the website internal structure A strong crawl signal supposed to pilot the Googlebot(s)
  • 44. CRAWL RATIO VS. % OF INTERNAL PAGERANK SPREAD ACROSS COMPLIANT PAGES From our past Experience From the analysis of the Dataset Confirmation ● If compliant pages get more PageRank, their crawl ratio should improve. Pro Tips: ● Don’t waste PR with nofollow and noindex tags. ● Crawl ratio ⇔ opportunity to improve your links. better crawl ratio = rework your links!
  • 45. The number of physical clicks from the home page #7 - Depth Expected Results Slow Down Crawl Potential Crawl Budget Waste # Folders Depth # Clicks from the Home Page
  • 46. CRAWL RATIO VS. AVG. DEPTH IN ANY WEBSITE STRUCTURE From our past Experience From the analysis of the Dataset Confirmation ● We know depth is an SEO / UX problem: ○ Catalog size ○ Faceted navigation ○ Structure pruning to cut low value content Avg. Depth ● Websites with a higher average depth should be less crawled by Google.
  • 47. #9 - Load Time Expected Results Idle Crawl Huge Impact on Crawl Ratio We consider from a web crawler “point of view” : - The time to first byte (webserver responsiveness) + - The time to download the page HTML source (the DOM).
  • 48. CRAWL RATIO VS. LOAD TIMES IN MILLISECONDES From our past Experience From the analysis of the Dataset ● When we look at all the websites sizes, load times don’t seem to have a big impact on Google’s crawl. ● With higher average load time, crawl ratio should decrease. Disturbing fact here Your target
  • 49. LESS THAN 10K PAGES From the analysis of the Dataset New learnings! ● Small websites ⇔ Low impact of load times ● Big websites ⇔ Huge impact of load times ● With higher average load time, crawl ratio should decrease. Limited impact MORE THAN 10K PAGES From our past Experience Dramatic impact
  • 50. #10 - Number of Internal Outlinks Expected Results Quantity is not Quality Impact on crawl ratio when too many Either Follow Nofollow To a compliant page To a not compliant page
  • 51. CRAWL RATIO VS. NO. OF OUTLINKS PER PAGE CRAWL RATIO VS. NO. OF OUTLINKS TO NOT COMPLIANT PAGES From the analysis of the Dataset No Confirmation From our past Experience ● Google’s crawl doesn’t seem to be impacted by the number of outlinks. ● Less outlinks ⇔ Better crawl ratio ● Bad outlinks ⇔ Slightly decrease the crawl ratio
  • 52. #11 - Percentage of Content Expected Results Low percentage of “real” content often means heavier pages Heavier pages are more difficult to crawl for Google Percentage of Content REAL Content TEMPLATE Content
  • 53. LESS THAN 10K PAGES MORE THAN 1M PAGES From the analysis of the Dataset New learnings! From the analysis of the Dataset Confirmation ● Small websites => Limited impact of the % of content vs. template ● Big websites => Huge impact of the % of content vs. template Limited impact Awesome impact
  • 54. #12 - Content Size (in words not ignored) The number of words on a page, excluding the template. Expected Results The more content on average, the more crawled Yet a limited impact on Google crawl
  • 55. CRAWL RATIO VS. CONTENT SIZE (IN WORDS) From the analysis of the Dataset Confirmation From our past Experience ● Content size impacts Google’s Crawl. ● Websites with more content should be more crawled by Google but we do not expect a very high impact on Google’s crawl. ● Content size is more impactful on Google’s crawl than we expected. New learnings!
  • 56. From the analysis of the Dataset New learnings! ● Content Size positively impacts Google’s crawl for every size of website. Medium impact LESS THAN 100K PAGES MORE THAN 1M PAGES Good impact Awesome impact BETWEEN 100K AND 1M PAGES
  • 57.
  • 58. Website size dramatically impacts Google’s crawl ratio. Even small websites are not crawled at 100% by Google. Crawl Budget matters. #1 #2
  • 59. Content size matters: Build high quality unique content Orphans volume matters: Like water, don’t waste the crawl budget! Structure depth matters: Don’t be afraid, prune the useless branches! #3 #4 #5
  • 60.
  • 61. OPPORTUNITIES : 1. Analyze a larger set of websites (iterate on the dataset) 2. Extend the duration of the study (6 months, 12 months, 24 months) 3. Increase the list of KPIs to test (nofollow, noindex, etc.) 4. Cross even more SEO KPIs 5. Extend the data to Keywords (impressions, positions, clicks, etc.) 6. Consider seasonality in the analysis (trending topics, breakout topics, etc.)
  • 62.
  • 64. GOOGLEBOT(S) DISTRIBUTION BY INDUSTRYGOOGLEBOT(S) DISTRIBUTION DESKTOP MOBILE From the analysis of the DatasetFrom our past Experience ● Rolling out the Mobile-First Index takes lot of time.● Googlebot desktop is still very present.
  • 65. Book a Demo to learn what Botify can do for you: