Behemoth SEO Search Strategy For Huge Websites

Behemoth SEOSearch Strategy For Huge Websites
@pip_net
Download Slides: clk.me/behemoth

Philipp Klöckner
Angel Investor & Advisor
@pip_net
2005 2010 2015 2019

Behemoth SEOSearch Strategy For Huge Websites
@pip_net

Most Behemoths Are Aggregation Websites with 1M+ Pages
Vertical Search Engines
• i.e. Comparison Shopping
Engines (CSEs) and Meta-
Search Engines
• Scraping and aggregating
price/fare and product
information
• Partly relying on affiliate data
and feeds
Classifieds
• Real Estate, Cars, Jobs,
Holiday Rentals, General
Classifieds
• Aggregating user-generated
or previously published
offers/ads
• Content usually expires after
certain timeframe
Marketplaces
• Aggregating supply
(product/service feeds) and
demand at the same time
• Supplies often have several
points of sale and syndicate
data
Social Networks & Forums
• Vast amounts of user
generated content
• Insufficient control over quality
and information architechture
Most of these are „Intermediaries“ doing „Search“ and implicitly violate Guidelines.

Advantages & Challenges of Aggregators
ChallengesAdvantages
• Aggregation attracts demand (users)
through superior availability,
assortment (choice) and competition
(price)
• High degree of automation
• Both market sides may create lots of
content, data and value
• Extremely scalable and capital
efficient
• Consequently build network effects
and moats over time…
• …and become hyper-profitable and
well defendable
• Automation potentially creates billions
of documents
• Quality of content/inventory is
extremely diverse
• Panda/Core algorithm sparked a
structural decline of the whole sector
• Google positions own verticals above
SERPs
• Aggregators may potentially violate
different Google Guidelines:
• Dupe Content (int/ext)
• Thin Content
• Affiliate Content
• Indexable Search

Thin
Affiliate
Duplicate
Content
Excessive
Page
Growth
Medic Panda
SERP in
SERPs
Thin &
Empty
Pages

Useful Advice
For Very Big Websites

But It‘s Has Gotten A Lot Better Recently…
“…there’s some really good stuff here. But there’s
also some really shady or iffy stuff here as well…
and we don’t know like how we should treat things
over all. That might be the case.” @JohnMu

Comparison Search has been in Structural Decline for the Past Decade
Panda 1.0

“YOU HAVE STOLEN
MY DREAMS AND MY
CHILDHOOD WITH
YOUR EMPTY

Navigating an
aggregation website
through Panda

Well… Everyone but Two Players
Idealo.de
Ladenzeile.de

Classical Search Engine Optimisation Framework
SEO
Content Popularity Technical SEO
• Inventory
• Text
• Rich Media
• Video
• Advice
• Structured Data
• Tools & Apps
• Interactive Content
• Links
• Mentions
• Brand Search
• Comp. Brand
Search
• Direct Type-Ins
• Sharing
• All available signals
• Internal Linking
• URL Design
• Indexing
• Heading Tags
• Href Lang Setup
• Structured Data
• HTTPS/HTTP2

Search Engine Optimisation Post-Panda (2011)
SEO
• Inventory
• Text
• Rich Media
• Video
• Advice
• Structured Data
• Tools & Apps
• …
• Links
• Mentions
• Brand Search
• Comp. Brand
Search
• Direct Type-Ins
• Sharing
• URL Design
• Indexing
• Heading Tags
• Href Lang Setup
• Structured Data
• HTTPS/HTTP2
User Experience
• Bounce Rate
• Back To SERP
• Dwell Time
• Retention
• Trust
• Search Journey
• Satisfaction of Intent
PageSpeed
*
* 2011 Major Google Update named after Engineer Panda Navneet

Search Engine Optimisation Today (2019)
SEO
Content Popularity Tech SEO User Experience

The Future of Search Engine Optimisation
SEO
C P T User Experience

Focus Areas of Concern for Huge Websites
SEO
• Inventory
• Text
• Rich Media
• Video
• Advice
• Structured Data
• Tools & Apps
• …
• Links
• Mentions
• Brand Search
• Comp. Brand
Search
• Direct Type-Ins
• Sharing
• URL Design
• Indexing
• Heading Tags
• Href Lang Setup
• Structured Data
• HTTPS/HTTP2
User Experience
• Bounce Rate
• Back To SERP
• Dwell Time
• Retention
• Trust
• Search Journey
• Satisfaction of Intent
PageSpeed
*
* 2011 Major Google Update named after Engineer Panda Navneet

Today we‘ll learn:
1. Index Management
2. Crawl Budget
Optimisation with
internal Linking
3. Making Users Happy!

4. Practise with Case
Studies

Theory: Typical Page Quality (Qp) over Number of Pages (np)
np
Qp
Homepage
Category
Category+Brand
Facetted Search
Thin Catalogue (low inventory)
Dupe Content page „no results“ page
highestlowestmediorceuseful
400.000200.000 300.000100.000
Page Quality (Qp) can
be defined as content
richness, engagement,
ultimateley how useful
the page is to the user.
But also its revenue
potential.
PROBLEM:
Since Panda (2011) this
structure has become toxic.

np
Qp
400.000200.000 300.000100.000
Average Quality
😞
Quality Threshold (mediocre and better)
NOINDEX
(320.000)
INDEX
(80.000)
New Average Quality
QTY
INCREASE
Panda Diet:
Let‘s cut some crap!
Quality Threshold
RANKINGS
potential.

Identifying Low Quality Pages by Page-Type
Easy NOINDEX Targets
• „no results“ pages
• Few results pages (set item threshold)
• Single review pages, other low-quality UGC
• Bulk product pages
• Any dupe pages
• Facetted search w/o search demand
• Out of stock pages
• Expired offers/ads
• Parameters, etc…
If your site has more indexed pages than things on sale – you‘re
doing it wrong!

Identifying Low Quality Pages: Data Driven Approach
Data to support page quality decisions
• Revenue distribution on landing pages (Google Analytics)
• Engagement and commercial metrics per page-type
• Conversion rate related to inventory count
• Demand-Data (Search Volume, PPC traffic, navigational traffic)
• „Indexation Gap“ (Sitemaps, Submitted vs. Indexed)
• Crawling Activity (Server Logs)
• Hint: Consider using De-Indexing sitemaps to accelerate Panda Diet

np
Qp
400.000200.000 300.000100.000
Truth is:
This curve doesn‘t look
like this…
potential.

np
Qp
400.000200.000 300.000100.000
Truth is:
like this…
BUT: More like THIS!
potential.

Theory: ACTUAL Page Quality (Qp) over Number of Pages (np)
np
Qp
400.000200.000 300.000100.000
Truth is:
like this…
BUT: More like THIS!
ACTUALLY… like THIS!
potential.

Theory: ACTUAL Page Quality (Qp) over Number of Pages (np)
np
Qp
400.000200.000 300.000100.000
These pages typically…
• Never saw a visit, nor any conversions (GA Organic Langing
Pages)
• Aren‘t crawled any longer, as Google wont rank them anyway
(logs)
• Are not being considered for indexation (GSC Sitemaps Monitor)
While 100% of your revenue is here!

A Proper Cut: Extreme Panda Diet

The Result of Removing 997 out of 1,000 Pages

How To Deal With Duplicate Content
Reliable Solutions
1. Avoid it! Internally and externally (Double Serving, Affiliate Content, Syndication)
2. Identify it! (Ryte Reports, „Quotation Searches“, HTML Improvements in GSC, etc)
3. Rewrite or enrich content
4. NOINDEX
5. Enforce Canoncial URL via 301 (lookup, fix, truncate – „Canonical for Adults“)
(http://example.com/landing/?page=2&affID=anet ==301==> https://www.example.com/landing/)
Post & Pray Solutions (these might or might not work perfectly)
1. Canonical Tag
2. GSC Parameter Handling
3. Robots.txt

Bot Recognition (Switch)
Crawling-
friendly
website
Fully
functional
website
Tip: Surf Amazon side-by-side as Googlebot vs Real User

If Noindex: Consequently „Orphanize“ Pages
Home
One Two Three

Home
One Two Three
NOINDEX

Home
One Two Three
NOINDEX
Viable solutions for link removal
• Nofollow
• Dynamic Serving („Cloaking“)
• Client-side JS
• PRG Pattern
• Forms/Buttons

Get Rid Of Pagination (Entirely)
Pagination Best Practise
• Pagination is a stupid offline concept
• More items, less pages, less problems
• Users like comprehensive pages (A/B Test)
• NOINDEX pagination if possible
• Remove links to those pages
• No pagination pages – no problem
• Make sure discovery remains intact
No one, ever…

This useless shit… Gone (for Bots at least)
Social Profile
Links
Locale Selector
Keep these on you Homepage or About Us, but not on every page.
(If they are important for the user, why are they in the footer?)

46
Even Product/Offer Detail Pages Might Be Low-Quality
5x
?
0,1% of Pages

Case Study: How to identify the least valuable pages?
1. Out of Stock Handling: (OoS pages generate lots of html pages and poor UX)
1. If OoS for good: 301 to most similar page (parent category) or 410 if no alternative
2. (If potentially restocked keep page alive (200), offer restock alert and/or alternatives)
2. Facetted Search (Filters) & Indexable Site Search
1. Set minimum item threshold to define a „good“ search result page that doesn‘t look like a SERP
2. Build clusters where possible (typos, plurals, refined queries, entities)
3. Apply quality thresholds (Dwell time, Bounce rate, conversion) to SERP in SERP pages (indexing int. Search)
3. Pagination
1. Show more items per page (3x more items = 1/3 of pages)
2. Best solution for pagination: no pagination
4. PDP (product detail page) reduction
1. Get better at understanding shelf huggers and bestsellers using your data
2. Advanced: Predict page performance with machine learning (OEM, price, category, attributes, etc)
3. Merge variants into master products (sizes, patterns, color, etc)
5. Reviews & FAQ: Use Overview pages for reviews & questions, don‘t index single pieces of content
6. Don‘t built a self-fulfilling prophecy
1. Allow triggers for re-indexation (ppc traffic, navigational demand, etc)

Internal Search Makes Inventory Accessible
Million $
Mistake

Internal Search Makes Inventory Accessible

Put Your Site Search In A Prominent Place!

Pinterest: Dupe Content Clusterfuck
https://www.pinterest.com/pin/554083560398205192/
https://www.pinterest.de/pin/554083560398205192/
https://www.pinterest.at/pin/554083560398205192/
https://www.pinterest.fr/pin/554083560398205192/
https://www.pinterest.es/pin/554083560398205192/
https://www.pinterest.pt/pin/554083560398205192/
https://www.pinterest.se/pin/554083560398205192/
https://www.pinterest.dk/pin/554083560398205192/
https://www.pinterest.no/pin/554083560398205192/
https://www.pinterest.ch/pin/554083560398205192/
https://www.pinterest.ie/pin/554083560398205192/
https://www.pinterest.ch/pin/554083560398205192/
https://www.pinterest.id/pin/554083560398205192/
https://www.pinterest.it/pin/554083560398205192/
https://www.pinterest.ru/pin/554083560398205192/
+ 2 dozen more locales….

Pinterest: Internationalization

RE-PINS – Adding Insult To Injury!
https://www.pinterest.de/pin/243475923592500876/https://www.pinterest.de/pin/241013017546674029/ INDEXABLEINDEXABLE
New
URL!

Pinterest: Millions of Dead Files
Boards
Pins
Home Fave Places My Style
INDEXABLE

Quick Reminder (10 Years ago…)
2009!

1. Facebook Index Coverage: Accessibility vs. Page Quality
2. Inactive/Empty Groups, Pages, Users, Places
3. Privacy-aware users (or create incentive to share public to improve LP value)
4. Use Engagement as a quality metric for post-URLs (doesn‘t get much better than this)
5. Marketplace (See Advanced Panda Diet)
6. Expired Events
7. …

63
Crawling Efficiency & Internal Linking
Links from GSC or
Crawling Tools

64
Crawling Efficiency & Internal Linking

Balance: Algorithmic Internal Linking for 1.000 Pages
1. New York
2. London
3. Paris
4. Rome
5. Amsterdam
6. Milan
7. Barcelona
8. Prague
9. Dublin
10. Berlin
1. Munich
2. Warzaw
3. Madrid
4. Copenhagen
5. Stockholm
6. San Francisco
7. Toronto
8. Hamburg
9. Rio de Janeiro
10. Cairo
1. Seattle
2. Marrakesh
3. Sofia
4. Wroclaw
5. Helsinki
6. Vancouver
7. Hanover
8. Marseille
9. Alicante
10. Edinburgh
First Tier
Top 10
This class of pages gets 1.000 Links each
Second Tier: Random
10 out of Top 100
This class of pages gets 100 Links each
Third Tier: Random
10 out of Top 1.000
This class of pages gets 10 Links each
• Shuffle container 2+3, but keep static per page
• Include relevance score/silos/topical proximity to improve UX

66
Fix Internal Linking Using Bestseller Lists
1. Standard Sorting: Popularity
2. Dyn. Bestseller Lists for Prioritization
3. „New Arrivals“ for Discovery
4. Related Products für Completeness
5. Breadcrumb for Bottom Up Prio
6. Prio über Sitemap: Ask Santa about it!

SEO EfficiencyTM
*
*
*
*
The key to extremely big websites: Trim them for Efficiency!
100x
2200x

Frequently Asked Questions
How isn‘t this cloaking?
I‘m afraid I could lose all my long-tail revenue. *mimimi*
Should I remove all those pages in one drastic move? Wouldn‘t Google see that as a weakness?
Should I really dynamically switch/flap index directives?
How does GoogleBot discover my content without pagination?
1. It doesn‘t alter user experience 2. It only makes Google‘s job easier 3. Take a look at Amazon, bro
1. There‘s usually no data confirming the long-tail 2. Rankings are usually not lost but substituted by
superior pages 3. Google actually prefers pages with good UX over the most specific result (Hummingbird,
RankBrain instead of perfect title string match)
It‘s always a good time to do the right thing!
I think you should. See
above.
If you need pagination for discovery, you‘ve got bigger fish to fry. Seriously…

What to remember…
1. We‘re doing this for 10 years (Pre-Panda) now and it has never backfired
2. This is most important if your website has more than 100.000 pages
3. Index Bloat: Millions of indexed HTML documents are not an asset but a
liability. Indexing everything is inefficient by definition.
4. 80 % (actually 95%) of your website usually is dead weight. And it‘s
pulling down your best pages.
5. Analyse your potential with an organic landing page report
6. There‘s no black and white, but a reasonable amount of grey which
should be defined by data
7. Non-transactional content is (most likely) overrated. (Inventory=Content)

Behemoth SEO Search Strategy For Huge Websites

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Behemoth SEO Search Strategy For Huge Websites

Similar a Behemoth SEO Search Strategy For Huge Websites (20)

Más de Philipp Klöckner

Más de Philipp Klöckner (9)

Último

Último (20)

Behemoth SEO Search Strategy For Huge Websites

Notas del editor