Ecommerce Crawling Issues: Facets, Filters & URLs
Faceted navigation is a powerful UX feature that helps users find products quickly, but for search engines, it is often a “crawl trap” that generates millions of low-value URLs. If left unmanaged, these filters can dilute your site’s authority and waste your limited crawl budget on duplicate content. In this guide, I will show you how to identify facet-driven crawl bloat and implement a technical strategy to ensure Googlebot only spends time on your most valuable pages.
How Faceted Navigation Explodes URL Space
Faceted navigation creates infinite crawl paths by allowing users to combine multiple filters, sorts, and parameters. The short answer: what is a convenient browsing experience for a human is a mathematical nightmare for a crawler.
The Mathematical Growth of Facet Combinations
Imagine an ecommerce store, MyShop Online, with 100 products. If you have 5 filter categories (Color, Size, Brand, Price, Material) and each has 5 options, the number of potential URL permutations is staggering.
When you allow filters to be combined in any order (e.g., /shoes?color=blue&size=10 vs /shoes?size=10&color=blue), you create duplicate targets for the same content. Crawlers don’t “know” when to stop; they follow every unique link they discover until your crawl budget is exhausted.
User-Useful Facets vs. Crawler Traps
You must distinguish between a facet that satisfies search intent and one that simply organizes data.
- User-Useful: A “Red Running Shoes” page has high search volume.
- Crawler Trap: A combination like “Red Running Shoes under $50, Size 11, Mesh Material, Rated 4-stars” has zero search volume but still generates a unique URL that Googlebot feels obligated to crawl.
Why Googlebot Treats Facets as Discoverable Pages
Googlebot discovers your site primarily through <a> tags. If your faceted navigation is built using standard anchor links, every filter click is a new discovery target.
- Internal Link Demand: Each checkbox in your UI that is wrapped in an
<a>tag tells Google, “This is a page you should visit.” - Parameter as Unique Targets: Google treats
example.com/shopandexample.com/shop?sort=price_ascas two distinct URLs. - Crawl Budget Drain: When Googlebot spends 80% of its time crawling “Sort by Price” or “Discount” filters, it has less time to discover your new product launches or updated content.
The Three Types of Facet URLs (and How to Treat Each)
To manage your URL space, you must categorize your facets into three buckets:
- Valuable Facet Pages (Index-worthy): These target specific long-tail keywords (e.g., “Men’s Leather Boots”). They should be crawlable, indexable, and included in your sitemap.
- Neutral Facet Pages (Crawl but don’t index): These are useful for users but have no SEO value. Use a
noindextag, but keep them crawlable so link equity can flow through them to products. - Toxic Facet Combinations (Must not be crawled): These are permutations or “Sort” parameters that provide no value. These should be blocked via
robots.txtor handled with JavaScript to prevent discovery.
⭐ Pro Tip: Never use noindex on pages you have blocked in robots.txt. If Google can’t crawl the page, it can’t see the noindex directive, and the URL might still appear in the index if it has external links.
URL Parameter Strategy: Control at the Source
Designing a clean parameter schema is the first step to prevention.
- Consistent Parameter Ordering: Force your site to always list parameters in a specific order (e.g., alphabetical). This prevents
/shoes?color=red&size=10and/shoes?size=10&color=redfrom existing as two separate URLs. - Path-based Facets for SEO: For high-value facets, use a clean URL path (e.g.,
/shoes/red/) instead of a query parameter (/shoes?color=red). - Avoid Session IDs: Never include session IDs or tracking parameters in the URL path, as these create an infinite number of unique URLs for the exact same content.
Internal Linking Rules for Faceted Navigation
The most effective way to stop crawl waste is to hide the links from the bot entirely while keeping them functional for the user.
- Use Buttons or POST Requests: Instead of
<a>tags for filters that don’t need to be indexed (like “Sort by Price”), use<button>elements or trigger the filter via a POST request. - JavaScript Events: Implement filters using JS that updates the page content without changing the URL to a crawlable state for non-essential filters.
- Link Sculpting: Only use standard anchor tags for the “SEO Allowlist” combinations you want Google to discover.
Robots.txt vs. Noindex vs. Canonical for Facets
Choosing the right directive is crucial for success. Here is how to decide:
- Use Robots.txt: To save crawl budget immediately. Use this to block “Sort,” “Price Range,” and “View” parameters.
- Use Noindex: When you want the page to stay out of the SERPs but still want Google to follow links on that page to find products.
- Use Rel=“Canonical”: When you have multiple URLs showing similar content but want to consolidate the “ranking power” to one master page. Note that Google often treats canonicals as “suggestions” and may ignore them if the content differs too much.
🔖 See also: Google’s Guide to Large Site Crawl Budget Management
Implementing ItemList Schema for Facet Pages
When you have an “SEO-approved” facet page, you should use ItemList schema to help Google understand the entities on that page.
Step 1: Identify the products listed on the filtered page.
Step 2: Nest the Product entities within the itemListElement.
Step 3: Validate your code using the Rich Results Test.
{
"@context": "https://schema.org",
"@type": "ItemList",
"name": "Men's Red Running Shoes",
"url": "https://myshoponline.com/shoes/mens/red",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"url": "https://myshoponline.com/products/speed-runner-2000-red"
},
{
"@type": "ListItem",
"position": 2,
"url": "https://myshoponline.com/products/trail-blazer-pro-maroon"
}
]
}
Measuring Success: Before & After Facet Cleanup
After implementing these changes, you must validate the results using Google Search Console (GSC) and your server log files.
- GSC Crawl Stats: Look for a sharp decrease in the “Total crawl requests” for URLs containing query strings (e.g.,
?). - Index Coverage: You should see the “Excluded” count stabilize as toxic parameters are removed from the crawl path.
- Log File Analysis: Use a tool like Screaming Frog Log File Analyser to verify that Googlebot is now spending more time on your
/product/and/category/folders and less on/shop?sort=.
The goal isn’t just to reduce crawling—it’s to ensure that every visit from Googlebot is a meaningful one that leads to better indexing and higher rankings for your core business entities.