Using Screaming Frog to Analyze Crawlability

Published on February 4, 2026 by Devender Gupta

Setting Up Screaming Frog SEO Spider for Accurate Crawl Simulation

Modern SEO is no longer just about checking tags; it’s about understanding how search engines discover, parse, and render your site. Screaming Frog is the industry standard for this task, but a default crawl often misses the nuances of how Google actually sees your pages.

In this guide, I will show you how to configure the SEO Spider to move beyond basic reporting and into high-level technical diagnostics.

Choosing the Correct User-Agent (Googlebot, Googlebot Smartphone, Custom)

What: The User-Agent (UA) is the “ID card” the crawler presents to your server. Why: Many servers and CDNs (like Cloudflare) treat different UAs with different priority levels or even serve different content (Dynamic Rendering). If you crawl as “Screaming Frog SEO Spider,” you aren’t seeing what Google sees.

How:

Navigate to Configuration > User-Agent.
Click the Preset User-Agents dropdown.
Select Googlebot Smartphone.

⭐ Pro Tip: Since Google has moved to mobile-first indexing for almost all sites, you should default to the Smartphone UA. This ensures you are auditing the version of the site that determines your rankings.

Enabling JavaScript Rendering and When to Use It

The short answer: If your site uses a framework like React, Angular, or Vue, or if content is injected via client-side JS, a standard “Text Only” crawl will return empty or misleading results.

What: JavaScript Rendering allows Screaming Frog to use an integrated Chromium browser to execute scripts and see the final DOM (Document Object Model).

How:

Go to Configuration > Spider > Rendering.
Change the dropdown from “Text Only” to JavaScript.
Adjust the Rendering Timeout if your site is slow (the default is usually 5 seconds).

The distinction: Do not use JS rendering for every crawl. It is resource-intensive and will significantly slow down your crawl speed. Only enable it when you need to audit content, links, or metadata that are not present in the initial HTML source code.

Respecting vs. Ignoring robots.txt for Diagnostic Crawls

Google strictly follows robots.txt instructions. However, as a technical auditor, you sometimes need to see what’s behind the curtain.

Respect robots.txt: Use this for a “Google-eye view” of the site. It shows you exactly what is being indexed.
Ignore robots.txt: Use this when you are hunting for “hidden” technical debt, such as old staging folders or dev environments that should have been deleted but are still linked internally.

🔖 Read more: Google’s Official Robots.txt Documentation

Controlling Crawl Scope and Preventing Noise

A common mistake is letting the spider run wild on a 500,000-page site when you only need to audit a specific subfolder. This wastes “Crawl Budget” (the spider’s efficiency) and leads to messy data.

Using Include/Exclude Regex to Isolate Sections

If you only want to audit your ecommerce store’s product pages and ignore the blog, you must use Regex (Regular Expressions).

How to Exclude:

Go to Configuration > Exclude.
Enter the pattern: https://myshop.online/blog/.*
This tells the spider to ignore any URL that starts with the blog path.

How to Include: If you only want to crawl the /products/ subfolder:

Go to Configuration > Include.
Enter: https://myshop.online/products/.*

⭐ Pro Tip: Always use the “Test” tab within the Include/Exclude window. Paste a few URLs to ensure your Regex is functioning as expected before starting a massive crawl.

The only exception to a full crawl is when you encounter “Infinite Spaces”—URLs generated dynamically that never end. Common culprits include:

Calendars: /events/?date=2024-01-01 (the bot can crawl indefinitely into the future).
Facets: Multiple combinations of filters (Color + Size + Price + Material).

The fix: Use Configuration > Spider > Limits to set a maximum “Crawl Depth.” For most ecommerce sites, a depth of 5–10 is sufficient to reach all critical products without getting lost in facet-hell.

Custom Extraction and Advanced Data Collection

Sometimes the standard tabs don’t give you the data you need, such as checking if a specific “Add to Cart” button is present or if Product schema is missing a brand property.

Using XPath to Extract Critical Elements

What: XPath is a query language used to navigate through the HTML of a page. Why: It allows you to pull specific data points into a custom column in your crawl report.

How:

Navigate to Configuration > Custom > Extraction.
Select XPath from the dropdown.
Paste your expression.

Example: Extracting the “In Stock” status from a product page:

//span[@class="inventory-status"]

Common Extraction Use-Cases:

Schema Validation: Extracting the @type from JSON-LD to ensure every product has Product schema.
Breadcrumb Path: //nav[@aria-label="Breadcrumb"]//li[last()] to see the final level of your taxonomy.

Practical Workflow: From Crawl to Actionable Fixes

To turn a crawl into a professional audit, follow this 4-step diagnostic process:

Step 1: The “Bulk” Audit. Check the Response Codes tab. Filter for 4xx and 5xx errors. These are your “P0” fixes—links that are completely broken.
Step 2: The Canonical Audit. Look for Canonical > Non-Indexable. If a page points its canonical tag to a 404 page or a redirect, you have a “Canonical Loop.”
Step 3: The Link Equity Audit. View the Inlinks tab for your most important pages. If a high-margin product only has 1 internal link, it is likely being “orphaned” or buried too deep.
Step 4: Validation. Use the Structured Data tab to ensure Google can parse your JSON-LD.

Crucial: Never deliver a raw Screaming Frog export to a client. Use the “Bulk Export” feature to isolate specific issues (e.g., “All 404s with their Source Inlinks”) so the developer knows exactly which pages to edit.

About Devender Gupta

Devender is an SEO Manager with over 6 years of experience in B2B, B2C, and SaaS marketing. Outside of work, he enjoys watching movies and TV shows and building small micro-utility tools.

Using Screaming Frog to Analyze Crawlability

Setting Up Screaming Frog SEO Spider for Accurate Crawl Simulation

Choosing the Correct User-Agent (Googlebot, Googlebot Smartphone, Custom)

Enabling JavaScript Rendering and When to Use It

Respecting vs. Ignoring robots.txt for Diagnostic Crawls

Controlling Crawl Scope and Preventing Noise

Using Include/Exclude Regex to Isolate Sections

Avoiding Infinite Spaces (Facets, Calendars, Session IDs)

Custom Extraction and Advanced Data Collection

Using XPath to Extract Critical Elements

Practical Workflow: From Crawl to Actionable Fixes

About Devender Gupta