Browse documentation
Documentation crawler

How the crawler works

When you add a website, the product first runs an initial Search Console data pull for that property, then seeds crawling from your homepage and continues automatically (subject to your plan’s page limits) to build a picture of URLs, responses, metadata, and content-level signals. That supports technical visibility and the kinds of on-page checks described in On-page and technical signals.

Where crawling starts

Crawling typically begins from your site’s homepage at https://<your-domain>/. The crawler discovers internal links on pages it fetches and queues further URLs on the same site, up to the maximum number of pages allowed for your subscription (see Tiers, pages, and data retention). Failed fetch attempts are tracked for diagnostics but do not consume your paid page allowance.

By default, only links on the same hostname scope as your Search Console property (including the apex and www pair) count as internal; subdomains are excluded unless you enable that on the website overview. Details: Crawl scope and subdomains.

User-Agent

Requests from the crawler identify themselves with this User-Agent string (also described in our Terms of Service):

SEO-Perception-Crawler/1.0 (+https://seoperception.com; mailto:support@seoperception.com)

You can use this string to recognize our traffic in server and firewall logs.

If you operate a strict firewall or bot policy, allow this User-Agent and the related contact address so we can help investigate crawl issues quickly. For questions, reach us at support@seoperception.com.

robots.txt

For each site, we fetch robots.txt at the HTTPS host root and evaluate it using the standard rules: User-agent groups, Allow vs Disallow precedence (including wildcard path patterns where declared). URLs that our crawler User-Agent may not fetch are skipped so polite crawling aligns with typical search-engine expectations.

SEO Possibilities also compare your live robots.txt (same parsing as crawling) against stored crawl data—separately for each rule type below.

  • Sitemap-listed URL blocked for Googlebot — URLs discovered via XML sitemap but whose paths are disallowed for Googlebot-style matching are flagged on that URL’s row. This check is unchanged when we suppress paths for internal links.
  • Outgoing internal links → disallowed targets — We flag the page that contains the internal link (not the destination URL row) when link targets would be blocked for Googlebot. To reduce noise from intentional nav links (dashboard, login, checkout, …), we skip targets whose paths start with these defaults (case-insensitive): /dashboard, /my-account, /account, /login, /sign-in, /signup, /register, /cart, /checkout, /wp-admin, /admin.
  • Googlebot stricter than wildcard (*) — When default rules allow a URL but a Googlebot-specific section disallows it, we flag that URL’s row—often worth checking for staging mistakes.

If you need the crawler to follow URLs that robots.txt would normally block (for example on a staging or special setup), you can turn on “Ignore robots.txt when crawling” on your account profile. Use this only when you understand the impact on your servers and policies; it applies to crawling for your user account.

XML sitemaps (Sitemap:)

The same robots.txt fetch is also used to read optional Sitemap: lines pointing at XML sitemap documents. Separately from link discovery, we can cross-check URLs listed in those sitemaps against HTTP behaviour (expecting a 200 response on the first hop). See Sitemap URLs and HTTP 200 for what that means and how it appears as a possibility.

Feeds

Some URLs are detected as feeds (for example RSS or Atom) using response headers and URL patterns. They may still appear in crawl-backed lists (such as the technical URL table) as feed type rather than normal HTML.

What we collect (overview)

We do not publish a step-by-step list of every rule here — checks evolve — but at a high level the crawler supports:

  • HTTP layer — status codes and awareness of redirects in the navigation chain.
  • HTML metadata — title, meta description, canonical, robots meta, Open Graph tags where present, and related cues.
  • Page content signals — headings, word counts, images and alt text, internal and external link counts, presence of structured data (such as JSON-LD), viewport and lang, and favicon on the homepage.
  • Link health sampling — a sample of internal and external links may be followed to detect obvious broken or redirecting outbound targets.

Those inputs feed scoring and categorization described conceptually under On-page and technical signals. A row-level view of crawled URLs lives in the website workspace.

Note: Server-side or in-app configuration can cap how many pages are processed per run so large installs stay responsive; that affects how quickly a big site is fully refreshed, not your subscription’s total page allowance. We intentionally crawl at a measured pace to avoid overwhelming origin servers and to reduce noisy false positives.

Stay in the loop

Weekly SEO teardowns, algorithm update alerts, and performance tactics—when we publish them.

We respect your privacy: we do not sell your email or spam you.

SEO Perception

We take all the dry, technical SEO data nobody wants to read, connect the dots with AI and decades of SEO expertise, and show you the fixes that matter most plus the opportunities with the biggest upside.


Google, PageSpeed, and PageSpeed Insights are trademarks of Google LLC. SEO Perception is not endorsed by or affiliated with Google. We use Google’s public PageSpeed Insights service because we find it useful for site owners.

© 2024 - 2026 SEO Perception. All rights reserved.

Built with love by Larsik Corp.