Browse documentation
Documentation crawler

How the crawler works

When you add a website, SEO Perception crawls it automatically (subject to your plan’s page limits) to build a picture of URLs, responses, metadata, and content-level signals. That supports technical visibility and the kinds of on-page checks described in On-page and technical signals.

Where crawling starts

Crawling typically begins from your site’s homepage at https://<your-domain>/. The crawler discovers internal links on pages it fetches and queues further URLs on the same site, up to the maximum number of pages allowed for your subscription (see Tiers, pages, and data retention).

By default, only links on the same hostname scope as your Search Console property (including the apex and www pair) count as internal; subdomains are excluded unless you enable that on the website overview. Details: Crawl scope and subdomains.

How pages are loaded

Pages are opened in a headless browser (Playwright), similar to a real desktop session. The crawler uses a fixed viewport (1920×1080), waits for the full load event (so deferred scripts can affect what is measured, for example image attributes), and uses a navigation timeout on each URL. This is closer to what users see than a bare HTTP fetch, but it is still an automated crawl — not a replacement for manual QA on every device.

User-Agent

Requests from the crawler identify themselves with this User-Agent string (also described in our Terms of Service):

SEO-Perception-Crawler/1.0 (+https://seoperception.com)

You can use this string to recognize our traffic in server and firewall logs.

robots.txt

For each site, we fetch robots.txt and apply a straightforward Disallow path check. URLs that are disallowed for the crawler are skipped so we stay within common expectations for polite crawling.

If you need the crawler to follow URLs that robots.txt would normally block (for example on a staging or special setup), you can turn on “Ignore robots.txt when crawling” on your account profile. Use this only when you understand the impact on your servers and policies; it applies to crawling for your user account.

Feeds

Some URLs are detected as feeds (for example RSS or Atom) using response headers and URL patterns. They may still appear in crawl-backed lists (such as the technical URL table) as feed type rather than normal HTML.

What we collect (overview)

We do not publish a step-by-step list of every rule here — checks evolve — but at a high level the crawler supports:

  • HTTP layer — status codes and awareness of redirects in the navigation chain.
  • HTML metadata — title, meta description, canonical, robots meta, Open Graph tags where present, and related cues.
  • Page content signals — headings, word counts, images and alt text, internal and external link counts, presence of structured data (such as JSON-LD), viewport and lang, and favicon on the homepage.
  • Link health sampling — a sample of internal and external links may be followed to detect obvious broken or redirecting outbound targets.

Those inputs feed scoring and categorization described conceptually under On-page and technical signals. A row-level view of crawled URLs lives in the website workspace.

Note: Server-side or in-app configuration can cap how many pages are processed per run so large installs stay responsive; that affects how quickly a big site is fully refreshed, not your subscription’s total page allowance.

Stay in the loop

Weekly SEO teardowns, algorithm update alerts, and performance tactics—when we publish them.

We respect your privacy: we do not sell your email or spam you.

Stay in the loop

Weekly SEO teardowns, algorithm update alerts, and performance tactics—when we publish them.

We respect your privacy: we do not sell your email or spam you.

SEO Perception

We take all the dry, technical SEO data nobody wants to read, connect the dots with AI and decades of SEO expertise, and show you the fixes that matter most plus the opportunities with the biggest upside.


© 2024 - 2026 SEO Perception. All rights reserved.

Built with love by Larsik Corp.