How the crawler works
When you add a website, SEO Perception crawls it automatically (subject to your plan’s page limits) to build a picture of URLs, responses, metadata, and content-level signals. That supports technical visibility and the kinds of on-page checks described in On-page and technical signals.
Where crawling starts
Crawling typically begins from your site’s homepage at https://<your-domain>/. The crawler discovers internal links on pages it fetches and queues further URLs on the same site, up to the maximum number of pages allowed for your subscription (see Tiers, pages, and data retention).
By default, only links on the same hostname scope as your Search Console property (including the apex and www pair) count as internal; subdomains are excluded unless you enable that on the website overview. Details: Crawl scope and subdomains.
How pages are loaded
Pages are opened in a headless browser (Playwright), similar to a real desktop session. The crawler uses a fixed viewport (1920×1080), waits for the full load event (so deferred scripts can affect what is measured, for example image attributes), and uses a navigation timeout on each URL. This is closer to what users see than a bare HTTP fetch, but it is still an automated crawl — not a replacement for manual QA on every device.
User-Agent
Requests from the crawler identify themselves with this User-Agent string (also described in our Terms of Service):
SEO-Perception-Crawler/1.0 (+https://seoperception.com)
You can use this string to recognize our traffic in server and firewall logs.
robots.txt
For each site, we fetch robots.txt and apply a straightforward Disallow path check. URLs that are disallowed for the crawler are skipped so we stay within common expectations for polite crawling.
If you need the crawler to follow URLs that robots.txt would normally block (for example on a staging or special setup), you can turn on “Ignore robots.txt when crawling” on your account profile. Use this only when you understand the impact on your servers and policies; it applies to crawling for your user account.
Feeds
Some URLs are detected as feeds (for example RSS or Atom) using response headers and URL patterns. They may still appear in crawl-backed lists (such as the technical URL table) as feed type rather than normal HTML.
What we collect (overview)
We do not publish a step-by-step list of every rule here — checks evolve — but at a high level the crawler supports:
- HTTP layer — status codes and awareness of redirects in the navigation chain.
- HTML metadata — title, meta description, canonical, robots meta, Open Graph tags where present, and related cues.
- Page content signals — headings, word counts, images and alt text, internal and external link counts, presence of structured data (such as JSON-LD), viewport and
lang, and favicon on the homepage. - Link health sampling — a sample of internal and external links may be followed to detect obvious broken or redirecting outbound targets.
Those inputs feed scoring and categorization described conceptually under On-page and technical signals. A row-level view of crawled URLs lives in the website workspace.
Note: Server-side or in-app configuration can cap how many pages are processed per run so large installs stay responsive; that affects how quickly a big site is fully refreshed, not your subscription’s total page allowance.