BROKENLINKSBOT

BrokenLinksBot

THE FRIENDLY CRAWLER THAT FINDS DEAD LINKS ON THE WEB

Oh my, the internet is broken

Like, really broken. There are billions of links out there pointing to pages that no longer exist, servers that gave up, and domains that wandered off into the sunset. It’s a mess.

Our BrokenLinksBot does a pretty darn good job of finding where. It crawls the web, follows links on pages, and checks whether those links actually lead somewhere useful, or straight into the void. When it finds a dead link, we help site owners find out about it, so they can fix it and replace it with something useful.

How we crawl (politely)

We take server etiquette very seriously. Your server will barely notice we’re there.

User-Agent. We identify ourselves as BrokenLinksBot/1.0 (+https://brokenlinks.io/bot). No sneaking around, proxy use, or pretending to be something we’re not. We’re not Chrome, definitely not Firefox, and don’t want to be.
Rate limiting. We make at most 1 request per second per domain. We’re patient like that.
429 backoff. If your server tells us to slow down with a 429 Too Many Requests, we listen. We double our delay each time (exponential backoff), respect Retry-After headers, and cap out at 60 seconds between requests. We then wait and automatically retry the request. We will not hammer your server. Ever.
Error backoff. If your server returns a 401, 403, or 5xx error, we also double our crawl delay using the same exponential backoff (capped at 60 seconds). Unlike 429 responses, we don’t retry: we just slow down and move on.
Adaptive backoff. If your server is responding slowly, we notice. We automatically raise our crawl delay to match your server’s response time (capped at 60 seconds), so we never request faster than your server can comfortably respond. Slow servers get gentler crawling.
robots.txt. We fully comply with your robots.txt file per RFC 9309. If you say don’t crawl, we don’t crawl. Simple as that.
Crawl-Delay. We honor the Crawl-Delay directive in robots.txt, between 1 and 60 seconds.
Noindex. We respect <meta name="robots" content="noindex"> tags and X-Robots-Tag: noindex HTTP headers. If a page says it doesn’t want to be indexed, we leave it alone.
GET only. We only make GET requests. We never submit forms, POST data, or interact with your site beyond following links.
No JavaScript. We use an HTML parser, not a headless browser. We don’t execute JavaScript, so JS-rendered links may not be discovered.
No authentication. We don’t crawl behind login walls or attempt to access authenticated content. If a page requires a login, we skip it.
Response size limit. We cap responses at 1 MB. We’re not downloading your video files.

In short: we will not overload your server. We’re here to help, not to cause problems.

What data we collect

We only collect what we need to identify broken links and provide useful context:

Page titles. So we can tell you which page has the broken link.
Meta descriptions. For context about the page content.
Page body text. A text-only extract of the page (no HTML, scripts, or styles) used to provide context around broken links.
Link data. The URLs we find on your pages, their anchor text, and rel attributes (like nofollow). This is how we track what links to where.
Canonical URLs. To avoid reporting duplicates.

We do not collect personal data, cookies, form submissions, or login credentials. We’re looking at links, not reading your diary.

How we use this data

We use the data we collect to build a database of broken links across the web. This data is made available for commercial purposes, helping SEO professionals, website owners, and digital agencies discover broken links and help site owners fix them.

How to block us

We respect your right to say no. Here’s how to block BrokenLinksBot using your robots.txt file.

Block us entirely:

User-agent: BrokenLinksBot
Disallow: /

Block specific paths:

User-agent: BrokenLinksBot
Disallow: /private/
Disallow: /admin/

Slow us down instead:

User-agent: BrokenLinksBot
Crawl-Delay: 10

We use a two-tier cache for robots.txt. For the first 24 hours after a fetch, we serve from cache without contacting your server at all. After that, for up to 30 days, we send conditional requests with If-None-Match (ETag) and If-Modified-Since headers. If your server responds with 304 Not Modified, we keep using the cached copy, which is much cheaper for your server than re-sending the full file. If your robots.txt changes, the new response replaces our cache immediately. Worst case, a change to block us takes effect within 24 hours.

robots.txt: the fine print

We follow RFC 9309 (Robots Exclusion Protocol). Here’s exactly how we handle different scenarios:

No robots.txt (404). If your site doesn’t have a robots.txt file, we treat that as permission to crawl. Per the RFC, a missing file means no restrictions.
Server error (5xx). If your server returns a 5xx error when we request robots.txt, we play it safe and assume we’re not allowed to crawl. We’ll try again later.
Forbidden (403). Same as a server error: we err on the side of caution and don’t crawl.
Redirect (3xx). If robots.txt redirects, we treat it as if no robots.txt exists (allow all), per RFC 9309.
HTML instead of text. Some misconfigured servers return an HTML page instead of a real robots.txt. We detect this and treat it as if no robots.txt exists.

Per-subdomain scope

As per the RFC, robots.txt applies to the specific host (subdomain) it’s served from. A robots.txt at www.example.com does not apply to blog.example.com. Each subdomain has its own robots.txt.

Redirects and cross-host checks

When we follow a redirect that lands on a different host, we fetch and check the robots.txt for that new host before continuing. We don’t assume that permission to crawl one subdomain means we can crawl another.

Verifying the bot

Want to make sure a request is actually from us and not someone pretending to be BrokenLinksBot? You can verify it with a reverse DNS lookup. Our crawlers resolve back to crawlerX.brokenlinks.io.

Step 1. Run a reverse DNS lookup on the IP address from your access logs:

$ host 203.0.113.50
50.113.0.203.in-addr.arpa domain name pointer crawlerX.brokenlinks.io.

Step 2. Confirm the hostname resolves back to the same IP:

$ host crawlerX.brokenlinks.io
crawlerX.brokenlinks.io has address 203.0.113.50

If both match, it’s us. If the reverse DNS doesn’t point to crawlerX.brokenlinks.io, someone is spoofing our user-agent string, and that’s not cool.

Operator

This bot is operated by Two Phase LLC.

Mailing address: PO Box 14672, Jackson, WY 83002.

Physical office: 680 South Cache Street, Unit 100, Jackson, WY 83001.

Contact: [email protected]