STEP 11

🗃️ Our data

WHERE THE BROKEN-LINK DATABASE COMES FROM, HOW IT'S GATHERED, HOW OFTEN IT'S REFRESHED, AND WHAT WE DO (AND DON'T) COLLECT.

A recommendation engine is only as good as the data underneath it. So it’s worth knowing what’s in our index, how it got there, and how we keep it honest.

What we have 📦

There are three layers of data feeding the product.

🌐 A global broken-link index. Tens of millions of broken outbound links discovered across the public web by our own crawler, BrokenLinksBot. For each one we keep the source URL, the source domain, the broken destination URL, the anchor text, the sentence wrapped around the link, the source page title, the source meta description, and the error type (DNS failure, 404, redirect loop, etc.). This is what powers the Broken Link Database tab and the candidate pool every audit matches against.
🏷️ Domain-level signals. A domain-rating-style estimate (the BDR column) for each source domain, plus aggregated counts of inbound and outbound links at the domain level. Used to filter for higher-authority sources and to rank opportunities.
🪞 Your site’s content fingerprint. When you start an audit, our crawler reads your sitemap and pulls a text-only extract of every page (no HTML, no scripts, no cookies, no logged-in content). We score each page on the five content dimensions (see Reading your Content Analysis) and build a topical fingerprint that captures what each page is about. That fingerprint is what we match against the global broken-link index.

How we crawl 🤖

The crawler that builds the broken-link database (and the one that audits your site) is the same code: BrokenLinksBot. It identifies itself honestly, rate-limits to one request per second per domain, backs off on 429 / 5xx responses, fully complies with robots.txt per RFC 9309, respects noindex meta tags and headers, only issues GET requests, doesn’t execute JavaScript, and doesn’t try to crawl behind logins. Full operator details, verification steps, and how to block the bot live on the BrokenLinksBot page.

If a site says no via robots.txt, we don’t crawl it. That site is then absent from our database, which is exactly the behavior site owners are signaling for. The trade-off is real: a publisher that blocks us is a publisher whose pages will never appear as opportunities. We think that’s the right side of the trade-off.

How fresh is it? ⏱️

Broken links are never a one-time discovery. A page that 404s today might be restored next week; a page that resolves today might break tomorrow. So every link in the database has two timestamps you can see in the Timeline column:

📅 Found 👉 when we first discovered the link.
🔁 Checked 👉 when we last revisited it to confirm it’s still broken.

Recently-checked rows are more reliable than stale ones. We continuously revisit known broken links to catch fix-ups and to age out false positives. Recheck cadence depends on how the link is behaving: links that have failed many times in a row get rechecked less often (they’re stable failures), while recently-broken links get checked more frequently (they’re more likely to be in flux).

What we collect from your site 🔒

When you point an audit at your domain, BrokenLinksBot does exactly what it does on any other site, plus we keep a richer fingerprint for matching:

📄 Page titles so we can show you which page is being matched.
📝 Meta descriptions for context.
📰 Page body text as a plain-text extract, used for content scoring and topical matching.
🔗 Link data (URLs, anchor text, rel attributes) so we can find broken outbound links from your site too.
🏷️ Canonical URLs so we don’t double-count duplicates.

We do not collect personal data, cookies, form submissions, or login credentials. We don’t read pages behind authentication. We don’t track visitors. We’re looking at links and the text around them, not at users.

If at any point you want us to stop, block BrokenLinksBot in your robots.txt and we’ll be gone within 24 hours (we cache robots.txt for up to a day). If you want to delete everything, see the Danger zone in Account settings.

Where the data goes 🛣️

The database is shared across customers. That’s a feature, not a leak: when one site’s audit surfaces a particularly useful broken link, that same broken link can be matched to your pages too if your content is a credible fit. The shared index is what makes it possible to find good opportunities for a brand-new site on day one, instead of asking you to wait while we crawl the entire web from scratch.

What’s not shared:

🚫 Your content scores and page fingerprints are scoped to your account.
🚫 Your ratings (👍 / 👎 with reasons) are scoped to your account. Nobody else sees them.
🚫 Your opportunity shortlist is computed for you specifically; it’s not visible to other customers.

In aggregate, your ratings do help improve the matcher that all customers benefit from (more good ratings, better recommendations for everyone), but no individual rating is ever visible to anyone outside your account.

Tips 💡

📅 When in doubt about a row, glance at Checked. A row checked yesterday is more trustworthy than one checked three months ago.
🚧 If a competitor’s site is missing from the Broken Link Database, they may be blocking our crawler. That’s their call, not a bug on our end.
🧠 The freshness of your opportunities depends on how recently your audit ran. Trigger a re-audit (or wait for the scheduled refresh) if you’ve published a lot of new content recently.
📩 Have a question about a specific row? [email protected].