Perplexity’s Stealth Crawlers: Evading Website Directives and Eroding Trust

Perplexity’s Stealth Crawlers: A Breach of Trust

Cloudflare has uncovered that Perplexity, an AI-powered answer engine, is employing stealthy and undeclared crawlers to bypass website no-crawl directives. This behavior, observed across millions of daily requests and tens of thousands of domains, directly contradicts established internet norms for bot transparency and respect for website preferences. Cloudflare has de-listed Perplexity as a verified bot and implemented new rules to block this deceptive crawling activity, highlighting a critical challenge to the foundational trust of the internet as AI models increasingly rely on web content.

Points clés

Perplexity, an AI-powered answer engine, is using stealth, undeclared crawlers to evade website no-crawl directives.
When Perplexity’s declared user agents (PerplexityBot and Perplexity-User) are blocked, they obscure their crawling identity.
Perplexity is repeatedly modifying their user agent, changing source ASNs, and ignoring or failing to fetch robots.txt files.
Cloudflare customers reported Perplexity accessing content despite explicit robots.txt disallows and WAF rules blocking declared crawlers.
Tests on new, unindexed domains confirmed Perplexity AI provided detailed information about restricted content.
Perplexity uses a generic browser user agent (e.g., Google Chrome on macOS) for stealth crawling, alongside their declared user agent.
The stealth crawler utilizes multiple undisclosed IPs and rotates through them, along with different ASNs, to evade blocks.
This activity spans tens of thousands of domains and millions of requests daily, fingerprinted by Cloudflare using machine learning and network signals.
OpenAI is cited as an example of an AI company that adheres to best practices, respecting robots.txt and network-level blocks.
Cloudflare has de-listed Perplexity as a verified bot and added heuristics to their managed rules to block this stealth crawling.

À retenir

So, Perplexity decided to play hide-and-seek with website owners, apparently thinking “robots.txt” was merely a suggestion for less ambitious bots. It seems they missed the memo that the internet is built on trust, not on who can sneak past the digital bouncer with the most convincing fake ID. While their intentions might have been to bring us more “answers,” they’ve inadvertently highlighted the importance of good manners in the digital realm. Perhaps a refresher course in “Internet Etiquette 101” is in order for our AI friends, because, let’s be honest, nobody likes a digital peeping tom, no matter how intelligent they claim to be.

Sources