got robots.txt ?
Dark Visitors - list of known AI (and other) agents on the internet : ‘the hidden ecosystem of autonomous chatbots and data scrapers’ https://darkvisitors.com/ #DarkVisitors #WebCrawlers #WorldWideWeb
got robots.txt ?
Dark Visitors - list of known AI (and other) agents on the internet : ‘the hidden ecosystem of autonomous chatbots and data scrapers’ https://darkvisitors.com/ #DarkVisitors #WebCrawlers #WorldWideWeb
"It’s pretty crazy that not only a) these bots shamelessly harvest all your data without asking for permission and b) they do it in such a brute-force manner.
My coworker and security expert António pointed me to #DarkVisitors, and I’ll probably be installing their #WordPressPlugin on all my sites. For what it’s worth."
@john_fisherman on #AIscraping
https://fred-rocha.medium.com/ai-crawler-bots-on-the-hunt-caf5a59ff478
The automatic #robots.txt generation from #darkvisitors only creates a 23 record file. what about all the other dozens, hundreds, from the #agents list?
```
curl -qs -X POST https://api.darkvisitors.com/robots-txts -H "Authorization: Bearer ${ACCESS_TOKEN}" -H 'Content-Type: application/json' \
-d '{
"agent_types": [
"AI Assistant",
"AI Data Scraper",
"AI Search Crawler",
"Undocumented AI Agent"
],
"disallow": "/"
}'
```
Anyone else seen that behaviour?
So I added the Dark Visitors plugin to my website this weekend.
What’s neat is seeing all the different bots/agents visiting the site, that I wasn’t seeing in other analytics tools.
You might be familiar with what I'm terming the "Token Wars" - in which #LLM and #GenAI companies seek to ingest text, image, audio and video content to create their #ML models. Tokens are the basic unit of data input into these models - meaning that #scraping of web content is widespread.
In retaliation, many sites - such as Reddit, Inc. and Stack Overflow - are entering into content sharing deals with companies like OpenAI, or making their sites subscription only.
Another solution that has emerged recently is content blocking based on user agent. In web programming, the client requesting a web page identifies themself - usually as a browser or a bot.
User agents can be blocked by a website's robots.txt file - but only if the user agent respects the robots.txt protocol. Many web scrapers do not. Taking this a step further, network providers like Cloudflare are now offering solutions which block known token scraper bots at a a network level.
I've been playing with one of these solutions called #DarkVisitors for a couple weeks after learning it about it on The Sizzle and was **amazed** at how much traffic to my websites were bots, crawlers and content scrapers.
(No backhanders here, it's just a very insightful tool)
WTF happened last night? Dark Visitors recorded nearly 1,200 hits from Mastodon instances fetching Open Graph data on my blog
#Mastodon #DarkVisitors #OpenGraph
got robots.txt ?
Dark Visitors - list of known AI (and other) agents on the internet : ‘the hidden ecosystem of autonomous chatbots and data scrapers’ https://darkvisitors.com/ #DarkVisitors #WebCrawlers #WorldWideWeb