Nick Mark, MD @nick

**wolf of the wisp** @wolfofthewisp@mastodon.ie · Dec 16, 2024

Dec 16, 2024

wolf of the wisp @wolfofthewisp@mastodon.ie

got robots.txt ?

Dark Visitors - list of known AI (and other) agents on the internet : ‘the hidden ecosystem of autonomous chatbots and data scrapers’ https://darkvisitors.com/ #DarkVisitors #WebCrawlers #WorldWideWeb

Dark VisitorsDark Visitors - Track the AI Agents and Bots Crawling Your WebsiteGet realtime insight into the hidden ecosystem of crawlers, scrapers, and AI agents browsing your website

**Ecologia Digital** @josemurilo@mato.social · Oct 1, 2024 *

Oct 1, 2024 *

Ecologia Digital @josemurilo@mato.social

"It’s pretty crazy that not only a) these bots shamelessly harvest all your data without asking for permission and b) they do it in such a brute-force manner.
My coworker and security expert António pointed me to #DarkVisitors, and I’ll probably be installing their #WordPressPlugin on all my sites. For what it’s worth."
@john_fisherman on #AIscraping
https://fred-rocha.medium.com/ai-crawler-bots-on-the-hunt-caf5a59ff478

Medium · Oct 1, 2024AI crawler bots on the hunt - Fred Rocha - MediumBy Fred Rocha

**kgoetz** @kgoetz@aus.social · Sep 27, 2024

Sep 27, 2024

kgoetz @kgoetz@aus.social

The automatic #robots.txt generation from #darkvisitors only creates a 23 record file. what about all the other dozens, hundreds, from the #agents list?

```
curl -qs -X POST https://api.darkvisitors.com/robots-txts -H "Authorization: Bearer ${ACCESS_TOKEN}" -H 'Content-Type: application/json' \
-d '{
"agent_types": [
"AI Assistant",
"AI Data Scraper",
"AI Search Crawler",
"Undocumented AI Agent"
],
"disallow": "/"
}'
```

Anyone else seen that behaviour?

**django** @django@social.coop · Aug 14, 2024

Aug 14, 2024

django @django@social.coop

So I added the Dark Visitors plugin to my website this weekend.

What’s neat is seeing all the different bots/agents visiting the site, that I wasn’t seeing in other analytics tools.

#DarkVisitors #AiTheft

**Kathy Reid** @KathyReid@aus.social · Aug 7, 2024

Aug 7, 2024

Kathy Reid @KathyReid@aus.social

You might be familiar with what I'm terming the "Token Wars" - in which #LLM and #GenAI companies seek to ingest text, image, audio and video content to create their #ML models. Tokens are the basic unit of data input into these models - meaning that #scraping of web content is widespread.

In retaliation, many sites - such as Reddit, Inc. and Stack Overflow - are entering into content sharing deals with companies like OpenAI, or making their sites subscription only.

Another solution that has emerged recently is content blocking based on user agent. In web programming, the client requesting a web page identifies themself - usually as a browser or a bot.

User agents can be blocked by a website's robots.txt file - but only if the user agent respects the robots.txt protocol. Many web scrapers do not. Taking this a step further, network providers like Cloudflare are now offering solutions which block known token scraper bots at a a network level.

I've been playing with one of these solutions called #DarkVisitors for a couple weeks after learning it about it on The Sizzle and was **amazed** at how much traffic to my websites were bots, crawlers and content scrapers.

https://darkvisitors.com

(No backhanders here, it's just a very insightful tool)

#TokenWars #tokenization #bots

**Axel Leroy** @axeleroy@toot.community · Aug 7, 2024

Aug 7, 2024

Axel Leroy @axeleroy@toot.community

WTF happened last night? Dark Visitors recorded nearly 1,200 hits from Mastodon instances fetching Open Graph data on my blog
#Mastodon #DarkVisitors #OpenGraph