AI Crawler Tracking
Make AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Googlebot and more) show up under Crawler activity. Crawlers never run JavaScript, so your tracking script can't see them. You catch them server-side instead, with one small feeder.
Why a feeder is needed
The tracking script in your page only fires for real people, because it needs a browser running JavaScript. Robots like GPTBot and ClaudeBot just grab the HTML and leave, so the script never runs and they stay invisible.
The feeder runs on the server (or a CDN in front of it), one step earlier than the browser. It spots bot requests and quietly reports them to the tracker. Real human page views are never forwarded, so there's no privacy impact and no added latency.
The one rule
Every method does the same thing: when a request's User-Agent looks like a bot, send one non-blocking POST to:
https://YOUR-HOST/ingest-logs?t=YOUR_KEYwith a JSON array of one hit (only url and ua are required):
[{ "url": "/some/path", "ua": "<user-agent>", "ip": "1.2.3.4",
"referer": "", "country": "", "ts": "2026-06-21T10:00:00Z" }]Your two values: both come from the snippet you already installed. YOUR-HOST is the host your beacon loads from (the data-key subdomain); YOUR_KEY is the data-key value. The dashboard's Copy AI setup prompt button pre-fills these for you.
Setup by platform
WordPress
Upload the must-use plugin to wp-content/mu-plugins/ (create the folder if it doesn't exist). It auto-activates and forwards only bot requests. Set your host/key with MABLG_AI_ENDPOINT / MABLG_AI_KEY in wp-config.php if needed.
Cloudflare (works for any backend, incl. Shopify behind Cloudflare)
Deploy as a Worker on the site's zone with a route like example.com/*. This is the most complete option: it even sees cached pages that never reach the origin.
const ENDPOINT = 'https://YOUR-HOST'; // your tracker host (the data-key subdomain)
const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx'; // your write-key
const BOT = /bot|crawl|spider|slurp|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|Meta-External|facebookexternalhit|python-requests|axios|node-fetch|Go-http|curl|wget|HeadlessChrome|scrapy/i;
export default {
async fetch(request, env, ctx) {
const ua = request.headers.get('user-agent') || '';
if (BOT.test(ua)) {
const url = new URL(request.url);
const payload = [{
url: url.pathname,
ua,
ip: request.headers.get('cf-connecting-ip') || '',
referer: request.headers.get('referer') || '',
country: request.headers.get('cf-ipcountry') || '',
ts: new Date().toISOString(),
}];
ctx.waitUntil(
fetch(ENDPOINT + '/ingest-logs?t=' + KEY, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(payload),
}).catch(function () {})
);
}
return fetch(request); // always pass through to origin
},
};Next.js (middleware.ts)
import { NextResponse } from 'next/server';
const ENDPOINT = 'https://YOUR-HOST';
const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';
const BOT = /bot|crawl|spider|slurp|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|Meta-External|facebookexternalhit|python-requests|axios|node-fetch|Go-http|curl|wget|HeadlessChrome|scrapy/i;
export function middleware(req, event) {
const ua = req.headers.get('user-agent') || '';
if (BOT.test(ua)) {
const payload = [{
url: req.nextUrl.pathname,
ua,
ip: req.headers.get('cf-connecting-ip') || req.headers.get('x-forwarded-for') || '',
referer: req.headers.get('referer') || '',
country: req.headers.get('cf-ipcountry') || '',
ts: new Date().toISOString(),
}];
event.waitUntil(
fetch(ENDPOINT + '/ingest-logs?t=' + KEY, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(payload),
}).catch(function () {})
);
}
return NextResponse.next();
}
export const config = { matcher: ['/((?!_next/|favicon.ico|.*\\.).*)'] };Express / Node
const ENDPOINT = 'https://YOUR-HOST';
const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';
const BOT = /bot|crawl|spider|slurp|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|Meta-External|facebookexternalhit|python-requests|axios|node-fetch|Go-http|curl|wget|HeadlessChrome|scrapy/i;
app.use(function (req, res, next) {
const ua = req.get('user-agent') || '';
if ((req.method === 'GET' || req.method === 'HEAD') && BOT.test(ua)) {
const payload = [{
url: req.originalUrl.split('?')[0],
ua,
ip: (req.headers['cf-connecting-ip'] || req.headers['x-forwarded-for'] || req.socket.remoteAddress || '')
.toString().split(',')[0].trim(),
referer: req.get('referer') || '',
ts: new Date().toISOString(),
}];
fetch(ENDPOINT + '/ingest-logs?t=' + KEY, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify(payload),
}).catch(function () {}); // Node 18+ has global fetch
}
next();
});Laravel (terminable middleware)
// app/Http/Middleware/AiCrawlerBeacon.php
use Illuminate\Support\Facades\Http;
class AiCrawlerBeacon
{
private const ENDPOINT = 'https://YOUR-HOST';
private const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';
private const BOT = '/bot|crawl|spider|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|python-requests|curl|wget|HeadlessChrome|scrapy/i';
public function handle($request, \Closure $next) { return $next($request); }
// terminate() runs AFTER the visitor has the page, so it never blocks.
public function terminate($request, $response): void
{
$ua = (string) $request->userAgent();
if ($ua === '' || !preg_match(self::BOT, $ua)) return;
try {
Http::timeout(1)->post(self::ENDPOINT . '/ingest-logs?t=' . self::KEY, [[
'url' => $request->getPathInfo(),
'ua' => $ua,
'ip' => $request->ip(),
'referer' => (string) $request->headers->get('referer', ''),
'ts' => now()->toIso8601String(),
]]);
} catch (\Throwable $e) { /* ignore */ }
}
}Django
Add aicrawler.middleware.AiCrawlerMiddleware to MIDDLEWARE in settings.
# aicrawler/middleware.py
import re, json, threading, urllib.request
from datetime import datetime, timezone
ENDPOINT = "https://YOUR-HOST"
KEY = "mbai_xxxxxxxxxxxxxxxxxxxx"
BOT = re.compile(r"bot|crawl|spider|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|"
r"Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|"
r"CCBot|python-requests|curl|wget|HeadlessChrome|scrapy", re.I)
def _send(payload):
try:
req = urllib.request.Request(
ENDPOINT + "/ingest-logs?t=" + KEY,
data=json.dumps(payload).encode(),
headers={"content-type": "application/json"}, method="POST")
urllib.request.urlopen(req, timeout=1)
except Exception:
pass
class AiCrawlerMiddleware:
def __init__(self, get_response): self.get_response = get_response
def __call__(self, request):
ua = request.META.get("HTTP_USER_AGENT", "")
if ua and BOT.search(ua):
payload = [{
"url": request.path,
"ua": ua,
"ip": request.META.get("HTTP_CF_CONNECTING_IP") or request.META.get("REMOTE_ADDR", ""),
"referer": request.META.get("HTTP_REFERER", ""),
"ts": datetime.now(timezone.utc).isoformat(),
}]
threading.Thread(target=_send, args=(payload,), daemon=True).start()
return self.get_response(request)nginx / Apache / Caddy (ship access logs)
If you can't add code but own the web server, ship a JSON access log. Filter to bot User-Agents before sending to keep volume down.
log_format ai_json escape=json '{'
'"time":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"request_uri":"$request_uri",'
'"http_user_agent":"$http_user_agent",'
'"http_referer":"$http_referer"'
'}';
access_log /var/log/nginx/ai.log ai_json;
# then ship bot lines to ENDPOINT/ingest-logs?t=KEY with Vector / Fluent Bit / a cronShopify, Wix, Squarespace, Webflow
These platforms don't let you run server-side code on page requests and don't give you raw request logs, so from inside the platform you genuinely can't see a crawler hitting a page. Two realistic options:
- Put Cloudflare in front of the store (custom domain through Cloudflare → the Worker above). Complete, but a setup step, and leave
/checkoutunproxied on Shopify. - Accept beacon-only. You still get the human half: shoppers arriving from a ChatGPT / Perplexity / Gemini answer, plus the AI-influence estimate. Crawler activity just stays thin for those sites.
Verify it works
Trigger a test hit (replace host/key):
curl -s "https://YOUR-HOST/ingest-logs?t=YOUR_KEY" \
-H 'content-type: application/json' \
-d '[{"url":"/install-test","ua":"Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot)","ip":"20.171.207.1"}]'
# -> {"accepted":1}Within ~1 minute it appears under Crawler activity. A real bot UA from a non-vendor IP is correctly shown as spoofed, not a verified crawler — that's the IP verification doing its job.