AI Crawler Tracking

Make AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Googlebot and more) show up under Crawler activity. Crawlers never run JavaScript, so your tracking script can't see them. You catch them server-side instead, with one small feeder.

Why a feeder is needed

The tracking script in your page only fires for real people, because it needs a browser running JavaScript. Robots like GPTBot and ClaudeBot just grab the HTML and leave, so the script never runs and they stay invisible.

The feeder runs on the server (or a CDN in front of it), one step earlier than the browser. It spots bot requests and quietly reports them to the tracker. Real human page views are never forwarded, so there's no privacy impact and no added latency.

The one rule

Every method does the same thing: when a request's User-Agent looks like a bot, send one non-blocking POST to:

https://YOUR-HOST/ingest-logs?t=YOUR_KEY

with a JSON array of one hit (only url and ua are required):

[{ "url": "/some/path", "ua": "<user-agent>", "ip": "1.2.3.4",
   "referer": "", "country": "", "ts": "2026-06-21T10:00:00Z" }]

Your two values: both come from the snippet you already installed. YOUR-HOST is the host your beacon loads from (the data-key subdomain); YOUR_KEY is the data-key value. The dashboard's Copy AI setup prompt button pre-fills these for you.

Setup by platform

WordPress

Upload the must-use plugin to wp-content/mu-plugins/ (create the folder if it doesn't exist). It auto-activates and forwards only bot requests. Set your host/key with MABLG_AI_ENDPOINT / MABLG_AI_KEY in wp-config.php if needed.

Cloudflare (works for any backend, incl. Shopify behind Cloudflare)

Deploy as a Worker on the site's zone with a route like example.com/*. This is the most complete option: it even sees cached pages that never reach the origin.

const ENDPOINT = 'https://YOUR-HOST';      // your tracker host (the data-key subdomain)
const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';   // your write-key
const BOT = /bot|crawl|spider|slurp|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|Meta-External|facebookexternalhit|python-requests|axios|node-fetch|Go-http|curl|wget|HeadlessChrome|scrapy/i;

export default {
  async fetch(request, env, ctx) {
    const ua = request.headers.get('user-agent') || '';
    if (BOT.test(ua)) {
      const url = new URL(request.url);
      const payload = [{
        url: url.pathname,
        ua,
        ip: request.headers.get('cf-connecting-ip') || '',
        referer: request.headers.get('referer') || '',
        country: request.headers.get('cf-ipcountry') || '',
        ts: new Date().toISOString(),
      }];
      ctx.waitUntil(
        fetch(ENDPOINT + '/ingest-logs?t=' + KEY, {
          method: 'POST',
          headers: { 'content-type': 'application/json' },
          body: JSON.stringify(payload),
        }).catch(function () {})
      );
    }
    return fetch(request); // always pass through to origin
  },
};

Next.js (middleware.ts)

import { NextResponse } from 'next/server';

const ENDPOINT = 'https://YOUR-HOST';
const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';
const BOT = /bot|crawl|spider|slurp|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|Meta-External|facebookexternalhit|python-requests|axios|node-fetch|Go-http|curl|wget|HeadlessChrome|scrapy/i;

export function middleware(req, event) {
  const ua = req.headers.get('user-agent') || '';
  if (BOT.test(ua)) {
    const payload = [{
      url: req.nextUrl.pathname,
      ua,
      ip: req.headers.get('cf-connecting-ip') || req.headers.get('x-forwarded-for') || '',
      referer: req.headers.get('referer') || '',
      country: req.headers.get('cf-ipcountry') || '',
      ts: new Date().toISOString(),
    }];
    event.waitUntil(
      fetch(ENDPOINT + '/ingest-logs?t=' + KEY, {
        method: 'POST',
        headers: { 'content-type': 'application/json' },
        body: JSON.stringify(payload),
      }).catch(function () {})
    );
  }
  return NextResponse.next();
}

export const config = { matcher: ['/((?!_next/|favicon.ico|.*\\.).*)'] };

Express / Node

const ENDPOINT = 'https://YOUR-HOST';
const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';
const BOT = /bot|crawl|spider|slurp|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|Meta-External|facebookexternalhit|python-requests|axios|node-fetch|Go-http|curl|wget|HeadlessChrome|scrapy/i;

app.use(function (req, res, next) {
  const ua = req.get('user-agent') || '';
  if ((req.method === 'GET' || req.method === 'HEAD') && BOT.test(ua)) {
    const payload = [{
      url: req.originalUrl.split('?')[0],
      ua,
      ip: (req.headers['cf-connecting-ip'] || req.headers['x-forwarded-for'] || req.socket.remoteAddress || '')
        .toString().split(',')[0].trim(),
      referer: req.get('referer') || '',
      ts: new Date().toISOString(),
    }];
    fetch(ENDPOINT + '/ingest-logs?t=' + KEY, {
      method: 'POST',
      headers: { 'content-type': 'application/json' },
      body: JSON.stringify(payload),
    }).catch(function () {}); // Node 18+ has global fetch
  }
  next();
});

Laravel (terminable middleware)

// app/Http/Middleware/AiCrawlerBeacon.php
use Illuminate\Support\Facades\Http;

class AiCrawlerBeacon
{
    private const ENDPOINT = 'https://YOUR-HOST';
    private const KEY = 'mbai_xxxxxxxxxxxxxxxxxxxx';
    private const BOT = '/bot|crawl|spider|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|CCBot|python-requests|curl|wget|HeadlessChrome|scrapy/i';

    public function handle($request, \Closure $next) { return $next($request); }

    // terminate() runs AFTER the visitor has the page, so it never blocks.
    public function terminate($request, $response): void
    {
        $ua = (string) $request->userAgent();
        if ($ua === '' || !preg_match(self::BOT, $ua)) return;
        try {
            Http::timeout(1)->post(self::ENDPOINT . '/ingest-logs?t=' . self::KEY, [[
                'url'     => $request->getPathInfo(),
                'ua'      => $ua,
                'ip'      => $request->ip(),
                'referer' => (string) $request->headers->get('referer', ''),
                'ts'      => now()->toIso8601String(),
            ]]);
        } catch (\Throwable $e) { /* ignore */ }
    }
}

Django

Add aicrawler.middleware.AiCrawlerMiddleware to MIDDLEWARE in settings.

# aicrawler/middleware.py
import re, json, threading, urllib.request
from datetime import datetime, timezone

ENDPOINT = "https://YOUR-HOST"
KEY = "mbai_xxxxxxxxxxxxxxxxxxxx"
BOT = re.compile(r"bot|crawl|spider|GPTBot|ClaudeBot|anthropic|Claude-|Perplexity|"
                 r"Google-Extended|Googlebot|Bingbot|Applebot|Amazonbot|Bytespider|"
                 r"CCBot|python-requests|curl|wget|HeadlessChrome|scrapy", re.I)

def _send(payload):
    try:
        req = urllib.request.Request(
            ENDPOINT + "/ingest-logs?t=" + KEY,
            data=json.dumps(payload).encode(),
            headers={"content-type": "application/json"}, method="POST")
        urllib.request.urlopen(req, timeout=1)
    except Exception:
        pass

class AiCrawlerMiddleware:
    def __init__(self, get_response): self.get_response = get_response
    def __call__(self, request):
        ua = request.META.get("HTTP_USER_AGENT", "")
        if ua and BOT.search(ua):
            payload = [{
                "url": request.path,
                "ua": ua,
                "ip": request.META.get("HTTP_CF_CONNECTING_IP") or request.META.get("REMOTE_ADDR", ""),
                "referer": request.META.get("HTTP_REFERER", ""),
                "ts": datetime.now(timezone.utc).isoformat(),
            }]
            threading.Thread(target=_send, args=(payload,), daemon=True).start()
        return self.get_response(request)

nginx / Apache / Caddy (ship access logs)

If you can't add code but own the web server, ship a JSON access log. Filter to bot User-Agents before sending to keep volume down.

log_format ai_json escape=json '{'
  '"time":"$time_iso8601",'
  '"remote_addr":"$remote_addr",'
  '"request_uri":"$request_uri",'
  '"http_user_agent":"$http_user_agent",'
  '"http_referer":"$http_referer"'
'}';
access_log /var/log/nginx/ai.log ai_json;
# then ship bot lines to  ENDPOINT/ingest-logs?t=KEY  with Vector / Fluent Bit / a cron

Shopify, Wix, Squarespace, Webflow

These platforms don't let you run server-side code on page requests and don't give you raw request logs, so from inside the platform you genuinely can't see a crawler hitting a page. Two realistic options:

  1. Put Cloudflare in front of the store (custom domain through Cloudflare → the Worker above). Complete, but a setup step, and leave /checkout unproxied on Shopify.
  2. Accept beacon-only. You still get the human half: shoppers arriving from a ChatGPT / Perplexity / Gemini answer, plus the AI-influence estimate. Crawler activity just stays thin for those sites.

Verify it works

Trigger a test hit (replace host/key):

curl -s "https://YOUR-HOST/ingest-logs?t=YOUR_KEY" \
  -H 'content-type: application/json' \
  -d '[{"url":"/install-test","ua":"Mozilla/5.0 (compatible; GPTBot/1.2; +https://openai.com/gptbot)","ip":"20.171.207.1"}]'
# -> {"accepted":1}

Within ~1 minute it appears under Crawler activity. A real bot UA from a non-vendor IP is correctly shown as spoofed, not a verified crawler — that's the IP verification doing its job.