Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.searchable.com/llms.txt

Use this file to discover all available pages before exploring further.

What this does

CloudFront’s Standard Logging V2 ships request logs to a destination of your choosing. We point a Kinesis Data Firehose stream at Searchable’s ingest endpoint, Firehose batches log records to us, we classify the AI bots, and drop everything else.
No code changes. All configuration is done inside the AWS Console.

Prerequisites

A CloudFront distribution (works on any plan — Standard Logging V2 is a free CloudFront feature)
Permission to create Firehose delivery streams and edit the CloudFront distribution
A Searchable project with your domain confirmed
You only pay for the Firehose ingestion + delivery (typically a few cents per GB of AI-bot traffic). There is no CloudFront charge for enabling Standard Logging V2.

Setup

1

Generate an integration token in Searchable

  1. Open your Searchable dashboard
  2. Go to Agent Analytics → Setup
  3. Pick Amazon CloudFront as your crawler source
  4. Click Generate token
Copy the token now — it starts with sa_… and won’t be shown again. You can always generate a new one if you lose it.The endpoint URL is fixed:
https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs
Sanity check before pointing Firehose at the endpoint: curl https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs should return 200 OK. If you don’t see that, your network can’t reach the endpoint and Firehose will silently buffer to the S3 backup bucket instead.
2

Open the CloudFront distribution's Logging tab

Open the CloudFront console → your distribution → Logging.Click Add standard log destination.
FieldValue
Destination typeAmazon Data Firehose
Output formatJSON
Then click Create new Firehose stream — this opens the Firehose wizard in a new tab.
3

Create the Firehose delivery stream

In the Firehose wizard:
FieldValue
SourceDirect PUT
DestinationHTTP Endpoint
HTTP endpoint URLhttps://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs
Access keypaste the sa_… token from Searchable
Content encodingGZIP
Buffer hints1 MiB / 60s
Buffer hints matter. The Firehose default is 5 MiB, which sits right at Searchable’s 5 MB compressed-batch cap — high-traffic distributions can intermittently hit 413 Payload Too Large. Set the size hint to 1 MiB and the interval to 60 seconds.
Firehose can’t set arbitrary request headers, so it sends the Access key value as the X-Amz-Firehose-Access-Key header instead. Searchable’s endpoint accepts either that or Authorization: Bearer sa_…, so the same token works for Firehose, a custom Lambda relay, or a curl test.
4

Configure the S3 backup bucket

Firehose requires an S3 backup bucket for records it can’t deliver.
FieldValue
S3 backup bucketany bucket you control
Backup modeFailed data only
“Failed data only” means you only pay to store records the endpoint actually rejects (for example, after a token revocation). Don’t pick “All data” — it would duplicate every CloudFront log into S3 unnecessarily.Finish creating the Firehose stream.
5

Finish the CloudFront log destination

Return to the CloudFront tab and select the Firehose stream you just created.Pick these standard log fields:Required (the endpoint drops records missing any of these):
  • timestamp
  • sc-status
  • cs-method
  • cs-uri-stem
  • x-host-header
  • cs-user-agent
Recommended (improves enrichment + debugging; anything you select beyond the required set is preserved as custom_properties in Searchable):
  • c-ip
  • c-country
  • cs-protocol
  • cs-uri-query
  • cs-referer
  • sc-bytes
  • time-taken
  • x-edge-request-id
  • x-edge-location
  • x-edge-result-type
Save the log destination.
6

Verify in Searchable

Firehose batches every ~60 seconds, so events take a moment to appear after the first AI bot hit.Return to Agent Analytics → Setup in Searchable. The Amazon CloudFront card should show Connected within a few minutes once an AI bot hits your site.

Verifying the connection

In Searchable:
  1. Go to Agent Analytics → Setup
  2. Look at the Amazon CloudFront card status
  3. Click Check if it still shows “Waiting for first event”
StatusWhat it means
Waiting for first eventThe Firehose stream is configured but no AI bot has hit your site yet. Typical wait is a few hours for sites that are already indexed.
ConnectedEvents are arriving. The card shows the count from the last 24 hours.
You can also confirm in AWS: Amazon Data Firehose → your stream → Monitoring. The stream’s metrics show incoming records (from CloudFront) and successful HTTP deliveries (to Searchable). The S3 backup bucket should stay near-empty when “Failed data only” is set.

What Searchable receives

For each request that matches an AI-bot user agent, Searchable receives:
  • HTTP method, path, and host (query strings stripped before storage)
  • User agent
  • Referer
  • Country code (from c-country)
  • Response status, response bytes
  • Edge timing (time-taken)
  • CloudFront edge metadata (x-edge-request-id, x-edge-location, x-edge-result-type) — preserved as custom_properties for debugging
Bodies, headers other than User-Agent / Referer, cookies, and full IPs are never sent or stored. The CloudFront edge request ID is also used to de-duplicate Firehose redeliveries server-side.

Multiple distributions

Each Firehose stream is bound to a single CloudFront distribution’s log destination. If your domain spans multiple distributions (for example, separate distributions for example.com and assets.example.com):
  1. Generate one integration token, or reuse one across distributions — both work
  2. Create one Firehose stream per distribution, all pointing at the same HTTP endpoint
  3. Searchable tags events by host, so you’ll still see them split by domain in the dashboard

Troubleshooting

Open the Firehose stream in the AWS Console → Monitoring → Destination error logs for the specific error. The common ones:
  • 401 Unauthorized — the sa_… token in the Access key field is missing, wrong, or has been revoked in Searchable. Re-paste the token, or generate a new one from Agent Analytics → Setup.
  • 413 Payload Too Large — the Firehose buffer hint is too big. Drop the size hint to 1 MiB and the interval to 60 seconds, then save. Firehose will retry the buffered batches once they shrink.
  • 400 Bad Request — the CloudFront log output format isn’t JSON, or the required fields aren’t selected. Edit the CloudFront log destination, set Output format to JSON, and confirm timestamp, sc-status, cs-method, cs-uri-stem, x-host-header, and cs-user-agent are all checked.
While errors are climbing, Firehose buffers the affected records into your S3 backup bucket — fix the root cause and the stream catches back up automatically.
CloudFront only delivers logs for traffic that’s actually being served by the distribution. Things to check:
  • The CloudFront distribution is Enabled (not disabled or paused)
  • The Firehose stream is Active (not in CREATING or DELETING state)
  • Your domain in Searchable matches the alternate domain (CNAME) on the distribution (check Agent Analytics → Setup → Confirm your domain)
  • The log destination is attached to the distribution — easy to miss if the Firehose-creation wizard was opened separately and you forgot to come back to CloudFront’s Logging tab
If everything looks right, hit your site with a known AI user agent and wait ~60 seconds for the next Firehose batch:
curl -H "User-Agent: GPTBot/1.0 (+https://openai.com/gptbot)" https://yourdomain.com/
The “Failed data only” backup bucket only fills up when the HTTP endpoint rejects records — typically a token / config issue. Inspect a recent failed object in the bucket:
  • It contains the original CloudFront records along with an error message from the endpoint (401, 413, 400, etc.)
  • Use the message to identify the root cause (see the Firehose 'Destination error count' is climbing section above)
  • Once you fix the config, new records flow through to Searchable; the records already in the backup bucket aren’t automatically re-delivered (Firehose treats the backup as terminal storage)
Firehose is at-least-once, so it can occasionally redeliver a batch after a transient network error. Searchable de-duplicates on x-edge-request-id server-side, so duplicate events don’t appear in the dashboard — provided you selected x-edge-request-id in the CloudFront log fields. If you skipped it, dedup falls back to a (timestamp, path, user-agent) heuristic that’s less precise. Edit the log destination, add x-edge-request-id, and save.
The same endpoint accepts plain NDJSON POSTs (no Firehose envelope), so you can post log records directly from a Lambda or any custom collector. Use:
  • POST https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs
  • Authorization: Bearer sa_… header
  • Content-Type: application/x-ndjson (gzip optional — set Content-Encoding: gzip if you compress)
  • One CloudFront log record per line, using the same field names as Standard Logging V2 (cs-method, cs-uri-stem, x-host-header, cs-user-agent, sc-status, timestamp, etc.)
The endpoint replies with 204 No Content on success.

Removing the integration

  1. AWS Console → CloudFront → your distribution → Logging → delete the standard log destination
  2. AWS Console → Amazon Data Firehose → your stream → delete the stream (and the S3 backup bucket if you no longer need it)
  3. Searchable → Agent Analytics → Setup → Tokens → revoke the token
Both sides are independent — revoking the token alone is enough to stop ingestion immediately, even if the Firehose stream stays configured (its deliveries will start returning 401 and land in the backup bucket).

Next steps

See the data

Open Agent Analytics to see which assistants are crawling your site.

Add Search Console

Correlate AI crawls with search demand.