Send Amazon CloudFront traffic to Searchable

What this does

CloudFront’s Standard Logging V2 ships request logs to a destination of your choosing. We point a Kinesis Data Firehose stream at Searchable’s ingest endpoint, Firehose batches log records to us, we classify the AI bots, and drop everything else.

No code changes. All configuration is done inside the AWS Console.

Prerequisites

A CloudFront distribution (works on any plan — Standard Logging V2 is a free CloudFront feature)

Permission to create Firehose delivery streams and edit the CloudFront distribution

A Searchable project with your domain confirmed

You only pay for the Firehose ingestion + delivery (typically a few cents per GB of AI-bot traffic). There is no CloudFront charge for enabling Standard Logging V2.

Setup

Generate an integration token in Searchable

Open your Searchable dashboard
Go to Agent Analytics → Setup
Pick Amazon CloudFront as your crawler source
Click Generate token

Copy the token now — it starts with sa_… and won’t be shown again. You can always generate a new one if you lose it.The endpoint URL is fixed:

https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs

Sanity check before pointing Firehose at the endpoint: curl https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs should return 200 OK. If you don’t see that, your network can’t reach the endpoint and Firehose will silently buffer to the S3 backup bucket instead.

Open the CloudFront distribution's Logging tab

Open the CloudFront console → your distribution → Logging.Click Add standard log destination.

Field	Value
Destination type	Amazon Data Firehose
Output format	JSON

Then click Create new Firehose stream — this opens the Firehose wizard in a new tab.

Create the Firehose delivery stream

In the Firehose wizard:

Field	Value
Source	Direct PUT
Destination	HTTP Endpoint
HTTP endpoint URL	`https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs`
Access key	paste the `sa_…` token from Searchable
Content encoding	GZIP
Buffer hints	1 MiB / 60s

Buffer hints matter. The Firehose default is 5 MiB, which sits right at Searchable’s 5 MB compressed-batch cap — high-traffic distributions can intermittently hit 413 Payload Too Large. Set the size hint to 1 MiB and the interval to 60 seconds.

Firehose can’t set arbitrary request headers, so it sends the Access key value as the X-Amz-Firehose-Access-Key header instead. Searchable’s endpoint accepts either that or Authorization: Bearer sa_…, so the same token works for Firehose, a custom Lambda relay, or a curl test.

Configure the S3 backup bucket

Firehose requires an S3 backup bucket for records it can’t deliver.

Field	Value
S3 backup bucket	any bucket you control
Backup mode	Failed data only

“Failed data only” means you only pay to store records the endpoint actually rejects (for example, after a token revocation). Don’t pick “All data” — it would duplicate every CloudFront log into S3 unnecessarily.Finish creating the Firehose stream.

Finish the CloudFront log destination

Return to the CloudFront tab and select the Firehose stream you just created.Pick these standard log fields:Required (the endpoint drops records missing any of these):

timestamp
sc-status
cs-method
cs-uri-stem
x-host-header
cs-user-agent

Recommended (improves enrichment + debugging; anything you select beyond the required set is preserved as custom_properties in Searchable):

c-ip
c-country
cs-protocol
cs-uri-query
cs-referer
sc-bytes
time-taken
x-edge-request-id
x-edge-location
x-edge-result-type

Save the log destination.

Verify in Searchable

Firehose batches every ~60 seconds, so events take a moment to appear after the first AI bot hit.Return to Agent Analytics → Setup in Searchable. The Amazon CloudFront card should show Connected within a few minutes once an AI bot hits your site.

Verifying the connection

In Searchable:

Go to Agent Analytics → Setup
Look at the Amazon CloudFront card status
Click Check if it still shows “Waiting for first event”

Status	What it means
Waiting for first event	The Firehose stream is configured but no AI bot has hit your site yet. Typical wait is a few hours for sites that are already indexed.
Connected	Events are arriving. The card shows the count from the last 24 hours.

You can also confirm in AWS: Amazon Data Firehose → your stream → Monitoring. The stream’s metrics show incoming records (from CloudFront) and successful HTTP deliveries (to Searchable). The S3 backup bucket should stay near-empty when “Failed data only” is set.

What Searchable receives

For each request that matches an AI-bot user agent, Searchable receives:

HTTP method, path, and host (query strings stripped before storage)
User agent
Referer
Country code (from c-country)
Response status, response bytes
Edge timing (time-taken)
CloudFront edge metadata (x-edge-request-id, x-edge-location, x-edge-result-type) — preserved as custom_properties for debugging

Bodies, headers other than User-Agent / Referer, cookies, and full IPs are never sent or stored. The CloudFront edge request ID is also used to de-duplicate Firehose redeliveries server-side.

Multiple distributions

Each Firehose stream is bound to a single CloudFront distribution’s log destination. If your domain spans multiple distributions (for example, separate distributions for example.com and assets.example.com):

Generate one integration token, or reuse one across distributions — both work
Create one Firehose stream per distribution, all pointing at the same HTTP endpoint
Searchable tags events by host, so you’ll still see them split by domain in the dashboard

Troubleshooting

Firehose 'Destination error count' is climbing

Open the Firehose stream in the AWS Console → Monitoring → Destination error logs for the specific error. The common ones:

401 Unauthorized — the sa_… token in the Access key field is missing, wrong, or has been revoked in Searchable. Re-paste the token, or generate a new one from Agent Analytics → Setup.
413 Payload Too Large — the Firehose buffer hint is too big. Drop the size hint to 1 MiB and the interval to 60 seconds, then save. Firehose will retry the buffered batches once they shrink.
400 Bad Request — the CloudFront log output format isn’t JSON, or the required fields aren’t selected. Edit the CloudFront log destination, set Output format to JSON, and confirm timestamp, sc-status, cs-method, cs-uri-stem, x-host-header, and cs-user-agent are all checked.

While errors are climbing, Firehose buffers the affected records into your S3 backup bucket — fix the root cause and the stream catches back up automatically.

Status stays on 'Waiting for first event'

CloudFront only delivers logs for traffic that’s actually being served by the distribution. Things to check:

The CloudFront distribution is Enabled (not disabled or paused)
The Firehose stream is Active (not in CREATING or DELETING state)
Your domain in Searchable matches the alternate domain (CNAME) on the distribution (check Agent Analytics → Setup → Confirm your domain)
The log destination is attached to the distribution — easy to miss if the Firehose-creation wizard was opened separately and you forgot to come back to CloudFront’s Logging tab

If everything looks right, hit your site with a known AI user agent and wait ~60 seconds for the next Firehose batch:

curl -H "User-Agent: GPTBot/1.0 (+https://openai.com/gptbot)" https://yourdomain.com/

Records keep landing in the S3 backup bucket

The “Failed data only” backup bucket only fills up when the HTTP endpoint rejects records — typically a token / config issue. Inspect a recent failed object in the bucket:

It contains the original CloudFront records along with an error message from the endpoint (401, 413, 400, etc.)
Use the message to identify the root cause (see the Firehose 'Destination error count' is climbing section above)
Once you fix the config, new records flow through to Searchable; the records already in the backup bucket aren’t automatically re-delivered (Firehose treats the backup as terminal storage)

I see duplicated events

Firehose is at-least-once, so it can occasionally redeliver a batch after a transient network error. Searchable de-duplicates on x-edge-request-id server-side, so duplicate events don’t appear in the dashboard — provided you selected x-edge-request-id in the CloudFront log fields. If you skipped it, dedup falls back to a (timestamp, path, user-agent) heuristic that’s less precise. Edit the log destination, add x-edge-request-id, and save.

I'd rather pipe logs through a custom Lambda instead of Firehose

The same endpoint accepts plain NDJSON POSTs (no Firehose envelope), so you can post log records directly from a Lambda or any custom collector. Use:

POST https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs
Authorization: Bearer sa_… header
Content-Type: application/x-ndjson (gzip optional — set Content-Encoding: gzip if you compress)
One CloudFront log record per line, using the same field names as Standard Logging V2 (cs-method, cs-uri-stem, x-host-header, cs-user-agent, sc-status, timestamp, etc.)

The endpoint replies with 204 No Content on success.

Removing the integration

AWS Console → CloudFront → your distribution → Logging → delete the standard log destination
AWS Console → Amazon Data Firehose → your stream → delete the stream (and the S3 backup bucket if you no longer need it)
Searchable → Agent Analytics → Setup → Tokens → revoke the token

Both sides are independent — revoking the token alone is enough to stop ingestion immediately, even if the Firehose stream stays configured (its deliveries will start returning 401 and land in the backup bucket).

Getting Started

Using Searchable

Integrations

Agent Analytics Setup

Advanced Features

Send Amazon CloudFront traffic to Searchable

What this does

Prerequisites

Setup

Verifying the connection

What Searchable receives

Multiple distributions

Troubleshooting

Removing the integration

Next steps

See the data

Add Search Console

Getting Started

Using Searchable

Integrations

Agent Analytics Setup

Advanced Features

Documentation Index

​What this does

​Prerequisites

​Setup

​Verifying the connection

​What Searchable receives

​Multiple distributions

​Troubleshooting

​Removing the integration

​Next steps

See the data

Add Search Console

What this does

Prerequisites

Setup

Verifying the connection

What Searchable receives

Multiple distributions

Troubleshooting

Removing the integration

Next steps