Documentation Index
Fetch the complete documentation index at: https://docs.searchable.com/llms.txt
Use this file to discover all available pages before exploring further.
What this does
CloudFront’s Standard Logging V2 ships request logs to a destination of your choosing. We point a Kinesis Data Firehose stream at Searchable’s ingest endpoint, Firehose batches log records to us, we classify the AI bots, and drop everything else.No code changes. All configuration is done inside the AWS Console.
Prerequisites
A CloudFront distribution (works on any plan — Standard Logging V2 is a free CloudFront feature)
Permission to create Firehose delivery streams and edit the CloudFront distribution
A Searchable project with your domain confirmed
Setup
Generate an integration token in Searchable
- Open your Searchable dashboard
- Go to Agent Analytics → Setup
- Pick Amazon CloudFront as your crawler source
- Click Generate token
sa_… and won’t be shown again. You can always generate a new one if you lose it.The endpoint URL is fixed:Open the CloudFront distribution's Logging tab
Open the CloudFront console → your distribution → Logging.Click Add standard log destination.
Then click Create new Firehose stream — this opens the Firehose wizard in a new tab.
| Field | Value |
|---|---|
| Destination type | Amazon Data Firehose |
| Output format | JSON |
Create the Firehose delivery stream
In the Firehose wizard:
| Field | Value |
|---|---|
| Source | Direct PUT |
| Destination | HTTP Endpoint |
| HTTP endpoint URL | https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logs |
| Access key | paste the sa_… token from Searchable |
| Content encoding | GZIP |
| Buffer hints | 1 MiB / 60s |
Configure the S3 backup bucket
Firehose requires an S3 backup bucket for records it can’t deliver.
“Failed data only” means you only pay to store records the endpoint actually rejects (for example, after a token revocation). Don’t pick “All data” — it would duplicate every CloudFront log into S3 unnecessarily.Finish creating the Firehose stream.
| Field | Value |
|---|---|
| S3 backup bucket | any bucket you control |
| Backup mode | Failed data only |
Finish the CloudFront log destination
Return to the CloudFront tab and select the Firehose stream you just created.Pick these standard log fields:Required (the endpoint drops records missing any of these):
timestampsc-statuscs-methodcs-uri-stemx-host-headercs-user-agent
custom_properties in Searchable):c-ipc-countrycs-protocolcs-uri-querycs-referersc-bytestime-takenx-edge-request-idx-edge-locationx-edge-result-type
Verifying the connection
In Searchable:- Go to Agent Analytics → Setup
- Look at the Amazon CloudFront card status
- Click Check if it still shows “Waiting for first event”
| Status | What it means |
|---|---|
| Waiting for first event | The Firehose stream is configured but no AI bot has hit your site yet. Typical wait is a few hours for sites that are already indexed. |
| Connected | Events are arriving. The card shows the count from the last 24 hours. |
What Searchable receives
For each request that matches an AI-bot user agent, Searchable receives:- HTTP method, path, and host (query strings stripped before storage)
- User agent
- Referer
- Country code (from
c-country) - Response status, response bytes
- Edge timing (
time-taken) - CloudFront edge metadata (
x-edge-request-id,x-edge-location,x-edge-result-type) — preserved ascustom_propertiesfor debugging
User-Agent / Referer, cookies, and full IPs are never sent or stored. The CloudFront edge request ID is also used to de-duplicate Firehose redeliveries server-side.
Multiple distributions
Each Firehose stream is bound to a single CloudFront distribution’s log destination. If your domain spans multiple distributions (for example, separate distributions forexample.com and assets.example.com):
- Generate one integration token, or reuse one across distributions — both work
- Create one Firehose stream per distribution, all pointing at the same HTTP endpoint
- Searchable tags events by host, so you’ll still see them split by domain in the dashboard
Troubleshooting
Firehose 'Destination error count' is climbing
Firehose 'Destination error count' is climbing
Open the Firehose stream in the AWS Console → Monitoring → Destination error logs for the specific error. The common ones:
401 Unauthorized— thesa_…token in the Access key field is missing, wrong, or has been revoked in Searchable. Re-paste the token, or generate a new one from Agent Analytics → Setup.413 Payload Too Large— the Firehose buffer hint is too big. Drop the size hint to 1 MiB and the interval to 60 seconds, then save. Firehose will retry the buffered batches once they shrink.400 Bad Request— the CloudFront log output format isn’t JSON, or the required fields aren’t selected. Edit the CloudFront log destination, set Output format to JSON, and confirmtimestamp,sc-status,cs-method,cs-uri-stem,x-host-header, andcs-user-agentare all checked.
Status stays on 'Waiting for first event'
Status stays on 'Waiting for first event'
CloudFront only delivers logs for traffic that’s actually being served by the distribution. Things to check:
- The CloudFront distribution is Enabled (not disabled or paused)
- The Firehose stream is Active (not in
CREATINGorDELETINGstate) - Your domain in Searchable matches the alternate domain (CNAME) on the distribution (check Agent Analytics → Setup → Confirm your domain)
- The log destination is attached to the distribution — easy to miss if the Firehose-creation wizard was opened separately and you forgot to come back to CloudFront’s Logging tab
Records keep landing in the S3 backup bucket
Records keep landing in the S3 backup bucket
The “Failed data only” backup bucket only fills up when the HTTP endpoint rejects records — typically a token / config issue. Inspect a recent failed object in the bucket:
- It contains the original CloudFront records along with an error message from the endpoint (
401,413,400, etc.) - Use the message to identify the root cause (see the
Firehose 'Destination error count' is climbingsection above) - Once you fix the config, new records flow through to Searchable; the records already in the backup bucket aren’t automatically re-delivered (Firehose treats the backup as terminal storage)
I see duplicated events
I see duplicated events
Firehose is at-least-once, so it can occasionally redeliver a batch after a transient network error. Searchable de-duplicates on
x-edge-request-id server-side, so duplicate events don’t appear in the dashboard — provided you selected x-edge-request-id in the CloudFront log fields. If you skipped it, dedup falls back to a (timestamp, path, user-agent) heuristic that’s less precise. Edit the log destination, add x-edge-request-id, and save.I'd rather pipe logs through a custom Lambda instead of Firehose
I'd rather pipe logs through a custom Lambda instead of Firehose
The same endpoint accepts plain NDJSON POSTs (no Firehose envelope), so you can post log records directly from a Lambda or any custom collector. Use:
POST https://searchable-tracker.searchable.workers.dev/v1/cloudfront-logsAuthorization: Bearer sa_…headerContent-Type: application/x-ndjson(gzip optional — setContent-Encoding: gzipif you compress)- One CloudFront log record per line, using the same field names as Standard Logging V2 (
cs-method,cs-uri-stem,x-host-header,cs-user-agent,sc-status,timestamp, etc.)
204 No Content on success.Removing the integration
- AWS Console → CloudFront → your distribution → Logging → delete the standard log destination
- AWS Console → Amazon Data Firehose → your stream → delete the stream (and the S3 backup bucket if you no longer need it)
- Searchable → Agent Analytics → Setup → Tokens → revoke the token
401 and land in the backup bucket).
Next steps
See the data
Open Agent Analytics to see which assistants are crawling your site.
Add Search Console
Correlate AI crawls with search demand.