> ## Documentation Index
> Fetch the complete documentation index at: https://docs.searchable.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Send Amazon CloudFront traffic to Searchable

> Stream CloudFront standard logs to Searchable through Amazon Data Firehose — works on any CloudFront distribution

## What this does

CloudFront's [Standard Logging V2](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/standard-logging.html) ships request logs to a destination of your choosing. We point a Kinesis Data Firehose stream at Searchable's ingest endpoint, Firehose batches log records to us, we classify the AI bots, and drop everything else.

<Info>
  **No code changes.** All configuration is done inside the AWS Console.
</Info>

## Prerequisites

<Check>A CloudFront distribution (works on any plan — Standard Logging V2 is a free CloudFront feature)</Check>
<Check>Permission to create Firehose delivery streams and edit the CloudFront distribution</Check>
<Check>A Searchable project with your domain confirmed</Check>

<Tip>
  You only pay for the Firehose ingestion + delivery (typically a few cents per GB of AI-bot traffic). There is no CloudFront charge for enabling Standard Logging V2.
</Tip>

## Setup

<Steps>
  <Step title="Generate an integration token in Searchable">
    1. Open your Searchable dashboard
    2. Go to **LLM Analytics → Setup**
    3. Pick **Amazon CloudFront** as your crawler source
    4. Click **Generate token**

    Copy the token now — it starts with `sa_…` and won't be shown again. You can always generate a new one if you lose it.

    The endpoint URL is fixed:

    ```
    https://tracker.searchableanalytics.com/v1/cloudfront-logs
    ```

    <Tip>
      Sanity check before pointing Firehose at the endpoint: `curl https://tracker.searchableanalytics.com/v1/cloudfront-logs` should return `200 OK`. If you don't see that, your network can't reach the endpoint and Firehose will silently buffer to the S3 backup bucket instead.
    </Tip>
  </Step>

  <Step title="Open the CloudFront distribution's Logging tab">
    Open the [CloudFront console](https://console.aws.amazon.com/cloudfront/v4/home) → your distribution → **Logging**.

    Click **Add standard log destination**.

    | Field                | Value                |
    | -------------------- | -------------------- |
    | **Destination type** | Amazon Data Firehose |
    | **Output format**    | JSON                 |

    Then click **Create new Firehose stream** — this opens the Firehose wizard in a new tab.
  </Step>

  <Step title="Create the Firehose delivery stream">
    In the Firehose wizard:

    | Field                 | Value                                                        |
    | --------------------- | ------------------------------------------------------------ |
    | **Source**            | Direct PUT                                                   |
    | **Destination**       | HTTP Endpoint                                                |
    | **HTTP endpoint URL** | `https://tracker.searchableanalytics.com/v1/cloudfront-logs` |
    | **Access key**        | paste the `sa_…` token from Searchable                       |
    | **Content encoding**  | GZIP                                                         |
    | **Buffer hints**      | **1 MiB** / **60s**                                          |

    <Warning>
      Buffer hints matter. The Firehose default is 5 MiB, which sits right at Searchable's 5 MB compressed-batch cap — high-traffic distributions can intermittently hit `413 Payload Too Large`. Set the size hint to **1 MiB** and the interval to **60 seconds**.
    </Warning>

    <Tip>
      Firehose can't set arbitrary request headers, so it sends the **Access key** value as the `X-Amz-Firehose-Access-Key` header instead. Searchable's endpoint accepts either that or `Authorization: Bearer sa_…`, so the same token works for Firehose, a custom Lambda relay, or a curl test.
    </Tip>
  </Step>

  <Step title="Configure the S3 backup bucket">
    Firehose requires an S3 backup bucket for records it can't deliver.

    | Field                | Value                  |
    | -------------------- | ---------------------- |
    | **S3 backup bucket** | any bucket you control |
    | **Backup mode**      | **Failed data only**   |

    "Failed data only" means you only pay to store records the endpoint actually rejects (for example, after a token revocation). Don't pick "All data" — it would duplicate every CloudFront log into S3 unnecessarily.

    Finish creating the Firehose stream.
  </Step>

  <Step title="Finish the CloudFront log destination">
    Return to the CloudFront tab and select the Firehose stream you just created.

    Pick these standard log fields:

    **Required** (the endpoint drops records missing any of these):

    * `timestamp`
    * `sc-status`
    * `cs-method`
    * `cs-uri-stem`
    * `x-host-header`
    * `cs-user-agent`

    **Recommended** (improves enrichment + debugging; anything you select beyond the required set is preserved as `custom_properties` in Searchable):

    * `c-ip`
    * `c-country`
    * `cs-protocol`
    * `cs-uri-query`
    * `cs-referer`
    * `sc-bytes`
    * `time-taken`
    * `x-edge-request-id`
    * `x-edge-location`
    * `x-edge-result-type`

    Save the log destination.
  </Step>

  <Step title="Verify in Searchable">
    Firehose batches every \~60 seconds, so events take a moment to appear after the first AI bot hit.

    Return to **LLM Analytics → Setup** in Searchable. The Amazon CloudFront card should show **Connected** within a few minutes once an AI bot hits your site.
  </Step>
</Steps>

## Verifying the connection

In Searchable:

1. Go to **LLM Analytics → Setup**
2. Look at the Amazon CloudFront card status
3. Click **Check** if it still shows "Waiting for first event"

| Status                      | What it means                                                                                                                          |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| **Waiting for first event** | The Firehose stream is configured but no AI bot has hit your site yet. Typical wait is a few hours for sites that are already indexed. |
| **Connected**               | Events are arriving. The card shows the count from the last 24 hours.                                                                  |

You can also confirm in AWS: **Amazon Data Firehose → your stream → Monitoring**. The stream's metrics show incoming records (from CloudFront) and successful HTTP deliveries (to Searchable). The S3 backup bucket should stay near-empty when "Failed data only" is set.

## What Searchable receives

For each request that matches an AI-bot user agent, Searchable receives:

* HTTP method, path, and host (query strings stripped before storage)
* User agent
* Referer
* Country code (from `c-country`)
* Response status, response bytes
* Edge timing (`time-taken`)
* CloudFront edge metadata (`x-edge-request-id`, `x-edge-location`, `x-edge-result-type`) — preserved as `custom_properties` for debugging

Bodies, headers other than `User-Agent` / `Referer`, cookies, and full IPs are never sent or stored. The CloudFront edge request ID is also used to de-duplicate Firehose redeliveries server-side.

## Multiple distributions

Each Firehose stream is bound to a single CloudFront distribution's log destination. If your domain spans multiple distributions (for example, separate distributions for `example.com` and `assets.example.com`):

1. Generate one integration token, or reuse one across distributions — both work
2. Create one Firehose stream per distribution, all pointing at the same HTTP endpoint
3. Searchable tags events by host, so you'll still see them split by domain in the dashboard

## Troubleshooting

<AccordionGroup>
  <Accordion title="Firehose 'Destination error count' is climbing">
    Open the Firehose stream in the AWS Console → **Monitoring → Destination error logs** for the specific error. The common ones:

    * **`401 Unauthorized`** — the `sa_…` token in the **Access key** field is missing, wrong, or has been revoked in Searchable. Re-paste the token, or generate a new one from **LLM Analytics → Setup**.
    * **`413 Payload Too Large`** — the Firehose buffer hint is too big. Drop the size hint to **1 MiB** and the interval to **60 seconds**, then save. Firehose will retry the buffered batches once they shrink.
    * **`400 Bad Request`** — the CloudFront log output format isn't JSON, or the required fields aren't selected. Edit the CloudFront log destination, set **Output format** to **JSON**, and confirm `timestamp`, `sc-status`, `cs-method`, `cs-uri-stem`, `x-host-header`, and `cs-user-agent` are all checked.

    While errors are climbing, Firehose buffers the affected records into your S3 backup bucket — fix the root cause and the stream catches back up automatically.
  </Accordion>

  <Accordion title="Status stays on 'Waiting for first event'">
    CloudFront only delivers logs for traffic that's actually being served by the distribution. Things to check:

    * The CloudFront distribution is **Enabled** (not disabled or paused)
    * The Firehose stream is **Active** (not in `CREATING` or `DELETING` state)
    * Your domain in Searchable matches the alternate domain (CNAME) on the distribution (check **LLM Analytics → Setup → Confirm your domain**)
    * The log destination is attached to the distribution — easy to miss if the Firehose-creation wizard was opened separately and you forgot to come back to CloudFront's Logging tab

    If everything looks right, hit your site with a known AI user agent and wait \~60 seconds for the next Firehose batch:

    ```bash theme={null}
    curl -H "User-Agent: GPTBot/1.0 (+https://openai.com/gptbot)" https://yourdomain.com/
    ```
  </Accordion>

  <Accordion title="Records keep landing in the S3 backup bucket">
    The "Failed data only" backup bucket only fills up when the HTTP endpoint rejects records — typically a token / config issue. Inspect a recent failed object in the bucket:

    * It contains the original CloudFront records along with an error message from the endpoint (`401`, `413`, `400`, etc.)
    * Use the message to identify the root cause (see the `Firehose 'Destination error count' is climbing` section above)
    * Once you fix the config, new records flow through to Searchable; the records already in the backup bucket aren't automatically re-delivered (Firehose treats the backup as terminal storage)
  </Accordion>

  <Accordion title="I see duplicated events">
    Firehose is at-least-once, so it can occasionally redeliver a batch after a transient network error. Searchable de-duplicates on `x-edge-request-id` server-side, so duplicate events don't appear in the dashboard — provided you selected `x-edge-request-id` in the CloudFront log fields. If you skipped it, dedup falls back to a (timestamp, path, user-agent) heuristic that's less precise. Edit the log destination, add `x-edge-request-id`, and save.
  </Accordion>

  <Accordion title="I'd rather pipe logs through a custom Lambda instead of Firehose">
    The same endpoint accepts plain NDJSON POSTs (no Firehose envelope), so you can post log records directly from a Lambda or any custom collector. Use:

    * `POST https://tracker.searchableanalytics.com/v1/cloudfront-logs`
    * `Authorization: Bearer sa_…` header
    * `Content-Type: application/x-ndjson` (gzip optional — set `Content-Encoding: gzip` if you compress)
    * One CloudFront log record per line, using the same field names as Standard Logging V2 (`cs-method`, `cs-uri-stem`, `x-host-header`, `cs-user-agent`, `sc-status`, `timestamp`, etc.)

    The endpoint replies with `204 No Content` on success.
  </Accordion>
</AccordionGroup>

## Removing the integration

1. AWS Console → **CloudFront → your distribution → Logging** → delete the standard log destination
2. AWS Console → **Amazon Data Firehose → your stream** → delete the stream (and the S3 backup bucket if you no longer need it)
3. Searchable → **LLM Analytics → Setup → Tokens** → revoke the token

Both sides are independent — revoking the token alone is enough to stop ingestion immediately, even if the Firehose stream stays configured (its deliveries will start returning `401` and land in the backup bucket).

## Next steps

<CardGroup cols={2}>
  <Card title="See the data" icon="chart-line" href="/using-searchable/visibility-tracking">
    Open LLM Analytics to see which assistants are crawling your site.
  </Card>

  <Card title="Add Search Console" icon="google" href="/integrations/google-search-console">
    Correlate AI crawls with search demand.
  </Card>
</CardGroup>
