OpenAI rate limit monitoring, now with an 80% throttling alert

Elastic's OpenAI integration now polls rate limits every five minutes and checks them against real usage across every project and model. You can see RPM, TPM and IPM headroom before OpenAI hits you with an HTTP 429. A prebuilt alert fires when peak one-minute utilization crosses 80% of your configured limit for three checks in a row, grouped by project and model, so one team's spike doesn't get lost in an org-wide average. OpenAI configures these limits per project, capped at or below your organization's overall ceiling. That means a single noisy project can burn through its own allocation while the rest of the org still has room, and until now, that headroom stayed invisible until it ran out.

The first time most teams learn that their OpenAI project is close to a rate limit is when production traffic starts getting throttled with HTTP 429 responses. OpenAI enforces rate limits at the project level, not at the organization level, so a single noisy workload in one project can saturate that project's RPM or TPM ceiling while the rest of the organization still has plenty of room. Without OpenAI rate limit monitoring that compares configured limits against actual consumption, headroom is invisible until it runs out.

OpenAI API monitoring in Elastic: what's new

We're pleased to announce updates to the Elastic OpenAI integration. On top of the existing token usage and audit log coverage, the integration now polls OpenAI's List project rate limits Admin API per project and rolls the results up into both per-project and org-wide views. A new openai.rate_limits dataset feeds two new dashboard panels and a prebuilt threshold alert rule, so teams can see how close each project and model is to being throttled before users experience production impact.

What the integration polls: Usage, Audit Logs and Rate Limits APIs

The Elastic OpenAI integration is built for teams running applications on the OpenAI API platform. The people accountable for it are the developers shipping those services, the platform and SRE teams keeping them running, and the finance and FinOps owners answering "how much is our software consuming, and are we within our capacity envelopes?"

The integration collects from three OpenAI Admin API surfaces:

Usage API for usage counts across tokens, characters, seconds, sessions, bytes, and images, with project, model, user, and API key attribution where that Usage API surface provides it.
Audit Logs API for organization audit events such as API key creation, project changes, and user activity.
Rate Limits API for configured RPM, TPM, and IPM ceilings per project and per model, plus other limit dimensions where available; the new headroom views compare the per-minute request, token, and image limits against actual consumption.

Because everything is pulled from the Admin API at the organization level, platform teams get a unified view across every project, model and API key, alongside the rest of the telemetry they already monitor in Elastic, without touching application code or installing SDKs in every service.

What teams need to monitor when running on the OpenAI API

Four operational needs come up over and over for teams running production workloads on the OpenAI API.

Token usage attribution

A single OpenAI organization usually serves many internal teams and products, each with its own project, its own mix of models (GPT-5.5 Pro for the hardest reasoning tasks, GPT-5.4 for everyday traffic, GPT-5.4 nano for high-volume low-cost requests, and specialized models for images, audio and embeddings) and its own user and API key footprint. When usage patterns shift, the platform team needs to know which project, model and key is driving the change so they can attribute consumption back to the right team and decide which workloads should move to a cheaper model.

Rate limit headroom

OpenAI enforces per-model rate limits on requests per minute (RPM) and tokens per minute (TPM) at the project level, not at the organization level. The first time a team learns they're close to the ceiling is usually when production traffic starts being throttled. Surfacing configured limits alongside actual consumption, per project and per model, lets platform teams see headroom in advance, plan capacity, and request limit increases before users feel the impact.

Audit visibility

Security and compliance teams need to know who created API keys, who changed project settings, who invited or removed users, and when. The integration ingests OpenAI's organization audit log so those events land in the same Elastic deployment as the usage data, ready for correlation, alerting and long-term retention. Audit log ingestion has two prerequisites: audit logging must be enabled in your OpenAI organization, and the Admin API key used by the integration must belong to an Organization Owner, because OpenAI restricts audit-log access to that role. Without both, the openai.audit dataset stays empty.

Granularity for every audience

The same data needs to serve different cadences. SREs want one-minute resolution to catch spikes and trigger throttling alerts. Platform engineers want hourly views for capacity planning. Finance and FinOps owners want daily totals that roll up cleanly for reporting. A single integration that exposes all three granularities removes the need to maintain separate pipelines for each audience.

How does Elastic poll the OpenAI Admin API?

The integration runs on Elastic Agent and uses the CEL input to poll OpenAI's Admin API on a schedule. Authentication uses a single Admin API key, stored as an encrypted Fleet secret and redacted from agent logs. From a single configuration, the integration ingests datasets from three Admin API sources:

Usage API datasets (per project, model, user and API key, with each dataset tracking the unit OpenAI exposes for that workload):

openai.completions for chat and completion token counts (input, output, cached, audio input/output).
openai.embeddings for embedding token counts.
openai.moderations for moderation token counts.
openai.images for image counts and size dimensions.
openai.audio_speeches for text-to-speech character counts.
openai.audio_transcriptions for speech-to-text duration in seconds.
openai.code_interpreter_sessions for code interpreter session counts.
openai.vector_stores for vector store byte counts.

Audit Logs API dataset:

openai.audit for organization audit events such as API key creation, project changes and user activity.

Rate Limits API dataset:

openai.rate_limits (new) for snapshots of configured rate limits per project and per model, including RPM, TPM, IPM, and other limit fields where OpenAI returns them, paged across all active projects on each poll.

Ingest pipelines handle parsing and field mapping so the data lands queryable, dashboard-ready, and aligned with the rest of Elastic Observability. Because the data is pulled from the Admin API at the organization level, you get this visibility without any application-side instrumentation or SDK changes.

What you need to set up OpenAI monitoring in Elastic

To get started with the Elastic OpenAI integration, you will need:

An Elastic deployment:
- Elastic Cloud Hosted (ECH) running a recent supported version, or
- Elastic Cloud Serverless, no version requirement, works out of the box.
An OpenAI organization with Admin API access.
An Admin API key provisioned by an Organization Owner from the OpenAI platform settings under Settings → Admin keys. Owner-level keys are required if you want the openai.audit dataset to populate.
Audit logging enabled in your OpenAI organization, if you want audit data.
Elastic Agent installed on a host with outbound HTTPS access to api.openai.com, or the agentless deployment option.

How to set up the OpenAI integration

Generate an Admin API key in the OpenAI platform settings.
In Kibana, go to Management → Integrations, search for OpenAI and click Add.
Choose your deployment mode: agentless for a zero-install experience, or Elastic Agent on your own host.
Tune the defaults if you need to. Each dataset has sensible defaults:
- Usage datasets poll every 5 minutes with 1-minute buckets. Each dataset exposes a finalization_grace setting that controls when a per-minute usage bucket is considered final. The default 0s favors freshness: a bucket is ingested as soon as its minute closes. The observed behavior (which OpenAI does not document, but the integration team measured against the live API) is that bucket counts can keep rising for some time after that point, so per-minute totals at 0s can undercount during heavy bursts. Setting finalization_grace to 15m, the recommended value for accurate counts, holds a bucket back until the grace window has elapsed and brings counts much closer to the Usage API, though a small residual undercount can remain during very high-volume bursts because OpenAI's per-minute counts can be revised upward beyond any fixed grace window. The cost is delaying dashboards and the rate limit headroom alert by the grace period.
- Rate limits polls every 5 minutes. Each poll captures the full set of configured RPM, TPM and IPM limits per project and per model.
- Audit logs polls on a separate cadence and ingests all org-level audit events. Remember the prerequisites: audit logging enabled in your OpenAI organization and an Organization-Owner Admin API key.
Open the integration assets in Kibana. Within minutes, usage, audit, and rate-limit data starts flowing, and the prebuilt dashboards and alert rule are ready to use.

For the full configuration reference, see the OpenAI integration documentation.

What do the OpenAI rate limit dashboards show?

The integration ships with a pre-built Kibana dashboard that gives you an immediate, queryable view of your organization's OpenAI API consumption. The overview pulls headline numbers (total tokens, total invocations, top models, top projects, top users and top API keys) into one place for a quick read on the state of your OpenAI usage. The screenshot below shows the OpenAI usage overview dashboard:

From the overview, you can drill into the views that answer the operational needs introduced earlier.

Token usage by model, project and user

The token metrics panels break down token consumption (input, output, cached input, audio input/output) for the token-based datasets (openai.completions, openai.embeddings, openai.moderations) by model and over time. This is the view that tells you where your token budget is actually going, which workloads are getting the most out of prompt caching, and which projects, users or API keys are driving the bulk of your token consumption. Filter by project or model to scope the view to a single team or product. Image, audio and vector-store consumption (measured in images, characters, seconds, sessions and bytes rather than tokens) is reported in dedicated sections of the same dashboard. The token metrics panels look like this:

Rate limit headroom: per project and model (new)

The new rate limit headroom panel joins the configured limits from openai.rate_limits against actual consumption from the usage datasets, per project_id and model. For each row it reports peak one-minute used, the configured limit, and utilization percentage for requests (RPM), tokens (TPM), and images (IPM). Rows are sorted by highest TPM utilization first, with RPM and IPM utilization as tie-breakers, so the highest token-pressure rows appear at the top of the list. Utilization is computed against the peak one-minute bucket in the lookback window, never a 5- or 15-minute sum against a one-minute ceiling, so the panel reflects peak-minute pressure against the one-minute ceiling instead of averaging it away. The per-project rate limit headroom panel is shown below:

Rate limit headroom: org-wide rollup by model (new)

Because OpenAI enforces limits per project, a single per-project view doesn't answer "how much total capacity do we have for gpt-image-2 across the organization?" The new org-wide rollup panel reports the same RPM, TPM and IPM metrics summed across all active projects for each model. Both the limit and the usage figures are indicative upper bounds rather than exact org-wide numbers (the limit is a sum of per-project ceilings; the usage is a sum of each project's peak minute, which may fall in different minutes across projects), but together they give platform teams a single number to plan against when they're sizing a new workload or deciding which project should absorb a new use case. The org-wide rollup panel is shown below:

Behind the scenes, version 2.3.0 also normalizes request and token counts into shared openai.base.usage_tokens and openai.base.usage_images fields across the usage datasets, so the headroom panels render correctly even when only a subset of usage datasets is enabled.

OpenAI audit log activity in Elastic

The audit panels surface organization audit events (API key creations, project changes, user invitations and login activity) so security and compliance teams can review who did what, when, alongside the usage data. The audit dashboard is shown below:

Out-of-the-box alert for rate limit headroom (new)

The integration ships with a pre-built threshold alert rule template, [OpenAI] Rate limit headroom low, ready to install in one click from the integration's Assets tab.

The default behavior is tuned to be useful out of the box:

Runs every 5 minutes.
Looks back over the last 15 minutes.
Fires after 3 consecutive matches.
Triggers when peak one-minute TPM utilization reaches or exceeds 80% of the configured project/model limit.
Groups alerts by project_id::model, so an incident in one project on one model doesn't get lost in an org-wide aggregate.

The 80% threshold and other parameters are editable in Kibana after you install the rule, so each team can tune the alert to its own risk tolerance.

Custom OpenAI alerts and SLOs in Elastic Observability

As with every other Elastic integration, all the OpenAI metrics and audit data is fully available to leverage in every capability in Elastic Observability, including SLOs, alerting, custom dashboards and in-depth logs exploration.

For example, to keep token consumption under control across a single project, create a custom threshold rule that sums tokens from the relevant usage dataset and fires when the daily or hourly total crosses your budget. To track model mix, define an SLO in Elastic Observability that treats OpenAI requests on your approved lower-cost model families as the "good events" (the ones that count as meeting the target) and all OpenAI requests as the "total events", grouped by openai.base.project_id and openai.base.user_id. The ratio becomes your SLI; a 7-day rolling 80% target quickly surfaces projects and users overusing more expensive models.

Choosing OpenAI usage data granularity

OpenAI usage data collected by the integration powers different cadences, with a fidelity-versus-freshness tradeoff to be aware of. One-minute usage buckets feed the rate limit headroom alert and near-real-time throttling notifications when a project approaches its ceiling: with finalization_grace set to 0s (the default), per-minute counts arrive within minutes but can undercount during heavy bursts; raising finalization_grace to 15m brings counts much closer to reconciled at the cost of delaying the dashboards and alert by the grace period; a small residual undercount can still remain for the busiest buckets. Hourly views support operational monitoring and capacity planning across projects and models. Daily aggregates roll up cleanly for FinOps reporting and reconciliation. An out-of-the-box alert ships for rate limit headroom ([OpenAI] Rate limit headroom low), and the same data can be reused for custom usage and budget thresholds without building anything from scratch.

Get started with OpenAI monitoring in Elastic

The Elastic OpenAI integration is available today in Elastic Cloud, including Elastic Cloud Hosted and Elastic Cloud Serverless. To get started, sign up for a free Elastic Cloud trial, provision an Admin API key in the OpenAI platform settings, and add the OpenAI integration from Kibana under Management → Integrations.

Within minutes you'll have token usage, audit activity, and rate limit headroom data flowing into Elasticsearch, with the prebuilt dashboards and the new throttling alert ready to use.

Elastic now alerts at 80% OpenAI rate limit usage, before your app gets throttled