Certificate expiry is the only class of production outage that prints its own end date at issuance. There is no anomaly to detect, no model to train, no unknown. And yet, expired certificates still take down payment gateways, internal APIs, and observability stacks every quarter, because the gap between knowing a certificate will expire and acting on it runs through spreadsheets, calendar reminders, and whoever happens to notice first. This post closes that gap: Elastic Synthetics, Osquery, ES|QL, and Elastic Workflows (generally available in 9.4), wired into a closed-loop TLS certificate monitoring pipeline that detects an expiring certificate, finds every affected host, rotates it, and verifies it. No human intervention required.
The reasoning behind the design (the trust deficit, the remediation gap, and the sense → think → act → verify loop that frames everything below) is laid out in Automated reliability: The architecture of self-healing enterprises. Read that one for the why. Certificates are the right first workload for this pattern because the signal is unambiguous, the corrective action is well-understood, and the blast radius of the automation itself is small enough to build trust before applying the same pattern to noisier domains.
The full source (synthetics journeys, sample workflow YAMLs, sample ES|QL alert queries, the ingest pipeline, and a short-lived-cert local lab) lives in the aiops-synthetics-lab repository. The article walks through the patterns; the repo has the complete artifacts.
Mapping the build to the closed loop
Each phase of the closed loop maps to a specific Elastic component and a corresponding artifact in the companion repository. The table below is a quick reference; the diagram that follows shows how data flows between the phases.
| Phase | Component | Lives in repo at |
|---|---|---|
| Sense | Synthetics browser journeys + Osquery certificates pack | journeys/, helpers/tls.ts, docs/ingest-pipeline-synthetics-browser.json |
| Think | ES|QL alert rule on synthetics-* deduplicated by CN + tags | docs/elastic-samples/alerts/ |
| Act | Smart Certificate Rotation Escalation and Remediation workflow | docs/elastic-samples/workflows/01-...yaml |
| Verify | Synthetics re-reads the rotated chain; canary curl_certificate confirms the fleet | docs/elastic-samples/workflows/02-...yaml, local-lab/ |
Figure 1: Each phase maps to a specific Elastic primitive, and each phase only relies on clean data from the previous one. No out-of-band coordination, no shared mental state.
Discovery: picking the right two methods
There are four ways to discover certificates, but two methods provide rapid time to value as the following table shows.
| Method | Mechanism | Verdict |
|---|---|---|
| Known endpoints | Playwright browser tests validate the served chain | ✅ Validates exactly what the client sees, including intermediates |
| OS interrogation | Osquery queries local host keystores | ✅ Finds certificates on backend services no synthetic ever calls |
| CA log polling | Monitor CA issuance logs | ✗ Noisy — certificates are frequently issued and never deployed |
| Network discovery | Subnet scanners extract TLS handshakes | ✗ Triggers IDS, breaks legacy OT, friction with security ops |
Endpoint synthetics gives you the client view. Osquery gives you the host view. A load balancer can serve a valid public cert while a downstream internal host quietly serves an expired one to its peers; invisible from the outside, fully visible from the keystore. You need both.
The left side (yellow) shows agent-based monitoring: Elastic Agent runs on each host and uses the Osquery certificates pack to read the local operating system keystore — /etc/ssl on Linux, CurrentUser and LocalMachine stores on Windows. This is the host view: certificates that services hold regardless of whether those services expose a public endpoint. The right side (blue) shows agentless monitoring: Elastic Synthetics executes browser journeys against reachable HTTPS endpoints and captures the full certificate chain as a client would see it. Both paths normalize output to ECS tls.server.x509.* fields, so the workflow can join them on common_name without any translation step.
Figure 2: Synthetics validates what clients see; Osquery enumerates what hosts hold. Both feed ECS-mapped tls.server.x509.* so the workflow can join them on common_name without translation.
Architecture overview
The architecture has four horizontal layers, each corresponding to one phase of the loop. Signal collection (top) shows the three data sources: Synthetics is the primary alert driver, Osquery is queried by the workflow for host discovery, and canary curl_certificate validates impact on demand. Alerting runs an ES|QL rule on synthetics-* and fires when a cert falls within the expiry window. Smart escalation and remediation is the workflow itself: pre-flight severity mapping, host discovery via Osquery, then branching into auto-rotation for certs expiring within 7 days, Jira ticketing for 7-90 day windows, or PagerDuty for anything that needs a human. Impact validation runs the canary workflow on demand to measure blast radius before and after any rotation.
Figure 3: Four layers, one closed loop. Synthetics fires the alert, Osquery maps the target hosts, and canary validation measures impact.
Prerequisites
| Requirement | Notes |
|---|---|
| Elastic Stack 9.3+ | ES|QL LOOKUP JOIN is GA in 9.1; Elastic Workflows is GA in 9.4 (technical preview in 9.3) |
| Elastic Agent via Fleet | Hosts with the Osquery Manager integration enabled |
| Elastic Synthetics | A Synthetics project and a private (or managed) location |
| Network egress | Cluster reachable to Ansible Tower, Jira, Slack, PagerDuty connectors |
| Node.js 18 LTS+ | For running the synthetics project locally |
Sense: signal collection
Clone the companion repository to get the complete set of artifacts used throughout this post: Synthetics browser journeys, Osquery pack configuration, Elastic Workflow YAML definitions, ES|QL alert queries, the ingest pipeline definition, and a local lab for testing against short-lived certificates. The npm ci step installs the Node.js dependencies required for running browser journeys locally.
git clone https://github.com/adrianchen-es/aiops-synthetics-lab.git
cd aiops-synthetics-lab
npm ci
CSV-driven endpoints
Endpoints are managed through a CSV file, so adding new hosts requires no code changes. Just edit journeys/tls-browser/tls-target-hosts.csv:
host,criticality,assertionText,assertionSelector
elastic.co,critical,Elastic,h1
internal-api.mycompany.com,high,,
payment-gateway.internal,critical,,
criticality becomes a journey tag (criticality:critical) which flows into the alert payload and drives escalation routing — no CMDB lookup required for basic triage. npm run generate:tls-targets rebuilds the TypeScript host module from the CSV and runs automatically before npm test and npm run push. The repo has additional groups under journeys/tls/, journeys/demos/, and journeys/kibana/ for non-browser TLS, badssl-style demos, and a multi-step Kibana login check, respectively npm run push:tls and friends scope deploys to one folder.
The browser TLS journey
Each host runs its own journey. Certificate inspection relies on Node's built-in tls module, where cert.fingerprint256 is precomputed by OpenSSL during the handshake, so it incurs no extra cost.
import { journey, step, expect } from '@elastic/synthetics';
import { TLS_TARGET_HOSTS } from '../../helpers/tlsTargetHosts.tls-browser.generated';
import { fetchCertInfo, checkCertTrusted, logCertInfo } from '../../helpers/tls';
for (const { host, criticality, assertionText, assertionSelector } of TLS_TARGET_HOSTS) {
journey(
{ name: `TLS Browser Check - ${host}`, tags: criticality ? [`criticality:${criticality}`] : [] },
({ page }) => {
step(`TLS Validation for ${host}:443`, async () => {
// Route-stub: gives Synthetics UI the actual hostname instead of about:blank.
await page.route('**/*', route => route.fulfill({ status: 200, body: 'TLS check context' }));
await page.goto(`https://${host}`, { waitUntil: 'commit' });
const cachedCert = await fetchCertInfo(host, 443);
const trusted = await checkCertTrusted(host, 443);
logCertInfo(host, 443, cachedCert); // emits TLS_CERT stdout line for ingest pipeline
expect(cachedCert.sha256).toMatch(/^([0-9A-F]{2}:){31}[0-9A-F]{2}$/);
expect(cachedCert.validTo.getTime(), `Certificate expired: ${host}`).toBeGreaterThan(Date.now());
expect(trusted, `CA untrusted: ${host}`).toBe(true);
});
step(`Navigate to ${host} and verify page content`, async () => {
await page.context().route('**', route => route.continue());
const response = await page.goto(`https://${host}`, { waitUntil: 'domcontentloaded', timeout: 30_000 });
expect(response).not.toBeNull();
if (assertionText && assertionSelector) {
await expect(page.locator(assertionSelector)).toContainText(assertionText);
}
});
}
);
}
Both successful and failed runs index tls.server.x509.* into synthetics-*; the alert rule reads both.
The ingest pipeline (don't skip this)
logCertInfo() emits a structured TLS_CERT stdout line per check. Without an ingest pipeline, that lands under synthetics.payload.message and the Kibana TLS Summary card stays empty even when journeys are succeeding. Most certificate-monitoring tutorials skip this step; it caught me out the first time.
curl -X PUT -H "Content-Type: application/json" \
-u "elastic:${ELASTIC_PASSWORD}" \
"https://<YOUR_CLUSTER_HOST>:9243/_ingest/pipeline/synthetics-browser%40custom" \
--data-binary @docs/ingest-pipeline-synthetics-browser.json
The pipeline parses the embedded JSON into tls.server.x509 and tls.server.hash, stripping colons from the fingerprint hex to match the UI's expected format.
Deploying monitors via CI/CD
export KIBANA_URL="https://your-deployment.kb.us-east-1.aws.elastic-cloud.com"
export SYNTHETICS_API_KEY="<your-kibana-api-key>"
npm run push
Adding a new service is a CSV row in a PR, not a Kibana form fill; the same monitors-as-code idea covered for Logstash in Logstash Pipeline Management & Configuration with GitOps. The repo's GitHub Actions workflow runs tsc --noEmit, test:dry, and test:unit on every PR (none of which need network access, so the loop catches journey breakage before merge).
Osquery host certificate inventory
Browser synthetics only sees certificates bound to active web listeners. Backend databases, internal gRPC services, and message queue brokers don't surface in any Playwright test. The Osquery pack queries each host's certificate store daily and writes to logs-osquery_manager.result* with these ECS mappings:
"ecs_mapping": {
"tls.server.x509.subject.common_name": { "field": "common_name" },
"tls.server.x509.not_before": { "field": "not_valid_before" },
"tls.server.x509.not_after": { "field": "not_valid_after" },
"tls.server.hash.sha1": { "field": "sha1" },
"file.path": { "field": "path" }
}
Without ECS mapping, certificate fields land in proprietary osquery.* columns. With it, tls.server.x509.subject.common_name is directly matchable against the Synthetics alert payload, which is what the workflow's host-discovery step depends on.
CN matching is necessary but not sufficient. Modern certificates carry many SANs and wildcard subjects. The host-discovery query joins on
common_namebecause that is what the alert payload exposes most cleanly; in environments with heavy wildcard usage (*.svc.cluster.local,*.internal) extend the join to also match againsttls.server.x509.alternative_namesand filter out shared wildcards renewed centrally — otherwise one Let's-Encrypt-backed wildcard renewal will fan out the workflow to every host in the cluster on the same day.
Think: alerting on Synthetics data
The alert rule runs against synthetics-*, not Osquery. Synthetics captures the certificate as the client actually sees it, including intermediate-chain issues that host keystores can't detect. Osquery is consulted later, inside the workflow to find which hosts to act on.
Create an Elasticsearch query rule with this ES|QL (full file: docs/elastic-samples/alerts/01-certificate-approaching-30d-expiring.md — note the repo file uses a 30-day window for stricter alerting; the 90-day version below matches my own preferred default):
FROM synthetics-*,*:synthetics-*
| WHERE tls.server.x509.not_before IS NOT NULL
AND tls.server.x509.not_after IS NOT NULL
AND tls.server.x509.subject.common_name IS NOT NULL
| STATS
record_count = COUNT(*),
tls.server.x509.not_after = MAX(tls.server.x509.not_after),
tls.server.x509.not_before = MIN(tls.server.x509.not_before),
monitor.name = VALUES(monitor.name),
@timestamp = MAX(@timestamp)
BY tls.server.x509.subject.common_name, tags
| WHERE NOW() + 90d > tls.server.x509.not_after
OR tls.server.x509.not_before > NOW()
| EVAL days_until_expiry = DATE_DIFF("days",
tls.server.x509.not_before,
tls.server.x509.not_after)
| KEEP @timestamp, tls.server.x509.subject.common_name,
tls.server.x509.not_before, tls.server.x509.not_after,
monitor.name, tags, days_until_expiry
| SORT @timestamp
| LIMIT 1000
The STATS BY common_name, tags aggregation deduplicates to one row per certificate per criticality tag, so the rule fires once per expiring cert, not once per monitor execution per location. MAX(not_after) and MIN(not_before) pick the definitive validity window across all observed documents.
Two tuning notes worth flagging:
- ACME certificates (90-day validity) will sit at
days_until_expirynear 90 every check cycle. Run a separate rule with a tighter threshold (< 14 days) and a different escalation path for them. - For cross-cluster index patterns (
*:synthetics-*), confirmtagsis populated consistently across clusters before relying on it for routing.
Act: smart certificate rotation escalation and remediation
The full workflow lives at docs/elastic-samples/workflows/01-smart-certificate-rotation-escalation-remediation.yaml. Copy it into Kibana → Stack Management → Workflows, swap connector IDs, and you have the picture below running. What follows is just the patterns you can't intuit by skimming the YAML.
Figure 4: Every leaf has a contract — auto-rotate, ticket, soft-notify, or page. The workflow never silently drops a certificate. The image below shows how smart escalation and remediation operate. Beneath it, the reasoning for each stage is explained.
Figure 5: The same logic on the time axis instead of the decision axis.
| Stage | Behavior | Why it's interesting |
|---|---|---|
| 1. Pre-flight | Capture severity + business-hours flag | In 9.3, console steps act as expression evaluators; in 9.4, data.set is the named, testable equivalent |
| 2. Host discovery | ES|QL on logs-osquery_manager.* matching the alert's CN | Cross-source join: alert from Synthetics, target list from Osquery |
| 3. Guard | Email + abort if no Osquery host matches | Cert visible on the wire but not in any keystore = real finding |
| 4. Branch | < 7 days → rotate; otherwise → Jira ticket | Single threshold, no fuzzy logic, no ambiguity |
| 5. Rotate | Freshness check, then Ansible webhook + Slack | Circuit breaker before action — see snippet below |
| 6. Escalate | Soft email or hard PagerDuty page when rotation can't proceed safely | Severity + business hours decide the channel |
The patterns that matter
Setting workflow variables. The pre-flight steps compute and store values that every later branch reads. In 9.3, the only way to do this was to abuse console steps — a Liquid template in message renders to a string accessible at steps.<name>.output. In 9.4, data.set makes this an explicit, named, testable operation with typed output fields.
9.3 — console as expression evaluator (compatible with 9.3 and 9.4; not recommended for new workflows):
- name: business_hours_check
type: console
with:
message: |-
{%- assign hour = "now" | date: "%H", "Australia/Sydney" | plus: 0 %}
{%- if hour >= 8 and hour < 17 %}true{%- else %}false{%- endif %}
9.4 — data.set (preferred for new workflows):
- name: business_hours_check
type: data.set
with:
is_business_hours: |-
{%- assign hour = "now" | date: "%H", "Australia/Sydney" | plus: 0 -%}
{%- if hour >= 8 and hour < 17 %}true{%- else %}false{%- endif -%}
The data.set output is accessible at steps.business_hours_check.output.is_business_hours, a named field rather than a raw string. The same upgrade applies to the severity_mapping and pagerduty_criticality_mapping steps in the full workflow YAML. Swap the timezone to suit your team.
Cross-source host discovery. The alert fires on Synthetics; this query asks Osquery which hosts carry the expiring CN:
- name: find_affected_hosts
type: elasticsearch.esql.query
with:
query: |
FROM logs-osquery_manager.result-*
| WHERE tls.server.x509.subject.common_name == "{{ event.alerts[0]['tls.server.x509.subject.common_name'][0] }}"
AND `file.path` != "LocalMachine\\Certificate Enrollment Requests"
| EVAL days_until_expiry = DATE_DIFF("days", tls.server.x509.not_before, tls.server.x509.not_after)
| KEEP host.hostname, tls.server.x509.subject.common_name,
tls.server.x509.not_before, tls.server.x509.not_after,
file.path, days_until_expiry
format: json
Output columns are accessed positionally below: [0] host.hostname, [5] days_until_expiry. The Windows Certificate Enrollment Requests exclusion drops pending renewal requests from the keystore — they generate false matches mid-renewal.
The circuit breaker. Don't act on a host that can't currently observe itself:
- name: check_host_freshness
type: elasticsearch.esql.query
with:
query: |
FROM metrics-system.cpu-*
| WHERE host.name == "{{ steps.find_affected_hosts.output.values[0][0] }}"
| STATS latest_metric = MAX(@timestamp)
| EVAL latency_sec = (TO_LONG(NOW()) - TO_LONG(latest_metric)) / 1000
| LIMIT 1
- name: remediation_routing_logic
type: if
condition: 'steps.check_host_freshness.output.values[1] <= 300'
steps:
- name: trigger_ansible_rotation
type: http
with:
url: "{{consts.ansible_webhook}}"
method: POST
body: { extra_vars: { target_host: "{{ steps.find_affected_hosts.output.values[0][0] }}" } }
headers: { Authorization: "Bearer {{consts.ansible_token}}" }
This is the trust deficit circuit breaker made concrete. My first version of this workflow didn't have the freshness check, we cheerfully fired Ansible jobs at hosts that had already left the building. Tower returned 200 because the job queued, the workflow declared success, and the certificate never actually rotated. Two lines of YAML and one postmortem you don't have to write.
Ansible Tower is illustrative — type: http works with any webhook-addressable executor: Rundeck, AWX, Salt, GitHub Actions workflow_dispatch, a thin internal service in front of kubectl rollout restart. The pattern that matters is the health check before the action.
The
ansible_tokenliteral inconsts:is for prose readability. In production, inject the credential via Workflows' secret handling (or load it from a referenced connector) so it never lands in version-controlled YAML.
Severity-aware escalation when the circuit breaker fires. Low severity outside business hours gets a non-urgent email; everything else pages, with severity translated to PagerDuty's vocabulary. Same 9.3/9.4 pattern as above:
9.3:
- name: pagerduty_criticality_mapping
type: console
with:
message: |-
{%- assign sev = steps.severity_mapping.output -%}
{%- case sev -%}
{%- when "critical" -%}critical
{%- when "high" -%}error
{%- when "medium" -%}warning
{%- else -%}info
{%- endcase -%}
9.4:
- name: pagerduty_criticality_mapping
type: data.set
with:
severity: |-
{%- assign sev = steps.severity_mapping.output.severity -%}
{%- case sev -%}
{%- when "critical" -%}critical
{%- when "high" -%}error
{%- when "medium" -%}warning
{%- else -%}info
{%- endcase -%}
The mapping step has to run before hard_notification references its output; keep that ordering if you refactor. In 9.4, the data.set output is referenced by field name (steps.severity_mapping.output.severity); in 9.3, the console step exposes a raw string at steps.severity_mapping.output.
Verify: closing the loop
A self-healing system that takes an action and walks away is not self-healing: it's just hopeful. Verify is the phase that distinguishes a closed loop from a fire-and-forget script, and it's where most "automated remediation" implementations quietly fall short.
For certificates, verification has two natural anchors. The next Synthetics interval (1–10 minutes after rotation) re-reads the chain — the new not_after and fingerprint land in synthetics-* and the alert rule simply stops matching that CN. No alert is the right outcome. For deeper confidence, the canary curl_certificate workflow confirms that the fleet — not just the public endpoint — has the new cert. Add the following block at the end of the rotation success path:
- name: wait_for_rotation
type: wait
with:
duration: 120s
- name: verify_rotation
type: elasticsearch.esql.query
with:
query: |
FROM synthetics-*
| WHERE tls.server.x509.subject.common_name
== "{{ event.alerts[0]['tls.server.x509.subject.common_name'][0] }}"
AND @timestamp > NOW() - 5 minutes
| STATS new_not_after = MAX(tls.server.x509.not_after),
new_fingerprint = VALUES(tls.server.hash.sha256)
| EVAL days_remaining = DATE_DIFF("days", NOW(), new_not_after)
- name: rotation_outcome
type: if
condition: "${{ steps.verify_rotation.output.values[0][2] > 30 }}"
steps:
- name: notify_verified
type: slack
connector-id: "slack-sre-channel-uuid"
with:
message: "✅ Rotation verified. {{ steps.verify_rotation.output.values[0][2] }} days remaining."
else:
- name: notify_verification_failed
type: pagerduty
connector-id: "<your-pagerduty-connector-uuid>"
with:
eventAction: "trigger"
severity: "error"
summary: "Rotation triggered but Synthetics still shows old cert. Investigate Ansible job and DNS/cache."
The verify query reads from the same source the alert rule reads from, so the two definitions can never drift. The check is numeric (days_remaining > 30), nothing to misinterpret. And a failed verification escalates harder than the original: a cert that was about to expire and silently failed to rotate is the more dangerous state, because automation has already swallowed the warning a human would otherwise have seen.
The 120-second pause is conservative. For ACME certs renewed by a sidecar, the next monitor interval (often under a minute) is enough. For Ansible-driven PKI rotations that touch a reverse-proxy reload, two minutes feels about right in my experience; long enough not to chase a stale cache, short enough that on-call doesn't go to bed thinking the rotation worked.
To rehearse the whole loop end-to-end before pointing it at production, the repo's local-lab/ directory ships a Docker Compose setup that issues a one-day self-signed cert and serves it from Nginx or Apache. Renew the cert, restart the container, and watch Synthetics pick up the new fingerprint within one monitor cycle — same shape as a real rotation, with no production risk.
Skip Verify and you have built an open loop dressed up as a closed one. The trust deficit returns the first time a silent rotation failure becomes a 4 AM page.
Impact validation: the canary workflow
The combined workflow handles the predictable case. Some incidents fall outside the predictable case: a Synthetic monitor failing with a TLS handshake timeout where the chain is not obvious, or a situation where you want to confirm blast radius before committing to rotation. For these, an Osquery curl_certificate live query gives a client-side view of the TLS chain from the perspective of machines that actually communicate with the target. A reverse proxy can hold a valid public cert while a downstream internal host serves an expired CA-signed cert that breaks microservice calls: invisible to endpoint polling, fully visible from internal canary hosts.
Two workflow files are in the repo: 02-canary-certificate-impact-check.yaml (9.3-compatible) and 02-canary-certificate-impact-check-9.4.yaml (GA default for 9.4+). Two ways to invoke either:
- Manually in Kibana with
target_hostandtarget_portinputs. - Autonomously as a registered Agent Builder tool, used by the SRE Certificate Agent during triage — same registration pattern as How to Troubleshoot Kubernetes Pod Restarts & OOMKilled Events with Agent Builder.
The interesting moves inside the YAML are: dispatch a curl_certificate live query with agent_all: true, poll the live-query API until status is completed, then aggregate the asynchronous results with one ES|QL pass. The polling step is where 9.3 and 9.4 differ most visibly.
9.3 — foreach + intentional failure workaround:
consts:
items: [1] # single-element list; foreach runs once per retry attempt
steps:
- name: while_not_workaround_loop
type: foreach
foreach: "${{consts.items}}"
steps:
- name: osquery_check_query_completion
type: kibana.request
with:
method: GET
path: /api/osquery/live_queries/{{ steps.osquery_check_host.output.data.action_id }}
- name: wait_10s
type: wait
with:
duration: 10s
- name: retry
type: http
if: "${{ steps.osquery_check_query_completion.output.data.status != 'completed' }}"
with:
url: "https://httpbin.org/status/404" # deliberate 404 triggers on-failure
on-failure:
retry:
max-attempts: 9
9.4 — native while loop with data.set:
- name: poll_osquery_completion
type: while
condition: "${{ steps.osquery_poll_state.output.query_status != 'completed' }}"
max-iterations:
limit: 10
on-limit: fail
steps:
- name: wait_10s
type: wait
with:
duration: 10s
- name: check_query_completion
type: kibana.request
with:
method: GET
path: /api/osquery/live_queries/{{ steps.osquery_check_host.output.data.action_id }}
- name: osquery_poll_state
type: data.set
with:
query_status: "{{ steps.check_query_completion.output.data.status }}"
See the 9.3 → 9.4 migration guide for a full list of changes. The ES|QL aggregation step is identical in both versions:
FROM logs-osquery_manager.result*
| WHERE action_id == "{{ steps.osquery_check_host.output.data.queries[0].action_id }}"
| WHERE NOW() > tls.server.x509.not_after
| STATS expired_canary_hosts = COUNT_DISTINCT(agent.name)
BY certificate_validity_duration = DATE_DIFF("day",
tls.server.x509.not_before,
tls.server.x509.not_after)
expired_canary_hosts is the count of agents that confirmed expiry. certificate_validity_duration > 90 distinguishes long-lived PKI from short-lived ACME, which the workflow uses to route the response — replace the console placeholders in the YAML with Slack / PagerDuty / a follow-on remediation step as preferred.
Registering as an agent tool
Register the workflow as a tool so any Agent Builder agent can invoke it during triage:
POST kbn:/api/agent_builder/tools
{
"id": "o11y.canary_certificate_check",
"type": "workflow",
"description": "Queries curl_certificate from all fleet agents to confirm real TLS impact for a given hostname and port. Returns count of affected canary hosts and certificate type (PKI vs ACME).",
"tags": ["observability", "certificates"],
"configuration": {
"workflow_id": "<your-osquery-curl-workflow-id>",
"wait_for_completion": true
}
}
Packaging as a skill (9.4+)
Elastic 9.4 introduces skills, a higher-level abstraction over individual tools. Where a tool is a single discrete operation (run a query, call a webhook), a skill bundles instructions, tools, and reference context into a reusable capability pack that an agent loads selectively rather than holding in the system prompt at all times.
For certificate triage, a skill lets you combine the canary impact check tool with domain-specific instructions, for example, how to interpret PKI vs ACME results, when to escalate vs when to redirect to DNS, and which follow-on workflows to invoke. The agent selects the skill automatically when the conversation context signals a TLS problem, or users invoke it directly with a slash command (/cert-triage).
Skills are managed through the skill library in Agent Builder and are available to any agent in the workspace: one definition, shared across the SRE agent, the on-call agent, and any future agents that need certificate expertise. Retrieve available skills with GET /api/agent_builder/skills.
A typical agent-driven triage with the skill loaded:
SRE: "The Synthetics monitor for blueprint-portal is failing. Investigate."
Agent: [Reviews synthetics logs — TLS handshake error on blueprint-portal.internal]
Agent: [Invokes o11y.canary_certificate_check for blueprint-portal.internal:443]
Canary result: expired intermediate cert confirmed on 3 of 12 agents — PKI class (365-day cert)
Agent: "The Synthetic failure is caused by an expired intermediate CA on the downstream
gateway. Three canary hosts confirm. The leaf cert is valid — this is a CA renewal gap on
the internal chain, not a service certificate expiry."
SRE: "Execute the proxy cache flush workflow."
Agent: [Invokes proxy flush workflow via MCP]
Agent: "Flush confirmed. Monitor should recover within one check interval."
When the canary check comes back clean, the agent redirects to DNS, routing, or upstream service health without the SRE needing to query anything by hand.
Operational considerations
Observability of the workflow itself. The question I get asked once people see this pattern is the same: "how do you know the workflow is still doing what you think?" The Workflows run history in Kibana shows every execution, status, and step output. And the clearest sign that the loop is closing, straight from the alerts index:
// Certificates that fired the alert more than once in the last 14 days —
// strong signal that rotation is failing silently and the loop is open
FROM .alerts-observability.metrics.alerts-default
| WHERE @timestamp > NOW() - 14 days
AND kibana.alert.rule.name == "Certificate Expiry"
| STATS fires = COUNT(*) BY tls.server.x509.subject.common_name
| WHERE fires > 1
| SORT fires DESC
Pin that query somewhere visible. A CN appearing more than once means the workflow is acting, but the action isn't landing; exactly the failure mode this architecture exists to surface. The companion meta-alert query is checked in at docs/elastic-samples/alerts/02-certificate-workflow-verification.md.
Criticality context for backend hosts. The CSV tags field only covers endpoints with active Synthetic monitors. For Osquery-discovered certificates on backend services with no monitor, add a LOOKUP JOIN enterprise_cmdb ON host.name step before remediate_or_notify. Two patterns keep the CMDB index current: pull (Logstash sync of CMDB exports) or push (the workflow POSTs back via ServiceNow webhook when Osquery surfaces a host that isn't in the CMDB; that's a real finding worth investigating).
Rehearse end-to-end before production. The repo's local-lab/ directory issues a 1-day self-signed cert, serves it via Nginx (port 8443) or Apache (port 8444), and includes a renew-tls-certs.sh script. Run the loop against it for a day or two before pointing the workflow at services that matter.
What success looks like
| Indicator | Healthy state |
|---|---|
| Repeat-fires per CN over 14 days | Near zero — every alert is acted on and verified within one cycle |
Workflow success rate | Dominant; failed runs are mostly host-stale escalations, not bugs |
| Auto-rotation share of total runs | Increases as the team trusts the ≤ 7-day window |
| Manual Jira tickets | Drops as more cert classes graduate from "ticket" to "auto-rotate" |
| 4 AM cert pages | Approaches zero. If they don't, the loop isn't closing. Start with the repeat-fires query. |
The single highest-value metric is the last one. A successful self-healing build is one that no human notices for months at a time.
Repository structure
Figure 6: CSV in, monitors out. Sample workflow YAML, alert ES|QL, ingest pipeline, and a short-lived-cert local lab are all colocated and version-controlled together.
Full reference at github.com/adrianchen-es/aiops-synthetics-lab.
Next steps
- Clone the repo, edit
journeys/tls-browser/tls-target-hosts.csvwith your endpoints and criticality tags. - Install the ingest pipeline once:
PUT _ingest/pipeline/synthetics-browser@custom. - Run
npm run test:demosagainstrevoked.badssl.comto confirm Sense and the ingest pipeline are wired correctly. Optionally point the journey at thelocal-lab/Nginx (https://127.0.0.1:8443) to rehearse the full loop with 1-day certificates. - Push monitors with
npm run push. - Deploy the Osquery pack via the Kibana API.
- Paste the ES|QL from
docs/elastic-samples/alerts/01-…mdinto a Kibana ES|QL rule. - Import
docs/elastic-samples/workflows/01-…yamlas a Kibana Workflow and configure connector IDs. Add the Verify block before enabling auto-rotation in production. - Optionally import
docs/elastic-samples/workflows/02-…yamland register it as an Agent Builder tool.
Try it on Elastic Cloud Serverless or Elastic Cloud.
Frequently asked questions
Why do TLS certificates keep expiring in production if the expiry date is known in advance?
The expiry date is known, but acting on it requires a reliable chain from detection to rotation — and that chain typically runs through spreadsheets, manual calendar reminders, and whoever happens to notice first. Automation closes that gap: Elastic Synthetics and Osquery provide continuous visibility into expiry state, while Elastic Workflows manages rotation and verification without manual intervention.
How is monitoring TLS certificates with Elastic Synthetics different from a simple cron job or alerting rule?
A Synthetics journey captures the full certificate chain as the client actually sees it, including intermediate CA issues that a host-side check misses entirely. It also maps results into tls.server.x509.* ECS fields, so the same data feeds both the alert rule and the workflow's host-discovery query — no translation layer, no schema drift between detection and remediation.
How do I make sure automated certificate rotation actually worked and didn't just queue a job?
After triggering rotation, the Elastic Workflow waits 120 seconds and queries synthetics-* with ES|QL to check whether the new not_after date is more than 30 days out. If the old certificate is still present, the workflow escalates to PagerDuty at higher severity than the original alert — because a silent rotation failure is more dangerous than the expiry warning it consumed.
What happens if Ansible fires a rotation job at a host that's already been decommissioned?
The workflow includes a circuit-breaker step that queries metrics-system.cpu-* before triggering Ansible. If the host hasn't reported telemetry in the last 300 seconds, the rotation is skipped and an escalation is sent instead. Without this check, the Ansible Tower job queues and returns 200, the workflow declares success, and the certificate never actually rotates.
When should I use the Osquery curl_certificate canary workflow instead of the main rotation workflow?
The main rotation workflow handles predictable expiry detected by Synthetics. The canary workflow is for triage when the failure mode is unclear — for example, a TLS handshake timeout where the root cause isn't obvious. It dispatches a curl_certificate live query to all fleet agents simultaneously and tells you which machines confirm the expiry, distinguishing a public endpoint failure from an internal chain failure that Synthetics can't see.
Does this pipeline require Elastic Synthetics monitors for every certificate I want to cover?
No. The Osquery component independently inventories the local certificate store on every enrolled host daily, surfacing certificates used by backend databases, gRPC services, and message brokers that have no HTTP endpoint. Synthetics covers what clients see from the outside; Osquery covers what hosts hold internally. Both write to tls.server.x509.* so the remediation workflow can join them on common name without a translation layer.
What are the limitations of joining Synthetics and Osquery data on common name?
Common name matching works well for simple certificate topologies. In environments with heavy wildcard usage (*.svc.cluster.local, *.internal), a single Let's Encrypt wildcard renewal can fan out the remediation workflow to every host in the cluster. The fix is to extend the join to tls.server.x509.alternative_names and filter out centrally managed wildcards before triggering per-host rotation.
What version of the Elastic Stack do I need to use Elastic Workflows for certificate monitoring?
Elastic Workflows is generally available in Elastic Stack 9.4 and was available as a Technical Preview in 9.3. The ES|QL LOOKUP JOIN used for host discovery requires 9.1 or later. For new deployments, target 9.4 to access the native while loop and data.set steps, which eliminate the workarounds required in 9.3.
What is the difference between a tool and a skill in Elastic Agent Builder?
A tool is a single discrete operation — run a query, call a webhook, trigger a workflow. A skill bundles instructions, tools, and reference context into a reusable capability pack that an agent loads selectively rather than keeping in the system prompt at all times. For certificate triage, registering the canary impact check as a tool gives an agent one action; packaging it as a skill gives the agent the context to interpret results, decide when to escalate, and invoke follow-on workflows automatically.
Related reading
- Automated Reliability: The Architecture of Self-Healing Enterprises — the inspiration for this build. Covers the why: the remediation gap, the trust deficit, and the sense → shink → sct → verify loop this implementation realizes.
- How to Troubleshoot Kubernetes Pod Restarts & OOMKilled Events with Agent Builder — a directly comparable Agent Builder triage pattern; the SRE Certificate Agent here uses the same registration model.
- Agentic CI/CD: Kubernetes Deployment Gates with Elastic MCP Server — extends the workflow-as-tool pattern to deployment pipelines; relevant if you want to gate releases on certificate health.
- Elastic Ramen: A CLI harness for SRE investigation and remediation — brings Agent Builder conversations, skills, and tools into the terminal so engineers can move from investigation to remediation in a single thread.