Elasticsearch network monitoring: cutting MTTR with ML

A Level 1 NOC analyst staring at 802.1x authentication failures across a Cisco ISE deployment would typically need a senior network engineer and 20 minutes of manual correlation to build an investigation checklist. With Elastic's ML anomaly detection and AI Assistant, that same checklist - root causes ranked by likelihood, certificate checks, RADIUS config steps - generates in seconds from a single alert. This post walks through the exact setup: Elastic Agent, the Cisco ISE integration, a real-time anomaly detection job, and a Knowledge Base backed AI Assistant that turns alert context into resolution guidance automatically.

To boldly go where no man has gone before.

Star Trek wasn't just a space odyssey. For today's network engineers, it's a fitting metaphor for where NOC operations are headed: into a new frontier powered by Machine Learning, AI, and unified telemetry at scale.

Anyone who has spent time in a Network Operations Center (NOC) knows the grind — wall-to-wall dashboards, "what's up" ping tools, and the exhausting art of sifting through thousands of log lines trying to pinpoint why a service went dark at 2 a.m. The old playbook of manual correlation, tribal knowledge, and war-room calls is hitting its limits as networks grow more complex and distributed.

The new playbook looks different. Instead of reacting to dashboards, NOC analysts are leveraging advanced Machine Learning algorithms and Artificial Intelligence to aggregate internal and external telemetry, surface anomalies automatically, and guide engineers through resolution — dramatically reducing Mean Time to Resolution (MTTR), or as many teams call it, Mean Time to Innocence (MTTI).

Elastic sits at the center of this shift. Its scalable, schema-on-write architecture can ingest terabytes of network telemetry, run real-time ML jobs against it, and surface AI-generated resolution guidance directly in the analyst's workflow. This post walks through exactly how to set that up — using a real customer use case as our guide.

Network Telemetry Sources: The Data Elastic Can Ingest

Network telemetry comes in many forms. Elastic can ingest and correlate all of them in a single platform, giving your NOC a unified view instead of siloed tools:

Router & Switch Syslogs
NetFlow / IPFIX
SNMP Traps & Polls
DNS & DHCP Logs
Firewall & VPN Logs
VPC Flow Logs (AWS/GCP/Azure)
802.1x / NAC Events
SD-WAN Telemetry

Until recently, there wasn't an intelligent way to aggregate and analyze these sources holistically. NetFlow gives 5-tuple IP traffic at scale, but correlating it with an 802.1x failure from your NAC platform requires manual pivoting between tools. Elastic changes that equation entirely. Below is a customer use case that leverages Elastic to solve the problem. As you read through the use case, think of the complexities and the anxiety of resolving this issue quickly for your executives. After reading the customer use case, I will provide step-by-step instructions on how to leverage Elastic’s Machine Learning and AI Assistant to quickly identify and resolve the problem:

Customer Use Case: A customer needed to troubleshoot Cisco ISE authentication logs. Executives could log into their laptops successfully when docked in the office, but lost network access when moving to conference rooms or the cafeteria. The NOC team suspected an 802.1x wireless issue but was overwhelmed trying to manually correlate failures across the ISE platform. Here is how they used Elastic to solve it, step by step.

Step-by-step Configuration

1. Install the Elastic Agent (Setup)

The Elastic Agent is the single, unified way to collect telemetry from your network devices and infrastructure. It replaces the need for multiple Beats or Logstash pipelines for most use cases. Deploy it on a collector host that can reach your network devices via syslog, SNMP, or API.

From Fleet in Kibana, navigate to Fleet → Agents → Add Agent. Select your platform, copy the generated install command, and run it on your collector host. The agent self-registers with Fleet — future policy changes are centrally managed, no SSH required.

Tip: For high-availability NOC deployments, deploy multiple Elastic Agents behind a load balancer pointing to your Fleet Server cluster. This ensures telemetry collection continues if a collector host goes down.

See screenshot below:

2. Configure an Agent Policy (Setup)

Agent Policies define what an Elastic Agent collects. In Fleet, create a new policy named "NOC-Network-Telemetry". Policies group integrations and apply to one or more agents — this lets you manage collection for an entire class of collector hosts as a single unit.

Within the policy, you'll add integrations for each telemetry source. Start with the integrations relevant to your environment (Cisco ISE in this use case), and add NetFlow, SNMP, and others as you expand coverage.

See screenshot below:

3. Add the Cisco ISE Integration (Integration)

In Fleet, go to Integrations → search "Cisco ISE" and add it to your policy. The integration ships pre-built ingest pipelines that parse ISE's syslog format into structured ECS fields automatically — no custom Grok patterns required.

Configure ISE to forward syslogs to your Elastic Agent collector on UDP/TCP 514. In the integration settings, match that port and set the log categories: authentication events, posture assessments, guest access, and profiler logs are the most valuable for 802.1x troubleshooting.

# ISE syslog target configuration
# Administration → System → Logging → Remote Logging Targets
 
Target Name:  elastic-noc-collector
IP Address:   10.10.5.20   # Your Elastic Agent host
Port:         514
Facility:     LOCAL7

Tip: Enable all severity levels (DEBUG through EMERGENCY) on your ISE syslog target during initial troubleshooting. You can filter in Elastic rather than missing events at the source. You can also configure ingest pipelines to modify or drop fields within the integration pipeline.

See screenshot below:

4. Explore the Data in Discover (Analysis)

Once ISE logs start flowing, head to Kibana → Discover and select the logs-cisco_ise.* data view. You'll see structured events with parsed fields like cisco.ise.message.id, event.outcome, and cisco.ise.nas.ip.

For the executive 802.1x problem, filter to failed authentication events. Look for event.outcome: failure combined with cisco.ise.message.id: 5200 (authentication failure). Add cisco.ise.nas.ip to your columns to immediately see which access points are generating failures — you'll quickly spot a pattern tied to conference rooms and the cafeteria.

# Quick KQL filters to start with in Discover:
 
# All authentication failures
event.outcome: "failure" AND cisco.ise.message.id: 5200*
 
# Failures on wireless only
event.outcome: "failure" AND cisco.ise.network.device.groups: *Wireless*
 
# Specific endpoint MAC address
cisco.ise.endpoint.mac: "AA:BB:CC:DD:EE:FF"

5. Leverage ES|QL for Advanced Queries (Analysis)

Elastic's ES|QL takes analysis further than KQL by enabling SQL-style transformations, aggregations, and enrichments directly on your data. For the NOC, this means failure rate statistics, top-N offending devices, and time-bucketed trend analysis — all without building dashboards first.

// Top access points by authentication failure count

FROM logs-logen-logen_events-cisco_ise-log-*
| WHERE cisco_ise.log.failure_reason IS NOT NULL
| STATS failure_count = COUNT(*),
        unique_endpoints = COUNT_DISTINCT(cisco_ise.log.network_device_ip)
  BY cisco_ise.log.network_device_ip
| SORT failure_count DESC

Tip: ES|QL results can be pinned directly to a Kibana dashboard. Build your investigation queries first in Discover, then save them as panels — giving the NOC a live view of the problem while troubleshooting is still in progress.

See screenshot below:

6. Create a Machine Learning Job (Machine Learning)

Navigate to Machine Learning → Anomaly Detection → Create Job. Choose "rare" job type and point it at your logs-cisco_ise.* index (your index name may be different). The ML job learns each device's historical authentication failure baseline and flags deviations that are statistically anomalous. No threshold configuration, no model training — Elastic handles it automatically.

# ML Job Configuration
Job ID:       ise_auth_failure_anomaly
Detectors:    rare by "cisco_ise.log.failure.reason"
influencers:     cisco_ise.log.failure.reason host.hostname  
Bucket span:  15m
Datafeed:     Real-time (continuous)

See screenshot below:

7. Create an Observability Alert Rule (Alerting)

Navigate to Observability → Alerts → Manage Rules → “Create Rule”. Select "Anomaly detection alert" and reference your ise_auth_failure_anomaly job. Set the anomaly score threshold to 70–80 and configure the action to notify your NOC via Slack, PagerDuty, or ServiceNow. After you are finished, select “Create Rule” at the bottom of the screen.

Tip: Set a flapping detection window of 30 minutes to prevent alert storms during brief transient spikes. Your NOC should only be paged for sustained anomalies, not blips that self-resolve.

See screenshot below:

8. Review Alert Details in Observability (AI Assistant)

When an alert fires, navigate to Observability → Alerts and click "Alert details". You'll see the anomaly score trend, impacted devices, the time range of the anomaly, and correlated events — all in a single view.

See screenshot below:

9. Invoke the AI Assistant (AI Assistant)

On the alert detail page, click "Help me understand this alert". The Elastic AI Assistant reads the full alert context and provides a plain-language summary, probable root causes ranked by likelihood, and suggested investigation steps:

AI Assistant Response (example):
 
Alert Summary: Anomalous spike in 802.1x authentication failures
detected across 3 wireless APs between 09:15 and 09:45 UTC.
Failure count is 847% above the learned baseline.

Probable Root Causes:
  1. RADIUS server certificate expiry (ISE error 5406 in logs)
  2. VLAN misconfiguration on the wireless controller
  3. 802.1x supplicant profile mismatch on endpoints
 
Recommended Next Steps:
  1. Check ISE certificate validity: Admin > System > Certificates
  2. Verify RADIUS shared secret between WLC and ISE PSN
  3. Review Authentication Detail Report for affected MACs

See screenshot below:

In seconds, your Level 1 NOC analyst has a prioritized investigation checklist that would have taken a senior network engineer 20 minutes to build manually. This is MTTR reduction in action.

10. Configure the AI Knowledge Base (Knowledge Base)

Open the AI Assistant, click Actions → Manage Knowledge Base. Add static entries — your ISE troubleshooting runbooks, architecture notes, or past incident post-mortems. The assistant will reference these when generating analysis, tailoring guidance to your environment.

# Example Knowledge Base entry:
 
Title: ISE Error 5406 - RADIUS Authentication Failure Resolution
 
When ISE error 5406 is observed with wireless endpoints:
1. Verify ISE System Certificate: Admin > Certificates
2. Check PSN health: Admin > Deployment
3. Confirm RADIUS CoA is enabled on WLC for the affected SSID
4. Review Authentication Policy matching order
Escalate to: noc-tier2@company.com if not resolved in 30 min

See screenshot below:

11. Connect External Data Sources via Connectors (Integrations)

For a more dynamic Knowledge base, you can leverage connectors. To do this, navigate to Stack Management → Connectors → “Create Connector”. Available types include ServiceNow, Jira, PagerDuty, GitHub, and generic webhooks. Once connected, the AI Assistant can query these systems when analysing an alert — checking for recent changes, open tickets, known bugs, playbooks, or RSS feeds — without the analyst manually pivoting between tools.

Tip: The ServiceNow CMDB connector is particularly valuable. When the AI identifies an impacted device, it can automatically pull the device owner, maintenance window schedule, and change history — context that dramatically speeds up escalation decisions. You can also subscribe to vendor product feeds, so your context will get updated as subscribers hit “new bugs” on software releases

See screenshot below:

12. The Future: MCP, Workflows & Agentic Operations (Coming Soon)

The current AI Assistant is already transforming NOC operations — but it represents only the beginning. The next evolution is agentic AI: instead of the AI responding to questions, it proactively takes action on your behalf.

Elastic is investing in Model Context Protocol (MCP) support and AI Workflows, which will enable the AI to automatically open a ServiceNow ticket, query the CMDB for the device owner, draft an executive impact summary, and walk a Level 1 analyst through step-by-step resolution — all triggered by a single anomaly alert. The analyst becomes the final approver, not the investigator. I’ll blog more about this in the next series.

Conclusion: A New Frontier, Today

The NOC of the future isn't defined by how many dashboards are on the wall, it's defined by how quickly the team can move from alert to resolution. The Elastic stack, with its unified telemetry ingestion, real-time ML anomaly detection, AI Assistant, and extensible Knowledge Base, gives today's NOC teams the tools to make that leap.

The Cisco ISE use case in this post is just one example. The same pattern — ingest, ML detect, AI triage, knowledge-base-guided resolution — applies to NetFlow anomalies, BGP route flaps, firewall policy violations, SD-WAN path degradation, and dozens of other network events your team deals with daily.

The frontier is open. Start with one integration, one ML job, and one Knowledge Base entry. The MTTR improvements will speak for themselves.

How Elastic's ML and AI Assistant cut 802.1x triage from 20 minutes to seconds in the NOC

Network Telemetry Sources: The Data Elastic Can Ingest

Step-by-step Configuration

1. Install the Elastic Agent (Setup)

2. Configure an Agent Policy (Setup)

3. Add the Cisco ISE Integration (Integration)

4. Explore the Data in Discover (Analysis)

5. Leverage ES|QL for Advanced Queries (Analysis)

6. Create a Machine Learning Job (Machine Learning)

7. Create an Observability Alert Rule (Alerting)

8. Review Alert Details in Observability (AI Assistant)

9. Invoke the AI Assistant (AI Assistant)

10. Configure the AI Knowledge Base (Knowledge Base)

11. Connect External Data Sources via Connectors (Integrations)

12. The Future: MCP, Workflows & Agentic Operations (Coming Soon)

Conclusion: A New Frontier, Today

Jump to section

Share this article