AWS VPC Flow log analysis with GenAI in Elastic

Elastic Observability provides a full observability solution, by supporting metrics, traces and logs for applications and infrastructure. In managing AWS deployments, VPC flow logs are critical in managing performance, network visibility, security, compliance, and overall management of your AWS environment. Several examples of :

Where traffic is coming in from and going out to from the deployment, and within the deployment. This helps identify unusual or unauthorized communications
Traffic volumes detecting spikes or drops which could indicate service issues in production or an increase in customer traffic
Latency and Performance bottlenecks - with VPC Flow logs, you can look at latency for a flow (in and outflows), and understand patterns
Accepted and rejected traffic helps determine where potential security threats and misconfigurations lie.

AWS VPC Logs is a great example of how logs are great. Logging is an important part of Observability, for which we generally think of metrics and tracing. However, the amount of logs an application and the underlying infrastructure output can be significantly daunting with VPC Logs. However, it also provides a significant amount of insight.

Before we proceed, it is important to understand what Elastic provides in managing AWS and VPC Flow logs:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

In today’s blog, we’ll cover how Elastics’ other features can support analyzing and RCA for potential VPC flow logs even more easily. Specifically, we will focus on managing the number of rejects, as this helps ensure there weren’t any unauthorized or unusual activities:

Set up an easy-to-use SLO (newly released) to detect when things are potentially degrading
Create an ML job to analyze different fields of the VPC Flow log
Using our newly released RAG-based AI Assistant to help analyze the logs without needing to know Elastic’s query language nor how to even graph on Elastic
ES|QL will help understand and analyze add latency for patterns.

In subsequent blogs, we will use AI Assistant and ESQL to show how to get other insights beyond just REJECT/ACCEPT from VPC Flow log.

Prerequisites and config

If you plan on following this blog, here are some of the components and details we used to set up this demonstration:

Ensure you have an account on Elastic Cloud and a deployed stack (see instructions here).
Follow the steps in the following blog to get AWS’s three-tier app installed instructed in git, and bring in the AWS VPC Flow logs.
Ensure you have an ML node configured in your Elastic stack
To use the AI Assistant you will need a trial or upgrade to Platinum.

SLO with VPC Flow Logs

Elastic’s SLO capability is based directly on the Google SRE Handbook. All the definitions and semantics are utilized as described in Google’s SRE handbook. Hence users can perform the following on SLOs in Elastic:

Define an SLO on Logs not just metrics - Users can use KQL (log-based query), service availability, service latency, custom metric, histogram metric, or a timeslice metric.
Define SLO, SLI, Error budget and burn rates. Users can also use occurrence versus time slice-based budgeting.
Manage, with dashboards, all the SLOs in a singular location.
Trigger alerts from the defined SLO, whether the SLI is off, the burn rate is used up, or the error rate is X.

Setting up an SLO for VPC is easy. You simply create a query you want to trigger off. In our case, we look for all the good events where aws.vpcflow.action=ACCEPT and we define the target at 85%.

As the following example shows, over the last 7 days, we have exceeded our budget by 43%. Additionally, we have not complied for the last 7 days.

Analyzing the SLO with AI Assistant

Now that we see that there is an issue with the VPC Flows, we immediately work with the AI Assistant to start analyzing the SLO. Because it's a chat interface we simply open the AI Assistant and work through some simple analysis: (See Animated GIF for a demo below)

AI Assistant analysis:

what were the top 3 source.address that had aws.vpcflow.action=REJECT over the last 7 days, which is causing this SLO issue? - We wanted to simply see what could be causing the loss in error budget. Were there any particular source.addresses causing a heavy reject rate.
- The answer: A table with the highest count = 42670 and source.address = 79.110.62.185
- There is one singular source.address that is causing the loss in SLO.
What is the largest number of aws.vpcflow.action=REJECT in a 30 min time frame for the last 3 days where the source.address=79.110.62.185? - After understanding that a specific source.address is causing the loss in SLO, we want to understand the averages.
- **The answer: ** "The largest number of aws.vpcflow.action=REJECT in a 30-minute time frame for the last 3 days where the source.address is 79.110.62.185 is 229. This occurred on 2024-06-01T04:00:00.000Z.”
- It means there must be a low REJECT rate but fairly consistent vs spiky over the last 7 days.
for the logs with source.address="79.110.62.185" was there any country code of source.geo.country_iso_code field present. If yes what is the value - Given the last question showed a low REJECT rate, it only means that this was fairly consistent vs spiky over the last 7 days.
- The answer: Yes, there is a country code present in the source.geo.country_iso_code field for logs with source.address="79.110.62.185". The value is BG (Bulgaria).
Is there a specific destination.address where source.address=79.110.62.185 is getting a aws.vpcflow.action=REJECT. Give me both the destination.address and the number of REJECTs for that destination.address?
- The answer: destination.address of 10.0.0.27 is giving a reject number of 53433 in this time frame.
Graph the number of REJECT vs ACCEPT for source.address="79.110.62.185" over the last 7 days. The graph is on a daily basis in a singular graph - We asked this question to see what the comparison is between ACCEPT and REJECT.
- The answer: See the animated GIF to see that the generated graph is fairly stable
Were there any source.address that had a spike, high reject rate in. a 30min period over the 30 days? - We wanted to see if there was any other spike
- The answer - Yes, there was a source.address that had a spike in high reject rates in a 30-minute period over the last 30 days. source.address: 185.244.212.67, Reject Count: 8975, Time Period: 2024-05-22T03:00:00.000Z

Watch the flow

Potential issue:

he server handling requests from source 79.110.62.185 is potentially having an issue.

Again using logs, we essentially asked the AI Assistant to give the eni ids where the internal ip address was 10.0.0.27

From our AWS console, we know that this is the webserver. Further analysis in Elastic, and with the developers we realized there is a new version that was installed recently causing a problem with connections.

Locating anomalies with ML

While using the AI Assistant is great for analyzing information, another important aspect of VPC flow management is to ensure you can manage log spikes and anomalies. Elastic has a machine learning platform that allows you to develop jobs to analyze specific metrics or multiple metrics to look for anomalies.

VPC Flow logs come with a large amount of information. The full set of fields is listed in AWS docs. We will use a specific subset to help detect anomalies.

We were setting up anomalies for aws.vpcflow.action=REJECT, which requires us to use multimetric anomaly detection in Elastic.

The config we used utilizes:

Detectors:

destination.address
destination.port

Influencers:

source.address
aws.vpcflow.action
destination.geo.region_iso_code

The way we set this up will help us understand if there is a large spike in REJECT/ACCEPT against destination.address values from a specific source.address and/or destination.geo.region_iso_code location.

The job once run reveals something interesting:

Notice that source.address 185.244.212.67 has had a high REJECT rate in the last 30 days.

Notice where we found this before? In the AI Assistant!!!!!

While we can run the AI Assistant and find this sort of anomaly, the ML job can be setup to run continuously and alert us on such spikes. This will help us understand if there are any issues with the webserver like we found above or even potential security attacks.

Conclusion:

You’ve now seen how easily Elastic’s RAG-based AI Assistant can help analyze VPC Flows without even the need to know query syntax, understand where the data is, and understand even the fields. Additionally, you’ve also seen how we can alert you when a potential issue or degradation in service (SLO). Check out our other blogs on AWS VPC Flow analysis in Elastic:

A full set of integrations to manage VPC Flows and the entire end-to-end deployment on AWS.
Elastic has a simple-to-use AWS Firehose integration.
Elastic’s tools such as Discover, spike analysis, and anomaly detection help provide you with better insights and analysis.
And a set of simple Out-of-the-box dashboards

Try it out

Existing Elastic Cloud customers can access many of these features directly from the Elastic Cloud console. Not taking advantage of Elastic on the cloud? Start a free trial.

All of this is also possible in your environment. Learn how to get started today.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

In this blog post, we may have used or referred to third party generative AI tools, which are owned and operated by their respective owners. Elastic does not have any control over the third party tools and we have no responsibility or liability for their content, operation or use, nor for any loss or damage that may arise from your use of such tools. Please exercise caution when using AI tools with personal, sensitive or confidential information. Any data you submit may be used for AI training or other purposes. There is no guarantee that information you provide will be kept secure or confidential. You should familiarize yourself with the privacy practices and terms of use of any generative AI tools prior to use.

Elastic, Elasticsearch, ESRE, Elasticsearch Relevance Engine and associated marks are trademarks, logos or registered trademarks of Elasticsearch N.V. in the United States and other countries. All other company and product names are trademarks, logos or registered trademarks of their respective owners.