Elastic Observability Labs - Articles by Philipp Kahr

Monitoring service performance: An overview of SLA calculation for Elastic Observability

Mon, 24 Apr 2023 00:00:00 GMT

Elastic Stack provides many valuable insights for different users. Developers are interested in low-level metrics and debugging information. SREs are interested in seeing everything at once and identifying where the root cause is. Managers want reports that tell them how good service performance is and if the service level agreement (SLA) is met. In this post, we’ll focus on the service perspective and provide an overview of calculating an SLA.

Since version 8.8, we have a built in functionality to calculate SLOs — check out our guide!

Foundations of calculating an SLA

There are many ways to calculate and measure an SLA. The most important part is the definition of the SLA, and as a consultant, I’ve seen many different ways. Some examples include:

Count of HTTP 2xx must be above 98% of all HTTP status
Response time of successful HTTP 2xx requests must be below x milliseconds
Synthetic monitor must be up at least 99%
95% of all batch transactions from the billing service need to complete within 4 seconds

Depending on the origin of the data, calculating the SLA can be easier or more difficult. For uptime (Synthetic Monitoring), we automatically provide SLA values and offer out-of-the-box alerts to simply define alert when availability below 98% for the last 1 hour.

I personally recommend using Elastic Synthetic Monitoring whenever possible to monitor service performance. Running HTTP requests and verifying the answers from the service, or doing fully fledged browser monitors and clicking through the website as a real user does, ensures a better understanding of the health of your service.

Sometimes this is impossible because you want to calculate the uptime of a specific Windows Service that does not offer any TCP port or HTTP interaction. Here the caveat applies that just because the service is running, it does not necessarily imply that the service is working fine.

Transforms to the rescue

We have identified our important service. In our case, it is the Steam Client Helper. There are two ways to solve this.

Lens formula

You can use Lens and formula (for a deep dive into formulas, check out this blog). Use the Search bar to filter down the data you want. Then use the formula option in Lens. We are dividing all counts of records with Running as a state and dividing it by the overall count of records. This is a nice solution when there is a need to calculate quickly and on the fly.

count(kql='windows.service.state: "Running" ')/count()

Using the formula posted above as the bar chart's vertical axis calculates the uptime percentage. We use an annotation to mark why there is a dip and why this service was below the threshold. The annotation is set to reboot, which indicates a reboot happening, and thus, the service was down for a moment. Lastly, we add a reference line and set this to our defined threshold at 98%. This ensures that a quick look at the visualization allows our eyes to gauge if we are above or below the threshold.

Transform

What if I am not interested in just one service, but there are multiple services needed for your SLA? That is where Transforms can solve this problem. Furthermore, the second issue is that this data is only available inside the Lens. Therefore, we cannot create any alerts on this.

Go to Transforms and create a pivot transform.

Add the following filter to narrow it to only services data sets: data_stream.dataset: "windows.service". If you are interested in a specific service, you can always add it to the search bar if you want to know if a specific remote management service is up in your entire fleet!
Select data histogram(@timestamp) and set it to your chosen unit. By default, the Elastic Agent only collects service states every 60 seconds. I am going with 1 hour.
Select agent.name and windows.service.name as well.

Now we need to define an aggregation type. We will use a value_count of windows.service.state. That just counts how many records have this value.

Rename the value_count to total_count.
Add value_count for windows.service.state a second time and use the pencil icon to edit it to terms, which aggregates for running.

This opens up a sub-aggregation. Once again, select value_count(windows.service.state) and rename it to values.
Now, the preview shows us the count of records with any states and the count of running.

Here comes the tricky part. We need to write some custom aggregations to calculate the percentage of uptime. Click on the copy icon next to the edit JSON config.
In a new tab, go to Dev Tools. Paste what you have in the clipboard.
Press the play button or use the keyboard shortcut ctrl+enter/cmd+enter and run it. This will create a preview of what the data looks like. It should give you the same information as in the table preview.
Now, we need to calculate the percentage of up, which means doing a bucket script where we divide running.values by total_count, just like we did in the Lens visualization. Suppose you name the columns differently or use more than a single value. In that case, you will need to adapt accordingly.

"availability": {
        "bucket_script": {
          "buckets_path": {
            "up": "running>values",
            "total": "total_count"
          },
          "script": "params.up/params.total"
        }
      }

This is the entire transform for me:

POST _transform/_preview
{
  "source": {
    "index": [
      "metrics-*"
    ]
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      "agent.name": {
        "terms": {
          "field": "agent.name"
        }
      },
      "windows.service.name": {
        "terms": {
          "field": "windows.service.name"
        }
      }
    },
    "aggregations": {
      "total_count": {
        "value_count": {
          "field": "windows.service.state"
        }
      },
      "running": {
        "filter": {
          "term": {
            "windows.service.state": "Running"
          }
        },
        "aggs": {
          "values": {
            "value_count": {
              "field": "windows.service.state"
            }
          }
        }
      },
      "availability": {
        "bucket_script": {
          "buckets_path": {
            "up": "running>values",
            "total": "total_count"
          },
          "script": "params.up/params.total"
        }
      }
    }
  }
}

The preview in Dev Tools should work and be complete. Otherwise, you must debug any errors. Most of the time, it is the bucket script and the path to the values. You might have called it up instead of running. This is what the preview looks like for me.

{
  "running": {
    "values": 1
  },
  "agent": {
    "name": "AnnalenasMac"
  },
  "@timestamp": "2021-12-07T19:00:00.000Z",
  "total_count": 1,
  "availability": 1,
  "windows": {
    "service": {
      "name": "InstallService"
    }
  }
},

Now we only paste the bucket script into the transform creation UI after selecting Edit JSON. It looks like this:

Give your transform a name, set the destination index, and run it continuously. When selecting this, please also make sure not to use @timestamp. Instead, opt for event.ingested. Our documentation explains this in detail.

Click next and create and start. This can take a bit, so don’t worry.

To summarize, we have now created a pivot transform using a bucket script aggregation to calculate the running time of a service in percentage. There is a caveat because Elastic Agent, per default, only collects the every 60 seconds the services state. It can be that a service is up exactly when collected and down a few seconds later. If it is that important and no other monitoring possibilities, such as Elastic Synthetics are possible, you might want to reduce the collection time on the Agent side to get the services state every 30 seconds, 45 seconds. Depending on how important your thresholds are, you can create multiple policies having different collection times. This ensures that a super important server might collect the services state every 10 seconds because you need as much granularity and insurance for the correctness of the metric. For normal workstations where you just want to know if your remote access solution is up the majority of the time, you might not mind having a single metric every 60 seconds.

After you have created the transform, one additional feature you get is that the data is stored in an index, similar to in Elasticsearch. When you just do the visualization, the metric is calculated for this visualization only and not available anywhere else. Since this is now data, you can create a threshold alert to your favorite connection (Slack, Teams, Service Now, Mail, and so many more to choose from).

Visualizing the transformed data

The transform created a data view called windows-service. The first thing we want to do is change the format of the availability field to a percentage. This automatically tells Lens that this needs to be formatted as a percentage field, so you don’t need to select it manually as well as do calculations. Furthermore, in Discover, instead of seeing 0.5 you see 50%. Isn’t that cool? This is also possible for durations, like event.duration if you have it as nanoseconds! No more calculations on the fly and thinking if you need to divide by 1,000 or 1,000,000.

We get this view by using a simple Lens visualization with a timestamp on the vertical axis with the minimum interval for 1 day and an average of availability. Don’t worry — the other data will be populated once the transformation finishes. We can add a reference line using the value 0.98 because our target is 98% uptime of the service.

Summary

This blog post covered the steps needed to calculate the SLA for a specific data set in Elastic Observability, as well as how to visualize it. Using this calculation method opens the door to a lot of interesting use cases. You can change the bucket script and start calculating the number of sales, and the average basket size. Interested in learning more about Elastic Synthetics? Read our documentation or check out our free Synthetic Monitoring Quick Start training.

Collecting OpenShift container logs using Red Hat’s OpenShift Logging Operator

Tue, 16 Jan 2024 00:00:00 GMT

This blog explores a possible approach to collecting and formatting OpenShift Container Platform logs and audit logs with Red Hat OpenShift Logging Operator. We recommend using Elastic® Agent for the best possible experience! We will also show how to format the logs to Elastic Common Schema (ECS) for the best experience viewing, searching, and visualizing your logs. All examples in this blog are based on OpenShift 4.14.

Why use OpenShift Logging Operator?

A lot of enterprise customers use OpenShift as their orchestrating solution. The advantages of this approach are:

It is developed and supported by Red Hat
It can automatically update the OpenShift cluster along with the Operating system to make sure that they are and remain compatible
It can speed up developing life cycles with features like source to image
It uses enhanced security

In our consulting experience, this latter aspect poses challenges and frictions with OpenShift administrators when we try to install an Elastic Agent to collect the logs of the pods. Indeed, Elastic Agent requires the files of the host to be mounted in the pod, and it also needs to be run in privileged mode. (Read more about the permissions required by Elastic Agent in the official Elasticsearch® Documentation). While the solution we explore in this post requires similar privileges under the hood, it is managed by the OpenShift Logging Operator, which is developed and supported by Red Hat.

Which logs are we going to collect?

In OpenShift Container Platform, we distinguish three broad categories of logs: audit, application, and infrastructure logs:

Audit logs describe the list of activities that affected the system by users, administrators, and other components.
Application logs are composed of the container logs of the pods running in non-reserved namespaces.
Infrastructure logs are composed of container logs of the pods running in reserved namespaces like openshift*, kube*, and default along with journald messages from the nodes.

In the following, we will consider only audit and application logs for the sake of simplicity. In this post, we will describe how to format audit and application Logs in the format expected by the Kubernetes integration to take the most out of Elastic Observability.

Getting started

To collect the logs from OpenShift, we must perform some preparation steps in Elasticsearch and OpenShift.

Inside Elasticsearch

We first install the Kubernetes integration assets. We are mainly interested in the index templates and ingest pipelines for the logs-kubernetes.container_logs and logs-kubernetes.audit_logs.

To format the logs received from the ClusterLogForwarder in ECS format, we will define a pipeline to normalize the container logs. The field naming convention used by OpenShift is slightly different from that used by ECS. To get a list of exported fields from OpenShift, refer to Exported fields | Logging | OpenShift Container Platform 4.14. To get a list of exported fields of the Kubernetes integration, you can refer to Kubernetes fields | Filebeat Reference [8.11] | Elastic and Logs app fields | Elastic Observability [8.11]. Further, specific fields like kubernetes.annotations must be normalized by replacing dots with underscores. This operation is usually done automatically by Elastic Agent.

PUT _ingest/pipeline/openshift-2-ecs
{
  "processors": [
    {
      "rename": {
        "field": "kubernetes.pod_id",
        "target_field": "kubernetes.pod.uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_ip",
        "target_field": "kubernetes.pod.ip",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.pod_name",
        "target_field": "kubernetes.pod.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_name",
        "target_field": "kubernetes.namespace",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.namespace_id",
        "target_field": "kubernetes.namespace_uid",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_id",
        "target_field": "container.id",
        "ignore_missing": true
      }
    },
    {
      "dissect": {
        "field": "container.id",
        "pattern": "%{container.runtime}://%{container.id}",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_image",
        "target_field": "container.image.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.container.image",
        "copy_from": "container.image.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "copy_from": "kubernetes.container_name",
        "field": "container.name",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "kubernetes.container_name",
        "target_field": "kubernetes.container.name",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "level",
        "target_field": "log.level",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "file",
        "target_field": "log.file.path",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "dissect": {
        "field": "kubernetes.pod_owner",
        "pattern": "%{_tmp.parent_type}/%{_tmp.parent_name}",
        "ignore_missing": true
      }
    },
    {
      "lowercase": {
        "field": "_tmp.parent_type",
        "ignore_missing": true
      }
    },
    {
      "set": {
        "field": "kubernetes.pod.{{_tmp.parent_type}}.name",
        "value": "{{_tmp.parent_name}}",
        "if": "ctx?._tmp?.parent_type != null",
        "ignore_failure": true
      }
    },
    {
      "remove": {
        "field": [
          "_tmp",
          "kubernetes.pod_owner"
          ],
          "ignore_missing": true
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes annotations",
        "if": "ctx?.kubernetes?.annotations != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.annotations.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.annotations[sanitizedKey] = ctx.kubernetes.annotations[k];
            ctx.kubernetes.annotations.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize kubernetes namespace_labels",
        "if": "ctx?.kubernetes?.namespace_labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.namespace_labels.keySet());
        for(k in keys) {
          if (k.indexOf(".") >= 0) {
            def sanitizedKey = k.replace(".", "_");
            ctx.kubernetes.namespace_labels[sanitizedKey] = ctx.kubernetes.namespace_labels[k];
            ctx.kubernetes.namespace_labels.remove(k);
          }
        }
        """
      }
    },
    {
      "script": {
        "description": "Normalize special Kubernetes Labels used in logs-kubernetes.container_logs to determine service.name and service.version",
        "if": "ctx?.kubernetes?.labels != null",
        "source": """
        def keys = new ArrayList(ctx.kubernetes.labels.keySet());
        for(k in keys) {
          if (k.startsWith("app_kubernetes_io_component_")) {
            def sanitizedKey = k.replace("app_kubernetes_io_component_", "app_kubernetes_io_component/");
            ctx.kubernetes.labels[sanitizedKey] = ctx.kubernetes.labels[k];
            ctx.kubernetes.labels.remove(k);
          }
        }
        """
      }
    }
    ]
}

Similarly, to handle the audit logs like the ones collected by Kubernetes, we define an ingest pipeline:

PUT _ingest/pipeline/openshift-audit-2-ecs
{
  "processors": [
    {
      "script": {
        "source": """
        def audit = [:];
        def keyToRemove = [];
        for(k in ctx.keySet()) {
          if (k.indexOf('_') != 0 && !['@timestamp', 'data_stream', 'openshift', 'event', 'hostname'].contains(k)) {
            audit[k] = ctx[k];
            keyToRemove.add(k);
          }
        }
        for(k in keyToRemove) {
          ctx.remove(k);
        }
        ctx.kubernetes=["audit":audit];
        """,
        "description": "Move all the 'kubernetes.audit' fields under 'kubernetes.audit' object"
      }
    },
    {
      "set": {
        "copy_from": "openshift.cluster_id",
        "field": "orchestrator.cluster.name",
        "ignore_failure": true
      }
    },
    {
      "set": {
        "field": "kubernetes.node.name",
        "copy_from": "hostname",
        "ignore_failure": true
      }
    },
    {
      "rename": {
        "field": "hostname",
        "target_field": "host.name",
        "ignore_missing": true
      }
    },
    {
      "script": {
        "if": "ctx?.kubernetes?.audit?.annotations != null",
        "source": """
          def keys = new ArrayList(ctx.kubernetes.audit.annotations.keySet());
          for(k in keys) {
            if (k.indexOf(".") >= 0) {
              def sanitizedKey = k.replace(".", "_");
              ctx.kubernetes.audit.annotations[sanitizedKey] = ctx.kubernetes.audit.annotations[k];
              ctx.kubernetes.audit.annotations.remove(k);
            }
          }
          """,
        "description": "Normalize kubernetes audit annotations field as expected by the Integration"
      }
    }
  ]
}

The main objective of the pipeline is to mimic what Elastic Agent is doing: storing all audit fields under the kubernetes.audit object.

We are not going to use the conventional @custom pipeline approach because the fields must be normalized before invoking the logs-kubernetes.container_logs integration pipeline that uses fields like kubernetes.container.name and kubernetes.labels to determine the fields service.name and service.version. Read more about custom pipelines in Tutorial: Transform data with custom ingest pipelines | Fleet and Elastic Agent Guide [8.11].

The OpenShift Cluster Log Forwarder writes the data in the indices app-write and audit-write by default. It is possible to change this behavior, but it still tries to prepend the prefix “app” and the suffix “write”, so we opted to send the data to the default destination and use the reroute processor to send it to the right data streams. Read more about the Reroute Processor in our blog Simplifying log data management: Harness the power of flexible routing with Elastic and our documentation Reroute processor | Elasticsearch Guide [8.11] | Elastic.

In this case, we want to redirect the container logs (app-write index) to logs-kubernetes.container_logs and the Audit logs (audit-write) to logs-kubernetes.audit_logs:

PUT _ingest/pipeline/app-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.container_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.container_logs-openshift"
      }
    }
  ]
}



PUT _ingest/pipeline/audit-write-reroute-pipeline
{
  "processors": [
    {
      "pipeline": {
        "name": "openshift-audit-2-ecs",
        "description": "Format the Openshift data in ECS"
      }
    },
    {
      "set": {
        "field": "event.dataset",
        "value": "kubernetes.audit_logs"
      }
    },
    {
      "reroute": {
        "destination": "logs-kubernetes.audit_logs-openshift"
      }
    }
  ]
}

Please note that given that app-write and audit-write do not follow the data stream naming convention, we are forced to add the destination field in the reroute processor. The reroute processor will also fill up the data_stream fields for us. Note that this step is done automatically by Elastic Agent at source.

Further, we create the indices with the default pipelines we created to reroute the logs according to our needs.

PUT app-write
{
  "settings": {
      "index.default_pipeline": "app-write-reroute-pipeline"
   }
}


PUT audit-write
{
  "settings": {
    "index.default_pipeline": "audit-write-reroute-pipeline"
  }
}

Basically, what we did can be summarized in this picture:

Let us take the container logs. When the operator attempts to write in the app-write index, it will invoke the default_pipeline “app-write-reroute-pipeline” that formats the logs into ECS format and reroutes the logs to logs-kubernetes.container_logs-openshift datastreams. This calls the integration pipeline that invokes, if it exists, the logs-kubernetes.container_logs@custom pipeline. Finally, the logs-kubernetes_container_logs pipeline may reroute the logs to another data set and namespace utilizing the elastic.co/dataset and elastic.co/namespace annotations as described in the Kubernetes integration documentation, which in turn can lead to the execution of an another integration pipeline.

Create a user for sending the logs

We are going to use basic authentication because, at the time of writing, it is the only supported authentication method for Elasticsearch in OpenShift logging. Thus, we need a role that allows the user to write and read the app-write, and audit-write logs (required by the OpenShift agent) and auto_configure access to logs-*-* to allow custom Kubernetes rerouting:

PUT _security/role/YOURROLE
{
    "cluster": [
      "monitor"
    ],
    "indices": [
      {
        "names": [
          "logs-*-*"
        ],
        "privileges": [
          "auto_configure",
          "create_doc"
        ],
        "allow_restricted_indices": false
      },
      {
        "names": [
          "app-write",
          "audit-write",
        ],
        "privileges": [
          "create_doc",
          "read"
        ],
        "allow_restricted_indices": false
      }
    ],
    "applications": [],
    "run_as": [],
    "metadata": {},
    "transient_metadata": {
      "enabled": true
    }

}



PUT _security/user/YOUR_USERNAME
{
  "password": "YOUR_PASSWORD",
  "roles": ["YOURROLE"]
}

On OpenShift

On the OpenShift Cluster, we need to follow the official documentation of Red Hat on how to install the Red Hat OpenShift Logging and configure Cluster Logging and the Cluster Log Forwarder.

We need to install the Red Hat OpenShift Logging Operator, which defines the ClusterLogging and ClusterLogForwarder Resources. Afterward, we can define the Cluster Logging resource:

apiVersion: logging.openshift.io/v1
kind: ClusterLogging
metadata:
  name: instance
  namespace: openshift-logging
spec:
  collection:
    logs:
      type: vector
      vector: {}

The Cluster Log Forwarder is the resource responsible for defining a daemon set that will forward the logs to the remote Elasticsearch. Before creating it, we need to create in the same namespace as the ClusterLogForwarder a secret containing the Elasticsearch credentials for the user we created previously in the namespace, where the ClusterLogForwarder will be deployed:

apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-password
  namespace: openshift-logging
type: Opaque
stringData:
  username: YOUR_USERNAME
  password: YOUR_PASSWORD

Finally, we create the ClusterLogForwarder resource:

kind: ClusterLogForwarder
apiVersion: logging.openshift.io/v1
metadata:
  name: instance
  namespace: openshift-logging
spec:
  outputs:
    - name: remote-elasticsearch
      secret:
        name: elasticsearch-password
      type: elasticsearch
      url: "https://YOUR_ELASTICSEARCH_URL:443"
      elasticsearch:
        version: 8 # The default is version 6 with the _type field
  pipelines:
    - inputRefs:
        - application
        - audit
      name: enable-default-log-store
      outputRefs:
        - remote-elasticsearch

Note that we explicitly defined the version of Elasticsearch to be 8, otherwise the ClusterLogForwarder will send the _type field, which is not compatible with Elasticsearch 8 and that we collect only application and audit logs.

Result

Once the logs are collected and passed through all the pipelines, the result is very close to the out-of-the-box Kubernetes integration. There are important differences, like the lack of host and cloud metadata information that don’t seem to be collected (at least without an additional configuration). We can view the Kubernetes container logs in the logs explorer:

In this post, we described how you can use the OpenShift Logging Operator to collect the logs of containers and audit logs. We still recommend leveraging Elastic Agent to collect all your logs. It is the best user experience you can get. No need to maintain or transform the logs yourself to ECS formatting. Additionally, Elastic Agent uses API keys as the authentication method and collects metadata like cloud information that allow you in the long run to do more.

Learn more about log monitoring with the Elastic Stack.

Have feedback on this blog? Share it here.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Optimizing cloud resources and cost with APM metadata in Elastic Observability

Wed, 16 Aug 2023 00:00:00 GMT

Application performance monitoring (APM) is much more than capturing and tracking errors and stack traces. Today’s cloud-based businesses deploy applications across various regions and even cloud providers. So, harnessing the power of metadata provided by the Elastic APM agents becomes more critical. Leveraging the metadata, including crucial information like cloud region, provider, and machine type, allows us to track costs across the application stack. In this blog post, we look at how we can use cloud metadata to empower businesses to make smarter and cost-effective decisions, all while improving resource utilization and the user experience.

First, we need an example application that allows us to monitor infrastructure changes effectively. We use a Python Flask application with the Elastic Python APM agent. The application is a simple calculator taking the numbers as a REST request. We utilize Locust — a simple load-testing tool to evaluate performance under varying workloads.

The next step includes obtaining the pricing information associated with the cloud services. Every cloud provider is different. Most of them offer an option to retrieve pricing through an API. But today, we will focus on Google Cloud and will leverage their pricing calculator to retrieve relevant cost information.

The calculator and Google Cloud pricing

To perform a cost analysis, we need to know the cost of the machines in use. Google provides a billing API and Client Library to fetch the necessary data programmatically. In this blog, we are not covering the API approach. Instead, the Google Cloud Pricing Calculator is enough. Select the machine type and region in the calculator and set the count 1 instance. It will then report the total estimated cost for this machine. Doing this for an e2-standard-4 machine type results in 107.7071784 US$ for a runtime of 730 hours.

Now, let’s go to our Kibana® where we will create a new index inside Dev Tools. Since we don’t want to analyze text, we will tell Elasticsearch® to treat every text as a keyword. The index name is cloud-billing. I might want to do the same for Azure and AWS, then I can append it to the same index.

PUT cloud-billing
{
  "mappings": {
    "dynamic_templates": [
      {
        "stringsaskeywords": {
          "match": "*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}

Next up is crafting our billing document:

POST cloud-billing/_doc/e2-standard-4_europe-west4
{
  "machine": {
    "enrichment": "e2-standard-4_europe-west4"
  },
  "cloud": {
    "machine": {
       "type": "e2-standard-4"
    },
    "region": "europe-west4",
    "provider": "google"
  },
  "stats": {
    "cpu": 4,
    "memory": 8
  },
  "price": {
    "minute": 0.002459068,
    "hour": 0.14754408,
    "month": 107.7071784
  }
}

We create a document and set a custom ID. This ID matches the instance name and the region since the machines' costs may differ in each region. Automatic IDs could be problematic because I might want to update what a machine costs regularly. I could use a timestamped index for that and only ever use the latest document matching. But this way, I can update and don’t have to worry about it. I calculated the price down to minute and hour prices as well. The most important thing is the machine.enrichment field, which is the same as the ID. The same instance type can exist in multiple regions, but our enrichment processor is limited to match or range. We create a matching name that can explicitly match as in e2-standard-4_europe-west4. It’s up to you to decide whether you want the cloud provider in there and make it google_e2-standard-4_europ-west-4.

Calculating the cost

There are multiple ways of achieving this in the Elastic Stack. In this case, we will use an enrich policy, ingest pipeline, and transform.

The enrich policy is rather easy to setup:

PUT _enrich/policy/cloud-billing
{
  "match": {
    "indices": "cloud-billing",
    "match_field": "machine.enrichment",
    "enrich_fields": ["price.minute", "price.hour", "price.month"]
  }
}

POST _enrich/policy/cloud-billing/_execute

Don’t forget to run the _execute at the end of it. This is necessary to make the internal indices used by the enrichment in the ingest pipeline. The ingest pipeline is rather minimalistic — it calls the enrichment and renames a field. This is where our machine.enrichment field comes in. One caveat around enrichment is that when you add new documents to the cloud-billing index, you need to rerun the _execute statement. The last bit calculates the total cost with the count of unique machines seen.

PUT _ingest/pipeline/cloud-billing
{
  "processors": [
    {
      "set": {
        "field": "_temp.machine_type",
        "value": "{{cloud.machine.type}}_{{cloud.region}}"
      }
    },
    {
      "enrich": {
        "policy_name": "cloud-billing",
        "field": "_temp.machine_type",
        "target_field": "enrichment"
      }
    },
    {
      "rename": {
        "field": "enrichment.price",
        "target_field": "price"
      }
    },
    {
      "remove": {
        "field": [
          "_temp",
          "enrichment"
        ]
      }
    },
    {
      "script": {
        "source": "ctx.total_price=ctx.count_machines*ctx.price.hour"
      }
    }
  ]
}

Since this is all configured now, we are ready for our Transform. For this, we need a data view that matches the APM data_streams. This is traces-apm*, metrics-apm.*, logs-apm.*. For the Transform, go to the Transform UI in Kibana and configure it in the following way:

We are doing an hourly breakdown, therefore, I get a document per service, per hour, per machine type. The interesting bit is the aggregations. I want to see the average CPU usage and the 75,95,99 percentile, to view the CPU usage on an hourly basis. Allowing me to identify the CPU usage across an hour. At the bottom, give the transform a name and select an index cloud-costs and select the cloud-billing ingest pipeline.

Here is the entire transform as a JSON document:

PUT _transform/cloud-billing
{
  "source": {
    "index": [
      "traces-apm*",
      "metrics-apm.*",
      "logs-apm.*"
    ],
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "exists": {
                    "field": "cloud.provider"
                  }
                }
              ],
              "minimum_should_match": 1
            }
          }
        ]
      }
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      "cloud.provider": {
        "terms": {
          "field": "cloud.provider"
        }
      },
      "cloud.region": {
        "terms": {
          "field": "cloud.region"
        }
      },
      "cloud.machine.type": {
        "terms": {
          "field": "cloud.machine.type"
        }
      },
      "service.name": {
        "terms": {
          "field": "service.name"
        }
      }
    },
    "aggregations": {
      "avg_cpu": {
        "avg": {
          "field": "system.cpu.total.norm.pct"
        }
      },
      "percentiles_cpu": {
        "percentiles": {
          "field": "system.cpu.total.norm.pct",
          "percents": [
            75,
            95,
            99
          ]
        }
      },
      "avg_transaction_duration": {
        "avg": {
          "field": "transaction.duration.us"
        }
      },
      "percentiles_transaction_duration": {
        "percentiles": {
          "field": "transaction.duration.us",
          "percents": [
            75,
            95,
            99
          ]
        }
      },
      "count_machines": {
        "cardinality": {
          "field": "cloud.instance.id"
        }
      }
    }
  },
  "dest": {
    "index": "cloud-costs",
    "pipeline": "cloud-costs"
  },
  "sync": {
    "time": {
      "delay": "120s",
      "field": "@timestamp"
    }
  },
  "settings": {
    "max_page_search_size": 1000
  }
}

Once the transform is created and running, we need a Kibana Data View for the index: cloud-costs. For the transaction, use the custom formatter inside Kibana and set its format to “Duration” in “microseconds.”

With that, everything is arranged and ready to go.

Observing infrastructure changes

Below I created a dashboard that allows us to identify:

How much costs a certain service creates
CPU usage
Memory usage
Transaction duration
Identify cost-saving potential

From left to right, we want to focus on the very first chart. We have the bars representing the CPU as average in green and 95th percentile in blue on top. It goes from 0 to 100% and is normalized, meaning that even with 8 CPU cores, it will still read 100% usage and not 800%. The line graph represents the transaction duration, the average being in red, and the 95th percentile in purple. Last, we have the orange area at the bottom, which is the average memory usage on that host.

We immediately realize that our calculator does not need a lot of memory. Hovering over the graph reveals 2.89% memory usage. The e2-standard-8 machine that we are using has 32 GB of memory. We occasionally spike to 100% CPU in the 95th percentile. When this happens, we see that the average transaction duration spikes to 2.5 milliseconds. However, every hour this machine costs us a rounded 30 cents. Using this information, we can now downsize to a better fit. The average CPU usage is around 11-13%, and the 95th percentile is not that far away.

Because we are using 8 CPUs, one could now say that 12.5% represents a full core, but that is just an assumption on a piece of paper. Nonetheless, we know there is a lot of headroom, and we can downscale quite a bit. In this case, I decided to go to 2 CPUs and 2 GB of RAM, known as e2-highcpu2. This should fit my calculator application better. We barely touched the RAM, 2.89% out of 32GB are roughly 1GB of use. After the change and reboot of the calculator machine, I started the same Locust test to identify my CPU usage and, more importantly, if my transactions get slower, and if so, by how much. Ultimately, I want to decide whether 1 millisecond more latency is worth 10 more cents per hour. I added the change as an annotation in Lens.

After letting it run for a bit, we can now identify the smaller hosts' impact. In this case, we can see that the average did not change. However, the 95th percentile — as in 95% of all transactions are below this value — did spike up. Again, it looks bad at first, but checking in, it went from ~1.5 milliseconds to ~2.10 milliseconds, a ~0.6 millisecond increase. Now, you can decide whether that 0.6 millisecond increase is worth paying ~180$ more per month or if the current latency is good enough.

Conclusion

Observability is more than just collecting logs, metrics, and traces. Linking user experience to cloud costs allows your business to identify areas where you can save money. Having the right tools at your disposal will help you generate those insights quickly. Making informed decisions about how to optimize your cloud cost and ultimately improve the user experience is the bottom-line goal.

The dashboard and data view can be found in my GitHub repository. You can download the .ndjson file and import it using the Saved Objects inside Stack Management in Kibana.

Caveats

Pricing is only for base machines without any disk information, static public IP addresses, and any other additional cost, such as licenses for operating systems. Furthermore, it excludes spot pricing, discounts, or free credits. Additionally, data transfer costs between services are also not included. We only calculate it based on the minute rate of the service running — we are not checking billing intervals from Google Cloud. In our case, we would bill per minute, regardless of what Google Cloud has. Using the count for unique instance.ids work as intended. However, if a machine is only running for one minute, we calculate it based on the hourly rate. So, a machine running for one minute, will cost the same as running for 50 minutes — at least how we calculate it. The transform uses calendar hour intervals; therefore, it's 8 am-9 am, 9 am-10 am, and so on.

Universal Profiling: Detecting CO2 and energy efficiency

Mon, 05 Feb 2024 00:00:00 GMT

A while ago, we posted a blog that detailed how we imported over 4 billion chess games with speed using Python and optimized the code leveraging our Universal Profiling^TM. This was based on Elastic Stack running on version 8.9. We are now on 8.12, and it is time to do a second part that shows how easy it is to observe compiled languages and how Elastic®’s Universal Profiling can help you determine the benefit of a rewrite, both from a cost and environmental friendliness angle.

Why efficiency matters — for you and the environment

Data centers are estimated to consume ~3% of global electricity consumption, and their usage is expected to double by 2030.* The cost of a digital service is a close proxy to its computing efficiency, and thus, being more efficient is a win-win: less energy consumed, smaller bill.

In the same scenario, companies want the ability to scale to more users while spending less for each user and are effectively looking into methods of reducing their energy consumption.

In this spirit, Universal Profiling comes equipped with data and visualizations to help determine where efficiency improvement efforts are worth the most.

Energy efficiency measures how much a digital service consumes to produce an output given an input. It can be measured in multiple ways, and we at Elastic Observability chose CO₂ emissions and annualized CO₂ emissions (more details on them later).

Let’s take the example of an e-commerce website: the energy efficiency of the “search inventory” process could be calculated as the average CPU time needed to serve a user request. Once the baseline for this value is determined, changes to the software delivering the search process may result in more or less CPU time consumed for the same feature, resulting in less or more efficient code.

How to set up and configure wattage and CO2

You can find a “Settings” button in the top-right corner of the Universal Profiling views. From there, you can customize the coefficient used to calculate CO₂ emissions tied to profiling data.

The values set here will be used only when the profiles gathered from host agents are not already associated with publicly known data certified by cloud providers. For example, suppose you have a hybrid cloud deployment with a portion of your workload running on-premise and a portion running in GCP. In that case, the values set here will only be used to calculate the CO₂ emissions for the on-premise machines; we already use all the coefficients as declared by GCP to calculate the emissions of those machines.

Python vs. Go

Our first blog post implemented a solution to read PGN chess games, a text representation in Python. It showed how Universal Profiler can be leveraged to identify slow functions and help you rewrite your code faster and more efficiently. At the end of it, we were happy with the Python version. It is still used today to grab the monthly updates from the Lichess database and ingest them into Elasticsearch®. I always wanted a reason to work more with Go, and we rewrote Python to Go. We leveraged goroutines and channels to send data through message passing. You can see more about it in our GitHub repository.

Rewriting in Go also means switching from an interpreted language to a compiled one. As with everything in IT, this has benefits as well as disadvantages. One disadvantage is that we must ship debug symbols for the compiled binary. When we build the binary, we can use the symbtool program to ship the debug symbols. Without debug symbols, we see uninterpretable information as frames will be labeled with hexadecimal addresses in the flame graph rather than source code annotations.

First, make sure that your executable includes debug symbols. Go per default builds with debug symbols. You can check this by using file yourbinary. The important part is that it is not stripped.

file lichess
lichess: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, Go BuildID=gufIkqA61WnCh8haeW-2/lfn3ne3U_y8MGoFD4AvT/QJEykzbacbYEmEQpXH6U/MqVbk-402n1k3B8yPB6I, with debug_info, not stripped

Now we need to push the symbols using symbtool. You must create an Elasticsearch API key as the authentication method. In the Universal Profiler UI in Kibana®, an Add Data button in the top right corner will tell you exactly what to do. The command is like this. The -e is the part where you pass through the path of your executable file. In our case, this is lichess as above.

symbtool push-symbols executable -t "ApiKey" -u "elasticsearch-url" -e "lichess"

Now that debug symbols are available inside the cluster, we can run both implementations with the same file simultaneously and see what Universal Profiler can tell us about it.

Identifying CO2 and energy efficiency savings

Python is more frequently scheduled on the CPU. Thus, it runs more often on the hardware and contributes more to the machines’ resource usage.

We use the differential flame graph to identify and automatically calculate the difference in the following comparison. You need to filter on process.thread.name: “python3.11” in the baseline, and for the comparison, filter for lichess.

Looking at the impact of annualized CO₂ emissions, we see a decrease from 65.32kg of CO₂ from the Python solution to 16.78kg. That is a difference of 48.54kg CO₂ savings over a year.

If we take a step back, we’ll want to figure out why Python produces many more emissions. In the flamegraph view, we filter down to just showing Python, and we can click on the first frame called python3.11. A little popup tells us that it caused 32.95kg of emissions. That is nearly 50% of all emissions caused by the runtime. Our program itself caused the other ~32kg of CO₂. We immediately reduced 32kg of annual emissions by cutting out the Python interpreter with Go.

We can lock that box using a right click and click Show more information.

The Show more information link displays detailed information about the frame, like sample count, total CPU, core seconds, and dollar costs. We won’t go into more detail in this blog.

Reduce your carbon footprint today with Universal Profiling

This blog post demonstrates that rewriting your code base can reduce your carbon footprint immensely. Using Universal Profiler, you could do a quick PoC to showcase how much carbon resources can be spared.

Learn how you can get started with Elastic Universal Profiling today.

Cluster for storing the data where three nodes, each 64GB RAM and 32 CPU cores, are running GCP on Elastic Cloud.

The machine for sending the data is a GCP e2-standard-32, thus 128GB RAM and 32 CPU cores with a 500GB balanced disk to read the games from.

The file used for the games is this Lichess database containing 96,909,211 games. The extracted file size is 211GB.

Source:

*https://media.ccc.de/v/camp2023-57070-energy_consumption_of_data_centers