Optimizing cloud resources and cost with APM metadata in Elastic Observability

illustration-out-of-box-data-vis-1680x980.png

Application performance monitoring (APM) is much more than capturing and tracking errors and stack traces. Today’s cloud-based businesses deploy applications across various regions and even cloud providers. So, harnessing the power of metadata provided by the Elastic APM agents becomes more critical. Leveraging the metadata, including crucial information like cloud region, provider, and machine type, allows us to track costs across the application stack. In this blog post, we look at how we can use cloud metadata to empower businesses to make smarter and cost-effective decisions, all while improving resource utilization and the user experience.

First, we need an example application that allows us to monitor infrastructure changes effectively. We use a Python Flask application with the Elastic Python APM agent. The application is a simple calculator taking the numbers as a REST request. We utilize Locust — a simple load-testing tool to evaluate performance under varying workloads.

The next step includes obtaining the pricing information associated with the cloud services. Every cloud provider is different. Most of them offer an option to retrieve pricing through an API. But today, we will focus on Google Cloud and will leverage their pricing calculator to retrieve relevant cost information.

The calculator and Google Cloud pricing

To perform a cost analysis, we need to know the cost of the machines in use. Google provides a billing API and Client Library to fetch the necessary data programmatically. In this blog, we are not covering the API approach. Instead, the Google Cloud Pricing Calculator is enough. Select the machine type and region in the calculator and set the count 1 instance. It will then report the total estimated cost for this machine. Doing this for an e2-standard-4 machine type results in 107.7071784 US$ for a runtime of 730 hours.

Now, let’s go to our Kibana® where we will create a new index inside Dev Tools. Since we don’t want to analyze text, we will tell Elasticsearch® to treat every text as a keyword. The index name is cloud-billing. I might want to do the same for Azure and AWS, then I can append it to the same index.

PUT cloud-billing
{
  "mappings": {
    "dynamic_templates": [
      {
        "stringsaskeywords": {
          "match": "*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}

Next up is crafting our billing document:

POST cloud-billing/_doc/e2-standard-4_europe-west4
{
  "machine": {
    "enrichment": "e2-standard-4_europe-west4"
  },
  "cloud": {
    "machine": {
       "type": "e2-standard-4"
    },
    "region": "europe-west4",
    "provider": "google"
  },
  "stats": {
    "cpu": 4,
    "memory": 8
  },
  "price": {
    "minute": 0.002459068,
    "hour": 0.14754408,
    "month": 107.7071784
  }
}

We create a document and set a custom ID. This ID matches the instance name and the region since the machines' costs may differ in each region. Automatic IDs could be problematic because I might want to update what a machine costs regularly. I could use a timestamped index for that and only ever use the latest document matching. But this way, I can update and don’t have to worry about it. I calculated the price down to minute and hour prices as well. The most important thing is the machine.enrichment field, which is the same as the ID. The same instance type can exist in multiple regions, but our enrichment processor is limited to match or range. We create a matching name that can explicitly match as in e2-standard-4_europe-west4. It’s up to you to decide whether you want the cloud provider in there and make it google_e2-standard-4_europ-west-4.

Calculating the cost

There are multiple ways of achieving this in the Elastic Stack. In this case, we will use an enrich policy, ingest pipeline, and transform.

The enrich policy is rather easy to setup:

PUT _enrich/policy/cloud-billing
{
  "match": {
    "indices": "cloud-billing",
    "match_field": "machine.enrichment",
    "enrich_fields": ["price.minute", "price.hour", "price.month"]
  }
}

POST _enrich/policy/cloud-billing/_execute

Don’t forget to run the _execute at the end of it. This is necessary to make the internal indices used by the enrichment in the ingest pipeline. The ingest pipeline is rather minimalistic — it calls the enrichment and renames a field. This is where our machine.enrichment field comes in. One caveat around enrichment is that when you add new documents to the cloud-billing index, you need to rerun the _execute statement. The last bit calculates the total cost with the count of unique machines seen.

PUT _ingest/pipeline/cloud-billing
{
  "processors": [
    {
      "set": {
        "field": "_temp.machine_type",
        "value": "{{cloud.machine.type}}_{{cloud.region}}"
      }
    },
    {
      "enrich": {
        "policy_name": "cloud-billing",
        "field": "_temp.machine_type",
        "target_field": "enrichment"
      }
    },
    {
      "rename": {
        "field": "enrichment.price",
        "target_field": "price"
      }
    },
    {
      "remove": {
        "field": [
          "_temp",
          "enrichment"
        ]
      }
    },
    {
      "script": {
        "source": "ctx.total_price=ctx.count_machines*ctx.price.hour"
      }
    }
  ]
}

Since this is all configured now, we are ready for our Transform. For this, we need a data view that matches the APM data_streams. This is traces-apm*, metrics-apm.*, logs-apm.*. For the Transform, go to the Transform UI in Kibana and configure it in the following way:

transform configuration

We are doing an hourly breakdown, therefore, I get a document per service, per hour, per machine type. The interesting bit is the aggregations. I want to see the average CPU usage and the 75,95,99 percentile, to view the CPU usage on an hourly basis. Allowing me to identify the CPU usage across an hour. At the bottom, give the transform a name and select an index cloud-costs and select the cloud-billing ingest pipeline. 

Here is the entire transform as a JSON document:

PUT _transform/cloud-billing
{
  "source": {
    "index": [
      "traces-apm*",
      "metrics-apm.*",
      "logs-apm.*"
    ],
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "exists": {
                    "field": "cloud.provider"
                  }
                }
              ],
              "minimum_should_match": 1
            }
          }
        ]
      }
    }
  },
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1h"
        }
      },
      "cloud.provider": {
        "terms": {
          "field": "cloud.provider"
        }
      },
      "cloud.region": {
        "terms": {
          "field": "cloud.region"
        }
      },
      "cloud.machine.type": {
        "terms": {
          "field": "cloud.machine.type"
        }
      },
      "service.name": {
        "terms": {
          "field": "service.name"
        }
      }
    },
    "aggregations": {
      "avg_cpu": {
        "avg": {
          "field": "system.cpu.total.norm.pct"
        }
      },
      "percentiles_cpu": {
        "percentiles": {
          "field": "system.cpu.total.norm.pct",
          "percents": [
            75,
            95,
            99
          ]
        }
      },
      "avg_transaction_duration": {
        "avg": {
          "field": "transaction.duration.us"
        }
      },
      "percentiles_transaction_duration": {
        "percentiles": {
          "field": "transaction.duration.us",
          "percents": [
            75,
            95,
            99
          ]
        }
      },
      "count_machines": {
        "cardinality": {
          "field": "cloud.instance.id"
        }
      }
    }
  },
  "dest": {
    "index": "cloud-costs",
    "pipeline": "cloud-costs"
  },
  "sync": {
    "time": {
      "delay": "120s",
      "field": "@timestamp"
    }
  },
  "settings": {
    "max_page_search_size": 1000
  }
}

Once the transform is created and running, we need a Kibana Data View for the index: cloud-costs. For the transaction, use the custom formatter inside Kibana and set its format to “Duration” in “microseconds.”

cloud costs

With that, everything is arranged and ready to go.

Observing infrastructure changes

Below I created a dashboard that allows us to identify:

  • How much costs a certain service creates
  • CPU usage
  • Memory usage
  • Transaction duration
  • Identify cost-saving potential
graphs

From left to right, we want to focus on the very first chart. We have the bars representing the CPU as average in green and 95th percentile in blue on top. It goes from 0 to 100% and is normalized, meaning that even with 8 CPU cores, it will still read 100% usage and not 800%. The line graph represents the transaction duration, the average being in red, and the 95th percentile in purple. Last, we have the orange area at the bottom, which is the average memory usage on that host.

We immediately realize that our calculator does not need a lot of memory. Hovering over the graph reveals 2.89% memory usage. The e2-standard-8 machine that we are using has 32 GB of memory. We occasionally spike to 100% CPU in the 95th percentile. When this happens, we see that the average transaction duration spikes to 2.5 milliseconds. However, every hour this machine costs us a rounded 30 cents. Using this information, we can now downsize to a better fit. The average CPU usage is around 11-13%, and the 95th percentile is not that far away. 

Because we are using 8 CPUs, one could now say that 12.5% represents a full core, but that is just an assumption on a piece of paper. Nonetheless, we know there is a lot of headroom, and we can downscale quite a bit. In this case, I decided to go to 2 CPUs and 2 GB of RAM, known as e2-highcpu2. This should fit my calculator application better. We barely touched the RAM, 2.89% out of 32GB are roughly 1GB of use. After the change and reboot of the calculator machine, I started the same Locust test to identify my CPU usage and, more importantly, if my transactions get slower, and if so, by how much. Ultimately, I want to decide whether 1 millisecond more latency is worth 10 more cents per hour. I added the change as an annotation in Lens.

After letting it run for a bit, we can now identify the smaller hosts' impact. In this case, we can see that the average did not change. However, the 95th percentile — as in 95% of all transactions are below this value — did spike up. Again, it looks bad at first, but checking in, it went from ~1.5 milliseconds to ~2.10 milliseconds, a ~0.6 millisecond increase. Now, you can decide whether that 0.6 millisecond increase is worth paying ~180$ more per month or if the current latency is good enough.

Conclusion

Observability is more than just collecting logs, metrics, and traces. Linking user experience to cloud costs allows your business to identify areas where you can save money. Having the right tools at your disposal will help you generate those insights quickly. Making informed decisions about how to optimize your cloud cost and ultimately improve the user experience is the bottom-line goal.

The dashboard and data view can be found in my GitHub repository. You can download the .ndjson file and import it using the Saved Objects inside Stack Management in Kibana.

Caveats

Pricing is only for base machines without any disk information, static public IP addresses, and any other additional cost, such as licenses for operating systems. Furthermore, it excludes spot pricing, discounts, or free credits. Additionally, data transfer costs between services are also not included. We only calculate it based on the minute rate of the service running — we are not checking billing intervals from Google Cloud. In our case, we would bill per minute, regardless of what Google Cloud has. Using the count for unique instance.ids work as intended. However, if a machine is only running for one minute, we calculate it based on the hourly rate. So, a machine running for one minute, will cost the same as running for 50 minutes — at least how we calculate it. The transform uses calendar hour intervals; therefore, it's 8 am-9 am, 9 am-10 am, and so on.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.