Elastic Cloud is now using Automated Certificate Management with Let's Encrypt

blog-thumb-cloud-blue.png

At Elastic we are committed to offering our customers the most secure way to run their workloads in the cloud. To help ensure this commitment, we have migrated our Elastic Cloud Transport Layer Security (TLS) certificates to Let’s Encrypt to best support sustainable and fully automated certificate management as our product offerings and available regions continue to expand.

To achieve this change, we have built a fully automated regional ACM (ACME Certificate Manager). We will walk you through our internal process of why we needed to upgrade, what options we explored, the implementation, and how we use our own Elastic Uptime Monitoring and Kibana to monitor the certificates.

Why we needed to change

Certificate Management was toilsome, costly, and time consuming. It took hours to artisanally handcraft configurations to create CSRs, laboriously submit them to a Certificate Authority, cautiously ask our finance team to pay the bill, lovingly receive the bundles using the very-modern-and-fancy-enterprise-level “download” button, eagerly save them to a secure location for safe keeping, and cautiously update the numerous cloud providers and regions with the new certificates.

We are expanding the number of regions and services in the Elastic Cloud and the process to make changes to the certificates took days when focused and feverish, but often weeks to complete given the complexity and underlying risk of human error.

What options we explored

Initially, we explored multiple options to find the most reliable and simple solution, that included intensive research and discussions in the options below:

  • The development of our own app using ACME libraries
  • An internal tool used by other elastic teams to manage DNS and certificates
  • Automate the process by calling directly ACME client (certbot or Lego CLI)
  • Terraform ACME provider and an internal terraform provider for Elastic Cloud

What we decided to use and why

At Elastic we reward impact and simplicity; the elegant solution is most often the one that requires us to build little ourselves.

The goal was to have a reliable, stable and simple solution, easy to maintain and update. To achieve that goal, we decided to choose a solution based on Terraform using an internal terraform provider and a terraform ACME certificate provider.

Terraform is a widely used tool by the Elastic SRE team and the SRE community that will allow any member of the team to rapidly understand and maintain the ACM solution.

How that implementation went

Simplicity is not easy to achieve. During the development of the ACM, we encountered multiple obstacles like:

  • Let's Encrypt certificate rate limit of 100 names per certificate
  • DNS01 challenge certificate validation, having DNS entries stored in different CSP accounts
  • Top Level Domain Certificates shared across multiple cloud providers accounts

To successfully automate the ACM with the challenges described above, we needed the collaboration of different teams to ensure certificates could be shared across regions managed in different accounts, with no security and reliability risks and assuming the minimal permissions.

Aiming for minimal impact to our customers, finding the best way to communicate the changes required a huge amount of research and planning. Also, to ensure that we can rapidly support customers that may be affected by the transition presented quite a challenge for our engineering team. The collaboration between the Communication, Support and Engineering teams provided customers the information and support for such a smooth change.

How it all works

The ACM solution has a single variables file for all regions and subdomains that is updated and reviewed by our SRE team. The ACM daily makes sure that any certificate expiring in the next month is being renewed, new certificates are issued and certificates that are no longer required are revoked.

  1. A scheduled job runs the ACM daily.
  2. Terraform issues/renews/revokes a regional certificate(s) using the ACME terraform provider.
    1. The certificate is validated using the DNS-01 challenge on the corresponding DNS host for the region.
  3. Terraform stores the certificate key and the certificate in Hashicorp Vault as a source of truth.
  4. Terraform pushes the certificate to the regional proxies using our internal elasticcloud terraform provider.

How do we monitor the certificates

We use Elastic Uptime Monitoring and Kibana to check, report and alert on certificate validity and expiration based on configurable threshold.

Wrapping up

The ACM solution allowed us to fully automate the certificate creation and deployment increasing stability of our services by removing human intervention and allows us to expand and update our regions at a faster pace.