Automated Certificate Renewal

Throughout my career in software engineering, certificate management and renewal has been considered a difficult area of the job.

I've put this down to a few things:

  • Cryptography in its own right is a rather complex subject. Although engineers are not typically writing encryption models the topic seems daunting and complex
  • The certificate renewal process doesn't always adopt the good development practices we use in our day-to-day development (code first, pipeline driven, testing). Typically, it's a set of 'scripts' run on an arbitrary virtual machine by a particular individual or individuals that know how it all works
  • The actual need to rotate certificates is rather infrequent (could be yearly, quarterly or even multiple years). This leaves a cognitive load on the engineers that find themselves asking: 'how do we do this again?' or 'has anything changed since we last did this?'

How is ClearBank solving these issues?

As our system has grown over the years, our requirement for certificates has also grown. We spent some time reviewing the certificate renewal process for our Service Fabric cluster.

The main issues we found include:

  • We rely on the Service Fabric cluster portal to inform us when the certificate is due for renewal. This requires an engineer to spot the banner that appears on the Service Fabric web interface and inform the team
  • The renewal of the cluster certificate is run through a pipeline that applies the changes via Terraform. An engineer is required to run this pipeline each time
  • For the client certificate, due to technical issues, we have to run a Powershell script. This script is run via a pipeline, but again, it requires unnecessary cognitive load

Our main objective was to automate the whole certificate renewal flow and remove any actions required by engineers.

Service Fabric certificate renewal

We have chosen to use Let's Encrypt as our certificate authority. The main benefit of Let's Encrypt is they provide an automated way of distributing certificates upon a successful ACME request.

To automate the ACME request for new certificates we have used the KeyVault Acmebot created by Tatsuro Shibamura. This tool is an Azure Function that periodically checks the expiry date of your certificates. If any certificate is due to expire, the Azure function will initiate an ACME request for a new certificate.

Service Fabric requires 2 certificates (cluster certificate and client certificate). The cluster certificate allows the Service Fabric nodes to talk to each other. The client certificate allows us to expose APIs from services deployed to Service Fabric via Azure API Management.

Using a combination of the tooling mentioned, we have automated the renewal process for both certificates.

Tooling needed

The process

Initially, we need to ask the Acmebot to request the first instance of both the Service Fabric cluster and client certificates.

Using the ACME process, the Acmebot will get each certificate (client and cluster) and store it in an Azure Key Vault.

On a repeated schedule, the Acmebot will check the expiry date of all the certificates in the Key Vault. If it finds any certificates that are due to expire soon, it will automatically initiate another ACME certificate request.

The Acmebot will store the new certificate in the Key Vault. The Key Vault will then emit an event that notifies any resource, that is interested, that there is a new version of the certificate.

With the auto-renewal in place between the Acmebot and the Key Vault, it's now down to both the API Management and Service Fabric to watch for new versions of the certificate.

Both the API Management and Service Fabric cluster are configured to listen to the Key Vault events for new certificates. When one becomes available, both resources will download the new certificate and update the version they are using.

Both the API Management and Service Fabric resources are configured to use common names for their certificates. This is important as it means it doesn't matter the order in which the certificates are downloaded (if they haven't expired).

Conclusion

Now that this architecture has been implemented, the problems we earlier described are solved and the process is fully automated.

The tooling used gives us a lot of useful features out the box, such as Key Vault which provides the ideal place to store certificates. The combination of the Azure Function and Let's Encrypt handles the complexity of the certificate renewal.

With all these benefits, I believe there still needs to be a level of awareness for engineers. At a minimum, the process needs to be clearly documented in a knowledge base.

Additionally, it's important to provide engineers with visibility of all types of events when they occur - good and bad. Some examples include when a new certificate has been renewed or when the Azure Function fails. Keep in mind, all these sorts of actions should be part of a runbook.

As ClearBank grows, the requirement for more certificates grows. And now with this process in place, we can scale much quicker with more certificates, while removing all unnecessary load on the engineers.

Happy Renewing!

Michael Rodda

Senior DevOps Engineer, ClearBank