pts2024

How To Revoke And Replace 400 Million Certificates Without Breaking The Internet
2024-07-04, 11:45–12:20 (Europe/Paris), Amphitheater

In a delegated-trust environment like the WebPKI, revocation of trust in certificates and keys that are compromised is a critical aspect of security. But for many years, security experts have rightly been saying that revocation is broken: Certificate Revocation Lists don’t scale; the Online Certificate Status Protocol fails open, is expensive to run, and is a privacy risk; and mass-revocations can effectively take huge swathes of the internet offline. This talk will provide technical details behind three techniques that the tiny team at Let’s Encrypt is using to solve these problems at scale.


Let’s Encrypt is the world’s leading free and open (-source!) Certificate Authority. In the 10 years since the organization was founded, and 5 years since the ACME (Automated Certificate Management Environment) protocol was standardized, we have issued over 4 billion TLS certificates. Every day we issue 4 million more, covering 400 million unique domain names and providing 48% of all publicly-trusted TLS certs at any given time.

So what happens when something goes wrong, and all of those certificates need to be revoked?

Naively:

  • your CRLs will become so large that your HSM will refuse to sign them and clients will fail to download them;
  • your OCSP service will saturate your HSM’s signature capacity, potentially causing another incident as you fail to revoke all the certificates within the required timeframe;
  • millions of websites will display opaque and confusing TLS errors, training browser users to blindly click through important security warnings; and
  • the resulting spike in issuance as everyone tries to get a new un-revoked certificate will take down your entire infrastructure.

Not fun!

Let’s Encrypt has been in this situation. And from our past experience, we’ve implemented three techniques to mitigate the worst of these effects:

  • Sharding our CRLs, and providing these smaller shards to browser-based CRL aggregation systems like CRLite, CRLsets, and Valid;
  • Live-signing OCSP responses only when they are actually asked for, and caching those responses in a lightweight in-memory lookaside cache; and
  • ACME Renewal Information, an extension to the ACME protocol which allows the server to suggest to clients when they should renew their certificates.

In this talk I’ll cover the motivation of why these are important problems to solve, the technical details of both the failure modes and our solutions, and how a team of just 13 engineers is able to keep all of this running.

See also: Slides

Aaron is the technical lead of the Let's Encrypt software development team, which builds the CA's validation and issuance software. His work both with ISRG and previously with the Chromium Project is focused on making the web a better place through open source initiatives.