p1ngouin

Why I migrated from Traefik to Caddy

First, let's define what is Traefik. Traefik is a an open-source reverse proxy and load balancer for HTTP and TCP-based applications. It generates SSL certificates for you on the fly (based on a configuration defined in a static file or dynamically using Docker networks and labels). The main advantage of this solution is that it is turnkey. This application has been specially designed to work with Docker in order to be able to detect the presence of containers in the network, read labels and automatically redirect traffic to the correct container (as a load balancer). After a few weeks of use and many managed sites (+150), Traefik proved to be quite poor at managing HTTP certificates. Indeed, thanks to the KV store solutions (such as Consul), Traefik keeps the certificates in a single large JSON, gzipped under a single key. A big disappointment for me.

Consul allows you to store configurations/certificates between several servers, sharing the same Swarm cluster.

This problem, which may seem benign, is not. Indeed, with a very large number of certificates, we very quickly encounter a problem related to Consul : a limit of 512KB is applied per value. The only way to solve this problem was to compile a customized version of Consul in order to significantly increase this limit (at the risk of losing performance) by using the following patch :

--- kvs_endpoint.go 2018-11-23 16:09:26.771017520 +0100
+++ kvs_endpoint.go.t   2018-11-23 16:10:10.462064157 +0100
@@ -16,7 +16,7 @@
    // maxKVSize is used to limit the maximum payload length
    // of a KV entry. If it exceeds this amount, the client is
    // likely abusing the KV store.
-   maxKVSize = 512 * 1024
+   maxKVSize = 5120 * 1024
 )

 func (s *HTTPServer) KVSEndpoint(resp http.ResponseWriter, req *http.Request) (interface{}, error) {

This problem having been solved, several months have passed without any problems. Certificates were correctly generated, stored and served. After this serenity, Traefik suddenly started to stop renewing certificates for some sites (using HTTP-01). I looked for where this bug could have come from, and I came across these different issues:

Using the HA part heavily, I cannot do without the Swarm currently in place, and the certificates must continue to be renewed. To date, I have not found any solution to avoid the synchronization error of the KV store (Consul, Etcd...). Containous formally explains that the notion of HA will only be officially supported on the commercial version of Traefik. As a result, I find myself in a dead end. I trusted a solution that no longer meets my needs, which took several days to implement.

Looking for alternatives

After several hours of research and a little reddit, several solutions were possible:

However, the solution also had to meet several criteria:

  • Easily configurable ;
  • Discover the services in the Swarm ;
  • Automatic generation of SSL certificates ;
  • Implementation of mesh routing (optional) ;

The last two solutions seemed very complex to configure. Caddy only partially meets these criteria. Indeed, it was not developed with a use under Docker. He didn't seem like a good candidate to me. Then, after seeing this solution come up, I asked myself a few questions : why do you hear so much about Caddy ? Now I know.

Caddy to the rescue !

Caddy was designed to work (written in Go) with modules, so it is fully extensible. First, Caddy can work, since version 0.11, with Consul with a plugin to store the different HTTP certificates generated. This is already a very good point in order to be able to share certificates between several servers. Second, Caddy also has a plugin to listen to the Swarm to automatically generate a configuration file in memory based on existing services/containers. And finally, Caddy's configuration is extremely simple and it manages DNS/HTTP-01 resolvers in parallel (instead of Traefik).

In order to consolidate Caddy and its plugins, I decided to generate a custom Docker image, actually containing only one file (excluding the CI part):

main package

import (
    "github.com/caddyserver/caddy/caddy/caddy/caddymain"

    // List of plugins
    _ "github.com/lucaslorentz/caddy-docker-proxy/plugin"
    _ "github.com/pteich/caddy-tlsconsul"
)

func main() {
    caddymain.Run()
}

Time for migration

To begin with, I migrated only one server under Caddy (the one containing the most sites, obviously to test the resilience of the solution).

version: "3.4"
services:
    custom-service:
        image: containous/whoami
        networks:
            - routable
        deploy:
            labels:
                traefik.port: 80
                traefik.docker.network: routable
                traefik.frontend.rule: "Host:example.com"
                traefik.frontend.entryPoints: http,https

networks:
    routable:
        external: true

Now, using Caddy's labels.

version: "3.4"
services:
    custom-service:
        image: containous/whoami
        networks:
            - routable
        deploy:
            labels:
                caddy.address: https://example.com
                caddy.targetport: "80"

networks:
    routable:
        external: true

Nothing extraordinary here, except that Caddy works, renews all certificates correctly and is fully customizable. In addition, the icing on the cake: the TLS Consul plugin used with Caddy registers one SSL certificate per entry, awesome.

The routing mesh (optional)

The technique is the same between Traefik and Caddy here. The purpose of mesh routing is to point any IP to any server in the Swarm and get the response from the right container. Personally, I'm not a fan of the principle of assigning IPs only to manager nodes. So here's the technique I use :

version  : "3.4"
services :
    consul:
        image    : consul:latest
        command  : agent -server -bootstrap-expect=1
        networks :
            - consul
        volumes  :
            - "consul-data:/consul/data"
        deploy   :
            mode: replicated
            replicas: 1
        environment :
            - CONSUL_LOCAL_CONFIG={"datacenter":"us_east2","server":true}
            - CONSUL_BIND_INTERFACE=eth0
            - CONSUL_CLIENT_INTERFACE=eth0

    docker-proxy:
        image: rancher/socat-docker
        networks:
            - caddy
        volumes:
            - /var/run/docker.sock:/var/run/docker.sock
        deploy:
            mode: replicated
            replicas: 1

    caddy:
        image    : <custom-caddy-image>
        command  : -email <redacted> -agree=true -log stdout -proxy-service-tasks=true -docker-validate-network=false
        networks :
            - routable
            - caddy
            - consul
        ports    :
            - target    : 80
              published : 80
              mode      : host
            - target    : 443
              published : 443
              mode      : host
        deploy:
            mode          : global
            update_config :
                parallelism : 10
                delay       : 10s
            restart_policy :
                condition : on-failure
        environment :
            DOCKER_HOST: tcp://docker-proxy:2375
            CONSUL_HTTP_ADDR: consul:8500

volumes :
    consul-data:

networks:
    routable:
        external: true

    consul:
        driver: overlay

    caddy:
        driver: overlay

With this configuration, it is no longer a question of running Caddy only on the manager nodes, but on all the nodes available in the Swarm. Since certificates are generated by only one instance of the application, we have no problem running multiple caddy instances. IPs can now be pointed to any server in the Swarm, and Caddy will forward the request to the right container (even if it is not on the same server).

Now I have a Traefik version of Caddy and it fits all my needs.