At Open Systems we manage more than 3000 hosts acting as proxies, mail gateways, VPN devices, etc., all around the world. These hosts need to communicate with various central services, such as update servers, or monitoring systems.

A challenge that we faced is the optimal selection of these central servers. As an example, let's take ClamAV anti-virus updates: we want that the software doing the updates chooses one of our caching proxies that is online, and as close as possible to where it is running.

We want to have "optimal server selection" for all software that we use on our devices, but not all software has built-in support for that. A general solution that always works is to implement the server selection by influencing the name resolution of the server. If ClamAV wants to fetch updates via an update server called updates.open.ch, we only need to make sure that the returned IP address is good, and we don't need any failover mechanisms in ClamAV itself.

One way to implement that is to use the built-in failover mechanisms of DNS.

DNS-based Failover

DNS based failover for updates

The DNS standard, as defined by RFC 1034 and RFC 1035, mandates that when a DNS zone is resolved, all defined "NS" servers are tried. If one or more of the NS servers are unavailable, the others should be used instead. Also, it is recommended that the selection of the best server is made based on response time.

The DNS-based failover mechanism, described here, uses that fact and defines a DNS zone, called updates.open.ch in this article, which is defined as being served by two NS servers: updates1.open.ch and updates2.open.ch.

If one of the DNS servers, updates1 or updates2, becomes unavailable, then DNS queries will still continue to work thanks to the NS failover behavior implemented in DNS clients.

The trick that we used is to also operate the update server on those same servers, and to configure the DNS zones on both hosts differently, so that they always return their own IP address when queried for the name updates.open.ch (A or AAAA records).

Et voilĂ , a very cheap and effective failover mechanism, thanks to the built-in mechanisms of the DNS protocol.

This worked well for us, but we still faced some issues:

  • We couldn't easily configure a managed host to use a different set of IP addresses, without using a different name.
  • The mechanism requires access to a functioning DNS. This is not always the case, especially for certain internal firewalls, backend services, IDS systems, etc.
  • The DNS resolver mostly chooses the nearest server, but we couldn't make sure that this was always the case.

That's why we implemented a different approach: ClientHA

ClientHA

ClientHA-based server selection

If you use Linux, you might know the nsswitch.conf file: it defines, among other things, in what databases to look when resolving hostnames, and is typically configured as follows:

hosts:  files dns

With the Linux glibc, this works under the hood by using "NSS modules", which are plugins that take care of the name resolution. There is an "nss_files" module that looks up hostnames in /etc/hosts, and there is an nss_dns module, which uses DNS. nsswitch.conf describes what modules should be used, and in which order.

ClientHA is also an NSS module and the nsswitch.conf on our managed hosts looks as follows:

hosts:  files clientha dns

ClientHA only knows how to resolve a few names such as updates.open.ch, and just ignores any other name resolution request. The beauty of it is that we can implement whatever behavior we want to resolve those names. We implement failover and server selection exactly as we want it to be.

Currently, we have implemented a server selection based on Pingmachine, which pings all servers regularly and chooses the nearest available one.

This solution is simple and generic, but much more flexible than the DNS-based approach. Also we still keep the main failover logic in on the client side, making sure that it is based on real availability measures as seen from the host that needs to use the service.

Maybe it could be useful for you too! Have a look; it's open-source: