At Open Systems we manage more than 3000 hosts acting as proxies, mail gateways, VPN devices, etc., all around the world. These hosts need to communicate with various central services, such as update servers, or monitoring systems.
A challenge that we faced is the optimal selection of these central servers. As an example, let's take ClamAV anti-virus updates: we want that the software doing the updates chooses one of our caching proxies that is online, and as close as possible to where it is running.
We want to have "optimal server selection" for all software that we use on our
devices, but not all software has built-in support for that. A general
solution that always works is to implement the server selection by
influencing the name resolution of the server. If ClamAV wants to fetch updates
via an update server called
updates.open.ch, we only need to make sure that
the returned IP address is good, and we don't need any failover mechanisms in
One way to implement that is to use the built-in failover mechanisms of DNS.
The DNS standard, as defined by RFC 1034 and RFC 1035, mandates that when a DNS zone is resolved, all defined "NS" servers are tried. If one or more of the NS servers are unavailable, the others should be used instead. Also, it is recommended that the selection of the best server is made based on response time.
The DNS-based failover mechanism, described here, uses that fact and defines a
DNS zone, called
updates.open.ch in this article, which is defined as being
served by two NS servers:
If one of the DNS servers, updates1 or updates2, becomes unavailable, then DNS queries will still continue to work thanks to the NS failover behavior implemented in DNS clients.
The trick that we used is to also operate the update server on those same
servers, and to configure the DNS zones on both hosts differently, so that they
always return their own IP address when queried for the name
(A or AAAA records).
Et voilà, a very cheap and effective failover mechanism, thanks to the built-in mechanisms of the DNS protocol.
This worked well for us, but we still faced some issues:
- We couldn't easily configure a managed host to use a different set of IP addresses, without using a different name.
- The mechanism requires access to a functioning DNS. This is not always the case, especially for certain internal firewalls, backend services, IDS systems, etc.
- The DNS resolver mostly chooses the nearest server, but we couldn't make sure that this was always the case.
That's why we implemented a different approach: ClientHA
If you use Linux, you might know the
nsswitch.conf file: it defines, among
other things, in what databases to look when resolving hostnames, and is typically
configured as follows:
hosts: files dns
With the Linux glibc, this works under the hood by using "NSS modules",
which are plugins that take care of the name resolution. There
is an "nss_files" module that looks up hostnames in
/etc/hosts, and there is
nss_dns module, which uses DNS.
nsswitch.conf describes what modules
should be used, and in which order.
ClientHA is also an NSS module and the
nsswitch.conf on our managed hosts
looks as follows:
hosts: files clientha dns
ClientHA only knows how to resolve a few names such as
updates.open.ch, and just
ignores any other name resolution request. The beauty of it is that we can
implement whatever behavior we want to resolve those names. We implement failover
and server selection exactly as we want it to be.
Currently, we have implemented a server selection based on Pingmachine, which pings all servers regularly and chooses the nearest available one.
This solution is simple and generic, but much more flexible than the DNS-based approach. Also we still keep the main failover logic in on the client side, making sure that it is based on real availability measures as seen from the host that needs to use the service.
Maybe it could be useful for you too! Have a look; it's open-source: