Unwrap the SERVFAIL


We recently released a new version of Cloudflare Resolver which adds a piece of information called “Extended DNS Errors” (EDE) along with the response code under certain circumstances. This will be helpful in tracing DNS resolution errors and figuring out what went wrong behind the scenes.

(image from: https://www.pxfuel.com/en/free-photo-expka)

A tight-lipped agent

The DNS protocol was designed to map domain names to IP addresses. To inform the client about the result of the lookup, the protocol has a 4 bit field, called response code/RCODE. The logic to serve a response might look something like this:

function lookup(domain) {
    ...
    switch result {
    case "No error condition":
        return NOERROR with client expected answer
    case "No record for the request type":
        return NOERROR
    case "The request domain does not exist":
        return NXDOMAIN
    case "Refuse to perform the specified operation for policy reasons":
        return REFUSE
    default("Server failure: unable to process this query due to a problem with the name server"):
        return SERVFAIL
    }
}

try {
    lookup(domain)
} catch {
    return SERVFAIL
}

Although the context hasn’t changed much, protocol extensions such as DNSSEC have been added, which makes the RCODE run out of space to express the server’s internal status. To keep backward compatibility, DNS servers have to squeeze various statuses into existing ones. This behavior could confuse the client, especially with the “catch-all” SERVFAIL: something went wrong but what exactly?

Most often, end users don’t talk to authoritative name servers directly, but use a stub and/or a recursive resolver as an agent to acquire the information it needs. When a user receives  SERVFAIL, the failure can be one of the following:

  • The stub resolver fails to send the request.
  • The stub resolver doesn’t get a response.
  • The recursive resolver, which the stub resolver sends its query to, is overloaded.
  • The recursive resolver is unable to communicate with upstream authoritative servers.
  • The recursive resolver fails to verify the DNSSEC chain.
  • The authoritative server takes too long to respond.

In such cases, it is nearly impossible for the user to know exactly what’s wrong. The resolver is usually the one to be blamed, because, as an agent, it fails to get back the answer, and doesn’t return a clear reason for the failure in the response.

Keep backward compatibility

It seems we need to return more information, but (there’s always a but) we also need to keep the behavior of existing clients unchanged.

One way is to extend the RCODE space, which came out with the Extension mechanisms for DNS or EDNS. It defines a 8 bit EXTENDED-RCODE, as high-order bits to current 4 bit RCODE. Together they make up a 12 bit integer. This changes the processing of RCODE, requires both client and server to fully support the logic unfortunately.

Another approach is to provide out-of-band data without touching the current RCODE. This is how Extended DNS Errors is defined. It introduces a new option to EDNS, containing an INFO-CODE to describe error details with an EXTRA-TEXT as an optional supplement. The option can be repeated as many times as needed, so it’s possible for the client to get a full error chain with detailed messages. The INFO-CODE is just something like RCODE, but is 16 bits wide, while the EXTRA-TEXT is an utf-8 encoded string. For example, let’s say a client sends a request to a resolver, and the requested domain has two name servers. The client may receive a SERVFAIL response with an OPT record (see below) which contains two extended errors, one from one of the authoritative servers that shows it’s not ready to serve, and the other from the resolver, showing it cannot connect to the other name server.

;; OPT PSEUDOSECTION:
; ...
; EDE: 14 (Not Ready)
; EDE: 23 (Network Error): (cannot reach upstream 192.0.2.1)
; ...

Google has something similar in their DoH JSON API, which provides diagnostic information in the “Comment” field.

Let’s dig into it

Our 1.1.1.1 service has an initial support of the draft version of Extended DNS Errors, while we are still trying to find the best practice. As we mentioned above, this is not a breaking change, and existing clients will not be affected. The additional options can be safely ignored without any problem, since the RCODE stays the same.

If you have a newer version of dig, you can simply check it out with a known problematic domain. As you can see, due to DNSSEC verification failing, the RCODE is still SERVFAIL, but the extended error shows the failure is “DNSSEC Bogus”.

$ dig @1.1.1.1 dnssec-failed.org

; <<>> DiG 9.16.4-Debian <<>> @1.1.1.1 dnssec-failed.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 1111
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 6 (DNSSEC Bogus)
;; QUESTION SECTION:
;dnssec-failed.org.		IN	A

;; Query time: 111 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Wed Sep 01 00:00:00 PDT 2020
;; MSG SIZE  rcvd: 52

Note that Extended DNS Error relies on EDNS. So to be able to get one, the client needs to support EDNS, and needs to enable it in the request. At the time of writing this blog post, we see about 17% of queries that 1.1.1.1 received had EDNS enabled within a short time range. We hope this information will help you uncover the root cause of a SERVFAIL in the future.



Source link