Introduction

I had been troubleshooting a fairly tricky DNS resolution issue and wanted to share what I learned with all of you. Most of the DNS lookup issues I’ve encountered in my career, have been the result of a misconfigured zone, record, or a server outage. I cannot recall a time with a Windows DNS server, where I’ve had to adjust default setting, but that’s exactly what I’ve had to do.

In this article, I will walk you through the following areas.

  • What our DNS resolution path looked like before
  • Describing the problem I ran into.
  • Setting up the diagnostics to capture the problem
  • Research I did to understand what was going on
  • How I solved it
  • General recommendations

DNS Setup

Our server looks pretty straight forward. Keep in mind, the purpose of this Windows DNS server, was to provide DNS resolution for domain joined VM’s. Up until very recently, Azure had no native means to forward DNS requests to a domain controller. This forced you to either statically set DNS servers on your Azure VM ipConfiguration, or update the whole virtual network to forward DNS requests to a Windows DNS server. As of this writing, there is a new preview feature called Azure DNS Private Resolver which looks to mitigate the need for this setup. It’s something I’ll be exploring once it’s out of preview.


graph LR
    DNSClient((DNS Client))

    subgraph Windows DNS Server
        dnsServerIp[DNS Server IP]

        subgraph DNS Forwarders
            azureDNS[Azure DNS]
            end

        subgraph Conditional Forwarder Active Directory Domain
            DC1[Domain Controller 1 IP]
            DC2[Domain Controller 2 IP]
            end   
        end

    externalDNSServer[External DNS Server / Azure]
    adServers[Active Directory DNS Lookup]

    DNSClient --Lookup-->dnsServerIp
    dnsServerIp --Unknown Lookup-->azureDNS--Lookup external name server -->externalDNSServer
    dnsServerIp --AD Lookup-->DC1 & DC2 --Lookup Domain DNS--> adServers

Discovering and understanding the problem

There were two primary ways we discovered this issue.

  1. Azure Insights informed us that our apps were throwing an exception.
  2. Our proactive web health check would also inform us that one of our sites was intermittently offline.

An example of the errors we would see in both tools, are below.

The remote name could not be resolved: 'record.domain.com' 
The remote name could not be resolved: 'record2.domain.com'

While the error was pretty self explanatory, we were never able to reproduce it. Typically, the errors were only blips, lasting a few seconds and then returning to healthy. This made the issue incredibly difficult to troubleshoot.

Diagnostics

Fortunately, we were using Windows DNS. One of the good things about running your own infrastructure, is you have the ability to see what’s going on. I knew Windows DNS had some pretty verbose logging capabilities, so that’s what I enabled. You can see the diagnostic settings I enabled via PowerShell. I have almost every option enabled.

Get-DnsServerDiagnostics

SaveLogsToPersistentStorage          : False
Queries                              : True
Answers                              : True
Notifications                        : True
Update                               : True
QuestionTransactions                 : True
UnmatchedResponse                    : True
SendPackets                          : True
ReceivePackets                       : True
TcpPackets                           : True
UdpPackets                           : True
FullPackets                          : True
FilterIPAddressList                  : 
EventLogLevel                        : 7
UseSystemEventLog                    : False
EnableLoggingToFile                  : True
EnableLogFileRollover                : False
LogFilePath                          : C:\DNS\debug.txt
MaxMBFileSize                        : 500000000
WriteThrough                         : False
EnableLoggingForLocalLookupEvent     : False
EnableLoggingForPluginDllEvent       : False
EnableLoggingForRecursiveLookupEvent : True
EnableLoggingForRemoteServerEvent    : True
EnableLoggingForServerStartStopEvent : False
EnableLoggingForTombstoneEvent       : False
EnableLoggingForZoneDataWriteEvent   : False
EnableLoggingForZoneLoadingEvent     : False

Diagnostic Results

Once I saw the error occur again, I gleefully started combing the logs. I was really hopeful it would be something simple / straight forward. Alas, the logs did not paint a picture that was easy to understand. Looking back with the information I’ll share further down. All the answers were there, I just didn’t understand it at the time.

9/13/2022 5:00:01 PM 02EC PACKET  0000015D4AE3B560 UDP Rcv redactedIpAddress        2d20   Q [0001   D   NOERROR] A      (5)record(16)domain(3)com(0)
9/13/2022 5:00:01 PM 02EC PACKET  0000015D4B368D80 UDP Snd 168.63.129.16            bd32   Q [0001   D   NOERROR] A      (5)record(16)domain(3)com(0)
9/13/2022 5:00:01 PM 02EC PACKET  0000015D4A022070 UDP Rcv redactedIpAddress        2d20   Q [0001   D   NOERROR] A      (5)record(16)domain(3)com(0)
9/13/2022 5:00:02 PM 02EC PACKET  0000015D4CFA9570 UDP Rcv redactedIpAddress        2d20   Q [0001   D   NOERROR] A      (5)record(16)domain(3)com(0)
9/13/2022 5:00:04 PM 02EC PACKET  0000015D4B0D7990 UDP Rcv redactedIpAddress        2d20   Q [0001   D   NOERROR] A      (5)record(16)domain(3)com(0)
9/13/2022 5:00:05 PM 179C PACKET  0000015D4AE3B560 UDP Snd redactedIpAddress        2d20 R Q [8281   DR SERVFAIL] A      (5)record(16)domain(3)com(0)
9/13/2022 5:00:05 PM 02EC PACKET  0000015D4CF40C80 UDP Rcv 168.63.129.16            bd32 R Q [8081   DR  NOERROR] A      (5)record(16)domain(3)com(0)

Searching “DNS SERVFAIL” and many other combinations of texts didn’t provide a ton of helpful information. Most of the errors simply stated the authoritative DNS server was having issues. In this case, while it would have been easy to say “must be the authoritative server having issues”, I had no proof of it per se. Part of the issue, is the DNS request is forwarded to Azure, which leaves me blind for part of the resolution path.

Over the course of a week or so, I thought maybe it was related to Azure, so I started playing around with different forwarders. First I sent it to Google and CloudFlare. Same issue occurred, so it had to be on the authoritative server right? We were not dealing with recursion yet (this particular record would ultimately need to be recursed), but this was failing on the first lookup.

I started suspecting there might be some timeout issue or some sort of throttling going on. After tracking down the Microsoft DNS timeouts article (below), it seemed like I might be on to something.

  1. DNS Client Timeouts
  2. DNS Forwarder Timeouts

It took me a bit to break down each timeout, and walk through the log snippet to see if any of them were reached. It was important to keep in mind, that each timeout is independent of each other. However, any one of them reaching their maximum elapsed time, would end the DNS resolution process. With that in mind, let’s diagram the exact set of events in the log snippet, so you can visualize what occurred. Then we can break it down after the diagram.

  • In green, you can see the client portion of the timeout.
  • In blue, you can see the DNS forwarder portion of the timeout.
  • in purple, you can see the DNS recursion portion of the timeout.

sequenceDiagram
    autonumber
    actor  DNSClient
    participant WindowsDNS
    participant DNSForwarder
    participant ExternalDNS
    DNSClient->>WindowsDNS: record.domain.com?
        rect rgb(0, 255, 0)
        note over DNSClient: Starting timer for DNS client, elapsed time 0 seconds
        end
    
    WindowsDNS->>DNSForwarder: record.domain.com?
    note over WindowsDNS: I don't have this in cache, forwarding
        rect rgb(24, 14, 161)
        note over WindowsDNS: Starting timer for DNS Forwarder, elapsed time, 0 seconds
        end
        rect rgb(126, 12, 207)
        note over WindowsDNS: Starting timer for DNS Recursion, elapsed time, 0 seconds
        end
    DNSForwarder->>ExternalDNS: record.domain.com?
        rect rgb(0, 255, 0)
        note over DNSClient: Asking again, elapsed time 1 second
        end
    DNSClient->>WindowsDNS: record.domain.com?
        rect rgb(24, 14, 161)
        note over WindowsDNS: elapsed time, 1 second
        end
        rect rgb(126, 12, 207)
        note over WindowsDNS: elapsed time, 1 second
        end

        rect rgb(0, 255, 0)
        note over DNSClient: Asking again, elapsed time 2 seconds
        end
        DNSClient->>WindowsDNS: record.domain.com?

        rect rgb(24, 14, 161)
        note over WindowsDNS: elapsed time, 2 seconds
        end
        rect rgb(126, 12, 207)
        note over WindowsDNS: elapsed time, 2 seconds
        end
    
        rect rgb(24, 14, 161)
        note over WindowsDNS: elapsed time, 3 seconds, no response from forwarder, time to end
        end
        WindowsDNS->>DNSClient: record.domain.com, had an error SERVFAIL
        rect rgb(126, 12, 207)
        note over WindowsDNS: elapsed time, 3 seconds
        end
    
    note over DNSClient: :(

        rect rgb(126, 12, 207)
        note over WindowsDNS: elapsed time, between 3 and 4 seconds
        end
    DNSForwarder->>WindowsDNS: record.domain.com is located here
    note over WindowsDNS: Too late, my forwarder timeout expired!  Told the client you didn't respond.

In this case, the DNS forwarder was timing out before it got an answer. It was taking over 3 seconds for the DNS forwarder (Azure’s DNS) to complete it’s first lookup. You can see that ultimately Azure DNS does respond, but it’s too late. Since DNS forwarder timed out already, the client is sent a SERVFAIL response. This is different than an NXDOMAIN response, which is confirmation that a given record or domain does not exist. Ultimately the source of the SERVFAIL would have been a easier to identify if Windows DNS logging have simply said, something like “forwarder timeout exceeded waiting for a response, sending SERVFAIL”. Instead, the logging does nothing to help you understand what has happened.

The 3 second timeout, is mentioned here, under the section ForwardingTimeout.

Why should a DNS server in Azure be different?

In most non-Azure DNS setups, you’d probably have more than one forwarder, and you’d likely have root hints enabled as a backup. We do not, and it’s for the reasons below.

  1. Azure only has one DNS server IP to send requests to.
  2. It’s not scalable long term to add conditional forwarders for Azure private DNS zones. You’d probably be ok managing a few, but it’s not something you’d want across the board.
  3. We can use conditional forwarder with multiple servers for specific domains, like active directory domain integrated ones. As an aside, if your DNS server happens to be an domain controller, it’s likely your DNS server is also authoritative for your domain.
  4. We don’t want root hints, or secondary resolvers like Google DNS. The primary reason is in cases of split horizon / private DNS zones. Private zones for this example, would be things like Azure Private Endpoint name resolution or DNS zones that only exist internally. If the Azure DNS server times out and you failover to a non-Azure server, one of a few things will happen.
    1. You’ll resolve the public facing IP instead of your desired private IP.
    2. You will get an NXDOMAIN because that DNS record / zone does not exist publicly.

If you’re running in Azure and do NOT need to resolve Azure private DNS zones, then you can probably get away with this standard configuration.

Recommendation / Solution

This is what I’m going to do, and I’ll report back if I start running into other unexpected issues.

Below is a list of what we had in-place before, and this is not changing.

  1. Set the DNS forwarder to 168.63.129.16 only.
  2. Leverage conditional forwarders for any other domain specific use case like AD.
  3. Disable root hints.

As for what I changed that made the difference? I changed the default dns forwarder timeout of 3 seconds to 6. This is a more than reasonable timeout in a single forwarder design in my opinion. The spirit of the default timeouts were likely based around an assumption that you would have at least two forwarders, and also failover to root hints.

Alternative ideas?

I suppose you could still increase the default forwarder timeout to something greater than 3 seconds, and then add a backup forwarder for internet resolution. However, I feel that if internal Azure DNS resolution is important to your infrastructure, this could cause more harm than it would solve. Yes, you would be able to resolve public addresses, but that also might be a problem to.

Closing

Most DNS servers respond in less than 4 seconds, which is why this was such a tough issue for us to troubleshoot. Even in this case, the issues was very sporadic and at most twice a day. Ultimately, if a DNS resolution is taking longer than 3 seconds to respond, that’s an issue. However, taking longer than 3 seconds, and not responding at all are two completely different. This change gives the external DNS resolution a bit more time to complete.

Hopefully this helps you if you’re running into similar issues, or just wanted to know how to troubleshoot Windows DNS.