I had been troubleshooting a fairly tricky DNS resolution issue and wanted to share what I learned with all of you. Most of the DNS lookup issues I’ve encountered in my career, have been the result of a misconfigured zone, record, or a server outage. I cannot recall a time with a Windows DNS server, where I’ve had to adjust default setting, but that’s exactly what I’ve had to do.
In this article, I will walk you through the following areas.
- What our DNS resolution path looked like before
- Describing the problem I ran into.
- Setting up the diagnostics to capture the problem
- Research I did to understand what was going on
- How I solved it
- General recommendations
Our server looks pretty straight forward. Keep in mind, the purpose of this Windows DNS server, was to provide DNS resolution for domain joined VM’s. Up until very recently, Azure had no native means to forward DNS requests to a domain controller. This forced you to either statically set DNS servers on your Azure VM ipConfiguration, or update the whole virtual network to forward DNS requests to a Windows DNS server. As of this writing, there is a new preview feature called Azure DNS Private Resolver which looks to mitigate the need for this setup. It’s something I’ll be exploring once it’s out of preview.
graph LR DNSClient((DNS Client)) subgraph Windows DNS Server dnsServerIp[DNS Server IP] subgraph DNS Forwarders azureDNS[Azure DNS] end subgraph Conditional Forwarder Active Directory Domain DC1[Domain Controller 1 IP] DC2[Domain Controller 2 IP] end end externalDNSServer[External DNS Server / Azure] adServers[Active Directory DNS Lookup] DNSClient --Lookup-->dnsServerIp dnsServerIp --Unknown Lookup-->azureDNS--Lookup external name server -->externalDNSServer dnsServerIp --AD Lookup-->DC1 & DC2 --Lookup Domain DNS--> adServers
There were two primary ways we discovered this issue.
- Azure Insights informed us that our apps were throwing an exception.
- Our proactive web health check would also inform us that one of our sites was intermittently offline.
An example of the errors we would see in both tools, are below.
The remote name could not be resolved: 'record.domain.com' The remote name could not be resolved: 'record2.domain.com'
While the error was pretty self explanatory, we were never able to reproduce it. Typically, the errors were only blips, lasting a few seconds and then returning to healthy. This made the issue incredibly difficult to troubleshoot.
Fortunately, we were using Windows DNS. One of the good things about running your own infrastructure, is you have the ability to see what’s going on. I knew Windows DNS had some pretty verbose logging capabilities, so that’s what I enabled. You can see the diagnostic settings I enabled via PowerShell. I have almost every option enabled.
Get-DnsServerDiagnostics SaveLogsToPersistentStorage : False Queries : True Answers : True Notifications : True Update : True QuestionTransactions : True UnmatchedResponse : True SendPackets : True ReceivePackets : True TcpPackets : True UdpPackets : True FullPackets : True FilterIPAddressList : EventLogLevel : 7 UseSystemEventLog : False EnableLoggingToFile : True EnableLogFileRollover : False LogFilePath : C:\DNS\debug.txt MaxMBFileSize : 500000000 WriteThrough : False EnableLoggingForLocalLookupEvent : False EnableLoggingForPluginDllEvent : False EnableLoggingForRecursiveLookupEvent : True EnableLoggingForRemoteServerEvent : True EnableLoggingForServerStartStopEvent : False EnableLoggingForTombstoneEvent : False EnableLoggingForZoneDataWriteEvent : False EnableLoggingForZoneLoadingEvent : False
Once I saw the error occur again, I gleefully started combing the logs. I was really hopeful it would be something simple / straight forward. Alas, the logs did not paint a picture that was easy to understand. Looking back with the information I’ll share further down. All the answers were there, I just didn’t understand it at the time.
9/13/2022 5:00:01 PM 02EC PACKET 0000015D4AE3B560 UDP Rcv redactedIpAddress 2d20 Q [0001 D NOERROR] A (5)record(16)domain(3)com(0) 9/13/2022 5:00:01 PM 02EC PACKET 0000015D4B368D80 UDP Snd 18.104.22.168 bd32 Q [0001 D NOERROR] A (5)record(16)domain(3)com(0) 9/13/2022 5:00:01 PM 02EC PACKET 0000015D4A022070 UDP Rcv redactedIpAddress 2d20 Q [0001 D NOERROR] A (5)record(16)domain(3)com(0) 9/13/2022 5:00:02 PM 02EC PACKET 0000015D4CFA9570 UDP Rcv redactedIpAddress 2d20 Q [0001 D NOERROR] A (5)record(16)domain(3)com(0) 9/13/2022 5:00:04 PM 02EC PACKET 0000015D4B0D7990 UDP Rcv redactedIpAddress 2d20 Q [0001 D NOERROR] A (5)record(16)domain(3)com(0) 9/13/2022 5:00:05 PM 179C PACKET 0000015D4AE3B560 UDP Snd redactedIpAddress 2d20 R Q [8281 DR SERVFAIL] A (5)record(16)domain(3)com(0) 9/13/2022 5:00:05 PM 02EC PACKET 0000015D4CF40C80 UDP Rcv 22.214.171.124 bd32 R Q [8081 DR NOERROR] A (5)record(16)domain(3)com(0)
Searching “DNS SERVFAIL” and many other combinations of texts didn’t provide a ton of helpful information. Most of the errors simply stated the authoritative DNS server was having issues. In this case, while it would have been easy to say “must be the authoritative server having issues”, I had no proof of it per se. Part of the issue, is the DNS request is forwarded to Azure, which leaves me blind for part of the resolution path.
Over the course of a week or so, I thought maybe it was related to Azure, so I started playing around with different forwarders. First I sent it to Google and CloudFlare. Same issue occurred, so it had to be on the authoritative server right? We were not dealing with recursion yet (this particular record would ultimately need to be recursed), but this was failing on the first lookup.
I started suspecting there might be some timeout issue or some sort of throttling going on. After tracking down the Microsoft DNS timeouts article (below), it seemed like I might be on to something.
It took me a bit to break down each timeout, and walk through the log snippet to see if any of them were reached. It was important to keep in mind, that each timeout is independent of each other. However, any one of them reaching their maximum elapsed time, would end the DNS resolution process. With that in mind, let’s diagram the exact set of events in the log snippet, so you can visualize what occurred. Then we can break it down after the diagram.
- In green, you can see the client portion of the timeout.
- In blue, you can see the DNS forwarder portion of the timeout.
- in purple, you can see the DNS recursion portion of the timeout.
sequenceDiagram autonumber actor DNSClient participant WindowsDNS participant DNSForwarder participant ExternalDNS DNSClient->>WindowsDNS: record.domain.com? rect rgb(0, 255, 0) note over DNSClient: Starting timer for DNS client, elapsed time 0 seconds end WindowsDNS->>DNSForwarder: record.domain.com? note over WindowsDNS: I don't have this in cache, forwarding rect rgb(24, 14, 161) note over WindowsDNS: Starting timer for DNS Forwarder, elapsed time, 0 seconds end rect rgb(126, 12, 207) note over WindowsDNS: Starting timer for DNS Recursion, elapsed time, 0 seconds end DNSForwarder->>ExternalDNS: record.domain.com? rect rgb(0, 255, 0) note over DNSClient: Asking again, elapsed time 1 second end DNSClient->>WindowsDNS: record.domain.com? rect rgb(24, 14, 161) note over WindowsDNS: elapsed time, 1 second end rect rgb(126, 12, 207) note over WindowsDNS: elapsed time, 1 second end rect rgb(0, 255, 0) note over DNSClient: Asking again, elapsed time 2 seconds end DNSClient->>WindowsDNS: record.domain.com? rect rgb(24, 14, 161) note over WindowsDNS: elapsed time, 2 seconds end rect rgb(126, 12, 207) note over WindowsDNS: elapsed time, 2 seconds end rect rgb(24, 14, 161) note over WindowsDNS: elapsed time, 3 seconds, no response from forwarder, time to end end WindowsDNS->>DNSClient: record.domain.com, had an error SERVFAIL rect rgb(126, 12, 207) note over WindowsDNS: elapsed time, 3 seconds end note over DNSClient: :( rect rgb(126, 12, 207) note over WindowsDNS: elapsed time, between 3 and 4 seconds end DNSForwarder->>WindowsDNS: record.domain.com is located here note over WindowsDNS: Too late, my forwarder timeout expired! Told the client you didn't respond.
In this case, the DNS forwarder was timing out before it got an answer. It was taking over 3 seconds for the DNS forwarder (Azure’s DNS) to complete it’s first lookup. You can see that ultimately Azure DNS does respond, but it’s too late. Since DNS forwarder timed out already, the client is sent a SERVFAIL response. This is different than an NXDOMAIN response, which is confirmation that a given record or domain does not exist. Ultimately the source of the SERVFAIL would have been a easier to identify if Windows DNS logging have simply said, something like “forwarder timeout exceeded waiting for a response, sending SERVFAIL”. Instead, the logging does nothing to help you understand what has happened.
The 3 second timeout, is mentioned here, under the section ForwardingTimeout.
In most non-Azure DNS setups, you’d probably have more than one forwarder, and you’d likely have root hints enabled as a backup. We do not, and it’s for the reasons below.
- Azure only has one DNS server IP to send requests to.
- It’s not scalable long term to add conditional forwarders for Azure private DNS zones. You’d probably be ok managing a few, but it’s not something you’d want across the board.
- We can use conditional forwarder with multiple servers for specific domains, like active directory domain integrated ones. As an aside, if your DNS server happens to be an domain controller, it’s likely your DNS server is also authoritative for your domain.
- We don’t want root hints, or secondary resolvers like Google DNS. The primary reason is in cases of split horizon / private DNS zones. Private zones for this example, would be things like Azure Private Endpoint name resolution or DNS zones that only exist internally. If the Azure DNS server times out and you failover to a non-Azure server, one of a few things will happen.
- You’ll resolve the public facing IP instead of your desired private IP.
- You will get an NXDOMAIN because that DNS record / zone does not exist publicly.
If you’re running in Azure and do NOT need to resolve Azure private DNS zones, then you can probably get away with this standard configuration.
This is what I’m going to do, and I’ll report back if I start running into other unexpected issues.
Below is a list of what we had in-place before, and this is not changing.
- Set the DNS forwarder to 126.96.36.199 only.
- Leverage conditional forwarders for any other domain specific use case like AD.
- Disable root hints.
As for what I changed that made the difference? I changed the default dns forwarder timeout of 3 seconds to 6. This is a more than reasonable timeout in a single forwarder design in my opinion. The spirit of the default timeouts were likely based around an assumption that you would have at least two forwarders, and also failover to root hints.
I suppose you could still increase the default forwarder timeout to something greater than 3 seconds, and then add a backup forwarder for internet resolution. However, I feel that if internal Azure DNS resolution is important to your infrastructure, this could cause more harm than it would solve. Yes, you would be able to resolve public addresses, but that also might be a problem to.
Most DNS servers respond in less than 4 seconds, which is why this was such a tough issue for us to troubleshoot. Even in this case, the issues was very sporadic and at most twice a day. Ultimately, if a DNS resolution is taking longer than 3 seconds to respond, that’s an issue. However, taking longer than 3 seconds, and not responding at all are two completely different. This change gives the external DNS resolution a bit more time to complete.
Hopefully this helps you if you’re running into similar issues, or just wanted to know how to troubleshoot Windows DNS.