[TriLUG] What could be going on with my nameserver?
Aaron Joyner
aaron at joyner.ws
Tue Nov 1 18:06:30 EST 2005
Rick DeNatale wrote:
>I'm plagued by what looks like an intermittent problem with my nameserver setup.
>
>I'm running bind9 as a cacheing name server, and to resolve local
>addresses on my LAN.
>
>>From time to time, resolution of internet names seems to stop for a
>while. Sometimes it's all external names, and sometimes it's only
>some. For example, right now I can resolve www.google.com, but not
>en.wikipedia.org.
>
>The bind configuration has a forward first directive, and a forwarders
>directive to forward to my netgear router which in turn forwards to
>the name servers it gets from my isp via dhcp. The router's local ip
>address is 192.168.0.11
>
>Here's some recent attempts to figure out what's going on using dig.
><trimmed>
>;; QUESTION SECTION:
>;www.google.com. IN A
>
>;; ANSWER SECTION:
>www.google.com. 310 IN CNAME www.l.google.com.
>www.l.google.com. 270 IN A 64.233.161.99
>www.l.google.com. 270 IN A 64.233.161.104
>www.l.google.com. 270 IN A 64.233.161.147
>
>;; QUESTION SECTION:
>;en.wikipedia.org. IN A
>
>;; ANSWER SECTION:
>en.wikipedia.org. 1288 IN CNAME rr.wikimedia.org.
>rr.wikimedia.org. 175 IN CNAME rr.pmtpa.wikimedia.org.
>rr.pmtpa.wikimedia.org. 1222 IN A 207.142.131.246
>
>;; QUESTION SECTION:
>;www.google.com. IN A
>
>;; ANSWER SECTION:
>www.google.com. 822 IN CNAME www.l.google.com.
>www.l.google.com. 231 IN A 64.233.161.99
>www.l.google.com. 231 IN A 64.233.161.104
>www.l.google.com. 231 IN A 64.233.161.147
><end trimmed>
>So I can get google resolved via my local nameserver, but I can only
>resolve en.wikipedia.org if I bypass the local nameserver and go
>directly to the netgear router.
>
>
The results you pasted above all have ANSWER sections with valid A
records, meaning that these were all successful dns queries. I don't
doubt that you've having a DNS problem, I just wanted to highlight that
your above output doesn't show the problem clearly, so my answers are
just speculation.
>As I said these problems seem to come and go. Resolution of local
>names seems solid (they're all in a local subdomain
>local.denhaven2.com). Restarting bind doesn't seem to make a
>difference.
>
>Any ideas?
>
>
I can make a pretty good educated guess. A good way to test it would be
to isolate how long the queries fail, although that's definitely tricky
if you're not using it when it starts to fail (although you'd probably
have to be). So here's the guess. Your NetGear router is imperfect,
and has a pretty slow CPU. This leads to the condition where you may
look up a name, and your BIND server looks up that name by passing the
query to the NetGear router. The router then attempts to forward that
query to the remote name server, get the response, and return it to
BIND. If that process takes less than X seconds (where I don't know X
off the top of my head), or fails for some other reason (specifically
something like a NXINFO or SERVFAIL, which the NetGear may incorrectly
return if *it* gets a timeout), then BIND will negatively cache that
record. So for the next 5 to 10 mins (roughly) BIND won't try to look
up that name, because it just tried it, and it failed, so obviously
there's no sense trying it again right away (this is a debatable point,
of course).
So how can you detect this type of failure? Well, you can dig the local
nameserver right away, when it's failing, and look at the output. You'd
do this by something like `dig bad.domain.tld @localhost` on the BIND
server. You would see a result such as a response with no ANSWER
section, or a "connection timed out" error. Connection timed out would
indicate that it's not negatively cached, but that it's unable to look
up the name up-stream. An empty answer section is a likely result of a
negatively cached answer. Unfortunately, it's really hard to chase down
and prove that something is negatively cached in BIND, as it doesn't
create an entry in a dumpdb (that I've found), but it does fail quickly
on queries with no answer, and with out generating an upstream query.
You can dig locally while doing a `tcpdump -ni eth0 port 53` and see if
any traffic goes out during the dig, which is one way, but I don't know
of a better way to sift for that information.
Generally, there's no good reason to have that NetGear box in the
middle, and my gut instinct is that it's the problem. Configure up a
few fast forwarders in your local BIND nameserver, and go on with life.
If my suspicions aren't correct, and you can gather some more definitive
queries, perhaps I can help chase farther into the problem.
Best of luck!
Aaron S. Joyner
More information about the TriLUG
mailing list