[TriLUG] What could be going on with my nameserver?

Rick DeNatale rick.denatale at gmail.com
Tue Nov 1 18:39:25 EST 2005


On 11/1/05, Aaron Joyner <aaron at joyner.ws> wrote:
> Rick DeNatale wrote:
>
> >I'm plagued by what looks like an intermittent problem with my nameserver setup.
> >
> >I'm running bind9 as a cacheing name server, and to resolve local
> >addresses on my LAN.
> >
> >>From time to time, resolution of internet names seems to stop for a
> >while.  Sometimes it's all external names, and sometimes it's only
> >some.  For example, right now I can resolve www.google.com, but not
> >en.wikipedia.org.
> >
> >The bind configuration has a forward first directive, and a forwarders
> >directive to forward to my netgear router which in turn forwards to
> >the name servers it gets from my isp via dhcp. The router's local ip
> >address is 192.168.0.11
> >
> >Here's some recent attempts to figure out what's going on using dig.
> ><trimmed>
> >;; QUESTION SECTION:
> >;www.google.com.                        IN      A
> >
> >;; ANSWER SECTION:
> >www.google.com.         310     IN      CNAME   www.l.google.com.
> >www.l.google.com.       270     IN      A       64.233.161.99
> >www.l.google.com.       270     IN      A       64.233.161.104
> >www.l.google.com.       270     IN      A       64.233.161.147
> >
> >;; QUESTION SECTION:
> >;en.wikipedia.org.              IN      A
> >
> >;; ANSWER SECTION:
> >en.wikipedia.org.       1288    IN      CNAME   rr.wikimedia.org.
> >rr.wikimedia.org.       175     IN      CNAME   rr.pmtpa.wikimedia.org.
> >rr.pmtpa.wikimedia.org. 1222    IN      A       207.142.131.246
> >
> >;; QUESTION SECTION:
> >;www.google.com.                        IN      A
> >
> >;; ANSWER SECTION:
> >www.google.com.         822     IN      CNAME   www.l.google.com.
> >www.l.google.com.       231     IN      A       64.233.161.99
> >www.l.google.com.       231     IN      A       64.233.161.104
> >www.l.google.com.       231     IN      A       64.233.161.147
> ><end trimmed>
> >So I can get google resolved via my local nameserver, but I can only
> >resolve en.wikipedia.org if I bypass the local nameserver and go
> >directly to the netgear router.
> >
> >
> The results you pasted above all have ANSWER sections with valid A
> records, meaning that these were all successful dns queries.  I don't
> doubt that you've having a DNS problem, I just wanted to highlight that
> your above output doesn't show the problem clearly, so my answers are
> just speculation.

Actually you clipped out the offending results:
rick at frodo:~$ dig en.wikipedia.org

; <<>> DiG 9.2.4 <<>> en.wikipedia.org
;; global options:  printcmd
;; connection timed out; no servers could be reached

This was right after the successful lookup of google via my ns, the
next successful lookup of the same address specifying the netgear
router as the name server.


> I can make a pretty good educated guess.  A good way to test it would be
> to isolate how long the queries fail, although that's definitely tricky
> if you're not using it when it starts to fail (although you'd probably
> have to be).  So here's the guess.  Your NetGear router is imperfect,
> and has a pretty slow CPU.  This leads to the condition where you may
> look up a name, and your BIND server looks up that name by passing the
> query to the NetGear router.  The router then attempts to forward that
> query to the remote name server, get the response, and return it to
> BIND.  If that process takes less than X seconds (where I don't know X
> off the top of my head), or fails for some other reason (specifically
> something like a NXINFO or SERVFAIL, which the NetGear may incorrectly
> return if *it* gets a timeout), then BIND will negatively cache that
> record.  So for the next 5 to 10 mins (roughly) BIND won't try to look
> up that name, because it just tried it, and it failed, so obviously
> there's no sense trying it again right away (this is a debatable point,
> of course).
>
> So how can you detect this type of failure?  Well, you can dig the local
> nameserver right away, when it's failing, and look at the output.  You'd
> do this by something like `dig bad.domain.tld @localhost` on the BIND
> server.  You would see a result such as a response with no ANSWER
> section, or a "connection timed out" error.  Connection timed out would
> indicate that it's not negatively cached, but that it's unable to look
> up the name up-stream.  An empty answer section is a likely result of a
> negatively cached answer.  Unfortunately, it's really hard to chase down
> and prove that something is negatively cached in BIND, as it doesn't
> create an entry in a dumpdb (that I've found), but it does fail quickly
> on queries with no answer, and with out generating an upstream query.
> You can dig locally while doing a `tcpdump -ni eth0 port 53` and see if
> any traffic goes out during the dig, which is one way, but I don't know
> of a better way to sift for that information.

Well, as the missing dig output shows, it seems to be timing out.

> Generally, there's no good reason to have that NetGear box in the
> middle, and my gut instinct is that it's the problem.  Configure up a
> few fast forwarders in your local BIND nameserver, and go on with life.
> If my suspicions aren't correct, and you can gather some more definitive
> queries, perhaps I can help chase farther into the problem.

Aaron - Thanks.  The reason that I'm forwarding to my router is that
it gets my isp name servers via dhcp, and I can't figure out how to
get them from it in turn.  My assumption was that it was basically
just passing through dns requests.

The isp dns server addresses seem to be pretty stable, (as does my
"dynamic" ip address, but that's another topic). So I've added those
as forwarders, and I guess I'll remove the router.  Are there other
socially acceptable stable servers to forward to?


--
Rick DeNatale

Visit the Project Mercury Wiki Site
http://www.mercuryspacecraft.com/



More information about the TriLUG mailing list