[TriLUG] Google lookup problems...
Aaron S. Joyner
aaron at joyner.ws
Sat Apr 2 17:44:29 EST 2005
Benjamin Reed wrote:
> johnm wrote:
>
>>> So TriLUGers, weigh in please. :) Have you had problems resolving
>>> Google as of late?
>>
>
> I've had intermittent SERVFAIL issues with resolving www.google.com
> (you'll note that just "google.com" still resolves, it's weird).
>
> Lasts for maybe 5-10 minutes and then pops back.
>
First off, let me say thanks to everyone who responded, I'm glad to know
it's not just me. Also, after going away from the problem for a few
hours and taking another look, I think I have a better understanding of
what's going on. Here's the relevant portion of the output from `rndc
dumpdb`:
> ; authauthority
> l.google.com. 55996 NS a.l.google.com.
> 55996 NS b.l.google.com.
> 55996 NS c.l.google.com.
> 55996 NS d.l.google.com.
> ; authanswer
> a.l.google.com. 55768 A 216.239.53.9
> ; authanswer
> b.l.google.com. 55947 A 64.233.179.9
> ; authanswer
> c.l.google.com. 55951 A 64.233.161.9
> ; authanswer
> d.l.google.com. 55996 A 64.233.183.9
What this basically points out is what I suspected before. The
authauthority (glue NS) records for l.google.com are getting refreshed
every time it updates the www.l.google.com record (who's TTL is 5 mins),
but does not provide glue records for the IPs of these hosts, only their
names. This is the start of the problem, as shown by this dig query:
> [asjoyner at bobjr asjoyner]$ dig -t any www.l.google.com @a.l.google.com
> +norec
>
> ; <<>> DiG 9.2.1 <<>> -t any www.l.google.com @a.l.google.com +norec
> ;; global options: printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35141
> ;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;www.l.google.com. IN ANY
>
> ;; ANSWER SECTION:
> www.l.google.com. 300 IN A 66.249.85.104
> www.l.google.com. 300 IN A 66.249.85.99
>
> ;; AUTHORITY SECTION:
> l.google.com. 86400 IN NS a.l.google.com.
> l.google.com. 86400 IN NS b.l.google.com.
> l.google.com. 86400 IN NS c.l.google.com.
> l.google.com. 86400 IN NS d.l.google.com.
>
> ;; Query time: 93 msec
> ;; SERVER: 216.239.53.9#53(a.l.google.com)
> ;; WHEN: Sat Apr 2 17:30:24 2005
> ;; MSG SIZE rcvd: 130
The lookups will cycle through the various authoritative records, but
they don't get updated regularly. And once those glue records expire,
which admittedly takes a day, the next query against a sub host of
l.google.com will try to ask one of the authoritative servers, which are
unfortunately hosts with in that very domain. This is the manifestation
of the problem. What server can it ask now? It knows that
a.l.google.com is authoritative for l.google.com, and it needs to ask
that server how to look up itself, but it has no address to begin the
query with. This causes the SERVFAIL error that Ben is describing
above. I think that this will invalidate the NS records, perhaps
negatively caching them for some short time? I can't seem to find any
documentation on precisely what BIND9 will do with the associated
records when it gets a SERVFAIL for NS records.
So the problem is that a.l.google.com (and it's companions) aren't
returning the IPs as glue records for the NS records that are returned
on queries against www.l.google.com. We can prove that a.l.google.com
is aware of the A record for it's own IP address, it can return it when
queried for it directly, but I don't know why it's not returning that
glue. I attempted to duplicate the behavior by setting "fetch-glue no;
recursion no;" on a similarly configured server, but BIND 9.2.1 seems to
always be handing back those glue records (as it really should).
Thanks again to everyone who responded confirming I wasn't the only one
to have seen this. I'll see what I can do about bringing it to the
attention of someone at Google to get it fixed. :)
Aaron S. Joyner
More information about the TriLUG
mailing list