[TriLUG] Google lookup problems...
    Aaron S. Joyner 
    aaron at joyner.ws
       
    Sat Apr  2 17:44:29 EST 2005
    
    
  
Benjamin Reed wrote:
> johnm wrote:
>
>>> So TriLUGers, weigh in please.  :)  Have you had problems resolving 
>>> Google as of late?
>>
>
> I've had intermittent SERVFAIL issues with resolving www.google.com 
> (you'll note that just "google.com" still resolves, it's weird).
>
> Lasts for maybe 5-10 minutes and then pops back.
>
First off, let me say thanks to everyone who responded, I'm glad to know 
it's not just me.  Also, after going away from the problem for a few 
hours and taking another look, I think I have a better understanding of 
what's going on.  Here's the relevant portion of the output from `rndc 
dumpdb`:
> ; authauthority
> l.google.com.           55996   NS      a.l.google.com.
>                         55996   NS      b.l.google.com.
>                         55996   NS      c.l.google.com.
>                         55996   NS      d.l.google.com.
> ; authanswer
> a.l.google.com.         55768   A       216.239.53.9
> ; authanswer
> b.l.google.com.         55947   A       64.233.179.9
> ; authanswer
> c.l.google.com.         55951   A       64.233.161.9
> ; authanswer
> d.l.google.com.         55996   A       64.233.183.9
What this basically points out is what I suspected before.  The 
authauthority (glue NS) records for l.google.com are getting refreshed 
every time it updates the www.l.google.com record (who's TTL is 5 mins), 
but does not provide glue records for the IPs of these hosts, only their 
names.  This is the start of the problem, as shown by this dig query:
> [asjoyner at bobjr asjoyner]$ dig -t any www.l.google.com @a.l.google.com 
> +norec                    
>
> ; <<>> DiG 9.2.1 <<>> -t any www.l.google.com @a.l.google.com +norec
> ;; global options:  printcmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35141
> ;; flags: qr aa; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 0
>
> ;; QUESTION SECTION:
> ;www.l.google.com.              IN      ANY
>
> ;; ANSWER SECTION:
> www.l.google.com.       300     IN      A       66.249.85.104
> www.l.google.com.       300     IN      A       66.249.85.99
>
> ;; AUTHORITY SECTION:
> l.google.com.           86400   IN      NS      a.l.google.com.
> l.google.com.           86400   IN      NS      b.l.google.com.
> l.google.com.           86400   IN      NS      c.l.google.com.
> l.google.com.           86400   IN      NS      d.l.google.com.
>
> ;; Query time: 93 msec
> ;; SERVER: 216.239.53.9#53(a.l.google.com)
> ;; WHEN: Sat Apr  2 17:30:24 2005
> ;; MSG SIZE  rcvd: 130
The lookups will cycle through the various authoritative records, but 
they don't get updated regularly.  And once those glue records expire, 
which admittedly takes a day, the next query against a sub host of 
l.google.com will try to ask one of the authoritative servers, which are 
unfortunately hosts with in that very domain.  This is the manifestation 
of the problem.  What server can it ask now?  It knows that 
a.l.google.com is authoritative for l.google.com, and it needs to ask 
that server how to look up itself, but it has no address to begin the 
query with.  This causes the SERVFAIL error that Ben is describing 
above.  I think that this will invalidate the NS records, perhaps 
negatively caching them for some short time?  I can't seem to find any 
documentation on precisely what BIND9 will do with the associated 
records when it gets a SERVFAIL for NS records.
So the problem is that a.l.google.com (and it's companions) aren't 
returning the IPs as glue records for the NS records that are returned 
on queries against www.l.google.com.  We can prove that a.l.google.com 
is aware of the A record for it's own IP address, it can return it when 
queried for it directly, but I don't know why it's not returning that 
glue.  I attempted to duplicate the behavior by setting "fetch-glue no; 
recursion no;" on a similarly configured server, but BIND 9.2.1 seems to 
always be handing back those glue records (as it really should).
Thanks again to everyone who responded confirming I wasn't the only one 
to have seen this.  I'll see what I can do about bringing it to the 
attention of someone at Google to get it fixed.  :)
Aaron S. Joyner
    
    
More information about the TriLUG
mailing list