[TriLUG] Problems last Friday morning

Tanner Lovelace clubjuggler at gmail.com
Mon Nov 22 14:53:44 EST 2004


Greetings folks,

I had hoped to get this message out this weekend, but a trip
to the emergency room and 4 stitches for my 3 year old drove
everything else out of my head this weekend (everything's fine
though).

Some of you may have noticed on Friday morning that certain
things like logging into the IMAP server or ssh into the old
login server, moya, didn't work.  This is what happened, as
near as I can tell.

When moya was first setup with the ldap/kerberos single sign-on,
we setup connections to the ldap server requesting user information
to use SSL.  This was setup with a self-signed certificate.  Later,
TriLUG created it's own certificate authority which we used
to sign certificates for the web server, imap server, smtp server,
etc.. The LDAP service was never converted to use the new 
certificate format and later machines didn't even use ssl, mainly
because it's chief use is if you're actually transmitting passwords
over ldap, which we're not (passwords are entirely handled by 
kerberos, which doesn't transmit password in over the wire at
all).  So, the ldap setup from moya to the server (yes, also
on moya) kept going using the old certificate.  This certificate
expired this past August!  We didn't actually realize, though,
at the time for two reasons.  It happened right around the time
we moved the login server from moya to dargo (which doesn't
connect to the ldap server using SSL).  Also, the name service
caching daemon (nscd) was running on moya and it by the time
the certificate expired, it apparently had everyone (or at least most)
in its cache and could respond for requests even though ldap
wasn't working.  We did notice some things start acting
weirdly.  The user addition script stopped working correctly
and I incorrectly attributed it to the /home directory move.

Anyway, what finally caused everything to fail was that late
Thursday night, around midnight, I restarted the ldap server
and nscd while doing routine maintenance.  As soon as they
were restarted, no one accessing moya (which is the IMAP
and mail server) could access their user credentials and as 
a result imap logins stopped working, and worse yet, mail
started bouncing with "No such user" error messages! :-(
This continued until about 9:30am when I finally figured
out what was happening and turned off SSL and fixed the
error.  So, as a result people using the trilug mail server should 
check to see if they've been suspended from any mailing list
they're on, and if you were expecting a certain email from Friday
morning at midnight until about 9:30, you should check with
whoever you were expecting it from to see if it bounced.

Anyway, we very much apologize for this screwup and hope
that it won't happen again.

Cheers,
Tanner Lovelace



More information about the TriLUG mailing list