[TriLUG] How would you diagnose "random" system hangs?

Mon May 7 20:47:39 EDT 2007

   You may have a hardware watchdog timer or a softdog module. lsmod. Check
bios to turn on hw watchdog if you have one. These may detect lockup. E.g.
is the kernel locked totally or is it just the GUI. E.g. try the power
switch (button.ko module). Does this log anything to the event log? What was
the last log to /var/log/dmesg? On reboot you can look back to
/var/log/messages to where the messages ended before your last boot.

If you have multiple CPUs or multicore then you should see other interrupts
being handled by the non-blocked CPU. Maybe you can turn off a CPU (kernel
boot parameter) to twiddle this.

In the end, lsmod and rmmod down to nothing and see if the lockup occurs.
The fact it occurs while you were out suggests it is related to what is
running while you were out. What is running then? You can also look to the
other logs in /var/log such as for httpd/apache, mysql, and such and see
where they end before lockup.

   Finally, you can really rebuild with a kernel debugger. Then you could
attach another computer via a null modem cable and setup your primary
computer to send kernel debugging information over the wire. The on the
monitoring and not locked maching you can issue a break into the kernel
debugger and look at the stack and see exactly where it really is locked. If
fact, you can setup the kernel debugger to break when a soft lockup is
detected and when you do a ps in the kernel debugger then it will show you
the exact process that locked. Then look at the call stack and see what
driver/other kernel code was the culprit. RHAT has a soft lockup detection
code too. You could diff with their patches and maybe go crazy customizing
your kernel with their code.

   I editted a book for a friend, "GNU/Linux Debugging and Troubleshooting".
He's not ready to release it yet. Only been working on it for 4-5 years!
http://www.lulu.com/content/424861 He uses it in his training company.

My $.02,
TimJowers

On 5/7/07, Robert Dale <robdale at gmail.com> wrote:
>
> On 5/6/07, Andrew Perrin <clists at perrin.socsci.unc.edu> wrote:
> > My home system has been freezing up at apparently random times --
> usually
> > when I'm at work, so I come home to a frozen machine that has to be
> > cold-booted. How would you go about checking this out? I've let
> memtest86
> > run continuously for 24 hours with no errors, so I don't think it's
> > memory. I have the sensors reporting hourly to a log, and there are no
> > temperature concerns (generally between 37 and 40C). There's nothing of
> > interest in syslog that rings any bells to me. Any ideas?
> >
> > The machine is an ASUS A8N-E, nForce chipset, with an Athlon64 dual-core
> > CPU and 4GB of RAM in it. It's running debian etch, but with a
> > home-compiled kernel 2.6.20.7.
>
> In my case, once it was a power supply, another was the mobo.
> You probably could find someone to load test your power supply.
> Hopefully, you don't run into the second situation.  Luckily I was
> just outside of my house when I heard the smoke detector go off.  Ran
> inside to the smell of burnt electronics and black smoke.  The mobo
> had literally exploded.
>
> --
> Robert Dale
> --
> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG Organizational FAQ  : http://trilug.org/faq/
> TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
>