[TriLUG] Question - Linux Server CPR

Jon Carnes jonc at nc.rr.com
Wed Mar 19 10:36:31 EST 2003


Hmmm... at that point his processor was jammed to the max.  The best
thing he could do is run a monitoring program (either external or
internal to the server) and send an alert whenever the load reaches some
critical level - but before it drives the box into un-usability.

He should monitor CPU/Memory/disk space.  In particular he wants to
monitor virtual memory.  I'll bet he's running an IDE disk subsystem...

Last time I had to do this for a server it was a Mail server running an
IDE disk subsystem.  We solved the problems it was having by putting a
max limit on message sizes and changing the disk-subsystem over to scsi.

If they just setup some scripts to dump out the system's status, they
can setup the scripts to run as cron jobs every five minutes.  I like to
mail these status messages to some localized server that is stable. 
Then run a script on the incoming status mail to check it for any
problems.  If their are problems the mail is forwarded to my account (or
an alert is sent out).

Good Luck - Jon Carnes

On Wed, 2003-03-19 at 09:55, Scott Chilcote wrote:
> Hi Folks,
> 
> I had someone call me yesterday who had a seriously ill Linux server.
> 
> He said that the machine seemed heavily overloaded and was running very 
> slowly.  It would respond to its keyboard at the console, but only at 
> the rate of one keystroke every five to ten minutes.  Attempts to reach 
> it with ssh from another system were timing out with "connection reset 
> by peer".  control-alt-delete wasn't having any affect (the command may 
> have been removed for security).
> 
> He wound up hitting the reset button, and now has hard disk errors that 
> e3fsck can't handle.  Quite a mess.  It's possible that he could have 
> logged in and issued commands (if he'd kept at it all night), but the 
> slowdown appeared to be worsening.
> 
> What he hoped I could tell him was whether there was any way to get the 
> kernel's attention under these circumstances.  The problem might have 
> been fixed by deleting an oversize /tmp file or killing some processes, 
> if there was a way to get in.  I don't know of any.
> 
> Is there a better way to handle this problem, other than "Don't let it 
> happen to begin with"?
> 
> Thanks,
> 
> Scott C.
> 
> 
> _______________________________________________
> TriLUG mailing list
>     http://www.trilug.org/mailman/listinfo/trilug
> TriLUG Organizational FAQ:
>     http://www.trilug.org/~lovelace/faq/TriLUG-faq.html





More information about the TriLUG mailing list