[TriLUG] Remote server monitoring

Thu Sep 1 12:25:15 EDT 2005

> 
> Did you mention that it's auto detection (perhaps it's only feature
> advantage over Nagios) is notoriously prone to slaughter the network
> (1), 

Slaughtering the network and DoS-ing daemons that can handle being
remotely monitored, in my mind, are two drastically different things. 
One should first understand a network before one sets about
implementing a tool to try to manage it.  OpenNMS makes it quite easy
to disable a given service poller, and SSH can be disabled with
service-level granularity, IP address range granularity, or host-level
granularity.  Which one of these didn't you like?

> and it's implemented entirely in Java (and consumes resources like
> your average java application, accordingly?)  

Gasp!!!  Whoever heard of implementing a business application in
Java?!?!?  There are a plethora of reasons why this makes sense and
Java is increasing in popularity, but that's an argument for a
different thread altogether.

On resource consumption...I don't know many folks that are deploying
business apps on anything less than P4 servers with 1GB RAM and big
fat disks (or SANs).  I don't know why anyone would want to plan for
less than that for an application upon which, potentially, your
ability to transact business depends.  Would it be smaller and more
efficient if implemented in another language?  Perhaps.  Then again,
it might also have decidedly less function as well.  For companies
with significant operational budgets or sizable networks, 512MB of RAM
is not  where you hinge your choice of monitoring platforms.

> Also, don't forget to
> point out that it doesn't understand network topology, and as such will
> page you for services Y and Z that depend on X, when ever X goes down,
> because you can't express the dependencies.  

You're correct.  And in Nagios, you statically define those
dependencies, so when your network changes, you have to statically
re-define those dependencies.  I'd argue (and have for years) that
reflecting topology in network management should be a documentation
function of the tool, not a configuration dependency.  Nagios assumes
that the user will be aware of any changes to the network and will
respond by updating the configuration manually (and re-starting the
tool)--in practice, I've not seen many, if any, network operations
centers, where the folks are in tune with the goings-on to stay on top
of this as they should.

I also must admit that suppressing multiple notifications on the same
root cause event was one of the first things that the old Oculan team
addressed in the commercial product offerings.  I don't recall if that
suppression technology was included in the OpenNMS 1.0 release or
not--I'd defer to Tarus on that one.

> I'd definitely debate your
> usability interface point, as with Nagios' increased understanding of
> network topologies, it's able to graph the network  in a very clear and
> understandable manner, which makes blocking outages very fast to
> understand, and reduces troubleshooting time accordingly.  Not to
> mention that it is in-and-of-itself a self-documenting diagram of your
> network, which makes a great teaching aide for new folks on staff, as
> well as the fact it's the most likely document to be up to date in your
> entire documentation (as it's auto generated from the monitoring system).
> 

Topology interfaces (e.g., maps) are the single most useless interface
ever introduced.  They serve to sell a lot of network management tools
to folks that will never use them.  It's a very rare network that can
use and maintain a network map appropriately, and since they largely
go unmaintained, they are a "worst case scenario".  The only thing
worse than no documentation at all is inaccurate documentation that's
assumed correct.

Network maps do not scale and simply cannot keep up with the speed at
which network topologies are prone to change, given dynamic routing
protocols from spanning tree to OSPF/BGP to EIGRP to name your
preferred dynamic routing protocol here.

On Nagios interface in general--please note that the product is
basically unusable by someone that's color blind and in our business,
that's a very sizable component of the user base.

> Have a mentioned enough, or should we continue the holy war?  :)  I can
> go on for pages...

I don't see this as a holy war.  Holy wars are based on beliefs and
this exchange is (or at least should be) based in facts.

And you're not only going to have to go on for pages, but for years. 
I've been in the network management industry for way too long to be
swayed by one opinion in an email.

> 
> 1 - In all fairness, I've never setup OpenNMS.  I've only seen it in
> use, on networks I've managed.  I've seen it setup by people I don't
> know, people I know are idiots, and one person who I think is rather
> knowledgeable.  In all 3 cases I've seen it's auto detection utterly
> choke OpenSSH's ability to take new incoming connections, by effectively
> flooding the hosts with connections until the host runs out of
> resources.  I'd rather not have a "monitoring" daemon that has more than
> a few times been the source of the problem it's complaining about.  I'll
> gladly admit the remote possibility that this may have been the fault of
> 3 separate and unrelated sets of people misconfiguring it in the same
> way on completely separate occasions.  But if that's the case, that's a
> design flaw in and of itself.  

So you argue that if this is a problem that's widely recurring, that
behavior should be turned off by default?  No argument here.  Does
that make it a "design flaw"?  Not hardly.

Conversely, Nagios ships with NOTHING turned on--nor any ability to
discover the network to which it's attached.  Which means for it to do
anything, it must be configured in detail by an administrator.  This
is not an uncommon approach by any means, but it's a problem that can
only be remedied by having someone on staff that's intimately familiar
with the configuration of the tool before it ever gets deployed and
used.  Most businesses would trade that for some out-of-the-box
functionality any day.

> My solution to the problem usually goes
> something like this:
> ps ax | grep java | awk '{print $1}' | xargs kill
> but then again I've been called "closed minded" when it comes to java,
> and I generally consider it a complement.

You're a smart guy and I've enjoyed your posts to the list.  I'm
comfortable overlooking this one area of technology myopia.  I think
the "closed minded" comment is probably right on and am confident that
someday you'll recognize that it's not intended as a complement, but
perhaps as constructive criticism.

Shane O.
========
Shane O'Donnell
shaneodonnell at gmail.com
====================