[TriLUG] Jihad! ( was Remote server monitoring)
Shane O'Donnell
shaneodonnell at gmail.com
Thu Sep 1 12:46:02 EDT 2005
William -
Thanks for keeping us grounded in reality.
You're really discussing two separate issues -- network availability
monitoring and systems performance monitoring. To almost any company,
these are at the core of their monitoring goals. Unfortunately, many
companies have grown to the point that organizationally they've split
up the server folks from the networks folks (as opposed to have a
larger "services" focus, but that's another topic altogether).
What I've seen is that in companies where the function is split, the
server folks take on a "server-centric" view of the universe and
typically go about deploying agents to servers because, well, that's
what they do--maintain software on servers. They chalk up the spotty
availability to "network problems" that are outside their scope or
area of responsibility. There is nothing wrong with this approach,
until you get to the user's take on a situation. When a user can't
access a resource, there is a service-related problem that should not
involve finger-pointing. This means the server guys should have an
idea as to what's going on on the network (meaning they need insight
into simple and up-to-date availability reports) as well as data from
their servers over the period of time during which they can't be
reached. The solution to the problem usually ends up leaning toward
the expertise of the area that solves the problem; the network guys go
with a polling approach while the server guys fall toward the agent
side.
Personally, I trend toward the polling approach, for a few reasons:
- Agents can be a bitch to maintain
- Agents arguably intrude on (and steal resources from) the
systems/apps they're supposed to managing/monitoring
- If you can't reach a box, there are usually bigger problems than
what's going on on the box
- Good polling solutions collect data over multiple time periods, so
small gaps can be interpolated for reporting purposes
As an example, OpenNMS is configured that if it discovers a box
running the Net-SNMP agent (which ships by default with Red Hat, SuSE,
etc.), it will collect performance metrics on CPUs, network
performance, disks (IIRC), and most interestingly, it will collect the
1-5-15 minute load metrics. All this data gets automagically slammed
into JRobin databases and graphs are dynamically generated by the UI,
on demand.
For reporting purposes, this out-of-the-box functionality is usually
sufficient. If you need to augment this with logs of performance
metrics from the remote machines, I'd recommend a lighter weight
approach--cron jobs that capture df/netstat/load/proc data to a file
for access if the network is unavailable. If you need a solution that
reports on data collected on a batch basis from machines that are
regularly unaccessible, you'll probably want to look to a full-blown
agent solution--and you should be prepared for the maintenance
overhead that's associated therewith.
Hope this helps,
Shane O.
On 9/1/05, William Sutton <william at trilug.org> wrote:
> Hmmm...
>
> The question wasn't entirely theoretical. We have an in-house developed
> system monitoring tool at $WORK to make sure that our servers aren't being
> bogged down by manufacturing processes (a lot of back-end stuff going on
> with databases and so-on). We also have a large worldwide VPN where
> segments run over hardware we don't own or control. Consequently, fixing
> the outages isn't an option...
>
> FWIW....
>
> On Thu, 1 Sep 2005, Tarus Balog wrote:
>
> >
> > On Sep 1, 2005, at 12:14 PM, William Sutton wrote:
> >
> > > It seems like a more sensible alternative to polling is to have
> > > separate
> > > tools for monitoring and data collection/reporting: Place the
> > > monitor on
> > > the servers, and allow them to queue up reports in event of network
> > > problems.
> >
> > Depends on what you want to monitor. I can have a program check if
> > apache is running on the server, but does that mean that server is
> > available in LA? New York? If all you care about is "is there an
> > apache process running on this server that I can connect to, from
> > this server" then, yeah. If you want to measure service availability,
> > you need to measure it from the user's point of view. If Travelocity
> > is slow, I go to Orbitz, whether or not the Travelocity server is
> > actually up as far as they are concerned. In my case, I want to
> > capture the user experience.
> >
> > You can also place "agents" on systems, but agent management outside
> > of what ships with a O/S can be problematic on an enterprise scale. I
> > guess you could write an agent to store performance data, like CPU,
> > disk, etc., and then report it up to an NMS, but many people would
> > rather spend resources to fix issues with the "spotty" network and
> > leave it at that.
> >
> > -T
> >
> > -----
> >
> > Tarus Balog
> > The OpenNMS Group, Inc.
> > Main : +1 919 545 2553 Fax: +1 503-961-7746
> > Direct: +1 919 647 4749 Skype: tarusb
> > Key Fingerprint: 8945 8521 9771 FEC9 5481 512B FECA 11D2 FD82 B45C
> >
> >
> --
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG Organizational FAQ : http://trilug.org/faq/
> TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
> TriLUG PGP Keyring : http://trilug.org/~chrish/trilug.asc
>
--
Shane O.
========
Shane O'Donnell
shaneodonnell at gmail.com
====================
More information about the TriLUG
mailing list