[TriLUG] Jihad! ( was Remote server monitoring)

Thu Sep 1 13:13:46 EDT 2005

Yes...at this point, it becomes rather a P-NP type of problem, best solved 
by an intelligent human...because what you have to be able to do is:

1. Recognize from the symptoms that something out of the ordinary is 
occurring on one or more systems
2. Identify similar systems that are not having problems
3. Identify the similarities between working/non-working systems and 
discard those as causal issues
4. identify discrepancies between working/non-working systems
5. filter the discrepancies based on what is natural (different hostname 
files, for example) and what is potentially problematic (different DNS 
servers, say)
6. identify common discrepancies among multiple broken systems (say 3 
servers aren't responding and they all use the same DNS server....)
7. (requires a human): figure out what the heck it all means :)

William

On Thu, 1 Sep 2005, Shane O'Donnell wrote:

> Comments embedded...
> 
> On 9/1/05, William Sutton <william at trilug.org> wrote:
> > A lot of good stuff....yes, there probably needs to be a better solution
> > to the whole problem...after all, how do you determine that lack of httpd
> > response stems from a server using a slow LDAP server for logins rather
> > than httpd being turned off or local system processes putting on a severe
> > load?
> > 
> 
> Actually, this is something that OpenNMS would report as "service
> unresponsive" instead of "down", assuming you configured the software
> to actually log-in to the box.  Also, the LDAP server could be polled
> using LDAP and should also respond as "unresponsive".  "Unresponsive"
> means, in OpenNMS, that the TCP handshake completed but that the
> service never completed the entire synthetic transaction.
> 
> > As I said, I've been on the development side of an in-house system that
> > does client-based monitoring, so I have a particular viewpoint on the
> > entire process.  You've helped me see that it's not the only (or
> > necessarily best) solution.
> > 
> > I'm beginning to think that (even with all of the existing monitoring
> > tools) something should be developed that provides data from both inside
> > and outside a server, with a set of rule-based processes on the data
> > collection/reporting server that intelligently ties that information
> > together...
> > 
> 
> Ties it together AND compares it for an anomalies between the two. 
> This starts getting closer to root cause analysis.  I like the way you
> think.
> 
> 
> > e.g., a high CPU load, low memory usage, and high network latency when
> > connecting to a server, coupled with low network latency to another server
> > in the same rack, could tell you that server A is having LDAP server
> > issues (or whatever)....data that you can't get just looking at services
> > provided or internal vmstat data.
> > 
> 
> And now you're going down an event correlation path--seeing discrete
> "occurrences" manifested as recognized patterns of events from
> different sources.  This is where this stuff gets
> interesting/tricky/network-specific.  It's also where many companies
> have spent beaucoup dollars trying to chase an elusive target and
> never see a real return on that investment.  It's a slippery slope,
> indeed.
> 
> Shane O.
> 
> > William
> > 
> > On Thu, 1 Sep 2005, Shane O'Donnell wrote:
> > 
> > > William -
> > >
> > > Thanks for keeping us grounded in reality.
> > >
> > > You're really discussing two separate issues -- network availability
> > > monitoring and systems performance monitoring.  To almost any company,
> > > these are at the core of their monitoring goals.  Unfortunately, many
> > > companies have grown to the point that organizationally they've split
> > > up the server folks from the networks folks (as opposed to have a
> > > larger "services" focus, but that's another topic altogether).
> > >
> > > What I've seen is that in companies where the function is split, the
> > > server folks take on a "server-centric" view of the universe and
> > > typically go about deploying agents to servers because, well, that's
> > > what they do--maintain software on servers.  They chalk up the spotty
> > > availability to "network problems" that are outside their scope or
> > > area of responsibility.  There is nothing wrong with this approach,
> > > until you get to the user's take on a situation.  When a user can't
> > > access a resource, there is a service-related problem that should not
> > > involve finger-pointing.  This means the server guys should have an
> > > idea as to what's going on on the network (meaning they need insight
> > > into simple and up-to-date availability reports) as well as data from
> > > their servers over the period of time during which they can't be
> > > reached.  The solution to the problem usually ends up leaning toward
> > > the expertise of the area that solves the problem; the network guys go
> > > with a polling approach while the server guys fall toward the agent
> > > side.
> > >
> > > Personally, I trend toward the polling approach, for a few reasons:
> > >
> > >  - Agents can be a bitch to maintain
> > >  - Agents arguably intrude on (and steal resources from) the
> > > systems/apps they're supposed to managing/monitoring
> > >  - If you can't reach a box, there are usually bigger problems than
> > > what's going on on the box
> > >  - Good polling solutions collect data over multiple time periods, so
> > > small gaps can be interpolated for reporting purposes
> > >
> > > As an example, OpenNMS is configured that if it discovers a box
> > > running the Net-SNMP agent (which ships by default with Red Hat, SuSE,
> > > etc.), it will collect performance metrics on CPUs, network
> > > performance, disks (IIRC), and most interestingly, it will collect the
> > > 1-5-15 minute load metrics.  All this data gets automagically slammed
> > > into JRobin databases and graphs are dynamically generated by the UI,
> > > on demand.
> > >
> > > For reporting purposes, this out-of-the-box functionality is usually
> > > sufficient.  If you need to augment this with logs of performance
> > > metrics from the remote machines, I'd recommend a lighter weight
> > > approach--cron jobs that capture df/netstat/load/proc data to a file
> > > for access if the network is unavailable.  If you need a solution that
> > > reports on data collected on a batch basis from machines that are
> > > regularly unaccessible, you'll probably want to look to a full-blown
> > > agent solution--and you should be prepared for the maintenance
> > > overhead that's associated therewith.
> > >
> > > Hope this helps,
> > >
> > > Shane O.
> > > On 9/1/05, William Sutton <william at trilug.org> wrote:
> > > > Hmmm...
> > > >
> > > > The question wasn't entirely theoretical.  We have an in-house developed
> > > > system monitoring tool at $WORK to make sure that our servers aren't being
> > > > bogged down by manufacturing processes (a lot of back-end stuff going on
> > > > with databases and so-on).  We also have a large worldwide VPN where
> > > > segments run over hardware we don't own or control.  Consequently, fixing
> > > > the outages isn't an option...
> > > >
> > > > FWIW....
> > > >
> > > > On Thu, 1 Sep 2005, Tarus Balog wrote:
> > > >
> > > > >
> > > > > On Sep 1, 2005, at 12:14 PM, William Sutton wrote:
> > > > >
> > > > > > It seems like a more sensible alternative to polling is to have
> > > > > > separate
> > > > > > tools for monitoring and data collection/reporting:  Place the
> > > > > > monitor on
> > > > > > the servers, and allow them to queue up reports in event of network
> > > > > > problems.
> > > > >
> > > > > Depends on what you want to monitor. I can have a program check if
> > > > > apache is running on the server, but does that mean that server is
> > > > > available in LA? New York? If all you care about is "is there an
> > > > > apache process running on this server that I can connect to, from
> > > > > this server" then, yeah. If you want to measure service availability,
> > > > > you need to measure it from the user's point of view. If Travelocity
> > > > > is slow, I go to Orbitz, whether or not the Travelocity server is
> > > > > actually up as far as they are concerned. In my case, I want to
> > > > > capture the user experience.
> > > > >
> > > > > You can also place "agents" on systems, but agent management outside
> > > > > of what ships with a O/S can be problematic on an enterprise scale. I
> > > > > guess you could write an agent to store performance data, like CPU,
> > > > > disk, etc., and then report it up to an NMS, but many people would
> > > > > rather spend resources to fix issues with the "spotty" network and
> > > > > leave it at that.
> > > > >
> > > > > -T
> > > > >
> > > > > -----
> > > > >
> > > > > Tarus Balog
> > > > > The OpenNMS Group, Inc.
> > > > > Main  : +1 919 545 2553   Fax:   +1 503-961-7746
> > > > > Direct: +1 919 647 4749   Skype: tarusb
> > > > > Key Fingerprint: 8945 8521 9771 FEC9 5481  512B FECA 11D2 FD82 B45C
> > > > >
> > > > >
> > > > --
> > > > TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> > > > TriLUG Organizational FAQ  : http://trilug.org/faq/
> > > > TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
> > > > TriLUG PGP Keyring         : http://trilug.org/~chrish/trilug.asc
> > > >
> > >
> > >
> > >
> > --
> > TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> > TriLUG Organizational FAQ  : http://trilug.org/faq/
> > TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
> > TriLUG PGP Keyring         : http://trilug.org/~chrish/trilug.asc
> > 
> 
> 
>