[TriLUG] Clusters, performance, etc...

Mon Nov 7 16:14:28 EST 2005

On Mon, 2005-11-07 at 15:25 -0500, Michael Alan Dorman wrote:
> Mark Freeze <mfreeze at gmail.com> writes:
> 
> > I have a friend who runs a business like mine and we have the same
> > basic setup. We normally receive files from customers that may be 50
> > to 100 MB. We run programs on these files that parse text, create
> > databases, purge records, and so on. Normal database
> > stuff. Converting and parsing records with the software that I have
> > written usually runs for about 1 hour on the larger files and we may
> > have 2 or 3 of these files each time a customer trasmits data to us.

> Without more information, it's impossible to say.

Hi Mark,

Michael is right.  If your current jobs have few dependencies (have some
locality) then you might be able to get a good (perhaps even linear)
speedup from more machines.  Or, you might have dependencies (eg. global
I/O) that would actually make it run *slower* when cut up into pieces
and smeared over multiple nodes.  Or anything in-between.

We do a lot of cluster coding and its sometimes hard to see how things
will work out.  If you'd really like to get an idea if there is anything
to be gained, then try a "trivial cluster" of two machines and see what
the bottlenecks are.

Ed

ps - Even a problem that can be readily parallelized might wind up
performing terribly on a cluster due to poor coding--but thats another
issue entirely...  ;-)

-- 
Edward H. Hill III, PhD
office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
             Cambridge, MA 02139-4307
emails:  eh3 at mit.edu                ed at eh3.com
URLs:    http://web.mit.edu/eh3/    http://eh3.com/
phone:   617-253-0098
fax:     617-253-4464