[TriLUG] Clusters, performance, etc...

Mon Nov 7 21:23:10 EST 2005

Mark Freeze wrote:

>  2. Makes the second file into a dbase file
>  3. Runs another c++ program on the first file that examines each record in
> the file and compares it to another database (using proprietary code
> libraries supplied by our software vendor) that corrects any bad info in the
> address, adds a zip+4, adds carrier route info, etc...

That definitely sounds like something you could parallelize. 
  And you might not even have to re-code your program 
(depending on some "ifs" ).  You could probably just divide 
the input file into x chunks, where x is the number of nodes
you want working in parallel.  Then sftp the chunks to the
respective nodes and kick off your processor program.  After 
it finishes, sftp the chunks back to a central location and
concatenate them and do whatever else needs to be done.

Of course that leaves you needing to sort out some things like:

1. how to divide the input file.  Possibly your existing 
step 1 program could do it, or maybe you could use some 
shell scripting and the tail command.

2. how to move the chunks.  I mentioned sftp as an example, 
but you could use ftp, nfs, samba, or pretty much anything.

3. controlling the worker processes... if the process 
currently works by watching a directory for a file, you may 
not have to do anything special.  Otherwise you may have to 
work out another way to kick off the process on the remote 
nodes.

4. work out how to determine when the process(es) is done 
with a chunk of file so you can gather up your chunks to 
continue processing.

HTH.

TTYL,

Phil
-- 
North Carolina - First In Freedom

Free America - Vote Libertarian
www.lp.org