[TriLUG] Clusters, performance, etc...
Phillip Rhodes
mindcrime at cpphacker.co.uk
Mon Nov 7 21:23:10 EST 2005
Mark Freeze wrote:
> 2. Makes the second file into a dbase file
> 3. Runs another c++ program on the first file that examines each record in
> the file and compares it to another database (using proprietary code
> libraries supplied by our software vendor) that corrects any bad info in the
> address, adds a zip+4, adds carrier route info, etc...
That definitely sounds like something you could parallelize.
And you might not even have to re-code your program
(depending on some "ifs" ). You could probably just divide
the input file into x chunks, where x is the number of nodes
you want working in parallel. Then sftp the chunks to the
respective nodes and kick off your processor program. After
it finishes, sftp the chunks back to a central location and
concatenate them and do whatever else needs to be done.
Of course that leaves you needing to sort out some things like:
1. how to divide the input file. Possibly your existing
step 1 program could do it, or maybe you could use some
shell scripting and the tail command.
2. how to move the chunks. I mentioned sftp as an example,
but you could use ftp, nfs, samba, or pretty much anything.
3. controlling the worker processes... if the process
currently works by watching a directory for a file, you may
not have to do anything special. Otherwise you may have to
work out another way to kick off the process on the remote
nodes.
4. work out how to determine when the process(es) is done
with a chunk of file so you can gather up your chunks to
continue processing.
HTH.
TTYL,
Phil
--
North Carolina - First In Freedom
Free America - Vote Libertarian
www.lp.org
More information about the TriLUG
mailing list