[TriLUG] Recommend some 'big iron'?

Andrew Perrin andrew_perrin at unc.edu
Sun Feb 3 20:06:30 EST 2002


Thanks for the link. My expectation is that the jobs will be CPU (and, to
a lesser extent, RAM) bound, and less IO bound. It's not searching so much
as doing fancy pattern matching. I'm working on algorithms to allow
automatic coding of qualitative data -- essentially, a sociological use
for document classification -- which will eventually reach very large
numbers of documents and categories.  Software running on it will be:

- Linux (probably debian)
- PostgreSQL
- R
- Perl
- Apache (to allow research staff to code texts)
- MAYBE ClaraOCR, to try to turn scanned documents into text

I'm going to go ahead and propose some fancy, brand-name solution (the Sun
or IBM probably) with the expectation that I'll probably end up cutting it
down to a cheaper solution.

----------------------------------------------------------------------
Andrew J Perrin - andrew_perrin at unc.edu - http://www.unc.edu/~aperrin
 Assistant Professor of Sociology, U of North Carolina, Chapel Hill
      269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA


On 31 Jan 2002, Ed Hill wrote:

> On Thu, 2002-01-31 at 12:24, Andrew Perrin wrote:
> > Okay, this will be fun :)
> > 
> > I'm putting together a research grant for some fairly heavy text crunching
> > (categorizing thousands of documents using statistical methods). At the
> > moment the grant is in the "reach for the sky" phase, meaning look for the
> > best-possible technical solution. Eventually, of course, we will probably
> > have to cut down.
> > 
> > But for now, I'd like advice on hardware, potentially costing as much as
> > $25,000 for this project.  I'm open to clustered solutions as well as
> > single-machine solutions, although I don't want to spend much time keeping
> > the cluster going.  Things I've thought of include:
> > 
> > - IBM IntelliStation Z line
> > - Sun Enterprise 450 or 420R
> > - SGI 2200 or something like that (don't know SGI's line well)
> > - Building a standard Intel-based system (dual fast processors, 4G RAM,
> > fast SCSI disks, etc.)
> > 
> > So, what would you do?
> 
> 
> I assume by text processing you mean mostly integer work with some
> floating point.  In either case, you should be aware of the SPEC
> benchmarks:
> 
>   http://www.spec.org/osg/cpu2000/results/
> 
> and read how the benchmark scores are calculated before browsing.
> 
> You'll notice that, at the moment, the AMD Athlons are the best in terms
> of operations (either floating point or integer) per second per dollar. 
> You can get dual Athlon systems for very competitive prices online.  Or
> pick up a recent copy of Linux Journal and you'll see multiple ads for
> companies selling dual-Athlon systems that come with Linux pre-loaded
> and pre-configured.  For $25K you could build a small cluster.
> 
> But getting back to the original question: how do you know whether your
> application(s) will be CPU-bound?  If you're doing a lot of searching,
> your work is more likely to be IO-bound and in that case you're better
> off getting relatively cheaper/slower CPUs and putting your grant money
> into a large/fast SCSI array.
> 
> hth,
> Ed
> 
> 
> -- 
> Edward H. Hill III, PhD
> Post-Doctoral Researcher   |  Email:       ed at eh3.com, ehill at mines.edu
> Division of ESE            |  URL:         http://www.eh3.com
> Colorado School of Mines   |  Phone:       303-273-3483
> Golden, CO  80401          |  Fax:         303-273-3311
> Key fingerprint = 5BDE 4DA1 66BE 4F7B BC17  3A0C 932B 7266 1E76 F123
> 




More information about the TriLUG mailing list