[TriLUG] Recommend some 'big iron'?
Andrew Perrin
andrew_perrin at unc.edu
Sun Feb 3 20:06:30 EST 2002
Thanks for the link. My expectation is that the jobs will be CPU (and, to
a lesser extent, RAM) bound, and less IO bound. It's not searching so much
as doing fancy pattern matching. I'm working on algorithms to allow
automatic coding of qualitative data -- essentially, a sociological use
for document classification -- which will eventually reach very large
numbers of documents and categories. Software running on it will be:
- Linux (probably debian)
- PostgreSQL
- R
- Perl
- Apache (to allow research staff to code texts)
- MAYBE ClaraOCR, to try to turn scanned documents into text
I'm going to go ahead and propose some fancy, brand-name solution (the Sun
or IBM probably) with the expectation that I'll probably end up cutting it
down to a cheaper solution.
----------------------------------------------------------------------
Andrew J Perrin - andrew_perrin at unc.edu - http://www.unc.edu/~aperrin
Assistant Professor of Sociology, U of North Carolina, Chapel Hill
269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA
On 31 Jan 2002, Ed Hill wrote:
> On Thu, 2002-01-31 at 12:24, Andrew Perrin wrote:
> > Okay, this will be fun :)
> >
> > I'm putting together a research grant for some fairly heavy text crunching
> > (categorizing thousands of documents using statistical methods). At the
> > moment the grant is in the "reach for the sky" phase, meaning look for the
> > best-possible technical solution. Eventually, of course, we will probably
> > have to cut down.
> >
> > But for now, I'd like advice on hardware, potentially costing as much as
> > $25,000 for this project. I'm open to clustered solutions as well as
> > single-machine solutions, although I don't want to spend much time keeping
> > the cluster going. Things I've thought of include:
> >
> > - IBM IntelliStation Z line
> > - Sun Enterprise 450 or 420R
> > - SGI 2200 or something like that (don't know SGI's line well)
> > - Building a standard Intel-based system (dual fast processors, 4G RAM,
> > fast SCSI disks, etc.)
> >
> > So, what would you do?
>
>
> I assume by text processing you mean mostly integer work with some
> floating point. In either case, you should be aware of the SPEC
> benchmarks:
>
> http://www.spec.org/osg/cpu2000/results/
>
> and read how the benchmark scores are calculated before browsing.
>
> You'll notice that, at the moment, the AMD Athlons are the best in terms
> of operations (either floating point or integer) per second per dollar.
> You can get dual Athlon systems for very competitive prices online. Or
> pick up a recent copy of Linux Journal and you'll see multiple ads for
> companies selling dual-Athlon systems that come with Linux pre-loaded
> and pre-configured. For $25K you could build a small cluster.
>
> But getting back to the original question: how do you know whether your
> application(s) will be CPU-bound? If you're doing a lot of searching,
> your work is more likely to be IO-bound and in that case you're better
> off getting relatively cheaper/slower CPUs and putting your grant money
> into a large/fast SCSI array.
>
> hth,
> Ed
>
>
> --
> Edward H. Hill III, PhD
> Post-Doctoral Researcher | Email: ed at eh3.com, ehill at mines.edu
> Division of ESE | URL: http://www.eh3.com
> Colorado School of Mines | Phone: 303-273-3483
> Golden, CO 80401 | Fax: 303-273-3311
> Key fingerprint = 5BDE 4DA1 66BE 4F7B BC17 3A0C 932B 7266 1E76 F123
>
More information about the TriLUG
mailing list