[TriLUG] Linux(RedHat) kernel question(long/involved)

Wed Oct 27 14:34:05 EDT 2010

Well, load average doesn't necessarily measure "total CPU time"... it
measures how many jobs are in the queue to be run at the present moment.

That includes I/O wait time. So it's entirely possible that your cluster
jobs are sitting in the run queue waiting to be served data they've
requested, and that your CPUs are idle. That is to say, the latency on
NFS requests/replies to/from the SATAbeast boxes could be holding you up.

I would recommend a day or two of sar(1) stats in order to better
quantify your problem. Also perhaps a better understanding of the sort
of I/O necessary for the jobs to be carried out could be useful. Big
sequential reads? Lots of little random I/O?

If you haven't yet, I'd go down that route before digging into the Linux
kernel's scheduler.

--bak

On 10/27/10 2:13 PM, Leslie(Pete) Boyd wrote:
> 
> Hello Trilug,
> 
>     I have lurked on this list for approximately 10 years and 
>     have learned so much from reading it.
>     
>     Now, I have a question regarding linux kernel interals.
>     
>     At this site, I manage approximately 120 linux servers and
>     several SGI Altix servers as well. 
>     
>     Now the question:
>      
>      How does the LINUX kernel handle a multiprocessor box?
>      Situation: 1 to N users are logged into the box.
>                 How are the CPUs allocated to these users?
>                 How will the I/O be distributed for these users?
>                 
>     My reason for needing this information:
>     
>     We have 40 Dell R610 servers with 8 processors and 48Gb of memory.
>     Storage for this cluster consists of two 42Tbyte SATAbeast units
>     attached via two Qlogic fiber controllers with a dedicated 
>     server for each controller.
>     
>     The total configuration for this cluster is 40 Dell R610s, each
>     with dual quad CPUs(320 nodes) and 72Tbytes of RAW disk space.
>     
>     Torque is used to queue jobs for the cluster and MPICH is used
>     to distribute the job across the nodes. The RAIDS are mounted 
>     using EXT4. NFS with automounter is used to distribute the disks
>     to each of the individual servers.
>     
>     Problem: When several jobs are running on the cluster, the load
>              average on the disk servers climbs above 8. Sometimes as
>              much as 12 and the performance of the running jobs 
>              drops.
>              
>     We are in the process of installing/configuring a lustre 
>     filesystem, however; the disks will remain attached and I
>     need to solve the load problem.
>     
>     Thanks in advance for your suggestions and comments.
>                          
> ******************
> Leslie(Pete) Boyd         
> Contractor:               The rich man isn't the one who has the most,
> Vision Technoligies       but the one who needs the least. 
> Senior Systems Engineer   
> US EPA  Rm. E460                --- IN GOD WE TRUST --  
> 919/541-1438                     
> ******************
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 259 bytes
Desc: OpenPGP digital signature
URL: <http://www.trilug.org/pipermail/trilug/attachments/20101027/2dc88856/attachment.pgp>