[TriLUG] Opteron >4GB w/RHEL 3

Mon Feb 16 10:02:42 EST 2004

No answers, but here's someone who's having the same problem with Suse 8.2.

http://lists.suse.com/archive/suse-amd64/2003-Jul/0054.html

Kevin L. Hawkins
Sr. Linux Systems Analyst
VF Services, Inc.
(336) 424-3826

++++++++++++++++++++++++++++++++++

I'm working on setting up some new lab hardware, among which is a shiny
new dual Opteron server running RHEL 3.  The box has two Opteron 244
CPU's and 6GB of DDR ECC RAM installed in six 1GB sticks (there are 8
total DIMM slots).  The motherboard is a Tyan S2882 (a.k.a. Thunder K8S
Pro).  Everything seems to run fine in the limited testing I've done so
far, but every few seconds I see this appear in the syslog:

Feb 12 18:04:11 rtp-wbu-sh-m1 kernel: CPU 0: Silent Northbridge MCE
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel: Northbridge status
a40000000005001b
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     GART TLB error generic level
generic
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     extended error gart error
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     link number 0
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     error address valid
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     error uncorrected
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     previous error lost
Feb 12 18:04:11 rtp-wbu-sh-m1 kernel:     error address 00000000fafe1a68

I thought this looked like possibly bad RAM.  But when I pull out
two--*any* two--sticks of RAM, the error message goes away.  It seems to
be tied in to the fact that I have >4GB of memory.  According to the
motherboard manual, when you use more than 6 DIMMs on this board, you're
using a 128-bit (interleaved) memory configuration as opposed to a
64-bit (noninterleaved) configuration with 4 or fewer DIMMs (ref. page
30 of ftp://ftp.tyan.com/manuals/m_s2882_101.pdf), if that's any hint.
I've tried rearranging the DIMMs in every valid way listed in the
motherboard's manual to no avail.  I even ran memtest86
(www.memtest.org) just to be sure I didn't have bad RAM.  I'm using RHEL
stock kernel 2.4.21-9.ELsmp and had the same problem on 2.4.21-4.ELsmp.
The box seems to run fine, but those errors clogging up my syslog have
me worried.

Anyone know what might be happening here?  I'm not sure whether to
complain to the vendor that something is fishy with their hardware or
whether this is a software issue.

At Your Service,

--
Mark T. Voelker

[root at localhost root]# free
             total       used       free     shared    buffers
cached
Mem:       5976880     657272    5319608          0     105080
222268
-/+ buffers/cache:     329924    5646956
Swap:      2040244          0    2040244

[root at localhost root]# uname -a
Linux localhost.localdomain 2.4.21-9.ELsmp #1 SMP Thu Feb 12 16:03:39
EST 2004 x86_64 x86_64 x86_64 GNU/Linux

(See attached file: signature.asc)--
TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
TriLUG Organizational FAQ  : http://trilug.org/faq/
TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
TriLUG PGP Keyring         : http://trilug.org/~chrish/trilug.asc