[SunRescue] Logging memory errors
rescue at sunhelp.org
rescue at sunhelp.org
Tue Dec 5 09:10:46 CST 2000
It's more a problem on how intel boxes report ecc errors. As I understand
it they arn't very verbose, and NMI's could mean other things.
Nick
On Tue, 5 Dec 2000, Dave Reader wrote:
>
>
> On Tue, 5 Dec 2000, Paul Khoury wrote:
>
> > On Mon, 04 Dec 2000 21:12:14 -0800, Paul Theodoropoulos wrote:
> >
> > >Dec 4 20:40:31 e4500a unix: Corrected MemMod Board 0 J3800
> > >Dec 4 20:40:31 e4500a unix: ECC Data Bit 11 was corrected
> > >
> > >I refuse to use anything but SPARC running Solaris for core
> > >infrastructure. Nothing is as reliable.
> >
> > How do the memory errors work, BTW? Does Solaris just map around them
> > in realtime? I'm sure Linux would have a fit if it encountered that.
>
> It's ECC memory - Error Checking and Correcting.
>
> It is possible for the ECC memory to seamlessly "heal" single-bit errors
> and raise an alert that an error has occurred.
>
> When this happens, it means "your memory has started to degrade and
> introduce errors, i'm correcting single bit errors but you'd better
> replace it before it gets worse" (ECC only protects you from a single bit
> error, and is there only to allow you time to swap out the memory without
> the machine crashing horribly first).
>
> With Linux, at least on x86 hardware - i've not seen ECC errors under
> Linux on Sparc - it will say something like "Received NMI - Maybe you have
> a memory problem? .. continuing anyway" .. okay, so thats a little vague
> (perhaps because Linux is still only just breaking out into the market
> where ECC is the norm), but it does detect it, report it, and continue.
>
> dave.
>
>
> _______________________________________________
> Rescue maillist - Rescue at sunhelp.org
> http://www.sunhelp.org/mailman/listinfo/rescue
>
More information about the rescue
mailing list