[geeks] computer room gallery 8-)
Eric Dittman
geeks at sunhelp.org
Fri Jan 4 22:30:57 CST 2002
> > The E10K was the one with the faulty cache design that Sun denied
> > for over a year (and fixed only if the customer signed an NDA),
> > wasn't it?
>
> I must respond, being somewhat sensitive about this issue...
>
> The problem was not E10K-specific, and Sun did not require an NDA to fix
> it. The issue was that the processor modules were originally designed
> with much smaller caches than the 4 & 8M used in the later units. Since
> they were small when originally designed, the decision was made not to use
> ECC on the cache memory. (IMO, a dumb idea) When the cache size grew, the
> density increased, and the SRAM became vulnerable to cosmic radiation and
> other particles.
I agree that designing cache without ECC was stupid. I heard about the
NDA being required for a fix in the early days from several places that
I trust to have the details.
> Sun kept sites it was investigating under NDA until they were sure what
> the problem was. (A factor that added to the confusion was that SRAM from
> one manufacturer seemed much more stable than that of another.) Once they
> knew what the issue really was, they produced a revised module, and have
> swapped them at any sites that evidemce a problem. (Some sites seem more
> problematic than others, especially those with bad datacenter conditions.
> Also, I believe (not sure) that altitude can be an aggrivating factor.)
I think requiring an NDA while investigating is terrible service. I've
never had to sign an NDA to get a vendor to investigate or debug a problem.
I don't think blaming the problems on the environment was any more than a
delaying factor.
There also appear to have been a couple of revised modules which didn't
actually fix the problem as the cache was mirrored but still didn't
have ECC. There was also the fix that Sun produced that impacted
performance.
> I think it was a wake-up call for Sun, as they have always been good as a
> reactive company, but not very proactive. I know that they are now
> putting a real effort into RAS in product design.
I hope they got new architects for their CPUs. The design problems they
had with the CPU module was not consistent with their earlier work.
--
Eric Dittman
dittman at dittman.net
Check out the DEC Enthusiasts Club at http://www.dittman.net/
More information about the geeks
mailing list