Moshe Valdman from Israel wrote this question:
In a telecom system we have many memories and FPGAs
Theoretically we should have quite high failure rate related to “single
event upset”. I suspect we indeed have such failures, but these could also
be just SW “bugs”. I have difficult time convincing developers to add ECC, CRC, parity and other means to correct or at least detect such temporary failures. Can you share ideas on how to estimate actual field failures rate related to SEUs and how to quantify the cost?
Charlie Slayman, our Ops SER Expert wrote back:
Yes, you will have soft errors if you have memory and FPGAs. Typical rates for SRAM and flops vary between 100 to 1000 FIT per Mb or Mflop. DRAM rates are much lower, around 100 FIT/Gb. Check out Slide 25 of my presentation at last month’s IEEE SCV Reliability Society meeting archived at http://www.ewh.ieee.org/r6/scv/rl/archives.htm.
You can use these numbers as a rough rule of thumb to estimate a system soft error rate. If you actually see field failures significantly higher than this rate, then I would suspect software bugs or signal integrity as the dominant sources of errors.
Ironically, high solar activity increases the strength of the earth’s magnetic field which in turn reduces the neutron flux at sea level. Neutrons are the only particles that make it to terrestrial levels with any significant flux. The solar particles are scattered and deflected in the upper atmosphere. So the soft error rate of terrestrial systems is lower during high solar activity. (It’s a different story for satellite systems since they are high enough to be hit by energetic ions from the solar storms.) But the modulation is only about 20%, so you won’t see a big change in system error rate.
As far as justifying the use of parity, ECC and CRC, that depends on the design target for system reliability and the components used. At the soft error rates I quoted above, I would find it hard to believe that any form of complex telcom design could meet reasonable reliability targets without some form of ECC on large memory.
Hope this helps.
If you have any comments or further insights into this, please respond to the blog