ECC RAM (Error Correcting Code Random Access Memory)

Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the most common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.

Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state. Most non-ECC memory cannot detect errors, although some non-ECC memory with parity support allows detection but not correction.

Problems

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them. Hence, the error rates increase rapidly with rising altitude; for example, compared to sea level, the rate of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km (the cruising altitude of commercial airplanes). As a result, systems operating at high altitudes require special provision for reliability.

Solutions

Several approaches have been developed to deal with unwanted bit-flips, including immunity-aware programming, RAM parity memory, and ECC memory.

This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most common error correcting code, a single-error correction and double-error detection (SECDED) Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected. Chipkill ECC is a more effective version that also corrects for multiple bit errors, including the loss of an entire memory chip.

Implementations

Seymour Cray famously said “parity is for farmers” when asked why he left this out of the CDC 6600. Later, he included parity in the CDC 7600, which caused pundits to remark that “apparently a lot of farmers buy computers”. The original IBM PC and all PCs until the early 1990s used parity checking. Later ones mostly did not. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not.ation needed]

An ECC-capable memory controller can detect and correct errors of a single bit per 64-bit “word” (the unit of bus transfer), and detect (but not correct) errors of two bits per 64-bit word. The BIOS in some computers, when matched with operating systems such as some versions of Linux, macOS, and Windows,ation needed] allows counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.

Some DRAM chips include “internal” on-chip error correction circuits, which allow systems with non-ECC memory controllers to still gain most of the benefits of ECC memory. In some systems, a similar effect may be achieved by using EOS memory modules.

Error detection and correction (EDAC) depends on an expectation of the kinds of errors that occur. Implicitly, it is assumed that the failure of each bit in a word of memory is independent, resulting in improbability of two simultaneous errors. This used to be the case when memory chips were one-bit wide, what was typical in the first half of the 1980s; later developments moved many bits into the same chip. This weakness is addressed by various technologies, including IBM’s Chipkill, Sun Microsystems’ Extended ECC, Hewlett Packard’s Chipspare, and Intel’s Single Device Data Correction (SDDC).

Cache

Many processors use error-correction codes in the on-chip cache, including the Intel Itanium and Xeon processors, the AMD Athlon, Opteron, all Zen- and Zen+-based processors (EPYC, EPYC Embedded, Ryzen and Ryzen Threadripper), and the DEC Alpha 21264.

As of 2006, EDC/ECC and ECC/ECC are the two most common cache error-protection techniques used in commercial microprocessors. The EDC/ECC technique uses an error-detecting code (EDC) in the level 1 cache. If an error is detected, data is recovered from ECC-protected level 2 cache. The ECC/ECC technique uses an ECC-protected level 1 cache and an ECC-protected level 2 cache. CPUs that use the EDC/ECC technique always write-through all STOREs to the level 2 cache, so that when an error is detected during a read from the level 1 data cache, a copy of that data can be recovered from the level 2 cache.

Registered Memory

Registered, or buffered, memory is not the same as ECC; these strategies perform different functions. It is usual for memory used in servers to be both registered, to allow many memory modules to be used without electrical problems, and ECC, for data integrity. Memory used in desktop computers is neither, for economy. However, unbuffered (not-registered) ECC memory is available, and some non-server motherboards support ECC functionality of such modules when used with a CPU that supports ECC. Registered memory does not work reliably in motherboards without buffering circuitry, and vice versa.

Advantages and Disadvantages

Ultimately, there is a trade-off between protection against unusual loss of data, and a higher cost.

ECC protects against undetected memory data corruption, and is used in computers where such corruption is unacceptable, for example in some scientific and financial computing applications, or in file servers. ECC also reduces the number of crashes that are especially unacceptable in multi-user server applications and maximum-availability systems. Most motherboards and processors for less critical application are not designed to support ECC so their prices can be kept lower. Some ECC-enabled boards and processors are able to support unbuffered (unregistered) ECC, but will also work with non-ECC memory; system firmware enables ECC functionality if the ECC RAM is installed.

ECC memory usually involves a higher price when compared to non-ECC memory, due to additional hardware required for producing ECC memory modules, and due to lower production volumes of ECC memory and associated system hardware. Motherboards, chipsets and processors that support ECC may also be more expensive.

ECC may lower memory performance by around 2–3 percent on some systems, depending on the application and implementation, due to the additional time needed for ECC memory controllers to perform error checking. However, modern systems integrate ECC testing into the CPU, generating no additional delay to memory accesses as long as no errors are detected.

Hardware