当前位置:文档之家› Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundanc

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundanc

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundanc
Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundanc

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang

School of Computer, National University of Defense Technology,

Changsha 410073, Hunan, China

{chenwei, gongrui, liufang, daikui}@https://www.doczj.com/doc/3b16140131.html,; zywang@https://www.doczj.com/doc/3b16140131.html,

Abstract- Triple Modular Redundancy is widely used in dependable systems design to ensure high reliability against soft errors. Conventional TMR is effective in protecting sequential circuits but can’t mask soft errors in combinational circuits. A new redundancy technique called the Space-Time Triple Modular Redundancy is presented in this paper, which improves the soft error tolerance of the combinational circuit. This paper demonstrates the usefulness of the Space-Time Triple Modular Redundancy design in a special case study. The delay overhead and the fault tolerance of Space-Time Triple Modular Redundancy are compared with that of the conventional Triple Modular Redundancy. Results show that Space-Time Triple Modular Redundancy is more effective than the conventional Triple Modular Redundancy.

Keywords: soft error, fault tolerance, reliability, space-time triple modular redundancy, sequential circuit, combinational circuit.

1Introduction

Integrated Circuits (IC) used in computer systems and other electronic systems operating under radiation are susceptible to a phenomenon known as Single Event Upset (SEU), or soft error. A soft error is a transient effect induced by the trespassing of a single charged particle through the silicon. Due to the constant shrink in the transistor dimensions, particles that once were considered negligible now are significant to cause upsets [1] which can perturb the integrated circuit operation. As computer systems and other electronic systems are widely used in radiation environments such as space vehicles, satellites and some military systems, fault tolerance and reliability of the IC should be improved to keep systems working correctly in harsh environments.

Several techniques have been proposed to make designs reliable in the presence of soft errors. Triple modular redundancy (TMR) [2] is a technique commonly used to provide design hardening. It is used to protect sequential circuits, or storage elements. Conventional TMR technique has been proved effective in protecting sequential circuits. But it can’t mask soft errors in combinational circuits.

A new TMR technique called Space-Time Triple Modular Redundancy (ST-TMR) is proposed in this paper. It is proved effectively improving fault tolerance of combinational circuits. Both the conventional TMR and ST-TMR are used in a target application: a special counter system. Random faults are injected into the counter. By investigating the value of the counter, the fault tolerant ability of the conventional TMR and ST-TMR is analyzed.

This paper is organized as follows. Section 2 introduces soft errors in sequential circuit and combinational circuit. Section 3 reviews the conventional TMR technique. In Section 4, the architecture and principle of ST-TMR are described in detail. A case study on a special counter protected under both S-TMR and ST-TMR is introduced in Section 5 and the main conclusion is presented in Section 6.

2Soft Errors in Sequential Circuits and Combinational Circuits

The circuit of modern processor or other electronic system falls into two basic classes: sequential circuit and combinational circuit. Soft errors in these two circuits have different impact. Thus, different approaches are required to protect the sequential circuit and the combinational circuit. 2.1Soft Errors in Sequential Circuits

The main contribution to the soft error rate (SER) comes from sequential circuits in current microprocessors. Sequential circuits always refer to different storage elements, such as registers, memories and flip-flops in general. A soft error in these circuits may result in a bit flip in the saved state, which may lead to a wrong execution. Storage elements take up a large part of the chip area in modern microprocessors. As a result, most modern microprocessors already incorporate mechanisms for detecting soft errors, like the triple modular redundancy technique.

2.2Soft Errors in Combinational Circuits

A particle that strikes a p-n junction within a combinational circuit may alter the value produced by the

circuit. However, a transient change in the combinational circuit will not affect the results of a computation unless it is captured by a sequential circuit, as shown in Fig.1(a). Transient changes on the clock signal or reset signal will definitely cause the circuit incorrectly executed as shown in

Fig.1(b).

(a)

(b)

Fig. 1.(a) Transient fault in the combinational circuit;

(b) transient fault on the clock signal

Past research has shown that combinational logic is

much less susceptible to soft errors than memory elements

[3, 4] and the probability of the glitch from the

combinational circuit captured by the sequential circuit is

very small. As a result, mechanisms most modern

microprocessors already incorporated for detecting soft

errors typically focus on protecting sequential elements,

particularly storage cells.

With the trends of reduced feature sizes, supply and

threshold voltages, soft error tolerance of combinational

logic circuits is affected more than memory elements. In

addition, higher clock frequencies increase the chance of a

glitch being captured by a sequential element. Even though

SER in combinational circuits is currently smaller than that

of sequential elements, it is expected to rise 9 orders of

magnitude between 1992 to 2011, when it will equal to the

SER of unprotected memory elements [5]. For processors

where the sequential elements have been protected,

combinational logic will quickly become the dominant

source of soft errors. Further research is required into

methods for protecting combinational logic from soft errors.

3Triple Modular Redundancy

Technique

Triple Module Redundancy [2, 6, 7] has been widely

used to improve the fault tolerance by protecting storage

elements. All memory elements are tripled and their

respective outputs are connected to a voter as shown in

Fig.2. The voter will select the output of the majority of the

components. So, if one component fails, the error will not

be reflected in the voter output. The voter is implemented

by few logic gates, for each bit, as it can be seen in Fig.3.

Fig. 2. Storage cell protected by TMR

Fig. 3. Voter architecture

TMR has been proved to be effective in protecting

memory elements, or sequential circuits. But conventional

TMR described above can’t mask glitches from the

combinational circuit. As shown in Fig.4, redundant

registers of conventional TMR are controlled by the same

clock. When the glitch from the combinational circuit

propagates to the sequential circuit at the rising edge of the

clock, all the three registers will capture the glitch.

Similarly, when soft error occurs on the clock signal or the

reset signal, all the redundant storage cells will execute

incorrectly.

4Space-Time Triple Modular

Redundancy Technique

A simple method to improve the soft error tolerance of

the combinational circuit is to reduce the chance of the

glitch being captured by the sequential circuit. Based on the

space redundancy of the conventional TMR (S-TMR), a

new type of TMR adding time redundancy is proposed in

this paper. As shown in Fig.5, the Space-Time Triple

Fig. 4. Architecture of the conventional TMR in detail (reset signal is omitted)

Modular Redundancy (ST-TMR) triplicates the clock in each of the TMR styles. By skewing the clock with delay δ, the fault tolerance of the combinational circuit is improved. As long as the glitch width is smaller than the clock skew, though a glitch from the combinational circuit is captured at the rising edge of one clock, the other two sequential

elements won’t capture the glitch.

Fig. 5. Architecture of space-time triple modular redundancy (reset signal is omitted)

ST-TMR is also effective in masking the soft errors on the clock signal and the reset signal because of the

triplication.

Because there is skew exists between clocks, the voter of ST-TMR is modified to vote the majority value after all the three clock signals are stable.

5 Case Study: A Counter Protected under S-TMR and ST-TMR

Though S-TMR and ST-TMR have similar architectures, they are different in terms of delay cost and the fault tolerant capability. In terms of delay, ST-TMR is a little worse than S-TMR. As shown in Fig.4, the delay of the circuit of S-TMR is: ff com voter t δδ++ (1) And as shown in Fig.5, the delay of ST-TMR is:

2ff com voter t δδδ+++ (2) However, the increase of delay caused by ST-TMR could be negligible compared with the improvement of fault tolerance capability. In order to compare the two types of TMR, we target our experiment on a special counter, as shown in Fig. 6. The counter is cleared when the reset signal is active. It increases itself by 1 every rising edge of the clock signal if ‘sig_full’ is inactive. Otherwise, it will be set ‘11…11’ at the rising edge of the clock if ‘sig_full’ is active. The register in the counter could be treated as a sequential circuit while the ‘sig_ful’ signal could be treated as an output of a combinational circuit. Thus any soft errors in the combinational circuit could be simulated as glitches on the ‘sig_full’ signal.

This counter is hardened using both S-TMR and ST-TMR. Soft errors are injected into the counter, in order to investigate the fault tolerance between the conventional TMR and ST-TMR. The counter is described in VHDL and

synthesized in XCV300 by Xilinx [8].

Fig. 6. The architecture of a counter

5.1 Fault Tolerance of Sequential Circuits Assuming that the ‘sig_full’ signal, the reset signal, the clock signal and the voter are fault free, we injected 1000 faults into the counter in 1ms while it is running, in

order to investigate the fault tolerance of the sequential circuit protected under S-TMR and ST-TMR. Faults are randomly injected, they could occur at any time during 1ms,

and could be in any of the three redundant registers.

As shown in Fig.7, both S-TMR and ST-TMR are effective in protecting the sequential circuit. ST-TMR is a

little more effective than T-TMR, because the voter of ST-TMR only works when the three clocks are stable. So the chance of voting the incorrect value is reduced.

There are still some soft errors which can not be

masked by S-TMR or ST-TMR. That is when two or more soft errors occur in different redundant registers during the

same clock cycle. Because the sequential circuit only

updates at the rising edge of the clock, if two or more soft errors occur in different redundant registers during the same

clock cycle, the voter will vote the incorrect value and the

sequential circuit will update with the incorrect value at the following rising edge of the clock. However, such probability is very small. Furthermore, the fault tolerance

increases while the clock frequency increases. Because the probability of the two or more soft errors occurring in different redundant registers during the same clock cycle decreases as the clock period decreases.

(a) (b)

Fig. 7. Fault tolerance of counter protected under S-TMR and ST-TMR: (a) the clock frequency is 100MHz; (b) the clock frequency

is 50MHz. ‘Fault tolerance’ on the Y-axis is the ratio of correct execution times to the total execution times, and it is obtained from 10000 fault injection experiments.

5.2Fault Tolerance of Combinational

Circuits

As mentioned above, ‘sig_full’ could be treated as an output of a combinational circuit. So glitches could be injected on this signal to simulate the soft errors in the combinational circuit. Assuming that the redundant registers, the reset signal, the clock signal and the voter are fault free, 1000 glitches are randomly injected on ‘sig_full’

in 1ms while the counter is running. Results are shown in Table.1. All the results would be much better, for 1000 faults in 1ms is too frequent.

Table 1. Fault tolerance of combinational circuits proteced under

S-TMR and ST-TMR with different clock skew, different glitch width and different clock frequency. δ is the clock skew.

(a) Clock frequency =100MHz

Glitch Width (ns) 0.5 1 2 3

S-TMR

ST-TMR(δ=2ns) ST-TMR(δ=4ns) 7%

99%

96%

7%

97%

96%

4%

31%

92%

3%

17%

37%

(b) Clock frequency =50MHz

Glitch Width (ns) 0.5 1 2 3

S-TMR

ST-TMR(δ=2ns) ST-TMR(δ=4ns) 13%

96%

97%

13%

96%

97%

13%

92%

89%

9%

49%

87%

Obviously, the fault tolerance of the combinational circuit protected by S-TMR decreases rapidly compared with the fault tolerance of the sequential circuit. Clock skew and glitch width have different influence on the fault tolerance of the combinational circuit while clock frequency doesn’t have the same effect.

There are two reasons why those soft errors still can’t be masked by ST-TMR. One reason is that soft errors in this experiment are injected too frequently, two or more glitches occur successively at more than one rising edge of clocks. Another reason is that the glitch width is so big that it covers the skew of the clock.

5.3Fault Tolerance of the Clock (Reset)

Signal

Clock signal and reset signal are global signals of IC. Any glitch on these signals may cause incorrect operation. In this experiment, 1000 glitches are randomly injected on the clock signal, assuming that the redundant registers, the ‘sig_full’ signal, the reset signal and the voter are fault free. Results are shown in Table. 2.

Table 2. Fault tolerance of clock signal of the circuit proteced under S-TMR and ST-TMR with different clock skew, different glitch width and different clock frequency. δ is the clock skew. δis the clock skew.

(a) Clock frequency = 100M

Glitch Width (ns) 0.5 1 2 3

S-TMR

ST-TMR (δ =0.5ns)

ST-TMR (δ =1n)

ST-TMR (δ =2ns)

0%

95%

96%

96%

0%

95%

95%

96%

0%

95%

95%

91%

0%

91%

90%

91%

(b) Clock frequency = 50M

Glitch Width (ns) 0.5 1 2 3

S-TMR

ST-TMR (δ =0.5ns)

ST-TMR (δ =1n)

ST-TMR (δ =2ns)

0%

79%

83%

87%

0%

79%

83%

87%

0%

76%

77%

90%

0%

70%

78%

85%

Obviously, conventional TMR can not mask glitches on the clock signal, while ST-TMR is much more effective. Experiments on the reset signal have similar results.

With the same reasons in Section 5.2, soft errors which are injected too frequently can’t be masked by ST-TMR.

5.4Fault Tolerance of the Whole Counter

In the sections above, the fault tolerance of the combinational circuit, the sequential circuit and the clock signal have been investigated independently. In this section, soft errors are injected into the whole counter. Every part of the counter would be the source of soft errors. 1000 faults are injected randomly into the register, the ‘sig_full’ signal, the clock signal and the reset signal. Results are shown in Fig.8. It is proved again that ST-TMR is more effective in protecting integrated circuits against soft errors.

Fig. 8. Fault tolerance of a counter protected under S-TMR and ST-TMR: (a) the clock frequency is 100MHz; (b) the clock frequency is 50MHz.

6Conclusion

Current technology trends (increased clock frequencies, reduced feature sizes, reduced supply and threshold voltages) have a negative effect on the soft error tolerance of the circuit. They will lead to a substantially more rapid increase in the soft error rate in combinational circuit than sequential circuit. Computer systems and other electronic systems are more and more used in the harsh environments where soft errors occur frequently. Research is required on methods for protecting combinational circuits

in order to improve the fault tolerance of the whole system.

In this paper, a new TMR technique based on both space redundancy and time redundancy is proposed. ST-TMR can not only protect the sequential circuit, but also mask faults from the combinational circuit and clock (reset) signal. A case study demonstrates that ST-TMR is much more effective in improving the fault tolerance and reliability of the computer system and other electronic systems, though it introduces a little delay penalty.

In our future work, the relationship of clock skew, clock frequency, glitch width and the frequency of faults injected will be discussed in detail. This will be helpful to finding the appropriate clock skew to achieve the better fault tolerance when the clock frequency and the glitch width are given.

7References

[1] A. Johnston, “Scaling and Technology Issues for Soft Error Rates,” 4th Annual Research Conference on Reliability, Stanford University, Oct. 2000.

[2] D.P. Siewiorek and R. S. Swarz, “Reliable Computer Systems: Design and Evaluation,” Digital Press, 1992.

[3]J. Gaisler, “Evaluation of a 32-bit microprocessor with built in concurrent error-detection,” In Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, pp. 42–46, 1997. [4]P. Liden, P. Dahlgren, R. Johansson, and J. Karlsson, “On Latching Probability of Particle Induced Transients in Combinational Networks,” In Proceedings of the 24th Symposium on Fault-Tolerant Computing (FTCS-24), pp. 340–349, 1994.

[5]P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, “Modeling the effect of technology trends on the soft error rate of combinational logic,” Proceedings International Conference on Dependable Systems and Networks, pp. 389-98, 23-26 June 2002.

[6] C. CARMICHAEL, “Triple Module Redundancy Design Techniques for the Virtex TM Series,” Xilinx Application Note xapp197, 2001.

[7]R. Hentschke, F. Marques, F. Lima, L. Carro, A. Susin, R. Reis, “Analyzing area and performance penalty of protecting different digital modules with Hamming code and triple modular redundancy,” Integrated Circuits and Systems Design, 15th Symposium, pp. 95-100, Sept.2002. [8]XILINX, INC. Virtex? 2.5 V Field Programmable Gate Arrays, Xilinx Datasheet DS003, v2.4, Oct. 2000.

相关主题
文本预览
相关文档 最新文档