BulletProof A defect-tolerant cmp switch architecture
- 格式:pdf
- 大小:460.61 KB
- 文档页数:12
00-300L -30-2L I -H 1141 | 02/03/2020 14-08 | t e c h n i c a l c h a n g e s r e s e r v e d FS100-300L-30-2LI-H1141Flow SensorTechnical dataFS100-300L-30-2LI-H1141100001034Medium temperature-25…+85 °C Application areaImmersion sensorApplication area liquids Bar length (L1)45 mmImmersion depth (L)16.9 mm(when using the supplied adapter) Process Pressure300 barFlow MonitoringResponse time T096 s Response time T053 sStandard flow range3…300 cm/sAny axial alignment of the sensor rod in the mediumExtended flow range1…300 cm/sExtended flow range comment Directed inflow to punch mark ±20 °Reproducibility 0.2…5 cm/s; for water 3…100 cm/s; 10…80 °C Temperature drift 0.5 cm/s × 1/K Temperature gradient≤ 300 K/min Temperature monitoringMeasuring range -25…85 °CFeatures■Screw-in adapter with process connection G1/2’’ male thread included in delivery ■M18 × 1.5 female to G1/2 inch male thread ■Electronics housing material/medium contact 1.4404 (316L)/1.4571 (316Ti)■Immersion depth 16.9 mm■Process value display via LED bar ■Flow monitoring for liquid media■Protection classes IP6K6K, IP6K7 and IP6K9K ■Adjustment of flow speed via teach function ■17…33 VDC■Analog output 4…20 mA ■M12 × 1 male connectorWiring diagramFunctional principleThe flow sensor functions according to thecalorimetric principle. The distinctive feature of this principle is that the flow rate correlates directly to the thermal loss of energy in theprobe. The increased loss of energy is therefore a direct measure of an increased flow rate.00-300L -30-2L I -H 1141 | 02/03/2020 14-08 | t e c h n i c a l c h a n g e s r e s e r v e d Technical data00-300L -30-2L I -H 1141 | 02/03/2020 14-08 | t e c h n i c a l c h a n g e s r e s e r v e dTechnical dataMounting instructions00-300L -30-2L I -H 1141 | 02/03/2020 14-08 | t e c h n i c a l c h a n g e s r e s e r v e dsensors without a switching output only have MAX/MIN Teach.00-300L -30-2L I -H 1141 | 02/03/2020 14-08 | t e c h n i c a l c h a n g e s r e s e r v e d LED displayLED ColorStatusDescriptionPWR GreenOnOperating voltage applied Device is operational FLT Red On Error displayed(for error pattern in combination with LEDs see manual)Off No errors displayed LOC Yellow On Device locked OffDevice unlockedFlashing Locking/unlocking process activeFLOW Yellow Flashing Teach mode/display of diagnostic data (see manual for specification)TEMPYellowFlashingTeach mode/display of diagnostic data (see manual for specification)For a detailed description of the display patterns and flashing codes see manual/operating instructions FS100 — compact flow sensors (D100002658)Accessories Wiring accessoriesDimension drawingTypeIdent. no.RKC4.4T-2/TEL6625013Connection cable, female M12, straight,4-pin, cable length: 2 m, sheath material:PVC, black; cULus approval; other cable lengths and qualities available, see WKC4.4T-2/TEL 6625025Connection cable, female M12, angled,4-pin, cable length: 2 m, sheath material:PVC, black; cULus approval; other cable lengths and qualities available, see Accessories。
As a programmer,writing an English report can be a critical task,especially when it comes to communicating technical details,project progress,or issues to stakeholders who may not be as familiar with the intricacies of coding.Heres a detailed guide on how to write an effective English report for programmers:1.Title:Start with a clear and concise title that summarizes the content of the report.For example,Project X Development Progress Report.2.Introduction:Briefly introduce the purpose of the report.Mention the projects name and its objectives.State the reports scope and the period it covers.3.Project Overview:Provide a brief description of the project.Outline the main goals and deliverables.Mention the team members and their roles.4.Development Progress:Detail the progress made since the last report or since the projects inception.Use bullet points or numbered lists to organize the information.Include specific features or functionalities that have been implemented.5.Technical Details:Describe any technical challenges faced and how they were overcome.Explain the technologies used,including programming languages,frameworks,and tools.Provide code snippets or diagrams if necessary to illustrate complex concepts.6.Testing and Quality Assurance:Discuss the testing strategies employed,including unit tests,integration tests,and user acceptance testing.Mention the results of the tests and any issues that were identified.7.Project Milestones:List the major milestones achieved,including deadlines met and deliverables completed.Highlight any upcoming milestones or deadlines.8.Issues and Risks:Identify any current issues or risks that may impact the projects timeline or quality. Discuss the steps being taken to mitigate these risks.9.Team Performance:Evaluate the performance of the development team.Mention any challenges in team dynamics and how they are being addressed.10.Future Plans:Outline the next steps or phases of the project.Provide a tentative schedule for upcoming tasks and milestones.11.Conclusion:Summarize the key points of the report.Reiterate the projects current status and future outlook.12.Appendices:Include any additional documentation,such as detailed technical specifications,user manuals,or supplementary data.13.References:List any external sources or references used in the report.14.Formatting and Language:Use clear and professional language.Ensure consistency in formatting,such as font size,headings,and bullet points. Proofread the report for grammar and spelling errors.15.Distribution:Specify who the report is intended for,such as project managers,team members,or clients.Include a distribution list if necessary.Remember,the goal of a programmers report is to communicate complex information in a clear,concise,and accessible manner.Tailor the level of technical detail to your audiences understanding and ensure that the report is easy to navigate with a logical flow of information.。
美丽心灵英文剧本台词.txt人生重要的不是所站的位置,而是所朝的方向。
不要用自己的需求去衡量别人的给予,否则永远是抱怨。
看电影学英语 A Beautiful Mind 《美丽心灵》-Professor: Mathematicians won the war. Mathematicians broke the Japanese codes, and built the A-bomb.mathematician: 数学家 break: 破译(密码) code: 密码 a-bomb: 原子弹是数学家赢得了二次大战。
是数学家破解了日本的密码,建造了原子炸弹。
Mathematicians... like you. The stated goal of the Soviets is global Communism stated: 陈述的,说明的,决定了的 goal: 目标 Soviet: 苏联 global: 全球的,全世界的communism: 共产主义就是……像你们这样的数学家。
苏联所定的目标是让共产党布遍全球。
In medicine or economics, in technology or space, battle lines are being drawn. medicine: 医学,医药 economics: 经济学 technology: 科技 space: 宇宙,太空 battle: 战争 line: 线 drawn: draw的过去分词,描写,刻画无论在医药还是经济上,在科技或太空技术上,战线已经分明了。
To triumph, we need results- publishable, applicable results.triumph: 得胜,成功 result: 结果,成绩 publishable: 可发表的,可出版的 applicable: 可应用的,实用的想要胜利,我们就需要有成果—可以发表,实用的成果。
+plus加号;正号-minus减号;负号±plus or minus正负号×is multiplied by乘号÷is divided by除号=is equal to等于号≠is not equal to不等于号≡is equivalent to全等于号≌is equal to or approximately equal to等于或约等于号≈is approximately equal to约等于号<is less than小于号>is more than大于号≮is not less than不小于号≯is not more than不大于号≤is less than or equal to小于或等于号≥is more than or equal to大于或等于号%per cent百分之…‰per mill千分之…∞infinity无限大号∝varies as与…成比例√(square) root平方根∵since; because因为∴hence所以∷equals, as (proportion)等于,成比例∠angle角⌒semicircle半圆⊙circle圆○circumference圆周πpi 圆周率△triangle三角形⊥perpendicular to垂直于∪union of并,合集∩intersection of 交,通集∫the integral of …的积分∑(sigma) summation of总和°degree度′minute分″second秒℃Celsius system摄氏度{open brace, open curly左花括号}close brace, close curly右花括号(open parenthesis, open paren左圆括号)close parenthesis, close paren右圆括号() brakets/ parentheses括号[open bracket 左方括号]close bracket 右方括号[] square brackets方括号.period, dot句号,点|vertical bar, vertical virgule竖线& amp;ampersand, and, reference, ref和,引用*asterisk, multiply, star, pointer星号,乘号,星,指针/slash, divide, oblique 斜线,斜杠,除号//slash-slash, comment 双斜线,注释符#pound井号\backslash, sometimes escape反斜线转义符,有时表示转义符或续行符~tilde波浪符.full stop句号,comma逗号:colon冒号;semicolon分号?question mark问号!exclamation mark (英式英语) exclamation point (美式英语)'apostrophe撇号-hyphen连字号-- dash 破折号...dots/ ellipsis省略号"single quotation marks 单引号""double quotation marks 双引号‖parallel 双线号~swung dash 代字号§section; division 分节号→arrow 箭号;参见号break 跳出当前循环case 开关语句分支char 声明字符型变量或函数const 声明只读变量continue 结束当前循环,开始下一轮循环default 开关语句中的“其他”分支do 循环语句的循环体double 声明双精度变量或函数else 条件语句否定分支(与 if 连用)enum 声明枚举类型extern 声明变量是在其他文件正声明(也可以看做是引用变量)float 声明浮点型变量或函数for 一种循环语句(可意会不可言传)goto 无条件跳转语句if 条件语句inlineint 声明整型变量或函数long 声明长整型变量或函数register 声明积存器变量restrictreturn 子程序返回语句(可以带参数,也看不带参数)short 声明短整型变量或函数signed 生命有符号类型变量或函数sizeof 计算数据类型长度static 声明静态变量struct 声明结构体变量或函数switch 用于开关语句typedef 用以给数据类型取别名(当然还有其他作用)union 声明联合数据类型unsigned 声明无符号类型变量或函数void 声明函数无返回值或无参数,声明无类型指针(基本上就这三个作用)volatile 说明变量在程序执行中可被隐含地改变while 循环语句的循环条件C++语言中常用的关键字、类名、函数及其他常用单词main主要的,主函数bool定义布尔型变量break跳出循环或switch语句case switch语句中使用char 定义字符型变量class 定义类const 定义常量continue 继续,结束本次循环default 缺省的,用于switch语句delete 释放由new分配的内存do 执行,do while循环double 定义双精度实型变量if else 如果。
BulletProof:A Defect-Tolerant CMP Switch Architecture Kypros Constantinides‡Stephen Plaza‡Jason Blome‡Bin Zhang†Valeria Bertacco‡Scott Mahlke‡T odd Austin‡Michael Orshansky†‡Advanced Computer Architecture Lab†Department of Electrical and Computer Engineering University of Michigan University of T exas at AustinAnn Arbor,MI48109Austin,TX,78712{kypros,splaza,jblome,valeria,bzhang@mahlke,austin}@ orshansky@AbstractAs silicon technologies move into the nanometer regime,tran-sistor reliability is expected to wane as devices become subject to extreme process variation,particle-induced transient errors, and transistor wear-out.Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure.In this paper,we examine the challenges of designing complex computing systems in the presence of transient and per-manent faults.We select one small aspect of a typical chip multi-processor(CMP)system to study in detail,a single CMP router switch.To start,we develop a unified model of faults,based on the time-tested bathtub ing this convenient abstraction, we analyze the reliability versus area tradeoffacross a wide spec-trum of CMP switch designs,ranging from unprotected designs to fully protected designs with online repair and recovery capabil-ities.Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design.To bet-ter understand the impact of these faults,we evaluate our CMP switch designs using circuit-level timing on detailed physical lay-outs.Our experimental results are quite illuminating.Wefind that designs are attainable that can tolerate a larger number of defects with less overhead than na¨ıve triple-modular redundancy, using domain-specific techniques such as end-to-end error detec-tion,resource sparing,automatic circuit decomposition,and it-erative diagnosis and reconfiguration.1.IntroductionA critical aspect of any computer design is its ers expect a system to operate without failure when asked to perform a task.In reality,it is impossible to build a completely reliable system,consequently,vendors target design failure rates that are imperceptibly small[23].More-over,the failure rate of a population of parts in thefield must exhibit a failure rate that does not prove too costly to service.The reliability of a system can be expressed as the mean-time-to-failure(MTTF).Computing system reli-ability targets are typically expressed as failures-in-time,or FIT rates,where one FIT represents one failure in a billion hours of operation.In many systems today,reliability targets are achieved by employing a fault-avoidance design strategy.The sources of possible computing failures are assessed,and the neces-sary margins and guards are placed into the design to en-sure it will meet the intended level of reliability.For exam-ple,most transistor failures(e.g.,gate-oxide breakdown)can be reduced by limiting voltage,temperature and frequency [8].While these approaches have served manufacturers well for many technology generations,many device experts agree that transistor reliability will begin to wane in the nanome-ter regime.As devices become subject to extreme process variation,particle-induced transient errors,and transistor wearout,it will likely no longer be possible to avoid these faults.Instead,computer designers will have to begin to directly address system reliability through fault-tolerant de-sign techniques.Figure1illustrates the fault-tolerant design space we fo-cus on in this paper.The horizontal axis lists the type of device-level faults that systems might experience.The source of failures are widespread,ranging from transient faults due to energetic particle strikes[32]and electrical noise[28],to permanent wearout faults caused by electro-migration[13],stress-migration[8],and dielectric break-down[10].The vertical axis of Figure1lists design solutions to deal with faults.Design solutions range from ignoring any possible faults(as is done in many systems today),to detecting and reporting faults,to detecting and correcting faults,andfinally fault correction with repair capabilities. Thefinal two design solutions are the only solutions that can address permanent faults,with thefinal solution being the only approach that maintains efficient operation after encountering a silicon defect.In recent years,industry designers and academics have paid significant attention to building resistance to transient faults into their designs.A number of recent publications have suggested that transient faults,due to energetic par-ticles in particular,will grow in future technologies[5,16].A variety of techniques have emerged to provide a capa-bility to detect and correct these type of faults in storage, including parity or error correction codes(ECC)[23],and logic,including dual or triple-modular spatial redundancy [23]or time-redundant computation[24]or checkers[30]. Additional work has focused on the extent to which circuit timing,logic,architecture,and software are able to mask out the effects of transient faults,a process referred to as “derating”a design[7,17,29].In contrast,little attention has been paid to incorporat-ing design tolerance for permanent faults,such as silicon de-fects and transistor wearout.The typical approach used to-day is to reduce the likelihood of encountering silicon faults through post-manufacturing burn-in,a process that acceler-ates the aging process as devices are subjected to elevated temperature and voltage[10].The burn-in process accel-erates the failure of weak transistors,ensuring that,after burn-in,devices still working are composed of robust tran-sistors.Additionally,many computer vendors provide the ability to repair faulty memory and cache cells,via the in-MANUFACTURINGMainstream Solutions High-end Solutions Specialized Solutionsas a result of relentless down-scaling[27].Transistor Infant Mortality.Scaling has had adverse effects on the early failures of transistor devices.Tradition-ally,early transistor failures have been reduced through the use of burn-in.The burn-in process utilizes high voltage and temperature to accelerate the failure of weak devices, thereby ensuring that parts that survive burn-in only pos-sess robust transistors.Unfortunately,burn-in is becoming less effective in the nanometer regime,as deep sub-micron devices are subject to thermal run-away effects,where in-creased temperature leads to increased leakage current and increased leakage current leads to yet higher temperatures. The end results is that aggressive burn-in will destroy even robust transistors.Consequently,vendors may soon have to relax the burn-in process which will ultimately lead to more early-failures for transistors in thefield.2.1The Bathtub:A Generic Model for Semicon-ductor Hard FailuresTo derive a simple architect-friendly model of failures,we step back and return to the basics.In the semiconductor industry,it is widely accepted that the failure rate for many systems follows what is known as the bathtub curve,as il-lustrated in Figure2.We will adopt this time-tested fail-ure model for our research.Our goal with the bathtub-curve model is not to predict its exact shape and magnitude for the future(although we willfit published data to it to create“design scenarios”),but rather to utilize the bath-tub curve to illuminate the potential design space for future fault-tolerant designs.The bathtub curve represents device failure rates over the entire lifetime of transistors,and it is characterized by three distinct regions.•Infant Period:In this phase,failures occur very soon and thus the failure rate declines rapidly over time.These infant mortality failures are caused by latent manufacturing defects that surface quickly if a tem-perature or voltage stress is applied.•Grace Period:When early failures are eliminated, the failure rate falls to a small constant value where failures occur sporadically due to the occasional break-down of weak transistors or interconnect.•Breakdown Period:During this period,failures oc-cur with increasing frequency over time due to age-related wearout.Many devices will enter this period at roughly the same time,creating an avalanche effect and a quick rise in device failure rates.With the respect to Figure2,the model is represented with the following equations:F G+λL109(t+1)m),if0≤t<t A F(t)=F G,if t A≤t<t BF G+(t−t B)b,if t B≤t(t is measured in hours)Where the parameters of the model are as follows:•λL:average number of latent manufacturing defects per chip•m:infant period maturing factor•F G:grace period failure rate•t B:breakdown period start point•b:breakdown factor,is the average number of“killer”defects per chip,andγis an empirically estimated parameter with typical values be-tween0.01and0.02.The same work,provides formulas for deriving the maximum number of latent manufacturing de-fects that may surface during burn-in test.Based on these models,the average number of latent manufacturing defects per chip(140mm2)for current technologies(λL)is approx-imately0.005.In the literature,there are no clear trends how this value changes with technology scaling,thus we use the same rate for projections of future technologies.Grace Period Failure Rate(F G):For the grace pe-riod failure rate,we use reference data by[26].In[26],a microarchitecture-level model was used to estimate workload-dependent processor hard failure rates at different technolo-gies.The model used supports four main intrinsic failure mechanisms experienced by processors:elegtromigration, stress migration,time-dependence dielectric breakdown,and thermal cycling.For a predicted post-65nm fabrication tech-nology,we adopt their worst-case failure rate(F G)of55,000 FITs.Breakdown Period Start Point(t B):Previous work [27],estimates the time to dielectric breakdown using ex-trapolation from the measurement conditions(under stress) to normal operation conditions.We estimate the break-down period start point(t B)to be approximately12years for65nm CMOS at1.0V supply voltage.We were unable to find any predictions as to how this value will trend for fab-rication technologies beyond65nm,but we conservatively assume that the breakdown period will be held to periods beyond the expected lifetime of the product.Thus,we need not address reliable operation in this period,other than to provide a limited amount of resilience to breakdown for the purpose of allowing the part to remain in operation until it can be replaced.The maturing factor during the infant mortality periodand the breakdown factor during the breakdown period used, are m=0.02and b=2.5,respectively.3.A Fault Impact Evaluation Infrastructure In[7],we introduced a high-fidelity simulation infrastruc-ture for quantifying various derating effects on a design’s overall soft-error rate.This infrastructure takes into account circuit level phenomena,such as time-related,logic-related and microarchitectural-related fault masking.Since many tolerance techniques for permanent errors can be adapted to also provide soft error tolerance,the remainder of this work concentrates on the exploration of various defect tol-erance techniques.3.1Simulation Methodology for Permanent Faults In this section,we present our simulation infrastructure for evaluating the impact of silicon defects.We create a model for permanent fault parameters,and develop the in-frastructure necessary to evaluate the effects on a given de-sign.of the uni-in-theworktwo de-sign intact (golden model),while the otheris subject to fault injection (defect-exposed model).The structural specification of our design was synthesized from a Verilog description using the Synopsys Design Compiler.Our silicon defect model distributes defects in the design uniformly in time of occurrence and spatial location.Once a permanent failure occurs,the design may or may not con-tinue to function depending on the circuit’s internal struc-ture and the system architecture.The defect analyzer clas-sifies each defect as exposed,protected or unprotected but masked.In the context of defect evaluation,faults accumu-late over time until the design fails to operate correctly.The defect that brings the system to failure is the last injected defect in each experiment and it is classified as exposed.A defect may be protected if,for instance,it is thefirst one to occur in a triple-module-redundant design.An unprotected but masked defect is a defect that it is masked because it occurs in a portion of the design that has already failed,forina issuch defect is considered exposed.Finally,to gain statistical confidence of the results,we run the simulations described many times in a Monte Carlo modeling framework.3.2Reliability of the Baseline CMP Switch Design Thefirst experiment is an evaluation of the reliability of the baseline CMP switch design.In Figure4,we used the bathtub curvefitted for the post-65nm technology node as derived in Section2.The FIT rate of this curve is55000 during the grace period,which corresponds to a mean time to failure(MTTF)of2years.We used this failure rate in our simulation framework for permanent failures and we plotted the results.The baseline CMP design does not deploy any protec-tion technique against defects,and one defect is sufficient to bring down the system.Consequently,the graph of Fig-ure4shows that in a large parts population,50%of the parts will be defective by the end of the second year after shipment,and by the fourth year almost all parts will have failed.In this experiment,we have also analyzed a design variant which deploys triple-module-redundancy(TMR)at the full-system level(i.e.three CMP switches with voting gates at their outputs.Designs with TMR applied at dif-ferent granularities are evaluated at Section5)to present better defect tolerance.The TMR model used in this analysis is the classical TMR model which assumes that when a module fails it starts pro-ducing incorrect outputs,and if two or more modules fail, the output of the TMR voter will be incorrect.This model is conservative in its reliability analysis because it does not take into account compensating faults.For example,if two faults affect two independent output bits,then the voter cir-cuit should be able to correctly mask both faults.However, the benefit gained from accounting for compensating faults rapidly diminishes with a moderate number of defects be-cause the probabilities of fault independence are multiplied.Further,though the switch itself demonstrated a moder-ate number of independent fault sites,submodules within the design tended to exhibit very little independence.Also, in[21],it is demonstrated that even when TMR is applied on diversified designs(i.e.three modules with the same func-tionality but different implementation),the probability of independence is small.Therefore,in our reliability analysis, we choose to implement the classical TMR model and for the rest of the paper whenever TMR is applied,the classical TMR model is assumed.From Figure4,the simulation-based analysisfinds that TMR provides very little reliability improvements over the baseline designs,due to the few number of defects that can be tolerated by system-level TMR.Furthermore,the area of the TMR protected design is more than three times the area of the baseline design.The increase in area raises the probability of a defect being manifested in the design,which significantly affects the design’s reliability.In the rest of the paper,we propose and evaluate defect-tolerant techniques that are significantly more robust and less costly than tra-ditional defect-tolerant techniques.4.Self-repairing CMP Switch DesignThe goal of the Bulletproof project is to design a defect-tolerant chip-multiprocessor capable of tolerating significant levels of various types of defects.In this work,we address the design of one aspect of the system,a defect tolerant CMP switch.The CMP switch is much less complex than a modern microprocessor,enabling us to understand the en-tire design and explore a large solution space.Further,this switch design contains many representative components of larger designs includingfinite state machines,buffers,con-trol logic,and buses.4.1Baseline DesignThe baseline design,consists of a CMP switch similar to the one described in[19].This CMP switch provides wormhole routing pipelined at theflit level and implements credit-basedflow control functionality for a two-dimensional torus network.In the switch pipeline,headflits will pro-ceed through routing and virtual channel allocation stages, while allflits proceed through switch allocation and switch traversal stages.A high-level block diagram of the router architecture is depicted in Figure5.The implemented router is composed of four functional modules:the input controller,the switch arbiter,the cross-bar controller,and the crossbar.The input controller is re-sponsible for selecting the appropriate output virtual chan-nel for each packet,maintaining virtual channel state infor-mation,and bufferingflits as they arrive and await virtual channel allocation.Each input controller is enhanced with an8-entry32-bit buffer.The switch arbiter allocates virtual channels to the input controllers,using a priority matrix to ensure that starvation does not occur.The switch arbiter also implementsflow control by tallying credit information used to determine the amount of available buffer space at downstream nodes.The crossbar controller is responsible for determining and setting the appropriate control signals so that allocatedflits can pass from the input controllers to the appropriate output virtual channels through the inter-connect provided by the crossbar.The router design is specified in Verilog and was synthe-sized using the Synopsys Design Compiler to create a gate-high levela con-level Thisdom-inate heavily84%of the tech-4.2A design that is tolerant to permanent defects must pro-vide mechanisms that perform four central actions related to faults:detection,diagnosis,repair,and recovery.Fault detection identifies that a defect has manifested as an error in some signal.Normal operation cannot continue after fault detection as the hardware is not operating properly.Often fault detection occurs at a macro-level,thus it is followed by a diagnosis process to identify the specific location of the defect.Following diagnosis,the faulty portion of the de-sign must be repaired to enable proper system functionality. Repair can be handled in many ways,including disabling, ignoring,or replacing the faulty component.Finally,the system must recover from the fault,purging any incorrect data and recomputing corrupted values.Recovery essen-tially makes the defect’s manifestation transparent to the application’s execution.In this section,we discuss a range of techniques that can be applied to the baseline switch to make it tolerant of permanent defects.The techniques differ in their approach and the level at which they are applied to the design.In[9]the authors present the Reliable Router(RR),a switching element design for improved performance and re-liability within a mesh interconnect.The design relies on an adaptive routing algorithm coupled with a link level re-transmission protocol in order to maintain service in the presence of a single node or link failure within the network. Our design differs from the RR in that our target domain involves a much higher fault rate and focuses on maintaining switch service in the face of faults rather than simply routing around faulty nodes or links.However,the two techniques can be combined and provide a higher reliability multipro-cessor interconnection network.4.2.1General TechniquesThe most commonly used protection mechanisms are dual and triple modular redundancy,or DMR and TMR[23]. These techniques employ spatial redundancy combined with a majority voter.With permanent faults,DMR provides only fault detection.Hence,a single fault in either of the redundant components will bring the system down.TMRis more effective as it provides solutions to detection,re-covery,and repair.In TMR,the majority voter identifies a malfunctioning hardware component and masks its affects on the primary outputs.Hence,repair is trivial since the defective component is always just simply outvoted when it computes an incorrect value.Due to this restriction,TMR is inherently limited to tolerating a single permanent fault. Faults that manifest in either of the other two copies can-not be handled.DMR/TMR are applicable to both state and logic elements and thus are broadly applicable to our baseline switch design.Storage or state elements are often protected by parity or error correction codes(ECC)[23].ECC provides a lower overhead solution for state elements than TMR.Like TMR, ECC provides a unified solution to detection and recovery. Repair is again trivial as the parity computation masks the effects of permanent faults.In addition to the storage over-head of the actual parity bits,the computation of parity or ECC bits generally requires a tree of exclusive-ORs.This hardware has moderate overhead,but more importantly,it can often be done in parallel,thus not affecting latency.For our defect-tolerant switch,the application of ECC is limited due to the small fraction of area that holds state.4.2.2Domain-specific TechniquesThe properties of the wormhole router can be exploited to create domain-specific protection mechanisms.Here,we focus on one efficient design that employs end-to-end error detection,resource sparing,system diagnosis,and reconfig-uration.End-to-End Error Detection and Recovery Mech-anism.Within our router design,errors can be separated into two major classes.Thefirst class is comprised of data corrupting errors,for example a defect that alters the data of a routedflit,so that the routedflit is permanently cor-rupted.The second class is comprised of errors that cause functional incorrectness,for example a defect that causes a flit to be misrouted to a wrong output channel or to get lost and never reach any of the switch’s output channels.Thefirst class of errors,the data corrupting errors,can be addressed by adding Cyclic Redundancy Checkers(CRC)at each one of the switch’sfive output channels,as shown in Figure6(a).When an error is detected by a CRC checker, all CRC checkers are notified about the error detection and block any furtherflit routing.The same error detection sig-nal used to notify the CRC checkers also notifies the switch’s recovery logic.The switch’s recovery logic logs the error oc-currence by incrementing an error counter.In case the error counter surpasses a predefined threshold,the recovery logic signals the need for system diagnosis and reconfiguration.In case the error counter is still below the predefined threshold,the switch recovers its operation from the last “checkpointed”state,by squashing all inflightflits and rerout-ing the corruptedflit and all followingflits.This is ac-complished by maintaining an extra recovery head pointer at the input buffers.As shown in Figure6(a),each input buffer maintains an extra head pointer which indicates the lastflit stored in the buffer which is not yet checked by a CRC checker.The recovery head pointer is automatically incremented four cycles after the associated input controller grants access to the requested output channel,which is the latency needed to route theflit through the switch,once access to the destination channel is granted.In case of ais shown.Flits are split into two parts,which are independently routed through the switch pipeline. switch recovery,the recovery head pointer is assigned to the head pointer for allfive input buffers,and the switch recov-ers operations by starting rerouting theflits pointed by the head pointers.Further,the switch’s credit backflow mecha-nism needs to be adjusted accordingly since an input buffer is now considered full when the tail pointer reaches the re-covery head pointer.In order for the switch’s recovery logic to be able to distinguish soft from hard errors,the error counter is reset to zero at regular intervals.The detection of errors causing functional incorrectness is considerably more complicated because of the need to be able to detect misrouted and lostflits.A critical issue for the recovery of the system is to assure that there is at least one uncorrupted copy for eachflit inflight in the switch’s pipeline.This uncorruptedflit can then be used during re-covery.To accomplish this,we add a Buffer Checker unit to each input buffer.As shown in Figure6(b),the Buffer Checker unit compares the CRC checked incomingflit with the lastflit allocated into the input buffers(tailflit).Fur-ther,to guarantee the input buffer’s correct functionality, the Buffer Checker also maintains a copy of the head and the tail pointers which are compared with the input buffer’s pointers whenever a newflit is allocated.In the case that the comparison fails,the Buffer Checker signals an alloca-tion retry,to cover the case of a soft error.If the error per-sists,this means that there is a potential permanent errorin the design,and it signals the system diagnosis and recon-figuration procedures.By assuring that a correct copy of theflit is allocated into the input buffers and that the input buffer’s head/tail pointers are maintained correctly,we guar-antee that eachflit entering the switch will correctly reach the head of the queue and be routed through the switch’s pipeline.To guarantee that aflit will get routed to the correct out-put channel,theflit is split into two parts,as shown in Figure6(b).Each part will get its output channel requests from a different routing logic block,and access the requested output channel through a different switch arbiter.Finally, each part is routed through the cross-bar independently.To accomplish this,we add an extra routing logic unit and an extra switch arbiter.The status bits in the input controllers that store the output channel reserved by the headflit are duplicated as well.Since the cross-bar routes theflits at the bit-level,the only difference is that the responses to the cross-bar controller from the switch arbiter will not be the same for all theflit bits,but the responses for thefirst and the second parts of theflit arefitted from thefirst and sec-ond switch arbiters,respectively.If a defect causes aflit to be misrouted,it follows that a single defect can impact only one of the two parts of theflit,and the error will be caught later at the CRC check.The area overhead of the proposed error detection and recovery mechanism is limited to only10%of the switch’s area.The area overhead of the CRC checkers,the Recov-ery Logic and the Buffer Checker units is almost negligible. More specifically,the area of a single CRC checker is0.1% of the switch’s area and the area for the Buffer Checker and the Recovery Logic is much less significant.The area overhead of the proposed mechanism is dominated by the extra Switch Arbiter(5.7%),the extra Routing Logic units (5x0.5%=2.5%),and the additional CRC bits(1.5%).As we can see,the proposed error detection and recovery mech-anism has a10X times less area overhead than a na¨ıve DMR implementation.Resource Sparing.For providing defect tolerance to the switch design we use resource sparing for selected partitions of the switch.During the switch operation only one spare is active for each distinct partition of the switch.For each spare added in the design,there is an additional overhead for the interconnection and the required logic for enabling and disabling the spare.For resource sparing,we study two different techniques,dedicated sparing and shared sparing. In the dedicated sparing technique,each spare is owned by a single partition and can be used only when the specific par-tition fails.When shared sparing is applied,one spare can be used to replace a set of partitions.In order for the shared sparing technique to be applied,it requires multiple identi-cal partitions,such as the input controllers for the switch design.Furthermore,each shared spare requires additional interconnect and logic overhead because of its need of hav-ing the ability to replace more than one possible defective partitions.System Diagnosis and Reconfiguration.As a system diagnosis mechanism,we propose an iterative trial-and-error method which recovers to the last correct state of the switch, reconfigures the system,and replays the execution until no error is detected.The general concept is to iterate through each spared partition of the switch and swap in the spare for the current copy.For each swap,the error detection andde-in ofthe will be thisthis disabled.In order for the system diagnosis to operate,it maintains a set of bit vectors as follows:•Configuration Vector:It indicates which spare parti-tions are enabled.•Reconfiguration Vector:It keeps track of which config-urations have been tried and indicates the next config-uration to be tried.It gets updated at each iteration of system diagnosis.•Defects Vector:It keeps track of which spare partitions are defected.•Trial Vector:Indicates which spare partitions are en-abled for a specific system diagnosis iteration. Figure7,demonstrates an example where system diagno-sis is applied on a system with four partitions and two copies (one spare)for each partition.Thefirst copy of partition B has a detected defect(mapped at the Defects Vector).The defect in thefirst copy of partition D is a recently manifested defect and is the one that caused erroneous execution.Once the error is detected,the system recovers to the last correct state using the mechanism described in the previous section (see error detection and recovery mechanism),and it ini-tializes the Reconfiguration Vector.Next,the Trial Vector is computed using the Configuration,Reconfiguration,and Defects vectors.In case the Trial Vector is the same with the Configuration Vector(attempt2),due to a defected spare, the iteration is skipped.Otherwise,the Trial Vector is used as the current Configuration vector,indicating which spare partitions will be enabled for the current trial.The execu-tion is then replayed from the recovery point until the error detection point.In case the error is detected,a new trial is initiated by updating the Reconfiguration Vector and re-computing the Trial Vector.In case no error is detected, meaning that the trial configuration is a working configura-tion,the Trial Vector is copied to the Configuration Vector and the Defects Vector is updated with the located defected copy.If all the trial configurations are exhausted,which are equal to the number of partitions,and no working configu-ration was found,then the defect was a fatal defect and the。