北大计算机系高级计算机系统结构课件chx14_arch08_mm
- 格式:pdf
- 大小:5.34 MB
- 文档页数:78
高等计算机系统结构引论(第一讲)程旭2014年2月17日教材与教师主要教材:Computer Architecture: A Quantitative Approach,4th(2006) or 5th Edition (2012) ,Patterson and Hennessy 主讲教师:程旭北京大学微处理器研究开发中心刘先华北京大学微处理器研究开发中心助教:TBD授课时间地点:每周一下午 15:10—18:00 二教203学习和把握将决定二十一世纪计算机具体形态的设计技术、机器结构、工艺要素、评价方法等技术工艺编程语言操作系统历史应用软硬件界面设计(ISA)测度 和 评测并行性计算机系统结构 •指令系统设计 •组成 •硬件学习和把握将决定二十一世纪计算机具体形态的设计技术、机器结构、工艺要素、评价方法等技术工艺Programming modelsBusiness models历史应用Architectural Design Patterns测度 和 评测计算机系统结构•计算机应用需要什么?•操作系统需要那些功能支持? •优化编译可以利用和实现哪些功能?•我们能够建造什么样的机器? •今后的计算机将会怎样?°计算机系统结构研究人员必须具有宽厚的专业知识!计算机基础数字逻辑计算机组织与结构 操作系统 编译技术数据结构 应用基础 C 语言编程存储管理 调度 并发代码生成 优化基本逻辑单元 处理器基础知识本课程在教学安排中的地位高等计算机体系结构如何实现! 具体细节---知其然!1.分析+评测—知其所以然!2.并行计算机系统结构计算机设计领域的剧变°Most of last 50 years, Moore’s Law ruled•Technology scaling allowed continual performance/energy improvements without changing software model°Last decade, technology scaling slowed/stopped•Dennard scaling over (supply voltage ~fixed) •Moore’s Law (cost/transistor) over?•No competitive replacement for CMOS anytime soon •Energy efficiency constrains everything°No “free lunch” for software developers, must consider:•Parallel systems•Heterogeneous systems当今主流的目标系统°Mobile (smartphone/tablet)•>1 billion sold/year•Market dominated by ARM-ISA-compatible general-purposeprocessor in system-on-a-chip (SoC)•Plus sea of custom accelerators (radio, image, video,graphics, audio, motion, location, security, etc.)°Warehouse-Scale Computers (WSCs)•100,000’s cores per warehouse•Market dominated by x86-compatible server chips•Dedicated apps, plus cloud hosting of virtual machines•Starting to see some GPU usage, but mostly general-purpose CPU code°Embedded computing•Wired/wireless network infrastructure, printers•Consumer TV/Music/Games/Automotive/Camera/MP3Charles Babbage (1791-1871)°Lucasian Professor ofMathematics, CambridgeUniversity, 1828-1839°A true “polymath” withinterests in many areas°Frustrated by errors inprinted tables, wanted tobuild machines to evaluateand print accurate tables°Inspired by earlier workorganizing human“computers” to methodicallycalculate tables by hand [Copyright expired and in public domain.Image obtained from Wikimedia Commons.]Charles Babbage°Difference Engine 1823°Analytic Engine 1833•The forerunner of modern digital computer!Application–Mathematical T ables – Astronomy–Nautical T ables – NavyBackground–Any continuous function can be approximated by apolynomial --- WeierstrassTechnology–mechanical - gears, Jacquard’s loom, simple calculatorsDifference EngineA machine to compute mathematical tablesWeierstrass:•Any continuous function can be approximated by a polynomial •Any polynomial can be computed from difference tablesAn examplef(n) = n2 + n + 41d1(n) = f(n) - f(n-1) = 2nd2(n) = d1(n) - d1(n-1) = 2f(n) = f(n-1) + d1(n) = f(n-1) + (d1(n-1) + 2)all you need is an adder!n d2(n) d1(n) f(n)41122232424 6 843 47 53 61Babbage’s Difference Engine 1 1832Analytic Engine1833: Babbage’s paper was published•conceived during a hiatus in the development of the difference engineInspiration: Jacquard Looms•looms were controlled by punched cards-The set of cards with fixed punched holesdictated the pattern of weave program-The same set of cards could be used withdifferent colored threads numbers1871: Babbage dies•The machine remains unrealized.It is not clear if the analytic engine could be built even today using only mechanical technologyBabbage’s Difference Engine 2 andAnalytical Engine1834 Babbage Analytical EngineThe Mill The Store PrinterPunchOperation Cards Variable CardsProgramBabbage Analytical Engine•The Store: Memory unit consisting of counter wheels•The Mill: The arithmetic unit capable of 4 operations used a pair of register and produced results stored in another register in the store•Operation Cards: Specified one of Four operations •Variable Cards: Specified the memory location to be used•Output: Printer or punchAnalytic EngineThe first conception of a general-purpose computer1.The store in which all variables to be operated upon,as well as all those quantities which have arisenfrom the results of the operations are placed.2.The mill into which the quantities about to beoperated upon are always brought.The programOperation variable1 variable2 variable3An operation in the mill required feeding two punched cardsand producing a new punched card for the store.An operation to alter the sequence was also provided!The first programmerAda Byron aka“Lady Lovelace” 1815-52Ada’s tutor was Babbage himself!While not using the practical technology of the era, Alan Turing developed the idea of a "Universal Machine" capable of executing anydescribable algorithm, and forming the basis for the concept of "computability". Perhaps more importantly Turing's ideas differed from those of others who were solving arithmetic problems by introducing the concept of "symbol processing".1937, Alan Turing第一台通用电子计算机--ENIAC 1946年2月14日Electronic Numerical Integrator and CalculatorJ. Presper Eckert&John MauchlyMoore SchoolUniversity of PennsylvaniaSize: 80 feet long8.5 feet high18,000 vacuum tubes5000 additions/sec.The world’s first general-purpose electronic computerconditional Jump and be programmable, distinguished it from earlier ones Used for computing artillery firing tablesAccumulator°28 vacuum tubesWW-2 EffortENIAC’S Application: Ballistic calculationsangle = f (location, tail wind, cross wind,air density, temperature, weight of shell,propellant charge, ... )ENIAC was NOT a “stored program” device °For each problem, someone analyzed the arithmetic processing needed and prepared wiring diagrams for the computors to use when wiring the machine°Process was time consuming and error prone °Cleaning personnel often knocked cables out of their place and just put them back somewhereWiring the machineElectronic Discrete Variable Automatic Computer (EDVAC)°ENIAC’s programming system was external•Sequences of instructions were executed independently of the results of the calculation•Human intervention required to take instructions “out of order”°Eckert, Mauchly, John von Neumann and others designed EDVAC (1944) to solve this problem•Solution was the stored program computer“program can be manipulated as data”°First Draft of a report on EDVAC was published in 1945, but just had von Neumann’s signature!•In 1973 the court of Minneapolis attributed the honor of inventing the computer to John Atanasoff“ ”The von Neumann MachineStored Program ComputerIAS(Institute for Advanced Study)Computer1946Main Memory ArithmeticLogicUnitProgramControlUnitI/OEquipment存储程序的思想即构成计算机程序的指令可同数据一样事先存放到存储器中,然后由计算机自己一条条取出执行。
高等计算机系统结构主存Main Memory(第八讲)程旭2014.5.5微处理器-主存(DRAM )的延迟差距Performance(1/latency) Gap grew 50% peryear°How do architects address this gap?•Put small, fast “cache” memories between CPU and DRAM. 80 processor memory cache memory2005 “ memory ”energy wall主存系统的性能latency°延迟(Latency):主要与Cache Miss Penalty相关•访问时间(access time): time between request and word arrives•周期时间(cycle time): : time between requests°带宽(Bandwidth):主要与I/O的性能相关•带宽对Cache的性能也很重要(Large Block Miss Penalty ---- L2 cache)°提高带宽比减低延迟容易些°在系统级(板级)提高存储系统性能受限制°在芯片内部提高存储系统的性能Core Memories (1950s & 60s)°Core Memory stored data as magnetization in iron rings•Iron “cores” woven into a 2-dimensional mesh of wires •Origin of the term “Dump Core” °See: /acis/history/core.htmlThe first magnetic core memory, from the IBM 405 Alphabetical AccountingMachine.magnetic corecore linux随机存储器(RAM)技术°为什么计算机设计人员需要了解RAM技术?•处理器的性能通常受到存储器带宽的限制•随着集成电路密度的增加,一些存储器将和处理器集成在同一芯片上-片载存储器来满足特殊需求-指令cache-数据cache-写缓冲器°为什么不用触发器技术来实现RAM?•密度:RAM需要更高的密度静态RAM 单元6管SRAM 单元bitbitword (行选)bit bitword°写操作: 1. 驱动位线(bit) 2. 选择行°读操作: 1. 对两条位线预充电,使得bit Vdd 2. 选择行 3. 存储单元将一条线拉为低 4. 列上的信号放大器检测 bit 和 bit 之间的差异拉高 1 0 010 0bit bit 1 bit0 1L-Bank PrechargeS-AMP 1/2 S-AMP典型的SRAM 组织: 16字 4位SRAM Cell SRAM Cell SRAM Cell SRAM CellSRAM CellSRAM CellSRAM CellSRAM CellSRAM Cell SRAM Cell SRAM Cell SRAM Cell - +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp ......Word 0Word 1Word 15Dout 0Dout 1 Dout 2 Dout 3- +Wr Driver & Precharger - +Wr Driver & Precharger - +Wr Driver & Precharger - +Wr Driver & Precharger Address DecoderWrEnPrechargeDin 0Din 1Din 2Din 3A0 A1A2 A3......典型SRAM 的逻辑图°写使能信号通常是低电平有效 (WE_L) °Din 和Dout 是结合在一起的:•需要一个新的控制信号 输出使能信号(OE_L) •WE_L 有效(Low), OE_L 禁止 (High) - D 为数据输入•WE_L 禁止 (High), OE_L 有效 (Low) - D 为数据输出 •WE_L 和OE_L 都有效:-结果不确定. 千万不要这样做!!!ADOE_LNMWE_L 2 Nwords x M bit SRAM典型的SRAM 时序Write Timing: D Read Timing:WE_LAWriteHold TimeWrite Setup TimeData In Write AddressOE_L High Z JunkRead Address Garbage Read AccessTimeData Out Read AccessTimeData OutJunk Read AddressADOE_LNMWE_L 2 Nwords x M bit SRAMD A WE WE进一步分析SRAM 单元°通常SRAM 具有许多存储字 (行)•位线(bit lines)就很长,因而也就具有较大的电容 •晶体管N1、N2、P1和P2就必须非常小°晶体管N1、P1没有足够的能量来快速驱动位线(Bit ):•需要增设一个信号放大器(sense amplifier )来比较 Bit 和Bit6管 SRAM 单元bit bitword (行选择)bitbitwordN1 N2P1P2SRAM 的问题°六个晶体管需要较多的芯片面积 °假设 在某单元中存储“0”:•晶体管 N1将试图将 Bit 拉为0 •晶体管 P2 将试图将Bit 为1°但是由于这些位线在预充电时都将置为高: 那么是否必须需要bit = 1 bit = 0Select = 1On Off Off OnN1N2P1P2OnOnSRAM 的问题 (续)° P 型晶体管 (P2)具有三个功能:•在读操作期间, 将 Bit 线驱动为高 (Select = 1)•在下一次写操作之前, 保持N1的门一直在高电平•在读操作期间, 防止 N1的门电容将它的所有电荷都泄漏给 Bitbit = 1 bit = 0Select = 1N1 OnOnP2 VddOnOnN1的门电容当将入“0”到这个单元时, 置为高 在下一次写入之前, P2将一直保持为高4管RAM 单元°读操作: •1. 对b 预充电, 使得 b Vdd•2. 选择行 •3. 感应 •4. 放大数据•5. 写°刷新:•假读周期 °写操作: •1. 驱动位线( bit lines)•2. 选择行在读取数据期间,消耗掉的电荷,必须被恢复 •优点: •较小: 取消了 2个负载设备和1个供电端 •缺点:•附加了刷新周期 •降低了抗干扰能力b bRow Selectdynamic RAMSRAM SRAM单管单元°写操作:•1. 驱动位线•2. 选择行°读操作:•1. 预充电, 使得位线 Vdd•2. 选择行•3. 单元和位线共享电荷-在位线上只有非常小的电压变化•4. 感应 (非常奇妙的感应放大器)-可以检测到大约一百万电子伏特的变化•5. 写: 恢复电压值°刷新•1. 仅仅需要对每个单元进行一次假读操作行选择位线4DRAM 引论°Dynamic RAM (DRAM):•需要刷新 •密度非常高•耗电非常低 (工作时0.1~0 .5 W,等待(standby)0.25 ~10 mW) •每位的成本非常低 •管脚敏感:-输出使能(Output Enable: OE_L) -写使能(Write Enable:WE_L)-行地址过滤(Row address strobe: ras) -列地址过滤(Col address strobe:cas)cell array N bitsr o w c o l addr log NsenseD单感应放大器耗电较少,面积小 2DRAM SRAM传统的DRAM 组成行 译 码 器行地址列地址数据RAM 单元阵列 RAM Cell Array字选择 (行选择)位线(数据)°行和列地址在一起:•每次选择一位每个交叉点代表一个单管DRAM 单元列选择器 & I/O 电路典型的DRAM 组成°典型DRAMs: 并行访问多位•例如: 2 Mb DRAM = 256K x 8 = 512行 x 512列 x 8位 •行和列地址并行作用于所有 8个位面 (planes)256 Kb DRAM 的一个位面512 行位面 0512列 位面1D<1>位面7 D<7>256 Kb DRAM256 Kb DRAM典型DRAM 的逻辑框图°控制信号 (RAS_L, CAS_L, WE_L, OE_L) 都是低电平有效 °Din 和Dout 合并在一起(D):•WE_L 有效(低), OE_L 禁止 (高)时, -D 作为数据输入管脚•WE_L 禁止(高), OE_L 有效 (低) -D 作为数据输出管脚°行和列地址共享相同的一组管脚(A)•RAS_L 变成低: 管脚A 被锁定为行地址 •CAS_L 变成低: 管脚A 被锁定为列地址ADOE_L 256K x 8 DRAM98WE_L CAS_L RAS_LDRAM ArchitectureR o w A d d r e s s D e c o d e rCol. 1Col. 2MRow 1Row 2N Column Decoder & Sense AmplifiersMNN+Mbit linesword lines Memory cell (one bit)DData • Bits stored in 2-dimensional arrays on chip• Modern chips have around 4 logical banks on each chip– each logical bank physically implemented as many smaller arraysDRAM Operation: Three Steps°Precharge•charges bit lines to known value, required before next row access °Row access (RAS)•decode row address, enable addressed row (often multiple Kb in row) •bitlines share charge with storage cell•small change in voltage detected by sense amplifiers which latch whole row of bits•sense amplifiers drive bitlines full rail to recharge storage cells°Column access (CAS) •decode column address to select small number of sense amplifier latches (4, 8, 16, or 32 bits depending on DRAM package) •on read, send latched bits out to chip pins•on write, change sense amplifier latches. which then charge storage cells to required value•can perform multiple column accesses on same row without another row access (burst mode)sense ampliferamplier ampliferWE_L ARow AddressOE_L JunkWR Access TimeWR Access TimeCAS_L RAS_LCol AddressRow AddressJunkCol AddressDJunkJunkData InData InJunkDRAM 写时钟周期Early Wr Cycle : WE_L asserted before CAS_LLate Wr Cycle : WE_L asserted after CAS_L当 RAS_L 有效时, 所有DRAM 开始访问AD256K x 8 DRAM98OE_L ARow AddressWE_L JunkRead AccessTimeOutput EnableDelayCAS_L RAS_LCol AddressRow AddressJunkCol AddressDHigh Z JunkDRAM 读时钟周期Early Read Cycle : OE_L asserted before CAS_LLate Read Cycle : OE_L asserted after CAS_LJunkData OutHigh Z当 RAS_L 有效时, 所有DRAM 开始访问AD256K x 8 DRAM98DRAM 读操作时序主存性能(周期时间与访问周期)°DRAM (读/写)周期时间 >> DRAM (读/写)访问时间 °DRAM (读/写)周期时间:•我们可以以多快的频率来开始进行存储访问?•比喻: 我们只能在4x 的年度的夏天,才能收看到奥运会足球赛 °DRAM (读/写) 访问时间:•一旦我们开始进行访问,那么要过多长时间可以获得数据? •比喻: 在奥运会期间,一旦我们想看,最多等一天就可以收看到下一场比赛 °DRAM 的带宽限制:•比喻:如果我们2014年还想看新的世界级足球比赛?时间访问时间周期时间增加带宽 交叉访问(Interleaving )非交叉访问的访问模式:开始访问D1CPUMemory开始访问 D2得到D1四路交叉访问的访问模式:访问体 1访问体 2访问体 3我们可以再次访问体 0CPUMemory Bank 1 Memory Bank 0 Memory Bank 3Memory Bank 2 访问体 0主存性能°简单: CPU 、Cache 、总线和主存同宽(32或64位)°宽度: CPU/Mux 1 个存储字; Mux/Cache 、总线和主存N 个存储字 (Alpha: 64 位 & 256 位; UtraSPARC 512位) °交叉(Interleaved ): CPU 、Cache 和总线1个存储字: 存储器 N 个存储体(4模); 示例为字交叉(word interleaved )cachebusmuxCPUCacheMbus 第一种解决方案 高带宽DRAM第二种解决方案存储器和Cache 之间宽数据通路 第三种解决方案 存储模块交叉访问CPUMCPUCachebusMMMM主存性能°时序模型 (字长 32 位)•1个周期发送地址,•6个周期访问时间, 1个周期发送数据•Cache块为 4个字°Simple M.P. = 4 x (1+6+1) = 32°Wide M.P. = 1 + 6 + 1 = 8°Interleaved M.P. = 1 + 6 + 4x1 = 11计算机中的主存系统CPU主存访问过程Need for Error Correction!°Motivation:•Failures/time proportional to number of bits! •As DRAM cells shrink, more vulnerable°Went through period in which failure rate was low enough without error correction that people didn’t do correction •DRAM banks too large now•Servers always corrected memory systems °Basic idea: add redundancy through parity bits•Common configuration: Random error correction-SEC-DED (single error correct, double error detect)-One example: 64 data bits + 8 parity bits (11% overhead) •Really want to handle failures of physical components as well-Organization is multiple DRAMs/DIMM, multiple DIMMs-Want to recover from failed DRAM and failed DIMM! -“Chip kill” handle failures width of single DRAM chip dramQuest for DRAM Performance1.Fast Page mode •Add timing signals that allow repeated accesses to row buffer without another row access time •Such a buffer comes naturally, as each array will buffer 1024 to 2048 bits for each access2.Synchronous DRAM (SDRAM)•Add a clock signal to DRAM interface, so that the repeated transfers would not bear overhead to synchronize with DRAM controller3.Double Data Rate (DDR SDRAM) •Transfer data on both the rising edge and falling edge of the DRAM clock signal doubling the peak data rate•DDR2 lowers power by dropping the voltage from 2.5 to 1.8 volts + offers higher clock rates: up to 400 MHz•DDR3 drops to 1.5 volts + higher clock rates: up to 800 MHz°Improved Bandwidth, not LatencyDRAM fastpage SRAM DRAM dram sense amplifier dramCPUFast Memory Systems: DRAM specific°Multiple CAS accesses: several names (page mode)•Extended Data Out (EDO): 30% faster in page mode°Newer DRAMs to address gap;what will they cost, will they survive?•RAMBUS: startup company; reinvented DRAM interface-Each Chip a module vs. slice of memory-Short bus between CPU and chips-Does own refresh-Variable amount of data returned- 1 byte / 2 ns (500 MB/s per chip)•Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfersynchronous to system clock (66 - 150 MHz)-DDR DRAM: Two transfers per clock (on rising and falling edge) •Intel claims FB-DIMM is the next big thing-Stands for “Fully-Buffered Dual-Inline RAM”-Same basic technology as DDR, but utilizes a serial “daisy-chain” channel between different memory components.DRAM技术的发展Throughput vs. Latency快速页模式(Fast Page Mode, FPM)DRAM °常规DRAM组成:•N行 x N列 x M位•同时读和写M位•每 M位访问需要一个RAS /CAS周期°FPM DRAM•N x M 锁存器来保存一行°在读取一行到寄存器后•仅仅需要CAS来访问该行中的其他M位存储块(bank)•在RAS_L保持有效, 同时CAS_L 不断变化行地址N行N列DRAMM位列地址M位输出N行N 列DRAM列地址M位输出M 位N x M SRAM行地址CPU CPU CPU°DRAM性能指标:(x-y-y-y,例如6-3-3-3)•x:first data access time in clock/bus cycles•y:successive burst data access time in clock/bus cyclesEDO DRAM(Extended Data Out)(20%-40%性能提升) °EDO DRAM 性能指标:5-2-2-2 at 66MHz FPM DRAM FPMCPUCPUFPM DRAM 10 20Burst EDO DRAM4°基于DRAM 的技术(CAS 、RAS ,etc ) °允许在一个DIMM 中包含多个BANK•DIMM SDRAM 168 pin 增加了ba0、ba1两个管脚 °与CPU 或芯片组使用同步时钟信号°五组控制信号,可组成多种命令•CS :chip select•RAS :raw address select•CAS :col address select•WE :write enable•DQM :output enable°更好的支持Burst 方式 °可编程设置模式:•Bust length,sequence...CPU RAM CPU RAM CPU DRAM CPU CPU RAM°SDRAM Mode RegisterSDRAM readSDRAM performance°CAS Latency is important°x-y-y(例如:3-2-2)•CAS Latency•the RAS-to-CAS delay•RAS precharge time°时钟主频•PC66:66MHz•PC100:100MHz•PC133:133MHzDDR SDRAM °DDR :Double data rate°时钟上升沿和下降沿均可以发送数据(带宽X2 !!)°在原有的SDRAM 的架构基础上加以较小的改进(可复用原有生产线)°SDRAM 和DDR 均为开放标准(JEDEC )(Important !!)SDRAM DDRJEDEC SDRAM dura-bank architecture burst modemode registerDDR-SDRAM Timing DiagramDDR - 2SDRAM的Bank和内存规范°图1:•4M X 1bit X 32chip°图2:•4bank in a dimm°SIMM、DIMM•single/doul in-linememory module°目前使用的都是DIMM°时钟频率•PC1600 100MHz-100 2 8 MB/s•PC2100 133MHz•PC2400 150MHz 图1DDR3 SDRAM其它DRAM—VCDRAM (Virtual Channel DRAM,NEC)SDRAM timing (Single Data Rate)°Micron 128M-bit dram (using 2Meg 16bit 4bank ver)RAS (New Bank)CASPrecharge x Burst READ CAS LatencyDouble-Data Rate (DDR2) DRAM[ Micron, 256Mb DDR2 SDRAM datasheet ]Row Column Precharge Row’Data200MHz Clock400Mb/sDDR vs DDR2 vs DDR3 vs DDR4 °All about increasing the rate at thepinspins°Not an improvement in latency•In fact, latency can sometimes beworseDDR2 latency DDR latency°Internal banks often consumed forincreased bandwidth°DDR4 (January 2011)•Samsung,…•Currently 2.13Gb/sec•Target: 4 Gb/secDDR2 Double Data Rate 2 DDR/DDR2 4 DDRcellcell 4。