当前位置:文档之家› Abstract Managing Wire Delay in Large Chip-Multiprocessor Caches

Abstract Managing Wire Delay in Large Chip-Multiprocessor Caches

Abstract Managing Wire Delay in Large Chip-Multiprocessor Caches
Abstract Managing Wire Delay in Large Chip-Multiprocessor Caches

Managing Wire Delay in Large Chip-Multiprocessor Caches Bradford M. Beckmann and David A. Wood

Computer Sciences Department

University of Wisconsin—Madison

{beckmann, david}@https://www.doczj.com/doc/719097678.html,

Abstract

In response to increasing(relative)wire delay,archi-tects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2caches. Block migration(e.g.,D-NUCA[27]and NuRapid[12]) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks.Transmission Line Caches(TLC)[6]use on-chip transmission lines to provide low latency to all banks.Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency.

Chip multiprocessors(CMPs)present additional chal-lenges.First,CMPs often share the on-chip L2cache, requiring multiple ports to provide suf?cient bandwidth. Second,multiple threads mean multiple working sets,which compete for limited on-chip storage.Third,sharing code and data interferes with block migration,since one proces-sor’s low-latency bank is another processor’s high-latency bank.

In this paper,we develop L2cache designs for CMPs that incorporate these three latency management tech-niques.We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scien-ti?c workloads.First,we demonstrate that block migration is less effective for CMPs because40-60%of L2cache hits in commercial workloads are satis?ed in the central banks, which are equally far from all processors.Second,we observe that although transmission lines provide low latency,contention for their restricted bandwidth limits their performance.Third,we show stride-based prefetching between L1and L2caches alone improves performance by at least as much as the other two techniques.Finally,we present a hybrid design—combining all three techniques—that improves performance by an additional2%to19% over prefetching alone.

1 Introduction

Many factors—both technological and marketing—are driving the semiconductor industry to implement multiple processors per chip.Small-scale chip multiprocessors (CMPs),with two processors per chip,are already commer-cially available[24,30,44].Larger-scale CMPs seem likely to follow as transistor densities increase[5,18,45,28].Due to the bene?ts of sharing,current and future CMPs are likely to have a shared, uni?ed L2 cache [25, 37].

Wire delay plays an increasingly signi?cant role in cache design.Design partitioning,along with the integra-tion of more metal layers,allows wire dimensions to decrease slower than transistor dimensions,thus keeping wire delay controllable for short distances[20,42].For instance as technology improves,designers split caches into multiple banks,controlling the wire delay within a bank. However,wire delay between banks is a growing perfor-mance bottleneck.For example,transmitting data1cm requires only2-3cycles in current(2004)technology,but will necessitate over12cycles in2010technology assum-ing a cycle time of12fanout-of-three delays[16].Thus,L2 caches are likely to have hit latencies in the tens of cycles.

Increasing wire delay makes it dif?cult to provide uni-form access latencies to all L2cache banks.One alternative is Non-Uniform Cache Architecture(NUCA)designs[27], which allow nearer cache banks to have lower access laten-cies than further banks.However,supporting multiple pro-cessors(e.g.,8)places additional demands on NUCA cache designs.First,simple geometry dictates that eight regular-shaped processors must be physically distributed across the 2-dimensional die.A cache bank that is physically close to one processor cannot be physically close to all the others. Second,an8-way CMP requires eight times the sustained cache bandwidth.These two factors strongly suggest a physically distributed, multi-port NUCA cache design.

This paper examines three techniques—previously evaluated only for uniprocessors—for managing L2cache latency in an eight-processor CMP.First,we consider using hardware-directed stride-based prefetching[9,13,23]to tolerate the variable latency in a NUCA cache design. While current systems perform hardware-directed strided prefetching[19,21,43],its effectiveness is workload dependent[10,22,46,49].Second,we consider cache block migration[12,27],a recently proposed technique for NUCA caches that moves frequently accessed blocks to cache banks closer to the requesting processor.While block migration works well for uniprocessors,adapting it to

This work was supported by the National Science Foundation (CDA-9623632,EIA-9971256,EIA-0205286,and CCR-0324878),a Wisconsin Romnes Fellowship(Wood),and dona-tions from Intel Corp.and Sun Microsystems,Inc.Dr.Wood has a signi?cant ?nancial interest in Sun Microsystems, Inc.

CMPs poses two problems.One,blocks shared by multiple processors are pulled in multiple directions and tend to con-gregate in banks that are equally far from all processors. Two,due to the extra freedom of movement,the effective-ness of block migration in a shared CMP cache is more dependent on“smart searches”[27]than its uniprocessor counterpart,yet smart searches are harder to implement in a CMP environment.Finally,we consider using on-chip trans-mission lines[8]to provide fast access to all cache banks[6]. On-chip transmission lines use thick global wires to reduce communication latency by an order of magnitude versus long conventional wires.Transmission Line Caches(TLCs) provide fast,nearly uniform,access latencies.However,the limited bandwidth of transmission lines—due to their large dimensions—may lead to a performance bottleneck in CMPs.

This paper evaluates these three techniques—against a baseline NUCA design with L2miss prefetching—using detailed full-system simulation and both commercial and sci-enti?c workloads. We make the following contributions:?Block migration is less effective for CMPs than previous results have shown for uniprocessors.Even with an per-fect search mechanism,block migration alone only improves performance by an average of3%.This is in part because shared blocks migrate to the middle equally-distant cache banks,accounting for40-60%of L2 hits for the commercial workloads.?Transmission line caches in CMPs exhibit performance improvements comparable to previously published uni-processor results[6]—8%on average.However,conten-tion for their limited bandwidth accounts for26%of L2 hit latency.

?Hardware-directed strided prefetching hides L2hit latency about as well as block migration and transmis-sion lines reduce it.However,prefetching is largely orthogonal, permitting hybrid techniques.

?A hybrid implementation—combining block migration, transmission lines,and on-chip prefetching—provides the best performance.The hybrid design improves per-formance by an additional 2% to 19% over the baseline.?Finally,prefetching and block migration improve net-work ef?ciency for some scienti?c workloads,while transmission lines potentially improve ef?ciency across all workloads.

2 Managing CMP Cache Latency

This section describes the baseline CMP design for this study and how we adapt the three latency management tech-niques to this framework.

2.1 Baseline CMP Design

We target eight-processor CMP chip designs assuming the45nm technology generation projected in2010[16].Table1speci?es the system parameters for all designs.Each CMP design assumes approximately300mm2of available die area[16].We estimate eight4-wide superscalar proces-sors would occupy120mm2[29]and16MB of L2cache storage would occupy64mm2[16].The on-chip intercon-nection network and other miscellaneous structures occupy the remaining area.

As illustrated in Figure1,the baseline design—denoted CMP-SNUCA—assumes a Non-Uniform Cache Architec-ture(NUCA)L2cache,derived from Kim,et al.’s S-NUCA-2design[27].Similar to the original proposal,CMP-SNUCA statically partitions the address space across cache banks,which are connected via a2D mesh interconnection network.CMP-SNUCA differs from the uniprocessor design in several important ways.First,it places eight processors around the perimeter of the L2cache,effectively creating eight distributed access locations rather than a single central-ized location.Second,the16MB L2storage array is parti-tioned into256banks to control bank access latency[1]and to provide suf?cient bandwidth to support up to128simulta-neous on-chip processor requests.Third,CMP-SNUCA con-nects four banks to each switch and expands the link width to 32bytes.The wider CMP-SNUCA network provides the additional bandwidth needed by an8-processor CMP,but requires longer latencies as compared to the originally pro-posed uniprocessor network.Fourth,shared CMP caches are subject to contention from different processors’working sets [32],motivating16-way set-associative banks with a pseudo-LRU replacement policy[40].Finally,we assume an ideal-ized off-chip communication controller to provide consistent off-chip latency for all processors.

2.2 Strided Prefetching

Strided or stride-based prefetchers utilize repeatable memory access patterns to tolerate cache miss latency[11, 23,38].Though the L1cache?lters many memory requests, L1and L2misses often show repetitive access patterns.Most current prefetchers utilize miss patterns to predict cache misses before they happen[19,21,43].Speci?cally,current hardware prefetchers observe the stride between two recent cache misses,then verify the stride using subsequent misses. Once the prefetcher reaches a threshold of?xed strided misses,it launches a series of?ll requests to reduce or elimi-nate additional miss latency.

We base our prefetching strategy on the IBM Power4 implementation[43]with some slight modi?cations.We evaluate both L2prefetching(i.e.,between the L2cache and memory)and L1prefetching(i.e.,between the L1and L2 caches).Both the L1and L2prefetchers contain three sepa-rate32-entry?lter tables:positive unit stride,negative unit stride,and non-unit stride.Similar to Power4,once a?lter table entry recognizes4?xed-stride misses,the prefetcher allocates the miss stream into its8-entry stream table.Upon allocation,the L1I and L1D prefetchers launch6consecutive

prefetches along the stream to compensate for the L1to L2latency,while the L2prefetcher launches 25prefetches.Each prefetcher issues prefetches for both loads and stores because,unlike the Power 4,our simulated machine uses an L1write-allocate protocol supporting sequential consistency.Also we model separate L2prefetchers per processor,rather than a single shared prefetcher.We found that with a shared prefetcher,interference between the different processors’miss streams signi?cantly disrupts the prefetching accuracy and coverage 1.

2.3 Block Migration

Block migration reduces global wire delay from L2hit latency by moving frequently accessed cache blocks closer to the requesting processor.Migrating data to reduce latency has been extensively studied in multiple-chip multiproces-sors [7,15,17,36,41].Kim,et al.recently applied data migration to reduce latency inside future aggressively-banked uniprocessor caches [27].Their Dynamic NUCA (D-NUCA)design used a 2-dimensional mesh to interconnect 2-way set-associative banks,and dynamically migrated fre-quently accessed blocks to the closest banks.NuRapid used centralized tags and a level of indirection to decouple data placement from set indexing,thus reducing con?icts in the nearest banks [12].Both D-NUCA and NuRapid assumed a single processor chip accessing the L2cache network from a single location.

For CMPs,we examine a block migration scheme as an extension to our baseline CMP-SNUCA design.Similar to the uniprocessor D-NUCA design [27],CMP-DNUCA per-mits block migration by logically separating the L2cache banks into 16unique banksets,where an address maps to a bankset and can reside within any one bank of the bankset.CMP-DNUCA physically separates the cache banks into 16different bankclusters ,shown as the shaded “Tetris”pieces in Figure 1.Each bankcluster contains one bank from every bankset,similar to the uniprocessor “fair mapping”policy

[27].The bankclusters are grouped into three distinct regions.The 8banksets closest to each processor form the local regions,shown by the 8lightly shaded bankclusters in Figure 1.The 4bankclusters that reside in the center of the shared cache form the center region,shown by the 4darkest shaded bankclusters in Figure 1.The remaining 4bankclus-ters form the inter ,or intermediate,region.Ideally block migration would maximize L2hits within each processor’s local bankcluster where the uncontended L2hit latency (i.e.,load-to-use latency)varies between 13to 17cycles and limit the hits to another processor’s local bankcluster,where the uncontended latency can be as high as 65 cycles.

To reduce the latency of detecting a cache miss,the uni-processor D-NUCA design utilized a “smart search”[27]mechanism using a partial tag array.The centrally-located partial tag structure [26]replicated the low-order bits of each bank’s cache tags.If a request missed in the partial tag struc-ture,the block was guaranteed not to be in the cache.This smart search mechanism allowed nearly all cache misses to be detected without searching the entire bankset.

In CMP-DNUCA,adopting a partial tag structure appears impractical.A centralized partial tag structure can-not be quickly accessed by all processors due to wire delays.Fully replicated 6-bit partial tag structures (as used in uni-

1.Similar to separating branch predictor histories per thread [39],separating the L2miss streams by processor signi?cantly improves prefetcher performance (up to 14 times for the workload ocean).

Table 1. 2010 System Parameters

Memory System

Dynamically Scheduled Processor

split L1 I & D caches 64 KB, 2-way, 3 cycles clock frequency 10 GHz uni?ed L2 cache 16MB,256x64KB,16-way, 6cycle bank access reorder buffer / scheduler 128 / 64 entries L1/L2 cache block size 64 Bytes pipeline width 4-wide fetch & issue memory latency 260 cycles pipeline stages 30

memory bandwidth 320 GB/s direct branch predictor 3.5 KB YAGS memory size

4 GB of DRAM return address stack 64 entries

outstanding memory requests/CPU

16

indirect branch predictor

256 entries (cascaded)

Figure 1. CMP-SNUCA Layout with CMP-DNUCA Bankcluster Regions

Local Inter.Center

Bankcluster Key

L1I $

L1D$L1I $

L1D$L1I $

L1D$L1I $

L1D$L 1I $

L 1D $L 1I $

L 1D $CPU 7

L1I $D L1$

L1I $L1D$

CPU 6

C P U 0

C P U 1

CPU 2

CPU 3

CPU 4

CPU 5

processor D-NUCA [27])require 1.5MBs of state,an extremely high overhead.More importantly,separate partial tag structures require a complex coherence scheme that updates address location state in the partial tags with block migrations.However,because architects may invent a solu-tion to this problem,we evaluate CMP-DNUCA both with and without a perfect search mechanism.

2.4 On-chip Transmission Lines

On-chip transmission line technology reduces L2cache access latency by replacing slow conventional wires with ultra-fast transmission lines [6].The delay in conventional wires is dominated by a wire’s resistance-capacitance prod-uct,or RC delay.RC delay increases with improving tech-nology as wires become thinner to match the smaller feature sizes below.Speci?cally,wire resistance increases due to the smaller cross-sectional area and sidewall capacitance increases due to the greater surface area exposed to adjacent wires.On the other hand,transmission lines attain signi?cant performance bene?t by increasing wire dimensions to the point where the inductance-capacitance product (LC delay)determines delay [8].In the LC range,data can be communi-cated by propagating an incident wave across the transmis-sion line instead of charging the capacitance across a series of wire segments.While techniques such as low-k intermetal dielectrics,additional metal layers,and more repeaters across a link,will mitigate RC wire latency for short and intermediate links,transmitting data 1cm will require more than 12cycles in 2010technology [16].In contrast,on-chip transmission lines implemented in 2010technology will transmit data 1cm in less than a single cycle [6].

While on-chip transmission lines achieve signi?cant latency reduction,they sacri?ce substantial bandwidth or require considerable manufacturing cost.To achieve trans-mission line signalling,on-chip wire dimensions and spacing must be an order of magnitude larger than minimum pitch global wires.To attain these large dimensions,transmission lines must be implemented in the chip’s uppermost metal layers.The sparseness of these upper layers severely limits

the number of transmission lines available.Alternatively,extra metal layers may be integrated to the manufacturing process,but each new metal layer adds about a day of manu-facturing time,increasing wafer cost by hundreds of dollars [47].

Applying on-chip transmission lines to reduce the access latency of a shared L2cache requires ef?cient utiliza-tion of their limited bandwidth.Similar to our uniprocessor TLC designs [6],we ?rst propose using transmission lines to connect processors with a shared L2cache through a single L2interface,as shown in Figure 2.Because transmission lines do not require repeaters,CMP-TLC creates a direct connection between the centrally located L2interface and the peripherally located storage arrays by routing directly over the processors.Similar to CMP-SNUCA,CMP-TLC statically partitions the address space across all L2cache banks.Sixteen banks (2adjacent groups of 8banks)share a common pair of thin 8-byte wide unidirectional transmission line links to the L2cache interface.To mitigate the conten-tion for the thin transmission line links,our CMP-TLC design provides 16separate links to different segments of the L2cache.Also to further reduce contention,the CMP-TLC L2interface provides a higher bandwidth connection (80-byte wide)between the transmission lines and processors than the original uniprocessor TLC design.Due to the higher bandwidth,requests encounter greater communication latency (2-10 cycles) within the L2 cache interface.

We also propose using transmission lines to quickly access the central banks in the CMP-DNUCA design.We refer to this design as CMP-Hybrid.CMP-Hybrid,illustrated in Figure 3,assumes the same design as CMP-DNUCA except the closest switch to each processor has a 32-byte wide transmission line link to a center switch in the DNUCA cache.Because the processors are distributed around the perimeter of the chip and the distance between the processor switches and the center switches is relatively short (approxi-mately 8mm),the transmission line links in CMP-Hybrid are wider (32bytes)than their CMP-TLC counterparts

L 2 I n t e r f a c e

CPU 4

L1I $$D L1I $L1L1D$I $L1CPU 2L1D$

L1I $$D L1L1I $$D L1I $L1L1D$I $

L1CPU 0L1D$L1I $$

D L1CPU 1

CPU 6

CPU 5

CPU 3

CPU 7

Figure 2. CMP-TLC Layout

L1I $

L1D$L1I $

L1D$L1I $

L1D$L1I $

L1D$L 1I $

L 1D $L 1I $

L 1D $CPU 7

L1I $D L1$

L1I $L1D$

CPU 6

C P U 0

C P U 1

CPU 2

CPU 3

CPU 4

CPU 5

Figure 3. CMP-Hybrid Layout

(8bytes).The transmission line links of CMP-Hybrid pro-vide low latency access to those blocks that tend to congre-gate in the center banks of the block migrating NUCA cache,Section 5.3.

Figure 4compares the uncontended L2cache hit latency between the CMP-SNUCA,CMP-TLC,and CMP-Hybrid designs.The plotted hit latency includes L1miss latency,i.e.it plots the load-to-use latency for L2hits.While CMP-TLC achieves a much lower average hit latency than CMP-SNUCA,CMP-SNUCA exhibits lower latency to the closest 1MB to each processor.For instance,Figure 4shows all processors in the CMP-SNUCA design can access their local bankcluster (6.25%of the entire cache)in 18cycles or less.CMP-DNUCA attempts to maximize the hits to this closest 6.25%of the NUCA cache through migration,while CMP-TLC utilizes a much simpler logical design and provides fast access for all banks.CMP-Hybrid uses transmission lines to attain similar average hit latency as CMP-TLC,as well as achieving fast access to more banks than CMP-SNUCA.

3 Methodology

We evaluated all cache designs using full system simula-tion of a SPARC V9CMP running Solaris 9.Speci?cally,we used Simics [33]extended with the out-of-order processor model,TFSim [34],and a memory system timing model.Our memory system implements a two-level directory cache-coherence protocol with sequential memory consistency.The intra-chip MSI coherence protocol maintains inclusion between the shared L2cache and all on-chip L1caches.All L1requests and responses are sent via the L2cache allowing the L2cache to maintain up-to-date L1sharer knowledge.The inter-chip MOSI coherence protocol maintains directory state at the off-chip memory controllers and only tracks which CMP nodes contain valid block copies.Our memory system timing model includes a detailed model of the intra-and inter-chip network.Our network models all messages communicated in the system including all requests,

responses,replacements,and https://www.doczj.com/doc/719097678.html,work routing is performed using a virtual cut-through scheme with in?nite buffering at the switches.

We studied the CMP cache designs for various commer-cial and scienti?c workloads.Alameldeen,et al.described in detail the four commercial workloads used in this study [2].We also studied four scienti?c workloads:two Splash2benchmarks [48]:barnes (16k-particles)and ocean (),and two SPECOMP benchmarks [4]:apsi and fma3d.We used a work-related throughput metric to address multithreaded workload variability [2].Thus for the com-mercial workloads,we measured transactions completed and for the scienti?c workloads,runs were completed after the cache warm-up period indicated in Table 2.However,for the specOMP workloads using the reference input sets,runs were too long to be completed in a reasonable amount of time.Instead,these loop-based benchmarks were split by main loop completion.This allowed us to evaluate all work-loads using throughput metrics, rather than IPC.

4 Strided Prefetching

Both on and off-chip strided prefetching signi?cantly improve the performance of our CMP-SNUCA baseline.Figure 5presents runtime results for no prefetching,L2prefetching only,and L1and L2prefetching combined,nor-malized to no prefetching.Error bars signify the 95%con?-dence intervals [3]and the absolute runtime (in 10K instructions per transaction/scienti?c benchmark)of the no prefetch case is presented below.Figure 5illustrates the sub-stantial bene?t from L2prefetching,particularly for regular scienti?c workloads.L2prefetching reduces the run times of ocean and apsi by 43%and 59%,respectively.Strided L2prefetching also improves performance of the commercial workloads by 4% to 17%.

The L1&L2prefetching bars of Figure 5indicate on-chip prefetching between each processor’s L1I and D caches and the shared L2cache improves performance by an addi-

10 20 30 40 50 60

70 0

20

40 60 80 100

L a t a n c y (c y c l e s )

% of L2 Cache Storage

CMP-SNUCA

CMP-TL CMP-Hybrid

Figure 4. CMP-SNUCA vs. CMP-TLC vs. CMP-Hybrid Uncontended L2 Hit Latency

Table 2. Evaluation Methodology

Bench

Fast Forward

Warm-up

Executed

Commercial Workloads (unit = transactions)

apache 5000002000500zeus 5000002000500jbb 1000000150002000oltp 100000300100

Scienti?c Workloads (unit = billion instructions)

barnes None 1.9run completion ocean None 2.4run completion apsi 88.8 4.64loop completion fma3d

190.4

2.08

loop completion

514514×

tional 6%on average.On-chip prefetching bene?ts all benchmarks except for jbb and barnes,which have high local L1cache hit rates of 91%and 99%respectively.Table 3breaks down the performance of stride prefetching into:

A Prefetch hit is de?ned as the ?rst reference to a prefetched block including a “partial hit”to a block still in-?ight.Except for L2prefetches in jbb and barnes,less than 12%of prefetch hits were partial hits.Overall,as communication latency increases,the signi?cant performance improvement attainable by prefetching ensures its continued integration into future high performance memory systems.

5 Block Migration

CMP caches utilizing block migration must effectively manage multiple processor working sets in order to reduce cache access latency.Although Kim,et al.[27]showed block migration signi?cantly reduced cache access latency in a non-prefetching uniprocessor NUCA cache,most future

large on-chip caches will implement hardware-directed prefetching and service multiple on-chip processors.Our CMP-DNUCA design extends block migration to an 8-pro-cessor CMP cache and supports strided prefetching at the L1and L2cache levels.We characterize the working sets exist-ing in a shared CMP cache in Section 5.1.Then we describe our CMP cache implementing block migration in Section 5.2, and present evaluation results in Section 5.3.

5.1 Characterizing CMP Working Sets

We analyze the working sets of the workloads running on an eight-processor CMP.Speci?cally,we focus on under-standing the potential bene?ts of block migration in a CMP cache by observing the behavior of L2cache blocks.We model a single-banked 16MB inclusive shared L2cache with 16-way associativity and a uniform access time to iso-late the sharing activity from access latency and con?ict misses.No prefetching is performed in these runs.To miti-gate cold start effects,the runs are long enough that L2cache misses outnumber physical L2cache blocks by an order of magnitude.

Figure 6shows the cumulative distribution of the num-ber of processors that access each block.For the scienti?c workloads,the vast majority of all blocks—between 70.3%and 99.9%—are accessed by a single processor.Somewhat surprisingly,even the commercial workloads share relatively few blocks.Only 5%to 28%of blocks in the evaluated com-mercial workloads are accessed by more than two proces-sors.Because relatively few blocks are shared across the entire workload spectrum,block migration can potentially improve all workloads by moving blocks towards the single processor that requests them.

Although relatively few blocks are shared in all work-loads,a disproportionate fraction of L2hits are satis?ed by highly shared blocks for the commercial workloads.Figure 7shows the cumulative distribution of L2hits for blocks accessed by 1to 8processors.The three array-based scien-ti?c workloads (fma3d [4],apsi [4],and ocean [48])exhibit extremely low inter-processor request sharing.Less than 9%

Table 3. Prefetching Characteristics

L1 I Cache

L1 D Cache

L2 Cache benchmark prefetches

coverage accuracy prefetches

coverage accuracy prefetches

coverage accuracy apache 7.618.3%48.7% 5.39.3%61.7%7.039.0%49.8%jbb 2.530.157.0 2.07.735.9 3.038.336.8oltp 15.126.753.5 1.4 5.959.6 1.935.050.3zeus 11.319.047.9 4.915.778.68.047.256.7barnes 0.09.138.70.2 3.020.80.012.322.8ocean 0.022.150.917.685.988.3 4.091.387.5apsi 0.010.838.5 5.646.799.4 5.798.798.8fma3d

0.0

12.4

33.1

6.7

32.8

82.5

11.2

36.8

67.9

0.0

0.5

1.0

N o r m a l i z e d R u n t i m e

Benchmarks

CMP-SNUCA no pf

CMP-SNUCA L2 pf CMP-SNUCA L1&L2 pf

31

apache 5.1

jbb 200

oltp 15

zeus 38000

barnes 28000

ocean 21000

apsi 32000

fma3d

Figure 5. Normalized Execution: Strided

Prefetching

pr ef et ch (prefetches), and =

rate prefetches

instructions 1000?-----------------------------------------------cover age %prefetchHits

prefetchHits misses +---------------------------------------------------------100×=(coverage), and

accuracy %prefetchHits

prefetches

---------------------------------100×=(accuracy).

of all L2hits are to blocks shared by multiple processors.However,for barnes,utilizing a tree data structure [48],71%of L2hits are to blocks shared among multiple processors.For the commercial workloads,more than 39%of L2hits are to blocks shared by all processors,with as many as 71%of hits for oltp.Figure 8breaks down the request type over the number of processors to access a block.In the four commer-cial workloads,instruction requests (GET_INSTR)make up over 56%of the L2hits to blocks shared by all processors.Kundu et al.[31]recently con?rmed the high degree of instruction sharing in a shared CMP cache running oltp.The large fraction of L2hits to highly shared blocks complicates block migration in CMP-DNUCA.Since shared blocks will be pulled in all directions,these blocks tend to congregate in the center of the cache rather than toward just one processor.

5.2 Implementing CMP Block Migration

Our CMP-DNUCA implementation employs block migration within a shared L2cache.This design strives to reduce additional state while providing correct and ef?cient allocation, migration and search policies.

Allocation.The allocation policy seeks an ef?cient ini-tial placement for a cache block,without creating excessive cache con?icts.While 16-way set-associative banks help reduce con?icts,the interaction between migration and allo-cation can still cause unnecessary replacements.CMP-DNUCA implements a simple,static allocation policy that uses the low-order bits of the cache tag to select a bank within the block’s bankset (i.e.,the bankcluster).This simple scheme works well across most workload types.While not studied in this paper,we conjecture that static allocation also works well for heterogeneous workloads,because all active processors will utilize the entire L2 cache storage.

Migration.We investigated several different migration policies for CMP-DNUCA.A migration policy should maxi-mize the proportion of L2hits satis?ed by the banks closest to a processor.Directly migrating blocks to a requesting pro-cessor’s local bankcluster increases the number of local bankcluster hits.However,direct migration also increases the proportion of costly remote hits satis?ed by a distant pro-cessors’local bankcluster.Instead CMP-DNUCA imple-ments a gradual migration policy that moves blocks along the six bankcluster chain:

02468# of Processors to Access the Block 0

20406080

100% o f U n i q u e B l o c k s

apsi fma3d ocean jbb zeus barnes apache oltp

Figure 6.Cumulative Percentage of Unique L2Blocks vs. # of Processors to Access the Block

During its L2 Cache Lifetime

02468# of Processors to Access the Block

20406080

100

% o f T o t a l L 2 H i t s

fma3d apsi ocean jbb barnes zeus apache oltp

Figure 7. Cumulative Percentage of Total L2Cache Hits vs. # of Processors to Access a Block During its L2 Cache Lifetime

020

406080100

% o f T o t a l L 2 H i t s

Benchmarks

GETX

UPGRADE GETS

GET_INSTR

12345678apache

12345678jbb

12345678oltp

12345678zeus

12345678barnes 12345678ocean

12345678apsi

12345678

fma3d

Figure 8. Request Type Distribution vs. # of Processors to Access a Block During its L2 Cache Lifetime

The gradual migration policy allows blocks frequently accessed by one processor to congregate near that particular processor,while blocks accessed by many processors tend to move within the center banks.Furthermore the policy sepa-rates the different block types without requiring extra state or complicated decision making.Only the current bank location and the requesting processor id is needed to determine which bank, if any, a block should be moved to.

Search.Similar to the best performing uniprocessor D-NUCA search policy,we advocate using a two-phase multi-cast search for CMP-DNUCA.The goal of the two-phase search policy is to maximize the number of?rst phase hits, while limiting the number of futile request messages.Based on the previously discussed gradual migration policy,hits most likely occur in one of six bankclusters:the requesting processor’s local or inter bankclusters,or the four center

bankclusters.Therefore the?rst phase of our search policy broadcasts a request to the appropriate banks within these six bankclusters.If all six initial bank requests miss,we broad-cast the request to the remaining10banks of the bankset. Only after a request misses in all16banks of the bankset will a request be sent off chip.Waiting for16replies over two phases adds signi?cant latency to cache misses.Unfortu-nately,as discussed in Section2.3,implementing a smart search mechanism to minimize search delay in CMP-DNUCA is dif?cult.Instead,we provide results in the fol-lowing section for an idealized smart search mechanism.

A unique problem of CMP-DNUCA is the potential for false misses,where L2requests fail to?nd a cache block because it is in transit from one bank to another.It is essen-tial that false misses are not naively serviced by the direc-tory,otherwise two valid block copies could exist within the L2cache creating a coherence nightmare.One possible solu-tion is requests could consult a second set of centralized on-chip tags not effected by block migration before going off chip.However,this second tag array would cost over1M

B of extra storage and require approximately1KB of data to be read and compared on every lookup—because each set logically appears 256-way associative.

Instead,CMP-DNUCA compensates for false misses by relying on the directory sharing state to indicate when a pos-sible false miss occurred.If the directory believes a valid block copy already exists on chip,the L2cache stops migra-tions and searches for an existing valid copy before using the received data.Only after sequentially examining all banks of a bankset with migration disabled will a cache be certain the block isn’t already allocated.While the meticulous examina-tion ensures correctness,it is very slow.Therefore,it is important to ensure false misses don’t happen frequently.

We signi?cantly reduced the frequency of false misses by implementing a lazy migration mechanism.We observed that almost all false misses occur for a few hot blocks that are rapidly accessed by multiple processors.By delaying migrations by a thousand cycles,and canceling migrations when a different processor accesses the same block,CMP-DNUCA still performs at least94%of all scheduled migra-tions,while reducing false misses by at least99%.In apsi, for instance,lazy migration reduced the fraction of false misses from 18% of all misses to less than 0.00001%.

5.3 Evaluating CMP Block Migration

Our CMP-DNUCA evaluation shows block migration creates a performance degradation unless a smart search mechanism is utilized.A smart search mechanism reduces contention by decreasing the number of unsuccessful request messages and reduces L2miss latency by sending L2misses directly off chip before consulting all16banks of a bankset.

The high demand for the equally distant central banks restricts the bene?t of block migration for the commercial workloads.Figure9shows in all four commercial workloads over47%of L2hits are satis?ed in the center bankclusters. The high number of central hits directly correlates to the increased sharing in the commercial workloads previously shown in Figure7.Figure10graphically illustrates the dis-tribution of L2hits for oltp,where the dark squares in the middle represent the heavily utilized central banks.

Conversely,CMP-DNUCA exhibits a mixture of behav-ior running the four scienti?c workloads.Due to a lack of frequently repeatable requests,barnes,apsi,and fma3d encounter30%to62%of L2hits in the distant10bankclus-ters.These hits cost signi?cant performance because the dis-tant banks are only searched during the second phase.On the other hand,the scienti?c workload ocean contains repeatable requests and exhibits very little sharing.Figure9indicates CMP-DNUCA successfully migrates over60%of ocean’s L2hits to the local bankclusters.The dark colored squares of

other local

other

inter

other

center

my

center

my

inter˙

my

local

?????

20

40

60

80

100

%

o

f

T

o

t

a

l

L

2

H

i

t

s

Benchmarks

Other 10 Bankclusters

Center 4 Bankclusters

Inter. Bankcluster

Local Bankcluster

apache jbb oltp zeus barnes ocean apsi fma3d

Figure 9. L2 Hit Distribution of CMP-DNUCA

Figure11graphically display how well CMP-DNUCA is able to split ocean’s data set into the local bankclusters.

The limited success of block migration along with the slow two-phase search policy causes CMP-DNUCA to actu-ally increase L2hit and L2miss latency.Figure12indicates CMP-DNUCA only reduces L2hit latency versus CMP-SNUCA for one workload,ocean.The latency increase for the other seven workloads results from second phase hits encountering31to51more delay cycles than CMP-SNUCA L2hits.The added latency of second phase hits in CMP-DNUCA is due to the delay waiting for responses from the ?rst phase requests.Furthermore due to the slow two-phase search policy,L2misses also encounter23to65more delay cycles compared to CMP-SNUCA.

A smart search mechanism solves CMP-DNUCA’s slow search problems.Figure12shows the L2hit latency attained by CMP-DNUCA with perfect search(perfect CMP-DNUCA),where a processor sends a request directly to the cache bank storing the block.Perfect CMP-DNUCA reduces L2hit latency across all workloads by7to15cycles versus CMP-SNUCA.Furthermore,when the block isn’t on chip, perfect CMP-DNUCA immediately generates an off-chip request,allowing its L2miss latency to match that of CMP-SNUCA.Although the perfect search mechanism is infeasi-ble,architects may develop practical smart search schemes in the future.We assume perfect searches for the rest of the paper to examine the potential bene?ts of block migration.

6 Targeting On-chip Latency In Isolation

On-chip prefetching performs competitively with the more extravagant techniques of block migration and trans-mission lines.A comparison between the bars labeled CMP-SNUCA L1&L2pf to those labeled perfect

CMP-DNUCA All CPUs

CPU 0CPU 1

All CPUs

CPU 0CPU 1

CPU 5

CPU 2CPU 3

CPU 6CPU 7

CPU 4CPU 5

CPU 2

CPU 6CPU 7

CPU 3

CPU 4

Figure 10. oltp L2 Hit Distribution Figure 11. ocean L2 Hit Distribution

The?gures above illustrate the distribution of cache hits across the L2cache banks.The Tetris shapes indicate the bankclus-ters and the shaded squares represent the individual banks.The shading indicates the fraction of all L2hits to be satis?ed by a given bank, with darker being greater. The top ?gure illustrates all hits, while the 8 smaller ?gures illustrate each CPU’s hits.

20

40

60

80

C

y

c

l

e

s

Benchmarks

CMP-SNUCA no pf

CMP-DNUCA no pf

perfect CMP-DNUCA no pf

apache jbb oltp zeus barnes ocean apsi fma3d

Figure 12. Avg. L2 Hit Latency: No Prefetching

L2pf and CMP-TLC L2pf in Figure 13reveals on-chip prefetching achieves the greatest single workload improve-ment over the baseline (CMP-SNUCA L2pf)—22%for ocean.In addition,on-chip prefetching improves perfor-mance by at least 4%for the workloads zeus,apsi,and fma3d.

Block migration improves performance by 2%–4%for 6of the 8workloads.However,for apache,an increase in cache con?icts causes a 4%performance loss.As illustrated by Figure 9,the four center bankclusters (25%of the total L2storage)incur 60%of the L2hits.The unbalanced load increases cache con?icts,resulting in a 13%increase in L2misses versus the baseline design.

By directly reducing wire latency,transmission lines consistently improve performance,but bandwidth contention prevents them from achieving their full potential.Figure 13shows transmission lines consistently improve performance between 3%to 10%across all workloads—8%on average.However,CMP-TLC would do even better,except for band-width contention that accounts for 26% of the L2 hit latency.

Overall,CMP-TLC is likely to improve a larger number of workloads because transmission lines reduce latency without relying on predictable workload behavior.On the other hand,prefetching and block migration potentially pro-vide a greater performance improvement for a smaller num-ber of workloads.

7 Merging Latency Management Techniques

None of the three evaluated techniques subsume another,but rather the techniques can work in concert to manage wire latency.Section 7.1demonstrates prefetching is mostly orthogonal to transmission lines and block migration,while Section 7.2evaluates combining all three techniques:prefetching, block migration, and transmission lines.

7.1 Combining with On-chip Prefetching

Prefetching,which reduces latency by predicting the next cache block to be accessed,is orthogonal to block migration.However,for some scienti?c workloads,prefetch-

ing’s bene?t slightly overlaps the consistent latency reduc-tion of CMP-TLC.The bars labeled CMP-DNUCA L1&L2pf and CMP-TLC L1&L2pf in Figure 13show the perfor-mance of CMP-DNUCA and CMP-TLC combined with on-chip prefetching.A comparison between the L2pf bars and the L1&L2pf bars reveals L1prefetching provides roughly equal improvement across all three designs.The only slight deviation is on-chip prefetching improves CMP-TLC by 5%to 21%for the scienti?c workloads ocean,apsi,and fma3d,while improving CMP-DNUCA by 6%to 27%.While com-bining each technique with prefetching is straightforward,combining all three techniques together requires a different cache design.

7.2 Combining All Techniques

CMP-Hybrid combines prefetching and transmission lines and block migration to achieve the best overall perfor-mance.Figure 14shows that CMP-Hybrid combined with on and off-chip prefetching (perfect CMP-Hybrid L1&L2pf)reduces runtime by 2%to 19%compared to the baseline design.As previously shown in Figure 9,25%to 62%of L2hits in CMP-DNUCA are to the center banks for all work-loads except ocean.By providing low-latency access to the center banks,Figure 15indicates CMP-Hybrid (bars labelled H)reduces the average L2hit latency for these seven work-

0.0

0.5

1.0

N

o r m a l i z e d R u n t i m

e

Benchmarks

CMP-SNUCA L2 pf CMP-SNUCA L1&L2 pf CMP-TLC L2 pf CMP-TLC L1&L2 pf

perfect CMP-DNUCA L2 pf perfect CMP-DNUCA L1&L2 pf

31

apache

5.1

jbb

190

oltp

15

zeus

38000

barnes

20000

ocean

20000

apsi

32000

fma3d

Figure 13. Normalized Execution: Latency Reduction Techniques with Prefetching

0.0

0.5

1.0

N o r m a l i z e d R u n t i m e

Benchmarks

CMP-SNUCA L1&L2 pf

perfect CMP-DNUCA L1&L2 pf CMP-TLC L1&L2 pf

perfect CMP-Hybrid L1&L2 pf

32apache 5.1jbb 190oltp 16zeus 38000barnes 20000ocean 20000apsi 31000

fma3d

Figure 14.Normalized Execution:Combining All

Techniques

loads by 6to 9cycles versus CMP-DNUCA (bars labelled D).As previously mentioned in Section 6,link contention within the centralized network of CMP-TLC accounts for 26%of delay for L2hits.CMP-Hybrid’s high bandwidth,distributed network reduces contention,allowing for better utilization of transmission line’s low latency.Figure 15shows L2hits within CMP-Hybrid encounter up to 8fewer contention delay cycles as compared those within CMP-TLC (bars labelled T).

While CMP-Hybrid achieves impressive performance,one should note it also relies on a good search mechanism for its performance.Furthermore,CMP-Hybrid requires both extra manufacturing cost to produce on-chip transmission lines and additional veri?cation effort to implement block migration.

8 Energy Ef?ciency

Although the majority of L2cache power consumption is expected to be leakage power [14],we analyze each net-work’s dynamic energy-delay product to determine the ef?-ciency of each design.Similar to our previous study [6],we estimate the network energy by measuring the energy used by the wires as well as the switches.For conventional RC interconnect using repeaters,we measure the energy required to charge and discharge the capacitance of each wire seg-ment.For transmission lines,we measure the energy required to create the incident wave.We do not include the dynamic energy consumed within the L2cache banks,but we do note block migration requires accessing the storage banks about twice as often as the static designs.

Prefetching and block migration improve network ef?-ciency for some scienti?c workloads,while transmission lines potentially improve ef?ciency across all workloads.Figure 16plots the product of the networks’dynamic energy consumption and the design’s runtime normalized to the value of the CMP-SNUCA design running ocean.The two block migrating designs,CMP-DNUCA and CMP-Hybrid,assume the perfect search mechanism.Figure 16shows the high accuracy and coverage of L1and L2prefetching results

in a reduction of network energy-delay by 18%to 54%for all designs with ocean.Also,by successfully migrating fre-quently requested blocks to the more ef?ciently accessed local banks,CMP-DNUCA achieves similar network ef?-ciency as CMP-SNUCA for ocean despite sending extra migration messages.However,both block migration and L1prefetching increase network energy-delay between 17%and 53%for the commercial workload oltp.On the other hand,those designs using transmission lines,CMP-TLC and CMP-Hybrid,reduce network energy-delay by 26%for oltp and 33%for ocean on average versus their counterpart designs that exclusively rely on conventional wires,CMP-SNUCA and CMP-DNUCA.In general,workload characteristics affect prefetching and block migration ef?ciency more than transmission line ef?ciency.

9 Conclusions

Managing on-chip wire delay in CMP caches is essen-tial in order to improve future system performance.Strided prefetching is a common technique utilized by current designs to tolerate wire delay.As wire delays continue to increase,architects will turn to additional techniques such as block migration or transmission lines to manage on-chip delay.While block migration effectively reduces wire delay in uniprocessor caches,we discover block migration’s capa-bility to improve CMP performance relies on a dif?cult to implement smart search mechanism.Furthermore,the poten-tial bene?t of block migration in a CMP cache is fundamen-tally limited by the large amount of inter-processor sharing that exists in some workloads.On the other hand,on-chip transmission lines consistently improve performance,but their limited bandwidth becomes a bottleneck when com-bined with on-chip prefetching.Finally,we investigate a hybrid design,which merges transmission lines,block migration,and prefetching.Adding transmission lines from each processor to the center of the NUCA cache could allevi-ate the de?ciencies of implementing block migration or transmission lines alone.

1020304050C y c l e s

Benchmarks

contended uncontended

D T H apache D T H jbb D T H oltp D T H zeus D T H barnes D T H ocean D T H apsi D T H fma3d

Figure 15. Avg. L2 Hit Latency: Combining

Latency Reduction Techniques

none

L2 only L1 & L2

Prefetching

01

2

3

N o r m a l i z e d (N e t w o r k E n e r g y ) x (D e l a y )

CMP-DNUCA CMP-Hybrid CMP-SNUCA CMP-TLC

Figure 16. Normalized Network Energy-Delay:

oltp and ocean

oltp

ocean

Acknowledgements

We thank Alaa Alameldeen,Mark Hill,Mike Marty, Kevin Moore,Phillip Wells,Allison Holloway,Luke Yen, the Wisconsin Computer Architecture Af?liates,Virtutech AB,the Wisconsin Condor group,the Wisconsin Computer Systems Lab,and the anonymous reviewers for their com-ments on this work.

References

[1]V.Agarwal,S.W.Keckler,and D.Burger.The Effect of Technology

Scaling on Microarchitectural Structures.Technical Report TR-00-02, Department of Computer Sciences, UT at Austin, May 2001.

[2] A.R.Alameldeen,M.M.K.Martin,C.J.Mauer,K.E.Moore,M.Xu,

D.J.Sorin,M.D.Hill,and D.A.Wood.Simulating a$2M

Commercial Server on a$2K PC.IEEE Computer,36(2):50–57,Feb.

2003.

[3] A.R.Alameldeen and D.A.Wood.Variability in Architectural

Simulations of Multi-threaded Workloads.In Proceedings of the Ninth IEEE Symposium on High-Performance Computer Architecture,pages 7–18, Feb. 2003.

[4]V.Aslot,M.Domeika,R.Eigenmann,G.Gaertner,W.Jones,and

B.Parady.SPEComp:A New Benchmark Suite for Measuring Parallel

Computer Performance.In Workshop on OpenMP Applications and Tools, pages 1–10, July 2001.

[5]L.A.Barroso,K.Gharachorloo,R.McNamara, A.Nowatzyk,

S.Qadeer,B.Sano,S.Smith,R.Stets,and B.Verghese.Piranha:A Scalable Architecture Based on Single-Chip Multiprocessing.In Proceedings of the27th Annual International Symposium on Computer Architecture, pages 282–293, June 2000.

[6] B.M.Beckmann and D.A.Wood.TLC:Transmission Line Caches.In

Proceedings of the36th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2003.

[7]R.Chandra,S.Devine,B.Verghese,A.Gupta,and M.Rosenblum.

Scheduling and Page Migration for Multiprocessor Compute Servers.

In Proceedings of the6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 1994.

[8]R.T.Chang,N.Talwalkar,C.P.Yue,and S.S.Wong.Near Speed-of-

Light Signaling Over On-Chip Electrical Interconnects.IEEE Journal of Solid-State Circuits, 38(5):834–838, May 2003.

[9]T.-F.Chen and J.-L.Baer.Reducing Memory Latency via Non-

Blocking and Prefetching Caches.In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 51–61, Oct. 1992.

[10]T.-F.Chen and J.-L.Baer.A Performance Study of Software and

Hardware Data Prefetching Schemes.In Proceedings of the21st Annual International Symposium on Computer Architecture,pages 223–232, Apr. 1994.

[11]T.-F.Chen and J.-L.Baer.Effective Hardware-Based Data Prefetching

for High Performance Processors.IEEE Transactions on Computers, 44(5):609–623, May 1995.

[12]Z.Chishti,M.D.Powell,and T.N.Vijaykumar.Distance

Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures.In Proceedings of the36th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2003.

[13] F.Dahlgren and P.Stenstr?m.Effectiveness of Hardware-Based Stride

and Sequential Prefetching in Shared-Memory Multiprocessors.In Proceedings of the First IEEE Symposium on High-Performance Computer Architecture, pages 68–77, Feb. 1995.

[14] B.Doyle,R.Arghavani, D.Barlage,S.Datta,M.Doczy,

J.Kavalieros,A.Murthy,and R.Chau.Transistor Elements for30nm Physical Gate Lengths and Beyond.Intel Technology Journal,May 2002.

[15] B.Falsafi and D.A.Wood.Reactive NUMA:A Design for Unifying

S-COMA and CC-NUMA.In Proceedings of the24th Annual International Symposium on Computer Architecture,pages229–240, June 1997.

[16]I.T.R.for Semiconductors.ITRS2003Edition.Semiconductor

Industry Association,2003.

https://www.doczj.com/doc/719097678.html,/Files/2003ITRS/Home2003.htm.

[17] E.Hagersten,https://www.doczj.com/doc/719097678.html,ndin,and S.Haridi.DDM–A Cache-Only Memory

Architecture.IEEE Computer, 25(9):44–54, Sept. 1992.

[18]L.Hammond, B.Hubbert,M.Siu,M.Prabhu,M.Chen,and

K.Olukotun.The Stanford Hydra CMP.IEEE Micro,20(2):71–84, March-April 2000.

[19]G.Hinton,D.Sager,M.Upton,D.Boggs,D.Carmean,A.Kyker,and

P.Roussel.The microarchitecture of the Pentium4processor.Intel Technology Journal, Feb. 2001.

[20]R.Ho,K.W.Mai,and M.A.Horowitz.The Future of Wires.

Proceedings of the IEEE, 89(4):490–504, Apr. 2001.

[21]T.Horel and https://www.doczj.com/doc/719097678.html,uterbach.UltraSPARC-III:Designing Third

Generation64-Bit Performance.IEEE Micro,19(3):73–85,May/June

1999.

[22] D.Joseph and D.Grunwald.Prefetching Using Markov Predictors.In

Proceedings of the24th Annual International Symposium on Computer Architecture, pages 252–263, June 1997.

[23]N.P.Jouppi.Improving Direct-Mapped Cache Performance by the

Addition of a Small Fully-Associative Cache and Prefetch Buffers.In Proceedings of the17th Annual International Symposium on Computer Architecture, pages 364–373, May 1990.

[24]R.Kalla,B.Sinharoy,and J.M.Tendler.IBM Power5Chip:A Dual

Core Multithreaded Processor.IEEE Micro,24(2):40–47,Mar/Apr 2004.

[25]M.Karlsson,K.E.Moore,E.Hagersten,and D.A.Wood.Memory

System Behavior of Java-Based Middleware.In Proceedings of the Ninth IEEE Symposium on High-Performance Computer Architecture, pages 217–228, Feb. 2003.

[26]R.E.Kessler,R.Jooss, A.Lebeck,and M.D.Hill.Inexpensive

Implementations of Set-Associativity.16th Annual International Symposium on Computer Architecture, May 1989.

[27] C.Kim,D.Burger,and S.W.Keckler.An Adaptive,Non-Uniform

Cache Structure for Wire-Dominated On-Chip Caches.10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 2002.

[28]P.Kongetira.A32-way Multithreaded SPARC?Processor.In

Proceedings of the 16th HotChips Symposium, Aug. 2004.

[29]G.K.Konstadinidis and et.al.Implementation of a Third-Generation

1.1-GHz64-bit Microprocessor.IEEE Journal of Solid-State Circuits,

37(11):1461–1469, Nov 2002.

[30]K.Krewell.UltraSPARC IV Mirrors Predecessor.Microprocessor

Report, pages 1–3, Nov. 2003.

[31]P.Kundu,M.Annavaram,T.Diep,and J.Shen.A Case for Shared

Instruction Cache on Chip Multiprocessors running https://www.doczj.com/doc/719097678.html,puter Architecture News, 32(3):11–18, 2004.

[32] C.Liu,A.Sivasubramaniam,and https://www.doczj.com/doc/719097678.html,anizing the Last

Line of Defense before Hitting the Memory Wall for CMPs.In Proceedings of the Tenth IEEE Symposium on High-Performance Computer Architecture, Feb. 2004.

[33]P.S.Magnusson et al.Simics:A Full System Simulation Platform.

IEEE Computer, 35(2):50–58, Feb. 2002.

[34] C.J.Mauer,M.D.Hill,and D.A.Wood.Full System Timing-First

Simulation.2002ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 108–116, June 2002.

[35] C.McNairy and D.Soltis.Itanium2Processor Microarchitecture.

IEEE Micro, 23(2):44–55, March/April 2003.

[36]H.E.Mizrahi,J.-L.Baer, https://www.doczj.com/doc/719097678.html,zowska,and J.Zahorjan.

Introducing Memory into the Switch Elements of Multiprocessor Interconnection Networks.In Proceedings of the16th Annual International Symposium on Computer Architecture, May 1989. [37] B.A.Nayfeh and K.Olukotun.Exploring the Design Space for a

Shared-Cache Multiprocessor.In Proceedings of the21st Annual International Symposium on Computer Architecture, Apr. 1994. [38] A.Pajuelo, A.Gonz·lez,and M.Valero.Speculative Dynamic

Vectorization.In Proceedings of the29th Annual International Symposium on Computer Architecture, May 2002.

[39] A.Seznec,S.Felix,V.Krishnan,and Y.Sazeides.Design Tradeoffs

for the Alpha EV8Conditional Branch Predictor*.29th Annual International Symposium on Computer Architecture, May 2002. [40]K.So and R.N.Rechtschaffen.Cache Operations by MRU Change.

IEEE Transactions on Computers, 37(6):700–709, June 1988.

[41]V.Soundararajan,M.Heinrich, B.Verghese,K.Gharachorloo,

A.Gupta,and J.Hennessy.Flexible Use of Memory for

Replication/Migration in Cache-Coherent DSM Multiprocessors.In Proceedings of the25th Annual International Symposium on Computer Architecture, pages 342–355, June 1998.

[42] D.Sylvester and K.Keutzer.Getting to the Bottom of Deep Submicron

II:a Global Wiring Paradigm.In Proceedings of the1999International Symposium on Physical Design, pages 193–200, 1999.

[43]J.M.Tendler,S.Dodson,S.Fields,H.Le,and B.Sinharoy.POWER4

System Microarchitecture. IBM Server Group Whitepaper, Oct. 2001.

[44]J.M.Tendler,S.Dodson,S.Fields,H.Le,and B.Sinharoy.POWER4

System Microarchitecture.IBM Journal of Research and Development, 46(1), 2002.

[45]M.Tremblay,J.Chan,S.Chaudhry,A.W.Conigliaro,and S.S.Tse.

The MAJC Architecture:A Synthesis of Parallelism and Scalability.

IEEE Micro, 20(6):12–25, November-December 2000.

[46] D.M.Tullsen and S.J.Eggers.Limitations of Cache Prefetching on a

Bus-Based Multiprocessor.20th Annual International Symposium on Computer Architecture, pages 278–288, May 1993.

[47]J.Turley.The Essential Guide to Semiconductors.Prentice Hall,2003.

[48]S.C.Woo,M.Ohara, E.Torrie,J.P.Singh,and A.Gupta.The

SPLASH-2Programs:Characterization and Methodological Considerations.In Proceedings of the22nd Annual International Symposium on Computer Architecture, pages 24–37, June 1995. [49]Z.Zhang and J.Torrellas.Speeding up Irregular Applications in

Shared-Memory Multiprocessors:Memory Binding and Group Prefetching.In Proceedings of the22nd Annual International Symposium on Computer Architecture, pages 188–199, June 1995.

KeilC51程序设计中几种精确延时方法

Keil C51程序设计中几种精确延时方法 2008-04-03 08:48 实现延时通常有两种方法:一种是硬件延时,要用到定时器/计数器,这种方法可以提高CPU的工作效率,也能做到精确延时;另一种是软件延时,这种方法主要采用循环体进行。 1 使用定时器/计数器实现精确延时 单片机系统一般常选用11.059 2 MHz、12 MHz或6 MHz晶振。第一种更容易产生各种标准的波特率,后两种的一个机器周期分别为1 μs和2 μs,便于精确延时。本程序中假设使用频率为12 MHz的晶振。最长的延时时间可达216=65 536 μs。若定时器工作在方式2,则可实现极短时间的精确延时;如使用其他定时方式,则要考虑重装定时初值的时间(重装定时器初值占用2个机器周期)。 在实际应用中,定时常采用中断方式,如进行适当的循环可实现几秒甚至更长时间的延时。使用定时器/计数器延时从程序的执行效率和稳定性两方面考虑都是最佳的方案。但应该注意,C51编写的中断服务程序编译后会自动加上PUSH ACC、PUSH PSW、POP PSW和POP ACC语句,执行时占用了4个机器周期;如程序中还有计数值加1语句,则又会占用1个机器周期。这些语句所消耗的时间在计算定时初值时要考虑进去,从初值中减去以达到最小误差的目的。 2 软件延时与时间计算 在很多情况下,定时器/计数器经常被用作其他用途,这时候就只能用软件方法延时。下面介绍几种软件延时的方法。 2.1 短暂延时 可以在C文件中通过使用带_NOP_( )语句的函数实现,定义一系列不同的延时函数,如Delay10us( )、Delay25us( )、Delay40us( )等存放在一个自定义的C文件中,需要时在主程序中直接调用。如延时10 μs 的延时函数可编写如下: void Delay10us( ) { _NOP_( ); _NOP_( ); _NOP_( ) _NOP_( );

基于51单片机的精确延时(微秒级)

声明: *此文章是基于51单片机的微秒级延时函数,采用12MHz晶振。 *此文章共包含4个方面,分别是延时1us,5us,10us和任意微秒。前三个方面是作者学习过程中从书本或网络上面总结的,并非本人所作。但是延时任意微秒函数乃作者原创且亲测无误。欢迎转载。 *此篇文章是作者为方便初学者使用而写的,水平有限,有误之处还望大家多多指正。 *作者:Qtel *2012.4.14 *QQ:97642651 ----------------------------------------------------------------------------------------------------------------------序: 对于某些对时间精度要求较高的程序,用c写延时显得有些力不从心,故需用到汇编程序。本人通过测试,总结了51的精确延时函数(在c语言中嵌入汇编)分享给大家。至于如何在c 中嵌入汇编大家可以去网上查查,这方面的资料很多,且很简单。以12MHz晶振为例,12MHz 晶振的机器周期为1us,所以,执行一条单周期指令所用时间就是1us,如NOP指令。下面具体阐述一下。 ----------------------------------------------------------------------------------------------------------------------1.若要延时1us,则可以调用_nop_();函数,此函数是一个c函数,其相当于一个NOP指令,使用时必须包含头文件“intrins.h”。例如: #include #include void main(void){ P1=0x0; _nop_();//延时1us P1=0xff; } ----------------------------------------------------------------------------------------------------------------------2.延时5us,则可以写一个delay_5us()函数: delay_5us(){ #pragma asm nop #pragma endasm } 这就是一个延时5us的函数,只需要在需要延时5us时调用此函数即可。或许有人会问,只有一个NOP指令,怎么是延时5us呢? 答案是:在调用此函数时,需要一个调用指令,此指令消耗2个周期(即2us);函数执行完毕时要返回主调函数,需要一个返回指令,此指令消耗2个周期(2us)。调用和返回消耗了2us+2us=4us。然后再加上一个NOP指令消耗1us,不就是5us吗。

单片机一些常用的延时与中断问题及解决方法

单片机一些常用的延时与中断问题及解决方法 延时与中断出错,是单片机新手在单片机开发应用过程中,经常会遇到的问题,本文汇总整理了包含了MCS-51系列单片机、MSP430单片机、C51单片机、8051F的单片机、avr单片机、STC89C52、PIC单片机…..在内的各种单片机常见的延时与中断问题及解决方法,希望对单片机新手们,有所帮助! 一、单片机延时问题20问 1、单片机延时程序的延时时间怎么算的? 答:如果用循环语句实现的循环,没法计算,但是可以通过软件仿真看到具体时间,但是一般精精确延时是没法用循环语句实现的。 如果想精确延时,一般需要用到定时器,延时时间与晶振有关系,单片机系统一般常选用 2 MHz、12 MHz或6 MHz晶振。第一种更容易产生各种标准的波特率,后两种的一个机器周期分别为1 μs和2 μs,便于精确延时。本程序中假设使用频率为12 MHz的晶振。最长的延时时间可达216=65 536 μs。若定时器工作在方式2,则可实现极短时间的精确延时;如使用其他定时方式,则要考虑重装定时初值的时间(重装定时器初值占用2个机器周期)。 2、求个单片机89S51 12M晶振用定时器延时10分钟,控制1个灯就可以 答:可以设50ms中断一次,定时初值,TH0=0x3c、TL0=0xb0。中断20次为1S,10分钟的话,需中断12000次。计12000次后,给一IO口一个低电平(如功率不够,可再加扩展),就可控制灯了。 而且还要看你用什么语言计算了,汇编延时准确,知道单片机工作周期和循环次数即可算出,但不具有可移植性,在不同种类单片机中,汇编不通用。用c的话,由于各种软件执行效率不一样,不会太准,通常用定时器做延时或做一个不准确的延时,延时短的话,在c中使用汇编的nop做延时 3、51单片机C语言for循环延时程序时间计算,设晶振12MHz,即一个机器周期是1us。for(i=0,i<100;i++) for(j=0,j<100;j++) 我觉得时间是100*100*1us=10ms,怎么会是100ms 答: 不可能的,是不是你的编译有错的啊

单片机89C51精确延时

单片机89C51精确延时 高手从菜鸟忽略作起之(六)一,晶振与周期: 89C51晶振频率约为12MHZ。在此基础上,计论几个与单片机相关的周期概念:时钟周期,状态周期,机器周期,指令周期。 晶振12MHZ,表示1US振动12次,此基础上计算各周期长度。 时钟周期(W sz):Wsz=1/12=0.083us 状态周期(W zt) Wzt=2*Wsz=0.167us 机器周期(W jq): Wjq=6*Wzt=1us 指令周期(W zl): W zl=n*Wjq(n=1,2,4) 二,指令周期 汇编指令有单周期指令,双周期指令,四周期指令。指令时长分别是1US,2US,4US.指令的周期可以查询绘编指令获得,用下面方法进行记忆。 1.四周期指令:MUL,DIV 2.双周期指令:与SP,PC相关(见汇编指令周期表) 3.单周期指令:其他(见汇编指令周期表) 三,单片机时间换算单位 1.1秒(S)=1000毫秒(ms) 2.1毫秒(ms)=1000微秒(us) 3.1微秒(us)=1000纳秒(ns) 单片机指令周期是以微秒(US)为基本单位。 四,单片机延时方式 1.计时器延时方式:用C/T0,C/T1进行延时。 2.指令消耗延时方式: 本篇单片机精确延时主要用第2种方式。 五,纳秒(ns)级延时: 由于单片机指令同期是以微秒(US)为基本单位,因此,纳秒级延时,全部不用写延时。六,微秒(US)级延时:

1.单级循环模式:delay_us_1 最小值:1+2+2+0+2+1+2+2=12(US),运行此模式最少需12US,因此12US以下,只能在代码中用指定数目的NOP来精确延时。 最大值:256*2+12-2=522(US),256最大循环次数,2是指令周期,12是模式耗时,-2是模式耗时中计1个时钟周期。 延时范围:值域F(X)[12,522],变量取值范围[0,255]. 函数关系:Y=F(x):y=2x+12,由输入参数得出延时时间。 反函数:Y=F(x):y=1/2x-6:由延时时间,计算输入参数。例延时500US,x=500,y=244,即输入244时,精确延时500US. 2.双级循环模式:delay_us_2

TMS320F2812delay 延时完整程序

TMS320F2812的延时程序(完整) .def _DSP28x_usDelay ;==================================================== ;Delay Function ;The C assembly call will look as follows: ; ; extern void Delay(long time); ; MOV AL,#LowLoopCount ; MOV AH,#HighLoopCount ; LCR _Delay ; ;Or as follows (if count is less then 16-bits): ; ; MOV ACC,#LoopCount ; LCR _Delay .global __DSP28x_usDelay _DSP28x_usDelay: SUB ACC,#1 NOP NOP BF _DSP28x_usDelay,GEQ ;; Loop if ACC >= 0 LRETR ;There is a 9/10 cycle overhead and each loop ;takes five cycles. The LoopCount is given by ;the following formula: ; DELAY_CPU_CYLES = 9 + 5*LoopCount ; LoopCount = (DELAY_CPU_CYCLES - 9) / 5 ;================================================== --

RE:我是这么调用的(C语言) extern void DSP28x_usDelay(long time); 在需要延时的地方加入 DSP28x_usDelay(0x100000);//根据延迟时间写入参数

STM32延时函数

#include #include "delay.h" ////////////////////////////////////////////////////////////////////////////////// //使用SysTick的普通计数模式对延迟进行管理 //包括delay_us,delay_ms //***************************************************************************** *** //V1.2修改说明 //修正了中断中调用出现死循环的错误 //防止延时不准确,采用do while结构! ////////////////////////////////////////////////////////////////////////////////// static u8 fac_us=0;//us延时倍乘数 static u16 fac_ms=0;//ms延时倍乘数 //初始化延迟函数 //SYSTICK的时钟固定为HCLK时钟的1/8 //SYSCLK:系统时钟 void delay_init(u8 SYSCLK) { SysTick->CTRL&=0xfffffffb;//bit2清空,选择外部时钟HCLK/8 fac_us=SYSCLK/8; fac_ms=(u16)fac_us*1000; } //延时nms //注意nms的范围 //SysTick->LOAD为24位寄存器,所以,最大延时为: //nms<=0xffffff*8*1000/SYSCLK //SYSCLK单位为Hz,nms单位为ms //对72M条件下,nms<=1864 void delay_ms(u16 nms) { u32 temp; SysTick->LOAD=(u32)nms*fac_ms;//时间加载(SysTick->LOAD为24bit) SysTick->VAL =0x00; //清空计数器 SysTick->CTRL=0x01 ; //开始倒数 do { temp=SysTick->CTRL; } while(temp&0x01&&!(temp&(1<<16)));//等待时间到达 SysTick->CTRL=0x00; //关闭计数器 SysTick->VAL =0X00; //清空计数器 } //延时nus //nus为要延时的us数.

单片机C延时时间怎样计算

C程序中可使用不同类型的变量来进行延时设计。经实验测试,使用unsigned char类型具有比unsigned int更优化的代码,在使用时 应该使用unsigned char作为延时变量。以某晶振为12MHz的单片 机为例,晶振为12M H z即一个机器周期为1u s。一. 500ms延时子程序 程序: void delay500ms(void) { unsigned char i,j,k; for(i=15;i>0;i--) for(j=202;j>0;j--) for(k=81;k>0;k--); } 计算分析: 程序共有三层循环 一层循环n:R5*2 = 81*2 = 162us DJNZ 2us 二层循环m:R6*(n+3) = 202*165 = 33330us DJNZ 2us + R5赋值 1us = 3us 三层循环: R7*(m+3) = 15*33333 = 499995us DJNZ 2us + R6赋值 1us = 3us

循环外: 5us 子程序调用 2us + 子程序返回2us + R7赋值 1us = 5us 延时总时间 = 三层循环 + 循环外 = 499995+5 = 500000us =500ms 计算公式:延时时间=[(2*R5+3)*R6+3]*R7+5 二. 200ms延时子程序 程序: void delay200ms(void) { unsigned char i,j,k; for(i=5;i>0;i--) for(j=132;j>0;j--) for(k=150;k>0;k--); } 三. 10ms延时子程序 程序: void delay10ms(void) { unsigned char i,j,k; for(i=5;i>0;i--) for(j=4;j>0;j--) for(k=248;k>0;k--);

delay延时教程

delay延时教程(用的是12MHz晶振的MCS-51) 一、 1)NOP指令为单周期指令 2)DJNZ指令为双周期指令 3)mov指令为单周期指令 4)子程序调用(即LCALL指令)为双周期指令 5)ret为双周期指令 states是指令周期数, sec是时间,=指令周期×states,设置好晶振频率就是准确的了 调试>设置/取消断点”设置或移除断点,也可以用鼠标在该行双击实现同样的功能 二、编程最好: 1.尽量使用unsigned型的数据结构。 2.尽量使用char型,实在不够用再用int,然后才是long。 3.如果有可能,不要用浮点型。 4.使用简洁的代码,因为按照经验,简洁的C代码往往可以生成简洁的目标代码(虽说不是在所有的情况下都成立)。 5.在do…while,while语句中,循环体内变量也采用减减方法。 三、编辑注意: 1、在C51中进行精确的延时子程序设计时,尽量不要或少在延时子程序中定义局部变量,所有的延时子程序中变量通过有参函数传递。 2、在延时子程序设计时,采用do…while,结构做循环体要比for结构做循环体好。 3、在延时子程序设计时,要进行循环体嵌套时,采用先内循环,再减减比先减减,再内循环要好。 四、a:delaytime为us级 直接调用库函数: #include// 声明了void _nop_(void); _nop_(); // 产生一条NOP指令 作用:对于延时很短的,要求在us级的,采用“_nop_”函数,这个函数相当汇编NOP指令,延时几微秒。 eg:可以在C文件中通过使用带_NOP_( )语句的函数实现,定义一系列不同的延时函数,如Delay10us( )、Delay25us( )、Delay40us( )等存放在一个自定义的C 文件中,需要时在主程序中直接调用。如延时10 μs的延时函数可编写如下: void Delay10us( ) { _NOP_( ); _NOP_( );

单片机几个典型延时函数

软件延时:(asm) 晶振12MHZ,延时1秒 程序如下: DELAY:MOV 72H,#100 LOOP3:MOV 71H,#100 LOOP1:MOV 70H,#47 LOOP0:DJNZ 70H,LOOP0 NOP DJNZ 71H,LOOP1 MOV 70H,#46 LOOP2:DJNZ 70H,LOOP2 NOP DJNZ 72H,LOOP3 MOV 70H,#48 LOOP4:DJNZ 70H,LOOP4 定时器延时: 晶振12MHZ,延时1s,定时器0工作方式为方式1 DELAY1:MOV R7,#0AH ;;晶振12MHZ,延时0.5秒 AJMP DELAY DELAY2:MOV R7,#14H ;;晶振12MHZ,延时1秒DELAY:CLR EX0 MOV TMOD,#01H ;设置定时器的工作方式为方式1 MOV TL0,#0B0H ;给定时器设置计数初始值 MOV TH0,#3CH SETB TR0 ;开启定时器 HERE:JBC TF0,NEXT1 SJMP HERE NEXT1:MOV TL0,#0B0H MOV TH0,#3CH DJNZ R7,HERE CLR TR0 ;定时器要软件清零 SETB EX0 RET

C语言延时程序: 10ms延时子程序(12MHZ)void delay10ms(void) { unsigned char i,j,k; for(i=5;i>0;i--) for(j=4;j>0;j--) for(k=248;k>0;k--); } 1s延时子程序(12MHZ)void delay1s(void) { unsigned char h,i,j,k; for(h=5;h>0;h--) for(i=4;i>0;i--) for(j=116;j>0;j--) for(k=214;k>0;k--); }

用单片机实现延时(自己经验及网上搜集).

标准的C语言中没有空语句。但在单片机的C语言编程中,经常需要用几个空指令产生短延时的效果。这在汇编语言中很容易实现,写几个nop就行了。 在keil C51中,直接调用库函数: #include // 声明了void _nop_(void; _nop_(; // 产生一条NOP指令 作用:对于延时很短的,要求在us级的,采用“_nop_”函数,这个函数相当汇编NOP指令,延时几微秒。NOP指令为单周期指令,可由晶振频率算出延时时间,对于12M晶振,延时1uS。对于延时比较长的,要求在大于10us,采用C51中的循环语句来实现。 在选择C51中循环语句时,要注意以下几个问题 第一、定义的C51中循环变量,尽量采用无符号字符型变量。 第二、在FOR循环语句中,尽量采用变量减减来做循环。 第三、在do…while,while语句中,循环体内变量也采用减减方法。 这因为在C51编译器中,对不同的循环方法,采用不同的指令来完成的。 下面举例说明: unsigned char i; for(i=0;i<255;i++; unsigned char i; for(i=255;i>0;i--;

其中,第二个循环语句C51编译后,就用DJNZ指令来完成,相当于如下指令: MOV 09H,#0FFH LOOP: DJNZ 09H,LOOP 指令相当简洁,也很好计算精确的延时时间。 同样对do…while,while循环语句中,也是如此 例: unsigned char n; n=255; do{n--} while(n; 或 n=255; while(n {n--}; 这两个循环语句经过C51编译之后,形成DJNZ来完成的方法, 故其精确时间的计算也很方便。 其三:对于要求精确延时时间更长,这时就要采用循环嵌套的方法来实现,因此,循环嵌套的方法常用于达到ms级的延时。对于循环语句同样可以采用for,do…while,while结构来完成,每个循环体内的变量仍然采用无符号字符变量。 unsigned char i,j for(i=255;i>0;i--

单片机一些常用的延时与中断问题及解决方法

延时与中断出错,是单片机新手在单片机开发应用过程中,经常会遇到的问题,本文汇总整理了包含了MCS-51系列单片机、MSP430单片机、C51单片机、8051F的单片机、avr单片机、STC89C52、PIC单片机…..在内的各种单片机常见的延时与中断问题及解决方法,希望对单片机新手们,有所帮助! 一、单片机延时问题20问 1、单片机延时程序的延时时间怎么算的? 答:如果用循环语句实现的循环,没法计算,但是可以通过软件仿真看到具体时间,但是一般精精确延时是没法用循环语句实现的。 如果想精确延时,一般需要用到定时器,延时时间与晶振有关系,单片机系统一般常选用11.059 2 MHz、12 MHz或6 MHz晶振。第一种更容易产生各种标准的波特率,后两种的一个机器周期分别为1 μs和2 μs,便于精确延时。本程序中假设使用频率为12 MHz的晶振。最长的延时时间可达216=65 536 μs。若定时器工作在方式2,则可实现极短时间的精确延时;如使用其他定时方式,则要考虑重装定时初值的时间(重装定时器初值占用2个机器周期)。 2、求个单片机89S51 12M晶振用定时器延时10分钟,控制1个灯就可以 答:可以设50ms中断一次,定时初值,TH0=0x3c、TL0=0xb0。中断20次为1S,10分钟的话,需中断12000次。计12000次后,给一IO口一个低电平(如功率不够,可再加扩展),就可控制灯了。 而且还要看你用什么语言计算了,汇编延时准确,知道单片机工作周期和循环次数即可算出,但不具有可移植性,在不同种类单片机中,汇编不通用。用c的话,由于各种软件执行效率不一样,不会太准,通常用定时器做延时或做一个不准确的延时,延时短的话,在c中使用汇编的nop做延时 3、51单片机C语言for循环延时程序时间计算,设晶振12MHz,即一个机器周期是1us。for(i=0,i<100;i++) for(j=0,j<100;j++) 我觉得时间是100*100*1us=10ms,怎么会是100ms 答: 不可能的,是不是你的编译有错的啊 我改的晶振12M,在KEIL 4.0 里面编译的,为你得出的结果最大也就是40ms,这是软件的原因, 不可能出现100ms那么大的差距,是你的软件的原因。 不信你实际编写一个秒钟,利用原理计算编写一个烧进单片机和利用软件测试的秒程序烧进单片机,你会发现原理计算的程序是正确的

STM32的几种延时方法

STM32的几种延时方法(基于MDK固件库3.0,晶振8M) 单片机编程过程中经常用到延时函数,最常用的莫过于微秒级延时delay_us()和毫秒级delay_ms()。 1.普通延时法 这个比较简单,让单片机做一些无关紧要的工作来打发时间,经常用循环来实现,不过要做的比较精准还是要下一番功夫。下面的代码是在网上搜到的,经测试延时比较精准。 //粗延时函数,微秒 void delay_us(u16 time) { u16 i=0; while(time--) { i=10; //自己定义 while(i--) ; } } //毫秒级的延时 void delay_ms(u16 time) { u16 i=0; while(time--) { i=12000; //自己定义 while(i--) ; } } 2.SysTick 定时器延时 CM3 内核的处理器,内部包含了一个SysTick定时器,SysTick是一个24 位的倒计数定时器,当计到0 时,将从RELOAD 寄存器中自动重装载定时初值。只要不把它在SysTick控制及状态寄存器中的使能位清除,就永不停息。SysTick 在STM32 的参考手册里面介绍的很简单,其详细介绍,请参阅《Cortex-M3 权威指南》。 这里面也有两种方式实现: a.中断方式 如下,定义延时时间time_delay,SysTick_Config()定义中断时间段,在中断中递减time_delay,从而实现延时。 volatile unsigned long time_delay; // 延时时间,注意定义为全局变量 //延时n_ms void delay_ms(volatile unsigned long nms) { //SYSTICK分频--1ms的系统时钟中断 if (SysTick_Config(SystemFrequency/1000))

51单片机精确延时源程序

51单片机精确延时源程序 一、晶振为 11.0592MHz,12T 1、延时 1ms: (1)汇编语言: 代码如下: DELAY1MS: ;误差 -0.651041666667us MOV R6,#04H DL0: MOV R5,#71H DJNZ R5,$ DJNZ R6,DL0 RET (2)C语言: void delay1ms(void) //误差 -0.651041666667us { unsigned char a,b; for(b=4;b>0;b--) for(a=113;a>0;a--); } 2、延时 10MS: (1)汇编语言: DELAY10MS: ;误差 -0.000000000002us MOV R6,#97H DL0: MOV R5,#1DH DJNZ R5,$ DJNZ R6,DL0

RET (2)C语言: void delay10ms(void) //误差 -0.000000000002us { unsigned char a,b; for(b=151;b>0;b--) for(a=29;a>0;a--); } 3、延时 100MS: (1)汇编语言: DELAY100MS: ;误差 -0.000000000021us MOV R7,#23H DL1: MOV R6,#0AH I

棋影淘宝店:https://www.doczj.com/doc/719097678.html,QQ:149034219 DL0: MOV R5,#82H DJNZ R5,$ DJNZ R6,DL0 DJNZ R7,DL1 RET (2)C语言: void delay100ms(void) //误差 -0.000000000021us { unsigned char a,b,c; for(c=35;c>0;c--) for(b=10;b>0;b--) for(a=130;a>0;a--); } 4、延时 1S: (1)汇编语言: DELAY1S: ;误差 -0.00000000024us MOV R7,#5FH DL1: MOV R6,#1AH DL0: MOV R5,#0B9H DJNZ R5,$ DJNZ R6,DL0 DJNZ R7,DL1 RET (2)C语言: void delay1s(void) //误差 -0.00000000024us { unsigned char a,b,c; for(c=95;c>0;c--) for(b=26;b>0;b--)

单片机精确毫秒延时函数

单片机精确毫秒延时函数 实现延时通常有两种方法:一种是硬件延时,要用到定时器/计数器,这种方法可以提高CPU的工作效率,也能做到精确延时;另一种是软件延时,这种方法主要采用循环体进行。今天主要介绍软件延时以及单片机精确毫秒延时函数。 单片机的周期介绍在电子技术中,脉冲信号是一个按一定电压幅度,一定时间间隔连续发出的脉冲信号。脉冲信号之间的时间间隔称为周期;而将在单位时间(如1秒)内所产生的脉冲个数称为频率。频率是描述周期性循环信号(包括脉冲信号)在单位时间内所出现的脉冲数量多少的计量名称;频率的标准计量单位是Hz(赫)。电脑中的系统时钟就是一个典型的频率相当精确和稳定的脉冲信号发生器。 指令周期:CPU执行一条指令所需要的时间称为指令周期,它是以机器周期为单位的,指令不同,所需的机器周期也不同。对于一些简单的的单字节指令,在取指令周期中,指令取出到指令寄存器后,立即译码执行,不再需要其它的机器周期。对于一些比较复杂的指令,例如转移指令、乘法指令,则需要两个或者两个以上的机器周期。通常含一个机器周期的指令称为单周期指令,包含两个机器周期的指令称为双周期指令。 时钟周期:也称为振荡周期,一个时钟周期= 晶振的倒数。对于单片机时钟周期,时钟周期是单片机的基本时间单位,两个振荡周期(时钟周期)组成一个状态周期。 机器周期:单片机的基本操作周期,在一个操作周期内,单片机完成一项基本操作,如取指令、存储器读/写等。 机器周期=6个状态周期=12个时钟周期。 51单片机的指令有单字节、双字节和三字节的,它们的指令周期不尽相同,一个单周期指令包含一个机器周期,即12个时钟周期,所以一条单周期指令被执行所占时间为12*(1/ 晶振频率)= x s。常用单片机的晶振为11.0592MHz,12MHz,24MHz。其中11.0592MHz 的晶振更容易产生各种标准的波特率,后两种的一个机器周期分别为1 s和2 s,便于精确延时。 单片机精确毫秒延时函数对于需要精确延时的应用场合,需要精确知道延时函数的具体延

AVR单片机常用的延时函数

AVR单片机常用的延时函数 /******************************************************************** *******/ //C header files:Delay function for AVR //MCU:ATmega8 or 16 or 32 //Version: 1.0beta //The author: /******************************************************************** *******/ #include void delay8RC_us(unsigned int time) //8Mhz内部RC震荡延时Xus { do { time--; } while(time>1); } void delay8RC_ms(unsigned int time) //8Mhz内部RC震荡延时Xms { while(time!=0) { delay8RC_us(1000); time--; } } /******************************************************************** **********/ void delay1M_1ms(void) //1Mhz延时1ms { unsigned char a,b,c; for(c=1;c>0;c--) for(b=142;b>0;b--) for(a=2;a>0;a--); } void delay1M_xms(unsigned int x) //1Mhz延时xms { unsigned int i; for(i=0;i

STM32的几种延时方法

STM32的几种延时方法(基于MDK固件库,晶振8M)单片机编程过程中经常用到延时函数,最常用的莫过于微秒级延时delay_us( )和毫秒级delay_ms( )。 1.普通延时法 这个比较简单,让单片机做一些无关紧要的工作来打发时间,经常用循环来实现,不过要做的比较精准还是要下一番功夫。下面的代码是在网上搜到的,经测试延时比较精准。 断方式 如下,定义延时时间time_delay,SysTick_Config()定义中断时间段,在中断中递减time_delay,从而实现延时。 volatile unsigned long time_delay; 中断方式 主要仿照原子的《STM32不完全手册》。SYSTICK 的时钟固定为HCLK 时钟的1/8,在这里我们选用内部时钟源72M,所以SYSTICK的时钟为9M,即SYSTICK 定时器以9M的频率递减。SysTick 主要包含CTRL、LOAD、VAL、CALIB 等4 个寄存器, 程序如下,相当于查询法。 //仿原子延时,不进入systic中断

void delay_us(u32 nus) { u32 temp; SysTick->LOAD = 9*nus; SysTick->VAL=0X00;//清空计数器 SysTick->CTRL=0X01;//使能,减到零是无动作,采用外部时钟源 do { temp=SysTick->CTRL;//读取当前倒计数值 }while((temp&0x01)&&(!(temp&(1<<16))));//等待时间到达 SysTick->CTRL=0x00; //关闭计数器 SysTick->VAL =0X00; //清空计数器 } void delay_ms(u16 nms) { u32 temp; SysTick->LOAD = 9000*nms; SysTick->VAL=0X00;//清空计数器 SysTick->CTRL=0X01;//使能,减到零是无动作,采用外部时钟源 do { temp=SysTick->CTRL;//读取当前倒计数值 }while((temp&0x01)&&(!(temp&(1<<16))));//等待时间到达 SysTick->CTRL=0x00; //关闭计数器 SysTick->VAL =0X00; //清空计数器 } 三种方式各有利弊,第一种方式容易理解,但不太精准。第二种方式采用库函数,编写简单,由于中断的存在,不利于在其他中断中调用此延时函数。第三种方式直接操作寄存器,看起来比较繁琐,其实也不难,同时克服了以上两种方式的缺点,个人感觉比较好用。

CVAVR 软件中启动delay库,调用delay_ms()函数,自动带了喂狗程序

CV A VR 软件中启动delay.h库,调用delay_ms()函数,自动带了喂狗程序 近期在学习中发现个问题,CV A VR 中启动delay.h库,调用delay_ms()函数延时,系统怎么都不复位重启,即使打开看门狗熔丝位,看门狗也不会重启,找了很久原因,发现是调用调用系统自身带的delay_ms()函数引起的,换成自己的简单延时函数,问题就解决,看门狗可以正常工作,后面附带我自己写的简单延时函数。 后来查找问题,发现系统中的delay_ms()函数自带了喂狗程序,所以不会自动的重启,请大家放心使用,用延时函数看门狗不溢出是正常的。后面附带软件编辑后生产的汇编程序,一看就知道确实带了喂狗。 今天写出来供大家注意,不要犯我同样的问题。 /***************************************************** This program was produced by the CodeWizardA VR V1.25.9 Standard Chip type : A Tmega8L Program type : Application Clock frequency : 1.000000 MHz Memory model : Small External SRAM size : 0 Data Stack size : 256 *****************************************************/ #include #include // Declare your global variables here void main(void) { // Declare your local variables here // Input/Output Ports initialization // Port B initialization // Func7=In Func6=In Func5=In Func4=In Func3=In Func2=In Func1=In Func0=In // State7=T State6=T State5=T State4=T State3=T State2=T State1=T State0=T PORTB=0x00; DDRB=0x00; // Port C initialization // Func6=In Func5=In Func4=In Func3=In Func2=In Func1=In Func0=In // State6=T State5=T State4=T State3=T State2=T State1=T State0=T PORTC=0x00; DDRC=0x00; // Port D initialization // Func7=In Func6=In Func5=In Func4=In Func3=In Func2=In Func1=In Func0=In // State7=T State6=T State5=T State4=T State3=T State2=T State1=T

单片机写延时程序的几种方法

单片机写延时程序的几种方法 1)空操作延時(12MHz) void delay10us() { _NOP_(); _NOP_(); _NOP_(); _NOP_(); _NOP_(); _NOP_(); } 2)循環延時 (12MHz) Void delay500ms() { unsigned char i,j,k; for(i=15;i>;0;i--) for(j=202;j>;0;j--) for(k=81;k>;0;k--); }

延時總時間=[(k*2+3)*j+3]*i+5 k*2+3=165 us 165*j+3=33333 us 33333*i+5=500000 us=500 ms 3)計時器中斷延時(工作方式2) (12MHz) #include; sbit led=P1^0; unsigned int num=0; void main() { TMOD=0x02; TH0=6; TL0=6; EA=1; ET0=1; TR0=1; while(1) { if(num==4000) { num=0;

led=~led; } } } void T0_time() interrupt 1 { num++; } 4)C程序嵌入組合語言延時 #pragma asm …… 組合語言程序段 …… #pragma endasm KEIL軟件仿真測量延時程序延時時間

這是前段事件總結之延時程序、由於不懂組合語言,故NO.4無程序。希望對你有幫助!!! 對於12MHz晶振,機器周期為1uS,在執行該for循環延時程式的時候 Void delay500ms() { unsigned char i,j,k; for(i=15;i>;0;i--) for(j=202;j>;0;j--) for(k=81;k>;0;k--); } 賦值需要1個機器周期,跳轉需要2個機器周期,執行一次for循環的空操作需要2個機器周期,那么,對於第三階循環 for(k=81;k>;0;k--);,從第二階跳轉到第三階需要2機器周期,賦值需要1個機器周期,執行81次則需要2*81個機器周期,執行一次二階for循環的事件為81*2+1+2;執行了220次,則(81*2+3)*220+3,執行15次一階循環,則 [(81*2+3)*220+3]*15,由於不需要從上階跳往下階,則只加賦值的一個機器周期,另外進入該延時子函數和跳出該函數均需要2個機器周期,故

51单片机的几种精确延时

51单片机的几种精确延时.txt这是一个禁忌相继崩溃的时代,没人拦得着你,只有你自己拦着自己,你的禁忌越多成就就越少。自卑有多种档次,最高档次的自卑表现为吹嘘自己干什么都是天才。51单片机的几种精确延时实现延时通常有两种方法:一种是硬件延时,要用到定时器/计数器,这种方法可以提高CPU的工作效率,也能做到精确延时;另一种是软件延时,这种方法主要采用循环体进行。 1 使用定时器/计数器实现精确延时 单片机系统一般常选用11.059 2 MHz、12 MHz或6 MHz晶振。第一种更容易产生各种标准的波特率,后两种的一个机器周期分别为1 μs和2 μs,便于精确延时。本程序中假设使用频率为12 MHz的晶振。最长的延时时间可达216=65 536 μs。若定时器工作在方式2,则可实现极短时间的精确延时;如使用其他定时方式,则要考虑重装定时初值的时间(重装定时器初值占用2个机器周期)。 在实际应用中,定时常采用中断方式,如进行适当的循环可实现几秒甚至更长时间的延时。使用定时器/计数器延时从程序的执行效率和稳定性两方面考虑都是最佳的方案。但应该注意,C51编写的中断服务程序编译后会自动加上PUSH ACC、PUSH PSW、POP PSW和POP ACC 语句,执行时占用了4个机器周期;如程序中还有计数值加1语句,则又会占用1个机器周期。这些语句所消耗的时间在计算定时初值时要考虑进去,从初值中减去以达到最小误差的目的。 2 软件延时与时间计算 在很多情况下,定时器/计数器经常被用作其他用途,这时候就只能用软件方法延时。下面介绍几种软件延时的方法。 2.1 短暂延时 可以在C文件中通过使用带_NOP_( )语句的函数实现,定义一系列不同的延时函数,如Delay10us( )、Delay25us( )、Delay40us( )等存放在一个自定义的C文件中,需要时在主程序中直接调用。如延时10 μs的延时函数可编写如下: void Delay10us( ) { _NOP_( ); _NOP_( ); _NOP_( ); _NOP_( ); _NOP_( ); _NOP_( ); } Delay10us( )函数中共用了6个_NOP_( )语句,每个语句执行时间为1 μs。主函数调用Delay10us( )时,先执行一个LCALL指令(2 μs),然后执行6个_NOP_( )语句(6 μs),最后执行了一个RET指令(2 μs),所以执行上述函数时共需要10 μs。可以把这一函数

相关主题
文本预览
相关文档 最新文档