Cache Performance Optimizations for Parallel Lattice Boltzmann Codes
- 格式:pdf
- 大小:184.44 KB
- 文档页数:9
Intermediately Executed Code is the Key to Find Refactorings that Improve Temporal Data LocalityKristof BeylsElectronics and Information Systems(ELIS),Ghent University,Sint-Pietersnieuwstraat41,B-9000Gent,Belgiumkristof.beyls@elis.UGent.beErik H.D’Hollander Electronics and Information Systems(ELIS)Ghent University,Sint-Pietersnieuwstraat41,B-9000Gent,Belgiumerik.dhollander@elis.UGent.beABSTRACTThe growing speed gap between memory and processor makes an efficient use of the cache ever more important to reach high performance.One of the most important ways to im-prove cache behavior is to increase the data locality.While many cache analysis tools have been developed,most of them only indicate the locations in the code where cache misses occur.Often,optimizing the program,even after pin-pointing the cache bottlenecks in the source code,remains hard with these tools.In this paper,we present two related tools that not only pinpoint the locations of cache misses,but also suggest source code refactorings which improve temporal locality and thereby eliminate the majority of the cache misses.In both tools, the key tofind the appropriate refactorings is an analysis of the code executed between a data use and the next use of the same data,which we call the Intermediately Executed Code (IEC).Thefirst tool,the Reuse Distance VISualizer(RD-VIS),performs a clustering on the IECs,which reduces the amount of work tofind required refactorings.The second tool,SLO(short for“Suggestions for Locality Optimiza-tions”),suggests a number of refactorings by analyzing the call graph and loop structure of the ing these tools, we have pinpointed the most important optimizations for a number of SPEC2000programs,resulting in an average speedup of2.3on a number of different platforms. Categories and Subject DescriptorsD.3.4[Programming Languages]:Processors—Compil-ers,Debuggers,Optimization; D.2.8[Software Engineer-ing]:Metrics—Performance measuresGeneral TermsPerformance,Measurement,LanguagesPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.CF’06,May3–5,2006,Ischia,Italy.Copyright2006ACM1-59593-302-6/06/0005...$5.00.KeywordsTemporal Data Locality,Program Analysis,Refactoring, Program Optimizations,Performance Debugger,Loop Trans-formations1.INTRODUCTIONThe widening speed gap between processor and main mem-ory makes low cache miss rates ever more important.The major classes of cache misses are conflict and capacity misses. While conflict misses are caused by conflicts in the internal cache structure,capacity misses are caused by poor tempo-ral or spatial locality.In this paper,we propose two tools that help to identify the underlying reason of poor temporal data locality in the source code.1.1Related WorkIn recent years,compiler methods have been devised to automatically increase spatial data locality,by transform-ing the data layout of arrays and structures,so that data accessed close together in time also lays close together in the address space[9,11,17,19,22,33].On the other hand, temporal locality can only be improved by reordering the memory accesses so that the same addresses are accessed closer together.Advanced compiler methods to do this all target specific code patterns such as affine array expressions in regular loop nests[11,18,22],or specific sparse matrix computations[14,15,24,27].For more general program constructs,fully-automatic optimization seems to be very hard,mainly due to the difficulty of the required dependence analysis.Therefore,cache and data locality analysis tools and visualizers are needed to help programmers to refactor their programs for improved temporal locality.void ex(double*X,double*Y,int len,int N){int i,j,k;for(i=0;i<N;i++){for(j=1;j<len;j++)Y[j]=Y[j]*X[i];//39%of cache misses for(k=1;k<len;k+=2)Y[k]=(Y[k]+Y[k-1])/2.0;//61%of cache misses }}Figure1:First motivating example,view on cache misses given by traditional tools.N=10,len=100001.(c) ACM, 2006. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Computing Frontiers, May 2006. /10.1145/1128022.1128071(a)view of reference pairs with long-distance reuse inRDVIS.(b)histogram of long-distance reuses.Gray scales cor-respond to the arrows in(a).(c)graphical view of intermediately executed code in RDVIS,and associated clusteranalysis.(d)view of Intermediately Executed Code of resp.the light and the dark gray cluster in (c).Figure 2:First motivating example,views produced by RDVIS.The colors were manually changed to gray scale,to make the results readable in this black-and-white copy of the paper.Most existing cache and data locality analyzers measure the locality or the cache misses and indicate at which lo-cations in the source code,or for which data structures,most cache misses occur [2,4,5,8,12,13,20,21,23,28,29,32].While this information is helpful in identifying the main bottlenecks in the program,it can still be difficult to deduce a suitable program transformation from it.In this regard,a few of these tools provide additional support for finding the underlying cause of conflict misses (e.g.CVT[28],CacheVIZ[32],YACO[25])or the underlying cause of poor spatial locality (e.g.SIP[4]).In contrast,we present a method to help identify the un-derlying causes of poor temporal data locality.Basically,poor temporal locality results when a large amount of other data is accessed between two consecutive uses of the same data.Improving the locality requires diminishing the vol-ume of data accessed between use and reuse.The source code executed between use and reuse is responsible for ac-cessing the large data volume,resulting in a long reuse dis-tance .That source code is called the Intermediately Exe-cuted Code (IEC)of that reuse.Consequently,to improve the temporal data locality,a refactoring of the IEC is re-quired.In this paper,we present two tools that analyze the IEC in different ways to pinpoint the required refactorings:RD-VIS (Reuse Distance VISualizer),which has been discussed earlier in [7],and SLO (Suggestions for Locality Optimiza-tions).RDVIS represents the IEC as a set of basic blocks ex-ecuted between long-distance reuses.In a typical program,there are a huge number of data reuses,and consequently a huge number of corresponding IECs.RDVIS applies a cluster analysis to the IECs so that the main patterns of poor locality-generating source code are revealed.Based on a visual representation of the resulting clusters and high-lighting of the corresponding source code,the programmer can deduce the necessary program optimizations.In SLO,the loop structure and the call graph of the IEC is also taken into account,allowing it to go one step further than RDVIS.SLO pinpoints the exact source code refactorings that are needed to improve locality.Examples of such refac-torings are loop tiling,and computation fusion,which are demonstrated in the motivating examples in section 2.In section 3,reuse distances and associated terms are defined.Section 4describes how RDVIS analyzes the IEC.Section 5presents the analyses performed by SLO on the IEC to find the appropriate source code refactorings.In section 6,we provide a few case studies where these tools have been used to identify the required refactorings for a number of real-world programs from the SPEC2000benchmarks.For two of them,we applied the necessary transformations,leading to an average cross-platform speedup of about 2.3.Con-cluding remarks are given in section 7.2.MOTIV ATING EXAMPLESWe start by showing two small code examples where the indication of cache misses with traditional tools does not clearly reveal how to optimize the programs.Furthermore,we show how RDVIS and SLO visualize the intermediately executed code of long-distance reuses,and how that makes it easier to find source code refactorings that improve temporal locality.2.1Example 1:Intra-Procedural Loop ReusesThe code in figure 1shows a small piece of code,where a traditional tool would show that the first statement is responsible for about 39%of all cache misses and the second statement produces 61%of them.While this information indicates where cache misses occur,it is not directly clear how the locality of the program can be improved to diminish the number of cache misses.The views produced by our tools are shown in figure 2for RDVIS,and in figure 3for SLO.For each pair of refer-ences that generate many long-distance reuses,an arrow is drawn,starting at the reference that accesses the data first,and pointing to the reference that reuses that data after a long time.Figure 2(a)shows the four pairs of references that generate the majority of long-distance reuses:(Y[k],Y[j]),(Y[k-1],Y[j]),(Y[j],Y[k])and (Y[j],Y[k-1]).Figure 2(b)shows that each of those 4pairs generate about the same amount of long-distance reuses at distance 217,meaning that about 217other elements are accessed between those reuses.(In this example,N =10and len =100001).When the cache can contain 210elements (as indicated by the background),all the reuses at a larger distance lead to cache misses.So,the reuses at distance 217must be made smaller than 210,in other words,largely diminishing the amount of data accessed between use and reuse.To optimize each of the four arrows in figure 2(a),the first step is to pinpoint which code is responsible for generating the accesses between use and reuse.The second step is to refactor the code so that fewer data elements are accessed between use and reuse.RDVIS records the basic blocks executed between each use and reuse,and allows to visu-ally indicate the corresponding source code for each arrow.Besides an evaluation examining each arrow separately,RD-VIS also contains a cluster analysis.The arrows with similar IEC are put in the same cluster.As an example,figure 2(c)shows how RDVIS graphically represents the result of the cluster analysis.On the left hand side,the code executed(a)5different optimizations indicated by gray scale,with respect to the reuse distance of the reuses they optimize,as shown by SLO(b)Indication of the two optimizations for the reuses at distance 217,as indicated by SLO.The light gray op-timization indicates fusion of the two inner loops.The dark gray optimization requires tiling the outer loop.Figure 3:First motivating example,SLO view.The colors were manually altered to gray scale,to make the results readable in this black-and-white copy of the paper.between use and reuse is graphically represented.There are four horizontal bands,respectively representing the IEC of the four arrows in figure 2(a).In each band,the basic blocks in the program are represented,left to right.If a basic block is executed between use and reuse,it is colored in a shade of gray,otherwise it is nearly white.Originally,RDVIS pro-duces a colored view with higher contrast.Here,the colors were converted to enhance readability in black-and-white.Figure 2(c)shows that the code executed between use and reuse of arrows 1and 2are identical.Also the code exe-Figure4:Code and left-over long reuse distance after loop fusion.cuted between use and reuse of arrow3and4are identical. On the right hand side,the cluster dendrogram graphically indicates how“similar”the IEC is for each arrow.In this example,the user has manually selected two subclusters.It shows that52.6%of the long distance reuses are generated by the light gray cluster,while47.4%are generated by the dark gray cluster.Furthermore,infigure2(d),the IEC for the two clusters has been highlighted in the source code by RDVIS.The code that is executed between use and reuse is highlighted in bold.This shows that for the light gray cluster,the uses occur in the j-loop,while the reuses occur in the k-loop.Both the use and the reuse occur in the same iteration of the i-loop,since the loop control code:i<N; i++is not highlighted.These two arrows can be optimized by loop fusion,as is discussed in detail below.In the dark gray cluster,it shows that the control of loop i:i<N;i++ is executed between use and reuse.Hence the use and reuse occur in different iterations of the outer i-loop.The expe-rienced RDVIS-user recognizes from this pattern that loop tiling needs to be applied,as discussed in more detail below. In contrast to RDVIS,where the programmer needs to examine the Intermediately Executed Code to pinpoint op-timizations,SLO analyzes the IEC itself,and interactively indicates the optimizations that are needed.For example, infigure3(b),the required loop fusion and loop tiling are indicated by a bar on the left hand side.Furthermore,the histogram produced by SLO indicates which reuses can be optimized by which optimization in different colors,e.g.see figure3(a).The upper histogram shows the absolute num-ber of reuses at a given distance.The bottom histogram Figure5:Code and left-over long reuse distance after loop tiling.shows the fraction of reuses at a given distance that can be optimized by each transformation.Below,we explain how loop fusion and loop tiling can be used to improve the locality and performance.These two transformations are the most important optimizations for improving temporal locality in loops.2.1.1Optimizing Pattern1:Loop FusionFrom both the views produced by RDVIS(fig.2(d)at the top)and SLO(fig.3(b)at the top),it shows that about half of the long-distance reuses occur because element Y[j] is used in thefirst loop,and it is reused by references Y[k] and Y[k-1]in the second loop.The distance is long because between the reuses,all other elements of array Y are accessed by the same loops.For this pattern,the reuse distance can be reduced by loop fusion:instead of running over array Y twice,the computations from both loops are performed in one run over the array.In order to fuse the loops,thefirst loop is unrolled twice,after which they are fused,under the assumption that variable len is odd,resulting in the code in figure4.The histogram in thefigure shows that the long-distance reuses targeted have all been shortened to distances smaller than25.This results in a speedup of about1.9ona Pentium4system,due to fewer cache misses,see table1.2.1.2Optimizing Pattern2:Loop TilingAfter fusing the inner loops,the code can be analyzed again for the causes of the remaining long reuse distance patterns.Figure4shows how SLO indicates that all left-version exec.time speeduporig0.183sfused0.098s 1.87fused+tiled0.032s 5.72Table1:Running times and speedups of the code before and after optimizations,on a2.66Ghz Pen-tium4,for N=10,len=1000001.1double inproduct(double*X,double*Y,int len){ int i;double result=0.0;for(i=0;i<len;i++)result+=X[i]*Y[i];//50%of cache misses 5return result;}double sum(double*X,int len){int i;double result=0.0;10for(i=0;i<len;i++)result+=X[i];//50%of cache missesreturn result;}15double prodsum(double*X,double*Y,int len){ double inp=inproduct(X,Y,len);double sumX=sum(X,len);double sumY=sum(Y,len);return inp+sumX+sumY;20}Figure6:View on cache misses as provided by most traditional tools for the second example.over long reuse distances occur because the use is in one iteration of the i-loop,and the reuse is in a later iteration. Consequently,the tool indicates that the i-loop should be tiled,by displaying a bar to the left of the loop source code. Loop tiling is applied when the long-distance reuses occur between different iterations of a single outer loop.When this occurs,it means that in a single iteration of that loop,more data is accessed than canfit in the cache.The principle idea behind loop tiling is to process less data in one iteration of the loop,so that data can be retained in the cache between several iterations of the loop.Figure5shows the code after tiling.Now,the inner j-loop executes at most50iterations (see variable tilesize),and hence the amount of data ac-cessed in the inner loop is limited.As a result,the reuses between different iterations of the i-loop are shortened from a distance of217to a distance between27and29,see the histograms in Figures4and5.Note that some reuses have increased in size:1in50reuses between iterations of the j-loop infigure4have increased from24–25to29–210(see dark bars infigure5).This is because1in50reuses in the original j-loop are now between iterations of the outer jj-loop.The end result is the removal of all long-distance reuses.As a result,the overall measured program speedup is5.7,see table1.2.2Example2:Inter-Procedural ReusesThe second example is shown infigure6.The code in function prodsumfirst calculates the inproduct of two arrays by calling inproduct,after which the sum of all elements in both arrays is computed by calling function sum.Most existing tools would show,in one way or another,that halfof the misses occur on line4,and the other half are causedby the code on line11.In contrast,RDVIS shows two reference pairs,indicatedby arrows,that lead to long distance reuses,seefigure7.By examining the highlighted code carefully,the program-mer canfind that uses occur in the call to inproduct,while reuses occur in one of the two calls to sum.Here,the pro-grammer must perform an interprocedural analysis of the IEC.SLO,on the other hand,performs the interprocedu-ral analysis for the programmer,and visualizes the result as shown infigure8.It clearly identifies that for half of the long-distances reuses,inproduct must be fused with thefirst call to sum,and for the other half inproduct must be fused with the second call to sum.3.BASIC DEFINITIONSIn this section,we review the basic terms and definitions that are used to characterize reuses in a program.Definition1.A memory access a x is a single access to memory,that accesses address x.A memory reference r is the source code construct that leads to a memory instructionat compile-time,which in turn generates memory accessesat run-time.The reference that generates memory access a xis denoted by ref(a x).The address accessed be a memory access is denoted by addr(a x),i.e.addr(a x)=x.Definition2.A memory access trace T is a sequence of memory accesses,indexed by a logical time.The differencein time between consecutive accesses in a trace is1.The time of an access a x is denoted by T[a x].Definition3.A reuse pair a x,a x is a pair of memory accesses in a trace such that both accesses address the same data,and there are no intervening accesses to that data. The use of a reuse pair is thefirst access in the pair;the reuse is the second access.A reference pair(r1,r2)is a pair of memory references. The reuse pairs associated with a reference pair(r1,r2)is the set of reuse pairs for which the use is generated by r1and the reuse is generated by r2,and is denoted by reuses(r1,r2).Definition4.The Intermediately Executed Code(IEC) of a reuse pair a x,a x is the code executed between T[a x] and T[a x].Definition5.The reuse distance of a reuse pair froma trace,is the number of unique memory addresses in that trace between use and reuse.Cache misses are identified by the reuses that have a dis-tance larger than the cache size[6].4.RDVIS:IEC ANALYSIS BY BASIC BLOCKVECTOR CLUSTERINGIn RDVIS,the Intermediately Executed Code is repre-sented by a basic block vector:Definition6.The basic block vector of a reuse paira x,a x ,denoted by BBV( a x,a x )is a vector∈{0,1}n, where n is the number of basic blocks in the program.Whena basic block is executed between use and reuse,the corre-sponding vector element is1,otherwise it is0.(a)IEC for first referencepair.(b)IEC for second reference pair.Figure 7:Indication of intermediately executed code byRDVIS.(a)Two required fusions of functions indicated byarrows.(b)The reuse distance histogram for the reuses opti-mized by the two arrows in (a),for len =1000000.Figure 8:Indication of locality optimizations by SLO.The basic block vector of a reference pair (r 1,r 2),denoted by BBV ((r 1,r 2))is a vector ∈[0,1]n .The value of a vector element is the fraction of reuse pairs in reuses(r 1,r 2)for which the basic block is executed between use and reuse.More formally:BBV ((r 1,r 2))=Pa x ,a x ∈reuses(r 1,r 2)BBV( a x ,a x )#reuses(r 1,r 2)In RDVIS,reference pairs are visually represented by ar-rows drawn on top of the source code,e.g.figure 2.The tool allows to highlight the code executed between use and reuse for each individual arrow.Additionally,RDVIS clusters ar-rows according to the similarity of their IEC.The similarity (or rather dissimilarity)of the code exe-cuted between two reference pairs is computed as the Man-hattan distance of the corresponding basic block vectors in the vector space [0,1]n .When exactly the same code is exe-cuted between the reuses,the distance is 0;when the code is completely dissimilar,the distance is n .Based on the Man-hattan distance,an agglomerative clustering is performed,which proceeds as follows.First,each reference pair forms a separate cluster.Then,iteratively,the two closest clustersare merged into a single cluster.The basic block vector cor-responding with the new cluster is the average of the two basic block vectors that represent the merged clusters.The clustering stops when all reference pairs are combined into one large cluster.The distances between different subclus-ters are shown graphically in the dendrogram,and the user selects “interesting-looking”or “tight”subclusters. E.g.in figure 2(c),the user selected two very tight subclusters:the light gray and the dark gray subcluster.Since similar code is executed between use and reuse in a tight subcluster,it is likely that the long-distance reference pairs can be optimized by the same refactoring,e.g.see figure 2(d).5.SLO:IEC ANALYSIS BY INTERPROCE-DURAL CONTROL FLOW INSPECTIONSLO aims to improve on RDVIS by analyzing the IEC further and automatically pinpoint the refactorings that are necessary to improve temporal locality,even in an interpro-cedural context.To make this possible,SLO tracks the loop headers (i.e.the basic blocks that control whether a loop body is executed [1])and the functions that are executed between use and reuse,using the following framework.5.1Step1:Determining the Least CommonAncestor FunctionFigure9:The Least Common Ancestor Frame (LCAF)of a reuse,indicated in the activation tree. The activation tree represents a given time during the execution of the code infigure6,assuming that the use occurs inside function inproduct,and the reuse occurs inside sum.SLO proceeds byfirst determining the function in which the refactoring must be applied.In a second step,the exact refactoring on which part of that function’s code is com-puted.The refactoring must be applied in the“smallest”function in which both the use and the reuse can be seen. This is formalized by the following definitions,and illus-trated infigure9.Definition7.The activation tree[1]of a running pro-gram is a tree with a node for every function call at run-time and edges pointing from callers to callees.The use site of a reuse pair a x,a x is the node cor-responding to the function invocation in which access a x occurs.The reuse site is the node where access a x occurs. The Least Common Ancestor Frame(LCAF)of a reuse pair a x,a x is the least common ancestor in the acti-vation tree of the use site and the reuse site of a x,a x .The Least Common Ancestor Function is the function that corresponds to the least common ancestor frame.The LCAF is the function where some refactoring is needed to bring use and reuse closer together.Once the LCAF has been determined,the loop structure of the LCAF is exam-ined,and the basic blocks in the LCAF executed between use and reuse.Definition8.The basic block in the LCAF,in which the use occurred(directly or indirectly through a function call), is called the Use Basic Block(UseBB)of a x,a x ;the basic block that contains the reuse is called the Reuse Ba-sic Block(ReuseBB)of a x,a x .5.2Step2:Analyzing the Control FlowStructure in the Least Common AncestorFunctionThe key to the analysis isfinding the loops that“carry”the reuses.These loops are found by determining the Non-nested Use and Non-nested Reuse Basic Blocks,as defined below(illustrated infigure10):Definition9.The Nested Loop Forest of a function is a graph,where each node represents a basic block in the function,and there are edges from a loop header to each basic block directly controlled by that loop header.The Outermost Executed Loop Header(OELH)ofa basic block BB with respect to a given reuse pair a x,a x is the unique ancestor of BB in the nested loop forest that has been executed between use a x and reuse a x, but does not have ancestors itself that are executed between use and reuse.The Non-nested Use Basic Block(NNUBB)of a x,a x is the OELH of the use basic block of a x,a x .The Non-nested Reuse Basic Block(NNRBB)of a x,a x is the OELH of the reuse basic block of a x,a x .5.3Step3:Determining the RequiredRefactoringRefactorings are determined by analyzing the NNUBB and NNRBB.We subdivide in3different patterns:Pattern1:Reuse occurs between iterations of a single loop.This occurs when NNUBB=NNRBB,and they are loop headers.Consequently,a single loop carries the reuses. This pattern arises when the loop traverses a“data struc-ture”1in every iteration of the loop.The distance of reuses across iterations can be made smaller by ensuring that onlya small part of the data structure is traversed in any given iteration.As such,reuses of data elements between consecu-tive iterations are separated by only a small amount of data, instead of the complete data structure.A number of transformations have been proposed to in-crease temporal locality in this way,e.g.loop tiling[26,30], data shackling[18],time skewing[31],loop chunking[3], data tiling[16]and sparse tiling[27].We call these transfor-mations tiling-like optimizations.An extreme case of sucha tiling-like optimization is loop permutation[22],where in-ner and outer loops are swapped,so that the long-distance accesses in different iterations of the outer loop become short-distance accesses between iterations of the inner loop. Examples of occurrences of this pattern are indicated by bars with the word“TILE L...”infigures3,4and5. Pattern2:Use is in one loop nest,the reuse in an-other.When NNUBB and NNRBB are different loop head-ers,reuses occur between different loops.The code tra-verses a data structure in the loop indicated by the NNUBB. The data structure is retraversed in the NNRBB-loop.The reuses can be brought closer together by only doing a sin-gle traversal,performing computations from both loops at the same time.This kind of optimization is known as loop fusion.We call the required transformation a fusion-like optimization.Examples of this pattern are indicated by bars with the word“FUSE L...”infigure3.Pattern3:NNUBB and NNRBB are not both loop head-ers.When one of NNUBB or NNRBB are not loop head-ers,it means that either the use or the reuse is not insidea loop in the LCAF.It indicates that data is accessed in one basic block(possibly indirectly through a function call), and the other access may or may not be in a loop.So, the reused data structure is traversed twice by two separate code pieces.In this case,bringing use and reuse closer to-gether requires that the computations done in the NNUBB and in the NNRBB are“fused”so that the data structure is1the data structure could be as small a single scalar variableor as large as all the data in the program。
perf性能分析实例——使⽤perf优化cache利⽤率摘要:本⽂主要讲解如何使⽤perf观察程序在缓存利⽤⽅⾯的瓶颈,进⽽优化程序,提⾼cache命中率。
主要讲解提⾼缓存利⽤的⼏种常⽤⽅法。
1.程序局部性⼀个编写良好的计算机程序通常具有程序的局部性,它更倾向于引⽤最近引⽤过的数据项,或者这个数据周围的数据——前者是时间局部性,后者是空间局部性。
现代的设计,从硬件到操作系统再到应⽤程序都利⽤了程序的局部性原理:硬件层,通过cache来缓存刚刚使⽤过的指令或者数据,来提交对内存的访问效率。
在操作系统级别,操作系统利⽤主存来缓存刚刚访问过的磁盘块;在应⽤层,web浏览器将最近引⽤过的⽂档放在磁盘上,⼤量的web服务器将最近访问的⽂档放在前端磁盘上,这些缓存能够满⾜很多请求⽽不需要服务器的⼲预。
本⽂主要将的是硬件层次的程序局部性。
2.处理器存储体系计算机体系的储存层次从内到外依次是寄存器、cache(从⼀级、⼆级到三级)、主存、磁盘、远程⽂件系统;从内到外,访问速度依次降低,存储容量依次增⼤。
这个层次关系,可以⽤下⾯这张图来表⽰:程序在执⾏过程中,数据最先在磁盘上,然后被取到内存之中,最后如果经过cache(也可以不经过cache)被CPU使⽤。
如果数据不再cache之中,需要CPU到主存中存取数据,那么这就是cache miss,这将带来相当⼤的时间开销。
3.perf原理与使⽤简介Perf是 kernel⾃带的系统性能优化⼯具。
Perf的优势在于与 Kernel的紧密结合,它可以最先应⽤到加⼊Kernel的new feature。
perf可以⽤于查看热点函数,查看cashe miss的⽐率,从⽽帮助开发者来优化程序性能。
性能调优⼯具如 perf,Oprofile 等的基本原理都是对被监测对象进⾏采样,最简单的情形是根据 tick 中断进⾏采样,即在 tick 中断内触发采样点,在采样点⾥判断程序当时的上下⽂。
Akamai各产品线介绍Akamai各产品线产品简介[作者]本⽂主要列举了Akamai各⼤产品线的各类产品,并且做了简要介绍,从Akamai的产品线可以⼤致看出Akamai战略发展⽅向以及⽬前已有解决⽅案的市场深⼊度。
Akamai⼤致分为五⼤产品线:⽹络性能解决⽅案媒体分发解决⽅案云安全解决⽅案云⽹络解决⽅案⽹络运营商解决⽅案◆⽹络性能产品线:主要web为主,静态⽹页类与动态加速,还有基于IP应⽤的加速(TCP层的优化加速,与APPA⼀样的产品),还有源站负载均衡等⼀些服务;◆媒体分发产品线:主要以直播和点播加速为主,还有云平台的流媒体服务器,以及播放器,最后还带个云存储产品,从这⾥看出Akamai产品解决⽅案是很强⼤很深⼊的,真正的端到端的媒体解决⽅案,从源站到⽤户端,也可以说从第⼀公⾥到最后⼀公⾥的这样⼀套整体解决⽅案,从根本上可以看出Akamai是⼀家技术导向型的公司,研发实⼒的确很强;并且Akamai还有客户端软件Akamai NetSession Interface,这个虽说主要⽤于媒体分发或下载,但可以看出这就是最后⼀公⾥加速了,这也算是将CDN打⼊了⼤众群体。
◆云安全产品线:主要以web⽹站类的安全,防攻击再加上WAF等,这个主要看经验积累了;但是通过收购Prolexic公司,拓展了⾃⼰产品线的市场范围,可以对基于IP的应⽤进⾏防护,并且有对整个数据中⼼的防护⽅案,这样就从最底层设备到最上层应⽤都能覆盖,真正的基于云的安全防护能⼒。
◆云⽹络产品线:主要与Cisco和Steelhead合作的两个产品。
◆⽹络运营商产品线:主要是卖软件了,授权license,其他的产品可以说是卖服务,这个可以说就是纯粹卖技术了。
经常会看见业界介绍Akamai五⼤解决⽅案如下:Akamai平台⼀共包括五⼤产品解决⽅案。
分别是AQUA(针对⽹站类型的应⽤业务)、SOLA(针对媒体视频传播及软件下载业务)、TERRA、KONA(提供客户防护⽹络攻击及⽹络监控功能,如DDos)、AURA(与运营商合作的⼀款产品)。
Wine(简体中⽂)说明是类UNIX系统下运⾏微软Windows程序的"兼容层"。
在Wine中运⾏的Windows程序,就如同运⾏原⽣Linux程序⼀样,不会有模拟器那样的性能问题。
获取更详细的介绍请浏览和页⾯。
安装警告: 如果您的账户能浏览某些⽂件或资源,Wine运⾏的程序也可以。
Wine不是。
如果很重视安全,请考虑使⽤。
通过从安装软件包即可获取 Wine 模拟器。
对于64位系统,需要启⽤仓库。
另外,您可能需要安装和软件包。
它们分别⽤于运⾏依赖于 Internet Explorer 和 .NET 的程序。
不过,也可以随后通过 Wine 在需要时下载安装这些组件。
但如果提前下载安装,您就可以离线使⽤它们,⽽且 Wine 不必为了每⼀个 WINEPREFIX 都单独下载。
平台差异默认的Wine是32位的程序,也是i686的Arch软件包。
所以它不能运⾏64位的Windows程序(反正是罕见的)。
然⽽,x86_64的Wine软件包⽬前以 --enable-win64⽅式编译。
这个参数激活了的Wine版本。
在Windows中,这个复杂的⼦系统允许⽤户同时使⽤32位和64位的Windows程序,甚⾄是在同⼀⽬录。
在Wine中,⽤户将必须建⽴单独分开的⽬录/前缀。
这项Wine功能仍是试验阶段,并建议⽤户使⽤⼀个win32WINEPREFIX。
浏览以获取有关这个的详细信息。
总结⼀下,配置WINEARCH=win32后,x86_64平台的Arch和i686平台的Arch完全相同。
注意: 如果在64位环境中执⾏winetricks或其它程序出现问题,请试试创建⼀个新的32位WINEPREFIX. 参见下⾯的[: invalid section]配置配置Wine的⽅式通常有:是Wine的图形界⾯配置程序。
控制台下调⽤$ winecfg(或指定系统⽬录:$ WINEPREFIX=~/.系统⽬录 winecfg)即可启动是Windows控制⾯板的Wine实现,通过$ wine control命令启动是Wine的注册表编辑器,⽐较前两者,该⼯具能配置更多东西。
一.多核处理器cashe一致性 (2)二.基于无锁机制的事务存储 (3)1.事务的基本概念 (3)2.实现流程-design (4)3.缓存状态 (5)4.事务行为 (5)5. 回退机制 (6)三.TCC模型 (6)1.编程模型 (6)2.TCC系统 (7)四.ASTM (7)1.背景 (7)2.STM设计 (8)2.1. 急迫申请与懒惰申请 (8)2.2.元数据结构 (8)2.3. 间接引用对象 (8)3.基本ASTM设计 (9)五.参考文献 (10)一.多核处理器cache一致性由于后续章节要用到多处理器cashe一致性的协议,这里先简单阐述一下!维持多处理器cashe一致性的协议叫做cashe一致性协议。
而实现cashe一致性协议的关键就是要跟踪一个共享数据块的任何状态。
目前有两种协议,分别使用不同的技术来跟踪共享状态。
一种是基于目录的,一个物理内存中数据块的共享状态保存在一个位置,叫做目录。
另外一种就是snooping协议。
我们先来看看snooping协议的具体实现。
Snooping的做法是,每个cashe不仅保存着一份物理内存的拷贝,而且保存着数据块的共享状态的拷贝。
通过广播介质这些cashe可以被访问,所有的cashe控制器通过介质检测来决定自己是否拥有一份来自总线请求的数据块的拷贝。
目前多处理器普遍采用写无效协议来维持一致性的需求。
它的核心思想就是一个处理器在写数据之前必须确保它对该数据块的互斥访问。
如果一个读操作紧随一个写操作之后,由于写操作是互斥的,它在写之前必须对无效化所有该数据块在其它cashe上的拷贝,当读发生时,它获得一个读缺失而被迫去获取新的拷贝。
如果两个写操作试图对同一数据同时操作,那么只有一个获胜,获胜方使得其它处理器种的cashe拷贝无效化,其它操作要完成它的写操作必须先获得数据的新拷贝,也就是更新的数据的拷贝,因此写无效协议实际上实现了写的序列化。
实现写无效协议的关键是使用总线(bus),或者其它的广播介质来执行无效操作。
Project Pravega Storage Reimagined for a Streaming WorldOpen SourceCommunityMassiveData Growth Emergence of Real-Time Apps Infrastructure Commoditizationand Scale-OutRapid Dissemination of Data to Apps Monetize Datawith AnalyticsData Velocity and VarietyMarket DriversToday’s “Accidental Architecture”BatchReal-TimeInteractive explorationby Data ScientistsReal-time intelligence at the NOCSensorsMirrorMakerDR SiteMobile DevicesApp LogsA New Architecture Emerges: Streaming• A new class of streaming systems is emerging to address the accidental architecture’s problems and enable new applications not possible before •Some of the unique characteristics of streaming applications –Treat data as continuous and infinite–Compute correct results in real-time with stateful, exactly-once processing •These systems are applicable for real-time applications, batch applications, and interactive applications•Web-scale companies (Google, Twitter) are beginning to demonstrate the disruptive value of streaming systems•What are the implications for storage in a streaming world?Let’s Rewind A Bit: The Importance of Log Storage Traditional Apps/Middleware Streaming Apps/MiddlewareBLOCKS •Structured Data •Relational DBs FILES•Unstructured Data•Pub/Sub•NoSQL DBsOBJECTS•Unstructured Data•Internet Friendly (REST)•Scale over Semantics•GeoLOGS•Append-only•Low-latency•Tail Read/WriteThe Importance of Log StorageThe Fundamental Data Structure for Scale-out Distributed SystemsAPPEND-ONLY LOGx=5z=6y=2x=4a=7y=5……older newerHigh Throughput catch-up readsLow Latencytailing readswritesOur Goal: Refactor the “Accidental Storage Stack”Ingest Buffer & Pub/Sub“Pravega Streams”Scale-out SDSNoSQL DB SearchAnalytics EnginesUsing Logs as a Shared Storage PrimitiveIngest Buffer & Pub/SubProprietary Log StorageLocal Files DASKafkaNoSQL DBProprietary Log StorageLocal Files DASCassandra et alIntroducing Pravega StreamsA New Log Primitive Designed Specifically For Streaming Architectures•Pravega is an open source distributed storage service offering a new storage abstraction called a stream• A stream is the foundation for building reliable streaming systems: a high-performance,durable,elastic,and infinite append-only log with strict ordering and consistency• A stream is as lightweight as a file–you can create millions of them in a single cluster•Streams greatly simplify the development and operation of a variety ofdistributed systems: messaging, databases, analytic engines, search engines, and so onPravega Architecture Goals•All data is durable–Data is replicated and persisted to disk before being acknowledged •Strict ordering guarantees and exactly once semantics –Across both tail and catch-up reads–Client tracks read offset, Producers use transactions •Lightweight, elastic, infinite, high performance–Support tens of millions of streams–Dynamic partitioning of streams based on load and throughput SLO–Size is not bounded by the capacity of a single node–Low (<10ms) latency writes; throughput bounded by network bandwidth –Read pattern (e.g. many catch-up reads) doesn’t affect write performanceStreaming Storage SystemArchitectureStream AbstractionPravega Streaming ServiceCloud Scale Storage (Isilon or ECS)•High-Throughput •High-Scale, Low-CostLow-Latency StorageApache BookkeeperAuto-TieringCache (Rocks)Messaging AppsReal-Time / Batch / Interactive Predictive AnalyticsStream Processors: Spark, Flink, …Other Apps & MiddlewarePravega Design Innovations 1.Zero-Touch Dynamic Scaling-Automatically scale read/writeparallelism based on load and SLO -No service interruptions-No manual reconfiguration of clients -No manual reconfiguration of service resources2.Smart Workload Distribution-No need to over-provision servers for peak load3.I/O Path Isolation-For tail writes -For tail reads-For catch-up reads4.Tiering for “Infinite Streams”5.Transactions For “Exactly Once”Pravega FundamentalsSegments•Base storage primitive is a segment• A segment is an append-only sequence of bytes •Writes durably persisted before acknowledgement….. 01110110 01100001 0110110001000110AppendRead01000110SegmentPravega• A stream is composed of one or more segments•Routing key determines the target segment for a stream write•Write order preserved by routing key; consistent tail and catch-up readsStream (011010010111011001001010)011101100111011001101111011010010110100101101111PravegaStream•There are no architectural limits on the number of streams or segments •Each segment can live in a different server•System is not limited in any way by the capacity of a single server (011010010111011001001010)011101100111011001101111011010010110100101101111PravegaStreamSegment Sealing• A segment may be sealed• A sealed segment cannot be appended to any more•Basis for advanced features such as stream elasticity and transactions (011010010111011001001010)011101100111011001101111011010010110100101101111XPravegaPravega System ArchitectureCommodity ServerPravega ClientControllerBookkeeper Segment StoreCacheCommodity Server Segment StoreBookkeeper CacheCommodity ServerSegment StoreBookkeeperCacheStream Create, Open, Transactions, …Read, Write, …Long-Term Storage(Object, NFS, HDFS)Beyond the Fundamentals Stream Elasticity, Unbounded Streams, Transactions, Exactly Once•Data arrival volume increases –more parallelism needed!PravegaStreamAppend….. 01110110 01100001 01101100 01000110Stream•Seal original segment •Replace with two new ones!•New segments may be distributed throughout the cluster balancing load….. 01110110 01100001 01101100AppendPravega010001100100011001000110Routing Key Space0.01.0TimeSplitSplitMerget 0t 1t 201000110010001100100011001000110010001100100011001000110010001100100011001000110Routing Key Space0.01.0Timet 0t 1t 201000110010001100100011001000110010001100100011001000110010001100100011001000110Key ranges are dynamically assigned to segments Split SplitMergeScale up Scale downStream ElasticityZero-Touch Scaling: Segment Splitting & MergingSegment 7Segment 4Segment 3Segment 1Segment 2Segment 0Time3210t0t1t2t3Segment 6Segment 5t4Data KeysStream SStreamUnbounded Streams•Segments are automatically tiered to long-term storage•Data in tiered segments is transparently accessible for catch-up reads •Preserves stream abstraction while lowering storage costs for older dataPravega01110110011101100110100101101001011011110110111101101111Long-T erm Storage (Object , HDFS, NFS)Segment Store Setup10 : 2Writer ID = 10Segment StoreXExactly OnceWriter ID = 10Ack Segment StoreID Position101ID Position102Writer ID = 10ID Position102Writer ID = 10Ack Segment Store ID Position103Stream01000110Initial State01110110 StreamBegin TXTX segments0111011001000110Stream01100001Write to TX01100001TX segments0111011001000110 Stream01100000Write toTX01100001TX segments0111011001000110 01100000StreamUpon commitSeal TX segments01110110010001100110000001100001Stream0100011001100001Merge TXsegments into stream segmentsUpon commit0111011001100000Stream01000110Initial State01110110 StreamBegin TXTX segments0111011001000110Stream01100001Write to TX01100001TX segments0111011001000110 Stream01100000Write toTX01100001TX segments0111011001000110 01100000StreamUpon abortSeal TX segments01110110 010001100110000001100001Transactional Semantics For “Exactly Once”New ItemStream 1, Segment 1…Stream 1, Segment 1, TX-230New ItemNew ItemStream ProcessorApp StateApp LogicWorker WorkerPravega Optimizations for Stream ProcessorsInput Stream (Pravega)…Worker…Segment Memory-SpeedStorageDynamically split input stream into parallel logs : infinite sequence, low-latency, durable, re-playable with auto-tiering from hot to cold storage.1Coordinate via protocol between streaming storage and streaming engine to systematically scale up and down the number of logs and source workers based on load variance over time2Support streaming write COMMIT operation to extend Exactly Once processing semantics across multiple, chained applications3S o c i a l , I o T P r o d u c e r sOutput Stream (Pravega)Stream Processor2nd AppSegmentSinkSegmentSegmentNautilus PlatformSecure | Integrated | Efficient | Elastic | ScalableA Turn-Key Streaming Data PlatformPravegaStreaming StorageFlinkStateful Stream ProcessorStreaming SQL API ZeppelinNotebook ExperienceNotebook APIK8SCommodity Servers or CloudS e c u r i t yServiceabilityStreaming Storage API Digital WorldReal-Time/Batch Analytics Frameworks and AppsInteractive ExplorationSummary1.“Streaming Architecture” replaces “Accidental Architecture”–Data: infinite/continuous vs. static/finite–Correctness in real-time: Exactly once processing + consistent storage 2.Pravega Streaming Storage Enables Storage Refactoring–Infinite, durable, scalable, re-playable, elastic append-only log–Open source project3.Unified Storage + Unified Data Pipeline–The New Data Stack!Comparing Pravega and Kafka Design PointsUnlike Kafka, Pravega is designed to be a durable and permanent storage systemQuality Pravega Goal Kafka Design Point Data Durability Replicated and persisted to disk before ACK Replicated but not persisted to disk before ACK Strict Ordering Consistent ordering on tail and catch-up reads Messages may get reorderedExactly Once Producers can use transactions for atomicity Messages may get duplicatedScale Tens of millions of streams per cluster Thousands of topics per clusterElastic Dynamic partitioning of streams based on load and SLO Statically configured partitionsSize Log size is not bounded by the capacity of any single nodePartition size is bounded by capacity of filesystemon its hosting nodeTransparently migrate/retrieve data from Tier 2 storage forolder parts of the logExternal ETL required to move data to Tier 2storage; no access to data via Kafka once movedPerformance Low (<10ms) latency durable writes; throughput bounded bynetwork bandwidthLow-latency achieved only by reducingreplication/reliability parametersRead pattern (e.g. many catch-up readers) does not affectwrite performanceRead patterns adversely affects write performancedue to reliance on OS filesystem cache✗✗✗✗✗✗✗✗✗。
The placement ofmatrices when usingXOR-based hashingBavo Nootaert∗,Hans Vandierendonck∗,Koen De Bosschere∗∗ELIS,Ghent University,Sint-Pietersnieuwstraat41,9000Gent,BelgiumABSTRACTTo reduce the number of conflicts in caches,hash function are used.One family of hash function that has been reported to perform well in the presence of stride patterns are based on XOR-ing address bits together.We show that,in contrast to traditional modulo-2m-based hashing,for XOR-based hashing the placement of stride patterns and matrices has a significant impact on conflicts. KEYWORDS:cache,optimization,stride pattern,XOR-based hash function1IntroductionHash functions have been studied with the aim of avoiding conflict misses in caches[Schl93, Gonz97,Toph99].In scientific programs,stride patterns are common.It is important to choose a hash function so as to map as many strides as possible with few conflicts.Conven-tional bit selection has the disadvantage of mapping even strides onto only one halve of the sets.A well studied alternative are the XOR-based hash functions[Frai85,Gonz97,Vand05], and a special subclass thereof,the polynomial hash functions[Rau91].These functions are computed using arithmetic over GF(2)and can be evaluated using solely XOR-gates.In[Toph99]benchmarks are used to measure the performance of these hash functions. Analytical studies of XOR-based hash functions are usually limited to vector space pat-terns[Frai85,Vand05].Analyzing these patterns is rather restrictive,since this class con-tains only certain strides,aligned to certain base addresses.In[Rau91],some properties are proven about the effect of polynomial hash functions on stride patterns starting at base ad-dress0.Although some of the theorems can be extended to include a different base address [Noot05],they are aimed at determining equivalence among patterns,and generally can-not say whether a particular base address is actually a good choice.Indeed,the number of conflicts varies greatly depending on the alignment of the stride pattern.In Section2we model self interference in a matrix using stride patterns.In Section3 we use this model to show the potential impact of the base address,independent of any particular benchmark.10000 1500020000 25000 300000 2e+06 4e+06 6e+068e+06 1e+07n u m b e r o f c o n f l i c t s base address Figure 1:The number of conflicts for a 976×976matrix.2A model for self interference in a matrixWe focus on self interference [Lam91]in a matrix:if the cache is large enough,cross inter-ference can be ignored.Assume interference is mainly restricted to within separate columns or rows of a matrix.This assumption is valid if each column or row is reused repeatedly,without intermediate accesses to other columns of the same matrix.A matrix can now be considered as a collection of stride patterns.An S ×S matrix stored in row major order is made of S patterns with stride S (the columns),and S with stride 1(the rows).Let a be the starting address of a matrix A .The patterns with stride S start at addresses a ,a +1,...,a +S −1,and the others start at addresses a ,a +S ,...,a +(S −1)S .A conflict occurs when an element of a pattern is mapped to a set that already contains an element.The number of conflicts is independent of the order in which the elements are mapped.We define the number of conflicts of a matrix as the sum of the conflicts of the column access patterns and the row access patterns.3An exampleConsider a cache with 213sets and a mapping that XORs two slices of 13bits together,and discards any bits above the 26least significant.Assume for simplicity that a cache line can hold exactly one element of the matrix.Figure 1shows the number of conflicts for S =976and base addresses ranging from 0to 10,000,000.Here,optimal placement reduces the number of conflicts with about 59%compared to the worst case.As can be observed,there is some symmetry,which is described next.Proofs can be found in [Noot05].For a pattern with stride S of size W ,the number of conflicts is symmetric around the points b j ,given by:b j =2j +t + S 2 − W S 2−1,where t is the smallest integer that makes b 0positive.The range [0,2b j ]does not reach b j +1.So if one examines the base addresses starting from 0,the range ]2b j ,b j +1]contains new information.In this range,there may be an optimal base address that has not yet been discovered.After the range[0,2b0],which is symmetric since b0is its center,has been reflected around b1,both the original and its image are reflected again around b2,resulting in four copies of the range.After the next symmetry point,we have eight copies,and so on.It can be proven that for the model described in Section2,the number of conflicts for a matrix shows a similar symmetric behavior.The symmetry points for an S×S-matrix areb j =2j+t −S22,where t is the smallest integer that makes b 0positive.4ConclusionThe base address of a stride pattern or a matrix pattern has a significant impact on the num-ber of conflicts,when mapped using a XOR-based hash function.This is the opposite of conventional modulo-based hashing,where the base address has no effect at all on self in-terference.These results offer a new possibility for optimizing the number of conflicts. References[Frai85]J.F RAILONG,W.J ALBY,AND J.L ENFANT.XOR-schemes:aflexible data organi-zation in parallel memories.In Proceedings of the1985International Conference onParallel Processing,pages276–283,August1985.[Gonz97] A.G ONZÁLEZ,M.V ALERO,N.T OPHAM,AND J.P ARCERISA.Eliminating cache conflict misses through XOR-based placement functions.In ICS’97:Proceedings ofthe11th international conference on Supercomputing,pages76–83.ACM Press,1997. [Lam91]M.L AM,E.R OTHBERG,AND M.E.W OLF.The cache performance and opti-mizations of blocked algorithms.In Proceedings of the Fourth International Con-ference on Architectural Support for Programming Languages and Operating Systems,pages63–74,April1991.[Noot05] B.N OOTAERT,H.V ANDIERENDONCK,AND K.D E B OSSCHERE.How patterns in memory references affect the performance of hash functions in cache memories.Technical Report R105.002,Ghent University,March2005.[Rau91] B.R AU.Pseudo-randomly interleaved memory.In Proceedings of the18th Annual International Symposium on Computer Architecture,pages74–83,May1991. [Schl93]M.S CHLANSKER,R.S HAW,AND S.S IVARAMAKRISHNAN.Randomization and associativity in the design of placement-insensitive caches.Technical report,HPComputer Systems Laboratory,June1993.[Toph99]N.T OPHAM AND A.G ONZÁLEZ.Randomized Cache Placement for Eliminating Conflicts.IEEE Transactions on Computers,48(2):185–192,1999.[Vand05]H.V ANDIERENDONCK AND K.D E B OSSCHERE.XOR-based Hash Functions.IEEE Transactions on Computers,54(7):800–812,2005.。
lto 兼容原理LTO, short for Link Time Optimization, is a key component of modern compiler toolchains. It's an optimization technique that takes place after the source code has been compiled into intermediate representations and during the linking phase. The primary objective of LTO is to improve the overall performance of the generated executable by performing cross-file optimizations.LTO兼容原理主要基于编译器工具链中的链接时优化技术。
这是一种优化手段,发生在源代码被编译成中间表示形式之后的链接阶段。
LTO的主要目标是通过跨文件优化来提高生成的可执行文件的整体性能。
At its core, LTO operates by gathering all the compiled intermediate representations, often in the form of object files, and allowing the optimizer to perform optimizations across these files. This is in contrast to traditional compilation, where optimizations are typically limited to within a single source file. By considering the entire program, LTO can eliminate redundant code, inline functions more effectively, and rearrange the layout of data structures for improved cache locality.在核心原理上,LTO通过将所有编译的中间表示形式(通常是目标文件)聚集在一起,允许优化器在这些文件之间执行优化操作。
FRIEDRICH-ALEXANDER-UNIVERSIT¨AT ERLANGEN-N¨URNBERG INSTITUT F¨UR INFORMATIK(MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl f¨u r Informatik10(Systemsimulation)Cache Performance Optimizations for Parallel Lattice BoltzmannCodes in2DJens Wilke,Thomas Pohl,Markus Kowarschik,Ulrich R¨u deLehrstuhlbericht03-3Cache Performance Optimizations for Parallel LatticeBoltzmann Codes in2D∗Jens Wilke,Thomas Pohl,Markus Kowarschik,Ulrich R¨u deLehrstuhl f¨u r Informatik10(Systemsimulation)Institut f¨u r InformatikFriedrich–Alexander–Universit¨a t Erlangen–N¨u rnbergGermany{Jens.Wilke,Thomas.Pohl,Markus.Kowarschik,Ulrich.Ruede}@cs.fau.deAbstractWhen designing and implementing highly efficient scientific applications for parallel com-puters such as clusters of workstations,it is inevitable to consider and to optimize the single–CPU performance of the codes.For this purpose,it is particularly important that the codesrespect the hierarchical memory designs that computer architects employ in order to hide theeffects of the growing gap between CPU performance and main memory speed.In this paper,we present techniques to enhance the single–CPU efficiency of lattice Boltzmann methods whichare commonly used in computationalfluid dynamics.We show various performance results toemphasize the effectiveness of our optimization techniques.1IntroductionIn order to enhance the performance of any parallel scientific application,it is important to focus on two related optimization issues.Firstly,it is necessary to minimize the parallelization overhead itself.These efforts commonly target the choice of appropriate load balancing strategies as well as the minimization of communication overhead by hiding network latency and bandwidth.Secondly, it is necessary to exploit the individual parallel resources as efficiently as possible;e.g.,by achieving as much performance as possible on each CPU in the parallel environment.This is especially true for distributed memory systems found in computer clusters based on off–the–shelf workstations communicating via fast networks.Our current research focuses on this second optimization issue.In order to mitigate the effects of the growing gap between theoretically available processor speed and main memory performance,today’s computer architectures are typically based on hierarchical∗A slightly modified version of this paper has been accepted for publication atEuroPar’03.Capacity Bandwidth Latency16 KB16 GB/s 1 cycle256 KB32 GB/s5+ cycles1.5/3 MB32 GB/s12+ cycles2 GB 6.4 GB/s>100 cyclesFigure1:Memory architecture of a workstation based on an Intel Itanium2CPU with three levels of on–chip cache[12].memory designs,involving CPU registers,several levels of cache memories(caches),and main memory[10].Remote main memory and external memory(e.g.,hard disk drives)can be considered as the slowest components in any memory hierarchy.Fig.1illustrates the memory architecture of a current high performance workstation[12].Efficient execution in terms of work units per second can only be obtained if the codes exploit the underlying memory design.This is particularly true for numerically intensive codes.Unfortunately, current compilers cannot perform highly sophisticated code transformations automatically.Much of this optimization effort is therefore left to the programmer[8,14].Generally speaking,efficient parallelization and cache performance tuning can both be inter-preted as data locality optimizations.The underlying idea is to keep the data to be processed as close as possible to the corresponding ALU.From this viewpoint,cache optimizations form an extension of classical parallelization efforts.Research has shown that the cache utilization of iterative algorithms for the numerical solution of linear systems can be improved significantly by applying suitable combinations of data layout optimizations and data access optimizations[3,6].The idea behind these techniques is to enhance the spatial locality as well as the temporal locality of the code[11].Similar work focuses on other algorithms of numerical linear algebra[16]and hardware–oriented FFT implementations[7].An overview of cache optimization techniques for numerical algorithms can be found in[13].Our current work concentrates on improving the cache utilization of a parallel implementation of the lattice Boltzmann method(LBM),which represents a particle–based approach towards the numerical simulation of problems in computationalfluid dynamics(CFD)[5,17].This paper is structured as follows.Section2contains a brief introduction to the LBM.Section3 presents code transformation techniques to enhance the single–CPU performance of the LBM and introduces a compressed grid storage technique,which almost halves its data set size.The perform-ance results of LBM implementations on various platforms are shown in Section4.We conclude in Section5.2The Lattice Boltzmann MethodIt is important to mention up front that we only give a brief description of the LBM in this section, since the actual physics behind this approach are not essential for the application of our optimization techniques.The usual approach towards solving CFD problems is based on the numerical solution of the governing partial differential equations,particularly the Navier–Stokes equations.The idea behind this approach is to discretize the computational domain usingfinite differences,finite elements or finite volumes,to derive algebraic systems of equations and to solve these systems numerically.In contrast,the LBM is a particle–oriented technique,which is based on a microscopic model of the movingfluid ing x to denote the position in space, u to denote the particle velocity,and t to denote the time parameter,the so–called particle distribution function f( x, u,t)is discretized in space,velocity,and time.This results in the computational domain being regularly divided into cells(so–called lattice sites),where the current state of each lattice site is defined by an array offloating–point numbers that represent the distribution functions w.r.t.to the discrete directions of velocity.The LBM then works as follows.In each time step,the entire grid(lattice)is traversed,and the distribution function values at each site are updated according to the states of its neighboring sites in the grid at the previous discrete point in time.This update step consists of a stream operation, where the corresponding data from the neighboring sites are retrieved,and a collide operation, where the new distribution function values at the current site are computed according to a suitable model of the microscopic behavior of thefluid particles,preserving hydrodynamic quantities such as mass density,momentum density,and ually,this model is derived from the Boltzmann equation1 f−f(0) ,∂f+ u,∇f =which describes the time–dependent behavior of the particle distribution function f.In this equa-tion,f(0)denotes the equilibrium distribution function,λis the relaxation time, .,. denotes the standard inner product,and∇f denotes the gradient of f w.r.t.the spatial dimensions[17].Figure2:Representation of the LBM updating a single lattice site by a stream operation(left)and a subsequent collide operation(right).Fig.2shows the update of an individual lattice site in2D.The grid on the left illustrates the source grid corresponding to the previous point in time,the darker arrows represent the particle distribution function values being read.The grid on the right illustrates the destination grid corresponding to the current point in time,the dark arrows represent the new particle distribution function values;i.e.,after the collide operation.Note that,in every time step,the lattice may be traversed in any order since the LBM only accesses data corresponding to the previous point in time in order to compute new distribution function values.From an abstract point of view,the structure of the LBM thus parallels the structure of an implementation of Jacobi’s method for the iterative solution of linear systems on structured meshes.The time loop in the LBM corresponds to the iteration loop in Jacobi’s method.In contrast to the lattice gas approach[17]where hexagonal grids are more common,we focus on orthogonal grids,since they are almost exclusively used for the LBM.3Optimization Techniques for Lattice Boltzmann Codes3.1Fusing the stream and the collide operationA naive implementation of the LBM would perform two entire sweeps over the whole data set in every time step:one sweep for the stream operation,copying the distribution function values from each lattice site into its neighboring sites,and a subsequent sweep for the collide operation, calculating the new distribution function values at each site.Afirst step to improve performance is to combine the streaming and the collision step.The idea behind this so–called loop fusion technique is to enhance the temporal locality of the code and thus the utilization of the cache[1].Instead of passing through the data set twice per time step, the fused version retrieves the required data from the neighboring cells and immediately calculates the new distribution function values at the current site,see again Fig.2.Since this tuning step is both common and necessary for all subsequent transformations,we consider this fused version as our starting point for further cache optimizations.3.2Data Layout OptimizationsAccessing main memory is very costly compared to even the lowest cache level.Therefore,it is essential to choose a memory layout for the implementation of the LBM which allows the code to exploit the benefits of the hierarchical memory architecture.Since we need to maintain the grid data for any two successive points in time,our initial storage scheme is based on two arrays of records.Each of these records stores the distribution function values of an individual lattice site.Clustering the distribution function values is a reasonable approach since the smallest unit of data to be moved between main memory and cache is a cache block which typically contains several data items that are located adjacent in memory.In the following,we present two data layout transformations that aim at further enhancing the spatial locality of the LBM implementation;grid merging and grid compression.Array tMerged arrayArray t+1Figure3:The two arrays which store the grid data at time t and time t+1,respectively,are merged into a single array.1. 2.Figure4:Fused stream–and–collide step for the cell in the middle(left,1and2),layout of the two overlayed grids after applying the grid compression technique(right).Grid merging.During a fused stream–and–collide step it is necessary to load data from the grid (array)corresponding to time t,to calculate new distribution function values,and to store these results into the other grid corresponding to time t+1.Due to our initial data layout(see above)thefirst two steps access data which are tightly clustered in memory.Storing the results,however,involves the access of memory locations that can be arbitrarily far away.In order to improve spatial locality,we introduce an interleaved data layout where the distribution function values of each individual site for two successive points in time are kept next to each other in memory.This transformation is commonly called array merging[11,13]. Fig.3illustrates the application of this technique for the2D case.Grid compression.The idea behind this data layout is both to save memory and to increase spatial locality.We concentrate on the2D case in this paper,while the extension of this technique to the3D case is currently being implemented.In an implementation of the LBM in2D,nearly half of the memory can be saved by exploiting the fact that only the data from the eight neighboring cells are required to calculate the new distribution function values at any regular site of the grid1.It is therefore possible to overlay the two grids for both points in time,introducing a diagonal shift of one row and one column of cells into the stream–and–collide operation.The direction of the shift then determines the update sequence in each time step since we may not yet overwrite those distribution function values that are still required henceforth.In Fig.4,the light gray area contains the values for the current time t.The two pictures on the left illustrate the stream–and–collide operation for a single cell:the values for time t+1will be stored with a diagonal shift to the lower left.Therefore,the stream–and–collide sweep must also start with the lower left cell.Consequently,after one complete sweep,the new values(time t+1) are shifted compared to the previous ones(time t).For the subsequent time step,the sweep must start in the upper right corner.The data for time t+2must then be stored with a shift to the upper right.After two successive sweeps,the memory locations of the distribution functions are the same as before.This alternating scheme is shown in the right picture of Fig.4.1For the sake of simplicity,we focus on regular sites and omit the description of how to treat boundary sites and obstacle sites separately.Figure5:1D blocking technique to enhance temporal locality.3.3Data Access OptimizationsData access optimizations change the order in which the data are referenced in the course of the computation,while respecting all data dependencies.Our access transformations for implement-ations of the LBM are based on the loop blocking(loop tiling)technique.The idea behind this general approach is to divide the iteration space of a loop or a loop nest into blocks and to perform as much computational work as possible on each individual block before moving on to the next block.If the size of the blocks is chosen according to the cache size,loop blocking can significantly enhance cache utilization and,as a consequence,yield significant performance speedups[1,8,13].In the following,we present two blocking approaches in order to increase the temporal locality of the2D LBM implementation.Both blocking techniques take advantage of the stencil memory access exhibited by the LBM.Because of this local operation,a grid site can be updated to time t+1as soon as the sites it depends on have been updated to time t.It is important to point out that each of these access transformations can be combined with either of the two layout transformations which we have introduced in Section3.2.1D blocking.Fig.5illustrates an example,where two successive time steps are blocked into a single pass through the grid.White cells have been updated to time t,light gray cells to time t+1, and dark gray cells to time t+2.It can be seen that in Grid2,all data dependencies of the bottom row are fulfilled,and can therefore be updated to time t+2,shown in Grid3.This is performed repeatedly until the entire grid has been updated to time t+2,see Grids4to10.The downside to this method is that,even two rows may contain too much data tofit into cache if the grid is too large.In this case,no performance gain will be observed.2D blocking.Fig.6illustrates an example,in which a4×4block of cells is employed.This means that four successive time steps are performed during a single pass through the grid.Since the data contained in the2D block is independent of the grid size,it will always(if the size is chosen appropriately)fit into the highest possible cache level,regardless of the grid size.In Fig.6, Grids1to4demonstrate the handling of one4×4block,and Grids5to8a second4×4block. The block which can be processed moves diagonally down and left in order to avoid violating data dependencies.Obviously,special handling is required for those sites near grid edges which cannot form a complete4×4block.4Performance ResultsIn order to test and benchmark the various implementations of the LBM,a well known problem influid dynamics known as the lid–driven cavity has been used.It consists of a closed box,where the top of the box,the lid,is continually dragged across thefluid in the same direction.Thefluid eventually forms a circularflow around the center of the box[9].Figure 6:2D blocking technique to enhance temporal locality.010020030040050060070080090010002grid size 2.03.04.05.06.07.08.09.010.011.0M L S U D /s 010020030040050060070080090010002grid size M L S U D /s Figure 7:Performance in millions of lattice site updates per second (MLSUD/s)on machines based on an AMD Athlon XP 2400+(left)and on an Intel Itanium2(right),respectively,for various grid sizes.Performance.Fig.7demonstrates the performance of five different implementations of the LBM in ANSI C++on two different machines 2for various grid sizes.Note that problem size n means that the grid contains n 2cells.The simple implementation involving a source and destination grid is the slowest,the implementation using the layout based on grid compression is slightly faster.Two implementations of 1D blocking are shown,one with two time steps blocked,and one with eight.Initially,they show a significant increase in performance.For large grids,however,performance worsens since the required data cannot fit into even the lowest level of cache.This effect is more dramatic on the AMD processor since it has only 256kB of L2and no L3cache,whereas the Intel CPU even has 1.5MB of L3cache on–chip.Finally,the 2D blocked implementation shows a significant increase in performance for all grid sizes.It should be noted that each implementation based on the grid merging layout performs worse than its counterpart based on grid compression.Therefore,no performance results for the grid merging technique are shown.We have obtained similar performance gains on several further platforms.For example,our results include speedup factors of 2–3on machines based on DEC Alpha 21164and DEC Alpha 21264CPUs.Both of them particularly benefit from large off–chip caches of 4MB.Cache Behavior.Fig.8demonstrates the behavior of the L1and L2cache of the AMD Athlon XP for the different implementations of the LBM.These results have been obtained by using the profiling tool PAPI [4].When compared to Fig.7it can be seen that there is a strong correlation between the number of cache misses and the performance of the code.The correlation between2We use an AMD Athlon XP 2400+based PC (2GHz)[2],Linux,gcc 3.2.1,as well as an Intel Itanium2based HP zx6000[12],Linux,Intel ecc V7.0.Aggressive compiler optimizations have been enabled in all experiments.0210022002300240025002600270028002900210002grid size 0.00.51.01.52.02.53.03.54.04.55.05.5L 1 c a c h e m i s s e s / l a t t i c e s i t e u p d a t e2grid size L 2 c a c h e m i s s e s / l a t t i c e s i t e u p d a t e Figure 8:Cache behavior of the AMD Athlon XP measured with PAPI.the performance drop of the two 1D blocked implementations and their respective rise in L2cache misses is especially dramatic.Additionally,Fig.8reveals the cause of the severe performance drop exhibited by the 2D blocked implementation at a grid size of 8002.The high number of cache misses are conflict misses which are caused when large amounts data in memory are mapped to only a small number of cache lines [15].5Conclusions and Future WorkDue to the still widening gap between CPU and memory speed,hierarchical memory architectures will continue to be a promising optimization target.The CPU manufacturers have already an-nounced new generations of CPUs with several megabytes of on–chip cache in order to hide the slow access to main memory.We have demonstrated the importance of considering the single–CPU performance before using parallel computing methods in the framework of a CFD code based on the LBM.By exploiting the benefits of hierarchical memory architectures of current CPUs,we were have obtained factors of 2–3in performance on various machines.Unfortunately,the parallelization of cache–optimized codes is commonly tedious and error–prone due to their implementation complexity.We are currently extending our techniques to the 3D case.From our experience in cache per-formance optimization of iterative linear solvers,we expect that appropriate data layout transform-ations and data access optimizations,in particular loop blocking,can be applied to improve the performance.AcknowledgmentsWe wish to thank the members of the High Performance Computing Group at the Computing Center in Erlangen (RRZE)for their support.References[1]R.Allen and K.Kennedy.Optimizing Compilers for Modern Architectures .Morgan KaufmannPublishers,San Francisco,CA,USA,2001.[2]AMD Corporation.AMD Athlon XP Processor 8Data Sheet ,2002.Publication #25175Rev.F.[3]F.Bassetti,K.Davis,and D.Quinlan.Temporal Locality Optimizations for Stencil Operationswithin Parallel Object–Oriented Scientific Frameworks on Cache–Based Architectures.In Proc.of the Int.Conference on Parallel and Distributed Computing and Systems ,pages 145–153,Las Vegas,NV,USA,1998.[4]S.Browne,J.Dongarra,N.Garner,G.Ho,and P.Mucci.A Portable Programming Interface forPerformance Evaluation on Modern Processors.Int.Journal of High Performance Computing Applications ,14(3):189–204,2000.[5]S.Chen and ttice Boltzmann Method for Fluid Flow.Annual Reviews of FluidMechanics,30:329–364,1998.[6]C.C.Douglas,J.Hu,M.Kowarschik,U.R¨u de,and C.Weiß.Cache Optimization for Structuredand Unstructured Grid Multigrid.Electronic Transactions on Numerical Analysis,10:21–40, 2000.[7]M.Frigo and S.G.Johnson.FFTW:An Adaptive Software Architecture for the FFT.In Proc.of the Int.Conference on Acoustics,Speech,and Signal Processing,volume3,pages1381–1384, Seattle,WA,USA,1998.[8]S.Goedecker and A.Hoisie.Performance Optimization of Numerically Intensive Codes.SIAM,2001.[9]M.Griebel,T.Dornseifer,and T.Neunhoeffer.Numerical Simulation in Fluid Dynamics.SIAM,1998.[10]J.Handy.The Cache Memory Book.Academic Press,second edition,1998.[11]J.L.Hennessy and puter Architecture:A Quantitative Approach.MorganKaufmann Publisher,Inc.,San Francisco,CA,USA,second edition,1996.[12]Intel Corporation.Intel Itanium2Processor Reference Manual,2002.Document Number:251110–001.[13]M.Kowarschik and C.Weiß.An Overview of Cache Optimization Techniques and Cache–Aware Numerical Algorithms.In Algorithms for Memory Hierarchies,volume2625of LNCS.Springer,2003.[14]D.Loshin.Efficient Memory Programming.McGraw–Hill,1998.[15]G.Rivera and C.-W.Tseng.Data Transformations for Eliminating Conflict Misses.In Proc.of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal,Canada,1998.[16]R.C.Whaley and J.Dongarra.Automatically Tuned Linear Algebra Software.In Proc.of theACM/IEEE Supercomputing Conference,Orlando,FL,USA,1998.[17]ttice–Gas Cellular Automata and Lattice Boltzmann Models.Springer,2000.。