Paragon parallel programming environment on sun workstations

格式：pdf
大小：84.84 KB
文档页数：12

下载文档原格式

小学上册第5次英语第4单元自测题[含答案]

小学上册英语第4单元自测题[含答案]考试时间：80分钟（总分:120）B卷一、综合题(共计100题共100分)1. 听力题：A _______ is a large body of saltwater.2. 选择题：What is the name of the closest galaxy to our Milky Way?A. AndromedaB. TriangulumC. WhirlpoolD. Sombrero3. 听力题：The ______ is a famous explorer.4. 听力题：The ______ is known for its intelligence and problem-solving skills.5. 选择题：What do you call a story that is made up?A. FictionB. Non-fictionC. BiographyD. Autobiography答案:A6. 填空题：The __________ (探险故事) inspire many people.7. 听力题：The process by which plants convert sunlight into energy is called _______.8. 填空题：The _____ (绿叶) provide energy for the plant.Which animal is known for its ability to change colors?A. ChameleonB. FrogC. DogD. Cat10. 填空题：Every summer, I go to _______ (地方) for vacation. It’s a wonderful way to _______ (放松).11. 填空题：The children are _______ (在玩游戏).12. 填空题：The __________ (历史的回响) echoes through generations.13. 选择题：What do we call the study of the human mind?A. PsychologyB. SociologyC. PhilosophyD. Anthropology答案: A. Psychology14. 听力题：A ______ is a type of insect that can be very loud.15. 填空题：Planting trees can provide ______ (遮荫) and cooling effects.16. 填空题：A healthy plant will have vibrant ______ and sturdy stems. (健康的植物会有鲜艳的叶子和坚固的茎。

Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics, vol 117, p 1{19 (1 March 1995) | originally Sandia Technical Report SAND91{1144 (May 1993, June 1994) |
Fast Parallel Alr Dynamics
Steve Plimpton Parallel Computational Sciences Department 1421, MS 1111 Sandia National Laboratories Albuquerque, NM 87185-1111 (505) 845-7873 sjplimp@ Keywords: molecular dynamics, parallel computing, N{body problem
1
1 Introduction
Classical molecular dynamics (MD) is a commonly used computational tool for simulating the properties of liquids, solids, and molecules 1, 2]. Each of the N atoms or molecules in the simulation is treated as a point mass and Newton's equations are integrated to compute their motion. From the motion of the ensemble of atoms a variety of useful microscopic and macroscopic information can be extracted such as transport coe cients, phase diagrams, and structural or conformational properties. The physics of the model is contained in a potential energy functional for the system from which individual force equations for each atom are derived. MD simulations are typically not memory intensive since only vectors of atom information are stored. Computationally, the simulations are \large" in two domains | the number of atoms and number of timesteps. The length scale for atomic coordinates is Angstroms; in three dimensions many thousands or millions of atoms must usually be simulated to approach even the sub{micron scale. In liquids and solids the timestep size is constrained by the demand that the vibrational motion of the atoms be accurately tracked. This limits timesteps to the femtosecond scale and so tens or hundreds of thousands of timesteps are necessary to simulate even picoseconds of \real" time. Because of these computational demands, considerable e ort has been expended by researchers to optimize MD calculations for vector supercomputers 24, 30, 36, 45, 47] and even to build special{purpose hardware for performing MD simulations 4, 5]. The current state{of{the{art is such that simulating ten{ to hundred{thousand atom systems for picoseconds takes hours of CPU time on machines such as the Cray Y{MP. The fact that MD computations are inherently parallel has been extensively discussed in the literature 11, 22]. There has been considerable e ort in the last few years by researchers to exploit this parallelism on various machines. The majority of the work that has included implementations of proposed algorithms has been for single{instruction/multiple{data (SIMD) parallel machines such as the CM{2 12, 52], or for multiple{instruction/multiple{data(MIMD) parallel machines with a few dozens of processors 26, 37, 39, 46]. Recently there have been e orts to create scalable algorithms that work well on hundred{ to thousand{ processor MIMD machines 9, 14, 20, 41, 51]. We are convinced that the message{passing model of programming for MIMD machines is the only one that provides enough exibility to implement all the data structure and computational enhancements that are commonly exploited in MD codes on serial and vector machines. Also, we have found that it is only the current generation of massively parallel MIMD machines with hundreds to thousands of processors that have the computational power to be competitive with the fastest vector machines for MD calculations. In this paper we present three parallel algorithms which are appropriate for a general class of MD problems that has two salient characteristics. The rst characteristic is that forces are limited in range, meaning each atom interacts only with other atoms that are geometrically nearby. Solids and liquids are often modeled this way due to electronic screening e ects or simply to avoid the computational cost of including long{range Coulombic forces. For short{range MD the computational e ort per timestep scales as N , the number of 2

parallel scavenge 参数

parallel scavenge 是一种用于垃圾回收的并行算法。

它是一种用于提高垃圾回收效率的算法，特别适用于多核处理器和大内存系统。

二、parallel scavenge 算法的工作原理1. 并行回收parallel scavenge 算法通过并行的方式进行垃圾回收。

在进行垃圾回收时，它会同时利用多个线程进行回收操作，从而提高了回收的效率。

2. 分代回收parallel scavenge 算法采用了分代回收的策略，将堆内存分为新生代和老年代。

针对新生代的垃圾回收会使用并行复制算法，而针对老年代的垃圾回收会使用并行标记-整理算法。

3. 自适应调节parallel scavenge 算法还具有自适应调节的功能。

它能够根据当前系统的负载情况和垃圾回收的效果来调整回收策略，以达到更好的性能。

1. 高效性由于采用了并行的方式进行回收，parallel scavenge 算法能够更快速地完成垃圾回收，从而减少了应用程序的停顿时间。

2. 适应性parallel scavenge 算法能够根据系统的负载情况和垃圾回收效果进行自适应调节，从而能够更好地适应不同的应用场景。

3. 多核处理器优化由于采用了并行的方式进行回收，parallel scavenge 算法特别适用于多核处理器和大内存系统，能够充分发挥多核处理器的性能优势。

四、parallel scavenge 算法的缺点1. 空间开销由于采用了并行的方式进行回收，parallel scavenge 算法需要额外的空间来存放并发回收时的临时对象，会造成一定的空间开销。

2. 卡表清理parallel scavenge 算法在进行回收时，需要对 Remembered Set （卡表）进行清理操作，这会增加一定的回收时间和开销。

3. 不适用于所有场景尽管 parallel scavenge 算法在多核处理器和大内存系统上表现优异，但并不是所有的应用场景都适用于它，因此需要根据具体场景进行选择。

小学上册第十四次英语第四单元测验卷(有答案)

小学上册英语第四单元测验卷(有答案)考试时间：80分钟（总分:100）B卷一、综合题(共计100题共100分)1. 填空题：I enjoy riding my ______ around the neighborhood.2. 填空题：When I help others, I feel ______ (满足). It’s important to be kind and ______ (乐于助人).3. 选择题：What is the largest organ in the human body?A. BrainB. HeartC. SkinD. Liver答案:C4. 听力题：Friction can slow down a ______.5. 填空题：The ancient Romans established ________ to provide public services.6. 选择题：What is the name of the fairy tale character who had long hair?A. CinderellaB. RapunzelC. Sleeping BeautyD. Snow White答案: B7. 填空题：I like to play ______ (视频游戏).8. 填空题：The frog croaks loudly during the ______ (春天).9. 听力题：The color of cabbage juice changes with pH; it can be red or ______.10. 选择题：What is 5 + 5?A. 8B. 9C. 10D. 1111. 填空题：The _____ (果树) produces sweet fruit.12. 填空题：My friend is very __________ (乐观的) about life.13. 选择题：What is 15 7?A. 6B. 7C. 8D. 9答案:C14. 听力题：The flower pot is ______ (colorful) and bright.15. 填空题：I love to watch ______ movies.16. 听力题：The _______ can be used for decoration.17. 选择题：What do you call a baby jackal?A. KitB. PupC. CalfD. Cub18. 听力题：The _______ of a wave can be described as its height.19. ssance was marked by advancements in ________ (科学). 填空题：The Rock20. 选择题：What is the name of the first spacecraft to fly by Jupiter?A. Pioneer 10B. Voyager 1C. Voyager 2D. Galileo21. 填空题：My mom loves to _______ (动词) to relax. 她觉得这个很 _______ (形容词).22. 选择题：What is a synonym for "fast"?A. SlowB. QuickC. LazyD. Tired23. 填空题：My mom is a wonderful __________ (家长) who teaches me well.24. 选择题：What do you call the sound a cow makes?A. MeowB. BarkC. MooD. Quack25. of Hammurabi is one of the oldest known ______ (法律). 填空题：The Cold26. 听力题：I see a ___ in the sky. (bird)27. 听力题：A ____ is often seen leaping gracefully through the air.28. 听力题：The chameleon changes ______ to blend in.29. 听力题：The dog is ______ at the squirrel. (barking)The flowers are ___. (colorful)31. 选择题：What do we call the person who teaches students?A. EngineerB. TeacherC. DoctorD. Chef答案:B32. 选择题：Which food is made from milk?A. BreadB. CheeseC. RiceD. Pasta答案: B33. 选择题：What do we call the act of making something happen?A. CreationB. InnovationC. ProductionD. Action答案:D34. 填空题：My brother loves __________ (学习) different instruments.35. 听力题：A mineral's ______ refers to the color of its powder when scraped on a surface.36. 听力题：They go to _____ (school/market) every day.37. 选择题：What is the capital of Brunei?A. Bandar Seri BegawanB. Kuala BelaitC. TutongD. Seria答案:A38. 听力题：The pizza is ______ and cheesy. (hot)The ____ is a favorite among children and loves to play in the grass.40. 听力题：The capital city of Sweden is __________.41. 听力题：The process of forming a precipitate occurs in a _______ reaction.42. 填空题：_____ (farming) can be both rewarding and challenging.43. 选择题：What is the capital city of Germany?A. MunichB. BerlinC. FrankfurtD. Hamburg44. 听力题：The children are _______ (drawing) pictures.45. 听力题：The _______ of a pendulum can be affected by air resistance.46. 听力题：When water freezes, it becomes ______.47. 选择题：What is the capital of Portugal?A. LisbonB. MadridC. ParisD. Rome答案:A48. 选择题：What do we call the process of changing from a liquid to a solid?A. MeltingB. FreezingC. BoilingD. Evaporating答案:B49. 选择题：What do you call a group of fish?A. SchoolB. FlockC. PackD. Pride答案: A50. 选择题：What is the chemical symbol for gold?A. AuB. AgC. PbD. Fe51. 听力题：She has ___ (ten) fingers.52. 填空题：The ______ (狐狸) is very clever and sly.53. 听力题：I like to _____ on weekends. (relax)54. 填空题：A chicken lays ______.55. 听力题：She is _______ (studying) for her exam.56. 填空题：My cat loves to chase after ______ (线).57. 填空题：I enjoy drawing _____ (树木) in art class.58. 选择题：What is the primary color of a pumpkin?A. GreenB. OrangeC. YellowD. Brown答案: B. Orange59. 听力题：The squirrel is very ___ (quick).The __________ (热带雨林) is rich in biodiversity.61. 听力题：The apple tree is _______ (full) of fruit.62. 选择题：What is the capital of Sweden?A. StockholmB. OsloC. HelsinkiD. Copenhagen答案: A. Stockholm63. 选择题：What do we call the hard outer layer of the Earth?A. CrustB. MantleC. CoreD. Lithosphere64. 填空题：The ________ (生态研究) reveals insights.65. 填空题：The crow is known for its black ______ (羽毛).66. 填空题：The _____ (birch) tree has beautiful bark.67. 听力题：The chemical symbol for francium is ______.68. 听力题：A saturated solution can no longer dissolve ______.69. 听力题：The __________ point is the temperature at which a solid becomes a liquid.70. 选择题：What do you call the person who studies stars and planets?A. BiologistB. GeologistC. AstronomerD. Physicist答案:CI have a toy _______ that makes me giggle.72. 选择题：What do we call a person who studies history?A. BiologistB. HistorianC. ScientistD. Researcher答案:B73. 听力题：My ______ likes to cook delicious food.74. 听力题：My mom makes _____ for breakfast. (pancakes)75. 填空题：The ________ grow in the garden.76. 选择题：Which sport uses a net and a ball?A. SoccerB. TennisC. BaseballD. Golf答案: B77. 填空题：在古代，________ (leaders) 的决策对国家未来有着重大影响。

lect4_华科并行编程课件

10
Limited Concurrency: Amdahl’s Law
Fundamental limitation on parallel speedup If s = fraction of sequential execution that is inherently serial then speedup ≤ 1/s
Physical engine on which process executes Processes virtualize machine to programmer • first write program in terms of processes, then map to processors
Overall Performance
Speedup
i.e. no more than 2
13
Parallelizing Phase 2
Trick: divide second phase into two steps
Step 1: accumulate into private sum during sweep Step 2: add per-process private sum into global sum
As programmers, we worry about partitioning first
Usually independent of architecture or programming model But cost and complexity of using primitives may affect decisions
Many time-steps, plenty of concurrency across stars within one

二阶梵塔问题java代码

二阶梵塔问题java代码二阶梵塔问题Java代码引言梵塔问题是一个非常经典的递归问题，涉及到了递归算法的核心思想。

在这篇文章中，我们将详细介绍如何使用Java语言来解决二阶梵塔问题。

什么是梵塔问题？梵塔问题是一个古老的数学难题，起源于印度。

该问题的目标是将三个柱子上的一组盘子从一个柱子移动到另一个柱子，每次只能移动一个盘子，并且大盘子不能放在小盘子上面。

这个问题也被称为汉诺塔问题。

二阶梵塔问题在二阶梵塔问题中，我们只有两个柱子和三个盘子。

我们需要将这三个盘子从第一个柱子移动到第二个柱子。

同样地，每次只能移动一个盘子，并且大盘子不能放在小盘子上面。

解决方案为了解决这个问题，我们可以使用递归算法。

具体地说，我们可以将整个过程分为三步：- 将前两个盘子从第一个柱子移动到第三个柱子。

- 将最后一个盘子从第一个柱子移动到第二个柱子。

- 将前两个盘子从第三个柱子移动到第二个柱子。

Java代码实现下面是使用Java语言实现二阶梵塔问题的代码：```public class Hanoi {public static void move(int n, char from, char to, char via) { if (n == 1) {System.out.println(from + " -> " + to);} else {move(n - 1, from, via, to);System.out.println(from + " -> " + to);move(n - 1, via, to, from);}}public static void main(String[] args) {int n = 3;move(n, 'A', 'B', 'C');}}```代码解释在上面的代码中，我们定义了一个名为Hanoi的类。

Parallel Algorithms for Computing Temporal Aggregates

Parallel Algorithms for Computing Temporal AggregatesJose Alvin G.Gendrano Bruce C.Huang Jim M.RodrigueBongki Moon Richard T.SnodgrassDept.of Computer Science IBM Storage Systems Division Raytheon Missile Systems Co.University of Arizona9000S.Rita Road1151East Hermans RoadTucson,AZ85721Tucson,AZ85744Tucson,AZ85706jag,bkmoon,rts@ brucelee@ jmrodrigue@AbstractThe ability to model the temporal dimension is essen-tial to many applications.Furthermore,the rate of increase in database size and response time requirements has out-paced advancements in processor and mass storage tech-nology,leading to the need for parallel temporal database management systems.In this paper,we introduce a variety of parallel temporal aggregation algorithms for a shared-nothing architecture based on the sequential Aggregation Tree algorithm.Via an empirical study,we found that the number of processing nodes,the partitioning of the data, the placement of results,and the degree of data reduction ef-fected by the aggregation impacted the performance of the algorithms.For distributed results placement,we discov-ered that Time Division Merge was the obvious choice.For centralized results and high data reduction,Pairwise Merge was preferred regardless of the number of processing nodes, but for low data reduction,it only performed well up to32 nodes.This led us to a centralized variant of Time Division Merge which was best for larger conﬁgurations having low data reduction.1.IntroductionAggregate functions are an essential component of data query languages,and are heavily used in many applications such as data warehousing.Unfortunately,aggregate com-putation is traditionally expensive,especially in a tempo-ral database where the problem is complicated by having to compute the intervals of time for which the aggregate value holds.For example,ﬁnding the(time-varying)maximum salary of professors in the Computer Science Department This work was sponsored in part by National Science Foundation grants CDA-9500991and IRI-9632569,and National Science Foundation Research Infrastructure program EIA-9500991.The authors assume all responsibility for the contents of the paper.involves computing the temporal extent of each maximum value,which requires determining the tuples that overlap each temporal instant.In this paper,we present several new parallel algorithms for the computation of temporal aggregates on a shared-nothing architecture[8].Speciﬁcally,we focus on the Aggregation Tree algorithm[7]and propose several ap-proaches to parallelize it.The performance of the parallel algorithms relative to various data set and operational char-acteristics is of our main interest.The rest of this paper is organized as follows.Section2 gives a review of related work and presents the sequential algorithm on which we base our parallel algorithms.Our proposed algorithms on computing parallel temporal aggre-gates are then described in Section3.Section4presents empirical results obtained from the experiments performed on a shared-nothing Pentium cluster.Finally,Section5con-cludes the paper and gives an outlook to future work.2.Background and Related WorkSimple algorithms for evaluating scalar aggregates and aggregate functions were discussed by Epstein[5].A dif-ferent approach employing program transformation meth-ods to systematically generate efﬁcient iterative programs for aggregate queries has also been suggested[6].Tumas extended Epstein’s algorithms to handle temporal aggre-gates[9];these were further extended by Kline[7].While the resulting algorithms were quite effective in a uniproces-sor environment,all suffer from poor scale-up performance, which identiﬁes the need to develop parallel algorithms for computing temporal aggregates.Early research on developing parallel algorithms focused on the framework of general-purpose multiprocessor ma-chines.Bitton et al.proposed two parallel algorithms for processing(conventional)aggregate functions[1].The Subqueries with a Parallel Merge algorithm computes par-tial aggregates on each partition and combines the partialName Salary BeginEnd Richard40K18 Karen45K820 Nathan35K712 Nathan37K1821Count BeginEnd 178 2812 11218 31820 22021 121(a)Data Tuples(b)ResultTable1.Sample Database and Its TemporalAggregationresults in a parallel merge stage to obtain aﬁnal result.An-other algorithm,Project by list,exploits the ability of the parallel system architecture to broadcast tuples to multi-ple processors.The Gamma database machine project[4] implemented similar scalar aggregates and aggregate func-tions on a shared-nothing architecture.More recently,par-allel algorithms for handling temporal aggregates were pre-sented[11],but for a shared-memory architecture.The parallel temporal aggregation algorithms proposed in this paper are based on the(sequential)Aggregation Tree algorithm(SEQ)designed by Kline[7].The aggregation tree is a binary tree that tracks the number of tuples whose timestamp periods contain an indicated time span.Each node of the tree contains a start time,an end time,and a count.When an aggregation tree is initialized,it begins with a single node containing(see the initial tree in Figure1).In the following example[7],there are tuples to be in-serted into an empty aggregation tree(see Table1(a)).The start time value,,of theﬁrst entry to be inserted splits the initial tree,resulting in the updated aggregation tree shown in Figure1.Because the original node and the new node share the same end date of,a count of1is assigned to the new leaf node.The aggregation tree after inserting the rest of the records in Table1(a)is shown in Figure1.To compute the number of tuples for the periodin this example,we simply take the count from the leaf node(which is),and add its parents’count val-ues.Starting from the root,the sum of the parents’counts is and adding the leaf count,gives a total of .The temporal aggregate results are given in Table1(b).Though SEQ correctly computes temporal aggregates,it is still a sequential algorithm,bounded by the resources of a single processor machine.This makes a parallel method for computing temporal aggregates desirable.After adding [18,∞)Figure1.Example run of the Sequential(SEQ)Aggregation Tree Algorithm3.Parallel Processing of Temporal AggregatesIn this section,we proposeﬁve parallel algorithms for the computation of temporal aggregates.We start with two simple parallel extensions to the SEQ algorithm,the Sin-gle Aggregation Tree(abbreviated SAT)and Single Merge (SM)algorithms.We then go on to introduce the Time Divi-sion Merge with Centralizing step(TDM+C)and Pairwise Merge(PM)algorithms,which both require more coordi-nation,but are expected to scale better.Finally,we present the Time Division Merge(TDM)algorithm,a variant of TDM+C,which distributes the resulting relation,as differ-entiated from the centralized results produced by the other algorithms.3.1.Single Aggregation Tree(SAT)Theﬁrst algorithm,SAT,extends the Aggregation Tree algorithm by parallelizing disk I/O.Each worker node reads its data partition in parallel,constructs the valid-time peri-ods for each tuple,and sends these periods up to the coordi-nator.The central coordinator receives the periods from all the worker nodes,builds the complete aggregation tree,and returns theﬁnal result to the client.3.2.Single Merge(SM)The second parallel algorithm,SM,is more complex than SAT,in that it includes computational parallelism along with I/O parallelism.Each worker node builds a local aggregation tree,in parallel,and sends its leaf nodes to the coordinator.Unlike the SAT coordinator,which inserts periods into an aggregation tree,the SM coordinator merges each of the leaves it receives using a variant of merge-sort.The use of this efﬁcient merging algorithm is possible since the worker nodes send their leaves in a temporally sorted order.Finally,after all the worker nodes ﬁnish sending their leaves,the coordinator returns the ﬁnal result to the client.3.3.Time Division Merge with Coordinator(TDM+C)Like SM,the third parallel algorithm also extends the aggregation tree method by employing both computational and I/O parallelism (see Figure 2).The main steps for this algorithm are outlined in Figure 3.Local TreesFigure 2.Time Division Merge with Centraliz-ing Step (TDM+C)AlgorithmStep 1.Client requestStep 2.Build local aggregation trees Step 3.Calculate local partition sets Step 4.Calculate global partition set Step 5.Exchange data and merge Step 6.Merge local results Step 7.Return results to clientFigure 3.Major Steps for the TDM+C Algo-rithm3.3.1Overall AlgorithmTDM+C starts when the coordinator receives a temporal aggregate request from a client.Each worker node is in-structed to build a local aggregation tree using its data par-tition knowing the number of worker nodes,,participating in the query.After each worker node constructs its local aggregationtree,the tree is augmented in the following manner.Thenode traverses its aggregation tree in DFS order,propagat-ing the count values to the leaf nodes.The leaf nodes now contain the full local count for the periods they represent,and any parent nodes are discarded.After all worker nodes complete their aggregation trees,they exchange minimum (earliest)start time and maximum (latest)end time values to ascertain the overall timeline of the query.Timeline Covered By NodeFigure 4.Timeline divided into partitions,forming a global partition setThe leaves of a local aggregation tree are evenly split into local partitions,consisting of a period and a tuple count.Because each partition is split to have the same (or nearly)the same number of tuples,local partitions can have different durations.The local partition set (containing par-titions)from each processing node is then sent to the coor-dinator.The coordinator takes all local partition sets 1and com-putes global partitions (how this is done is discussed in the next section).After computing the global time partition set,the coor-dinator then naively assigns the period of the partitionto theworker node,and broadcasts the global partition set and respective assignments to all the nodes.The worker nodes then use this information to decide which local ag-gregation tree leaves to send,and to which worker nodes to send them to.Note that periods which span more than one global partition period are split and each part is assigned ac-cordingly(split periods do not affect the result correctness).Each worker node merges the leaves it receives with the leaves it already has to compute the temporal aggregate for their assigned global partitions.When all the worker nodes ﬁnish merging,the coordinator polls them for their results in sequential order.The coordinator concatenates the results and sends the ﬁnal result to the client.1Atotal oflocal partitions are created byworker nodes.0591030350800100015005000100000505050 151515 303030Figure 5.Local Partition Sets from ThreeWorker Nodes3.3.2Calculating the Global Partition SetWe examine in more detail the computation of the global partition set by the coordinator.Recall that the coordinator receives from each worker node a local partition set,con-sisting of contiguous partitions.The goal is to temporally distribute the computation of theﬁnal result,with each node processing roughly the same number of leaf nodes.As an example,Figure5presents local partitions from worker nodes.The number between each hash mark seg-menting a local timeline represents the number of leaf nodes within that local partition.The total number of leaf nodes from the nodes is.The best plan is having leaf nodes to be processed by each node.Figure4illustrates the computation of the global partition set.We modiﬁed the SEQ algorithm to compute the global partition set,using the local partition information sent by the worker nodes.We treat the worker node local parti-tion sets as periods,inserting them into the modiﬁed ag-gregation tree.From Figure5,theﬁrst period to be in-serted is[5,9)(50),the fourth is[0,30)(15),and the seventh is[0,10)(30),and the ninth(last)is[1000,5000)(30).This use of the Aggregation Tree is entirely separate from the use of this same structure in computing the aggregate.Here we are concerned only with determining a division of the time-line into contiguous periods,each with approximately the same number of leaves.There are three main differences between our Modiﬁed Aggregation Tree algorithm used in this portion of TDM+C and the original Aggregation Tree[7],used in step2of Figure3.First,the“count”ﬁeld of this aggregation tree node is incremented by the count value of the local parti-tion being inserted,rather than.Second,a parent node must have a count value of.When a leaf node is split and becomes a parent node,its count is split proportionally be-tween the two new leaf nodes based on the durations of their respective time periods.This new parent count becomes. Third,during an insertion traversal for a record,if the search traversal diverges to both subtrees,the record count is split proportionally between the2sub-trees.Inserted Records [5,9)(50), [9,800)(50), and [800,1500)(50)(a)First3Local Partitions(b)After partition4is addedFigure6.Intermediate Aggregation Tree As an example,suppose we inserted theﬁrst three lo-cal partitions,and now we are inserting the fourth one [0,30)(15).The current modiﬁed aggregation tree,before inserting the fourth local partition,is shown in Figure6a. Notice that for leaf node[5,9)(50),the count value is set to instead of(ﬁrst difference).The second and third differences are exempliﬁed when the fourth local partition is added.At the root node,we see that the period for this fourth partition overlaps the periods of the left sub-tree and the right sub-tree.In the original aggregation tree,we simply added to a node’s count in the left sub-tree and the right sub-tree at the appropriate places. Here,we see the third difference.We split this partition count of in proportion to the durations of the left and right sub-trees.The root left sub-tree contains a period[0,5) for a duration of time units.The fourth local partition period is[0,30),or time units.We compute the left sub-tree’s share of this local time partition’s count as,while the right sub-tree’s share is.In this case,the left sub-tree leaf node[0,5)now has a count of (see Figure6b).We now pass down the root right sub-tree,increasing its right leaf node count from[5,9)(50)to [5,9)(52)as its share of the newly added partition’s count,, is added,by using the same proportion calculation method. At leaf node[9,800)(50),the inserted partition’s count is now down to,since was taken by node[5,9)(52).Now,the second difference comes into play.Two new leaf nodes are created by splitting[9,800)(50).The new leaves are[9,30)and[30,800).Leaf[9,30)receives all the remaining inserted partition’s count of.The count of from[9,800)(50)is now divvied up amongst the two new leaf nodes.The left leaf node receives of the ,while the right leaf node receives.So the new left leaf node is now[9,30)(12),where comes from,and the new right leaf node shows as[30,800)(49).Again,see Figure6b for the result.Table2shows the leaf node values once all local time partitions from Figure5are inserted.Count Begin End170564593910121030443035043350800218001000401000150032150050009500010000Table2.All leaf node values in a tabular formatonce all9partitions from Figure5are insertedNow that the coordinator has the global span leaf counts and the optimal number of leaf nodes to be processed by each node,it canﬁgure out the global partition set.For each node(except the last one),we continue adding the span leaf counts until it matches or surpasses the optimal number of leaf nodes.When the sum is more than the optimal number, we break up the leaf node that causes this sum to be greater than the optimal number,such that the leaf node count divi-sion is done in proportion to the period duration.As an example,refer to Table2.We know that the optimal number of periods per global partition is.We add the leaf node counts from the top until we reach node [10,30)(12).The sum at this point is,or more than optimal.We break up[10,30)(12)into two leaf nodes such that theﬁrst leaf node period should contain a count of, and the newly created leaf node should contain -ing the same idea of proportional count division,we can see that[10,28)(11)and[28,30)(1)are the two new leaf nodes. So theﬁrst global time partition has the period[0,28)and has a count of.The computation for the second global time partition starts at[28,30)(1).Continuing on,the global time parti-tions for this example are[0,28),[28,866),and[866,10000).The reader should be aware that this global time partition resolution algorithm is not perfect.The actual number of local aggregation tree leaves assigned to each worker node may not be identical.The reason is that the algorithm uses the local partition sets,which are just guides for the global partitioning.When a local partition has leaf nodes in pe-riod[9,800),the global partition scheme assumes a uniform distribution,while the actual leaf nodes distribution may be heavily skewed.3.3.3Expected PerformanceWe expect better scalability for TDM+C as compared to the SAT and SM algorithms because of the data redistribution and its load-balancing effect.However,there are two globalStep1.Client requestStep2.Build local aggregation treesStep3.While notﬁnal aggregation tree Mergebetween2nodesStep4.Return results to clientFigure7.Major Steps for the PM Algorithm synchronization steps that may limit the performance ob-tained.First,all of the local partition sets must be com-pleted before the global time set partitioning can begin.Sec-ond,all of the worker nodes must complete their merges and send their results to the coordinator before the client can re-ceive theﬁnal result.The next algorithm,PM,will attempt to obtain better performance,by replacing the two global synchronization steps with localized synchronization steps.3.4.Pairwise Merge(PM)The fourth parallel algorithm,PM(see Figure7),dif-fers from TDM+C in two ways.First,the coordinator is more involved than in TDM+C.Secondly,instead of all the worker nodes merging simultaneously,as in TDM+C,pairs of worker nodes merge when the opportunity presents itself. Which two worker nodes are paired is determined dynami-cally by the query coordinator.A worker node is available for merging when its local aggregation tree has been built.The worker node informs the query coordinator that it has completed its aggregation tree.The query coordinator then arbitrarily picks another worker node that had previously completed its aggregation tree,thereby allowing the two worker nodes to merge their leaves.Then,the query coordinator instructs the worker node with the least number of leaf nodes to send the leaves to the other node,the“buddy worker node”,which does the merging of leaves.Once a worker nodeﬁnishes transmitting leaves to its buddy worker node,it is no longer a participant in the query. This buddying-up continues until the query coordinator as-certains that only one worker node is left,which contains the completed aggregation tree.The query coordinator then directs the sole remaining worker node to submit the results directly to the client.Figure8provides a conceptual picture of this“buddy”system.A portion of a PM aggregation tree may be merged mul-tiple times with other aggregation trees.The merge algo-rithm is a merge-sort variant operating on two sorted lists as input(the local list,and the received list).This merge is near linear,,in the number of leaf nodes to be merged.Sole Remaining Figure 8.Pairwise Merge (PM)Algorithm3.5.Time Division Merge (TDM)The ﬁfth parallel algorithm,TDM,is identical to TDM+C,except that it has distributed result placement rather than centralized result placement.This algorithm simply eliminates the ﬁnal coordinator results collection phase and completes with each worker node having a dis-tinct piece of the ﬁnal aggregation tree.A distributed re-sult is useful when the temporal aggregate operation is a subquery in a much larger distributed query.This allows further localized processing on the individual node’s aggre-gation sub-result in a distributed and possibly more efﬁcient manner.4.Empirical EvaluationFor the purposes of our evaluation,we chose the tempo-ral aggregate operation COUNT since it does not require that the attribute itself be sent.This simpliﬁes the data struc-tures maintained while still exhibiting the characteristics of a temporal aggregate computation.Based on this tem-poral aggregate operation we perform a variety of perfor-mance evaluations on the ﬁve parallel algorithms presented.The matrix in Table 3summarizes the experiments we have done.Algorithms Covered NumProcessors 1SAT,PM,SM,TDM,TDM+C 2,4,8,16,32,642SAT,PM,SM,TDM,TDM+C 2,4,8,16,32,643SAT,PM,SM,TDM,TDM+C 2,4,8,16,32,644PM,SM,TDM,TDM+C 16Table 3.Experimental Case Matrix Summary4.1.Experimental EnvironmentThe experiments were conducted on a 64-node shared-nothing cluster of 200MHz Pentium machines,each with 128MB of main memory and a 2GB hard disk.The ma-chines were physically mounted on two racks of 32ma-chines.Connecting the machines was a 100Mbps switched Ethernet network,having a point-to-point bandwidth of 100Mbps and an aggregate bandwidth of 2.4Gbps in all-to-all communication.Each machine was booted with version 2.0.30of the Linux kernel.For message passing between the Pen-tium nodes,we used the LAM implementation of the MPI communication standard [2].With the LAM implemen-tation,we observed an average communication latency of 790microseconds and an average transfer rate of about 5Mbytes/second.4.2.Experimental ParametersTo help precisely deﬁne the parameters for each set of tests,we established an experiment classiﬁcation scheme.Table 4lists the different parameters,and the set of param-eter values for each experiment.Synthetic datasets were generated to model relations which store time-varying information for each employee in a database.Each tuple has three attributes,an SSN attribute which is ﬁlled with random digits,a StartDate attribute,and an EndDate attribute.The SSN attribute refers to an en-try in a hypothetic employee relation.On the other hand,the StartDate and EndDate attributes are temporal instants which together construct a valid-time period.The data gen-eration method varies from one experiment to another and is described later.NumProcessors depends on the type of performance measurement.Scale-up experiments used 2,4,8,16,32,and 64processing nodes,while the variable reduction ex-periment used a ﬁxed set of 16nodes.To see the effects of data partitioning on the perfor-mance of the temporal algorithms,the synthetic tables were partitioned horizontally either by SSN or by StartDate.The SSN and StartDate partitioning schemes were attempts to model range partitioning based on temporal and non-temporal attributes [3].The tuple size was ﬁxed at 41bytes/tuple.The tuple size was intentionally kept small and unpadded so that the gener-Parameter Exp4.3Exp4.4Exp4.5Exp4.62,4,8,16,32,642,4,8,16,32,642,4,8,16,32,6416by SSN by SSN by StartDate by StartDate41bytes41bytes41bytes41bytes65536tuples65536tuples65536tuples65536tuples*65536*65536*6553616*655360%100%0%0/20/40/60/80/100%Table4.Experiment Parametersated datasets could have more tuples before their size madethem difﬁcult to work with.2All experiments except the single speed-up test used aﬁxed database partition size of65,536tuples.This wasdone to facilitate cross-referencing of results between dif-ferent tests.Because of this,the16-node results of thescale-up experiments are directly comparable to the resultsof the16-node data reduction experiment.The total database size reﬂects the total number of tuplesacross all the nodes participating in a particular experimentrun.For scale-up tests,the total database size increased with the number of processing nodes.Finally,the amount of data reduction is100minus the ratio between the number of resulting leaves in theﬁnal aggregation tree and the original number of tuples in the dataset.A reduction of100percent means that a100-tuple dataset produces1leaf in theﬁnal aggregation tree because all the tuples have identical StartDates and EndDates.4.3.Baseline Scale-Up Performance:No Reductionand SSN PartitioningWe set up ourﬁrst experiment to compare the scale-up properties of the proposed algorithms on a dataset with no reduction.We will also use the measurements taken from this experiment as a baseline for later comparisons with sub-sequent experiments.The second column of Table4gives the parameters for this particular experiment.For this experiment,a synthetic dataset containing4M tuples was generated.Each tuple had a randomized SSN atrribute and was associated with distinct periods of unit length(i.e.,).The dataset was then sorted by SSN.3and were then distributed to the64 processing nodes.To measure the scale-up performance of the proposed al-gorithms,a series of6runs having2,4,8,16,32,and64 nodes,respectively,were carried out.Note that since we ﬁxed the size of the dataset on each node,increasing the number of processors meant increasing the total database size.Timing results from this experiment are plotted in Fig-ure9and lead us to the following conclusions.2The total database size for the scale-up experiment at64processing nodes was64partitions65536tuples41bytes=171,966,464bytes.3Since the SSNﬁelds are generated randomly,this has the effect of10203040506070248163264 TimeinSecondsNumber of Worker NodesSATSMPMTDMTDM+CFigure9.Scale-Up Results(4M tuple Datasetwith No Reduction and SSN Partitioning)SM performs better than SAT.Intuitively,since the dataset exhibits no reduction,both SM and SAT send all periods from the worker nodes to the coordinator.The rea-son behind SM’s performance advantage comes from the computational parallelism provided by building local aggre-gation trees on each worker node.Aside from potentially reducing the number of leaves passed on to the coordina-tor,this process of building local trees sorts the periods in temporal order.This sorting makes compiling the results more efﬁcient4than SAT’s strategy of having to insert each valid-time period into theﬁnal aggregation tree.SAT exhibits the worst scale-up performance.This result is not surprising,since the only advantage SAT has over the original sequential algorithm comes from parallelized I/O. This single advantage does not make up for the additional communication overhead and the coordinator bottleneck.5 The performance difference between TDM and TDM+C increases with the number of nodes.For this observation, it is important to remember that TDM+C is simply TDM plus an additional result-collection phase that sends allﬁnal leaves to the coordinator,one worker node at a time.The performance difference increases with the number of nodesrandomizing the tuples in terms of StartDate and EndDateﬁelds.4The SM coordinator uses a merge-sort variant in compiling and con-structing theﬁnal results.5In SAT,all the periods are sent to the coordinator which builds a single, but large,aggregation tree.because of the non-reducible nature of the dataset and the fact that scale-up experiments work with more data as the number of nodes increase.Among the algorithms that provide monolithic results, PM has the best scaleup performance up to32nodes.This is attributed to the multiple merge levels needed by PM.A PM computation needs at least merge levels where is the number of processing nodes.On the other hand,the TDM+C algorithm only merges local trees once but has three synchronization steps,as described in Section3.Withthis analysis in mind,we expected PM to perform better or as well as TDM+C for2,4,and8nodes,which have1,2, and3merge levels,respectively.We then expected TDM+C to outperform PM as more nodes are added,but we were suprised to realize that PM was still performing better than TDM+C up to perhaps50nodes.Toﬁnd out what was going on behind the scenes,we used the LAM XMPI package[2]to visually track the pro-gression of messages within the various TDM+C and PM runs.This led us to the reason why TDM+C performed worse than PM for2to32nodes:TDM+C was slowed more by increased waiting time due to load-imbalance(computa-tion skew)as compared to PM.4.4.Scale-Up Performance:100%Reduction andSSN PartitioningThis experiment is designed to measure the effect of a signiﬁcant amount of reduction(100%in this case)on the scale-up properties of the proposed algorithms.Table4 gives the parameters for this experiment.This experiment is modeled after theﬁrst one but with a synthetic dataset having100%reduction.This dataset was generated by creating4M tuples associated with the same period and having randomized SSN attributes.The syn-thetic dataset was then rearranged randomly6and split into 64partitions each having65,536tuples.This experiment,like theﬁrst one,is a scale-up experi-ment.Hence,it was conducted in much the same way.Tim-ing results from this experiment are plotted in Figure10and leads us to the following observations.All algorithms beneﬁt from the100%data reduction. Comparing results from the baseline experiment with re-sults from the current experiment lead us to this observation. Because of the high degree of data reduction,the aggrega-tion trees do not grow as large as in theﬁrst experiment. With smaller trees,insertions of new periods take less time because there are fewer branches to traverse before reaching the insertion points.Because all of the presented algorithms use aggregation trees,they all experience increased perfor-mance.6The aggregation tree algorithm performs at its worst case when the dataset is sorted by time[7].10203040506070248163264 TimeinSecondsNumber of Worker NodesSATSMPMTDMTDM+CFigure10.Scale-Up Results(4M tuple Datasetwith100%Reduction and SSN Partitioning)With100%reduction,PM and TDM+C catch up to TDM.Aside from constructing smaller aggregation trees, a high degree of data reduction decreases the number of ag-gregation tree leaves exchanged between nodes.TDM does not send its leaves to a central node for result collection,so it does not transfer as many leaves as its peers.Because of this,TDM is not impacted by the amount of data reduction as much as either PM or TDM+C which end up performing as well as TDM.4.5.Scale-Up Performance:No Reduction andTime PartitioningThis experiment is designed to measure the effect of time partitioning on the scale-up properties of the proposed algo-rithms.The settings for this experiment are summarized in Table4.The dataset for this experiment was generated in a man-ner similar to theﬁrst one,but with StartDate rather than SSN partitioning.This was done by sorting the whole dataset by the StartDate attribute and then splitting it into 64partitions of64K tuples each.Time Partitioning did not signiﬁcantly help any of the algorithms.We originally expected TDM and TDM+C to beneﬁt from the time partitioning but we also realized that for this to happen,the partitioning must closely match the way the global time divisions are calculated.Because we randomly assigned partitions to the nodes,TDM did not beneﬁt from the time partitioning.In fact,it even performed a little bit poorer in all but the16-node run.We attribute the small performance gaps to differences in how the partition-ing strategies interacting with the number of nodes made TDM redistribute mildly varying numbers of leaves across the runs.As for SM and PM,they exhibited no conclu-sive improvement because they were simple enough to work without considering how tuples were distributed across the various partitions.。

Slides1

Web site: /
Parallel Programming Platforms
Implicit Parallelism:
Trends in Microprocessor Architectures Executing multiple instructions in a single clock cycle.
Scope of Parallel Computing
Applications in Engineering Scientific Applications Commercial Applications Applications in Computer Systems Everywhere
Course Content
Introduction to parallel architecture and the basic theoretical principles of parallel algorithms and programming, includes some parallel programming tools. Practices: Includes some hands-on parallel programming on shared-memory and message-passing parallel architectures.
How we use very large number of transistors to achieve increasing rates of computation is the key
The Memory/Disk Speed Argument
Parallel platform yield better memory system performance:

小学上册第十三次英语第二单元真题试卷

小学上册英语第二单元真题试卷英语试题一、综合题(本题有100小题，每小题1分，共100分.每小题不选、错误，均不给分)1.Did you see a _______ (小蜈蚣) crawling on the ground?2. A _______ can be a climbing plant.3.What do you call a collection of stars?A. GalaxyB. PlanetC. Solar systemD. Nebula4.An atom that gains electrons becomes a _____ ion.5.The chemical formula for iron(III) sulfate is _____.6.What color are ripe bananas?A. GreenB. RedC. YellowD. BlueC7.The sun is ______ behind the clouds. (peeking)8.What is the capital of Turkey?A. IstanbulB. AnkaraC. IzmirD. Antalya9. Pyramid of Giza is one of the Seven Wonders of the _____. The Grea10.We go to school by ________ (自行车).11.Fungi are not _____ (植物) but they grow in soil.12.What is the name of the ocean that is the largest?A. AtlanticB. IndianC. ArcticD. Pacific13.I enjoy going to the ______ (咖啡馆) with my friends to chat and relax. It’s a cozy place.14.Circuits can be series or ______.15.I love to eat ______ (冰淇淋) in the summer. My favorite flavor is ______ (巧克力).16.The first telephone was invented by ________ Bell.17.What is the primary color of the sky on a clear day?A. PurpleB. GreenC. BlueD. Yellow18.The ancient Greeks made significant contributions to ________.19.Temperature is measured in degrees ______.20.What do you call the time before noon?A. MidnightB. MorningC. AfternoonD. EveningB21. A ________ (树枝) can break if it is too heavy.22.How many legs does a bee have?A. 4B. 6C. 8D. 10B23.The puppy is _______ (很可爱).24.What is the capital of the Philippines?A. ManilaB. CebuC. DavaoD. Iloilo25.His favorite animal is a ________.26.My favorite animal is a ________ (小狗).27.I want to _____ (go/stay) at home.28.The armadillo rolls into a ____.29.My toy robot can dance and ______.30.The sky is _______ (乌云密布).31.The milk is ___. (cold)32.She has a ___ (nice) dress.33.How many legs does a spider have?A. 6B. 8C. 10D. 1234. A ______ (植物病害) can harm crops.35. A _______ (小鱼鹰) dives into the water to catch fish.36.The process of burning wood produces _____ and ash.37.What is the name of the famous explorer who discovered America?A. Christopher ColumbusB. Ferdinand MagellanC. Marco PoloD. Vasco da GamaA Christopher Columbus38.I love to read ______ books at the library.39.What do we call a large natural stream of water?A. RiverB. CreekC. StreamD. BrookA River40.What is the name of the famous bear in the children's book series by A.A. Milne?A. Paddington BearB. Winnie-the-PoohC. Yogi BearD. BalooB41.I love to bake ________ (面包) with my grandmother.42.Which of these is a common pet?A. CowB. DogC. HorseD. Sheep43.Many galaxies are moving away from us, suggesting the universe is ______.44.What do we call the act of traveling to a foreign country for leisure?A. VacationingB. TouringC. TravelingD. ExploringA45.The ancient Greeks made significant contributions to _______.46.What is the name of the famous ancient city in Jordan?A. PetraB. BabylonC. UrD. NinevehA47.The ______ (植物的生长方式) is influenced by many factors.48.They are ___ (walking/running) in the park.49. A parrot has bright ______ (羽毛).50.What is the name of the first spacecraft to land on Mars?A. Viking 1B. SpiritC. OpportunityD. Curiosity51.My favorite toy is a ________ (玩具名). It makes me feel very ________ (形容词) when I play with it. Every time I take it out, I can’t help but ________ (动词) with joy.52.My favorite place is the ______.53.The movie is very ________.54.What is the freezing point of water?A. 0 degrees CelsiusB. 100 degrees CelsiusC. 32 degrees FahrenheitD. Both A and CD55.The ______ (生态系统) includes many interconnected parts.56.Coral reefs are made up of tiny ______.57.What is the name of the large body of saltwater?A. LakeB. RiverC. OceanD. PondC58.My mom loves __________ (探索新的思想).anic compounds contain _____ (carbon).60.I can ______ (dance) with my friends.61.My dad works _____ (hard/easy) every day.62.I enjoy making ________ (手工艺品) for my friends.63.__________ are found on the right side of the periodic table.64.What is 5 + 3?A. 6B. 7C. 8D. 965.What do you call a person who studies insects?A. BiologistB. EntomologistC. ZoologistD. BotanistB66.What do you call the place where you can buy groceries?A. StoreB. MallC. SupermarketD. MarketC67.My favorite game is _______.68.听录音判断，与图片内容相符的打√，不符的打x。

小学上册第十三次英语第三单元测验试卷

小学上册英语第三单元测验试卷英语试题一、综合题(本题有100小题，每小题1分，共100分.每小题不选、错误，均不给分)1.The ________ is my favorite animal to watch.2.Asteroids can be found in the ______ belt between Mars and Jupiter.3. A saturated solution occurs when no more solute can ______.4.My grandma enjoys making __________ (传统食品).5.What do we call the act of gathering information?A. ResearchB. InvestigationC. InquiryD. All of the aboveD6.What is the capital of Greece?A. RomeB. AthensC. IstanbulD. CairoB7.How many continents are there in the world?A. FiveB. SixC. SevenD. Eight8._____ (春天) is when many flowers bloom.9.Chemical bonds can be ionic or ______.10.Black holes are regions in space with a very strong ______.11.The turtle moves very _______ (慢) but is very wise.12.The _______ (The Industrial Revolution) transformed economies and societies.13.Many plants have __________ (不同的) colors.14.The _____ (自然) has a way of balancing ecosystems.15.The __________ is a famous area known for its luxury goods.16.I often visit my cousins during ____.17.My ________ (玩具名称) is a delightful companion.18.What is the value of 10 3 + 5?A. 10B. 11C. 12D. 13B19.What is the main purpose of a library?A. Borrow booksB. Watch moviesC. Play gamesD. Study scienceA20.What instrument is used to measure temperature?A. RulerB. ScaleC. ThermometerD. StopwatchC21.I like to ___ in the garden. (help)22.________ (植物保护措施) are necessary for survival.23.The squirrel collects _______ (坚果) in autumn.24.How many colors are in the rainbow?A. 5B. 6C. 7D. 825.The pizza is very _____. (yummy)26.What is the main ingredient in chocolate?A. CocoaB. WheatC. SugarD. Milk27.The __________ can indicate areas at risk of geological hazards.28.We seek ________ (opportunities) for growth.29.The doll wears a pretty _______ (娃娃穿着漂亮的_______).30.I love to draw ______.31.What is the name of the boundary that defines the edge of the observable universe?A. Cosmic HorizonB. Observable EdgeC. Event HorizonD. Singularity32.What do we call a person who runs a business?A. ManagerB. OwnerC. EntrepreneurD. All of the above33.What is the name of the process of making something less clean?A. PollutingB. ContaminatingC. DirtyingD. SoilingA34.What do you call the place where you go to learn?A. SchoolB. StoreC. ParkD. LibraryA35.She is very ________ at math.36.My teddy bear is soft and ______.37.The porcupine can defend itself with its sharp ________________ (刺).38.What do you put on a sandwich?A. WaterB. MeatC. PencilD. Soap39.My mom is a great __________ (社区领导者).40.Which animal hops?A. ElephantB. FrogC. FishD. Snake41.What do we call the process of moving from one place to another?A. TransportationB. TravelC. CommuteD. MigrationB42.What is the name of the famous bear who lives in the Hundred Acre Wood?A. Paddington BearB. Winnie the PoohC. Yogi BearD. BalooB43.What do we call the center of the Earth?A. CrustB. CoreC. MantleD. SurfaceB44.The _____ (图书馆) is a great place to study.45.My sister is my best _______ who loves to share secrets.46.I _____ (like/hate) rainy days.47.My grandma has a wealth of __________ (知识) about history.48.I like to help my dad ________ (修理) things around the house.49.salinity) of seawater affects marine life. The ____50.The ice cream is ___ (melting/freezing).51.I love to play ______ (instruments) in the band.52.What is the capital of Australia?A. SydneyB. MelbourneC. CanberraD. BrisbaneC53.My brother plays the ____ (cello) in the orchestra.54.What is the capital of Nauru?A. YarenB. Nauru CityC. AiwoD. BuadaA55.My aunt loves to volunteer at the ____ (shelter).56.Acids taste ________ and can be corrosive.57.The __________ (历史的复制) can occur in various forms.58.She is _______ (practicing) her piano skills.59.I have a ___ (dream) of being an astronaut.60.My brother is very good at ____ (drawing).61.The panda eats _____ bamboo. (lots of)62.I like to read ________ (杂志) about animals.63.What is the name of the process by which a caterpillar becomes a butterfly?A. MetamorphosisB. EvolutionC. AdaptationD. TransformationA64.The _____ (花园设计) includes planning for plant placement.65.What do you call a period of seven days?A. MonthB. WeekC. YearD. FortnightB66.My _____ (小兔) likes to nibble carrots.67. A ________ (植物教育活动) fosters community engagement.68.The __________ (土壤的质量) is vital for growth.69.The main gas emitted from vehicles is __________.70.What do you call a small, soft fruit?A. GrapeB. CherryC. RaspberryD. All of the above71.The first human-made object in orbit was ______ (斯普特尼克).72.The ________ was a significant moment in the history of labor rights.73.How many days are in a year?A. 365B. 366C. 364D. 36074.The _______ (鸵鸟) runs very fast.75.The __________ (文化教育) enriches lives.76.What is the name of the largest mammal in the world?A. ElephantB. Blue whaleC. GiraffeD. Hippopotamus77.The chemical symbol for arsenic is ______.78.What is the opposite of ‘begin’?A. StartB. CommenceC. EndD. Continue79.What do we call the time of day when it gets dark?A. NoonB. DuskC. DawnD. Midnight80. A _____ (植物活动) can raise awareness about conservation.81.I like to ______ puzzles on rainy days. (solve)82. A __________ is a substance that cannot be broken down into simpler substances.83. A spring can store ______ energy.84.The ancient Egyptians built ______ (金字塔) for their pharaohs.85.What is the currency used in the United States?A. EuroB. DollarC. YenD. PoundB86.The ________ was a pivotal moment in the narrative of national identity.87.After school, I sometimes visit my ________ (邻居). She bakes cookies and lets me help her in the ________ (厨房).88.The __________ (冷战) lasted several decades.89.The cake has _____ (chocolate) on top.90.The ________ were nomadic tribes that lived in Mongolia.91.My _____ (表弟) is visiting next week.92.The _____ (手链) is shiny.93.I like to feed my ______ in the morning.94.The girl sings very ________.95.The _____ (snail) moves slowly.96.The chemical formula for citric acid is ______.97. A mudslide is a rapid flow of ______ down a slope.98.Which food comes from cows?A. EggsB. MilkC. HoneyD. CheeseB99.The ant can lift objects many times its _______ (重量). 100.The chemical symbol for aluminum is _____ (Al).。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Paragon Parallel Programming Environment on SunWorkstationsStefan Lamberts,Georg Stellner,Arndt Bode and Thomas LudwigInstitut f¨u r InformatikLehrstuhl f¨u r Rechnertechnik und RechnerorganisationTechnische Universit¨a t M¨u nchenD-80290M¨u nchenlamberts,stellner,bode,ludwig@informatik.tu-muenchen.deOctober8,1993AbstractToday’s requirements for computational power are still not satisﬁed.Supercompu-ters on the one hand achieve good performanceﬁgures for a great variety of applica-tions but are expensive to buy and maintain.Multiprocessors like the Paragon XP/Sare cheaper but require more effort to port applications.As one consequence,muchcomputational power of such systems is wasted with debugging these codes.An at-tempt to withdraw implementation and debugging codes from multiprocessor systemsis the usage of coupled workstations.A software environment for networks of worksta-tions allows for implementation and testing of applications.After having been testedthe applications can then be shifted to the multiprocessor systems by recompilation.The paper describes the design and implementation of an environment which allows touse Ethernet coupled Sun SPARC systems as a development platform for applicationstargeted for Intel Paragon XP/S systems.1MotivationScientiﬁc and commercial applications require much computational power.Today’s su-percomputer systems have been developed to satisfy these demands.Their computational power is sufﬁcient to solve some of the so called Grand Challenge Problems.Typically, these machines are very expensive and difﬁcult to maintain,e.g.they need a water-cooling system.A different architectural approach has been made to reduce the costs for such pow-erful machines.Assembling cheap and simple standard components such as processors and memory chips into a single machine saves purchase as well as maintenance costs.These machines are the classical distributed memory multiprocessor systems where standard mi-croprocessor nodes are interconnected with a high performance interconnection network. Intel’s Paragon XP/S system is a typical member of this class.A drawback of multiprocessors is that porting existing applications onto those systems requires enormous efforts.Applications have to be parallelized which leads to frequent test runs during the implementation.Therefore,much workload on multiprocessor systems consists of test and debugging runs.To withdraw some of this load an environment is needed which allows the implementation of applications for multiprocessor systems on different hardware platforms.Today,typical environments in universities and companies consist of several work-stations all interconnected via standard Ethernet.The basic architecture of multiprocessor systems and coupled workstations is similar:independent processing elements(nodes or workstations)which are interconnected.In difference to the multiprocessors’high perfor-mance interconnection network,workstations use a slower interconnect.In addition the network has to be shared with other machines and users which are also connected to the network.State-of-the-art multiprocessors like the Paragon currently offer a proprietary message-passing environment.An implementation of that library on coupled workstations would allow for using interconnected workstations as a development platform for applications where the production code shouldﬁnally run on a multiprocessor system.Message-passing libraries for coupled workstations which offer a user interface similar to a multiprocessor can withdraw workload from these systems.In addition to that,it is also applicable to use interconnected workstations as additional computational resource.During times of low system load on the workstations their aggregated computational power can be used to run production versions of applications.Two restrictions apply for this approach.First,the computational power offered by a number of workstations in a local area network does not reach today’s multiprocessor systems.And second,the communication speed of the interconnection network is several orders of magnitudes lower than the one of multiprocessor systems like the Paragon.Thus, suitable applications are restricted to those with limited demands concerning computational power and the granularity of parallelism should be medium or even better coarse.In the following we will describe the design and implementation of the Paragon OSF/1 communication library for Sun workstations which are interconnected via Ethernet.There-fore weﬁrst give a short description of the Paragon,its operating system and the message-passing library NX.After that,we introduce the design of the NXLIB message-passing library for coupled workstations.In chapter4we show in detail how this concept has been implemented.The last two chaptersﬁnally give a summary and an outlook on future work. 2The Paragon and its Message-Passing InterfaceTo get a better understanding of the design and implementation we have chosen,we ﬁrst present a short overview on the Paragon,its OSF/1operating system and the NX message-passing library.Intel’s Paragon is a MIMD[1]system with distributed memory.twoFigure1shows the basic architecture of the Paragon nodes.Each node consists ofIntel i860/XP[6]microprocessors:one to run the operating system and user applications (application processor)and one to handle the communication between the nodes(message processor).Both processors access the local memory which can be up to32MB large via acommon bus.A DMA controller(data transfer engine)allows for efﬁcient data movement on each node.An additional expansion port and an I/O interface can be used to attach peripherals to a node.Finally a hardware monitoring chip has been integrated to provide low intrusion performance measurements on each node.The nodes of a Paragon system are interconnected in a two-dimensional mesh topology. Each node is connected to a special routing chip,the so called iMRC.The iMRC chip routesFigure2:Interconnection scheme of a Paragon systemthe messages between the nodes using a wormhole routing algorithm[7].The links between the iMRC chips are16bit wide and achieve a bidirectional communication bandwidth of 350MB/s.The nodes in a Paragon system are subdivided into three partitions:the I/O partition, the service partition and the compute partition.Figure3shows a typical conﬁguration of a Paragon ually the largest partition in a conﬁguration is the compute partition.Compute Partition Service Partition I/O PartitionFigure3:Different partitions in a Paragon systemParallel user applications are executed on the nodes in this partition.In contrast to that, interactive processes,like shells,editors etc.,are executed on the nodes in the service partition.Finally,the nodes in the I/O partition are used to connect I/O devices,like disks or local area networks,to the machine.Although the nodes are arranged in different partitions they execute the same operating system kernel.The Paragon operating system is a Mach3.0based implementation of theOSF/1operating system[8].It provides the user with a single system image of the machine. Any command which a user invokes during an interactive machine session is executed on any of the nodes in the service partition.Files can be transparently accessed from any node. File accesses are therefore transformed into corresponding requests to the nodes in the I/O partition.Parallel user applications on the compute partition make use of Intel’s message-passing library which is derived form the NX/2of the iPSC systems[9].Apart from synchronous, asynchronous and interrupt-driven communication calls,NX provides calls for the process management of parallel applications.Cooperating processes address each other via a node number and a process type(ptype).The node number is derived from the node where the process is executing,whereas the ptype can be modiﬁed via corresponding calls[3,4,2,5].The following section will introduce the concepts which were necessary to offer a similar system image on a network of workstations to the one available on the Paragon.This will include a more detailed discussion of some Paragon features where it seems appropriate. 3The Design of the Paragon Message-Passing Library for WorkstationsDue to the predeﬁned user interface of the environment the design process of the NXLIB was limited toﬁnding a model for the Paragon node,a layering of the software and a mapping of Paragon partitions to the workstations.Each of the following three sections in turn will give a short introduction to one of those topics.3.1The node modelIn the following the meaning of some frequently used terms will be explained.A parallel application on a Paragon system consists of two parts.The application processes on the compute partition and the controlling process of the application on one node of the service partition.Parallel applications require to be linked with a special linking option(–nx or –lnx),which includes the NX calls.Apart from the NX calls the application processes also can make use of the OSF/1system calls.In the following discussion the term Paragon node will be referred to as the collection of a hardware Paragon node,the OSF/1operating system kernel and a set of application processes running on top of that.The basic means to model Paragon nodes on coupled workstations is virtualization. Consequently,the term virtual Paragon node(VPN)describes a Paragon node on a work-station.The hard-and software properties of a Paragon node which are not available on a workstation are virtualized in the NXLIB software environment.The VPN is the smallest unit of distribution in the NXLIB environment,i.e.upon startup the user can deﬁne how many VPNs he wants to use and on which machine a speciﬁc VPN should be located. Aspects concerning the mapping of the VPNs to machines will be discussed in section3.3. Currently a standard lightweight process library is not yet available on every UNIX system. Thus,the decision was made to use heavyweight UNIX processes to model a VPN on a workstation.Section4.1will show which processes are required to model VPNs and givea detailed description of their cooperation.3.2Layers of NXLIBAn important issue for a message-passing library for coupled workstations is portability andﬂexibility.A layering of the message-passing library has been designed to cover both aspects.Figure4shows the layers of the NXLIB environment.The basis forms the standardFigure4:Layers of the NXLIB environmentUNIX system call layer with its different interprocess communication calls.To achieve a greatﬂexibility concerning the communication protocol which is used for the implementa-tion NXLIB distinguishes between local and remote communication.Thus,for either case it is possible to use a protocol which achieves the best performance.Within the local and remote communication layer a protocol speciﬁc addressing scheme is used.The reliable communication layer provides reliable point-to-point communication calls disregarding the location of the communication partners.The reliable communication interface still uses the Paragon addressing scheme.The address conversion layer has been introduced to map Paragon addresses consisting of a node number and a ptype to corresponding protocol speciﬁc addresses.In addition to its address conversion task this layer also distinguishes whether a communication is local or remote.Provided with that information the reliable communication layer can invoke the appropriate communication calls of either the local or remote communication layer.On a Paragon system the OSF/1operating system provides a sophisticated buffer man-agement.Its parameters can be conﬁgured upon the startup of an application with several command line switches.This mechanism allows for adapting the usage of the limited mem-ory resources on each node to the needs of an application in the best way.In addition, reserving enough buffer space may be required for certain applications to avoid deadlocks.A communicationﬂow protocol has been included in the Paragon communication to avoid ﬂooding a node’s buffers with messages.The buffer management layer we have introduced is based on the simplifying assump-tion that on each machine unlimited buffer space is available.Unlike the Paragon,where incoming messages are placed in a prereserved memory area,in NXLIB the memory is dynamically allocated when a message arrives on a node.Consequently a controlﬂow protocol and the conﬁguration parameters for the buffer sizes can be omitted in NXLIB.The Paragon OSF/1communication interfaceﬁnally provides the user calls which are available on a Paragon system.The calls of the buffer management to insert and delete messages into the message table are used to map messages to corresponding user calls.All user calls are therefore not directly based on a communication but make use of calls which update the message table.3.3Modeling Paragon partitionsA short overview concerning the three basic partitions on a Paragon system was already provided in section2.As an enhancement to the hardware deﬁnition of partitions the users can also deﬁne software partitions.These software partitions are compounded of any selection of nodes in the compute partition.The Paragon OSF/1operating system provides calls to deﬁne and modify such partitions.Similar to the UNIXﬁle system partitions have an owner(creator),access permissions,a name and may be created hierarchically.In a workstation environment the situation is different.One way to provide a similarsemantics is to use mapping ﬁles.Within the ﬁle a table has to be speciﬁed to map virtual node numbers to workstation names or Internet addresses respectively.The owner,access permissions and name of the mapping table can be used to simulate the corresponding Paragon partition properties.In addition to that the ﬁle system hierarchy can be used to model the hierarchical deﬁnition of the partitions.Thus,the mapping table deﬁnes a virtual compute partition .A problem occurs for the service partition.It is not part of the Paragon partition management which is available for the user.Consequently a different means has to be provided to establish a virtual service partition .This is simply done by deﬁning the machine where the application has been started as the virtual service partition of the virtual Paragon on the workstations.4The Implementation of NXLIBIn the previous section we have shown which concepts were developed to virtualize a Paragon system on a network of workstations.The next sections will show how these concepts were realized.4.1Implementation of virtual Paragon nodesThe implementation of the virtual Paragon node concept includes the controlling process on the virtual service partition as well as the application processes on the virtual compute partition .In the following we will ﬁrst introduce how the VPN concept has been imple-mented on the virtual compute partition and then discuss the implementation on the virtual service partition .The goal of virtual Paragon node s is to have an equivalent to a Paragon node which consists of the node hardware,the operating system on that node and the application processes on that node.A natural approach to model this environment is to introduce a daemon process which is responsible to virtualize the node hardware and the operating system.The application processes’calls to NX communication routines are transformed into requests to the the daemon process.Like on a Paragon system the application processes are clients which request some service from the operating system.But in difference to that every system call would require an interprocess communication in such an implementation.As an enhancement of the above described implementation we have introduced the following improvement which reduces the amount of interprocess communication.As not all system calls require the assistance of a centralized operating system parts of the operating system’s tasks have been migrated into the application processes.Figure 5shows a virtualAP DP Daemon ProcessApplication ProcessParagon OSF/1User Program Figure 5:Processes and the distribution of the operating system on a VPNParagon node with two application processes and their corresponding daemon.Operations which can be carried out without the assistance of a centralized operating system must be independent of each other and must not address common operating system structures ortables.These are for example NX send operations,as only address lookups of the destination address and a transformation to interprocess communication calls are necessary.In contrast to that,changing the ptype of a process will require the daemon’s assistance as only one process on a virtual Paragon node is allowed two have a certain ptype.The daemon as the central control instance to grant ptype s can easily guarantee their uniqueness on a single virtual Paragon node.The implementation of the VPN on the virtual service partition is different from the approach described above for the virtual compute partition.But as the tasks of the controlling process on the virtual service partition are different to those of the application processes on the virtual compute partition this difference is no contradiction to a uniform implementation. In contrast to the application processes where mainly computational work is done the controlling process has the following jobs:starting the application,managing the processes, propagating signals,providing I/O facilities and terminating an application.If a similar implementation had been chosen like for VPNs on the virtual compute partition frequent interprocess communication between the controlling process and its daemon would have been necessary.In addition there is only one process on the virtual service partition,so a natural improvement is to join the controlling process with its daemon into one process.For applications which were linked with the–lnx linking option the controlling process may also take part in the computation.This functionality is not affected by the decision to have only one process to implement the VPN on the virtual service partition.4.2Implementing the layers of NXLIBFor the following discussion concerning the implementation of the layer in the NXLIB refer again toﬁgure4.To reduce the effort which was necessary to implement and test the NXLIB the decision was made to use a communication protocol which supports both local and remote communication.For that reason we have chosen TCP-sockets which also offer a reliable point-to-point communication.Consequently no additional code was necessary to achieve a reliable communication protocol.Nevertheless the distinction between local and remote communication has been made throughout the whole implementation:the local and remote communication layer simply call the same basic communication functions.An exchange of the communication protocol in later versions is no problem as only the calls in either the local or the remote layer have to be substituted.TCP-sockets are addressed via a descriptor which is similar to aﬁle descriptor.The basis of the address conversion layer forms a table where all necessary information about the processes is stored.Functions to add,delete,update and retrieve this information are provided by this layer.The address mapping information which is necessary to communicate between different processes can be extracted from retrieved process descriptors.The reliable communication layer uses the calls of the address conversion layer to retrieve information about a destination process of a send or receive call.Based on this information it issues the corresponding local or remote communication calls with the appro-priate communication protocol addresses.For a further discussion of the implementation of the NX communication calls refer to section4.3.To handle incoming messages the buffer management keeps a message table where they are stored.During a send call a message type is associated with the message.The destination of the message is a process on a VPN with a dedicated ptype.A receive call on that VPN matches an incoming message if the current ptype of the process is the same as the one speciﬁed in the send call and if the message type is identical.Hence,the buffer management provides a set of calls to insert,retrieve and delete messages from the message table.Depending on the fact whether or not a corresponding receive call was already placed and the type of the receive call,the Paragon OSF/1communication layer invokes different actions.If a user speciﬁed message handler has been installed with a previous hrecv call, the handler is invoked and the message is deleted from the table.If on the other hand a synchronous receive was placed before,the message is extracted and the call returns with it as a result.A previous call to a asynchronous irecv call simply leaves the message in the table and marks it as received,so that later calls to probe functions can determine that the message is now available.In the case that no matching receive has been called at all the message is also inserted in the table until a later receive operation deletes the message from the table.4.3NX message passing callsSection4.2already explained the functionality of the different NXLIB layers.This section provides an overview how these layers cooperate to simulate the Paragon message-passing calls on a network of workstations.Therefore,the following topics will be addressed:ﬁrst the basic concepts are presented,then the start of an application will be described and the address resolution protocol will be explained.4.3.1Implementation concepts of NX message-passing callsAn important issue for message-passing programming libraries is the latency of the commu-nication calls.To reduce the latency it is desirable to use direct paths between communica-tion partners.Every stage in an indirect scheme increases the latency as additional calls are necessary until a message is sent.On the other hand,on most UNIX systems the descriptors which are available for openﬁles and sockets are limited.A full interconnection of all application processes would therefore reduce the number of processes in an application drastically.Establishing and terminating a communication link between two processes for every communication call is not feasible either as this would introduce much additional effort for every communication.The basic assumption of our implementation is that typical parallel applications have a regular communication structure in the sense that certain processes regularly communicate with each other.Thus,two processes are either connected and use this communication path frequently during the computation or they do not communicate at all.Consequently, communication paths need only to be created for those processes that wish to communicate. As the communication structure of an application can not be determined at start time,the interconnection of the processes can certainly not be done during the initialization of the application.So the communication paths between processes are set up on demand.Once established a connection between two processes is kept until the application terminates. Building up the connections on demand has the advantage that all interacting processes are fully interconnected.So communication latencies can be kept minimal for established communication links.And as only those processes are interconnected which need to com-municate more processes can participate in an application.The only drawback is that the ﬁrst communication between two processes is more expensive than the following because the connection has to be set up.4.3.2Start of an NXLIB applicationLike on a Paragon system an application is automatically started if it was linked with the –nx linker option.If the–lnx switch was used the programmer is responsible to call the corresponding system calls in the controlling process.As the basic sequence of system calls is the same the following describes only the–nx case.To start the application the user simply types the name of the application at a command line prompt.The command is started as any conventional UNIX command and executes anxloadve call which initiates the creation of the applicationprocesses.%myapp Application Process AP Controlling ProcessCP Daemon ProcessDP 1myapp 2nx_initve 3nx_loadveAP DP APDPAP VPN 1VPN 0CP321sun1sun2Figure 6:Starting an NXLIB applicationThe daemons on the remote machines are currently started via a standard Berkeley rsh command.The daemons on the remote machines inherit the environment of the machine where the application was started.A prerequisite for starting the node program is that the binary oneach workstation is located somewhere in the PA TH environment variable of the machine where the controlling process is located.4.3.3Address resolution protocolConcerning communication links with TCP sockets the situation after the start of an appli-cation is as shown in ﬁgure 7:the daemon processes are connected to each other and the application processes are linked to their corresponding daemon.The address conversionDP-AP DP-DP CP-DPDaemon Process Application ProcessControlling ProcessDP AP CP AP DP APAP DPCPVPN 0VPN 1sun1sun2Figure 7:Conﬁguration after starting an NXLIB applicationlayer within the daemons has information about all other daemons and the applicationprocesses of its associated VPN.The application processes on the other hand have only address information about their daemon.During an application is executing further connec-tions between application processes are created on demand when two application processes are communicating for theﬁrst time.If an application process tries to send a message to a VPN to which no connection exists, its address conversion layer cannot retrieve a process descriptor for this process.To get the information of the requested process the address conversion layer contacts its daemon process with an ADR protocol unit.If the daemon can provide the requested information it forwards it to the process with a DAA protocol unit.Otherwise the daemon contacts the daemon of the speciﬁed VPN with a DDR protocol unit.As this daemon is responsible for the VPN where the destination application process resides the necessary addresses must be stored there.Otherwise the application process does not yet exist and an error has occurred in the program.The daemon returns the addresses with a DDA to the requesting daemon, which in turn updates its address conversion information andﬁnally forwards the address to the application process with a DAA unit.In the last step the application process now contacts the destination process with a AAR to establish a new socket connection.This results in a new point-to-point connection between the two processes which will be used for all further messages sent between the two application processes.4.4Implementing global operationsOn a Paragon global operations manipulate data which are distributed among the nodes of an application,e.g.it is possible to calculate the sum of an array which is spread over the nodes.This requires the collection of data from every node.On a Paragon algorithms using a minimal spanning tree communication structure are used to collect the data.The same implementation could be used for NXLIB but as a network of workstations is coupled via an Ethernet bus the messages are serialized anyway.Consequently a similar optimization is not possible for global operations in NXLIB.The implementation for global calls in NXLIB uses a simpler approach which is not less efﬁcient on a network of workstations.As the execution of a global operation synchronizes the application processes the controlling process is used to collect,evaluate and distribute the result of a global operation.Therefore,all processes send a protocol unit via their daemon to the controlling process with the necessary parameters to carry out the operation.After that the application processes wait until the answer from the controlling process arrives. The controlling process on the other hand collects the incoming protocol units from every VPN,then computes the requested global operation andﬁnally forwards the result of the computation to all application processes.4.5Workstation speciﬁc changes and restrictionsAlthough a network of coupled workstations basically has the same type of architecture as a multiprocessor system like the Paragon there are differences which put several restrictions on the implementation of NXLIB.A short summary of these restrictions and changes will be given in this section.The compiler and linker on a Paragon system use special switches(–nx or–lnx)to create parallel pilers and linkers on workstations do not have an equivalent switch.To support an easy to use compilation system for Paragon applications which should run with NXLIB two special shell scripts have been provided.One to compile and link C。