The Use of Multithreaded Processors in DASH
- 格式:pdf
- 大小:58.41 KB
- 文档页数:15
一1.The computer keyboard has exactly the same layout as the typewriterkeyboard(F)2.To enter special computer-related codes, you may use someadditional keys(T)输入与计算机有关的特殊代码,你可以使用一些额外的钥匙3.We must use storage hardware to store computer instructions anddata, otherwise they will be lost when the power is turned off(T)我们必须使用存储硬件存储计算机指令和数据,否则会丢失当电源关闭4.Office filing systems store data as electromagnetic signals orlaser-etched spots(F)5.The processing hardware is mainly made up of CPU and memory(T)处理硬件主要是由CPU和内存6.The design of the CPU determines whether you can run simple orsophisticated software(F)7.The more sophisticated software program, the more instructions itcontains(T)更复杂的软件程序,它包含更多的指令8.If you have a large memory in your computer, you’ll be able to workwith and process a large great amount of data and information at one time. (T)如果你的计算机有一个大的内存,你将能够同时处理大数据量较大的信息。
高二英语软件开发单选题50题1.The main language used in software development is _____.A.PythonB.JavaC.C++D.All of the above答案:D。
在软件开发中,Python、Java 和C++都是常用的编程语言,所以答案是以上皆是。
2.Which one is not a software development tool?A.Visual StudioB.IntelliJ IDEAC.PhotoshopD.Eclipse答案:C。
Photoshop 是图像编辑软件,不是软件开发工具。
Visual Studio、IntelliJ IDEA 和Eclipse 都是常用的软件开发集成环境。
3.The process of finding and fixing bugs in software is called _____.A.debuggingB.codingC.testingD.designing答案:A。
debugging 是调试的意思,即查找和修复软件中的错误。
coding 是编码,testing 是测试,designing 是设计。
4.A set of instructions that a computer follows is called a _____.A.programB.algorithmC.data structureD.variable答案:A。
program 是程序,即一组计算机遵循的指令。
algorithm 是算法,data structure 是数据结构,variable 是变量。
5.Which programming paradigm emphasizes on objects and classes?A.Procedural programmingB.Functional programmingC.Object-oriented programmingD.Logic programming答案:C。
Multi-threading in PTC Creo ParametricMulti-processor support in PTC Creo ParametricToday, multi core and multi-processor workstations are popular. PTC has been enhancing PTC Creo Parametric to leverage these technologies to improve the application performance and user experience.This whitepaper will review how PTC Creo utilizes multiple processors today and provide an insight into the roadmap for future multi core and multi-processor support. For the purpose of this whitepaper, the term “multi-processor” will include both multi-processor as well as multiple cores on single processors.Some common misconceptions about these technologies and their application to PTC Creo are: ∙Multi-threading is the only means of leveraging multiple processors.∙Performance benefits can be achieved by simply multi-threading PTC Creo.∙Multi-threading does not provide benefits on single processor hardware.∙Multi-threading only provides performance benefits.This whitepaper will address these and other common misconceptions and will provide information on: ∙The benefits of multi-processor hardware∙The differences and similarities between multi-threading multi-processing∙How PTC Creo applications leverage and benefit from these technologiesMultiple Processor UsageWith a multi-processor equipped workstation, the Operating System (OS) has multiple processors available for use. If an application is multi-threaded, or several applications are running simultaneously, the OS can use several processors simultaneously.Many factors determine whether multiple processors are used or not. For example:Does the application need to synchronize the different threads frequently?Are different applications connected such that they need frequent synchronization or need to “wait” for one another?If the overhead of running multiple processors is high, it may be more efficient to run the threads or applications on the same processor. The OS typically decides on the right strategy to use.Symmetric Multi-threadingIn symmetric multi-threading, the application divides a task so that several processors can execute the sub-tasks simultaneously. For example, the ray-tracing algorithm in PTC Creo Parametric’s Advanced Rendering Extension splits the rendered image into finite pieces. The application computes the results for each piece individually and then merges them together.Asymmetric Multi-threadingWhile symmetric multi-threading is efficient in many cases, not all tasks or functional areas are suitable for performance related multi-threading. Some tasks are sequential in nature, meaning the sub-tasks mustexecute in a specific order. Executing these sub-tasks, in parallel may not be efficient or possible, if some sub-task is dependent on completion of another sub-task. One example of this is regeneration. Regeneration of each model or feature must happen in a specific order to preserve the user’s design intent. Multi-threading these tasks is not only very difficult, but may have a negative performance impact if the threads need to be synchronized frequently.In the case of an application like PTC Creo Parametric that depends upon robust regeneration, multi-threading can be utilized to improve the user experience by improving the accessibility of commands and responsiveness of one or more applications.For example, Pro/ENGINEER Wildfire’s retrieval thread is regularly interrupted (internally) to check if the user wishes to interrupt the retrieval by pushing the “stop sign”. Since the stop sign shares the same thread as retrieval, it is not responsive many a times. PTC Creo uses multi-threading to make the stop sign available during assembly retrieval. The result is vastly improved accessibility of this function. This is an example of asymmetric threading.Running a “foreign” application inside another application is another form of asymmetric threading. An example of this is the ability to run the embedded browser in a separate thread inside of PTC Creo. As a result, even when the browser thread is busy during a database check-in, PTC Creo remains responsive.Multi-processingWhile an application can run multi-threaded inside another application, it can be more beneficial to run the two applications as separate, yet connected processes. The PTC Creo Parametric Mathcad integration uses this approach, where– PTC Creo Parametric and Mathcad run as separate processes.In certain cases, it can be more beneficial to start several processes or instances of the same application rather than split a process into two or more threads. For instance, rather than have several threads computing similar but disconnected tasks inside one instance of PTC Creo, a utility can start several instances of PTC Creo and then merge the result. An example of this approach is the multi-objective design study capability within behavioral modeling (BMX) which utilizes distributed computing (dBATCH) to accomplish this. Summary of functional areas in PTC Creo 2.0 leveraging multiple processorsArchitectureMulti-threading - Responsiveness/PerformancePTC is enhancing PTC Creo into a set of services managed by the PTC Creo Agent to more effectively leverage multi-threading and multi-processor technologies and improve performance and responsiveness. In PTC Creo 2.0, the PTC Creo Agent executes the Learning Connector, Windchill SocialLink, and Exit Logger services on separate threads.Photo-realistic Rendering in the Advanced Rendering Extension (ARX)Multi-threading - PerformancePhoto rendering is ideally suited to performance-related multi-threading. Ray tracing, which is the bulk of rendering, is broken down into many individual tasks via symmetrical multi-threading for computation and the result is merged in the end.Simulation using PTC Creo SimulateMulti-threading - PerformanceSimulation is a compute intensive task and hence is a key area of focus for leveraging this technology.The solver in PTC Creo Simulate 2.0 is highly multi-threaded and scales up to 64 threads. PTC Creo 1.0 and 2.0 users can run their PTC Creo Simulate analysis on remote servers via the dBatch service. PTC Creo 2.0 loads large results files on a background thread, thus providing up to 40% improvement in the elapsed time required to load results.Research is ongoing in other areas of PTC Creo Simulate where multi-threading/multi-processing can improve performance.Assembly RetrievalMulti-threading - Responsiveness/PerformanceAssembly retrieval is sequential in nature; each part is retrieved, placed, and evaluated one after the other.PTC Creo’s new Light Weight Graphics Technology sharply reduces the time to load the assembly, thus improving the overall performance.PTC Creo uses a separate thread to retrieve the graphics data to speed up the retrieval process. It also uses a separate thread for rendering the graphics, thus providing a continuous buildup of the assembly display and improving overall retrieval time. Beyond its direct effect on the performance of assembly retrieval, the new PTC Creo “Open Subset” capability utilizes the ability to retrieve graphics in a separate thread. Thus, allowing users to select a predefined subset of an assembly in the graphics preview and retrieve it during runtime.Finally, PTC Creo provides improved command accessibility during retrieval and regeneration by executing certain tasks such as the “stop sign” in separate threads.RegenerationMulti-threading - Responsiveness/PerformanceAssembly or part regeneration is a means of propagating changes according to parametric relationships to the affected members of an assembly. It is a crucial step to ensure that a design is “up to date”. PTC is researching ways to speed up this regeneration. In PTC Creo 2.0, surface intersections have been multi-threaded to enhance regeneration speed.However, regeneration is still largely a sequential process. The regeneration algorithm needs to consider the design intent by evaluating the features and cross-model dependencies in the order indicated by the user. GraphicsMultithreading - Responsiveness/PerformancePTC Creo 2.0 now leverages the Graphics Processing Unit (GPU) for the enhanced realism display mode and for transparency support. As a result, enhanced realism display mode is up to 30 times faster in PTC Creo Parametric 2.0 than in Pro/ENGINEER Wildfire 5.0. Additionally, GPU accelerated transparency mode in PTC Creo Parametric 2.0 is up to 8 times faster than blended transparency in Pro/ENGINEER Wildfire 5.0. Finally, tessellation in PTC Creo is now multi-threaded to improve rendering performance.Load time and display of the preview graphics in the File/Open dialog and the Appearances dialog in in PTC Creo are executed in a separate thread to enhance responsiveness.Creo Embedded BrowserMulti-threading – PerformanceIn PTC Creo 2.0, the embedded browser runs as a separate process by default. This enables users to continue working in PTC Creo, while the browser intensive tasks like check-in or download run on a separate thread. In future releases, the PTC Creo Agent will manage the embedded browser to improve User Interface responsiveness further.Behavioral ModelingMulti-processing - PerformanceBy distributing the load via distributed computing, multi-objective design studies within BMX can vastly benefit from multi-processing. dBatch starts new instances of PTC Creo, which run on processors on the local workstation or on other connected workstations.。
《计算机导论》(第2版)习题答案参考计算机导论(第2版)习题答案参考Chapter 1: Introduction to Computers1. Define a computer and discuss its attributes.A computer is an electronic device capable of performing various operations and processes based on a set of instructions. Its attributes include the ability to input, process, store, and output information, as well as the capability to execute complex calculations and perform tasks.2. Differentiate between hardware and software.Hardware refers to the physical components of a computer system, including the central processing unit (CPU), memory, storage devices, input/output devices, and peripherals. Software, on the other hand, represents the non-tangible parts of a computer system, such as programs and data that can be stored and executed by the hardware.3. Explain the concept of data representation and discuss different numbering systems used in computer systems.Data representation refers to the way data is stored and processed by a computer. Different numbering systems include the binary system (base-2), decimal system (base-10), octal system (base-8), and hexadecimal system (base-16). Each system has its own set of symbols and rules for representing numbers and characters.Chapter 2: Computer Hardware1. Discuss the major components of a computer system.A computer system consists of several major components, including the central processing unit (CPU), memory, storage devices, input/output devices, and peripherals. The CPU is responsible for executing instructions and performing calculations, while memory stores data and instructions temporarily. Storage devices are used for long-term data storage, andinput/output devices allow users to interact with the computer system.2. Describe the functions and characteristics of the CPU.The CPU is the central processing unit of a computer system and is responsible for executing instructions and performing calculations. It consists of two main components: the control unit, which manages the execution of instructions, and the arithmetic logic unit (ALU), which performs calculations and logical operations. The CPU's performance is determined by factors such as clock speed, cache size, and number of cores.3. Explain the different types of memory in a computer system.A computer system typically has two main types of memory: primary memory (RAM) and secondary memory (storage devices). RAM, or random access memory, is used for temporary data storage and is volatile, meaning its contents are lost when the power is turned off. Secondary memory, such as hard disk drives and solid-state drives, provides long-term storage for data even when the power is off.Chapter 3: Operating Systems1. Define an operating system and discuss its functions.An operating system is a software that manages computer hardware and software resources. Its functions include providing a user interface, managing memory and storage, coordinating the execution of applications, handling input/output operations, and ensuring system security and stability.2. Explain the difference between a single-user and multi-user operating system.A single-user operating system is designed to be used by one user at a time. It provides a user interface and manages the resources on the computer for the sole user. A multi-user operating system, on the other hand, allows multiple users to access the system simultaneously, sharing resources and executing their own programs concurrently.3. Discuss the concept of virtualization and its advantages.Virtualization is the process of creating a virtual version of a computer system or resources. It allows multiple operating systems to run on a single physical machine, enabling better resource utilization, cost savings, and improved flexibility. Virtualization also provides isolation between different virtual machines, enhancing security and system stability.In conclusion, this article provides a brief overview of the topics covered in the second edition of "Introduction to Computers." It includes explanations and answers to selected exercises, helping readers understand the fundamental concepts of computer science and technology. By studying these topics, readers can gain a strong foundation in computer knowledge and skills.。
==================================名词解释======================================Operating system: operating system is a program that manages the computer hardware. The operating system is the one program running at all times on the computer (usually called the kernel), with all else being systems programs and application programs.操作系统:操作系统一个管理计算机硬件的程序,他一直运行着,管理着各种系统资源Multiprogramming: Multiprogramming is one of the most important aspects of operating systems. Multiprogramming increases CPU utilization by organizing jobs (code and data) so that the CPU always has one to execute.多程序设计:是操作系统中最重要的部分之一,通过组织工作提高CPU利用率,保证了CPU始终在运行中。
batch system: A batch system is one in which jobs are bundled together with the instructions necessary to allow them to be processed without intervention.批处理系统:将许多工作和指令捆绑在一起运行,使得它们不必等待插入,以此提高系统效率。
pyside2 qthread例子1.我们可以使用QThread来实现多线程处理任务。
We can use QThread to implement multithreading for task processing.2.在PySide2中,QThread类提供了一个方便的方式来创建和管理线程。
In PySide2, the QThread class provides a convenient way to create and manage threads.3. QThread可以让我们在应用程序中同时执行多个任务,避免阻塞主线程。
QThread allows us to simultaneously execute multiple tasks in the application, avoiding blocking the main thread.4.当执行一些耗时的操作时,QThread可以提高程序的响应性。
When performing some time-consuming operations, QThread can improve the responsiveness of the program.5.可以通过继承QThread类来创建自定义的线程类。
Custom thread classes can be created by inheriting the QThread class.6. QThread的start()方法用于启动线程。
The start() method of QThread is used to start the thread.7.调用QThread的start()方法后,线程将会开始执行run()方法中的代码。
After calling the start() method of QThread, the threadwill start executing the code in the run() method.8. QThread的finished信号可以用来捕获线程执行完成的事件。
操作系统期中试卷40(总4页)--本页仅作为文档封面,使用时请直接删除即可----内页可以根据需求调整合适字体及大小--Provide two programming examples in which multithreading does not provide better performance than a single-threaded solutionAnswer: (1) Any kind of sequential program is not a good candidate tobe threaded. An example of this is a program that calculates an individualtax return. (2) Another example is a "shell" program such asthe C-shell or Korn shell. Such a program must closely monitor its own working space such as open files, environment variables, and current working directory.Describe the actions taken by a thread library to context switch between user-level threads.Answer: Context switching between user threads is quite similar to switching between kernel threads, although it is dependent on the threads library and how it maps user threads to kernel threads. In general, context switching between user threads involves taking a user thread of its LWP and replacing it with another thread. This act typically involves saving and restoring the state of the registers. Under what circumstances does a multithreaded solution using multiple kernel threads provide better performance than a single-threadedsolution on a single-processor system?Answer: When a kernel thread suffers a page fault, another kernelthread can be switched in to use the interleaving time in a useful manner. A single-threaded process, on the other hand, will not be capable of performing useful work when a page fault takes place. Therefore, in scenarios where a program might suffer from frequent page faults or has to wait for other system events, a multi-threaded solution would perform better even on a single-processor system.Which of the following components of program state are shared across threads in a multithreaded process?a. Register valuesb. Heap memoryc. Global variablesd. Stack memoryAnswer: The threads of a multithreaded process share heap memory and global variables. Each thread has its separate set of register values and a separate stack.Can a multithreaded solution using multiple user-level threads achieve better performance on a multiprocessor system than on a single-processor system?Answer: A multithreaded system comprising of multiple user-levelthreads cannot make use of the different processors in a multiprocessor system simultaneously. The operating system sees only a single process and will not schedule the different threads of the process on separate processors. Consequently, there is no performance benefit associated with executing multiple user-level threads on a multiprocessor system.As described in Section Linux does not distinguish between processes and threads. Instead, Linux treats both in the same way, allowing a task to be more akin to a process or a thread depending on the set of flags passed to the clone() system call. However, many operating systems—such as Windows XP and Solaris—treat processes and threads differently. Typically, such systems use a notation wherein the data structure for aprocess contains pointers to the separate threads belonging to the process. Contrast these two approaches for modeling processes and threads within the kernel.Answer: On one hand, in systems where processes and threads are considered as similar entities, some of the operating system code could be simplified. A scheduler, for instance, can consider the different processes and threads in equal footing without requiring special code to examine the threads associated with a process during every scheduling step. On the other hand, this uniformity could make it harder to impose process-wide resource constraints in a direct manner. Instead, some extra complexity is required to identify which threads correspond to which process and perform the relevant accounting tasks.The program shown in Figure uses the Pthreads API. What would be output from the program at LINE C and LINE PAnswer: Output at LINE C is 5. Output at LINE P is 0.Consider a multiprocessor system and a multithreaded program written using the many-to-many threading model. Let the number of user-level threads in the program be more than the number of processors in the system. Discuss the performance implications of the following scenarios.a. The number of kernel threads allocated to the program is less than the number of processors.b. The number of kernel threads allocated to the program is equal to the number of processors.c. The number of kernel threads allocated to the program is greater than the number of processors but less than the number of user level threads.Answer: When the number of kernel threads is less than the number of processors, then some of the processors would remain idle since thescheduler maps only kernel threads to processors and not user-level threads to processors. When the number of kernel threads is exactly equal to the number of processors, then it is possible that all of the processors might be utilized simultaneously. However, when a kernel thread blocks inside the kernel (due to a page fault or while invoking system calls), the corresponding processor would remain idle. When there are more kernel threads than processors, a blocked kernel thread could be swapped out in favor of another kernel thread that is ready to execute, thereby increasing the utilization of the multiprocessor system. Write a multithreaded Java, Pthreads, or Win32 program that outputs prime numbers. This program should work as follows: The user will run the program and will enter a number on the command line. The programwill then create a separate thread that outputs all the prime numbers less than or equal to the number entered by the user.Answer: Please refer to the supporting Web site for source code solution.Modify the socket-based date server (Figure in Chapter 3 so that the server services each client request in a separate thread.Answer: Please refer to the supporting Web site for source code solution.The Fibonacci sequence is the series of numbers 0, 1, 1, 2, 3, 5,8, .... Formally,it can be expressed as:f ib0 = 0f ib1 = 1f ibn = f ibn-1 + f ibn-2Write a multithreaded program that generates the Fibonacci series using either the Java, Pthreads, or Win32 thread library. This program shouldwork as follows: The user will enter on the command line the number of Fibonacci numbers that the program is to generate. The program will then create a separate thread that will generate the Fibonacci numbers, placing the sequence in data that is shared by the threads (an array is probably the most convenient data structure).When the thread finishes execution, the parent thread will output the sequence generated by the child thread. Because the parent thread cannot begin outputting the Fibonacci sequence until the child thread finishes, this will require having the parent thread wait for the child thread to finish using the techniques described in Section .Answer: (Please refer to the supporting Web site for source code solution.)Exercise in Chapter 3 specifies designing an echo server using the Java threading API. However, this server is single-threaded meaning the server cannot respond to concurrent echo clients until the currentclient exits. Modify the solution to Exercise such that the echo server services each client in a spearate requestAnswer: Please refer to the supporting Web site for source code solution.。
The Use of Multithreaded Processors in DASHSugath Warnakulasuriyaemail: sugath@AbstractDASH is a scalable shared memory multiprocessor architecture which employs directory based cache coherence. Due to the physical distribution of the memory in DASH, there is potential for long memory latency. Although a number of latency reducing and hiding techniques such cach-ing, a weaker memory consistency model, and pre-fetching are employed by DASH, the use of Multithreaded processors is not currently included in the design. In this paper, the use of Mul-tithreaded processors in the DASH architecture is examined. A processor architecture which in-corporates multithreading with superscalar capabilities is proposed, and a general discussion of issues regarding multithreading as well as the incorporation of the proposed multithreaded processor into the existing DASH architecture is presented.1 - IntroductionDASH is a scalable shared memory multiprocessor architecture which employs directory based cache coherence. The memory is physically distributed among a potentially large number of processing nodes, leading to a NUMA (Non-Uniform Memory Access) model with relatively long latency for access to remote memory. Although the DASH provides local caches and en-courages the exploitation of data locality to minimize the need for remote memory access, the average memory access time is still significantly impacted by long latencies associated with re-mote memory access. Although the DASH supports a number of latency reducing and hiding techniques (including a weaker memory consistency model, caching, and pre-fetching), the use of Multithreaded processors is not currently included in the design. In this paper, we explore the issues relating to the use of Multithreaded processors in the DASH architecture.An overview of the DASH architecture and its features are presented in Section 2. A general discussion of issues in multithreading is presented in Section 3. An architectural framework for a multithreaded superscalar processor is proposed in Section 4, followed by a brief discus-sion of the issues involved in incorporating the proposed processor into the existing DASH ar-chitecture (in Section 5).2 - Overview of the DASH ArchitectureDASH is a MIMD shared memory multiprocessor architecture intended to be used for a broad range of applications. The architecture preserves the advantages in programmability of a sin-gle address space while maintaining scalability, an attribute traditionally associated with dis-tribu ted memory systems ("Mu lticompu ters"). The DASH mu ltiprocessor was designed to providing scalable memory bandwidth with limited overhead and cost, while reducing or hid-ing latency associated with large memory systems.DASH presents a single address space which encompasses all distributed shared memory. By doing so, it provides a simpler programming model than that provided in distributed memory system (where a programmer must structure the data to account for the distributed memory architecture, and must explicitly support the communication of shared data). A single address space also provides for easier program/data partitioning and dynamic load distribution, where-as in a message passing system, tasks and data must be explicitly migrated.The DASH architecture supports a general, scalable network topology. Each node in the net-work is a "Cluster", with each cluster itself being a small scale bus-based multiprocessors. In the proposed DASH configuration, each cluster is made of 4 processors. Processors in a cluster share a single memory module, with the memory modules of all clusters making up the global memory. All processors in a cluster share a single directory and network interface logic, allow-ing the high cost of these specialized components to be amortized over multiple processors. Fig-ure 1 presents a high-level view of the DASH architecture.Figure 1 - Overview of DASH ArchitectureA 64 processor prototype of the DASH has been built to demonstrate the feasibility of and to identify the limitations of the DASH architecture. This prototype consists of 16 clusters, with each cluster containing 4 processors. The estimated peak performance of this 64-processor sys-tem is 1.6 GIPS (and 600 MFLOPS). The prototype has been implemented u sing Silicon Graphics 4D/340 processors and customized directory and controller hardware. The imple-mented prototype is based on a pair of wormhole-routed mesh networks (to support indepen-dent request and reply networks).Traditionally, mu ltiprocessor cache coherence is achieved u sing snoopy protocols. In these schemes, each processing node monitors all requests to memory and independently determines the state of its cache. Such multiprocessors are usually bus based, and employ broadcast mes-sages to distribute state updates to memory (broadcast of invalidations and/or update messag-es). However, such bus-based schemes are not scalable since the fixed bus bandwidth is shared by all processors. Also, broadcasting memory requests to every cache may saturate the system, thereby making these schemes inappropriate for large scale multiprocessors. Note however that a snoopy protocol is used for intra-cluster cache coherence in DASH.To overcome these limitations, DASH supports general network interconnects and uses direc-tories for cache coherence. Cache coherence is maintained using dedicated hardware (as op-posed to software) in order to achieve better performance. This removes the burden of cache coherence from software. Cache coherence in DASH is maintained using point-to-point mes-sages instead of broadcast messages.Directories in DASH are partitioned and distributed, thereby eliminating the bottleneck prob-lems associated with centralized directory schemes. This too contributes to the scalability of DASH. DASH directories use bit vectors of "presence bits" to maintain coherence. Although the hardware and memory overhead for directory based coherence is a potential problem for machines with a large number of processors, many schemes which limit this overhead to a small fraction of total system cost have been proposed (limited pointer directories, etc.). These schemes are beyond the scope of this paper, and will not be discussed further.The DASH protocol is based on the use of point-to-point messages. The protocol uses message forwarding to achieve a highly efficient, non-blocking scheme. The protocol supports the auto-matic serialization of remote memory requests (serialization performed by the cluster which currently owns a particular block). Also, acknowledgments are used to detect "global comple-tion" of memory operations. Memory requests originating from processors in the same cluster are merged, thereby making the intra-clusters cache coherence activities transparent to the outside.By physically distributing memory throughout the system, the DASH presents a NUMA (Non-Uniform Memory Access) model. Although all memory is globally accessible, some memory is more costly than other to access. Conceptually, DASH presents a memory access hierarchy of 5 levels, with each lower level being more costly to access (the levels and their associated la-tencies are presented in Figure 2). This hierarchy allows exploiting the principle of locality by maintaining data "closer" to where it is actually used, thereby reducing the average latency of memory accesses and reducing bandwidth requirements of the global interconnect.Figure 2 - DASH Memory Access LevelsThe "memory access hierarchy" in DASH consists of the following levels:• Processor Level - desired data found in local processor cache (requires 1 processor clock for access)• Local Cluster Level - desired data found in other processor cache within the local cluster (re-quires 30 processor clocks for access)• Directory Home Level - an up-to-date copy of the desired data found at its "home" using the directory and main memory associated with given address - (requires 100 processor clocks for access)• Remote Cluster Level - desired data found in a processor cache of a remote cluster - (requires 135 processor clocks for access)DASH employs both memory latency reducing (decreasing the time required to access memory) and latency hiding (overlapping memory latency with additional compu tation) techniqu es. The primary latency reducing technique is the use of local caches, which are especially effective for applications which display a great degree of locality. Also, the multi-level memory access hierarchy in DASH allows further reduction in latency by pre-fetching data whose use can be anticipated, and therefore, brought closer.The nature of the DASH cache coherency protocol also leads to a reduction in latency. This is due to the minimization of the number of nodes that must be accessed to satisfy a memory re-quest, and use of a non-blocking message forwarding scheme.The latency hiding techniques employed by DASH include the support of data pre-fetching and the use of a weak memory consistency model. The pre-fetch operation supported is a non-bind-ing, software controlled operation, intended to be explicitly issued by a processor. The pre-fetch allows compilers/applications to aggressively pre-fetch values when data usage can be an-ticipated.DASH also uses release consistency, which helps hide latency to write operations (however, it does not help hide latency of read misses). To support release consistency, a write buffer is used to hold all writes issues by a processor. Also, "Fence" operations (which are explicit stall commands to the write buffer and or processor) are supported in DASH. This allows the com-piler/programmer to emulate any memory consistency model.A latency reducing technique not employed by DASH is the use of multithreaded processors. Mu ltithreading is the overlapping of commu nication (to access remote memory) and u sefu l computation by utilizing an otherwise idle processor to execute another task while a long la-tency memory operation is being made. Note that this concept is analogous to context switch-ing caused by page-faults in virtual memory systems. The need to build a working prototype along with the lack of commercially available multithreaded processors has led to the omission of this promising latency hiding technique from DASH. The reminder of this paper will discuss multithreading and its incorporation into DASH.There are a number of other interesting features in the DASH architecture. A description of these as well as a more detailed description of the overall architecture and the coherence pro-tocol can be found in [6] and [7].3 - Issues in MultithreadingThe following subsections present a discussion of the following:• Issues which have motivated the use of multithreaded processors• The various choices involved in the design of multithreaded processors•The tradeoffs associated with the use of multithreaded processorsThis discussion is intended to lay the ground work for the justification of the multithreaded processor proposed in Section 4.3.1 - Latency hiding through increased resource utilizationEven with the use of caches and other methods such as pre-fetching to exploit locality in appli-cations, it is unrealistic to expect that all references to remote memory can be eliminated. This has been su fficiently demonstrated in attempts to solve analogou s problems su ch as cache block and virtual memory page placement/replacement techniques. Furthermore, in multipro-cessor applications, there is a great likelihood of processes residing on independent processors needing to work together to solve a single problem, hence requiring inter-process communica-tion (IPC) and/or synchronization using shared variables. This demands that remote memory latency be further reduced or eliminated in order to maximize program performance.For example, based on the DASH memory latency presented in Figure 2, and assuming conser-vative "hit-ratios" of 85% for the Processor Level, 8% for the Local Cluster Level, 5% for the Directory Home Level, and 2% for the Remote Cluster Level, an average memory latency of 10.95 cycles is obtained. This can be a large performance penalty to pay, especially for memo-ry-intensive programs. During the execution of such applications, the processor will be idle for a substantial percentage of the time (during which the remote memory is accessed). This un-der-utilization of the processor in single-threaded systems is illustrated in Figure 3a.Figure 3a - Under-utilization of a single-threaded processorAssuming that all other means (both static and dynamic) of reducing and hiding latency have been attempted, methods to utilize the idle processor resources must now be explored. From this arises the motivation for multithreading: the utilization of the idle processor cycles during the memory operation of one thread to execute instructions of other threads. A simple example of increased processor utilization through multithreading is illustrated in Figure 3b.Figure 3b - Increased processor utilization due to Multithreading3.2 - Context Switching PoliciesAssuming that multithreading is to be used for hiding memory latency, a context switching policy must be established (such a policy dictates the conditions under which the processor switches from executing one thread to another). Although a number of context switch policies are described by Hwang in [4], only the following two policies are considered as candidates for the multithreaded processor proposed in the next section:• Switch on cache miss: this policy calls for a thread to relinquish the processor when a cache miss (or a similar long latency operation) is encountered. Note that this policy is generally re-ferred to as a "Blocked" scheme [5]. The April multithreaded processor used in the Alewife sys-tem utilizes this type of policy [2].• Switch on every cycle: this policy calls for a thread to relinquish the processor after each cy-cle, regardless of the type of instruction encountered. Instructions of multiple threads are in-terleaved, and "serviced" in a round-robin fashion. Note that this policy is generally referred to as an "Interleaved" scheme [5]. The HEP [12] and Tera [4] architectures utilize this type of policy.According to studies by Laudon, et. al., cycle-by-cycle switching can yield better performance than comparable "blocked" schemes [5]. It has been concluded that the complexity of imple-menting cycle-by-cycle context switching is not overwhelmingly, and that as pipelines get deep-er (and operate at a lower percentage of peak performance), the performance advantages of the cycle-by-cycle switching can justify additional complexity. In [5], it is also shown that the rel-ative advantage of the interleaved scheme is greater for deeper pipelines, and that the perfor-mance increases along with an increased number of threads. A more precise model of this behavior, along with a discussion of "linear" and "saturated regions" is presented in [4].3.3. - Additional Benefits of MultithreadingIn addition to helping hide latency, multithreading can also help remove pipeline dependen-cies. Generally, it can be assumed that instructions in different threads are independent. So, in a cycle-by-cycle context switching scheme employing pipelined units, pipeline stalls due to dependencies are all but eliminated assuming the number of concurrent threads match the number of pipeline stages. A simple example of this, assuming four pipeline stages, and four threads is illustrated in Figure 4. Note that instructions I1 and I2 of thread T1are dependent, and cannot be executed in successive pipeline steps in a single-threaded architecture. How-ever, due to the interleaving the of the instructions from the four independent threads, the de-pendency between the instructions I1 and I2 of Thread T1 has disappeared when I2 enters the pipeline.Figure 4 - Elimination of pipeline stalls due to multithreadingAn example of an actual implementation of such a scheme is the HEP processor, which, given 8 active threads, issues one instruction from each thread every 8 cycles. Since there are 8 pipe-line stages in the HEP processors, the previous instruction of a thread completes execution pri-or to the issuance of a new instruction from the same thread, thereby eliminating pipeline stalls due to dependencies. This type of scheme reduces and/or eliminates the need for special-ized hardware and/or compiler resources required to resolve pipeline hazards.Another important aspect of multithreading is that both hiding memory latency and reducing/ eliminating pipeline stalls can be achieved with no impact to application software. Multi-threading is completely transparent to the programmer, and comes at no additional program-ming cost.3.4 - Context Switching CostsGenerally, the cost of context switching should be small enough so as to not outweigh the ben-efits of multithreading. Intuitively, it can be assumed that more efficient and more complex context switching mechanisms are required to support the more aggressive context switching policies in order to maintain the benefits of multithreading. For example, the Tera architec-ture [4] (which calls for aggressive cycle-by-cycle context switching) provides highly efficient hardware context switching using 128 separate sets of registers (and other hardware) to main-tain the states of the individual threads. An earlier architecture, the HEP [12], uses a slightly different scheme, in which a set of 1024 general purpose registers (referred to as "Register Memory") is shared by all threads in the system. Hardware is used to maintain the set of reg-isters in the register memory which belongs to a particular thread (using base and limit regis-ters) in order to protect the state of a thread.Although complex hardware is required for instruction issue and thread management in a cy-cle-by-cycle context switching scheme, the runtime context switch cost for this should be zero. In other words, the fact that successive instructions entering a pipeline may belong to multiple threads should be transparent to the pipeline, and it should have no impact to the delay in propagating instructions from one pipeline stage to another.On the other hand, less aggressive context switching policies (i.e., switch on cache miss) do not require as efficient mechanisms to maintain the benefits of multithreading (due to their rela-tive infrequency of use). Less aggressive policies can be supported with a less complex set of hardware (and perhaps even aided with software). Such an implementation (the April multi-threaded processor), which uses existing register windows in the Sparc processor with slight enhancements, is presented in [2]. According to Laudon [5], the cost of a context switch for this type of switching policy corresponds to the number of instructions from a context to be swapped out that have already entered the pipeline (as these instructions must be removed from the pipeline).Performance models for determining the efficiency of multithreaded processors which take into account the cost and the frequency of context switching are presented in [2] and [4].3.5 - Drawbacks and Limitations of MultithreadingAlthough multithreading leads to the better utilization of processor resources, helps hide mem-ory latency, and may help hide dependencies, it poses several potential problems which must be resolved in order for it to be used.3.5.1 - Cache ConflictsEach thread has associated with it a "working set" of data. For optimal performance, this working should reside in the processor cache when the thread executes (so that a maximum amount of memory references lead to "cache hits"). In most applications, the working sets of independent threads are disjoint, leading to one or more of the following problems:• Placing all of the working sets of all threads in the cache simultaneously may be impractical given that an extremely large cache is likely to significantly decrease the cost-performance of a system. This is compounded by the fact that individual working sets may themselves be quite large. However, given the decreasing costs in hardware, a sufficiently large cache may be a viable solution.• With a limited cache size, thrashing (leading to unacceptable memory traffic conditions) may occur as context is switched from one thread to another, requiring working sets of threads to be swapped in and out of the cache. Such conditions are more likely when cycle-by-cycle con-text switching is performed.Potential solutions to reducing the likelihood of thrashing conditions are discussed in the next section.3.5.2 - Lack of Inherent Parallelism in ApplicationsIn order for multithreaded processors to be effective, it is necessary to partition applications/ problems into medium to small grain threads (i.e., achieve task level parallelism). This is es-pecially true in systems which rely on cycle-by-cycle context switching and deep pipelines to hide memory latencies. For example, the Tera architecture (described [4]) require 128 concur-rent threads in order to achieve optimal performance.Given that such processors are a part of a large multiprocessor system, and that each such pro-cessor requires a set of 128 threads for optimal performance (to keep the pipeline full), it is dif-ficult to imagine how a single application could be partitioned into such a large number of parallelizable threads given the current software technology for identifying parallelism. How-ever, it should be kept in mind that some of the threads could be in support of the operating system, and not all such threads need to be from a common application.Also it should be noted that an extensive set of advanced software tools are needed to fully uti-lize such a system. For example, new languages which allow more natural expression of inher-ent parallelism in problems as well as more advanced parallelizing compilers are required. 3.5.3 - Performance of Single ThreadsAlthough the interleaving of multiple threads leads to better processor resource utilization and throughput, the performance of a single may degrade as a result of multithreading.For example, in cycle-by-cycle context switching policies, threads are "swapped out" regardless of the type of instruction encountered (i.e. whether a cache miss occurs or not). Furthermore, the thread is "serviced" once again only after a single instruction of each of the other active threads is serviced. This tends to severely penalize threads with high cache hit ratios (that require little or no remote memory access), especially in the presence of a large number of other threads. More precisely, the maximum performance possible under cycle-by-cycle context switching is one instruction for every n cycles where n equals the depth of the pipeline. So run-ning a single thread at optimal performance is not possible.Ideally, a multithreaded processor should also support the execution of successive instructions from a single thread, assuming there are no dependencies between the instructions, and that the instructions do not cause long latency operations. Such a multithreaded processor would then be able to not only hide latency and eliminate pipeline stalls when appropriate, but also satisfy applications requiring near peak performance to be served as in a single-threaded pro-cessor. This has been taken into consideration in the design of the proposed multithreaded ar-chitecture proposed in the next section.3.6 - Multithreading support for superscalar processorsSo far, multithreading has been discussed with little reference to some of the latest advance-ments in modern processor architectures. The most important among these is the concept of superscalar architectures, which employ multiple functional units (possibly the use multiple units of a particular type), and are capable of issuing multiple instructions in a single clock cycle.Recent experience with the use of superscalar processors has revealed the following:• Instruction level parallelism available in a single instruction stream is limited• Functional units in superscalar architectures are often under-utilized due to the lack of in-struction level parallelism in single instruction streamsIn exploring multithreading, the most aggressive context switch strategy considered so far has been cycle-by-cycle context switching. Given the availability of multiple functional units in su-perscalar architectures, and the desire to maximize the utilization of these functional units, the issuing of multiple instructions (from potentially independent threads) during a single clock cycle should be considered. By doing so, the concept of a "context switch" is blurred.A simple example of the use of a multithreaded superscalar processor is illustrated in Figure5. Note that the utilization of functional units is maximized due to the availability of instruc-tions from multiple threads.Figure 5 - Instruction Issue using multithreaded superscalar processorThe concept of incorporating multithreading into a superscalar architecture forms the basis for the multithreaded processor proposed in the following section.4 - Proposed Design of a Multithreaded Superscalar ArchitectureBased on the discussions in the previous section, a high-level design of a new multithreaded superscalar architecture is presented below. A high-level diagram of this architecture is pre-sented in Figure 6.Figure 6 - Multithreaded superscalar architectureThe following are the main components of the proposed multithreaded superscalar processor:•Instruction Cache: maintains instructions belonging to multiple threads read in from the main memory. Instructions for different threads are fetched from main memory under the guidance of the Thread Management Unit. Data cache size should be determined based on the maximum number concurrent threads to be supported by the system, and the overall issue rate of the processor.• Thread management Unit (TMU): examines the dependence of the instructions within each thread, and schedules the execution of eligible instructions (based on dependency analysis). The instructions are scheduled by placing them in available slots in the instruction buffer. This components also manages the states of the individual threads.• Instruction Buffer: manages very long instruction words (VLIWs). Each instruction word contains multiple instructions, each of which may belong to a different thread.•Instruction Issue Logic: issues instructions from the instruction buffer based on the avail-ability of functional units. This component may issue instructions belonging to multiple threads during a single clock cycle.• Functional Units: a number of specialized functional units (integer units, floating point units, branch units, etc.). The type and number of units is to be determined based on analysis of the utilization of resources by multiple concurrent threads.• Register File: maintains a register set for each active threads in the system. The actual num-ber of register sets required should be scalable, and should be determined based on the maxi-mum number concurrent threads to be supported by the system.•Data Cache: maintains working sets of data belonging to the active threads. Since a large number of threads may be active, a sufficiently large cache must be provided to simultaneously hold multiple working sets.Although this is a simplified, very high level design (or framework) for a multithreaded super-scalar processor, it illustrates the components necessary for such an architecture.。