Efficient Shared Memory Programming on Software-DSM Systems
- 格式:pdf
- 大小:321.26 KB
- 文档页数:10
cuda中shared memory 好处-回复CUDA(Compute Unified Device Architecture)是由NVIDIA为其图形处理器(GPU)所创造的并行计算平台和编程模型。
在CUDA中,shared memory(共享内存)是一种特殊的硬件内存,它位于GPU的同步多处理器(SM)上,被多个线程块共享使用。
Shared memory的使用对于提高CUDA程序的性能非常重要。
本文将深入探讨shared memory的好处,并逐步回答关于shared memory的相关问题。
一、Shared memory的工作原理在理解shared memory的好处之前,先来了解一下它的工作原理。
Shared memory是一种高带宽低延迟的硬件内存,位于SM上。
当一个线程块被调度执行时,GPU会将线程块的数据从全局内存(global memory)中加载到shared memory中,然后线程块内的所有线程可以直接读写shared memory中的数据,而不需要通过全局内存。
这样可以大大减少全局内存的访问次数,提高程序的性能。
二、Shared memory的好处1. 减少全局内存的访问:全局内存是GPU上访问最慢的内存,而shared memory是GPU上访问最快的内存之一。
通过将数据加载到shared memory中,线程块内的线程可以直接从shared memory中读写数据,避免了对全局内存的频繁访问,从而降低了访存延迟,提高程序的性能。
2. 提高访存效率:shared memory是位于SM上的硬件内存,它与SM 上的所有线程块共享使用。
当多个线程块访问相同的数据时,它们可以直接从shared memory中读取数据,避免了重复从全局内存加载数据的操作,提高了访存效率。
3. 增加数据重用:shared memory的作用不仅仅是减少对全局内存的访问,还可以增加数据的重用。
当一个线程块需要多次访问相同的数据时,将这些数据加载到shared memory中可以保留数据的副本,并且线程块内的各个线程可以直接从shared memory中读取数据,而不需要再次从全局内存加载数据。
University of Wisconsin-Madison(UMW)周玉龙1101213442 计算机应用UMW简介美国威斯康辛大学坐落于美国密歇根湖西岸的威斯康辛州首府麦迪逊市,有着风景如画的校园,成立于1848年, 是一所有着超过150年历史的悠久大学。
威斯康辛大学是全美最顶尖的三所公立大学之一,是全美最顶尖的十所研究型大学之一。
在美国,它经常被视为公立的常青藤。
与加利福尼亚大学、德克萨斯大学等美国著名公立大学一样,威斯康辛大学是一个由多所州立大学构成的大学系统,也即“威斯康辛大学系统”(University of Wisconsin System)。
在本科教育方面,它列于伯克利加州大学和密歇根大学之后,排在公立大学的第三位。
除此之外,它还在本科教育质量列于美国大学的第八位。
按美国全国研究会的研究结果,威斯康辛大学有70个科目排在全美前十名。
在上海交通大学的排行中,它名列世界大学的第16名。
威斯康辛大学是美国大学联合会的60个成员之一。
特色专业介绍威斯康辛大学麦迪逊分校设有100多个本科专业,一半以上可以授予硕士、博士学位,其中新闻学、生物化学、植物学、化学工程、化学、土木工程、计算机科学、地球科学、英语、地理学、物理学、经济学、德语、历史学、语言学、数学、工商管理(MBA)、微生物学、分子生物学、机械工程、哲学、西班牙语、心理学、政治学、统计学、社会学、动物学等诸多学科具有相当雄厚的科研和教学实力,大部分在美国大学相应领域排名中居于前10名。
学术特色就学术方面的荣耀而言,威斯康辛大学麦迪逊校区的教职员和校友至今共获颁十七座诺贝尔奖和二十四座普立兹奖;有五十三位教职员是国家科学研究院的成员、有十七位是国家工程研究院的成员、有五位是隶属于国家教育研究院,另外还有九位教职员赢得了国家科学奖章、六位是国家级研究员(Searle Scholars)、还有四位获颁麦克阿瑟研究员基金。
威斯康辛大学麦迪逊校区虽然是以农业及生命科学为特色,但是令人注目,同时也是吸引许多传播科系学子前来留学的最大诱因,则是当前任教于该校新闻及传播研究所、在传播学界有「近代美国传播大师」之称的杰克·麦克劳(Jack McLauld)。
Datasheet Fujitsu Desktop ESPRIMO G9012Fujitsu recommends Windows 11 ProA smart and ultra-compact Mini PC for advanced office needsThe Fujitsu E SPRIMO G9012 is a smart Mini PC which offers advanced Desktop PC functionality in an ultra- compact, space-saving design. Making it a perfect device for working from home or in flexible workplaces with its single cable USB Type-C ® connection. Provided with the latest 13th generation Intel ® Core™ processors and Intel vPro ® technology, the devices empower impressive performance/volume ratio.A smart Mini PC with ultra-compact design Eye-catching exclusive modern design• Low noise Mini PC thus providing ergonomic, clean, and clutter-free working environment • Offers advanced desktop PC functionality with a volume of only 0.87 liters •Impressive performance/volume ratio and is easy to set up in minutesSmall in size but grand in performanceHigh performance in a small Mini PC design• Latest Intel chipset technology and 13th generation Intel ® Core™ processor family • With Intel vPro ® version for latest out-of-band management features •Next generation Intel integrated graphicsFlexibility without limitsPleasant working environment due to quiet system• Concept with different heights allows ODD and upgrade options like power supply and PCIe slot • Flexibility for later device upgrades •Various mounting optionsUltimate usability - Perfect fit for modern offices and workplaces• Integrated into Fujitsu AIO Display P2410 enables a space saving all-in-one system• Can be used for horizontal or vertical operations and easy upgrade of major components • VESA mount for attaching the PC to displays, stands or walls•Ideal device to use for home office scenarios due to its mobility and one cable solutionLow power consumption and environmental impact Reduced environmental impact and reduced energy bills•Highly efficient AC adapter 80Plus PLATINUM or internal power supply as optionSpecificationESPRIMO G9012 ESPRIMO G9012 ESTARxProcessor Intel® Core™ i7-13700 processor (16 Cores (8+8)/24 Threads, 2.1 GHz, 30 MB, Intel® UHD Graphics 770)**^ Intel® Core™ i5-13500 processor (14 Cores (6+8)/20 Threads, 2.5 GHz, 24 MB, Intel® UHD Graphics 770)*Intel® Core™ i5-13400 processor (10 Cores (6+4)/16 Threads, 2.5 GHz, 20 MB, Intel® UHD Graphics 730)*Intel® Core™ i3-13100 processor (4 Cores/8 Threads, 3.4 GHz, 12 MB, Intel® UHD Graphics 730)*Intel® Core™ i7-12700 processor (12 Cores(8+4)/20 Threads, 2.1 GHz, 25 MB, Intel® UHD Graphics 770) **Intel® Core™ i5-12500 processor (6 Cores/12 Threads, 3.0 GHz, 18 MB, Intel® UHD Graphics 770) *Intel® Core™ i5-12400 processor (6 Cores/12 Threads, 2.5 GHz, 18 MB, Intel® UHD Graphics 730) *Intel® Core™ i3-12100 processor (4 Cores/8 Threads, 3.3 GHz, 12 MB, Intel® UHD Graphics 730) *Intel vPro® Platform Logo (optional) with Intel® Core™ i5-12500, i5-13500 and Intel® Core™ i7 processors*With Intel® Turbo Boost Technology 2.0 (clock speed and performance will vary depending on workload andother variables)**With Intel®Turbo Boost Max Technology 3.0 (clock speed and performance will vary depending onworkload and other variables)^90 Watt AC adapter is required for maximum CPU performanceOperating systemsOperating system pre-installed Windows 11 Pro. Windows 11 Home.Windows 10 Pro (Via Windows 11 Pro Downgrade Option) Fujitsu recommends Windows 11 Pro for business.Memory modules16 GB (1 module(s) 16 GB) DDR4, unbuffered, non-ECC, 3,200 MT/s, SO DIMM.8 GB (1 module(s) 8 GB) DDR4, unbuffered, non-ECC, 3,200 MT/s, SO DIMM4 GB (1 module(s) 4 GB) DDR4, unbuffered, non-ECC, 3,200 MT/s, SO DIMMGraphics Entry 3D: NVIDIA®T600, 4 GB, PCIe x16, 4 x miniDPEntry 3D: NVIDIA®T400, 4 GB, PCIe x16, 3 x miniDPOthers: MiniDP to DP Adapter Cable*Optional in Int. PSU versionHDD 2.5 inch size HDD SATA III, 5,400 rpm, 1000 GB, 2.5-inch HDDSATA III, 5,400 rpm, 2000 GB, 2.5-inchM.2 SSD SSD PCIe, 2048 GB M.2 NVMe (Gen4), SEDSSD PCIe, 1024 GB M.2 NVMe (Gen4), SEDSSD PCIe, 512 GB M.2 NVMe (Gen4), SEDSSD PCIe, 256 GB M.2 NVMe (Gen4), Value, SEDHard disk notes One Gigabyte equals one billion bytes, when referring to hard disk drive capacity. SSD(Solid State Disk)SED (Self-Encrypting Drive)2.5-inch bay for devices with 7mm heightDurability in accordance with the manufacturer’s indications on read and write cycles.Optical drive DVD Super Multi ultra slim (tray), SATA (optional)Wireless LAN Intel® Wi-Fi 6E AX211 (2x2/160) Gig+ and Bluetooth 5.2 vPro (depends on OS support); SRD cat.1 (6E frequency only in dedicated regions)Wireless LAN Notes WI-FI 6E: The device does not operate in 6GHz band in all countries, depending on individual local regulations.Expansion notes The listed product components and optionalfunctions are not available in all region/countries;System is available in 3 different housing versions: Smallversion (0.87 Liter)ODD bay version (1.23 Liter)Internal Power supply/ PCIe Slot version (1.85 Liter) The listed product components and optional functions are not available in all region/countries; Energy Star certification only with Windows operating system;Base unit for TCO Certified generation 9; System is available in 3 different housing versions: Small version (0.87 Liter)ODD bay version (1.23 Liter)Internal Power supply/ PCIe Slot version (1.85 Liter)Base unitESPRIMO G9012 ESPRIMO G9012 ESTARxMainboardMainboard type D4014-A (base system)Formfactor ProprietaryChipset Intel® Q670Processor socket LGA1700Processor Qty (max.) 1Supported RAM (max.) 64 GBMemory slots 2 DIMM (DDR4)Memory frequency 3,200 MT/sMemory notes Dual channel supportFor dual channel performance, 2 memory modules have to be ordered. Capacity per channel has to be thesame. 3200 MHz may be clocked down to 2933/2400MHz depending on processor and memory configuration LAN 10/100/1,000 MBit/s Intel® I219LMIntegrated WLAN Optional; The device does not operate in 6GHz band in all countries, due to national regulations.BIOS version AMI Aptio VUEFI Specification 2.6BIOS features BIOS Flash EPROM update by softwareRecovery BIOSUnified Extensible Firmware Interface (UEFI)Audio type On boardAudio codec Realtek ALC257NAudio features Internal speaker supports audio playback, High Definition audioI/O controller on boardSerial ATA total 2Controller functions Serial ATA III (6 Gbit)NCQAHCIRAID 1/0InterfacesFront audio: headset 1USB front 1x USB 3.2 Gen1 (5 Gbps) Type-A1x USB 3.2 Gen2 (10 Gbps) Type-A1x USB 3.2 Gen2x2 (20 Gbps) Type-C (provides up to 15W)USB rear Base system:1x USB 2.0 Type-A (plus 2 ports optional*)3x USB 3.2 Gen2 (10 Gbps) Type-A1x USB 3.2 Gen2 (10 Gbps) Type-C (supports DisplayPort 1.4 and power delivery sink**)ESPRIMO G9012 ESPRIMO G9012 ESTARxDisplayPort 1x (DisplayPort 1.4, DP++) (2nd port optional*, DP)Serial (RS-232) 1 (optional*)HDMI 1x HDMI 2.1(2nd port optional*)Ethernet (RJ-45) 1x RJ-45Interface Module notes Only one of the interfaces marked as “optional” can be configured at the same time.**With power delivery sink support, the system is operated by the power from peripherals, e.g. portreplicators or displays. An extra power supply for the system is not needed. Peripherals must provide min.90Watt (for maximum CPU performance).Input deviceInput devices (optional) KeyboardMouseDrive baysDrive bays total 22.5-inch internal bays 15.25-inch external bays 1Drive bay notes 5.25” bay: for slim optical disc drive only as option with ODD bay housing version;M.2-2280 1 x on mainboard (PCIe 4.0 x4; up to 64Gbit/s)Slots1 x (168 mm / 6.61 inch)PCI-Express 3.0 x4(mech. x16)M.2-2230 On mainboard for WLAN moduleGraphics on boardGraphics brand name Intel® UHD Graphics 770, Intel® UHD Graphics 730 *Depends on CPUShared video memory Up to half size of total system memoryGraphics features Support for up to four independent displaysDirectX® 12HDCP supportOpenCL™ 2.1OpenGL® 4.6Vulkan™DisplayPort interface supports Ver. 1.4 incl. Multi-StreamHDMI™ 2.1Digital audio formats are supported, including Dolby Digital Plus (7.1 channels)Electrical valuesPower efficiency note 90 Watt AC adapter: power supply efficiency at 115/230V: 89% average efficiencyRated voltage range 100 V - 240 VRated frequency range 50 Hz - 60 Hz90 V - 264 VOperatingvoltage rangeESPRIMO G9012 ESPRIMO G9012 ESTARx Operating linefrequency range47 Hz - 63 HzMax. output of single power supply AC Adapter: 90 Winternal PLATINUM power supply (optional): 150 WNoise emissionStandard noiseemissionAccording to ISO 7779:2010, ECMA-74Standard noise notes / description A-weighted sound power level Lwad (in B) / Workplace related A-weighted sound pressure level LpAm (in dB(A))Dimensions / Weight / EnvironmentalDimensions (W x D x H) 146.5 x 164.5 x 36 mm (5.77 x 6.48 x 1.42 inch)Dimension notes With ODD drive bay, the dimensions are: 146.5 x 164.5 x 51 mm (5.77 x 6.48 x 2.01 inch) With internal power supply / PCIe slot version, the dimensions are:146.5 x 176.9 x 71 mm (5.77 x 6.97 x 2.80 inch)Operating position Vertical / Horizontal (feet included)Weight Base: Min:0.86kg; 1x mem,1xSDD / Max:1.025kg; 2x mem,2xSSD,HDD,Flexi,WLANODD version: Min:1.2kg/Max:1.29kgInt. PSU version: Min:1.45kg/Max:1.59kg*Weight of AC adapter and power cord are not includedWeight notes Actual weight may vary depending on configurationOperating ambienttemperature10 - 35 °C (50 - 95 °F)Operating relativehumidity5 - 85 % (relative humidity)ComplianceModel MPCGCertification CERoHS (Restriction of hazardous substances)CCC (In progress)BSMI (In progress) ENERGY STAR®RoHS (Restriction of hazardous substances) CCC (In progress)BSMI (In progress)Compliance link https:///sites/certificates Additional SoftwareAdditional software (preinstalled) McAfee® LiveSafe™ (provides award-winning antivirus protection for your PC and much more. 30 days trial pre-installed)Microsoft Office (1 month trial for new Microsoft® Office 365 customers. Buy Microsoft Office.)ESPRIMO G9012 ESPRIMO G9012 ESTARxAdditional software (optional) Recovery DVD for Windows®Drivers & Utilities DVD (DUDVD)Nero Essentials XLMicrosoft® Office Professional 2021Microsoft® Office Home and Business 2021(Need to buy license to activate the pre-installed Microsoft Office. Purchase and activation only in the region in which it was acquired.)Additional software (notes) Use of accompanying and/or additional Software is subject to proactive acceptance of the respective License Agreements /EULAs/ Subscription and support terms of the Software manufacturer as applicable for the relevant Software whether preinstalled or optional. The software may only be available bundled with a software support subscription which – depending on the Software - may be subject to separate remuneration.SecurityPhysical Security Kensington Lock supportEye for padlockSystem and BIOS Security Embedded security (TPM 2.0)EraseDiskCredential Guard Ready and Device Guard Capable (requires 8 GB or more system RAM and SSD PCIe NVME) Modern Standby and Legacy Standby selectableWrite protect option for the Flash EPROMControl of all USB interfacesExternal USB ports can be disabled separatelyControl of external interfacesNIST SP800-193 Platform Firmware Resiliency supportedNIST SP800-155 BIOS Integrity Measurement supportedNIST SP800-147 BIOS Protection supportedUser Security User and supervisor BIOS passwordHard disk passwordAccess protection via SmartCard reader (optional)AuthConductor Client Basic (secure authentication solution)Security Notes The properties of the product provide a baseline for product security and therefore end-customer IT security.However, these properties are not sufficient on their own to protect the product from all existing threats,such as intrusion attempts, data exfiltration and other forms of cyberattacks. To customize security settings,please use the configuration options as available for the respective product. During operation, the IT securityof this product is within the responsibility of the respective administrator/end-user of the product. Pleasenote, that Fujitsu as a manufacturer does not make any policy prescriptions or advocacy statements regardingIT security best practices and/or general product operation.ManageabilityManageability technology DeskUpdate Driver management PXE 2.1 Boot codeWake up from S5 (off mode) Intrusion switchWoL (Wake on LAN)Intel AMT 16.0 (depending on processor)Manageability software DeskView ClientDeskView Instant BIOS ManagementDeskView components BIOS Management incl. SecurityInventory ManagementESPRIMO G9012 ESPRIMO G9012 ESTARxDriver ManagementAlarm ManagementSupported standards DMI (Desktop Management Interface)SMBIOS (System Management BIOS)PXE (Preboot Execution Environment)WMI (Windows Management Instrumentation)WBEM (Web Based Enterprise Management)CIM (Common Information Model)Manageability link /fts/manageabilityPackaging informationPackaging dimension (mm) 489 x 218 x 165mmMax. quantity / pallet48Packaging material weight 702gWarrantyhttps:///hk/ap/warrantyWarranty Terms &ConditionsFujitsu recommends Windows 11 ProMore informationThe word “Uvance” embodies a concept of “Making all (Universal) things move forward(Advance) in a sustainable direction.Fujitsu Uvance will leverage Fujitsu’s technological capabilities and problem-solvingexpertise across seven key focus areas, including Sustainable Manufacturing, ConsumerExperience, Healthy Living, Trusted Society, Digital Shifts, Business Applications, and HybridIT, to offer unprecedented value to customers, while contributing to the achievement of itsultimate purpose —"to make theworld more sustainable by building trust in society through innovation."Through Fujitsu Uvance, we are committed to transforming the world into a place where people can live their lives, enjoying prosperity and peace of mind. Empowering each other to make the world more sustainable.For more information of Fujitsu Uvance, please visit https:///global/uvanceTo grow in an uncertain world, it is imperative to stay resilient and agile to respond complex and unforeseen challenges.Under the global business brand – Uvance, Fujitsu provides a full range of reliable and best-in-class product, service, and solution offerings with digital innovation, from sustainable manufacturing to smart city ecosystem, to create a more sustainable world and resolve social issues. This empowers customers to strengthen resilience through Fujitsu’s trusted solutions using cutting-edge technologies, while increasing their business agility and improving the reliability of their IT operations.For more information, please visit https:///globalContact UsHONG KONG SINGAPORE CHINAFujitsu Business Technologies Asia Pacific Ltd. Fujitsu Asia Pte Ltd. Fujitsu (China) Holdings Co., Ltd. – Tel: (852) 3910-8228 Tel: (65) 6512-7555 PC China DivisionEmail:***********************.com Email:*********************Tel: 86 (21) 58871000-8721/pc /pc Email:*************************.com/pcINDONESIA PHILIPPINES MALAYSIAPT Fujitsu Indonesia Fujitsu Philippines, Inc. Fujitsu (Malaysia) Sdn. BhdTel: (62) 21-570-9330 Tel: (63) 2-8841-8488 Tel: (60) 3-8230-4188Email:********************.com Email:********************************Email:************************* /pc /pc /pcTAIWAN THAILAND VIETNAMFujitsu Taiwan Ltd. Fujitsu (Thailand) Co., Ltd. Fujitsu Vietnam LimitedTel: (886) 2-2311-2255 Tel: (66) 0-2302-1500 Tel: (84-24) 2220-3113Email:************************Email:*******************Email:********************/pc /pc /pcNote: For countries not listed above, please contact Hong Kong office.Specification disclaimersNot all features are available in all editions or versions of Windows. Systems may require upgraded and/or separately purchased hardware, drivers, software or BIOS update to take full advantage of Windows functionality. Windows 11 is automatically updated, which is always enabled. ISP fees may apply and additional requirements may apply over time for updates. GB = 1 billion bytes. TB = 1 trillion bytes, when referring to hard disk drive capacity. Accessible capacity may vary, also depending on used software. Up to 20 GB of HDD space is reserved for system recovery. Shared memory depending on main memory size and operating system.Fujitsu shall not be liable for technical or editorial errors or omissions contained herein. Ultrabook, Celeron, Celeron Inside, Core Inside, Intel, Intel Logo, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside Logo, Intel vPro, Intel Evo, Itanium, Itanium Inside, Pentium, Pentium Inside, vPro Inside, Xeon, Xeon Phi, Xeon Inside, Intel Agilex, Arria, Cyclone, Movidius, eASIC, Enpirion, Iris, MAX, Intel RealSense, Stratix, and Intel Optane are trademarks of Intel Corporation or its subsidiaries. USB Type-C and USB-C™ are trademarks of USB Implementers Forum. All other trademarks are the property of their respective owners.All rights reserved, including intellectual property rights. Technical data subject to modifications and delivery subject to availability. Any liability that the data and illustrations are complete, actual or correct is excluded. Designations may be trademarks and/or copyrights of the respective manufacturer, the use of which by third parties for their own purposes may infringe the rights of such owner. For further information see /global/about/resources/terms/.© 2023 Fujitsu Business Technologies Asia Pacific LimitedLast Update: May 31, 2023。
DA-07173-001_v8.0 | June 2017 Application NoteTABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide (1)1.1. NVIDIA Maxwell Compute Architecture (1)1.2. CUDA Best Practices (2)1.3. Application Compatibility (2)1.4. Maxwell T uning (2)1.4.1. SMM (2)1.4.1.1. Occupancy (2)1.4.1.2. Instruction Scheduling (3)1.4.1.3. Instruction Latencies (3)1.4.1.4. Instruction Throughput (3)1.4.2. Memory Throughput (4)1.4.2.1. Unified L1/T exture Cache (4)1.4.3. Shared Memory (4)1.4.3.1. Shared Memory Capacity (4)1.4.3.2. Shared Memory Bandwidth (5)1.4.3.3. Fast Shared Memory Atomics (5)1.4.4. Dynamic Parallelism (6)Appendix A. Revision History (7)1.1. NVIDIA Maxwell Compute ArchitectureMaxwell is NVIDIA's next-generation architecture for CUDA compute applications. Maxwell retains and extends the same CUDA programming model as in previous NVIDIA architectures such as Fermi and Kepler, and applications that follow thebest practices for those architectures should typically see speedups on the Maxwell architecture without any code changes. This guide summarizes the ways that an application can be fine-tuned to gain additional speedups by leveraging Maxwell architectural features.1Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves energy efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development, NVIDIA's GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, compiler-based scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called SMM) to far exceed Kepler SMX efficiency.The first Maxwell-based GPU is codenamed GM107 and is designed for use in power-limited environments like notebooks and small form factor (SFF) PCs. GM107 is described in a whitepaper entitled NVIDIA GeForce GTX 750 Ti: Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt.2 The first GPU using the second-generation Maxwell architecture is codenamed GM204. Second-generation Maxwell GPUs retain the power efficiency of the earlier generation while delivering significantly higher performance. GM204 is described in a whitepaper entitled NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced GPU Ever Made.1Throughout this guide, Fermi refers to devices of compute capability 2.x, Kepler refers to devices of compute capability 3.x, and Maxwell refers to devices of compute capability 5.x.2The features of GM108 are similar to those of GM107.Compute programming features of GM204 are similar to those of GM107, except where explicitly noted in this guide. For details on the programming features discussed in this guide, please refer to the CUDA C Programming Guide.1.2. CUDA Best PracticesThe performance guidelines and best practices described in the CUDA C Programming Guide and the CUDA C Best Practices Guide apply to all CUDA-capable GPU architectures. Programmers must primarily focus on following those recommendations to achieve the best performance.The high-priority recommendations from those guides are as follows:‣Find ways to parallelize sequential code,‣Minimize data transfers between the host and the device,‣Adjust kernel launch configuration to maximize device utilization,‣Ensure global memory accesses are coalesced,‣Minimize redundant accesses to global memory whenever possible,‣Avoid long sequences of diverged execution by threads within the same warp.1.3. Application CompatibilityBefore addressing specific performance tuning issues covered in this guide, refer to the Maxwell Compatibility Guide for CUDA Applications to ensure that your application is compiled in a way that is compatible with Maxwell.1.4. Maxwell T uning1.4.1. SMMThe Maxwell Streaming Multiprocessor, SMM, is similar in many respects to the Kepler architecture's SMX. The key enhancements of SMM over SMX are geared toward improving efficiency without requiring significant increases in available parallelism per SM from the application.1.4.1.1. OccupancyThe maximum number of concurrent warps per SMM remains the same as in SMX (i.e., 64), and factors influencing warp occupancy remain similar or improved over SMX:‣The register file size (64k 32-bit registers) is the same as that of SMX.‣The maximum registers per thread, 255, matches that of Kepler GK110. As with Kepler, experimentation should be used to determine the optimum balance ofregister spilling vs. occupancy, however.‣The maximum number of thread blocks per SM has been increased from 16 to 32.This should result in an automatic occupancy improvement for kernels with smallthread blocks of 64 or fewer threads (shared memory and register file resourcerequirements permitting). Such kernels would have tended to under-utilize SMX, but less so SMM.‣Shared memory capacity is increased (see Shared Memory Capacity).As such, developers can expect similar or improved occupancy on SMM without changes to their application. At the same time, warp occupancy requirements (i.e., available parallelism) for maximum device utilization are similar to or less than those of SMX (see Instruction Latencies).1.4.1.2. Instruction SchedulingThe number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell's improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of SMM means CUDA Cores per GPU will be substantially higher vs. comparable Fermi or Kepler chips. SMM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design.As with SMX, each SMM has four warp schedulers. Unlike SMX, however, all SMM core functional units are assigned to a particular scheduler, with no shared units. Along with the selection of a power-of-two number of CUDA Cores per SM, which simplifies scheduling and reduces stall cycles, this partitioning of SM computational resources in SMM is a major component of the streamlined efficiency of SMM.The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.1.4.1.3. Instruction LatenciesAnother major improvement of SMM is that dependent math latencies have been significantly reduced; a consequence of this is a further reduction of stall cycles, as the available warp-level parallelism (i.e., occupancy) on SMM should be equal to or greater than that of SMX (see Occupancy), while at the same time each math operation takes less time to complete, improving utilization and throughput.1.4.1.4. Instruction ThroughputThe most significant changes to peak instruction throughputs in SMM are as follows:‣The change in number of CUDA Cores per SM brings with it a corresponding change in peak single-precision floating point operations per clock per SM.However, since the number of SMs is typically increased, the result is an increase in aggregate peak throughput; furthermore, the scheduling and latency improvements also discussed above make this peak easier to approach.‣The throughput of many integer operations including multiply, logical operations and shift is improved. In addition, there are now specialized integer instructionsthat can accelerate pointer arithmetic. These instructions are most efficient when data structures are a power of two in size.As was already the recommended best practice, signed arithmetic should bepreferred over unsigned arithmetic wherever possible for best throughput on SMM.The C language standard places more restrictions on overflow behavior for unsignedmath, limiting compiler optimization opportunities.1.4.2. Memory Throughput1.4.2.1. Unified L1/T exture CacheMaxwell combines the functionality of the L1 and texture caches into a single unit.As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler. Two new device attributes were added in CUDA Toolkit 6.0:globalL1CacheSupported and localL1CacheSupported. Developers who wish to have separately-tuned paths for various architecture generations can use these fields to simplify the path selection process.Enabling caching of globals in GM204 can affect occupancy. If per-thread-blockSM resource usage would result in zero occupancy with caching enabled, the CUDAdriver will override the caching selection to allow the kernel launch to succeed. Thissituation is reported by the profiler.1.4.3. Shared Memory1.4.3.1. Shared Memory CapacityWith Fermi and Kepler, shared memory and the L1 cache shared the same on-chip storage. Maxwell, by contrast, provides dedicated space to the shared memory of each SMM, since the functionality of the L1 and texture caches have been merged in SMM.This increases the shared memory space available per SMM as compared to SMX:GM107 provides 64 KB shared memory per SMM, and GM204 further increases this to 96 KB shared memory per SMM.This presents several benefits to application developers:‣Algorithms with significant shared memory capacity requirements (e.g., radix sort) see an automatic 33% to 100% boost in capacity per SM on top of the aggregate boost from higher SM count.‣Applications no longer need to select a preference of the L1/shared split for optimal performance. For purposes of backward compatibility with Fermi andKepler, applications may optionally continue to specify such a preference, but the preference will be ignored on Maxwell, with the full 64 KB per SMM always going to shared memory.While the per-SM shared memory capacity is increased in SMM, the per-thread-block limit remains 48 KB. For maximum flexibility on possible future GPUs, NVIDIArecommends that applications use at most 32 KB of shared memory in any one threadblock, which would for example allow at least two such thread blocks to fit per SMM.1.4.3.2. Shared Memory BandwidthKepler SMX introduced an optional 8-byte shared memory banking mode, which had the potential to increase shared memory bandwidth per SM over Fermi for shared memory accesses of 8 or 16 bytes. However, applications could only benefit from this when storing these larger elements in shared memory (i.e., integers and fp32 values saw no benefit), and only when the developer explicitly opted into the 8-byte bank mode via the API.To simplify this, Maxwell returns to the Fermi style of shared memory banking, where banks are always four bytes wide. Aggregate shared memory bandwidth across the chip remains comparable to that of corresponding Kepler chips, given increased SM count. In this way, all applications using shared memory can now benefit from the higher bandwidth, even when storing only four-byte items into shared memory and without specifying any particular preference via the API.1.4.3.3. Fast Shared Memory AtomicsKepler introduced a dramatically higher throughput for atomic operations to global memory as compared to Fermi. However, atomic operations to shared memory remained essentially unchanged: both architectures implemented shared memory atomics using a lock/update/unlock pattern that could be expensive in the case of high contention for updates to particular locations in shared memory.Maxwell improves upon this by implementing native shared memory atomic operations for 32-bit integers and native shared memory 32-bit and 64-bit compare-and-swap(CAS), which can be used to implement other atomic functions with reduced overhead compared to the Fermi and Kepler methods.Refer to the CUDA C Programming Guide for an example implementation of an fp64atomicAdd() using atomicCAS().1.4.4. Dynamic ParallelismGK110 introduced a new architectural feature called Dynamic Parallelism, which allows the GPU to create additional work for itself. A programming model enhancement leveraging this feature was introduced in CUDA 5.0 to enable kernels running on GK110 to launch additional kernels onto the same GPU.SMM brings Dynamic Parallelism into the mainstream by supporting it acrossthe product line, even in lower-power chips such as GM107. This will benefit developers, as it means that applications will no longer need special-case algorithm implementations for high-end GPUs that differ from those usable in more power-constrained environments.Version 1.0‣Initial Public ReleaseVersion 1.1‣Updated for second-generation Maxwell (compute capability 5.2).NoticeALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATEL Y, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSL Y DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright© 2012-2017 NVIDIA Corporation. All rights reserved.。
电脑硬件方面的国外文献全文共四篇示例,供读者参考第一篇示例:Introduction1. Central Processing Unit (CPU)Smith also discusses the concept of multi-core processors, which have become increasingly common in modern computers. These processors feature multiple cores that can work simultaneously to improve performance and efficiency. He notes that while multi-core processors can enhance the speed and responsiveness of a computer, they may also present challenges in terms of programming and software optimization.3. Random Access Memory (RAM)第二篇示例:电脑硬件是现代科技发展的重要组成部分,它们构成了我们日常生活中不可或缺的一部分。
在国外的科研领域,各种关于电脑硬件的研究文章层出不穷,它们不仅推动了技术的进步,也给我们带来了更加便利的生活体验。
一、处理器处理器是电脑硬件中最为核心的部件之一,它负责执行各种计算任务。
在国外的研究中,处理器的性能和效率一直是研究的重点。
一篇题为《Energy-Efficient Heterogeneous Multicore Systems for Embedded Multimedia Applications》的论文指出,针对嵌入式多媒体应用程序,设计高效能耗的异构多核系统是一种有效的方法,能够有效提高系统性能和降低功耗。
D ATA S HE E TM 20 I n t e r n e t B a c k b o n e R o u t e rThe M20 router’s compact design offers tremendous performance and portdensity. The M20 router has a rich feature set that includes numerous advantages.sRoute lookup rates in excess of 40 Mpps for wire-rate forwarding performancesAggregate throughputcapacity exceeding 20 Gbps sPerformance-based packet filtering, rate limiting, and sampling with the Internet Processor II™ ASIC sRedundant System and Switch Board andredundant Routing Engine sMarket-leading port density and flexibility sProduction-proven routing software with Internet-scale implementations of BGP4, IS-IS, OSPF , MPLS traffic engineering, class of service, and multicasting applicationsThe M20™ Internet backbone router is a high-performance routing platform that is built for a variety of Internet applications, including high-speed access, public and private peering,hosting sites, and backbone core networks.The M20 router leverages proven M-series ASIC technology to deliver wire-rateperformance and rich packet processing,such as filtering, sampling, and rate limiting.It runs the same JUNOS™ Internet software and shares the same interfaces that are supported by the M40™ Internet backbone router, providing a seamless upgrade path that protects your investment. Moreover, its compact design (14 in / 35.56 cm high)delivers market-leading performance and port density, while consuming minimal rack space.The M20 router offers wire-rate performance,advanced features,internal redundancy,and scaleability in a space-efficient package.A d v a n t a g e sFeatur esBenefitsIt [JUNOS software]dramatically increases our confidence that we will have access to technology to keep scaling along with what the demands on the network are.We can keep running.—Michael O’Dell,Chief Scientist,UUNETTechnologies, Inc.“”A r c h i t e c t u r eThe two key components of the M20 architecture are the Packet Forwarding Engine (PFE) and the Routing Engine,which are connected via a 100-Mbps link. Control traffic passing through the 100-Mbps link is prioritized and rate limited to help protect against denial-of-service attacks.sThe PFE is responsible for packet forwarding performance.It consists of the Flexible PIC Concentrators (FPCs),physical interface cards (PICs), System and Switch Board (SSB), and state-of-the-art ASICs.sThe Routing Engine maintains the routing tables andcontrols the routing protocols. It consists of an Intel-based PCI platform running JUNOS software.The architecture ensures industry-leading service delivery by cleanly separating the forwarding performance from the routing performance. This separation ensures that stressexperienced by one component does not adversely affect the performance of the other since there is no overlap of required resources.Leading-edge ASICsThe feature-rich M20 ASICs deliver a comprehensive hardware-based system for packet processing, including route lookups, filtering, sampling, rate limiting, loadbalancing, buffer management, switching, encapsulation,and de-encapsulation functions. To ensure a non-blocking forwarding path, all channels between the ASICs are oversized, dedicated paths.Internet Processor and Internet Processor II ASICsThe Internet Processor™ ASIC, which was originally deployed with M20 routers, supports an aggregated lookup rate of over 40 Mpps.An enhanced version, the Internet Processor II ASIC,supports the same 40 Mpps lookup rate. With over one million gates, this ASIC delivers predictable, high-speed forwarding performance with service flexibility, including filtering and sampling. The Internet Processor II ASIC is the largest, fastest, and most advanced ASIC ever implemented on a router platform and deployed in the Internet.Distributed Buffer Manager ASICsThe Distributed Buffer Manager ASICs allocate incoming data packets throughout shared memory on the FPCs. This single-stage buffering improves performance by requiring only one write to and one read from shared memory. There are no extraneous steps of copying packets from input buffers to output buffers. The shared memory is completelynonblocking, which in turn, prevents head-of-line blocking.I/O Manager ASICsEach FPC is equipped with an I/O Manager ASIC that supports wire-rate packet parsing, packet prioritizing, and queuing.Each I/O Manager ASIC divides the packets, stores them in shared memory (managed by the Distributed Buffer Manager ASICs), and re-assembles the packets for transmission.Media-specific ASICsThe media-specific ASICs perform physical layer functions,such as framing. Each PIC is equipped with an ASIC or FPGA that performs control functions tailored to the PIC’s media type.Packet Forwarding EngineThe PFE provides Layer 2 and Layer 3 packet switching, route lookups, and packet forwarding. The Internet Processor II ASIC forwards an aggregate of up to 40 Mpps for all packet sizes. The aggregate throughput is 20.6 Gbps half-duplex.The PFE supports the same ASIC-based features supported by all other M-series routers. For example, class-of-service features include rate limiting, classification, priority queuing,Random Early Detection and Weighted Round Robin to increase bandwidth efficiency. Filtering and sampling areLogical View of M20 ArchitecturePacket Forwarding Enginealso available for restricting access, increasing security, and analyzing network traffic.Finally, the PFE delivers maximum stability duringexceptional conditions, while also providing a significantly lower part count. This stability reduces power consumption and increases mean time between failure.Flexible PIC ConcentratorsThe FPCs house PICs and connect them to the rest of the PFE. There is a dedicated, full-duplex, 3.2-Gbps channel between each FPC and the core of the PFE.You can insert up to four FPCs in an M20 chassis. Each FPC slot supports one FPC or one OC-48c/STM-16 PIC. Each FPC supports up to four of the other PICs in any combination,providing unparalleled interface density and configuration flexibility.Each FPC contains shared memory for storing data packets received; the Distributed Buffer Manager ASICs on the SSB manage this memory. In addition, the FPC houses the I/O Manager ASIC, which performs a variety of queue management and class-of-service functions.Physical Interface CardsPICs provide a complete range of fiber optic and electrical transmission interfaces to the network. The M20 router offers flexibility and conserves rack space by supporting a wide variety of PICs and port densities. All PICs occupy one of four PIC spaces per FPC except for the OC-48c/STM-16 PIC, which occupies an entire FPC slot.An additional Tunnel Services PIC enables the M20 router to function as the ingress or egress point of an IP-IP unicasttunnel, a Cisco generic routing encapsulation (GRE) tunnel, or a Protocol Independent Multicast - Sparse Mode (PIM-SM) tunnel.For a list of available PICs, see the M-series Internet Backbone Routers Physical Interface Cards datasheet.System and Switch BoardThe SSB performs route lookup, filtering, and sampling, as well as provides switching to the destination FPC. Hosting both the Internet Processor II ASIC and the Distributed Buffer Manager ASICs, the SSB makes forwarding decisions,distributes data cells throughout memory , processes exception and control packets, monitors system components, and controls FPC resets. You can have one or two SSBs, ensuring automatic failover to a redundant SSB in case of failure.Routing EngineThe Routing Engine maintains the routing tables and controls the routing protocols, as well as the JUNOS software processes that control the router’s interfaces, the chassis components, system management, and user access to the router. These routing and software processes run on top of a kernel that interacts with the PFE.sThe Routing Engine processes all routing protocol updates from the network, so PFE performance is not affected.sThe Routing Engine implements each routing protocol with a complete set of Internet features and provides full flexibility for advertising, filtering, and modifying routes.Routing policies are set according to route parameters,such as prefixes, prefix lengths, and BGP attributes.You can install a redundant Routing Engine to ensuremaximum system availability and to minimize MTTR in case of failure.JUNOS Internet SoftwareJUNOS software is optimized to scale to large numbers of network interfaces and routes. The software consists of a series of system processes running in protected memory on top of an independent operating system. The modular design improves reliability by protecting against system-wide failure since the failure of one software process does not affect other processes.SuppliesBack ViewM20 Router Front and Back ViewsFront View14 inS p e c i f i c a t i o n sSpecification DescriptionCopyright © 2000, Juniper Networks, Inc. All rights reserved. Juniper Networks is a registered trademark of Juniper Networks, Inc. Internet Processor,Internet Processor II, JUNOS, M5, M10, M20, M40, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks may be the property of their respective owners. All specifications are subject to change without notice.Printed in USA.O r d e r i n g I n f o r m a t i o nModel NumberDescriptionPart Number 100009-003 09/00w w w.j u n i p e r.n e tC O R P O R AT EH E A D Q U A R T E RSJuniper Networks, Inc.1194 North Mathilda Avenue Sunnyvale, CA 94089 USAPhone 408 745 2000 or 888 JUNIPER Fax 408 745 2100Juniper Networks, Inc. has sales offices worldwide.For contact information, refer to /contactus.html .。
DSP56F826/DRev. # 0, 3/2001 Semiconductor Products SectorDSP56F826 Preliminary Technical DataDSP56F826 16-bit Digital Signal Processor•Up to 40 MIPS at 80MHz core frequency •DSP and MCU functionality in a unified, C-efficient architecture•Hardware DO and REP loops•MCU-friendly instruction set supports both DSP and controller functions: MAC, bitmanipulation unit, 14 addressing modes •31.5K × 16-bit words Program Flash •512 × 16-bit words Program RAM•2K × 16-bit words Data Flash•4K × 16-bit words Data RAM•2K × 16-bit words BootFLASH •Up to 64K × 16-bit words each of external memory expansion for Program and Datamemory•One Serial Port Interface (SPI)•One additional SPI or two optional Serial Communication Interfaces (SCI)•One Synchronous Serial Interface (SSI)•One General Purpose Quad Timer •JTAG/OnCE™ for debugging•100-pin LQFP Package•16 dedicated and 30 shared GPIO•One Time-of-Day moduleFigure1. DSP56F826 Block DiagramPart 1 Overview1.1 DSP56F826 Features1.1.1Digital Signal Processing Core•Efficient 16-bit DSP56800 Family DSP engine with dual Harvard architecture•As many as 40 Million Instructions Per Second (MIPS) at 80MHz core frequency•Single-cycle 16 × 16-bit parallel Multiplier-Accumulator (MAC)•Two 36-bit accumulators, including extension bits•16-bit bidirectional barrel shifter•Parallel instruction set with unique DSP addressing modes•Hardware DO and REP loops•Three internal address buses and one external address bus•Four internal data buses and one external data bus•Instruction set supports both DSP and controller functions•Controller style addressing modes and instructions for compact code•Efficient C Compiler and local variable support•Software subroutine and interrupt stack with depth limited only by memory•JTAG/OnCE Debug Programming Interface1.1.2Memory•Harvard architecture permits as many as three simultaneous accesses to program and data memory •On-chip memory including a low cost, high volume flash solution—31.5K × 16-bit words of Program Flash—512× 16-bit words of Program RAM—2K × 16-bit words of Data Flash—4K × 16-bit words of Data RAM—2K × 16-bit words of BootFLASH•Off-chip memory expansion capabilities programmable for 0, 4, 8, or 12 wait states —As much as 64 K × 16-bit data memory—As much as 64 K × 16-bit program memory1.1.3Peripheral Circuits for DSP56F826•One General Purpose Quad Timer totalling 7pins•One Serial Peripheral Interface with 4 pins (or four additional GPIO lines)•One Serial Peripheral Interface, or multiplexed with two Serial Communications Interfaces totalling4 pins•Synchronous Serial Interface (SSI) with configurable six-pin port (or six additional GPIO lines)DSP56F826 Description•Sixteen (16) dedicated general purpose I/O (GPIO) pins•Thirty (30) shared general purpose I/O (GPIO) pins•Computer-Operating Properly (COP) Watchdog timer•Two external interrupt pins•External reset pin for hardware reset•JTAG/On-Chip Emulation (OnCE™) for unobtrusive, processor speed-independent debugging •Software-programmable, Phase Lock Loop-based frequency synthesizer for the DSP core clock •Fabricated in high-density EMOS with 5V tolerant, TTL-compatible digital inputs•One Time of Day moduleEnergy Information•Dual power supply, 3.3V and 2.5V•Wait and Multiple Stop modes available1.2 DSP56F826 DescriptionThe DSP56F826 is a member of the DSP56800 core-based family of Digital Signal Processors (DSPs). It combines, on a single chip, the processing power of a DSP and the functionality of a microcontroller with a flexible set of peripherals to create an extremely cost-effective solution for general purpose applications. Because of its low cost, configuration flexibility, and compact program code, the DSP56F826 is well-suited for many applications. The DSP56F826 includes many peripherals that are especially useful for applications such as: noise suppression, ID tag readers, sonic/subsonic detectors, security access devices, remote metering, sonic alarms, POS terminals, feature phones.The DSP56800 core is based on a Harvard-style architecture consisting of three execution units operating in parallel, allowing as many as six operations per instruction cycle. The microprocessor-style programming model and optimized instruction set allow straightforward generation of efficient, compact code for both DSP and MCU applications. The instruction set is also highly efficient for C/C++ Compilers to enable rapid development of optimized control applications.The DSP56F826 supports program execution from either internal or external memories. Two data operands can be accessed from the on-chip Data RAM per instruction cycle. The DSP56F826 also provides two external dedicated interrupt lines, and up to 46 General Purpose Input/Output (GPIO) lines, depending on peripheral configuration.The DSP56F826 DSP controller includes 31.5K words (16-bit) of Program Flash and 2K words of Data Flash (each programmable through the JTAG port) with 512 words of Program RAM, and 4K words of Data RAM. It also supports program execution from external memory.The DSP56F826 incorporates a total of 2K words of BootFLASH for easy customer-inclusion of field-programmable software routines that can be used to program the main Program and Data Flash memory areas. Both Program and Data Flash memories can be independently bulk-erased or erased in page sizes of 256 words. The BootFLASH memory can also be either bulk- or page-erased.This DSP controller also provides a full set of standard programmable peripherals including one Synchronous Serial Interface (SSI), one Serial Peripheral Interface (SPI), the option to select a second SPI or two Serial Communications Interfaces (SCIs), and one Quad Timer. The SSI, SPI, and quad timer can be used as General Purpose Input/Outputs (GPIOs) if a timer function is not required.1.3 Best in Class Development EnvironmentThe SDK (Software Development Kit) provides fully debugged peripheral drivers, libraries and interfaces that allow programmers to create their unique C application code independent of component architecture.The CodeWarrior Integrated Development Environment is a sophisticated tool for code navigation,compiling, and debugging. A complete set of evaluation modules (EVMs) and development system cards will support concurrent engineering. Together, the SDK, CodeWarrior, and EVMs create a complete,scalable tools solution for easy, fast, and efficient development.1.4 Product DocumentationThe four documents listed in Table 1 are required for a complete description and proper design with the DSP56F826. Documentation is available from local Motorola distributors, Motorola semiconductor sales offices, Motorola Literature Distribution Centers, or online at /semiconductors/.Table 1. DSP56F826 Chip Documentation1.5 Data Sheet ConventionsThis data sheet uses the following conventions:TopicDescriptionOrder NumberDSP56800Family Manual Detailed description of the DSP56800 family architecture, and 16-bit DSP core processor and the instruction set DSP56800FM/D DSP56824/F826/F827User’s Manual Detailed description of memory, peripherals, and interfaces of the DSP56824, DSP56F826, DSP56F827TBD DSP56F826Technical Data Sheet Electrical and timing specifications, pin descriptions, and package descriptions (this document)DSP56F826/D DSP56F826Product BriefSummary description and block diagram of the DSP56F826 core, memory, peripherals and interfacesDSP56F826PB/DOVERBAR This is used to indicate a signal that is active when pulled low. For example, the RESET pin is active when low.“asserted” A high true (active high) signal is high or a low true (active low) signal is low.“deasserted” A high true (active high) signal is low or a low true (active low) signal is high.Examples:Signal/SymbolLogic State Signal State Voltage 11.Values for VIL, VOL, VIH, and VOH are defined by individual product specifications.PIN True Asserted V IL /V OL PIN False Deasserted V IH /V OH PIN True Asserted V IH /V OH PINFalseDeassertedV IL /V OLIntroduction Part 2 Signal/Connection Descriptions2.1 IntroductionThe input and output signals of the DSP56F826 are organized into functional groups, as shown in Table2 and as illustrated in Figure2. In Table3 describes the signal or signals present on a pin.Table2. Functional Group Pin AllocationsFunctional Group Number of PinsPower (V DD , V DDIO or V DDA)(3,4,1)Ground (V SS , V SSIO or V SSA )(3,4,1)PLL and Clock3Address Bus116Data Bus16Bus Control4Interrupt and Program Control5Dedicated General Purpose Input/Output16Synchronous Serial Interface (SSI) Port6Serial Peripheral Interface (SPI) Port11.Alternately, GPIO pins 4Serial Communications Interface (SCI) Ports4 Quad Timer Module Ports4 JTAG/On-Chip Emulation (OnCE)6Figure 2. DSP56F826 Signals Identified by Functional Group 11.Alternate pin functionality is shown in parenthesis.DSP56F8262.5V Power Port Ground PortPLL and ClockExternal Address Bus or GPIO External Data BusExternal Bus ControlDedicatedGPIOSPI1 Port orGPIOSCI0 Port or SPI0 PortSCI1 Port or SPI0 PortV DD V SS V DDIO V SSIO V DDA V SSAEXTAL(CLOCKIN )XTAL CLKOA0-A15(GPIO/E0–E7,GPIOA0–A7)D0–D15PS DS RD WRTA0 (GPIOF0)TA1 (GPIOF1)TA2 (GPIOF2)TA3 (GPIOF3)TCK TMS TDI TDO TRST DEQuad Timer AJTAG/OnCE ™PortInterrupt/Program ControlGPIOB0–7GPIOD0–7SRD (GPIOC0)SRFS (GPIOC1)SRCK (GPIOC2)STD (GPIOC3)STFS (GPIOC4)STCK (GPIOC5)SCLK (GPIOF4)MOSI (GPIOF5)MISO (GPIOF6)SS (GPIOF7)TXD0 (SCLK0)RXD0 (MOSI0)TXD1 (MISO0)RXD1 (SS0)IRQA IRQB RESET EXTBOOT33816168SSI Port orGPIO44Ground PortAnalog Power Port (3.3V)3.3V Power Port Ground PortIntroductionTable 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.Type DescriptionA024Output Address Bus —A0–A7 specify the address for external program or data memory accesses.Port E GPIO —These eight General Purpose I/O (GPIO) pins can be individually programmed as input or output pins. After reset, the default state is Address Bus.A123Output A222Output A321Output A418Output A517Output A616Output A7GPIOE0–GPIOE715Output Input/OutputA814Output Address Bus —A8–A15 specify the address for external program or data memory accesses.Port A GPIO —These eight General Purpose I/O (GPIO) pins can be individually programmed as input or output pins.After reset, the default state is Address Bus.A913Output A1012Output A1111Output A1210Output A139Output A148Output A15GPIOA0–GPIOA77Output Input/OutputCLKO65OutputClock Output —This pin outputs a buffered clock signal. By programming the CLKO Select Register (SLKOSR), the user can select betweenoutputting a version of the signal applied to XTAL and a version of the DSP master clock at the output of the PLL. The clock frequency on this pin can be disabled by programming the CLKO Select Register (CLKOSR).D034Input/OutputData Bus — D0–D15 specify the data for external program or data memory accesses. D0–D15 are tri-stated when the external bus is inactive.D135D236D337D438D539D640D741D842D943D1044D1146D1247D1348D1449D1550DE 98Output Debug Event —DE provides a low pulse on recognized debug events.DS28OutputData Memory Select —DS is asserted low for external data memory access.Table 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.Type DescriptionIntroductionEXTALCLOCKIN 61InputInputExternal Crystal Oscillator Input —This input can be connected to an 4MHz external crystal. If a 4MHz or less external clock source is used, EXTAL can be used as the input and XTAL must not be connected. For more information, please refer to Section 3.5.2.External Clock Input —This input can be connected to an external 8MHz clock. For more information, please refer to Section 3.5.The input clock can be selected to provide the clock directly to the DSP core. This input clock can also be selected as input clock for the on-chip PLL. EXTBOOT 25InputExternal Boot —This input is tied to V DD to force device to boot from off-chip memory. Otherwise, it is tied to ground.GPIOB066Input or OutputPort B GPIO —These eight dedicated General Purpose I/O (GPIO) pins can be individually programmed as input or output pins.After reset, the default state is GPIO input.GPIOB167GPIOB268GPIOB369GPIOB470GPIOB571GPIOB672GPIOB773GPIOD074Input or OutputPort D GPIO —These eight dedicated GPIO pins can be individually programmed as an input or output pins. After reset, the default state is GPIO input.GPIOD175GPIOD276GPIOD377GPIOD478GPIOD579GPIOD682GPIOD783Table 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.TypeDescriptionIRQA32InputExternal Interrupt Request A —The IRQA input is a synchronized external interrupt request that indicates that an external device is requesting service. It can be programmed to be level-sensitive or negative-edge- triggered. If level-sensitive triggering is selected, an external pull up resistor is required for wired-OR operation.If the processor is in the Stop state and IRQA is asserted, the processor will exit the Stop state.IRQB33InputExternal Interrupt Request B —The IRQB input is an external interrupt request that indicates that an external device is requesting service. It can be programmed to be level-sensitive or negative-edge-triggered. If level-sensitive triggering is selected, an external pull up resistor is required for wired-OR operation.MISO GPIOF686Input/OutputInput/OutputSPI Master In/Slave Out (MISO)—This serial data pin is an input to a master device and an output from a slave device. The MISO line of a slave device is placed in the high-impedance state if the slave device is not selected.Port F GPIO —This General Purpose I/O (GPIO) pin can be individually programmed as input or output. After reset, the default state is MISO.MOSIGPIOF585Input/OutputInput/Output SPI Master Out/Slave In (MOSI)—This serial data pin is an output from a master device and an input to a slave device. The master device places data on the MOSI line a half-cycle before the clock edge that the slave device uses to latch the data.Port F GPIO —This General Purpose I/O (GPIO) pin can be individually programmed as input or output.PS 29Output Program Memory Select —PS is asserted low for external program memory access.RD26OutputRead Enable —RD is asserted during external memory read cycles. When RD is asserted low, pins D0–D15 become inputs and an external device is enabled onto the DSP data bus. When RD is deasserted high, the external data is latched inside the DSP. When RD is asserted, it qualifies the A0–A15, PS, and DS pins. RD can be connected directly to the OE pin of a Static RAM or ROM.Table 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.TypeDescriptionIntroduction RESET45Input Reset—This input is a direct hardware reset on the processor. When RESETis asserted low, the DSP is initialized and placed in the Reset state. A Schmitttrigger input is used for noise immunity. When the RESET pin is deasserted,the initial chip operating mode is latched from the external boot pin. Theinternal reset signal will be deasserted synchronous with the internal clocks,after a fixed number of internal clocks.To ensure complete hardware reset, RESET and TRST should be assertedtogether. The only exception occurs in a debugging environment when ahardware DSP reset is required and it is necessary not to reset the OnCE/JTAG module. In this case, assert RESET, but do not assert TRST.RXD0 MOSI096InputInput/OutputReceive Data (RXD0)— receive data inputSPI Master Out/Slave In—This serial data pin is an output from a masterdevice, and an input to a slave device. The master device places data on theMOSI line one half-cycle before the clock edge the slave device uses to latchthe data.After reset, the default state is SCI input.RXD1 SS092InputInputReceive Data (RXD1)— receive data inputSPI Slave Select—In maste mode, this pin is used to arbitrate multiplemasters. In slave mode, this pin is used to select the slave.After reset, the default state is SCI input.SCLK GPIOF484Input/OutputInput/OutputSPI Serial Clock—In master mode, this pin serves as an output, clockingslaved listeners. In slave mode, this pin serves as the data clock input.Port F GPIO—This General Purpose I/O (GPIO) pin can be individuallyprogrammed as input or output.After reset, the default state is SCLK.SRCK GPIOC253Input/OutputInput/OutputSSI Serial Receive Clock (STCK)—This bidirectional pin provides theserial bit rate clock for the Receive section of the SSI. The clock signal canbe continuous or gated and can be used by both the transmitter and receiverin synchronous mode.Port C GPIO—This is a General Purpose I/O (GPIO) pin with the capabilityof being individually programmed as input or output.After reset, the default state is GPIO input.Table3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled. Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.SignalNamePin No.Type DescriptionSRD GPIOC051Input/OutputInput/OutputSSI Receive Data (SRD)—This input pin receives serial data and transfersthe data to the SSI Receive Shift Receiver.Port C GPIO—This is a General Purpose I/O (GPIO) pin with the capabilityof being individually programmed as input or output.After reset, the default state is GPIO input.SRFS GPIOC152Input/ OutputInput/OutputSSI Serial Receive Frame Sync (SRFS)—This bidirectional pin is used bythe receive section of the SSI as frame sync I/O or flag I/O. The STFS can beused only by the receiver. It is used to synchronize data transfer and can bean input or an output.Port C GPIO—This is a General Purpose I/O (GPIO) pin with the capabilityof being individually programmed as input or output.After reset, the default state is GPIO input.SS GPIOF787InputInput/OutputSPI Slave Select—In master mode, this pin is used to arbitrate multiplemasters. In slave mode, this pin is used to select the slave.Port F GPIO—This General Purpose I/O (GPIO) pin can be individuallyprogrammed as input or output.After reset, the default state is SS.STCK GPIOC556Input/ OutputInput/OutputSSI Serial Transmit Clock (STCK)—This bidirectional pin provides theserial bit rate clock for the transmit section of the SSI. The clock signal canbe continuous or gated. It can be used by both the transmitter and receiver insynchronous mode.Port C GPIO—This is a General Purpose I/O (GPIO) pin with the capabilityof being individually programmed as input or output.After reset, the default state is GPIO input.STD GPIOC354OutputInput/OutputSSI Transmit Data (STD)—This output pin transmits serial data from theSSI Transmitter Shift Register.Port C GPIO—This is a General Purpose I/O (GPIO) pin with the capabilityof being individually programmed as input or output.After reset, the default state is GPIO input.Table3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled. Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.SignalNamePin No.Type DescriptionIntroductionSTFSGPIOC455InputInput/OutputSSI Serial Transmit Frame Sync (STFS)—This bidirectional pin is used by the Transmit section of the SSI as frame sync I/O or flag I/O. The STFS can be used by both the transmitter and receiver in synchronous mode. It is used to synchronize data transfer and can be an input or output pin.Port C GPIO—This is a General Purpose I/O (GPIO) pin with the capability of being individually programmed as input or output. After reset, the default state is GPIO input.TA0-3GPIOF0-GPIOF391Input/OutputInput/OutputTA0–3—Timer F Channels 0, 1, 2, and 3Port F GPIO —These four General Purpose I/O (GPIO) pins can be individually programmed as input or output. After reset, the default state is Quad Timer.908988TCS99TCS —This pin is reserved for factory use. It must be tied to V SS for normal use. In block diagrams, this pin is considered an additional V SS.TCK 100InputTest Clock Input —This input pin provides a gated clock to synchronize the test logic and shift serial data to the JTAG/OnCE port. The pin is connected internally to a pull-down resistor.TDI 2InputTest Data Input —This input pin provides a serial input data stream to the JTAG/OnCE port. It is sampled on the rising edge of TCK and has an on-chip pull-up resistor.TDO 3OutputTest Data Output —This tri-statable output pin provides a serial output data stream from the JTAG/OnCE port. It is driven in the Shift-IR and Shift-DR controller states, and changes on the falling edge of TCK.TMS 1InputTest Mode Select Input —This input pin is used to sequence the JTAG TAP controller’s state machine. It is sampled on the rising edge of TCK and has an on-chip pull-up resistor.TRST 4InputTest Reset —As an input, a low signal on this pin provides a reset signal to the JTAG TAP controller. To ensure complete hardware reset, TRST should be asserted whenever RESET is asserted. The only exception occurs in a debugging environment when a hardware DSP reset is required and it is necessary not to reset the JTAG/OnCE module. In this case, assert RESET, but do not assert TRST.Table 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.TypeDescriptionTXD0SCLK097Output Input/OutputTransmit Data (TXD0)—transmit data outputSPI Serial Clock—In master mode, this pin serves as an output, clocking slaved listeners. In slave mode, this pin serves as the data clock input. After reset, the default state is SCI output.TXD1MISO093Output Input/OutputTransmit Data (TXD1)—transmit data outputSPI Master In/Slave Out—This serial data pin is an input to a masterdevice and an output from a slave device. The MISO line of a slave device is placed in the high-impedance state if hte slave device is not selected.After reset, the default state is SCI output.V DD 20V DD Power —These pins provide power to the internal structures of the chip, and are generally connected to a 2.5V supply.V DD 64V DD V DD 94V DD V DDA 59V DDA Analog Power —This pin supplies an analog power source (generally 2.5V).V DDIO 5V DDIO Power —These pins provide power to the I/O structures of the chip, and are generally connected to a 3.3V supply.V DDIO 30V DDIO V DDIO 57V DDIO V DDIO 80V DDIO V SS 19V SS GND —These pins provide grounding for the internal structures of the chip. All should be attached to V SS.V SS 63V SS V SS 95V SS V SSA 60V SSA Analog Ground —This pin supplies an analog ground.V SSIO 6V SSIO GND In/Out —These pins provide grounding for the I/O ring on the chip.All should be attached to V SS.V SSIO 31V SSIO V SSIO 58V SSIO V SSIO81V SSIOTable 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.Type DescriptionIntroductionWR27OutputWrite Enable —WR is asserted during external memory write cycles. When WR is asserted low, pins D0–D15 become outputs and the DSP puts data on the bus. When WR is deasserted high, the external data is latched inside the external device. When WR is asserted, it qualifies the A0–A15, PS, and DS pins. WR can be connected directly to the WE pin of a Static RAM.XTAL 62OutputCrystal Oscillator Output —This output connects the internal crystal oscillator output to an external crystal. If an external clock source over 4MHz is used, XTAL must be used as the input and EXTAL connected to V SS . For more information, please refer to Section 3.5.2.Table 3. DSP56F826 Signal and Package Information for the 100 Pin LQFPAll inputs have a weak internal pull-up circuit associated with them. These pull-up circuits are always enabled .Exceptions:1.When a pn is owned by GPIO, then the pull-up may be disabled under software control.2.TCK has a weak pull-down circuit always active.Signal Name Pin No.Type DescriptionPart 3 Specifications3.1 General CharacteristicsThe DSP56F826 is fabricated in high-density CMOS with 5-volt tolerant TTL-compatible digital inputs. The term 5-volt tolerant refers to the capability of an I/O pin, built on a 3.3V compatible process technology, to withstand a voltage up to 5.5V without damaging the device. Many systems have a mixture of devices designed for 3.3V and 5V power supplies. In such systems, a bus may carry both 3.3V and 5V- compatible I/O voltage levels. A standard 3.3V I/O is designed to receive a maximum voltage of 3.3V ± 10% during normal operation without causing damage. This 5V tolerant capability, therefore, offers the power savings of 3.3V I/O levels while being able to receive 5V levels without being damaged.Absolute maximum ratings given in Table4 are stress ratings only, and functional operation at the maximum is not guaranteed. Stress beyond these ratings may affect device reliability or cause permanent damage to the device.The DSP56F826 DC/AC electrical specifications are preliminary and are from design simulations. These specifications may not be fully tested or guaranteed at this early stage of the product life cycle. Finalized specifications will be published after complete characterization and device qualifications have been completed.CAUTIONThis device contains protective circuitry to guard againstdamage due to high static voltage or electrical fields.However, normal precautions are advised to avoidapplication of any voltages higher than maximum ratedvoltages to this high-impedance circuit. Reliability ofoperation is enhanced if unused inputs are tied to anappropriate logic voltage level (e.g., either or V DD or V SS).Table4. Absolute Maximum Ratings (V SS = 0 V)Characteristic Symbol Min Max Unit Supply voltage, core V DD V SS – 0.3 3.0V Supply voltage, IO and analog V DDIO V SS – 0.3 4.0V All other input voltages V IN V SS – 0.3V DDIO + 0.3V Current drain per pin excluding V DD, V SS I—10mA Junction temperature T J—150°C Storage temperature range T STG–55150°C。
计算机面试英语问答Q: What programming languages are you proficient in?A: I am proficient in Java, Python, and C++. I have experience working with these programming languages on various projects and have a strong understanding of their syntax and concepts.Q: Can you explain object-oriented programming?A: Object-oriented programming (OOP) is a programming paradigm that organizes data into objects, which are instances of classes. It focuses on the concept of objects that have properties (attributes) and behaviors (methods). Encapsulation, inheritance, and polymorphism are key principles of OOP.Q: Can you explain the difference between SQL and NoSQL databases?A: SQL (Structured Query Language) databases are based on a relational model and use SQL for data management and manipulation. They have a predefined schema and strict data consistency. NoSQL (Not only SQL) databases, on the other hand, do not require a fixed schema and are more flexible. They are designed to handle large amounts of unstructured data and provide high scalability and availability.Q: Explain the concept of a thread in multithreading.A: A thread is a separate sequence of execution within a program. Multithreading allows multiple threads to run concurrently within asingle process, sharing the same memory space. Threads can perform tasks independently and can communicate with each other through shared memory, enabling efficient processing and utilization of system resources.Q: What is the difference between compile-time and runtime errors? A: Compile-time errors occur during the compilation phase of a program and prevent it from being executed. They are usually caused by syntax errors or type mismatches. Runtime errors, on the other hand, occur while a program is running and can cause it to terminate unexpectedly. These errors are often due to logical or runtime issues, such as division by zero or accessing out-of-bounds array indices.。
Efficient Shared Memory Programming onSoftware-DSM SystemsJie Tao1,Wolfgang Karl1,Carsten Trinitis2,Hubert Eichner2,and Tobias Klug21Institut f¨u r Technische InformatikUniversit¨a t Karlsruhe(TH)Kaiserstr.12,76128Karlsruhe,Germany{tao,karl}@a.de2Lehrstuhl f¨u r Rechnertechnik und RechnerorganisationTechnische Universit¨a t M¨u nchenBoltzmannstr.3,85748Garching,Germany{trinitic,eichnerh,klug}@AbstractSoftware Distributed Shared Memory generally forms the basis for shared memory programming on cluster architectures.Due to the feature of software-based implementation and page-based inter-node transfers,however,SDSMsusually do not show excellent performance.This paper discusses an approachfor possible optimization with respect to data locality which can alleviate theoverheads caused by the communications across processor nodes.This approachrelies on a simulation platform to acquire performance data about the runtimedata operations and this information is then visualized by a visualization tool.Based on the presented access pattern users can detect the dominating node foreach data page and in the next step allocate the data on the node that mostlyaccesses it.The approach is verified on the simulation platform using standardbenchmark applications and the experimental results show an improvement ofup to70%.1IntroductionCluster systems usually exhibit a good scalability and cost-effectiveness.They arehence increasingly used in both research and commercial areas for High Performance Computing.Traditionally,message passing is used on such machines to parallelize applications.Over the last years,however,shared memory models have also beenwidely applied for programming cluster systems.A well-known example is OpenMP[2],a portable,standardized shared memory model initially designed for SMPs andcurrently extended for clusters.In contrast to the message passing paradigm,sharedmemory models provide a programming interface semantically close to sequential pro-gramming,thus releasing the users from the burden of otherwise having to deal with communications within the source code.A prerequisite for shared memory programming on clusters is a distributed shared memory(DSM)that establishes a global memory space visible to all processor nodes. As a result,a number of DSMs have been developed either solely based on software mechanisms[7,17,19,16]or also relying on additional hardware support[18].However,shared memory programs often show poor performance on DSM systems, and this problem is especially critical for software-based DSMs(SDSM)primarily due to the management of shared data.Hence,this work focuses on mechanisms capable of reducing the overhead in data transfer across processor nodes.On an SDSM system,the memory is actually distributed over the processor nodes. Hence,the global address space for programmers is not physical but virtual,and the data a processor needs for computation could resident on another node.In this case, SDSMs rely on the page fault handler of the operating system to fetch the data from the remote node.Nevertheless,using this scheme the whole page has to be copied resulting in large inter-node traffic.To tackle this performance problem,we use locality optimization schemes to opti-mally distribute the data on the dominating node.This means a data page will not be placed on the nodefirst touching it,as traditional SDSMs usually do.Rather it will be explicitly allocated by the programmer on the node that dominantly accesses it during the execution.For this,we simulate the execution of a program and collect information about the runtime data accesses.This information is then presented to the user with a visualization tool.Based on this tool,users are capable of detecting the dominating node for each data page.The proposed approach has been verified using the simulation platform.Experi-mental results show a high improvement for most tested applications.The remainder of this paper is organized as follows:Section2briefly describes the concept of software distributed shared memory and the related work.This is followed by a discussion of the efficiency problem of shared memory programming on SDSMs in Section3with ViSMI,an SDSM built for our InfiniBand cluster,as an example for case study.In Section4the approach for optimizations is described in detail,including the simulator,the visualization tool,and the optimization process.This is followed by thefirst experimental results in Section5.The paper concludes with a short summary and a few future directions in Section6.2SDSM and the Related WorkThe basic idea behind software distributed shared memory is to provide the program-mers with a virtually global address space on cluster architectures.This idea isfirst proposed by Kai Li[11]and implemented in IVY[12].As the memories are actually distributed across the cluster,the required data could be located on a remote note and also multiple copies of shared data could exist.The latter leads to consistency issues,where a write operation on shared data has to be seen by other processors.For tackling these problems SDSMs usually rely on the page fault handler of the operat-ing system to support page protection mechanisms that implement invalidation-based consistency models.The concept of memory consistency models is to precisely characterize the behav-ior of the respective memory system by clearly defining the order in which memoryoperations are performed.Depending on the concrete requirement,this order can be strict or less strict,hence leading to various consistency models.Thefirst consistency model,also the most strict one,is Sequential Consistency[10], which forces a multiprocessor system to achieve the same result of any execution as if the operations of all the processors were executed in some sequential order and the operations of each individual processor appear in this sequence in the order specified by its programmers.This provides an intuitive and easy-to-follow memory behavior, however,the strict ordering requires the memory system to propagate updates early and prohibits optimizations in both hardware and compilers.Hence,other models have been proposed to relax the constraints of Sequential Consistency with the goal of improving the overall performance.Relaxed consistency models[1,3,7,8]define a memory model for programmers to use explicit synchronization.Synchronizing memory accesses are divided into Acquires and Releases,where an Aquire allows the access to shared data and ensures that the data is up-to-data,while Release relinquishes this access right and ensures that all memory updates have been properly propagated.By separating the synchronization in this way invalidations are only performed by a synchronization operation,therefore reducing the unnecessary invalidations caused by an early coherence operation.A well-known relaxed consistency model is Lazy Release Consistency(LRC)[8]in which the invalidations are propagated at the acquire time.This allows to perform any communication of write updates so late when the data is actually needed.To reduce the communications caused by false sharing,where multiple unrelated shared data locate on the same page,LRC protocols usually support a multiple-writer scheme. Within this scheme,multiple writable copies of the same page are allowed and a clean copy is generated after an invalidation.Home-based Lazy Release Consistency(HLRC) [16],for example,implements such a multiple-writer scheme by specifying a home for each page.All updates to a page are propagated to the home node at synchronization points,such as lock release and barrier.Hence the page copy on home is up-to-date.Over the last years,a significant amount of work has been done in the area of soft-ware distributed shared memory.Quarks[19]is an example using relaxed consistency. This is a system aiming at providing a lean implementation of DSM to avoid complex high-overhead protocols.Treadmarks[7]is thefirst implementation of shared virtual memory using the LRC protocol on a network of stock computers.Shasta[17]is an-other example that uses LRC to achieve a distributed shared memory on clusters.For the HLRC protocol representative works include Rangarajan and Iftode’s HLRC-VIA [16]for a GigaNet VIA-based network,VIACOMM[4]which implements a multi-threaded HLRC also over VIA,and NEWGENDSM[14],an SDSM for InfiniBand clusters.We also implemented an SDSM,called ViSMI(Virtual Shared Memory for In-finiBand)[15].ViSMI provides a shared virtual space by implementing the HLRC protocol.In addition,it also establishes an interface for developing shared memory programming models on cluster architectures.3Efficiency Problem of Shared Memory Program-ming on SDSMs:A Case StudyIn order to depict the performance problem of shared memory applications running on SDSM-based clusters,we examine the parallel execution of OpenMP programs on top of our InfiniBand cluster.This OpenMp execution environment[5]is based on ViSMI and for adaptation to the ViSMI programming interface we modified the Omni OpenMP compiler for SMPs[9]to make a new runtime library.Figure1:Speedup on systems with different number of processors Figure1shows the performance of several OpenMP applications from different benchmark suites and a few self-coded small kernels.The speedup is measured on our InfiniBand cluster with different number of nodes.It can be observed that for most applications only a slight speedup has been achieved via parallel execution.For example,except the SOR(Successive Over Relaxation)code,all applications show a speedup of lower than3when running on the cluster with6processors.The efficiency is less than0.5.To detect the reason we examine now the execution time breakdown of two sample codes:CG and Gauss.Figure2shows the experimental results.In Figure2,exec.denotes the overall execution time,while comput.specifies the time for actually executing the application,page is the time for fetching pages,barrier denotes the time for barrier operations(this includes the time for synchronization andthat for transferring diffs.),lock is the time for performing locks,handler is thetimeFigure2:Sample normalized execution time breakdownneeded by the communication thread,and overhead is the time for other protocol activities.The sum of all partial times equals to the total exec.time.It can be seen that for Gauss roughly only the half time is spent for computation and for CG this is even worse with only33%of the overall time for running the program. Observing the time used for page transfer,it can be seen that both applications show a high proportion.In addition,synchronization activities,like lock and barrier,also take a significant amount of execution time and for CG this time is even more than the time for computing.This renders the necessity to optimize the behavior of SDSMs with respect to page and synchronization operations.As thefirst step towards this kind of optimization,we focus on the former:optimization to reduce the time for data transfer.4The ApproachFor such optimization wefirst need the knowledge about the data accesses performed at runtime during the execution of an application.For this we rely on an existing simulation tool called SIMT[20].SIMT is a multiprocessor simulator modeling the parallel execution of shared mem-ory applications on systems with shared or distributed shared memory.As it is used for research work at the area of memory system SIMT mainly models the memory hierarchy including arbitrary caches and the management of the main memory.SIMT itself is based on Augmint[13],an event-driven multiprocessor simulator. Hence,it applies the same SPMD programming model,where parallelization is pre-sented with m4macros in a fashion similar to SPLASH and SPLASH-2applications. For simulating the parallel execution of OpenMP applications,we both extended SIMT and modified the Omni OpenMP compiler to map the OpenMP programming model to the simulation platform.In addition,we add mechanisms in SIMT to simulate the activities of page faults on SDSM systems.Like most simulators,SIMT provides extensive statistics on the execution of an application at the simulation process.This includes elapsed simulation time,simulated processor cycles,number of total memory references,and information about page faults.In addition,SIMT delivers detailed information about data accesses in the form of memory access histograms.Each entry in the histogram corresponds to a data page and records the number of accesses performed by each processor.This enables the user to detect dominating nodes,providing the basis for determining a better data distribution leading to less inter-node communication.For supporting this,we developed a visualization tool that presents the performance data in user-understandable graphical views.This visualizer uses different form,like diagrams,tables,curves,and3D charts,to illustrate the various aspects of data ac-cesses at different granularity.The visualization can be specified to either virtual pages or data structures.The latter directly supports the detection of optimization objects in the source code.Figure3shows two sample views based on virtual pages.The left depicts a2D block view that presents hot memory accesses over the application’s whole shared virtual address space.For each page it shows the node(in color)that has issued the most accesses to the page.The number of accesses is depicted in the length of theFigure 3:Sample views of the visualization tool (virtual page)column.This allows the user to detect the node where a page has to be hosted in order to reduce page copy and thereby caused data transfer on an SDSM system.The right of Figure 3is a 3D view also showing the accesses to all virtual pages of the working set.In contrast to the 2D view,it separates the accesses into phases where a phase can be a code region,a function,or a synchronization phase.In a single phase,each page is represented by a colored bar showing all accesses from all nodes.The higher the bar,the more accesses performed by the corresponding node.This enables easy detection of the dominating node for each virtual page.In addition,the viewpoint is adjustable enabling a change of focus to an individual phase or observation of the changes in access behavior between phases.This enables a more detailed data allocation which corresponds to the different behavior in various program phases.For the concrete example,for example,different access patterns can be observed for different phases.While most pages are dominantly accessed by a single node in the first,second,and fourth phase,they are also accessed by other nodes in the remaining phases.This indicates that for some applications separate optimizations should be performed for each phase.An accumulated view across all phases alone would have failed to show this behavior and would have led to incorrect conclusions.Figure 4illustrates a sample data structure view showing the memory accesses performed on the main working set of a numerical kernel SOR (Successive Over Re-laxation)on an eight node system.This view directly shows accesses to the main data structure,a large dense matrix.Small squares within the diagram represent the array elements.Depending on the size of the array,each square can correspond to several elements.The color of a square shows the node that performs the most accesses to these elements.For the concrete example,a block access pattern can be observed,where a single processor node predominantly accesses a specific block.This diagram can hence direct the users to allocate data onto dominating nodes.Figure4:Sample views of the visualization tool(data structure)Figure5:Access behavior of SOR before(top)and after optimization(bottom)5Experimental ResultsWith the directions from the visualization tool,we can explicitly allocate the date in the source code with a specification of the home node.As thefirst step we use the simulation platform to examine the influence of such optimizations.For enabling the specification of home node for virtual pages we implements a macro ALLOC that can be used by the user in source codes to give the number of a page or a block of pages and the corresponding node.Figure5illustrates a sample experimental result for the SOR code.Thisfigure shows the access behavior of thefirst100pages of the working set.Corresponding to each page two numbers are presented:one for the accesses performed locally and the other for those fulfilled with issuing a page fault.Examining both diagrams it can be clearly seen that after optimization the accesses via page fault are less than that before optimization.This is caused by the correct placement of the data pages on the dominating nodes which are specified as the home.Figure6:Access behavior of SOR before(top)and after optimization(bottom) To examine the impact of this optimization on the execution behavior,we tested a variety of OpenMP applications from different benchmark suites,including the NAS OpenMP Benchmark[6]and the self-ported OpenMP version of the SPLASH-II Bench-mark[21].We also used a few self-coded small kernels.Figure6presents the experimental results.Thisfigure gives normalized improve-ment obtained by dividing the execution time with the unoptimized version by that with the optimized version.For each applications the same metric is measured on systems with4,8,16,and32processor nodes individually.It can be observed that all applications show an improvement(value>1.0)with the SOR code the best perfor-mance.This improvement rate increases on systems with more processors,indicating that larger systems benefit more from this optimization.6ConclusionsIn this paper,we introduce an approach for raising the efficiency of shared memory programming on SDSM-based cluster systems.This approach relies on a multipro-cessor simulator to acquire the runtime memory access behavior,a visualization tool to display the performance data in user-understandable manner,and explicit data al-location for less runtime inter-node communication.The experimental results on the simulation platform show the feasibility of this approach.The work is an initial phase for this kind of optimization.In the next step,we will move it to the InfiBand cluster.For this,performance data will be collected via the operating system and the established SDSM ViSMI for our cluster will be extended to support the explicite specification of home nodes.References[1]A.L.Cox,S.Dwarkadas,P.J.Keleher,H.Lu,R.Rajamony,and W.Zwaenepoel.Software Versus Hardware Shared-Memory Implementation:A Case Study.In Proceedings of the21th Annual International Symposium on Computer Architec-ture,pages106–117,April1994.[2]L.Dagum and R.Menon.OpenMP:An Industry-Standard API for Shared-Memory Programming.IEEE Computational Science&Engineering,5(1):46–55, January1998.[3]L.Iftode and J.P.Singh.Shared Virtual Memory:Progress and Challenges.InProceedings of the IEEE,Special Issue on Distributed Shared Memory,volume87, pages498–507,1999.[4]V.Iosevich and A.Schuster.Multithreaded home-based lazy release consistencyover VIA.In Proceedings of the18th International Parallel and Distributed Pro-cessing Symposium,pages59–68,April2004.[5]W.Karl J.Tao and C.Trinitis.Implementing an OpenMP Execution Environ-ment on InfiniBand Clusters.In Proceedings of IWOMP2005(Lecture Notes in Computer Science),Eugene,USA,June2005.[6]H.Jin,M.Frumkin,and J.Yan.The OpenMP Implementation of NAS ParallelBenchmarks and Its Performance.Technical Report NAS-99-011,NASA Ames Research Center,October1999.[7]P.Keleher,S.Dwarkadas,A.Cox,and W.Zwaenepoel.TreadMarks:DistributedShared Memory On Standard Workstations and Operating Systems.In Proceed-ings of the1994Winter Usenix Conference,pages115–131,January1994. [8]zy Release Consistency for Distributed Shared Memory.PhDthesis,Department of Computer Science,Rice University,January1995.[9]K.Kusano,S.Satoh,and M.Sato.Performance Evaluation of the Omni OpenMPCompiler.In Proceedings of International Workshop on OpenMP:Experiences and Implementations(WOMPEI),volume1940of LNCS,pages403–414,2000. [10]mport.How to Make a Multiprocessor That Correctly Executes MultiprocessPrograms.IEEE Transactions on Computers,28(9):241–248,1979.[11]K.Li.Shared Virtual Memory on Loosely Coupled Multiprocessors.PhD thesis,Yale University,September1986.[12]K.Li.IVY:A Shared Virtual Memory System for Parallel Computing.In Pro-ceedings of the International Conference on Parallel Processing,Vol.II Software, pages94–101,1988.[13]A-T.Nguyen,M.Michael,A.Sharma,and J.Torrellas.The Augmint Multi-processor Simulation Toolkit for Intel x86Architectures.In Proceedings of1996 International Conference on Computer Design,pages486–491.IEEE Computer Society,October1996.[14]R.Noronha and D.K.Panda.Designing High Performance DSM Systems usingInfiniBand Features.In DSM Workshop,in conjunction with4th IEEE/ACM International Symposium on Cluster Computing and the Grid,April2004. [15]C.Osendorfer,J.Tao,C.Trinitis,and M.Mairandres.ViSMI:Software Dis-tributed Shared Memory for InfiniBand Clusters.In Proceedings of the3rd IEEE International Symposium on Network Computing and Applications(IEEE NCA04),pages185–191,September2004.[16]M.Rangarajan and L.Iftode.Software Distributed Shared Memory over VirtualInterface Architecture:Implementation and Performance.In Proceedings of the 4th Annual Linux Showcase,Extreme Linux Workshop,pages341–352,Atlanta, USA,October2000.[17]D.J.Scales,K.Gharachorloo,and C.A.Thekkath.Shasta:A Low Overhead,Software-Only Approach for Supporting Fine-Grain Shared Memory.In Proceed-ings of the7th Symposium on Architectural Support for Programming Languages and Operating Systems,pages174–185,October1996.[18]M.Schulz.SCI-VM:A Flexible Base for Transparent Shared Memory Program-ming Models on Clusters of PCs.In Proceedings of HIPS’99,volume1586of Lecture Notes in Computer Science,pages19–33,Berlin,April1999.Springer Verlag.[19]M.Swanson,L.Stroller,and J.B.Carter.Making Distributed Shared MemorySimple,Yet Efficient.In Proceedings of the3rd International Workshop on High-Level Parallel Programming Models and Supportive Environments,pages2–13, March1998.[20]J.Tao,M.Schulz,and W.Karl.A Simulation Tool for Evaluating Shared MemorySystems.In Proceedings of the36th Annual Simulation Symposium,pages335–342,Orlando,Florida,April2003.[21]S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,and A.Gupta.The SPLASH-2Programs:Characterization and Methodological Considerations.In Proceedings of the22nd Annual International Symposium on Computer Architecture,pages 24–36,June1995.。