并行计算(中科大讲义)
- 格式:ppt
- 大小:8.40 MB
- 文档页数:617
GPU Architecture in detail and PerformanceOptimization (Part II)Bin ZHOU USTCAutumn, 20131 © 2012, NVIDIAAnnouncements •The Following Classes:–11/23 Review and Project Review–11/30 Final Exam + Project–12/07 12/14 Project–12/21 Project Defense–12/28 Or after that Course Close. •Project Source Code + Report + PPT •Important Time: Due to: 2013/12/18 24:00Contents•Last lecture Review + Continue•Optimization + Kepler New Things•Tools for Project3 © 2012, NVIDIAOptimizationMake the best or most effective use of asituation or resourceLast Lecture•General guideline•Occupancy Optimization•Warp branch divergence•Global memory access•Shared memory accessOutline•General guideline II•CPU-GPU Interaction Optimization•Kepler in detailTools•Winscp–Copy files from/to remote servers•Notepad++–Edit source files (with keyword highlighting)GENERAL GUIDELINE II8 © 2012, NVIDIAKernel Optimization WorkflowFind LimiterCompare topeak GB/s Memory optimization Compare topeak inst/sInstructionoptimizationConfigurationoptimizationMemory boundInstructionboundLatencybound Done!<< <<~ ~General Optimization Strategies: Measurement•Find out the limiting factor in kernel performance –Memory bandwidth bound (memory optimization)–Instruction throughput bound (instruction optimization) –Latency bound (configuration optimization)•Measure effective memory/instruction throughputMemory Optimization•If the code is memory-bound and effective memory throughput is much lower than the peak•Purpose: access only data that are absolutely necessary•Major techniques–Improve access pattern to reduce wasted transactions–Reduce redundant access: read-only cache, shared memoryInstruction Optimization•If you find out the code is instruction bound–Compute-intensive algorithm can easily become memory-bound if not careful enough–Typically, worry about instruction optimization after memory and execution configuration optimizations•Purpose: reduce instruction count–Use less instructions to get the same job done•Major techniques–Use high throughput instructions (ex. wider load)–Reduce wasted instructions: branch divergence, reduce replay (conflict), etc.Latency Optimization•When the code is latency bound–Both the memory and instruction throughputs are far from the peak•Latency hiding: switching threads–A thread blocks when one of the operands isn’t ready•Purpose: have enough warps to hide latency•Major techniques: increase active warps, increase ILPCPU-GPU INTERACTION14 © 2012, NVIDIAMinimize CPU-GPU data transferHost<->device data transfer has much lower bandwidth than global memory access.16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110)Minimize transferIntermediate data can be allocated, operated, de-allocated directly on GPU Sometimes it’s even better to recompute on GPUMove CPU codes to GPU that do not have performance gains if it can reduce data transferGroup transferOne large transfer much better than many small onesOverlap memory transfer with computationPCI Bus 1.Copy input data from CPU memory to GPUmemoryPCI Bus 1.Copy input data from CPU memory to GPUmemory2.Load GPU code and execute itPCI Bus 1.Copy input data from CPU memory to GPUmemory2.Load GPU code and execute it3.Copy results from GPU memory to CPUmemory•T total=T HtoD+T Exec+T DtoH •More Overlap?HtoD Exec DtoH Stream 1HD1 HD2 E1 E2 DH1 DH2 Stream 2cudaStreamCreate(&stream1);cudaMemcpyAsync(dst1, src1, size, cudaMemcpyHostToDevice, stream1);kernel<<<grid, block, 0, stream1>>>(…);cudaMemcpyAsync(dst1, src1, size, stream1);cudaStreamSynchronize(stream1);cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(dst1, src1, size, cudaMemcpyHostToDevice, stream1); cudaMemcpyAsync(dst2, src2, size, cudaMemcpyHostToDevice,stream2);kernel<<<grid, block, 0, stream1>>>(…);kernel<<<grid, block, 0, stream2>>>(…);cudaMemcpyAsync(dst1, src1, size, cudaMemcpyDeviceToHost, stream1); cudaMemcpyAsync(dst2, src2, size, cudaMemcpyDeviceToHost, stream2);cudaStreamSynchronize(stream1);cudaStreamSynchronize(stream2);KEPLER IN DETAIL23 © 2012, NVIDIAKepler•NVIDIA Kepler–1.31 tflops double precision–3.95 tflops single precision–250 gb/sec memorybandwidth–2,688 Functional Units(cores)•~= #1 on Top500 in 1997- KeplerKepler GK110 SMX vs Fermi SM3x perfPower goes down!New ISA Encoding: 255 Registers per Thread•Fermi limit: 63 registers per thread–A common Fermi performance limiter–Leads to excessive spilling•Kepler : Up to 255 registers per thread–Especially helpful for FP64 appsHyper-Q•Feature of Kepler K20 GPUs to increase application throughput by enabling work to be scheduled onto the GPU in parallel •Two ways to take advantage–CUDA Streams – now they really are concurrent –CUDA Proxy for MPI – concurrent CUDA MPIprocesses on one GPUBetter Concurrency SupportWork Distributor32 active gridsStream Queue Mgmt C B AR Q PZ Y XGrid Management UnitPending & Suspended Grids 1000s of pending gridsSMX SMX SMX SMXSM SM SM SM Work Distributor16 active gridsStream Queue MgmtC B AZ Y XR Q PCUDAGeneratedWorkFermiKepler GK110Fermi ConcurrencyFermi allows 16-way concurrency –Up to 16 grids can run at once–But CUDA streams multiplex into a single queue –Overlap only at stream edges P<<<>>> ;Q<<<>>> ;R<<<>>> A<<<>>> ; B<<<>>> ;C<<<>>> X<<<>>> ;Y<<<>>> ; Z<<<>>> Stream 1Stream 2Stream 3Hardware Work QueueA--B--C P--Q--R X--Y--ZKepler Improved ConcurrencyP<<<>>> ; Q<<<>>> ; R<<<>>>A <<<>>>;B <<<>>>;C<<<>>>X <<<>>>;Y <<<>>>; Z<<<>>>Stream 1Stream 2Stream 3Multiple Hardware Work QueuesA--B--CP--Q--R X--Y--ZKepler allows 32-way concurrencyOne work queue per stream Concurrency at full-stream level No inter-stream dependenciesCPU ProcessesShared GPUE FDCBACPU ProcessesShared GPUE FDCBACPU ProcessesShared GPUE FDCBACPU ProcessesShared GPUE FDCBACPU ProcessesShared GPUE FDCBACPU ProcessesShared GPUE FDCBACPU ProcessesShared GPUE FDCBAHyper-Q: Simultaneous MultiprocessE FDCBACPU ProcessesShared GPUCUDA ProxyClient – Server Software SystemWithout Hyper-QTime100500 G P U U t i l i z a t i o n % A B C D E FWith Hyper-Q Time 10050 0 G P U U t i l i z a t i o n % A A ABB BC CC D DDE E EF F FWhat is Dynamic Parallelism?The ability to launch new kernels from the GPU –Dynamically - based on run-time data–Simultaneously - from multiple threads at once–Independently - each thread can launch a different gridCPU GPU CPU GPU Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itselfCPU GPU CPU GPUWhat Does It Mean?Autonomous, Dynamic Parallelism GPU as Co-ProcessorNew Types of Algorithms•Recursive Parallel Algorithms like Quick sort •Adaptive Mesh Algorithms like Mandelbrot CUDA TodayCUDA on KeplerComputational Powerallocated to regions of interestGPU Familiar Programming Model__global__ void B(float *data) {do_stuff(data);X <<< ... >>> (data);Y <<< ... >>> (data);Z <<< ... >>> (data);cudaDeviceSynchronize();do_more_stuff(data);}ABCXYZ CPUint main() {float *data;setup(data);A <<< ... >>> (data);B <<< ... >>> (data);C <<< ... >>> (data);cudaDeviceSynchronize(); return 0;}__device__ float buf[1024]; __global__ void cnp(float *data){int tid = threadIdx.x;if(tid % 2)buf[tid/2] = data[tid]+data[tid+1];__syncthreads();if(tid == 0) {launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); }__syncthreads();cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize();}Code Example Launch is per-threadand asynchronous__device__ float buf[1024]; __global__ void cnp(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize();}__syncthreads();cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize();}Code Example Launch is per-threadand asynchronousCUDA primitives are per-blocklaunched kernels and CUDA objects like streams are visible to all threads in athread blockcannot be passed to child kernel__device__ float buf[1024]; __global__ void cnp(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1];__syncthreads();if(tid == 0) {launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize();}__syncthreads();cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize();} Code Example Launch is per-threadand asynchronousCUDA primitives are per-blockSync includes all launches by any thread in the block__device__ float buf[1024]; __global__ void cnp(float *data) { int tid = threadIdx.x; if(tid % 2) buf[tid/2] = data[tid]+data[tid+1]; __syncthreads(); if(tid == 0) { launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); } __syncthreads();cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize();}Code Example Launch is per-threadand asynchronousCUDA primitives are per-blockSync includes all launchesby any thread in the blockcudaDeviceSynchronize() does not imply syncthreads()__device__ float buf[1024]; __global__ void cnp(float *data){int tid = threadIdx.x;if(tid % 2)buf[tid/2] = data[tid]+data[tid+1];__syncthreads();if(tid == 0) {launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize();}__syncthreads();cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize();}Code Example Launch implies membar(child sees parent state at time of launch)__device__ float buf[1024]; __global__ void cnp(float *data) { int tid = threadIdx.x; if(tid % 2)buf[tid/2] = data[tid]+data[tid+1];__syncthreads();if(tid == 0) {launch<<< 128, 256 >>>(buf); cudaDeviceSynchronize(); }__syncthreads();cudaMemcpyAsync(data, buf, 1024); cudaDeviceSynchronize(); } Code Example Launch implies membar(child sees parent state at time of launch) Sync implies invalidate(parent sees child writes after sync)。
燕山大学课程讲义并行计算导论授课人:郭栋梁学时:32学时其中实验课:8学时三级项目:16学时第1章引言1.1概述单处理器计算机即将成为过时的概念.我们需要考虑如下因素来着手改进提高计算机的性能:(1)单纯依靠单处理器很难提升现有计算机的性能.即使有一个性能十分强大的单处理器,其功耗也让人无法接受.想要提升计算机的性能,更加可行的方法是同时使用多个简单处理器,它所能达到的性能可能是现有单处理器计算机性能的几千倍。
(2)观察结果显示,除非使用并行处理技术,一个程序在一台型号更新的单处理器计算机上的运行速度,可能比在旧的计算机上的运行速度更慢。
能依照给定算法检测出程序中的并行结构的编程工具还有待开发。
此算法需要能够检测出变ja之间的依赖关系是否规则;而且不管这些依赖是否规则,此算法都能在保证程序正确性的前提下,通过将程序中的一些子任务并行化来加速程序的执行。
(3)提升未来的计算机性能的关键就在于并行程序的开发,这涉及各个层面的工作:算法、程序开发、操作系统、编译器及硬件设备。
(4)并行计算除了要考虑到参与并行计算的处理器的数量,还应该考虑处理器与处理器、处理器与内存之间的通信。
最终计算性能的提升既依赖于算法能够提升的空间,更依赖于处理器执行算法的效率。
而通信性能的提升则依赖于处理器对数据的供应和提取的速度。
(5)内存系统的速度始终比处理器慢,而且由于一次只能进行单个字的读写操作,内存系统的带宽也有限制。
(6)内存系统的速度始终比处理器慢,而且由于一次只能进行单个字的读写操作,内存系统的带宽也有限制。
本书内容主要涉及并行算法与为了实现这些算法而设计的硬件结构。
硬件和软件是相互影响的,任何软件的最终运行环境是由处理器组成的底层硬件设备和相应的操作系统组成.我们在本章开始的部分会介绍一些概念,之后再来讨论为了实现这些概念有哪些方法和限制.1.2自动并行编程对于算法在软件中的实现过程我们都很熟悉。
在编程并不需要了解目标计算机系统的具体细节,因为编译器会处理这些细节.但是在编程和调试时依旧沿用着在单一央处理器(CPU)上顺序处理的模式.从另一方面讲,为了实现并行算法,硬件和软件之间的相互联系需要比我们想象的更加密切。