cuda中文手册
- 格式:docx
- 大小:16.82 KB
- 文档页数:2
CUDAC++编程⼿册(总论)CUDA C++编程⼿册(总论)The programming guide to the CUDA model and interface.Use CUDA C++ instead of CUDA C to clarify that CUDA C++ is a C++ language extension not a C language.General wording improvements throughput the guide.Fixed minor typos in code examples.Updated .Added reference to NVRTC in .Clarified linear memory address space size in .Clarified usage of CUDA_API_PER_THREAD_DEFAULT_STREAM when compiling with nvccin .Updated to use cudaLaunchHostFunc instead of the deprecated cudaStreamAddCallback.Clarified that 8 GPU peer limit only applies to non-NVSwitch enabled systems in .Added section .Added reference to CUDA Compatibility in .Extended list of types supported by __ldg() function in .Documented support for unsigned short int with .Added section .Added removal notice for deprecated warp vote functions on devices with compute capability 7.x or higher in .Added documentation for __nv_aligned_device_malloc() in .Added documentation of cudaLimitStackSize in CUDA Dynamic Parallelism .Added synchronization performance guideline to CUDA Dynamic Parallelism .Documented performance improvement of roundf(), round() and updated Maximum ULP Error Table for Mathematical .Updated Performance Guidelines for devices of compute capability 7.x.Clarified Shared Memory carve out description in Compute Capability 7.x .Added missing Stream CUStream Mapping toAdded remark about driver and runtime API inter-operability, highlighting cuDevicePrimaryCtxRetain() in Driver API .Updated default value of CUDA_C ACHE_M AXSIZE and removed no longer supported environment variables from Added new Unified Memory sections: , ,Added section .CUDA C++程序设计指南CUDA模型和接⼝的编程指南。
cuda中文手册摘要:一、引言二、CUDA的定义与背景三、CUDA架构1.统一内存架构2.线程调度3.通信机制四、CUDA编程模型1.编程语言与编译器2.核函数3.设备变量与设备内存五、CUDA应用案例1.图像处理2.深度学习3.高效能计算六、CUDA发展趋势与展望正文:一、引言随着高性能计算需求的不断增长,GPU并行计算逐渐成为研究热点。
NVIDIA公司推出的CUDA(Compute Unified Device Architecture)技术,为通用并行计算提供了一个强大的平台。
本文将对CUDA技术进行详细介绍,包括其定义、背景、架构、编程模型以及应用案例。
二、CUDA的定义与背景CUDA是一种通用并行计算架构,允许开发者利用NVIDIA GPU进行高性能计算。
CUDA起源于NVIDIA对GPU计算潜力的挖掘,旨在为科研、工程和娱乐等领域提供强大的并行计算能力。
三、CUDA架构CUDA架构主要包括统一内存架构、线程调度和通信机制三部分。
1.统一内存架构:CUDA采用统一内存架构,使得CPU和GPU能够共享内存,简化了并行计算的编程复杂度。
2.线程调度:CUDA支持多种线程调度策略,如Warp、Grid等,可以实现高效并行计算。
3.通信机制:CUDA提供了多种通信机制,如共享内存、全局内存和设备内存等,以满足不同应用场景的需求。
四、CUDA编程模型CUDA编程模型包括编程语言、编译器和核心函数等部分。
1.编程语言与编译器:CUDA支持C、C++和Python等编程语言,通过NVIDIA编译器可以将源代码编译为可执行文件。
2.核函数:CUDA中的核函数是一种特殊的函数,能够在GPU上执行。
开发者可以通过核函数实现并行算法。
3.设备变量与设备内存:CUDA提供了设备变量和设备内存,允许开发者直接在GPU上操作数据。
五、CUDA应用案例CUDA在多个领域都有广泛应用,以下为几个典型应用案例:1.图像处理:CUDA可以高效地处理图像和视频数据,如实时图像渲染、图像识别等。
Turing Compatibility Guide for CUDA ApplicationsApplication NoteTable of Contents Chapter 1. Turing Compatibility (1)1.1. About this Document (1)1.2. Application Compatibility on Turing (1)1.3. Compatibility between Volta and Turing (1)1.4. Verifying Turing Compatibility for Existing Applications (2)1.4.1. Applications Using CUDA Toolkit 8.0 or Earlier (2)1.4.2. Applications Using CUDA Toolkit 9.x (2)1.4.3. Applications Using CUDA Toolkit 10.0 (3)1.5. Building Applications with Turing Support (3)1.5.1. Applications Using CUDA Toolkit 8.0 or Earlier (3)1.5.2. Applications Using CUDA Toolkit 9.x (4)1.5.3. Applications Using CUDA Toolkit 10.0 (5)1.5.4. Independent Thread Scheduling Compatibility (6)Appendix A. Revision History (7)Chapter 1.Turing Compatibility1.1. About this DocumentThis application note, Turing Compatibility Guide for CUDA Applications, is intended to help developers ensure that their NVIDIA® CUDA® applications will run on GPUs based on the NVIDIA® Turing Architecture. This document provides guidance to developers who are already familiar with programming in CUDA C++ and want to make sure that their software applications are compatible with Turing.1.2. Application Compatibility on TuringThe NVIDIA CUDA C++ compiler, nvcc, can be used to generate both architecture-specific cubin files and forward-compatible PTX versions of each kernel. Each cubin file targets a specific compute-capability version and is forward-compatible only with GPU architectures of the same major version number. For example, cubin files that target compute capability 3.0 are supported on all compute-capability 3.x (Kepler) devices but are not supported on compute-capability 5.x (Maxwell) or 6.x (Pascal) devices. For this reason, to ensure forward compatibility with GPU architectures introduced after the application has been released, it is recommended that all applications include PTX versions of their kernels.Note: CUDA Runtime applications containing both cubin and PTX code for a given architecturewill automatically use the cubin by default, keeping the PTX path strictly for forward-compatibility purposes.Applications that already include PTX versions of their kernels should work as-is on Turing-based GPUs. Applications that only support specific GPU architectures via cubin files, however, will need to be updated to provide Turing-compatible PTX or cubins.1.3. Compatibility between Volta andTuringThe Turing architecture is based on Volta's Instruction Set Architecture ISA 7.0, extending it with new instructions. As a consequence, any binary that runs on Volta will be able to run onTuring (forward compatibility), but a Turing binary will not be able to run on Volta. Please note that Volta kernels using more than 64KB of shared memory (via the explicit opt-in, see CUDA C++ Programming Guide) will not be able to launch on Turing, as they would exceed Turing's shared memory capacity.Most applications compiled for Volta should run efficiently on Turing, except if the application uses heavily the Tensor Cores, or if recompiling would allow use of new Turing-specific instructions. Volta's Tensor Core instructions can only reach half of the peak performance on Turing. Recompiling explicitly for Turing is thus recommended.1.4. Verifying Turing Compatibility forExisting ApplicationsThe first step is to check that Turing-compatible device code (at least PTX) is compiled into the application. The following sections show how to accomplish this for applications built with different CUDA Toolkit versions.1.4.1. Applications Using CUDA Toolkit 8.0 orEarlierCUDA applications built using CUDA Toolkit versions 2.1 through 8.0 are compatible with Turing as long as they are built to include PTX versions of their kernels. To test that PTX JIT is working for your application, you can do the following:‣Download and install the latest driver from https:///drivers.‣Set the environment variable CUDA_FORCE_PTX_JIT=1.‣Launch your application.When starting a CUDA application for the first time with the above environment flag, the CUDA driver will JIT-compile the PTX for each CUDA kernel that is used into native cubin code.If you set the environment variable above and then launch your program and it works properly, then you have successfully verified Turing compatibility.Note: Be sure to unset the CUDA_FORCE_PTX_JIT environment variable when you are donetesting.1.4.2. Applications Using CUDA Toolkit 9.xCUDA applications built using CUDA Toolkit 9.x are compatible with Turing as long as they are built to include kernels in either Volta-native cubin format (see Compatibility between Volta and Turing) or PTX format (see Applications Using CUDA Toolkit 8.0 or Earlier) or both.1.4.3. Applications Using CUDA Toolkit 10.0CUDA applications built using CUDA Toolkit 10.0 are compatible with Turing as long as they are built to include kernels in Volta-native or Turing-native cubin format (see Compatibility between Volta and Turing), or PTX format (see Applications Using CUDA Toolkit 8.0 or Earlier), or both.1.5. Building Applications with TuringSupportWhen a CUDA application launches a kernel, the CUDA Runtime determines the compute capability of each GPU in the system and uses this information to automatically find the best matching cubin or PTX version of the kernel that is available. If a cubin file supporting the architecture of the target GPU is available, it is used; otherwise, the CUDA Runtime will load the PTX and JIT-compile that PTX to the GPU's native cubin format before launching it. If neither is available, then the kernel launch will fail.The method used to build your application with either native cubin or at least PTX support for Turing depend on the version of the CUDA Toolkit used.The main advantages of providing native cubins are as follows:‣It saves the end user the time it takes to JIT-compile kernels that are available only as PTX. All kernels compiled into the application must have native binaries at load time or else they will be built just-in-time from PTX, including kernels from all libraries linked to the application, even if those kernels are never launched by the application. Especially when using large libraries, this JIT compilation can take a significant amount of time. The CUDA driver will cache the cubins generated as a result of the PTX JIT, so this is mostly a one-time cost for a given user, but it is time best avoided whenever possible.‣PTX JIT-compiled kernels often cannot take advantage of architectural features of newer GPUs, meaning that native-compiled code may be faster or of greater accuracy.1.5.1. Applications Using CUDA Toolkit 8.0 orEarlierThe compilers included in CUDA Toolkit 8.0 or earlier generate cubin files native to earlier NVIDIA architectures such as Maxwell and Pascal, but they cannot generate cubin files native to Volta or Turing architecture. To allow support for Volta, Turing and future architectures when using version 8.0 or earlier of the CUDA Toolkit, the compiler must generate a PTX version of each kernel.Below are compiler settings that could be used to build mykernel.cu to run on Maxwell or Pascal devices natively and on Turing devices via PTX JIT.Note that compute_XX refers to a PTX version and sm_XX refers to a cubin version. The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one must be PTX to provide Turing compatibility.Windowsnvcc.exe -ccbin "C:\vs2010\VC\bin"-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_61,code=compute_61--compile -o "Release\mykernel.cu.obj" "mykernel.cu"Mac/Linux/usr/local/cuda/bin/nvcc-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_61,code=compute_61-O2 -o mykernel.o -c mykernel.cuAlternatively, you may be familiar with the simplified nvcc command-line option -arch=sm_XX, which is a shorthand equivalent to the following more explicit -gencode= command-line options used above. -arch=sm_XX expands to the following:-gencode=arch=compute_XX,code=sm_XX-gencode=arch=compute_XX,code=compute_XXHowever, while the -arch=sm_XX command-line option does result in inclusion of a PTX back-end target by default, it can only specify a single target cubin architecture at a time, and it is not possible to use multiple -arch= options on the same nvcc command line, which is why the examples above use -gencode= explicitly.1.5.2. Applications Using CUDA Toolkit 9.xWith versions 9.x of the CUDA Toolkit, nvcc can generate cubin files native to the Volta architecture (compute capability 7.0). When using CUDA Toolkit 9.x, to ensure that nvcc will generate cubin files for all recent GPU architectures as well as a PTX version for forward compatibility with future GPU architectures, specify the appropriate -gencode= parameters on the nvcc command line as shown in the examples below.nvcc.exe -ccbin "C:\vs2010\VC\bin"-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_70,code=compute_70--compile -o "Release\mykernel.cu.obj" "mykernel.cu"Mac/Linux/usr/local/cuda/bin/nvcc-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_70,code=compute_70-O2 -o mykernel.o -c mykernel.cuNote that compute_XX refers to a PTX version and sm_XX refers to a cubin version. The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one should be PTX to provide compatibility with future architectures.Also, note that CUDA 9.0 removes support for compute capability 2.x (Fermi) devices. Any compute_2x and sm_2x flags need to be removed from your compiler commands.1.5.3. Applications Using CUDA Toolkit 10.0With version 10.0 of the CUDA Toolkit, nvcc can generate cubin files native to the Turing architecture (compute capability 7.5). When using CUDA Toolkit 10.0, to ensure that nvcc will generate cubin files for all recent GPU architectures as well as a PTX version for forward compatibility with future GPU architectures, specify the appropriate -gencode= parameters on the nvcc command line as shown in the examples below.Windowsnvcc.exe -ccbin "C:\vs2010\VC\bin"-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_75,code=sm_75-gencode=arch=compute_75,code=compute_75--compile -o "Release\mykernel.cu.obj" "mykernel.cu"/usr/local/cuda/bin/nvcc-gencode=arch=compute_50,code=sm_50-gencode=arch=compute_52,code=sm_52-gencode=arch=compute_60,code=sm_60-gencode=arch=compute_61,code=sm_61-gencode=arch=compute_70,code=sm_70-gencode=arch=compute_75,code=sm_75-gencode=arch=compute_75,code=compute_75-O2 -o mykernel.o -c mykernel.cuNote that compute_XX refers to a PTX version and sm_XX refers to a cubin version. The arch= clause of the -gencode= command-line option to nvcc specifies the front-end compilation target and must always be a PTX version. The code= clause specifies the back-end compilation target and can either be cubin or PTX or both. Only the back-end target version(s) specified by the code= clause will be retained in the resulting binary; at least one should be PTX to provide compatibility with future architectures.1.5.4. Independent Thread Scheduling CompatibilityThe Volta and Turing architectures feature Independent Thread Scheduling among threadsin a warp. If the developer made assumptions about warp-synchronicity, 1 this feature can alter the set of threads participating in the executed code compared to previous architectures. Please see Compute Capability 7.0 in the CUDA C++ Programming Guide for details and corrective actions. To aid migration Volta and Turing developers can opt-in to the Pascal scheduling model with the following combination of compiler options.nvcc -arch=compute_60 -code=sm_70 ...1Warp-synchronous refers to an assumption that threads in the same warp are synchronized at every instruction and can, for example, communicate values without explicit synchronization.Appendix A.Revision HistoryVersion 1.0‣Initial public release.Version 1.1‣Use CUDA C++ instead of CUDA C/C++‣Updated references to the CUDA C++ Programming Guide and CUDA C++ Best Practices Guide.NoticeThis document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.OpenCLOpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.Copyright© -2022 NVIDIA Corporation & affiliates. All rights reserved.NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051。
cuda中文手册摘要:一、CUDA 概述1.NVIDIA 公司背景2.GPU 通用计算的兴起3.CUDA 的定义及作用二、CUDA 架构1.并行计算模型2.线程块与网格3.共享内存与全局内存4.核函数与设备函数三、CUDA 编程模型1.设备驱动程序2.应用程序接口3.编译器与运行时环境4.编程语言支持四、CUDA 应用案例1.图像处理与计算2.数值计算与科学模拟3.数据分析与机器学习4.深度学习与人工智能五、CUDA 性能优化1.性能评估与调试2.数据局部性与负载平衡3.存储层次结构与内存管理4.线程调度与负载调度六、CUDA 发展趋势与展望1.硬件与软件的协同进步2.与其他并行计算框架的竞争与合作3.我国在CUDA 领域的发展现状4.CUDA 在未来计算中的应用前景正文:CUDA(Compute Unified Device Architecture)是NVIDIA 公司开发的一种通用并行计算架构,旨在充分利用GPU 的并行计算能力进行通用计算。
随着GPU 性能的不断提升和制程技术的进步,CUDA 逐渐成为并行计算领域的重要力量。
CUDA 架构基于一个并行计算模型,通过线程块与网格将计算任务划分为多个独立的执行单元。
在此基础上,CUDA 提供了一组简洁易用的编程模型,包括设备驱动程序、应用程序接口、编译器与运行时环境等,以支持多种编程语言进行并行计算。
CUDA 在图像处理、数值计算、数据分析等领域有着广泛的应用。
随着深度学习与人工智能技术的飞速发展,CUDA 在这些领域也发挥着越来越重要的作用。
为了充分发挥CUDA 的性能,开发者需要掌握一系列性能优化技巧,包括性能评估与调试、数据局部性与负载平衡、存储层次结构与内存管理、线程调度与负载调度等。
在我国,CUDA 的应用与发展也取得了显著成果。
政府、企业和学术界对并行计算技术的重视与投入,使得我国在CUDA 领域取得了一系列具有国际影响力的研究成果。
Application NoteTable of Contents Chapter 1. NVIDIA Ampere GPU Architecture Tuning Guide (1)1.1. NVIDIA Ampere GPU Architecture (1)1.2. CUDA Best Practices (1)1.3. Application Compatibility (2)1.4. NVIDIA Ampere GPU Architecture Tuning (2)1.4.1. Streaming Multiprocessor (2)1.4.1.1. Occupancy (2)1.4.1.2. Asynchronous Data Copy from Global Memory to Shared Memory (2)1.4.1.3. Hardware Acceleration for Split Arrive/Wait Barrier (3)1.4.1.4. Warp level support for Reduction Operations (3)1.4.1.5. Improved Tensor Core Operations (3)1.4.1.6. Improved FP32 throughput (5)1.4.2. Memory System (5)1.4.2.1. Increased Memory Capacity and High Bandwidth Memory (5)1.4.2.2. Increased L2 capacity and L2 Residency Controls (5)1.4.2.3. Unified Shared Memory/L1/Texture Cache (5)1.4.3. Third Generation NVLink (6)Appendix A. Revision History (7)Chapter 1.NVIDIA Ampere GPUArchitecture Tuning Guide1.1. NVIDIA Ampere GPU ArchitectureThe NVIDIA Ampere GPU architecture is NVIDIA's latest architecture for CUDA compute applications. The NVIDIA Ampere GPU architecture retains and extends the same CUDA programming model provided by previous NVIDIA GPU architectures such as Turing and Volta, and applications that follow the best practices for those architectures should typically see speedups on the NVIDIA A100 GPU without any code changes. This guide summarizes the ways that an application can be fine-tuned to gain additional speedups by leveraging the NVIDIA Ampere GPU architecture's features.1For further details on the programming features discussed in this guide, please refer to the CUDA C++ Programming Guide.1.2. CUDA Best PracticesThe performance guidelines and best practices described in the CUDA C++ Programming Guide and the CUDA C++ Best Practices Guide apply to all CUDA-capable GPU architectures. Programmers must primarily focus on following those recommendations to achieve the best performance.The high-priority recommendations from those guides are as follows:‣Find ways to parallelize sequential code.‣Minimize data transfers between the host and the device.‣Adjust kernel launch configuration to maximize device utilization.‣Ensure global memory accesses are coalesced.‣Minimize redundant accesses to global memory whenever possible.‣Avoid long sequences of diverged execution by threads within the same warp.1Throughout this guide, Kepler refers to devices of compute capability 3.x, Maxwell refers to devices of compute capability 5.x, Pascal refers to device of compute capability 6.x, Volta refers to devices of compute capability 7.0, Turing refers to devices of compute capability 7.5, and NVIDIA Ampere GPU Architecture refers to devices of compute capability 8.x1.3. Application CompatibilityBefore addressing specific performance tuning issues covered in this guide, refer to the NVIDIA Ampere GPU Architecture Compatibility Guide for CUDA Applications to ensurethat your application is compiled in a way that is compatible with the NVIDIA Ampere GPU Architecture.1.4. NVIDIA Ampere GPU ArchitectureTuning1.4.1. Streaming MultiprocessorThe NVIDIA Ampere GPU architecture's Streaming Multiprocessor (SM) provides the following improvements over Volta and Turing.1.4.1.1. OccupancyThe maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64), and other factors influencing warp occupancy are:‣The register file size is 64K 32-bit registers per SM.‣The maximum number of registers per thread is 255.‣The maximum number of thread blocks per SM is 32 for devices of compute capability 8.0(i.e., A100 GPUs) and 16 for GPUs with compute capability 8.6.‣For devices of compute capability 8.0 (i.e., A100 GPUs) shared memory capacity per SM is 164 KB, a 71% increase compared to V100's capacity of 96 KB. For GPUs with compute capability 8.6, shared memory capacity per SM is 100 KB.‣For devices of compute capability 8.0 (i.e., A100 GPUs) the maximum shared memory per thread block is 163 KB. For GPUs with compute capability 8.6 maximum shared memory per thread block is 99 KB.Overall, developers can expect similar occupancy as on Volta without changes to their application.1.4.1.2. Asynchronous Data Copy from Global Memory toShared MemoryThe NVIDIA Ampere GPU architecture adds hardware acceleration for copying data from global memory to shared memory. These copy instructions are asynchronous, with respect to computation and allow users to explicitly control overlap of compute with data movement from global memory into the SM. These instructions also avoid using extra registers for memory copies and can also bypass the L1 cache. This new feature is exposed via the pipeline APIin CUDA. For more information please refer to the section on Async Copy in the CUDA C++ Programming Guide.1.4.1.3. Hardware Acceleration for Split Arrive/WaitBarrierThe NVIDIA Ampere GPU architecture adds hardware acceleration for a split arrive/wait barrier in shared memory. These barriers can be used to implement fine grained thread controls, producer-consumer computation pipeline and divergence code patterns in CUDA. These barriers can also be used alongside the asynchronous copy. For more informationon the Arrive/Wait Barriers refer to the Arrive/Wait Barrier section in the CUDA C++ Programming Guide.1.4.1.4. Warp level support for Reduction OperationsThe NVIDIA Ampere GPU architecture adds native support for warp wide reduction operations for 32-bit signed and unsigned integer operands. The warp wide reduction operations support arithmetic add, min, and max operations on 32-bit signed and unsigned integers and bitwise and, or and xor operations on 32-bit unsigned integers.For more details on the new warp wide reduction operations refer to Warp Reduce Functions in the CUDA C++ Programming Guide.1.4.1.5. Improved Tensor Core OperationsThe NVIDIA Ampere GPU architecture includes new Third Generation Tensor Cores that are more powerful than the Tensor Cores used in Volta and Turing SMs. The new Tensor Cores use a larger base matrix size and add powerful new math modes including:‣Support for FP64 Tensor Core, using new DMMA instructions.‣Support for Bfloat16 Tensor Core, through HMMA instructions. BFloat16 format is especially effective for DL training scenarios. Bfloat16 provides 8-bit exponent i.e., same range as FP32, 7-bit mantissa and 1 sign-bit.‣Support for TF32 Tensor Core, through HMMA instructions. TF32 is a new 19-bit Tensor Core format that can be easily integrated into programs for more accurate DL training than 16-bit HMMA formats. TF32 provides 8-bit exponent, 10-bit mantissa and 1 sign-bit.‣Support for bitwise AND along with bitwise XOR which was introduced in Turing, through BMMA instructions.The following table presents the evolution of matrix instruction sizes and supported data types for Tensor Cores across different GPU architecture generations.For more details on the new Tensor Core operations refer to the Warp Matrix Multiply section in the CUDA C++ Programming Guide.1.4.1.6. Improved FP32 throughputDevices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput. 1.4.2. Memory System1.4.2.1. Increased Memory Capacity and High BandwidthMemoryThe NVIDIA A100 GPU increases the HBM2 memory capacity from 32 GB in V100 GPU to 40 GB in A100 GPU. Along with the increased memory capacity, the bandwidth is increased by 72%, from 900 GB/s on Volta V100 to 1550 GB/s on A100.1.4.2.2. Increased L2 capacity and L2 Residency ControlsThe NVIDIA Ampere GPU architecture increases the capacity of the L2 cache to 40 MB in Tesla A100, which is 7x larger than Tesla V100. Along with the increased capacity, the bandwidth of the L2 cache to the SMs is also increased. The NVIDIA Ampere GPU architecture allows CUDA users to control the persistence of data in L2 cache. For more information on the persistence of data in L2 cache, refer to the section on managing L2 cache in the CUDA C++ Programming Guide.1.4.2.3. Unified Shared Memory/L1/Texture CacheThe NVIDIA A100 GPU based on compute capability 8.0 increases the maximum capacity of the combined L1 cache, texture cache and shared memory to 192 KB, 50% larger than the L1 cache in NVIDIA V100 GPU. The combined L1 cache capacity for GPUs with compute capability 8.6 is 128 KB.In the NVIDIA Ampere GPU architecture, the portion of the L1 cache dedicated toshared memory (known as the carveout) can be selected at runtime as in previous architectures such as Volta, using cudaFuncSetAttribute() with the attribute cudaFuncAttributePreferredSharedMemoryCarveout. The NVIDIA A100 GPU supports shared memory capacity of 0, 8, 16, 32, 64, 100, 132 or 164 KB per SM. GPUs with compute capability 8.6 support shared memory capacity of 0, 8, 16, 32, 64 or 100 KB per SM.CUDA reserves 1 KB of shared memory per thread block. Hence, the A100 GPU enablesa single thread block to address up to 163 KB of shared memory and GPUs with compute capability 8.6 can address up to 99 KB of shared memory in a single thread block. To maintain architectural compatibility, static shared memory allocations remain limited to 48 KB, and an explicit opt-in is also required to enable dynamic allocations above this limit. See the CUDA C+ + Programming Guide for details.Like Volta, the NVIDIA Ampere GPU architecture combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of thatdata to the warp. Another benefit of its union with shared memory, similar to Volta L1 is improvement in terms of both latency and bandwidth.1.4.3. Third Generation NVLinkThe third generation of NVIDIA’s high-speed NVLink interconnect is implemented in A100 GPUs, which significantly enhances multi-GPU scalability, performance, and reliability with more links per GPU, much faster communication bandwidth, and improved error-detection and recovery features. The third generation NVLink has the same bi-directional data rate of 50 GB/s per link, but uses half the number of signal pairs to achieve this bandwidth. Therefore, the total number of links available is increased to twelve in A100, versus six in V100, yielding 600 GB/s bidirectional bandwidth versus 300 GB/s for V100.NVLink operates transparently within the existing CUDA model. Transfers between NVLink-connected endpoints are automatically routed through NVLink, rather than PCIe. The cudaDeviceEnablePeerAccess() API call remains necessary to enable direct transfers (over either PCIe or NVLink) between GPUs. The cudaDeviceCanAccessPeer() can be used to determine if peer access is possible between any pair of GPUs.In the NVIDIA Ampere GPU architecture remote NVLINK accesses go through a Link TLBon the remote GPU. This Link TLB has a reach of 64 GB to the remote GPU's memory. Applications with remote random accesses may want to constrain the remotely accessed region to 64 GB for each peer GPU.Appendix A.Revision HistoryVersion 1.1‣Initial Public Release‣Added support for compute capability 8.6NoticeThis document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.VESA DisplayPortDisplayPort and DisplayPort Compliance Logo, DisplayPort Compliance Logo for Dual-mode Sources, and DisplayPort Compliance Logo for Active Cables are trademarks owned by the Video Electronics Standards Association in the United States and other countries.HDMIHDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.OpenCLOpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.Copyright© -2021 NVIDIA Corporation. All rights reserved.NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051。
CUDA Developer Guide for NVIDIA Optimus PlatformsReference GuideTable of ContentsChapter 1. Introduction to Optimus (1)Chapter 2. CUDA Applications and Optimus (2)Chapter 3. Querying for a CUDA Device (3)3.1. Applications without Graphics Interoperability (3)3.2. Applications with Graphics Interoperability (5)3.3. CUDA Support with DirecX Interoperability (6)3.4. CUDA Support with OpenGL Interoperability (7)3.5. Control Panel Settings and Driver Updates (8)Chapter 1.Introduction to OptimusNVIDIA® Optimus™ is a revolutionary technology that delivers great battery life and great performance, in a way that simply works. It automatically and instantaneously uses the best tool for the job – the high performance NVIDIA GPU for GPU-Compute applications, video, and 3D games; and low power integrated graphics for applications like Office, Web surfing, or email.The result is long lasting battery life without sacrificing great graphics performance, delivering an experience that is fully automatic and behind the scenes.When the GPU can provide an increase in performance, functionality, or quality over theIGP for an application, the NVIDIA driver will enable the GPU. When the user launches an application, the NVIDIA driver will recognize whether the application being run can benefit from using the GPU. If the application can benefit from running on the GPU, the GPU is powered up from an idle state and is given all rendering calls.Using NVIDIA’s Optimus technology, when the discrete GPU is handling all the rendering duties, the final image output to the display is still handled by the Intel integrated graphics processor (IGP). In effect, the IGP is only being used as a simple display controller, resulting in a seamless, flicker-free experience with no need to reboot.When the user closes all applications that benefit from the GPU, the discrete GPU is powered off and the Intel IGP handles both rendering and display calls to conserve power and provide the highest possible battery life.The beauty of Optimus is that it leverages standard industry protocols and APIs to work. From relying on standard Microsoft APIs when communicating with the Intel IGP driver, to utilizing the PCI-Express bus to transfer the GPU’s output to the Intel IGP, there are no proprietary hoops to jump through NVIDIA.This document provides guidance to CUDA developers and explains how NVIDIA CUDA APIs can be used to query for GPU capabilities in Optimus systems. It is strongly recommendedto follow these guidelines to ensure CUDA applications are compatible with all notebooks featuring Optimus.Chapter 2.CUDA Applications andOptimusOptimus systems all have an Intel IGP and an NVIDIA GPU. Display heads may be electrically connected to the IGP or the GPU. When a display is connected to a GPU head, all rendering and compute on that display happens on the NVIDIA GPU just like it would on a typical discrete system. When the display is connected to an IGP head, the NVIDIA driver decides if an application on that display should be rendered on the GPU or the IGP. If the driver decides to run the application on the NVIDIA GPU, the final rendered frames are copied to the IGP’s display pipeline for scanout. Please consult the Optimus white paper for more details on this behavior: /object/optimus_technology.html.CUDA developers should understand this scheme because it affects how applications should query for GPU capabilities. For example, a CUDA application launched on the LVDS panel of an Optimus notebook (which is an IGP-connected display), would see that the primary display device is the Intel's graphic adapter – a chip not capable of running CUDA. In this case, it is important for the application to detect the existence of a second device in the system – the NVIDIA GPU – and then create a CUDA context on this CUDA-capable device even when it is not the display device.For applications that require use of Direct3D/CUDA or OpenGL/CUDA interop, there are restrictions that developers need to be aware of when creating a Direct3D or OpenGL context that will interoperate with CUDA. CUDA Support with DirecX Interoperability and CUDA Support with OpenGL Interoperability in this guide discuss this topic in more detail.Chapter 3.Querying for a CUDA DeviceDepending on the application, there are different ways to query for a CUDA device.3.1. Applications without GraphicsInteroperabilityFor CUDA applications, finding the best CUDA capable device is done through the CUDA API. The CUDA API functions cudaGetDeviceProperties (CUDA runtime API), and cuDeviceComputeCapability (CUDA Driver API) are used. Refer to the CUDA Sample deviceQuery or deviceQueryDrv for more details.The next two code samples illustrate the best method of choosing a CUDA capable device with the best performance.// CUDA Runtime API Versioninline int cutGetMaxGflopsDeviceId(){int current_device = 0, sm_per_multiproc = 0;int max_compute_perf = 0, max_perf_device = 0;int device_count = 0, best_SM_arch = 0;int arch_cores_sm[3] = { 1, 8, 32, 192 };cudaDeviceProp deviceProp;cudaGetDeviceCount( &device_count );// Find the best major SM Architecture GPU devicewhile ( current_device < device_count ) {cudaGetDeviceProperties( &deviceProp, current_device );if (deviceProp.major > 0 && deviceProp.major < 9999) {best_SM_arch = max(best_SM_arch, deviceProp.major);}current_device++;}// Find the best CUDA capable GPU devicecurrent_device = 0;while( current_device < device_count ) {cudaGetDeviceProperties( &deviceProp, current_device );if (deviceProp.major == 9999 && deviceProp.minor == 9999) {sm_per_multiproc = 1;} else if (deviceProp.major <= 3) {sm_per_multiproc = arch_cores_sm[deviceProp.major];} else { // Device has SM major > 3sm_per_multiproc = arch_cores_sm[3];}int compute_perf = deviceProp.multiProcessorCount *sm_per_multiproc * deviceProp.clockRate;if( compute_perf > max_compute_perf ) {// If we find GPU of SM major > 3, search only theseif ( best_SM_arch > 3 ) {// If device==best_SM_arch, choose this, or else passif (deviceProp.major == best_SM_arch) {max_compute_perf = compute_perf;max_perf_device = current_device;}} else {max_compute_perf = compute_perf;max_perf_device = current_device;}}++current_device;}cudaGetDeviceProperties(&deviceProp, max_compute_perf_device);printf("\nDevice %d: \"%s\"\n", max__perf_device,);printf("Compute Capability : %d.%d\n",deviceProp.major, deviceProp.minor);return max_perf_device;}// CUDA Driver API Versioninline int cutilDrvGetMaxGflopsDeviceId(){CUdevice current_device = 0, max_perf_device = 0;int device_count = 0, sm_per_multiproc = 0;int max_compute_perf = 0, best_SM_arch = 0;int major = 0, minor = 0, multiProcessorCount, clockRate;int arch_cores_sm[3] = { 1, 8, 32, 192 };cuInit(0);cuDeviceGetCount(&device_count);// Find the best major SM Architecture GPU devicewhile ( current_device < device_count ) {cuDeviceComputeCapability(&major, &minor, current_device));if (major > 0 && major < 9999) {best_SM_arch = MAX(best_SM_arch, major);}current_device++;}// Find the best CUDA capable GPU devicecurrent_device = 0;while( current_device < device_count ) {cuDeviceGetAttribute( &multiProcessorCount,CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT, current_device ) );cuDeviceGetAttribute( &clockRate,CU_DEVICE_ATTRIBUTE_CLOCK_RATE,current_device ) );if (major == 9999 && minor == 9999) {sm_per_multiproc = 1;} else if (major <= 3) {sm_per_multiproc = arch_cores_sm[major];} else {sm_per_multiproc = arch_cores_sm[3];}int compute_perf = multiProcessorCount * sm_per_multiproc *clockRate;if( compute_perf > max_compute_perf ) {// If we find GPU with SM major > 2, search only theseif ( best_SM_arch > 2 ) {// If our device==dest_SM_arch, choose this, or else passif (major == best_SM_arch) {max_compute_perf = compute_perf;max_perf_device = current_device;}} else {max_compute_perf = compute_perf;max_perf_device = current_device;}}++current_device;}char name[100];cuDeviceGetName(name, 100, max_perf_device);cuDeviceComputeCapability(&major, &minor, max_perf_device);printf("\nDevice %d: "%s\n", max_perf_device, name);printf(" Compute Capability : %d.%d\n", major, minor);return max_perf_device;}3.2. Applications with GraphicsInteroperabilityFor CUDA applications that use the CUDA interop capability with Direct3D or OpenGL, developers should be aware of the restrictions and requirements to ensure compatibility with the Optimus platform. For CUDA applications that meet these descriptions:1.Application requires CUDA interop capability with either Direct3D or OpenGL.2.Application is not directly linked against cuda.lib or cudart.lib or LoadLibrary todynamically load the nvcuda.dll or cudart*.dll and uses GetProcAddress to retrieve function addresses from nvcuda.dll or cudart*.dll.A Direct3D or OpenGL context has to be created before the CUDA context. The Direct3D or OpenGL context needs to pass this into the CUDA. See the sample calls below in red below. Your application will need to create the graphics context first. The sample code below does not illustrate this.Refer to the CUDA Sample simpleD3D9 and simpleD3D9Texture for details.// CUDA/Direct3D9 interop// You will need to create the D3D9 context firstIDirect3DDevice9 * g_pD3D9Device; // Initialize D3D9 rendering device// After creation, bind your D3D9 context to CUDAcudaD3D9SetDirect3DDevice(g_pD3D9Device);Refer to the CUDA Sample simpleD3D10 and simpleD3D10Texture for details.// CUDA/Direct3D10 interop// You will need to create a D3D10 context firstID3D10Device * g_pD3D10Device; // Initialize D3D10 rendering device// After creation, bind your D3D10 context to CUDAcudaD3D10SetDirect3DDevice(g_pD3D10Device);Refer to the CUDA Sample simpleD3D11Texture for details.// CUDA/Direct3D11 interop// You will need to first create the D3D11 context firstID3D11Device * g_pD3D11Device; // Initialize D3D11 rendering device// After creation, bind your D3D11 context to CUDAcudaD3D11SetDirect3DDevice(g_pD3D11Device);Refer to the CUDA Samples simpleGL and postProcessGL for details.// For CUDA/OpenGL interop// You will need to create the OpenGL Context first// After creation, bind your D3D11 context to CUDAcudaGLSetGLDevice(deviceID);On an Optimus platform, this type of CUDA application will not work properly. If a Direct3Dor OpenGL graphics context is created before any CUDA API calls are used or initialized, the Graphics context may be created on the Intel IGP. The problem here is that the Intel IGP does not allow Graphics interoperability with CUDA running on the NVIDIA GPU. If the Graphics context is created on the NVIDIA GPU, then everything will work.The solution is to create an application profile in the NVIDIA Control Panel. With an application profile, the Direct3D or OpenGL context will always be created on the NVIDIA GPU when the application gets launched. These application profiles can be created manually through the NVIDIA control Panel (see Control Panel Settings and Driver Updates Section 5 for details). Contact NVIDIA support to have this application profile added to the drivers, so future NVIDIA driver releases and updates will include it.3.3. CUDA Support with DirecXInteroperabilityThe following steps with code samples illustrate what is necessary to initialize your CUDA application to interoperate with Direct3D9.Note: An application profile may also be needed.1.Create a Direct3D9 Context:// Create the D3D objectif ((g_pD3D = Direct3DCreate9(D3D_SDK_VERSION))==NULL )return E_FAIL;2.Find the CUDA Device that is also a Direct3D device// Find the first CUDA capable device, may also want to check// number of cores with cudaGetDeviceProperties for the best// CUDA capable GPU (see previous function for details)for(g_iAdapter = 0;g_iAdapter < g_pD3D->GetAdapterCount();g_iAdapter++){D3DCAPS9 caps;if (FAILED(g_pD3D->GetDeviceCaps(g_iAdapter,D3DDEVTYPE_HAL, &caps)))// Adapter doesn't support Direct3Dcontinue;D3DADAPTER_IDENTIFIER9 ident;int device;g_pD3D->GetAdapterIdentifier(g_iAdapter,0, &ident);cudaD3D9GetDevice(&device, ident.DeviceName);if (cudaSuccess == cudaGetLastError() )break;}3.Create the Direct3D device// Create the D3DDeviceif (FAILED( g_pD3D->CreateDevice( g_iAdapter, D3DDEVTYPE_HAL,hWnd, D3DCREATE_HARDWARE_VERTEXPROCESSING, &g_d3dpp, &g_pD3DDevice ) ) )return E_FAIL;4.Bind the CUDA Context to the Direct3D device:// Now we need to bind a CUDA context to the DX9 devicecudaD3D9SetDirect3DDevice(g_pD3DDevice);3.4. CUDA Support with OpenGLInteroperabilityThe following steps with code samples illustrate what is needed to initialize your CUDA application to interoperate with OpenGL.Note: An application profile may also be needed.1.Create an OpenGL Context and OpenGL window// Create GL contextglutInit(&argc, argv);glutInitDisplayMode(GLUT_RGBA | GLUT_ALPHA |GLUT_DOUBLE | GLUT_DEPTH);glutInitWindowSize(window_width, window_height);glutCreateWindow("OpenGL Application");// default initialization of the back bufferglClearColor(0.5, 0.5, 0.5, 1.0);2.Create the CUDA Context and bind it to the OpenGL context// Initialize CUDA context (ontop of the GL context)int dev, deviceCount;cudaGetDeviceCount(&deviceCount);cudaDeviceProp deviceProp;for (int i=0; i<deviceCount; i++) {cudaGetDeviceProperties(&deviceProp, dev));}cudaGLSetGLDevice(dev);3.5. Control Panel Settings and DriverUpdatesThis section is for developers who create custom application-specific profiles. It describes how end users can be sure that their CUDA applications run on Optimus and how they can receive the latest updates on drivers and application profiles.‣Profile updates are frequently sent to end user systems, similar to how virus definitions work. Systems are automatically updated with new profiles in the background with no user intervention required. Contact NVIDIA developer support to create the appropriate driver application profiles. This will ensure that your CUDA application is compatible on Optimus and included in these automatic updates.‣End users can create their own application profiles in the NVIDIA Control panel and set when switch and not to switch to the NVIDIA GPU per application. The screenshot below shows you where to find these application profiles.‣NVIDIA regularly updates the graphics drivers through the NVIDIA Verde Driver Program. End users will be able to download drivers which include application-specific Optimus profiles for NVIDIA-powered notebooks from: /object/notebook_drivers.htmlNoticeThis document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.OpenCLOpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.Copyright© 2013-2022 NVIDIA Corporation & affiliates. All rights reserved.NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051。
中文领域最详细的Python版CUDA入门教程本系列为英伟达GPU入门介绍的第二篇,主要介绍CUDA编程的基本流程和核心概念,并使用Python Numba编写GPU并行程序。
为了更好地理解GPU的硬件架构,建议读者先阅读我的第一篇文章。
1.GPU硬件知识和基础概念:包括CPU与GPU的区别、GPU架构、CUDA软件栈简介。
2.GPU编程入门:主要介绍CUDA核函数,Thread、Block和Grid概念,并使用Python Numba进行简单的并行计算。
3.GPU编程进阶:主要介绍一些优化方法。
4.GPU编程实践:使用Python Numba解决复杂问题。
针对Python的CUDA教程Python是当前最流行的编程语言,被广泛应用在深度学习、金融建模、科学和工程计算上。
作为一门解释型语言,它运行速度慢也常常被用户诟病。
著名Python发行商Anaconda公司开发的Numba 库为程序员提供了Python版CPU和GPU编程工具,速度比原生Python快数十倍甚至更多。
使用Numba进行GPU编程,你可以享受:1.Python简单易用的语法;2.极快的开发速度;3.成倍的硬件加速。
为了既保证Python语言的易用性和开发速度,又达到并行加速的目的,本系列主要从Python的角度给大家分享GPU编程方法。
关于Numba的入门可以参考我的另一篇文章。
更加令人兴奋的是,Numba提供了一个GPU模拟器,即使你手头暂时没有GPU机器,也可以先使用这个模拟器来学习GPU编程!初识GPU编程兵马未动,粮草先行。
在开始GPU编程前,需要明确一些概念,并准备好相关工具。
CUDA是英伟达提供给开发者的一个GPU编程框架,程序员可以使用这个框架轻松地编写并行程序。
本系列第一篇文章提到,CPU和主存被称为主机(Host),GPU和显存(显卡内存)被称为设备(Device),CPU无法直接读取显存数据,GPU无法直接读取主存数据,主机与设备必须通过总线(Bus)相互通信。
DU-05349-001_v9.2 | June 2018Installation and Verification on WindowsTABLE OF CONTENTS Chapter 1. Introduction (1)1.1. System Requirements (1)1.1.1. x86 32-bit Support (2)1.2. About This Document (3)Chapter 2. Installing CUDA Development T ools (4)2.1. Verify You Have a CUDA-Capable GPU (4)2.2. Download the NVIDIA CUDA T oolkit (4)2.3. Install the CUDA Software (5)2.3.1. Uninstalling the CUDA Software (8)2.4. Use a Suitable Driver Model (8)2.5. Verify the Installation (9)2.5.1. Running the Compiled Examples (9)Chapter 3. Compiling CUDA Programs (11)3.1. Compiling Sample Projects (11)3.2. Sample Projects (11)3.3. Build Customizations for New Projects (12)3.4. Build Customizations for Existing Projects (12)Chapter 4. Additional Considerations (14)CUDA® is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).CUDA was developed with several design goals in mind:‣Provide a small set of extensions to standard programming languages, like C, that enable a straightforward implementation of parallel algorithms. With CUDA C/C++, programmers can focus on the task of parallelization of the algorithms rather than spending time on their implementation.‣Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existingapplications. The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous computation on the CPU and GPU without contention for memory resources.CUDA-capable GPUs have hundreds of cores that can collectively run thousands of computing threads. These cores have shared resources including a register file and a shared memory. The on-chip shared memory allows parallel tasks running on these cores to share data without sending it over the system memory bus.This guide will show you how to install and check the correct operation of the CUDA development tools.1.1. System RequirementsTo use CUDA on your system, you will need the following installed:‣ A CUDA-capable GPU‣ A supported version of Microsoft Windows‣ A supported version of Microsoft Visual Studio‣the NVIDIA CUDA Toolkit (available at /cuda-downloads)The next two tables list the currently supported Windows operating systems and compilers.T able 1 Windows Operating System Support in CUDA 9.2T able 2 Windows Compiler Support in CUDA 9.2x86_32 support is limited. See the x86 32-bit Support section for details.1.1.1. x86 32-bit SupportNative development using the CUDA Toolkit on x86_32 is unsupported. Deployment and execution of CUDA applications on x86_32 is still supported, but is limited to use with GeForce GPUs. To create 32-bit CUDA applications, use the cross-development capabilities of the CUDA Toolkit on x86_64.Support for developing and running x86 32-bit applications on x86_64 Windows is limited to use with:‣GeForce GPUs‣CUDA Driver‣CUDA Runtime (cudart)‣CUDA Math Library (math.h)‣CUDA C++ Compiler (nvcc)‣CUDA Development Tools1.2. About This DocumentThis document is intended for readers familiar with Microsoft Windows operating systems and the Microsoft Visual Studio environment. You do not need previous experience with CUDA or experience with parallel computation.Basic instructions can be found in the Quick Start Guide. Read on for more detailed instructions.The setup of CUDA development tools on a system running the appropriate version of Windows consists of a few simple steps:‣Verify the system has a CUDA-capable GPU.‣Download the NVIDIA CUDA Toolkit.‣Install the NVIDIA CUDA Toolkit.‣Test that the installed software runs correctly and communicates with the hardware.2.1. Verify You Have a CUDA-Capable GPUYou can verify that you have a CUDA-capable GPU through the Display Adapters section in the Windows Device Manager. Here you will find the vendor name and model of your graphics card(s). If you have an NVIDIA card that is listed in http:// /cuda-gpus, that GPU is CUDA-capable. The Release Notes for the CUDA Toolkit also contain a list of supported products.The Windows Device Manager can be opened via the following steps:1.Open a run window from the Start Menu2.Run:control /name Microsoft.DeviceManager2.2. Download the NVIDIA CUDA T oolkitThe NVIDIA CUDA Toolkit is available at /cuda-downloads. Choose the platform you are using and one of the following installer formats:work Installer: A minimal installer which later downloads packages required forinstallation. Only the packages selected during the selection phase of the installerare downloaded. This installer is useful for users who want to minimize download time.2.Full Installer: An installer which contains all the components of the CUDA Toolkitand does not require any further download. This installer is useful for systemswhich lack network access and for enterprise deployment.The CUDA Toolkit installs the CUDA driver and tools needed to create, build and run a CUDA application as well as libraries, header files, CUDA samples source code, and other resources.Download VerificationThe download can be verified by comparing the MD5 checksum posted at http:// /cuda-downloads/checksums with that of the downloadedfile. If either of the checksums differ, the downloaded file is corrupt and needs to be downloaded again.To calculate the MD5 checksum of the downloaded file, follow the instructions at http:// /kb/889768.2.3. Install the CUDA SoftwareBefore installing the toolkit, you should read the Release Notes, as they provide details on installation and software functionality.The driver and toolkit must be installed for CUDA to function. If you have notinstalled a stand-alone driver, install the driver from the NVIDIA CUDA T oolkit.The installation may fail if Windows Update starts after the installation has begun.Wait until Windows Update is complete and then try the installation again. Graphical InstallationInstall the CUDA Software by executing the CUDA installer and following the on-screen prompts.Silent InstallationThe installer can be executed in silent mode by executing the package with the -s flag. Additional parameters can be passed which will install specific subpackages instead of all packages. See the table below for a list of all the subpackage names.T able 3 Possible Subpackage NamesFor example, to install only the compiler and driver components:<PackageName>.exe -s nvcc_9.2 Display.DriverExtracting and Inspecting the Files ManuallySometimes it may be desirable to extract or inspect the installable files directly, such as in enterprise deployment, or to browse the files before installation. The full installation package can be extracted using a decompression tool which supports the LZMA compression method, such as 7-zip or WinZip.Once extracted, the CUDA Toolkit files will be in the CUDAToolkit folder, and similarily for the CUDA Samples and CUDA Visual Studio Integration. Within each directory isa .dll and .nvi file that can be ignored as they are not part of the installable files.Accessing the files in this manner does not set up any environment settings, suchas variables or Visual Studio integration. This is intended for enterprise-leveldeployment.2.3.1. Uninstalling the CUDA SoftwareAll subpackages can be uninstalled through the Windows Control Panel by using the Programs and Features widget.2.4. Use a Suitable Driver ModelOn Windows 7 and later, the operating system provides two driver models under which the NVIDIA Driver may operate:‣The WDDM driver model is used for display devices.‣The Tesla Compute Cluster (TCC) mode of the NVIDIA Driver is available for non-display devices such as NVIDIA Tesla GPUs, and the GeForce GTX Titan GPUs; it uses the Windows WDM driver model.The TCC driver mode provides a number of advantages for CUDA applications on GPUs that support this mode. For example:‣TCC eliminates the timeouts that can occur when running under WDDM due to the Windows Timeout Detection and Recovery mechanism for display devices.‣TCC allows the use of CUDA with Windows Remote Desktop, which is not possible for WDDM devices.‣TCC allows the use of CUDA from within processes running as Windows services, which is not possible for WDDM devices.‣TCC reduces the latency of CUDA kernel launches.TCC is enabled by default on most recent NVIDIA Tesla GPUs. To check which driver mode is in use and/or to switch driver modes, use the nvidia-smi tool that is included with the NVIDIA Driver installation (see nvidia-smi -h for details).Keep in mind that when TCC mode is enabled for a particular GPU, that GPU cannotbe used as a display device.NVIDIA GeForce GPUs (excluding GeForce GTX Titan GPUs) do not support TCC mode.2.5. Verify the InstallationBefore continuing, it is important to verify that the CUDA toolkit can find and communicate correctly with the CUDA-capable hardware. To do this, you need to compile and run some of the included sample programs.2.5.1. Running the Compiled ExamplesThe version of the CUDA Toolkit can be checked by running nvcc -V in a Command Prompt window. You can display a Command Prompt window by going to:Start > All Programs > Accessories > Command PromptCUDA Samples include sample programs in both source and compiled form. To verify a correct configuration of the hardware and software, it is highly recommended that you run the deviceQuery program located atC:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2\bin\win64\ReleaseThis assumes that you used the default installation directory structure. If CUDA is installed and configured correctly, the output should look similar to Figure 1.Figure 1 Valid Results from deviceQuery CUDA SampleThe exact appearance and the output lines might be different on your system. The important outcomes are that a device was found, that the device(s) match what is installed in your system, and that the test passed.If a CUDA-capable device and the CUDA Driver are installed but deviceQuery reports that no CUDA-capable devices are present, ensure the deivce and driver are properly installed.Running the bandwidthTest program, located in the same directory as deviceQuery above, ensures that the system and the CUDA-capable device are able to communicate correctly. The output should resemble Figure 2.Figure 2 Valid Results from bandwidthT est CUDA SampleThe device name (second line) and the bandwidth numbers vary from system to system. The important items are the second line, which confirms a CUDA device was found, and the second-to-last line, which confirms that all necessary tests passed.If the tests do not pass, make sure you do have a CUDA-capable NVIDIA GPU on your system and make sure it is properly installed.To see a graphical representation of what CUDA can do, run the sample Particles executable atC:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2\bin\win64\ReleaseThe project files in the CUDA Samples have been designed to provide simple, one-click builds of the programs that include all source code. To build the Windows projects (for release or debug mode), use the provided *.sln solution files for Microsoft Visual Studio 2010, 2012, or 2013. You can use either the solution files located in each of the examples directories inC:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2\<category>\<sample_name>or the global solution files Samples*.sln located inC:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2CUDA Samples are organized according to <category>. Each sample is organized into one of the following folders: (0_Simple, 1_Utilities, 2_Graphics, 3_Imaging, 4_Finance, 5_Simulations, 6_Advanced, 7_CUDALibraries).3.1. Compiling Sample ProjectsThe bandwidthTest project is a good sample project to build and run. It is located in the NVIDIA Corporation\CUDA Samples\v9.2\1_Utilities\bandwidthTest directory.If you elected to use the default installation location, the output is placed in CUDA Samples\v9.2\bin\win64\Release. Build the program using the appropriate solution file and run the executable. If all works correctly, the output should be similar to Figure 2.3.2. Sample ProjectsThe sample projects come in two configurations: debug and release (where release contains no debugging information) and different Visual Studio projects.A few of the example projects require some additional setup. The simpleD3D9 example requires the system to have a Direct3D SDK (June 2010 or later) installed and the Visual C++ directory paths (located in Tools > Options...) properly configured. Consult the Direct3D documentation for additional details.These sample projects also make use of the $CUDA_PATH environment variable to locate where the CUDA Toolkit and the associated .props files are.The environment variable is set automatically using the Build Customization CUDA 9.2.props file, and is installed automatically as part of the CUDA Toolkit installation process.T able 4 CUDA Visual Studio .props locationsYou can reference this CUDA 9.2.props file when building your own CUDA applications.3.3. Build Customizations for New ProjectsWhen creating a new CUDA application, the Visual Studio project file must be configured to include CUDA build customizations. To accomplish this, click File-> New | Project... NVIDIA-> CUDA->, then select a template for your CUDA Toolkit version. For example, selecting the "CUDA 9.2 Runtime" template will configure your project for use with the CUDA 9.2 Toolkit. The new project is technically a C++ project (.vcxproj) that is preconfigured to use NVIDIA's Build Customizations. All standard capabilities of Visual Studio C++ projects will be available.To specify a custom CUDA Toolkit location, under CUDA C/C++, select Common, and set the CUDA Toolkit Custom Dir field as desired. Note that the selected toolkit must match the version of the Build Customizations.3.4. Build Customizations for Existing ProjectsWhen adding CUDA acceleration to existing applications, the relevant Visual Studio project files must be updated to include CUDA build customizations. This can be done using one of the following two methods:1.Open the Visual Studio project, right click on the project name, and select BuildCustomizations..., then select the CUDA Toolkit version you would like to target.2.Alternatively, you can configure your project always to build with the most recentlyinstalled version of the CUDA Toolkit. First add a CUDA build customization to your project as above. Then, right click on the project name and select Properties.Under CUDA C/C++, select Common, and set the CUDA Toolkit Custom Dir fieldto $(CUDA_PATH) . Note that the $(CUDA_PATH) environment variable is set by the installer.While Option 2 will allow your project to automatically use any new CUDA Toolkit version you may install in the future, selecting the toolkit version explicitly as in Option 1 is often better in practice, because if there are new CUDA configuration options added to the build customization rules accompanying the newer toolkit, you would not see those new options using Option 2.If you use the $(CUDA_PATH) environment variable to target a version of the CUDA Toolkit for building, and you perform an installation or uninstallation of any version of the CUDA Toolkit, you should validate that the $(CUDA_PATH) environment variable points to the correct installation directory of the CUDA Toolkit for your purposes. You can access the value of the $(CUDA_PATH) environment variable via the following steps:1.Open a run window from the Start Menu2.Run:control sysdm.cpl3.Select the "Advanced" tab at the top of the window4.Click "Environment Variables" at the bottom of the windowFiles which contain CUDA code must be marked as a CUDA C/C++ file. This can done when adding the file by right clicking the project you wish to add the file to, selecting Add\New Item, selecting NVIDIA CUDA 9.2\Code\CUDA C/C++ File, and then selecting the file you wish to add.Note for advanced users: If you wish to try building your project against a newer CUDA Toolkit without making changes to any of your project files, go to the Visual Studio command prompt, change the current directory to the location of your project, and execute a command such as the following:msbuild <projectname.extension> /t:Rebuild /p:CudaToolkitDir="drive:/path/to/ new/toolkit/"Now that you have CUDA-capable hardware and the NVIDIA CUDA Toolkit installed, you can examine and enjoy the numerous included programs. To begin using CUDA to accelerate the performance of your own applications, consult the CUDA C Programming Guide, located in the CUDA Toolkit documentation directory.A number of helpful development tools are included in the CUDA Toolkit or are available for download from the NVIDIA Developer Zone to assist you as you develop your CUDA programs, such as NVIDIA® Nsight™ Visual Studio Edition, NVIDIA Visual Profiler, and cuda-memcheck.For technical support on programming questions, consult and participate in the developer forums at /cuda/.NoticeALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATEL Y, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSL Y DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright© 2009-2018 NVIDIA Corporation. All rights reserved.。
CUDA——用于大量数据的超级计算Rob FarberCUDA使您能够在开发GPU上运行的软件时使用熟悉的编程概念。
Rob Farber是西北太平洋国家实验室(Pacific Northwest National Laboratory)的高级科研人员。
他在多个国家实验室进行大量数据平行计算的研究,并且是好几个新创企业的创始人之一。
可以发邮件到rob.farber@联系他。
您是否有兴趣在使用高级语言(比如C)编程时,通过标准多核处理器将性能提升几个数量级?您是否也希望拥有跨多个设备的伸缩能力?很多人(包括我自己)都通过使用NVIDIA的CUDA(Compute Unified Device Architecture,即计算统一设备架构的简称)获得了这种高性能和可伸缩性,以编写廉价的多线程GPU程序。
我特别强调“编程”是因为CUDA是为您的工作服务的架构,它不会迫使您的工作适应有限的一组性能库。
使用CUDA,您可以发挥才能设计软件以便在多线程硬件上获得最佳性能——并从中获得乐趣,因为计算正确的映射是很有意思的,而且软件开发环境十分合理和直观。
第一部分本文是这一系列文章的第一篇,介绍了CUDA的功能(通过使用代码)和思维过程,帮助您将应用程序映射到多线程硬件(比如GPU)以获得较大的性能提升。
当然,并不是所有问题都可以有效映射到多线程硬件,因此我会介绍哪些可以进行有效映射,哪些不能,而且让您对哪些映射可以运行良好有个常识性的了解。
“CUDA编程”和“GPGPU编程”并不相同(尽管CUDA运行在GPU上)。
以前,为GPU 编写软件意味着使用GPU语言编程。
我的一个朋友曾将这一过程描述为将数据从您的肘部拉到眼前。
CUDA允许您使用熟悉的编程概念开发能在GPU上运行的软件。
它可以通过将软件直接编译到硬件(例如,GPU汇编语言)避免图形层API的性能开销,这样可以提供更好的性能。
您可以任选一种CUDA设备。
图1和图2分别显示了运行在笔记本和台式机的离散GPU 上的CUDA 多体模拟(N-body simulation)程序。
cuda中文手册
【原创版】
目录
1.CUDA 概述
2.CUDA 安装与配置
3.CUDA 编程模型
4.CUDA 内存管理
5.CUDA 线程组织
6.CUDA 性能优化
7.CUDA 应用实例
正文
CUDA(Compute Unified Device Architecture,统一计算设备架构)是 NVIDIA 推出的一种通用并行计算架构,旨在利用 NVIDIA GPU 进行高性能计算。
CUDA 可以让开发者利用 NVIDIA GPU 的强大计算能力,实现高性能的科学计算、数据处理和图形渲染等任务。
在开始使用 CUDA 之前,首先需要进行安装与配置。
CUDA 支持多种操作系统,如 Windows、Linux 和 Mac OS 等。
安装过程需要下载相应的CUDA Toolkit,其中包括了 CUDA 驱动、CUDA SDK 以及相关工具。
配置过程中,需要注意设置正确的路径环境变量,以便在编写 CUDA 程序时能够正确链接库文件。
CUDA 编程模型主要包括线程组织、内存管理和数据传输等方面。
CUDA 线程组织是指如何在 GPU 上创建、管理和同步线程。
CUDA 提供了两种线程组织方式:线程块(Thread Block)和网格(Grid)。
线程块是 GPU 上执行的基本单位,而网格则是由多个线程块组成的。
CUDA 内存管理主要涉及全局内存(Global Memory)和常量内存(Constant Memory)的使用。
全局内存主要用于存储数据,而常量内存主要用于存储程序执行过程中不改变的数据。
CUDA 性能优化是提高程序运行速度的关键。
首先,需要合理地划分线程块和网格,以充分利用 GPU 的计算资源。
其次,要合理地分配内存,避免内存不足或内存浪费。
此外,还需要注意数据传输和同步操作的优化,以减少程序执行过程中的延迟。
CUDA 应用实例涵盖了许多领域,如科学计算、图像处理、深度学习和人工智能等。
这些应用实例展示了如何利用 CUDA 实现高性能计算,从而为相关领域的研究提供了强大的支持。
总之,CUDA 作为一种通用并行计算架构,为开发者提供了强大的计算能力。