Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 Platforms Application Note

格式：pdf
大小：569.91 KB
文档页数：75

下载文档原格式

Intel oneAPI DPC++ C++编译器简介说明书

Get Started with the Intel® oneAPI DPC ++/C++ CompilerGet Started with the Intel® oneAPI DPC++/C++ CompilerContentsChapter 1: Get Started with the Intel® oneAPI DPC++/C++ Compiler Get Started on Linux* (4)Get Started on Windows* (7)Compile and Execute Sample Code (10)2Get Started with the Intel® oneAPI DPC++/C++ Compiler 1The Intel® oneAPI DPC++/C++ Compiler provides optimizations that help your applications run faster onIntel® 64 architectures on Windows* and Linux*, with support for the latest C, C++, and SYCL languagestandards. This compiler produces optimized code that can run significantly faster by taking advantage of the ever-increasing core count and vector register width in Intel® Xeon® processors and compatible processors.The Intel® Compiler will help you boost application performance through superior optimizations and SingleInstruction Multiple Data (SIMD) vectorization, integration with Intel® Performance Libraries, and byleveraging the OpenMP* 5.0/5.1 parallel programming model.The Intel® oneAPI DPC++/C++ Compiler compiles C++-based SYCL* source files for a wide range ofcompute accelerators.The Intel® oneAPI DPC++/C++ Compiler is part of the Intel® oneAPI Toolkits.Find MoreNotices and DisclaimersIntel technologies may require enabled hardware, software or service activation.No product or component can be absolutely secure.31 Get Started with the Intel® oneAPI DPC++/C++ CompilerYour costs and results may vary.© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.Get Started on Linux*Before You BeginSet Environment VariablesBefore you can use the compiler, you must first set the environment variables by sourcing the environment script using the initialization utility. This initializes all the tools in one step.1.Determine your installation directory,<install_dir>:a.If your compiler was installed in the default location by a root user or sudo user, the compiler willbe installed under/opt/intel/oneapi. In this case, <install_dir> is /opt/intel/oneapi.b.For non-root users, your home directory under intel/oneapi is used. In this case,<install_dir> will be $HOME/intel/oneapi.c.For cluster or enterprise users, your admin team may have installed the compilers on a sharednetwork file system. Check with your local admin staff for the location of installation(<install_dir>).2.Source the environment-setting script for your shell:a.bash: source <install_dir>/setvars.sh intel64b.csh/tcsh: source <install_dir>/setvars.csh intel64Install GPU Drivers or Plug-ins (Optional)You can develop oneAPI applications using C++ and SYCL* that will run on Intel, AMD*, or NVIDIA* GPUs. To develop and run applications for specific GPUs you must first install the corresponding drivers or plug-ins:•To use an Intel GPU, install the latest Intel GPU drivers.•To use an AMD GPU, install the oneAPI for AMD GPUs plugin from Codeplay.•To use an NVIDIA GPU, install the oneAPI for NVIDIA GPUs plugin from Codeplay.Option 1: Use the Command Line®4Get Started with the Intel® oneAPI DPC++/C++ Compiler 1Invoke the compiler using the following syntax:{compiler driver} [option] file1 [file2...]For example:icpx hello-world.cppFor SYCL compilation, use the -fsycl option with the C++ driver:icpx -fsycl hello-world.cppNOTE When using -fsycl, -fsycl-targets=spir64 is assumed unless the -fsycl-targets isexplicitly set in the command.51 Get Started with the Intel® oneAPI DPC++/C++ CompilerIf you are targeting an AMD or NVIDIA GPU, refer to the corresponding Codeplay plugin get started guide for detailed compilation instructions:•oneAPI for AMD GPUs Get Started Guide•oneAPI for NVIDIA GPUs Get Started GuideOption 2: Use the Eclipse* CDTFollow these steps to invoke the compiler from within the Eclipse* CDT.Install the Intel® Compiler Eclipse CDT plugin.1.Start Eclipse2.Select Help > Install New Software3.Select Add to open the Add Site dialog4.Select Archive, browse to the directory <install_dir>/compiler/<version>/linux/ide_support,select the .zip file that starts with piler, then select OK5.Select the options beginning with Intel, select Next, then follow the installation instructions6.When asked if you want to restart Eclipse*, select YesBuild a new project or open an existing project.1.Open Existing Project or Create New Project on Eclipse2.Right click on Project > Properties > C/C++ Build > Tool chain Editor3.Select Intel DPC++/C++ Compiler from the right panelSet build configurations.1.Open Existing Project on Eclipse2.Right click on Project > Properties > C/C++ Build > Settings3.Create or manage build configurations in the right panelBuild a Program From the Command LineUse the following steps to test your compiler installation and build a program.e a text editor to create a file called hello-world.cpp with the following contents:#include <iostream>int main(){std::cout << “Hello, world!\n”;return 0;}pile hello-world.cpp:icpx hello-world.cpp -o hello-worldThe -o option specifies the file name for the generated output.3.Now you have an executable called hello-world which can be run and will give immediate feedback: hello-worldWhich outputs:Hello, world!You can direct and control compilation with compiler options. For example, you can create the object file and output the final binary in two steps:6Get Started with the Intel® oneAPI DPC++/C++ Compiler 1 pile hello-world.cpp:icpx hello-world.cpp -cThe -c option prevents linking at this step.e the icpx compiler to link the resulting application object code and output an executable:icpx hello-world.o -o hello-worldThe -o option specifies the generated executable file name.Refer to Compiler Options for details about available options.© Codeplay Software Limited. Intel, the Intel logo, Codeplay, Codeplay logo and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.Get Started on Windows*Before You BeginSet Environment VariablesThe compiler integrates into the following versions of Microsoft Visual Studio*:•Visual Studio 2022•Visual Studio 2019•Visual Studio 2017NOTE Support for Microsoft Visual Studio 2017 is deprecated as of the Intel® oneAPI 2022.1 release and will be removed in a future release.For full functionality within Visual Studio, including debugging and development, Visual Studio Community Edition or higher is required. Visual Studio Express Edition allows only command-line builds. For all versions, Microsoft C++ support must be selected as part of the Visual Studio install. For Visual Studio 2017 and later, you must use a custom install to select this option.You typically do not need to set the environment variables on Windows, as the compiler command-line window sets these variables for you automatically. If you need to set the environment variables, run the environment script as described in the suite-specific Get Started documentation.The default installation directory (<install_dir>) is C:\Program Files (x86)\Intel\oneAPI.Install GPU Drivers (Optional)To develop and run applications for Intel GPUs you must first install the latest Intel GPU drivers.Option 1: Use the Command Line in Microsoft Visual Studio®71 Get Started with the Intel® oneAPI DPC++/C++ CompilerInvoke the compiler using the following syntax:{compiler driver} [option] file1 [file2...]To invoke the compiler using the command line from within Microsoft Visual Studio, open a command prompt and enter your compilation command. For example:icx hello-world.cppFor SYCL compilation, use the -fsycl option with the C++ driver:icx -fsycl hello-world.cpp8Get Started with the Intel® oneAPI DPC++/C++ Compiler 1 NOTE When using -fsycl, -fsycl-targets=spir64 is assumed unless the -fsycl-targets isexplicitly set in the command.Option 2: Use Microsoft Visual StudioProject Support for the Intel® DPC++/C++ Compiler in Microsoft Visual StudioNew Microsoft Visual Studio projects for DPC++ are automatically configured to use the Intel® oneAPI DPC+ +/C++ Compiler.New Microsoft Visual C++* (MSVC) projects must be manually configured to use the Intel® oneAPI DPC++/C ++ Compiler.NOTE .NET-based CLR C++ project types are not supported by the Intel® oneAPI DPC++/C++Compiler. The specific project types will vary depending on your version of Visual Studio, for example: CLR Class Library, CLR Console App, or CLR Empty Project.Use the Intel® DPC++/C++ Compiler in Microsoft Visual StudioExact steps may vary depending on the version of Microsoft Visual Studio in use.1.Create a Microsoft Visual C++ (MSVC) project or open an existing project.2.In Solution Explorer, select the project(s) to build with the Intel® oneAPI DPC++/C++ Compiler.3.Open Project > Properties .4.In the left pane, expand the Configuration Properties category and select the General propertypage.5.In the right pane change the Platform Toolset to the compiler you want to use:•For C++ with SYCL, select Intel® oneAPI DPC++ Compiler.•For C/C++, there are two toolsets.Select Intel C++ Compiler <major version> (example 2021) to invoke icx.Select Intel C++ Compiler <major.minor> (example 19.2) to invoke icl.Alternatively, you can specify a compiler version as the toolset for all supported platforms andconfigurations of the selected project(s) by selecting Project > Intel Compiler > Use InteloneAPI DPC++/C++ Compiler.6.Rebuild, using either Build > Project only > Rebuild for a single project or Build > RebuildSolution for a solution.Select Compiler VersionIf you have multiple versions of the Intel® oneAPI DPC++/C++ Compiler installed, you can select which version you want from the Compiler Selection dialog box:1.Select a project, then go to Tools > Options > Intel Compilers and Libraries > <compiler> >Compilers, where <compiler> values are C++ or DPC++.e the Selected Compiler drop-down menu to select the appropriate version of the compiler.3.Select OK.Switch Back to the Microsoft Visual Studio C++ CompilerIf your project is using the Intel® oneAPI DPC++/C++ Compiler, you can choose to switch back to the Microsoft Visual C++ compiler:1.Select your project in Microsoft Visual Studio.2.Right-click and select Intel Compiler > Use Visual C++ from the context menu.91 Get Started with the Intel® oneAPI DPC++/C++ CompilerThis action updates the solution file to use the Microsoft Visual Studio C++ compiler. All configurations of affected projects are automatically cleaned unless you select Do not clean project(s). If you choose not to clean projects, you will need to rebuild updated projects to ensure all source files are compiled with the new compiler.Build a Program From the Command LineUse the following steps to test your compiler installation and build a program.e a text editor to create a file called hello-world.cpp with the following contents:#include <iostream>int main(){std::cout << “Hello, world!\n”;return 0;}pile hello-world.cpp:icx hello-world.cpp3.Now you have an executable called hello-world.exe which can be run and will give immediatefeedback:hello-world.exeWhich outputs:Hello, world!You can direct and control compilation with compiler options. For example, you can create the object file and output the final binary in two steps:pile hello-world.cpp:icx hello-world.cpp /c /Fohello-world.objThe /c option prevents linking at this step and /Fo specifies the name for the object file.e the icx compiler to link the resulting application object code and output an executable:icx hello-world.obj /Fehello-world.exeThe /Fe option specifies the generated executable file name.Refer to Compiler Options for details about available options.Compile and Execute Sample CodeMultiple code samples are provided for the Intel® oneAPI DPC++/C++ Compiler so that you can explore compiler features and familiarize yourself with how it works. For example:Sample Project DescriptionOpenMP Offload Sample The OpenMP* Offload sample demonstrates someof the new OpenMP Offload features supported bythe Intel® oneAPI DPC++/C++ Compiler.10Get Started with the Intel® oneAPI DPC++/C++ Compiler 1 Base: Vector Add Sample The Vector Add sample is the equivalent of a 'Hello,World!' sample for data parallel programs. Buildingand running the code sample verifies that yourdevelopment environment is set up correctly anddemonstrates the use of the core features of DPC++.Matrix Multiply Sample The Matrix Multiply sample is a simple program thatmultiplies together two large matrices and verifiesthe results. This program is implemented in twoways: Using Data Parallel C++ (DPC++) and usingOpenMP (OMP).Adaptive Noise Reduction Sample The Adaptive Noise Reduction sample is a DPC++reference design that demonstrates a highlyoptimized image sensor adaptive noise reduction(ANR) algorithm on an FPGA.Next Steps•Use the latest oneAPI Code Samples and follow along with the Intel® oneAPI Training Resources.•Explore the Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference on the Intel® Developer Zone.11。

在Windows上编译最新版本的Hashcat

在Windows上编译最新版本的Hashcat 最新版本V5.0.0下载地址：开发版：GPU驱动要求:AMD GPUs on Linux require “RadeonOpenCompute (ROCm)” Software Platform (1.6.180 or later)AMD GPUs on Windows require “AMD Radeon Software Crimson Edition” (15.12 or later)Intel CPUs require “OpenCL Runtime for Intel Core and Intel Xeon Processors” (16.1.1 or later)Intel GPUs on Linux require “OpenCL 2.0 GPU Driver Package for Linux” (2.0 or later)Intel GPUs on Windows require “OpenCL Driver for Intel Iris and Intel HD Graphics”NVIDIA GPUs require “NVIDIA Driver” (367.x or later)新特性（官⽅加粗显⽰突出特性）World’s fastest password crackerWorld’s first and only in-kernel rule engineFree （软件可以免费使⽤）Open-Source (MIT License) （程序源代码公开）Multi-OS (Linux, Windows and macOS)Multi-Platform (CPU, GPU, DSP, FPGA, etc., everything that comes with an OpenCL runtime)Multi-Hash (Cracking multiple hashes at the same time)Multi-Devices (Utilizing multiple devices in same system)Multi-Device-Types (Utilizing mixed device types in same system)Supports password candidate brain functionalitySupports distributed cracking networks (using overlay) （⽀持分布式⽹络离散破解）Supports interactive pause / resume （交互式的暂停和重启）Supports sessionsSupports restoreSupports reading password candidates from file and stdinSupports hex-salt and hex-charsetSupports automatic performance tuning （⾃动化的性能调优）Supports automatic keyspace ordering markov-chainsBuilt-in benchmarking systemIntegrated thermal watchdog （完整的⽇志输出流）implemented with performance in mind… and much more或者git clone /hashcat/hashcat 下载源代码。

C51-keil编译常见错误和警告处理

keil错误C51编译器识别错类型有三种1、致命错误：伪指令控制行有错，访问不存在的原文件或头文件等。

2、语法及语义错误：语法和语义错误都发生在原文件中。

有这类错误时，给出提示但不产生目标文件，错误超过一定数量才终止编译。

3、警告：警告出现并不影响目标文件的产生，但执行时有可能发生问题。

程序员应斟酌处理。

D.1 致命错误C_51 FATAL_ERRORACTION: <当前行为>LINE: <错误所在行>ERROR: <错误信息> terminated或C_51 FA TAL ERRORACTION: <当前行为>FILE: <错误所在文件>ERROR: <错误信息> terminatedC_51 TERMINATED C_51(1) ACTION 的有关信息*PARSING INVOKE-/#PRAGMA_LINE在对#pragma 指明的控制行作此法分析时出错。

*ALLOCATING MEMORY系统分配存储空间时出错。

编译较大程序需要512k空间。

*OPENING INPUT_FILE打开文件时，未找到或打不开源文件/头文件。

*CREATE LIST_FILE/OBJECT_FILE/WORK_FILE不能创建上述文件。

可能磁盘满或文件已存在而且写保护。

*PARSING SOURCE_FILE/ANALYZING DECLARATIONS分析源程序时发现外部引用名太多。

*GENERATING INTERMEDIATE CODE源代码被翻译成内部伪代码，错误可能来源于函数太大而超过内部极限。

*WRITING TO FILE在向文件（work,list,prelist或object file）写时发生错误。

（2）ERROR的有关信息*MEMORY SPACE EXHAUSTED所有可用系统空间耗尽。

至少需要512k 字节空间。

ubuntu单机wrf安装

最近看到a版发贴讨论WRF在Ubuntu上的安装，正好家里的电脑是Ubuntu系统，而且也想装个WRF，就尝试了一下。

这里用的编译器是gfortran和gcc。

这两个都是GNU软件，可以免费下载。

（PGI好像不是免费的，ifort和icc有免费版本）注意：因为这台实验电脑上只有gfortran和gcc作为fortran和C编译器，所以让事情变简单了很多。

根据我的经验，如果电脑上安装多个编译器的话，会需要一些进一步的设定，特别是在安装NETCDF的时候。

这篇文章暂不考虑这个问题。

注意：这篇文章只是测试了WRF的安装，没有涉及WPS和NCL的安装。

会在另外的帖子中讨论它们的安装。

在下面文章中，使用以下设定，大家可以根据情况修改：WRF安装路径：/home/ztftom/WRFNETCDF安装路径：/home/ztftom/util/netcdf1. 确认安装gfortran，gccgfortran是fortran编译器，gcc是C编译器。

这两个必须在电脑上安装的有。

可以用以下命令检查。

$which gfortran$which gcc如果两个编译器已经安装，将会返回这两个命令所在的位置。

否则什么也不返回。

如果没有安装，那么可以到Ubuntu软件中心搜索安装，或者用以下命令安装$sudo apt-get install gfortran$sudo apt-get install gcc2. 安装netcdf到Unidata找到安装文件，下载：/downloads/netcdf/index.jsp$tar -xvf netcdf-4.1.3.tar.gz进入解压缩后得到的文件夹，./configure --disable-dap --disable-netcdf-4 --prefix=/home/ztftom/util/netcdf注意：上面红色高亮的两个选项是很多次实验后证明必须的。

--disable-dap的原因是缺少一个‘curl’的lib，--disable-netcdf-4是因为WRF暂时不支持netcdf 4之后，运行：$make$make install（其实ubuntu的软件管理里可以找到netcdf的下载，我是自己安装了之后发现的：$sudo apt-get install netcdf-bin我没有尝试，如果大家哪位适用了的话可否告诉我效果如何？）2. 下载WRF ARW V3.3网址：/wrf/users/download/get_source.html需要注册，才能下载，下载到的文件名为：WRFV3.3.TAR.gz解压缩文件：$tar -xvf WRFV3.3.TAR.gz解压缩后，有WRFV3文件夹，进入该文件夹。

C语言编程规范LDRA_MISRA_2004_Standard_v2[1].5.3_Web

C语⾔编程规范LDRA_MISRA_2004_Standard_v2[1].5.3_WebRules Definition Required AdvisoryTotal Implemented byLDRA 10317120Partially implemented byLDRA 617Not deemed to be statically analysable by atool12214NOTE: Of the 14 rules deemed 'Not statically analysable", a number of them are checkable to some extent with facilities available in the LDRA tool suite. This may be through the application of a related programming standard, or some other technique such as dynamic analysis.Total12120141This information has been compiled using version 7.7.1 of the LDRA tool suite and is correct as of July 2008.MISRA-C:2004 Coding StandardThe LDRA tool suite is developed and certified to BS EN ISO 9001:2000. This comparison has used the revised MISRA standard, MISRA-C:2004 "Guidelines for the use of the C language in critical systems" was published in October 2004.LDRA Ltd. reserves the right to change anyMISRA-C:2004Rule Required / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description1.1Required293 S All code shall conform to ISO 9899:1990“Programming languages – C”, amended andcorrected by ISO/IEC 9899/COR1:1995, ISO/IEC 9899/AMD1:1995, and ISO/IEC 9899/COR2:1996.Non Ansi construct used.1.2Required412 SNo reliance shall be placed on undefined orunspecified behaviour.Undefined behaviour, \ before E-O-F.1.3Required N/AMultiple compilers and/or languages shall only be used if there is a common defined interfacestandard for object code to which thelanguages/compilers/assemblers conform.N/A1.4, 5.1Required 17 D The compiler/linker shall be checked to ensurethat 31 character significance and case sensitivityare supported for external identifiers.Identifier not unique within 31 characters. 1.5Advisory N/AFloating-point implementations should complywith a defined floating-point standard.N/A2.1Required 88 SAssembly language shall be encapsulated andisolated.Procedure is not pure assembler. 2.2Required 110 SSource code shall only use /* … */ stylecomments.Use of single line comment(s) //. 2.3Required 119 SThe character sequence /* shall not be usedwithin a comment.Nested comment found2.4Advisory 302 S Sections of code should not be “commented out”. Comment possibly contains code3.1Required N/AAll usage of implementation-defined behaviour shall be documented.N/A 3.2Required N/AThe character set and the correspondingencoding shall be documented.N/A3.3Advisory 373 S The implementation of integer division in thechosen compiler should be determined,documented and taken into account.Use of integer division 3.4Required69 SAll uses of the #pragma directive shall bedocumented and explained.#pragma used.LDRA Ltd. reserves the right to change anyMISRA-C:2004Rule Required / Advisory LDRA Rule Number MISRA Description LDRA Rule Description316 SBit field is not unsigned integral. 328 S Non bit field member in bitfield struct. 42 S Use of bit field in structure declaration. 226 S Bit field is not octal, hex or suffix u;3.6RequiredN/A All libraries used in production code shall be written to comply with the provisions of this document, and shall have been subject to appropriate validation.N/A4.1, 7.1Required 176 S Only those escape sequences that are defined inthe ISO C standard shall be used.Non standard escape sequence in source.4.2Required 81 S Trigraphs shall not be used.Use of trigraphs.384 S Identifier matches macro name in 31 chars.61 X Identifier match in *** chars. 5.2 - 5.5Required 131 S Identifiers in an inner scope shall not use the same name as an identifier in an outer scope,and therefore hide that identifier.Name reused in inner scope. 5.3, 5.6Required 112 STypedef name redeclared. 374 S Name conflict with typedef 16 X - 24 XIdentifier resuse: typedef vs …325 SInconsistent use of tag.4 X - 15 XIdentifier resuse: tag vs …18 D Identifier name reused 25 X - 39 XIdentifier reuse: ...91 S Name redeclared in another namespace (MR).40 X - 48 X Identifier reuse: ...5.6, 5.7Advisory383 S Identifier name matches macro name.Identifiers (internal and external) shall not rely on the significance of more than 31 characters.Required5.1Required 5.3 A tag name shall be a unique identifier.No identifier in one name space should have the same spelling as an identifier in another namespace, with the exception of structure and unionmember names.Advisory 5.63.5RequiredRequired 5.4Advisory 5.5If it is being relied upon, the implementationdefined behaviour and packing of bitfields shallbe documented.A typedef name shall be a unique identifier.No object or function identifier with static storage duration should be reused.LDRA Ltd. reserves the right to change anyMISRA-C:2004Required / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description327 SReuse of struct member name. 49 X - 60 XIdentifier reuse: ...6.1Required 329 SThe plain char type shall be used only for thestorage and use of character values.Operation not appropriate to plain char. 93 S Value is not of appropriate type. 96 S Use of mixed mode arithmetic. 90 S Basic type declaration used. 495 S Typedef has no size indication6.4Required 73 SBit fields shall only be defined to be of typeunsigned int or signed int.Bit field not signed or unsigned int.6.5Required 72 S Bit fields of type signed int shall be at least 2 bitslong.Signed bit field less than 2 bits wide.7.1Required83 S Octal constants (other than zero) and octalescape sequences shall not be used.Octal number found.24 DProcedure definition has no associated prototype 8.2Required326 SWhenever an object or function is declared ordefined, its type shall be explicitly stated.Declaration is missing type.102 SFunction and prototype return inconsistent 103 SFunction and prototype param inconsistent 62 XFunction prototype/defn return type mismatch.63 XFunction prototype/defn parameter type mismatch.1 X Declaration types do not match across a system.360 S Incompatible type8.3RequiredFunction call with no prior declaration 496 S8.1Required8.4Required6.2, 12.1, 12.6RequiredAdvisory 6.35.7Advisory Functions shall have prototype declarations and the prototype shall be visible at both the function definitionIf objects or functions are declared more than once their types shall be compatible.Signed and unsigned char type shall be used only for the storage and use of numeric values.Typedefs that indicate size and signedness should be used in place of the basic types.No identifier name should be reused.For each function parameter the type given in the declaration and definition shall be identical, and the return types shall also be identical.LDRA Ltd. reserves the right to change anyMISRA-C:2004RuleRequired / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description286 S Functions defined in header file. 287 SVariable definitions in header file. 8.6Required 296 S Functions shall be declared at file scope.Function declared at block scope. 8.7Required 25 DObjects shall be defined at block scope if theyare only accessed from within a single function.Scope of variable could be reduced26 D Variable should be defined once in only one file 60 DExternal object should be declared only once.172 S Variable declared multiply. 33 DNo real declaration for external variable.27 D Variable should be declared static61 D Procedure should be declared static 461 SIdentifier with ambiguous linkage.27 D Variable should be declared static 61 DProcedure should be declared static 8.12Required 127 S When an array is declared with external linkage,its size shall be stated explicitly or defined implicitly by initialisation.Array has no bounds specified.9.1Required 5 D All automatic variables shall have been assigneda value before being used.Procedure contains UR data flow anomalies.9.2Required105 S Braces shall be used to indicate and match thestructure in the non-zero initialisation of arrays and structures.Struct field initialisation brace fault.The static storage class specifier shall be used indefinitions and declarations of objects andfunctions that have internal linkage.RequiredAn external object or function shall be declared inone and only one file.Required 8.88.5Required There shall be no definitions of objects orfunctions in a header file.An identifier with external linkage shall haveexactly one external definition.Required 8.98.108.11RequiredAll declarations and definitions of objects or functions at file scope shall have internal linkage unless external linkage is required. LDRA Ltd. reserves the right to change anyMISRA-C:2004Rule Required / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description9.3Required85 S In an enumerator list, the “=” construct shall notbe used to explicitly initialise members other than the first, unless all items are explicitly initialised.Incomplete initialisation of enumerator.93 S Value is not of appropriate type. 96 SUse of mixed mode arithmetic. 433 SType conversion without cast.(a) 434 S Signed/unsigned conversion without cast.(a) 435 S Float/integer conversion without cast.(a) 446 S Narrower int conversion without cast.(a) 488 SValue outside range of underlying type.(b) 452 SNo cast for widening complex int expression.(c) 458 SImplicit conversion: actual to formal param.(c) 491 S No cast for widening int parameter.(d) 101 S Function return type inconsistent.(d) 457 SImplicit int widening for function return. 93 S Value is not of appropriate type. 96 S Use of mixed mode arithmetic. 433 S Type conversion without cast.(a) 435 S Float/integer conversion without cast. (a) 445 S Narrower float conversion without cast.(b) 451 SNo cast for widening complex float expression.(c) 458 SImplicit conversion: actual to formal param.(c) 490 S No cast for widening float parameter.(d) 101 S Function return type inconsistent.(d) 456 S Implicit float widening for function return.Required10.2The value of an expression of floating type shall not be implicitly converted to a different type if:a) it is not a conversion to a wider floating type, orb) the expression is complex, or c) the expression is a function argument, or d) the expression is a return expression10.1RequiredThe value of an expression of integer type shallnot be implicitly converted to a differentunderlying type if:a) it is not a conversion to a wider integer type of the same signedness, orb) the expression is complex, or c) the expression is not constant and is a function argument, ord) the expression is not constant and is a returnexpression LDRA Ltd. reserves the right to change anyMISRA-C:2004Rule Required / AdvisoryLDRA RuleNumberMISRA DescriptionLDRA Rule Description93 S Value is not of appropriate type. 433 S Type conversion without cast.332 S Widening cast on complex integer expression. 442 SSigned integral type cast to unsigned.443 S Unsigned integral type cast to signed.444 SIntegral type cast to non-integral.93 S Value is not of appropriate type. 433 SType conversion without cast.333 S Widening cast on complex float expression. 441 SFloat cast to non-float.10.5Required334 SIf the bitwise operators ~ and << are applied to an operand of underlying type unsigned char orunsigned short, the result shall be immediately cast to the underlying type of the operand. No cast when ~ or << applied to small types.10.6Required 331 SA “U” suffix shall be applied to all constants ofunsigned type.Literal value requires a U suffix. 11.1Required94 S & 95 S Conversions shall not be performed between apointer to a function and any type other than anintegral type.Conversions shall not be performed between a pointer to a function and any type other than an integral type.11.2Required94 SConversions shall not be performed between a pointer to object and any type other than anintegral type, another pointer to object type or a pointer to void.Casting operation on a pointer.439 SCast from pointer to integral type.440 SCast from integral type to pointer.11.4Advisory95 S A cast should not be performed between apointer to object type and a different pointer toobject type.Casting operation to a pointer.11.3AdvisoryA cast should not be performed between a pointer type and an integral type.10.4RequiredThe value of a complex expression of integer type may only be cast to a type that is narrower and of the same signedness as the underlyingtype of the expression.Required10.3The value of a complex expression of floatingtype may only be cast to a narrower floating type.LDRA Ltd. reserves the right to change any MISRA-C:2004RuleRequired / Advisory LDRA Rule Number MISRA Description LDRA Rule Description203 SCast on a constant value.344 S Cast on volatile value. 96 SSee 6.2 above361 SExpression needs brackets.9 SAssignment operation in expression. 134 SVolatile variable in complex expression. 12.2, 12.3Required 35 D Expression has side effects.12.2, 12.4Required 1 Q Call has execution order dependant side effects.12.2, 12.13 Required 30 S Deprecated usage of ++ or -- operators found.12.3Required54 SThe sizeof operator shall not be used on expressions that contain side effects.Sizeof operator with side effects.406 S Use of ++ or -- on RHS of && or || operator. 408 SVolatile variable accessed on RHS of && or ||.12.5Required49 SThe operands of a logical && or || shall beprimary-expressions. The oparands of a logical && or || shall be primary-expressions .12.6, 13.2Advisory 114 SThe operands of logical operators (&&, || and !) should be effectively Boolean. Expressions thatare effectively Boolean should not be used as operands to operators other than (&&, || and !).Expression is not Boolean.50 SUse of shift operator on signed type.Bitwise operators shall not be applied to operands whose underlying type is signed.12.4Required 12.7Required11.5Required12.1AdvisoryA cast shall not be performed that removes any const or volatile qualification from the typeaddressed by a pointer.Limited dependence should be placed on C’s operator precedence rules in expressions.The value of an expression shall be the sameunder any order of evaluation that the standardpermits.The right hand operand of a logical && or ||operator shall not contain side effects.12.2Required LDRA Ltd. reserves the right to change anyMISRA-C:2004Rule Required / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description120 S Use of bit operator on signed type. 12.8Required51 SThe right hand operand of a shift operator shall lie between zero and one less than the width inbits of the underlying type of the left hand operand.Shifting value too far.12.9Required 52 SThe unary minus operator shall not be applied toan expression whose underlying type is unsigned.Unsigned expression negated.12.10Required 53 S The comma operator shall not be /doc/9781b2552b160b4e767fcff3.html e of comma operator. 493 S Numeric overflow.494 SNumeric underflow.12.12Required 345 S The underlying bit representations of floating-point values shall not be used.Bit operator with floating point operand.12.13Advisory 30 S The increment (++) and decrement (--) operatorsshould not be mixed with other operators in anexpression.See 12.213.1Required 132 S Assignment operators shall not be used inexpressions that yield a Boolean value.Assignment operator in boolean expression.13.2Advisory 114 S Tests of a value against zero should be madeexplicit, unless the operand is effectively Boolean.See 12.613.3Required 56 S Floating-point expressions shall not be tested forequality or inequality.Equality comparison of floating point.13.4Required39 SThe controlling expression of a for statementshall not contain any objects of floating type.Unsuitable type for loop variable.429 SEmpty middle expression in for loop. 430 SInconsistent usage of loop control variable. 270 S For loop initialisation is not simple. 271 SFor loop incrementation is not simple.12.11Advisory Evaluation of constant unsigned integer expressions should not lead to wrap-around. operands whose underlying type is signed.The three expressions of a for statement shall be concerned only with loop control.13.512.7RequiredRequiredLDRA Ltd. reserves the right to change anyMISRA-C:2004RuleRequired / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description13.6Required 55 D Numeric variables being used within a for loop foriteration counting shall not be modified in thebody of the loop.Modification of loop counter in loop body. 139 S Construct leads to infeasible code. 140 S Infeasible loop condition found.14.1Required1 J There shall be no unreachable code.Unreachable Code found.14.2Required57 S All non-null statements shall either :a) have at least one side-effect howeverexecuted, orb) cause control flow to change.Statement with no side effect.14.3Required 58 SBefore preprocessing, a null statement shall only occur on a line by itself; it may be followed by a comment provided that the first character following the null statement is a white space character.Null statement found.14.4Required 13 S The goto statement shall not be used.goto detected. 14.5Required 32 S The continue statement shall not be used.Use of continue statement. 31 SUse of break statement in loop. 409 SMore than one break statement in loop. 14.7Required 7 CA function shall have a single point of exit at theend of the function.Procedure has more than one exit point. 11 SNo brackets to loop body. 428 S No {} for switch. 14.9Required12 SAn if (expression) construct shall be followed by a compound statement. The else keyword shall befollowed by either a compound statement, or another if statement.No brackets to then/else.The statement forming the body of a switch, while, do ... while or for statement shall be acompound statement.For any iteration statement there shall be at most one break statement used for loop termination.Boolean operations whose results are invariantshall not be permitted.14.8Required14.613.7Required RequiredLDRA Ltd. reserves the right to change anyMISRA-C:2004RuleRequired / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description59 S Else alternative missing in if.477 S Empty else clause following else if.60 SEmpty switch statement.385 S MISRA switch statement syntax violation.15.1, 15.5Required 245 S A switch label shall only be used when the mostclosely-enclosing compound statement is the body of a switch statement.Case statement in nested block.15.2Required62 S An unconditional break statement shall terminateevery non ?empty switch clause.Switch Case not terminated with break.48 SNo default case in switch statement.322 S Default is not last case of switch. 410 SSwitch empty default has no comment.15.4Required 121 SA switch expression shall not represent a valuethat is effectively /doc/9781b2552b160b4e767fcff3.html e of boolean expression in switch. 61 S Switch contains default only. 245 SCase statement in nested block.16.1Required 41 SFunctions shall not be defined with a variablenumber of arguments.Ellipsis used in procedure parameter list. 1 U Inter-file recursion found. 6 DRecursion in procedure calls found. 16.3Required 37 SIdentifiers shall be given for all of the parametersin a function prototype declaration.Procedure Parameter has a type but no identifier. 16.4Required 36 DThe identifiers used in the declaration anddefinition of a function shall be identical.Prototype and Definition name mismatch. 16.5Required 63 SFunctions with no parameters shall be declaredwith parameter type void.Empty parameter list to procedure/function. 21 S Number of parameters does not match.Every switch statement shall have at least onecase clause.Functions shall not call themselves, either directlyor indirectly.The number of arguments passed to a functionshall match the number of parameters.All if … else if constructs shall be terminated with an else clause.N/A15.5Required 16.2Required 14.10Required15.3Required16.6Required15N/AThe final clause of a switch statement shall bethe default clause.LDRA Ltd. reserves the right to change anyMISRA-C:2004RuleRequired / Advisory LDRA RuleNumberMISRA DescriptionLDRA Rule Description98 SActual and formal parameters inconsistent (MR).59 DParameter should be declared const.62 DPointer parameter should be declared const.2 DFunction does not return a value on all paths.36 S Function has no return statement. 66 S Function with empty return expression.16.9Required 99 S A function identifier shall only be used with eithera preceding &, or with a parenthesised parameter list, which may be empty.Function use is not a call.16.10RequiredDataflow anomalies If a function returns error information, then thaterror information shall be tested.N/A87 S Use of pointer arithmetic. 436 S Declaration does not specify an array.17.2Required438 SPointer subtraction shall only be applied topointers that address elements of the same array.Pointer subtraction not addressing one array.17.3Required 437 S >, >=, <, <= shall not be applied to pointer types except where they point to the same array.< > <= >= used on different object pointers.17.4Required 87 SArray indexing shall be the only allowed form of pointer arithmetic.See 17.117.5Advisory80 SThe declaration of objects should contain nomore than 2 levels of pointer indirection.Pointer indirection exceeds 2 levels.17.6Required 71 S The address of an object with automatic storage shall not be assigned to another object that may persist after the first object has ceased to exist.Pointer assignment to wider scope.465 S Struct/union not completely specified.481 SArray with no bounds in struct.17.1, 17.4RequiredAll exit paths from a function with non-void return type shall have an explicit return statement withan expression.Pointer arithmetic shall only be applied topointers that address an array or array element.A pointer parameter in a function prototype should be declared as pointer to const if the All structure and union types shall be complete at the end of a translation unit.Required16.8Required18.1Advisory16.7LDRA Ltd. reserves the right to change anyMISRA-C:2004Rule Required / Advisory LDRA Rule Number MISRA DescriptionLDRA Rule Description482 S Incomplete structure referenced.497 S Type is incomplete in translation unit.18.2Required 480 SAn object shall not be assigned to an overlappingobject.memcpy params access same variable.18.3Required N/A An area of memory shall not be reused forunrelated purposes.N/A18.4Required74 S Unions shall not be used. Union declared.75 S Executable code before an included file. 78 S Macro parameter not in brackets.338 S#include preceded by non preproc directives.19.2Advisory 100 SNon-standard characters should not occur inheader file names in #include directives.#include filename is non conformant. 292 SNo space between #include and filename. 339 S #include directive with illegal items. 427 SFilename in #include not in < > or " ". 19.4Required79 SC macros shall only expand to a braced initialiser, a constant, a parenthesisedexpression, a type qualifier, a storage class specifier, or a do-while-zero construct.Macro contains unacceptable items67 S #Define used in a block. 426 S #undef used in a block.19.6Required 342 SExtra chars after preprocessor directive. 19.6, 20.0Required 68 S#undef used. 19.7Advisory 340 SA function should be used in preference to afunction-like macro.Use of function like macro. 19.8Required 324 SA function-like macro shall not be invoked withoutall of its arguments.Macro call has wrong number of parameters. 19.9Required341 S Arguments to a function-like macro shall notcontain tokens that look like preprocessingdirectives.Preprocessor construct as macro parameter.Macros shall not be #define’d or #undef’d within ablock.#undef shall not be used.#include statements in a file should only bepreceded by other preprocessor directives orcomments.The #include directive shall be followed by eithera or "filename" sequence.the end of a translation unit.19.3Required19.5Required 19.1AdvisoryRequired18.1LDRA Ltd. reserves the right to change anyMISRA-C:2004RuleRequired / Advisory LDRA Rule NumberMISRA Description LDRA Rule Description19.10Required78 SIn the definition of a function-like macro each instance of a parameter shall be enclosed inparentheses unless it is used as the operand of # or ##.In the definition of a function-like macro the whole definition, and each instance of a parameter, shall be enclosed in parentheses or braces.19.11Required337 SAll macro identifiers in preprocessor directives shall be defined before use, except in #ifdef and#ifndef preprocessor directives and the defined() operator.Undefined macro variable in #if.19.12Required 76 SThere shall be at most one occurrence of the # or## operators in a single macro definition.More than one of # or ## in a macro. 19.13Advisory 125 S The # and ## operators should not be/doc/9781b2552b160b4e767fcff3.html e of ## or # in a macro 335 S operator defined contains illegal items. 336 S #if expansion contains define operator.19.15Required243 S Precautions shall be taken in order to prevent thecontents of a header file being included twice.Included file not protected with #define.19.16Required147 S Preprocessing directives shall be syntacticallymeaningful even when excluded by thepreprocessor.Spurious characters after preprocessor directive126 S A #if has no #endif in the same file.343 S #else has no #if, etc in the same file. 86 S Attempt to define reserved word. 156 S Use of 'defined' keyword in macro body. 478 S Misra special prefix banned name.20.2Required 218 SThe names of standard library macros, objectsand functions shall not be /doc/9781b2552b160b4e767fcff3.html is used in standard libraries. 20.3RequiredUnit TestingThe validity of values passed to library functions shall be checked.TBrun may be used to perform the necessary checks.Reserved identifiers, macros and functions in thestandard library, shall not be defined, redefinedor undefined.The defined preprocessor operator shall only beused in one of the two standard forms.All #else, #elif and #endif preprocessor directivesshall reside in the same file as the #if or #ifdef directive to which they are related.19.17Required19.14RequiredRequired20.1LDRA Ltd. reserves the right to change anyMISRA-C:2004。

A C Compiler for a Processor with a Reconfigurable Functional Unit

A C Compiler for a Processor with a ReconfigurableFunctional UnitZhi Alex Ye Nagaraj Shenoy Prithviraj BanerjeeDepartment of Electrical and Computer Engineering,Northwestern UniversityEvanston, IL 60201, USA{ye, nagaraj, banerjee}@ABSTRACTThis paper describes a C compiler for a mixed Processor/FPGA architecture where the FPGA is a Reconfigurable Functional Unit (RFU). It presents three compilation techniques that can extract computations from applications to put into the RFU. The results show that large instruction sequences can be created and extracted by these techniques. An average speedup of 2.6 is achieved over a set of benchmarks.1.INTRODUCTIONWith the flexibility of the FPGA, reconfigurable systems are able to get significant speedups for some applications. As the general purpose processor and the FPGA each has its own suitable area of applications, several architectures are proposed to integrate a processor with an FPGA in the same chip.In this paper, we talk about a C compiler for a Processor/FPGA system. The target architecture is Chimaera, which is a RISC processor with a Reconfigurable Functional Unit (RFU). We describe how the compiler identifies sequences of statements in a C program and changes them into RFU operations (RFUOPs). We show the performance benefits that can be achieved by such optimizations over a set of benchmarks.The rest of the paper is organized into five sections. Section 2 discusses related work. In Section 3, we give an overview of the Chimaera architecture. Section 4 discusses the compiler organization and implementation in detail. In this section, we first discuss a technique to enhance the size of the instruction sequence: control localization. Next, we describe the application of the RFU to SIMD Within A Register (SWAR) operations. Lastly, we introduce an algorithm to identify RFUOPs in a basic block. Section 5 demonstrates some experimental results. We summarize this paper in Section 6.2.RELATED WORKSeveral architectures have been proposed to integrate a processor with an FPGA [6,7,8,9,13,14,15]. The usage of the FPGA can be divided into two categories: FPGA as a coprocessor or FPGA as a functional unit.In the coprocessor schemes such as Garp[9], Napa[6], DISC[14], and PipeRench[7], the host processor is coupled with an FPGA based reconfigurable coprocessor. The coprocessor usually has the ability of accessing memory and performing control flow operations. There is a communication cost between the coprocessor and the host processor, which is several cycles or more. Therefore, these architectures tend to map a large portion of the application, e.g. a loop, into the FPGA. One calculation in the FPGA usually corresponds to a task that takes several hundred cycles or more.In the functional unit schemes such as Chimaera[8], OneChip[15], and PRISC[13], the host processor is integrated with an FPGA based Reconfigurable Functional Unit (RFU). One RFU Operation (RFUOP) can take on a task that usually requires several instructions on the host processor. As the functional unit is interfaced only with the register file, it cannot perform memory operations or control flow operations. The communication is faster than the coprocessor scheme. For example, in the Chimaera architecture, after an RFUOP’s configuration is loaded, an invocation of it has no overhead in communication. This gives such architecture a larger range of application. Even in cases where only a few instructions can be combined into one RFUOP, we could still apply the optimization if the execution frequency is high enough.3.CHIMAERA ARCHITECTUREIn this section, we review the Chimaera architecture to provide adequate background information for explaining the compiler support for this architecture. More information about Chimaera can be found in [8].The overall Chimaera architecture is shown in Figure 1. The main component of the system is the Reconfigurable Functional Unit (RFU), which consists of FPGA-like logic designed to support high-performance computations. It gets inputs from the host processor’s register file, or a shadow register file which duplicates a subset of the values in the host’s register file. The RFU is capable of computing data-dependent operations (e.g., tmp=r2-r3, r5=tmp+r1), conditional evaluations (e.g., "if (b>0) a=0; else a=1;"), and multiple sub-word operations (e.g., four instances of 8-bit addition).The RFU contains several configurations at the same time. An RFUOP instruction will activate the corresponding configuration in the RFU. An RFU configuration itself determines from whichregisters it reads its operands. A single RFUOP can read from all the registers connected to the RFU and then put the result on the result bus. The maximum number of input registers is 9 in Chimaera. Each RFUOP instruction is associated with a configuration and an ID. For example, an execution sequence “r2=r3<<2; r4=r2+r5; r6=lw 0(r4)” can be optimized to “r4=RFUOP #1; r6=lw 0(r4)”. Here #1 is the ID of this RFUOP and “r5+r3<<2” is the operation of the corresponding configuration. After an RFUOP instruction is fetched and decoded, the Chimaera processor checks the RFU for the configuration corresponding to the instruction ID. If the configuration is currently loaded in the RFU, the corresponding output is written to the destination register during the instruction writeback cycle. Otherwise, the processor stalls when the RFU loads the configuration.4. COMPILER IMPLEMENTATIONWe have developed a C compiler for Chimaera, which automatically maps some operations into RFUOPs. The generated code is currently run on a Chimaera simulator to gather performance information. A future version of the compiler will be integrated with a synthesis tool.The compiler is built using the widely available GCC framework. Figure 2 depicts the phase ordering of the implementation. The C code is parsed into the intermediate language of GCC: Register Transfer Language (RTL), which is then enhanced by several early optimizations such as common expression elimination, flow analysis, etc. The partially optimized RTL is passed through the Chimaera optimization phase, as will be explained below. The Chimaera optimized RTL is then processed by later optimization phases such as instruction scheduling, registers allocation, etc. Finally, the code for the target architecture is generated along with RFUOP configuration information.From the compiler’s perspective, we can consider an RFUOP as an operation with multiple register inputs and a single register output. The goal of the compiler is to identify the suitable multiple-input-single-output sequences in the programs and change them into RFUOPs.Chimaera Optimization consists of three steps: Control Localization, SWAR optimization and Instruction Combination.Due to the configuration loading time, these optimizations can be applied only in the kernels of the programs. Currently, we only optimize the innermost loop in the programs.The first step of Chimaera optimization is control localization.It will transform some branches into one macroinstruction to form a larger basic block. The second step is the SIMD Within A Register (SWAR) Optimization. This step searches the loop body for subword operations and unrolls the loop when appropriate.The third step is instruction combination. It takes a basic block as input and extracts the multiple-input-single-output patterns from the data flow graph. These patterns are changed into RFUOPs if they can be implemented in RFU. The following subsections discuss the three steps in detail.4.1 Control LocalizationIn order to get more speedup, we want to find larger and more RFUOPs. Intuitively, a larger basic block contains more instructions, thus has more chances of finding larger and more RFUOPs. We find that control localization technique [11][13] isFigure 1. The overall Chimaera architectureH o s t P r o c e s s o rFigure 2: Phase ordering of the C compiler for Chimaera(a)(b)Figure 3: Control Localization(a) control flow graph before control localization.Each oval is an instruction, and the dashed box marks the code sequence to be control localized.(b) control flow graph after control localizationuseful in increasing the size of basic blocks. Figure 3 shows an example of it. After control localization, several branches are combined into one macroinstruction, with multiple output and multiple input. In addition to enlarging the basic block, the control localization sometimes finds RFUOPs directly. When a macroinstruction has only one output, and all the operations in it can be implemented in the RFU, this macroinstruction can be mapped into an RFUOP. This RFUOP can speculatively compute all operations on different branch paths. The result on the correct path where the condition evaluates to true is selected to put into the result bus. This macro instruction is called as “CI macroin”and can be optimized by Instruction Combination.4.2SWAR OptimizationAs a method to exploit medium-grain data parallelism, SIMD (single instruction, multiple data) has been used in parallel computers for many years. Extending this idea to general purpose processors has led to a new version of SIMD, namely SIMD Within A Register (SWAR)[4]. The SWAR model partitions each register into fields that can be operated on in parallel. The ALUs are set up to perform multiple field-by-field operations. SWAR has been successful in improving the multimedia performance. Most of the implementations of this concept are called multimedia extensions, such as Intel MMX, HP MAX, SUN SPARC VIS, etc. For example, “PADDB A, B” is an instruction from Intel MMX. Both operands A and B are 64-bit and are divided into eight 8-bit fields. The instruction performs eight additions in parallel and stores the eight results to A.However, current implementations of SWAR do not support a general SWAR model. Some of their limitations are:•The input data must be packed and aligned correctly, causing packing and unpacking penalties sometimes.•Most of current hardware implementations support 8, 16 and 32-bit field size only. Other important sizes such as 2-bit and 10-bit are not supported.•Only a few operations are supported. When the operation for one item becomes complex, SIMD is impossible. For example, the following code does not map well to a simple sequence of SIMD operations:char out[100],in1[100],in2[100];for(i=0;i<100;i++) {if ((in1[i]-in2[i])>10)out[i]=in1[i]-in2[i];elseout[i]=10;}With the flexibility of the FPGA, the RFU can support a more general SWAR model without the above disadvantages. The only requirement is that the output fields should fit within a single register. The inputs don’t need to be stored in packed format, nor is there limitation on the alignment. In addition, complex operations can be performed. For example, the former example can be implemented in one RFUOP.Our compiler currently supports 8-bit field size, which is the size of “char” in C. In current implementation, the compiler looks for the opportunity to pack several 8-bit outputs into a word. In most cases, this kind of pattern exists in the loop with stride one. Therefore, the compiler searches for the pattern such that the memory store size is a byte and the address changes by one forunrolled four times. In the loop unrolling, conventional optimizations such as local register renaming and strength reduction are performed. In addition, the four memory stores are changed to four sub-register movements. For example,“store_byte r1,address;store_byte r2,address+1;store_byte r3,address+2;store_byte r4,address+3;”are changed into“(r5,0)=r1; (r5,1)=r2;(r5,2)=r3; (r5,3)=r4;”.The notation (r, n) refers to the n th byte of register r. We generate a pseudo instruction "collective-move" that moves the four sub-registers into a word register, e.g. “r5=(r5,0) (r5,1) (r5,2) (r5,3)”. In the data flow graph, the four outputs merge through this “collective-move” into one. Thus a multiple-input-single-output subgraph is formed. The next step, Instruction Combination, canrecognize this subgraph and change it to an RFUOP when appropriate. Finally, a memory store instruction is generated tostore the word register. The compiler then passes the unrolled copy to the instruction combination step.4.3Instruction CombinationThe instruction combination step analyzes a basic block and changes the RFU sequences into RFUOPs. It first finds out what instructions can be implemented in the RFU. It then identifies the RFU sequences. At last, it selects the appropriate RFU sequences and changes them into RFUOPs.We categorize instructions into Chimaera Instruction (CI) and Non-Chimaera Instruction (NCI). Currently CI includes logic operation, constant shift and integer add/subtract. The “collective_move”, “subregister movement” and “CI macroin” are also considered as CI. NCI includes other instructions such as multiplication/division, memory load/store, floating-point operation, etc.The algorithm FindSequences in Figure 4 finds all the maximum instruction sequences for the RFU. It colors each node in the data flow graph(DFG). The NCI instructions are marked as BLACK. A CI instruction is marked as BROWN when its output must be put into a register, that is, the output is live-on-exit or is the input of some NCI instructions. Other CI instructions are marked as WHITE. The RFU sequences are the subgraphs in the DFG that consists of BROWN nodes and WHITE nodes.The compiler currently changes all the identified sequences into RFUOPs. Under the assumption that every RFUOP takes one cycle and the configuration loading time can be amortized over several executions, this gives an upper bound of the speedup we could expect from Chimaera. In the future, we will take into account other factors such as the FPGA size, configuration loading time, actual RFUOP execution time, etc.5.EXPERIMENTAL RESULTSWe have tested the compiler’s output through a set of benchmarks on the Chimaera simulator. The simulator is a modification of SimpleScalar Simulator[3]. The simulated architecture has 32 general purpose 32-bit registers and 32 floating point registers. The instruction set is a superset of MIPS-IV ISA. Presently, the simulator executes the programs sequentially and gathers theEarly results on some benchmarks are presented in this section. Each benchmark is compiled in two ways: one is using “gcc -O2”, the other is using our Chimaera compiler. We studied the differences between the two versions of assembly codes as well as the simulation results. In the benchmarks, decompress.c and compress.c are from Honeywell benchmark[10], jacobi and life are from Raw benchmark[2], image reconstruction[12] and dct[1] are implementations of two program kernels of MPEG, image restoration is an image processing program. They are noted as dcmp, cmp, life, jcbi, dct, rcn and rst in the following figure.Table 1 shows the simulation results of the RFU optimizations. Insn1 and insn2 are the instruction counts without and with RFU optimization. The speedup is calculated as insn1/insn2. The following three columns IC, CL and SWAR stand for the portion of performance gain from Instruction Combination, Control Localization and SWAR respectively.The three optimizations give an average speedup of 2.60. The best speedup is up to 7.19.To illustrate the impact of each optimization on the kernel sizes, we categorize instructions into four types: NC, IC, CL and SWAR. NC is the part of instructions that cannot be optimized for Chimaera. NCI instructions and some non-combinable integer operations fall in this category. IC, CL and SWAR stand for the instructions that can be optimized by Instruction Combination, Control Localization and SWAR optimization respectively. Figure 5 shows the distribution of these four types of instructions in the program kernels. After the three optimizations, the kernel size can be reduced by an average of 37.5%. Of this amount, 22.3% is from Instruction Combination, 9.8% from Control Localization and 5.4% from SWAR.Table 1: Performance results over some benchmarks. The "avg" row is the average of all benchmarks.62.50%22.30%0%0%Figure 5: Distribution of the kernel instructionsFurther analysis shows that 58.4% of the IC portion comes from address calculation. For example, the following C code “int a[10], ...=a[i]” is translated to "r3=r2<<2, r4=r3+r1, r5=lw 0(r4)" in assembly. The first two instructions can be combined in Chimaera. The large portion of address calculation indicates that our optimizations can be applied to a wide range of applications, as long as they have complex address calculations in the kernel. Furthermore, as the address calculation is basically sequential, existing ILP architectures like superscalar and VLIW cannot take advantage of it. This suggests that we may expect speedup if we integrate a RFU into an advanced ILP architecture.Figure 6 illustrates the frequencies of different RFUOP sizes. For Instruction Combination and Control Localization, most of the sizes are from 2 to 6. These small sizes indicate that these techniques are benefiting from the fast communication of the functional unit scheme. In the coprocessor scheme, the communication overhead would make them prohibitive to apply. The SWAR optimization generally identifies much larger RFUOPs. The largest one comes from the image reconstruction benchmark, whose kernel is shown in Figure 7. In this case, a total of 52 instructions are combined in the RFU, which results in a speedup of 4.2.model. We have also simulated the architecture in an out-of-order execution environment. We considered a superscalar host processor, different latencies of RFUOPs, and configuration loading time. These results are reported in [16].In summary, the results show that the compilation techniques are able to create and find many instruction sequences for the RFU. Most of their sizes are several instructions, which demonstrate that the fast communication is necessary. The system gives an average speedup of 2.6.6.CONCLUSIONThis paper describes a C compiler for the Processor/FPGA architecture when the FPGA is served as a Reconfigurable Functional Unit (RFU).We have introduced an instruction combination algorithm to identify RFU sequences of instructions in a basic block. We have also shown that the control localization technique can effectively enlarge the size of the basic blocks and find some more sequences. In addition, we have illustrated the RFU support for SWAR. By introducing “sub-register movement” and “collective-move”, the instruction combination algorithm is able to identify complex SIMD instructions for the RFU.Finally, we have presented the experimental results, which demonstrate that these techniques can effectively create and identify larger and more RFU sequences. With the fast communication between RFU and the processor, the system can achieve considerable speedups.7.ACKNOWLEDGEMENTSWe would like to thank Scott Hauck for his contribution to this research. We would also like to thank the reviewers for their helpful comments. This work was supported by DARPA under Contract DABT-63-97-0035.8.REFERENCES[1]K. Atsuta, DCT implementation, http://marine.et.u-tokai.ac.jp/database/koichi.html.[2]J.Babb, M.Frank, et al. The RAW benchmark Suite:Computation Structures for General Purpose Computing. FCCM, Napa Vally, CA, Apr.1997[3] D. Burger, and T. Austin, The Simplescalar Tool Kit,University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, June, 1997 [4]P. Faraboschi, et al. The Latest Word in Digital andMedia Processing, IEEE signal processing magazine, Mar 1998[5]R. J. Fisher, and H. G. Dietz, Compiling For SIMDWithin A Register, 1998 Workshop on Languages and Compilers for Parallel Computing, North Carolina, Aug 1998[6]M.B. Gokhale, et al. Napa C: Compiling for a HybridRISC/FPGA Architecture, FCCM 98, CA, USA[7]S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S.Cadambi, R. R. Taylor, and R. Laufer. PipeRench: A Coprocessor for Streaming Multimedia Acceleration, ISCA’99, May 1999, Atlanta, Georgia[8]S. Hauck, T. W. Fry, M. M. Hosler, J. P. Ka, TheChimaera Reconfigurable Functional Unit, IEEE Symposium on FPGAs for Custom Computing Machines, 1997[9]J. R. Hauser and J. Wawrzynek. GARP: A MIPSprocessor with a reconfigurable coprocessor.Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines (FCCM), Napa, CA, April 1997.[10]Honeywell Inc, Adaptive Computing SystemsBenchmarking,/projects/acsbench/ [11]W. Lee, R. Barua, and et al. Space-Time Scheduling ofInstruction-Level Parallelism on a Raw Machine, MIT.ASPLOS VIII 10/98, CA, USA[12]S. Rathnam, et al. Processing the New World ofInteractive Media, IEEE signal processing magazine March 1998[13]R. Razdan, PRISC: Programmable Reduced InstructionSet Computers, Ph.D. Thesis, Harvard University, Division of Applied Sciences,1994[14]M. J. Wirthlin, and B. L. Hutchings. A DynamicInstruction Set Computer, FCCM, Napa Vally, CA, April, 1995[15]R. D. Wittig and P. Chow. OneChip: An FPGAProcessor with Reconfigurable Logic, FCCM, Napa Vally, CA, April, 1996[16]Z. A. Ye, A. Moshovos, P. Banerjee, and S. Hauck,"Chimaera, a high performance architecture with a tightly-coupled reconfigurable functional unit", submitted to the 27th International Symposium on Computer Architecture (ISCA-2000).。

NVIDIA Maxwell DA-07173-001_v8.0 应用优化指南说明书

DA-07173-001_v8.0 | June 2017 Application NoteTABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide (1)1.1. NVIDIA Maxwell Compute Architecture (1)1.2. CUDA Best Practices (2)1.3. Application Compatibility (2)1.4. Maxwell T uning (2)1.4.1. SMM (2)1.4.1.1. Occupancy (2)1.4.1.2. Instruction Scheduling (3)1.4.1.3. Instruction Latencies (3)1.4.1.4. Instruction Throughput (3)1.4.2. Memory Throughput (4)1.4.2.1. Unified L1/T exture Cache (4)1.4.3. Shared Memory (4)1.4.3.1. Shared Memory Capacity (4)1.4.3.2. Shared Memory Bandwidth (5)1.4.3.3. Fast Shared Memory Atomics (5)1.4.4. Dynamic Parallelism (6)Appendix A. Revision History (7)1.1. NVIDIA Maxwell Compute ArchitectureMaxwell is NVIDIA's next-generation architecture for CUDA compute applications. Maxwell retains and extends the same CUDA programming model as in previous NVIDIA architectures such as Fermi and Kepler, and applications that follow thebest practices for those architectures should typically see speedups on the Maxwell architecture without any code changes. This guide summarizes the ways that an application can be fine-tuned to gain additional speedups by leveraging Maxwell architectural features.1Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves energy efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development, NVIDIA's GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, compiler-based scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called SMM) to far exceed Kepler SMX efficiency.The first Maxwell-based GPU is codenamed GM107 and is designed for use in power-limited environments like notebooks and small form factor (SFF) PCs. GM107 is described in a whitepaper entitled NVIDIA GeForce GTX 750 Ti: Featuring First-Generation Maxwell GPU Technology, Designed for Extreme Performance per Watt.2 The first GPU using the second-generation Maxwell architecture is codenamed GM204. Second-generation Maxwell GPUs retain the power efficiency of the earlier generation while delivering significantly higher performance. GM204 is described in a whitepaper entitled NVIDIA GeForce GTX 980: Featuring Maxwell, The Most Advanced GPU Ever Made.1Throughout this guide, Fermi refers to devices of compute capability 2.x, Kepler refers to devices of compute capability 3.x, and Maxwell refers to devices of compute capability 5.x.2The features of GM108 are similar to those of GM107.Compute programming features of GM204 are similar to those of GM107, except where explicitly noted in this guide. For details on the programming features discussed in this guide, please refer to the CUDA C Programming Guide.1.2. CUDA Best PracticesThe performance guidelines and best practices described in the CUDA C Programming Guide and the CUDA C Best Practices Guide apply to all CUDA-capable GPU architectures. Programmers must primarily focus on following those recommendations to achieve the best performance.The high-priority recommendations from those guides are as follows:‣Find ways to parallelize sequential code,‣Minimize data transfers between the host and the device,‣Adjust kernel launch configuration to maximize device utilization,‣Ensure global memory accesses are coalesced,‣Minimize redundant accesses to global memory whenever possible,‣Avoid long sequences of diverged execution by threads within the same warp.1.3. Application CompatibilityBefore addressing specific performance tuning issues covered in this guide, refer to the Maxwell Compatibility Guide for CUDA Applications to ensure that your application is compiled in a way that is compatible with Maxwell.1.4. Maxwell T uning1.4.1. SMMThe Maxwell Streaming Multiprocessor, SMM, is similar in many respects to the Kepler architecture's SMX. The key enhancements of SMM over SMX are geared toward improving efficiency without requiring significant increases in available parallelism per SM from the application.1.4.1.1. OccupancyThe maximum number of concurrent warps per SMM remains the same as in SMX (i.e., 64), and factors influencing warp occupancy remain similar or improved over SMX:‣The register file size (64k 32-bit registers) is the same as that of SMX.‣The maximum registers per thread, 255, matches that of Kepler GK110. As with Kepler, experimentation should be used to determine the optimum balance ofregister spilling vs. occupancy, however.‣The maximum number of thread blocks per SM has been increased from 16 to 32.This should result in an automatic occupancy improvement for kernels with smallthread blocks of 64 or fewer threads (shared memory and register file resourcerequirements permitting). Such kernels would have tended to under-utilize SMX, but less so SMM.‣Shared memory capacity is increased (see Shared Memory Capacity).As such, developers can expect similar or improved occupancy on SMM without changes to their application. At the same time, warp occupancy requirements (i.e., available parallelism) for maximum device utilization are similar to or less than those of SMX (see Instruction Latencies).1.4.1.2. Instruction SchedulingThe number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell's improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of SMM means CUDA Cores per GPU will be substantially higher vs. comparable Fermi or Kepler chips. SMM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design.As with SMX, each SMM has four warp schedulers. Unlike SMX, however, all SMM core functional units are assigned to a particular scheduler, with no shared units. Along with the selection of a power-of-two number of CUDA Cores per SM, which simplifies scheduling and reduces stall cycles, this partitioning of SM computational resources in SMM is a major component of the streamlined efficiency of SMM.The power-of-two number of CUDA Cores per partition simplifies scheduling, as each of SMM's warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width. Each warp scheduler still has the flexibility to dual-issue (such as issuing a math operation to a CUDA Core in the same cycle as a memory operation to a load/store unit), but single-issue is now sufficient to fully utilize all CUDA Cores.1.4.1.3. Instruction LatenciesAnother major improvement of SMM is that dependent math latencies have been significantly reduced; a consequence of this is a further reduction of stall cycles, as the available warp-level parallelism (i.e., occupancy) on SMM should be equal to or greater than that of SMX (see Occupancy), while at the same time each math operation takes less time to complete, improving utilization and throughput.1.4.1.4. Instruction ThroughputThe most significant changes to peak instruction throughputs in SMM are as follows:‣The change in number of CUDA Cores per SM brings with it a corresponding change in peak single-precision floating point operations per clock per SM.However, since the number of SMs is typically increased, the result is an increase in aggregate peak throughput; furthermore, the scheduling and latency improvements also discussed above make this peak easier to approach.‣The throughput of many integer operations including multiply, logical operations and shift is improved. In addition, there are now specialized integer instructionsthat can accelerate pointer arithmetic. These instructions are most efficient when data structures are a power of two in size.As was already the recommended best practice, signed arithmetic should bepreferred over unsigned arithmetic wherever possible for best throughput on SMM.The C language standard places more restrictions on overflow behavior for unsignedmath, limiting compiler optimization opportunities.1.4.2. Memory Throughput1.4.2.1. Unified L1/T exture CacheMaxwell combines the functionality of the L1 and texture caches into a single unit.As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler. Two new device attributes were added in CUDA Toolkit 6.0:globalL1CacheSupported and localL1CacheSupported. Developers who wish to have separately-tuned paths for various architecture generations can use these fields to simplify the path selection process.Enabling caching of globals in GM204 can affect occupancy. If per-thread-blockSM resource usage would result in zero occupancy with caching enabled, the CUDAdriver will override the caching selection to allow the kernel launch to succeed. Thissituation is reported by the profiler.1.4.3. Shared Memory1.4.3.1. Shared Memory CapacityWith Fermi and Kepler, shared memory and the L1 cache shared the same on-chip storage. Maxwell, by contrast, provides dedicated space to the shared memory of each SMM, since the functionality of the L1 and texture caches have been merged in SMM.This increases the shared memory space available per SMM as compared to SMX:GM107 provides 64 KB shared memory per SMM, and GM204 further increases this to 96 KB shared memory per SMM.This presents several benefits to application developers:‣Algorithms with significant shared memory capacity requirements (e.g., radix sort) see an automatic 33% to 100% boost in capacity per SM on top of the aggregate boost from higher SM count.‣Applications no longer need to select a preference of the L1/shared split for optimal performance. For purposes of backward compatibility with Fermi andKepler, applications may optionally continue to specify such a preference, but the preference will be ignored on Maxwell, with the full 64 KB per SMM always going to shared memory.While the per-SM shared memory capacity is increased in SMM, the per-thread-block limit remains 48 KB. For maximum flexibility on possible future GPUs, NVIDIArecommends that applications use at most 32 KB of shared memory in any one threadblock, which would for example allow at least two such thread blocks to fit per SMM.1.4.3.2. Shared Memory BandwidthKepler SMX introduced an optional 8-byte shared memory banking mode, which had the potential to increase shared memory bandwidth per SM over Fermi for shared memory accesses of 8 or 16 bytes. However, applications could only benefit from this when storing these larger elements in shared memory (i.e., integers and fp32 values saw no benefit), and only when the developer explicitly opted into the 8-byte bank mode via the API.To simplify this, Maxwell returns to the Fermi style of shared memory banking, where banks are always four bytes wide. Aggregate shared memory bandwidth across the chip remains comparable to that of corresponding Kepler chips, given increased SM count. In this way, all applications using shared memory can now benefit from the higher bandwidth, even when storing only four-byte items into shared memory and without specifying any particular preference via the API.1.4.3.3. Fast Shared Memory AtomicsKepler introduced a dramatically higher throughput for atomic operations to global memory as compared to Fermi. However, atomic operations to shared memory remained essentially unchanged: both architectures implemented shared memory atomics using a lock/update/unlock pattern that could be expensive in the case of high contention for updates to particular locations in shared memory.Maxwell improves upon this by implementing native shared memory atomic operations for 32-bit integers and native shared memory 32-bit and 64-bit compare-and-swap(CAS), which can be used to implement other atomic functions with reduced overhead compared to the Fermi and Kepler methods.Refer to the CUDA C Programming Guide for an example implementation of an fp64atomicAdd() using atomicCAS().1.4.4. Dynamic ParallelismGK110 introduced a new architectural feature called Dynamic Parallelism, which allows the GPU to create additional work for itself. A programming model enhancement leveraging this feature was introduced in CUDA 5.0 to enable kernels running on GK110 to launch additional kernels onto the same GPU.SMM brings Dynamic Parallelism into the mainstream by supporting it acrossthe product line, even in lower-power chips such as GM107. This will benefit developers, as it means that applications will no longer need special-case algorithm implementations for high-end GPUs that differ from those usable in more power-constrained environments.Version 1.0‣Initial Public ReleaseVersion 1.1‣Updated for second-generation Maxwell (compute capability 5.2).NoticeALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATEL Y, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSL Y DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.TrademarksNVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright© 2012-2017 NVIDIA Corporation. All rights reserved.。

MKPROM2用户手册说明书

PROM image generatorMKPROM2 User ManualMKPROM2 User's ManualDocument: MKPROM2-UM Cobham Gaisler AB1. MKPROM2 (3)1.1. Introduction (3)1.2. Source code (3)1.3. Usage (3)1.4. Creating ROM resident applications (DEPRECIATED) (4)1.5. Internals (4)1.6. MKPROM2 general options (5)1.7. LEON2/3/4/5 memory controllers options (6)1.7.1. General options (6)1.7.2. Wait states options (7)1.7.3. Explicit memory controller parameters (8)1.7.4. EDAC output file options (8)1.8. LEON3 options (9)1.9. DDR/DDR2 controller options (10)1.10. SDCTRL64/FTSDCTRL64 controller options (10)1.11. FTAHBRAM controller options (10)1.12. SDCTRL controller options (10)1.13. SPI memory controller options (11)1.14. Custom controllers (11)1.15. Timer initialization (11)2. Examples (12)2.1. Quick start (12)2.1.1. Extracting mkprom2 parameters from GRMON (12)2.1.2. Loading boot image to ROM (12)2.1.3. Application start on system reset (13)2.2. Customizing with bdinit.o (13)3. Support (15)This document describes MKPROM2 boot image generator.1.1. IntroductionMKPROM2 is a utility program which converts a LEON RAM application image into a bootable ROM image. The resulting bootable ROM image contains system initialization code, an application loader and the RAM application itself. The RAM application is compressed with a modified LZSS algorithm, typically achieving a compression factor of 2.The system initialization code and the application loader operates in the following steps:•The register file of IU and FPU (if present) is initialized.•The memory controller, UARTs and timer unit are initialized according to the specified options.•The RAM application image stored in ROM is decompressed into RAM.•Finally, the application is started, setting the stack pointer to the top of RAM.MKPROM2 can create ROM images for ERC32 and LEON2/3/4/5 systems.The words ROM and PROM are used in this document to denote normally non-volatile memories such as ROM, PROM, EPROM, EEPROM, Flash PROM, MRAM etc.The word RAM is used in this document to denote normally volatile memory such as SRAM, DRAM, SDRAM, and sometimes DDR and DDR2 SDRAM.1.2. Source codeMKPROM2 comes with full source code included. The source code is located in the <mkprom-dir>/src directory.Requirements for building MKPROM is a host C compiler and the GCC 7.2.0 version of BCC 2.0.7 or later. MKPROM can also be built with other cross-compilers, such as Linux or RTEMS, by specifying TOOLCHAIN=</compiler/path> when calling make.To recompile mkprom issue a make command inside the source directory. This will compile MKPROM2 into the default location, which is /opt/mkprom2 on Linux and c:/opt/mkprom on Windows. On Windows you should use the MINGW/Msys compile system.Additionally, options listed below can also be used with make to control how make is built for different use cases:•NO_MULITLIB: Control if mkprom binaries should be built with multilib or not. When cross-compiling with other compilers (Linux), NO_MULITLIB should be set. default is disabled.•CCPREFIX: Control if mkprom should use default to specific compiler when building ROM image. Default is sparc-gaisler-elf.•FREESTANDING: Control if mkprom should be compiled with the -ffreestanding flag. Default is enabled.1.3. Usagemkprom2 is a command line utility that takes a number of options and files to encapsulate:mkprom2 -freq <mhz> [options] filesTo generate a boot-prom for a typical system with 50 MHz system clock, use:$ mkprom2 -freq 50 -v -rmw -ramsize 1024 hellowhich generates terminal output similar toMKPROM v2.0.62 - boot image generator for LEON applicationsCopyright Cobham Gaisler AB 2004-2017, all rights reserved.phead0: type: 1, off: 65536, vaddr: 40000000, paddr: 40000000, fsize: 27584, msize: 28008section: .text at 0x40000000, size 26272 bytesUncoded stream length: 26272 bytesCoded stream length: 14091 bytesCompression Ratio: 1.864section: .rodata at 0x400066a0, size 128 bytesUncoded stream length: 128 bytesCoded stream length: 38 bytesCompression Ratio: 3.368section: .data at 0x40006720, size 1184 bytesUncoded stream length: 1184 bytesCoded stream length: 572 bytesCompression Ratio: 2.070Creating LEON3 boot prom: prom.out[...]Success!When executed, the PROM loader prints a configuration message at start-up:tsim> runMKPROM2 boot loader v2.0.62Copyright Cobham Gaisler AB - all right reservedsystem clock : 50.0 MHzbaud rate : 19171 baudprom : 512 K, (2/2) ws (r/w)sram : 1024 K, 1 bank(s), 0/0 ws (r/w)decompressing .textdecompressing .datadecompressing .jcrstarting helloHello world!It is essential that the same -mflat, -qsvt and -msoft-float parameters are given to mkprom2, as was used when the binary was compiled.1.4. Creating ROM resident applications (DEPRECIATED)Since BCC1 has reached end of life (EOL) and this feature depends on BCC1, creating ROM resident images with MKPROM2 is currently not supported.1.5. Internalsmkprom2 is delivered with source code. mkprom2 is compiled from source file mkprom.c. mkprom2 creates a PROM image through the following steps:•Parse option switches•Calculate the register initialization values from the switches.•Read in elf-format object files and extract load location and section data from it.•Dump register values and sections data into a file called dump.s. You can preserve and read this file using the -dump option.•Use the crosscompile toolchain to compile dump.s and link this file against the boot-loader object files. You can see the command that is issued by adding the -v (-V) switch to mkprom2.The MKPROM2 binary distribution includes precompiled object code which is used when linking the boot loader. It is available in the installation subdirectory named lib/. The object code has been compiled with workarounds enabled for UT699. Compiling for UT699 represents a conservative approach which includes workarounds for the other LEON components. Also, MKPROM2 is by default compiled with MUL/DIV instructions disabled which does not pose a big performance impact while covering more systems. In addition, workarounds for the following technical notes have been taken into account for the included object code:•GRLIB-TN-0009•GRLIB-TN-0010 (MMU is not enabled when MKPROM boot loader executes)•GRLIB-TN-0011 (MMU is not enabled when MKPROM boot loader executes)•GRLIB-TN-0012 (FPU operations are not performed by MKPROM boot loader)•GRLIB-TN-0013 (FPU operations are not performed by MKPROM boot loader)•GRLIB-TN-0020There is in general no need for the user to recompile the MKPROM boot loader target object code. Advanced whowant to compile the target libraries can specify compiler options in the file src/multilibs_bcc2.The MKPROM boot loader run-time is self-containing and does not bring in startup-code provided by the tool chain used to link the boot image.1.6. MKPROM2 general optionsIn GRMON, the command info mkprom2 is available for extracting mkprom2 parameters for memory controller, timer, uart and interrupt controller. The extracted parameters can be used as a starting point for mkprom2 and should be verified against the target system by the user.grmon2> info mkprom2Mkprom2 switches:-leon3 -freq 48 -rmw -ramsize 8192 -sdram 128 -sdrambanks 2 -trfc 83For hardware without an FPU, the -msoft-float has to be given to mkprom2. Note the FPU registers will be cleared regardless of the -msoft-float flag if an FPU is present, however the FPU will be turned off when entering the application if -msoft-float has been given.Table 1.1. Linking optionsTable 1.2. General options1.7. LEON2/3/4/5 memory controllers optionsThe LEON2/3/4/5 memory controller options apply to MCTRL, FTMCTRL, SRCTRL and FTSRCTRL.1.7.1. General optionsThe options in Table 1.3 are used by the MKPROM2 tool to configure the memory controller. MKPROM2 uses these options to calculate suitable values which are used by the boot code to initialize the memory controller. The options in Table 1.3 are interpreted by MKPROM2 at the time when the boot image is generated.Table 1.3. Memory controller configuration register options1.7.2. Wait states optionsThe wait states parameters sets the corresponding wait states field in the memory controller. It shall be noted that the effective number of wait states depends on the target system. In particular, some memory controllers have the synthesis parameter wsshift which affects how the wait states field maps to effective number of wait states.For information on how the wait state fields are translated into actual wait states, consult the component docu-mentation. For custom designs, see the GRLIB IP Core User's Manual .Table 1.4. Memory controller configuration direct optionsAn optional way to specify the initialization values for the memory controller configuration registers is to use the options in Table 1.5. The -mcfg{1|2|3} options have priority over the parameters in Table 1.3 for memory controller initialization during boot. However, the options in Table 1.3 are still used for calculating the initial stack pointer, RAM wash size and EDAC allocation when using 8-bit SRAM bus.It is recommended NOT to use the -mcfg{1|2|3} options at all. Instead go through the memory related options in Table 1.3 and match them with the target system. This also provides a straight forward way to enable/disable individual options by only changing the human readable command line switch.If however -mcfg{1|2|3} options must be used, make sure that to use all of them and do not include any other memory controller switches that influence that same registers.To get a human readable printout of the current memory controller configuration from GRMON, use:grmon2> info sys mctrl0[...]grmon2> info reg -v mctrl0[...]Table 1.5. Memory controller configuration direct optionsThe options in Table 1.6 allows generation of additional output files with PROM EDAC check bits.Table 1.6. EDAC output file optionsCurrently the following IP cores are detected and initialized using plug and play: DDR2SPA, DDRSPA, SDCTR, IRQMP, APBUART, GPTIMER, GRTIMER, MCTRL, FTMCTRL, FTSRCTRL, FTAHBRAM.Table 1.7. MKPROM2 options for LEON3stack for each processor. The convention in software is that [bss-end,end-of-stack] defines the available memoryregion for each processor. Finally the IRQAMP controller can be configured using the -mpirqsel option. Below is an example of a AMP system with 2 processors. One RTEMS image running at 0x0, the other at 0x40000000.$ mkprom2 \-freq 50 \-mp \-mpstart 0x3 \-mpirqsel 0 0 \-mpirqsel 1 1 \-mpuart 2 0xF0000000 0xf0001000 \-mpstack 2 0x3fffff00 0x400fff00 \-mpentry 2 0x0 0x40000000 \rtems-tasks-0x00000000 rtems-tasks-0x40000000 -o amp.prom1.9. DDR/DDR2 controller optionsTable 1.8. MKPROM2 options for DDR/DDR2 controllerTable 1.9. MKPROM2 options for SDCTRL64/FTSDCTRL64 controllerTable 1.10. MKPROM2 options for FTAHBRAM controllerTable 1.11. MKPROM2 options for SDCTRL controllerTable 1.12. MKPROM2 options for SPI memory controller1.14. Custom controllersIf the target LEON3 system contains a custom controller, the initialization of the controller must be made through the bdinit1 function. Below is an example of a suitable bdinit.c file. The file should be compiled with sparc-gaisler-elf-gcc -O2 -c -msoft-float., and mkprom2 should be run with the -bdinit option.void bdinit1() {<.. your init code here ..>}void bdinit2 () {}1.15. Timer initializationThis section describes the default initialization of GPTIMER and GRTIMER cores on LEON3 systems with AM-BA plug and play.•Only timer cores (GPTIMER/GRTIMER) on the first APB bus are initialized by mkprom2.•GPTIMER cores are initialized before GRTIMER cores.•The timer core prescaler reload value is set such that it underflows once every microsecond. The -freq pa-rameter is used to calculate the prescaler value.•The first subtimer of each timer core is configured with reload value 0xffffffff. Its control register is then initialized such that the subtimer is loaded, enabled and set in restart mode.•The last subtimer of the first timer core (watchdog) is configured with reload value 300000000 (5 minutes).Its control register is not initialized (reset value remains).•All other subtimers are initialized with 0 in their counter value registers, reload value registers and control registers.This default timer initialization can be overridden by bdinit1() as described in this document.This chapter contains examples on how MKPROM2 can be used.2.1. Quick startThis is a quick start tutorial on how to come up and running with MKPROM2 on a typical LEON system. UT700 LEAP board is used for the purpose of this example.2.1.1. Extracting mkprom2 parameters from GRMONThis example shows how GRMON can be used to extract the parameters to use with the mkprom2 frontend.$ grmon -uart /dev/ttyUSB0GRMON2 LEON debug monitor v2.0.87 pro version[...]using port /dev/ttyUSB0 @ 115200 baudDevice ID: 0x699GRLIB build version: 4110Detected system: UT699E/UT700Detected frequency: 67 MHz[...]grmon2> info mkprom2Mkprom2 switches:-leon3 -freq 67 -nosram -sdram 32 -trfc 75 -trp 30 -baud 38417grmon2>The switches presented by the info mkprom2 command corresponds to system properties that GRMON has probed as part of its initialization. This includes, for example:•System clock frequency•Memory controller configuration•Properties of installed memory•UART baud, as specified by the GRMON command line or the default.These extracted parameters can be used directly with the mkprom2 command line tool:$ sparc-rtems-gcc rtems-hello.c -O2 -o rtems-hello.elf$ mkprom2 -leon3 -freq 67 -nosram -sdram 32 -trfc 75 -trp 30 -baud 38417 rtems-hello.elf -o rtems-hello.rom MKPROM v2.0.65 - boot image generator for LEON applicationsCopyright Cobham Gaisler AB 2004-2017, all rights reserved.Creating LEON3 boot prom: rtems-hello.romSuccess!Note that GRMON has limited knowledge of the installed external memory: density is often correct but timing parameters estimated. This means that the memory controller parameters must be verified against the relevant data sheet by the user.2.1.2. Loading boot image to ROMLoading the boot image to the ROM is board dependent. For the UT700 LEAP board, it can be loaded to the on-board MRAM using the GRMON2 load command:grmon2> load -wprot rtems-hello.rom00000000 .text 88.8kB / 88.8kB [===============>] 100%Total size: 88.83kB (95.23kbit/s)Entry point 0x0Image rtems-hello.rom loadedThe option -wprot temporarily disables the memory controller write protection during the load operation.As a sanity check, the GRMON 2 command verify can be used to verify that the image was loaded correctly.grmon2> verify rtems-hello.rom00000000 .text 88.8kB / 88.8kB [===============>] 100%Total size: 88.83kB (51.03kbit/s)Entry point 0x0Image of rtems-hello.rom verified without errorsgrmon2> dis 0 40x00000000: 88100000 clr %g40x00000004: 09000000 sethi %hi(0x0), %g40x00000008: 81c120b0 jmp %g4 + 0xB00x0000000c: 01000000 nopThe first couple of instructions is also disassembled in the example above.2.1.3. Application start on system resetOn system reset, the following message is displayed on the firt UART (APBUART0):MKPROM2 boot loader v2.0.65Copyright Cobham Gaisler AB - all rights reservedsystem clock : 67.0 MHzbaud rate : 38417 baudprom : 512 K, (2/2) ws (r/w)sdram : 32768 M, 1 bank(s), 9-bit columnsdram : cas: 2, trp: 45 ns, trfc: 90 ns, refresh 7.8 usdecompressing .text to 0x40000000decompressing .data to 0x40023530decompressing .jcr to 0x400246f0starting rtems-hello.elf[...]2.2. Customizing with bdinit.oMKPROM2 boot code initializes the most essential and common peripherals. Custom initializations can be defined by the user with the mkprom2 option -bdinit.The reason behind the -bdinit concept is to limit the complexity of the boot loader standard code. Another reason is that custom initialization is highly board and application specific so it would be hard to provide a slim and useful interface for all possible kinds of peripheral initializations. The primary area for MKPROM2 built-in initialization is CPU internal registers, memory controller registers, APBUART and optionally memory initial-ization when EDAC is used.Usage of -bdinit is described in Table 1.2. A file named bdinit.o is picked up from the current directory and is assumed to contain entry points for the functions:•bdinit0()•bdinit1()•bdinit2()The bdinit.o file can be compiled for example with:sparc-gaisler-rtems5-gcc -Os -Wall -Wextra -pedantic -c bdinit.c -o bdinit.oAny LEON compatible compiler can be used to compile the bdinit.o. It is important to provide the -c option to only do the compile step and not linking. It is also recommended to use optimization level -Os.Care must be taken when implementing bdinit0() and bdinit1() since they are called by the boot code before memory is available in the run-time. It is safe to keep them as empty void-functions:/** bdinit0 is called before peripheral has been initialized and before RAM is* available.*/void bdinit0(void) { }/** bdinit1 is called after peripherals registers have been initialized, but* before RAM is available.*/void bdinit1(void) { }/* bdinit2 is called after MKPROM boot loader has initialized memory. */ void bdinit2(void){/* Do some custom initialization, for example clock gating. */}****************************************************.When contacting support, please identify yourself in full, including company affiliation and site name and address. Please identify exactly what product that is used, specifying if it is an IP core (with full name of the library distribution archive file), component, software version, compiler version, operating system version, debug tool version, simulator tool version, board version, etc.The support service is only for paying customers with a support contract.Cobham Gaisler ABKungsgatan 12411 19 GothenburgSweden/Gaisler*****************T: +46 31 7758650F: +46 31 421407Cobham Gaisler AB, reserves the right to make changes to any products and services described herein at any time without notice. Consult the company or an authorized sales representative to verify that the information in this document is current before using this product. The company does not assume any responsibility or liability arising out of the application or use of any product or service described herein, except as expressly agreed to in writing by the company; nor does the purchase, lease, or use of a product or service from the company convey a license under any patent rights, copyrights, trademark rights, or any other of the intellectual rights of the company or of third parties. All information is provided as is. There is no warranty that it is correct or suitable for any purpose, neither implicit nor explicit.Copyright © 2022 Cobham Gaisler AB。

odin ODE生成和集成系统文档说明书

Package‘odin’October2,2023Title ODE Generation and IntegrationVersion1.2.5Description Generate systems of ordinary differential equations (ODE)and integrate them,using a domain speciﬁc language(DSL).The DSL uses R's syntax,but compiles to C in order toefﬁciently solve the system.A solver is not provided,butinstead interfaces to the packages'deSolve'and'dde'aregenerated.With these,while solving the differential equations,no allocations are done and the calculations remain entirely incompiled code.Alternatively,a model can be transpiled to R for use in contexts where a C compiler is not present.Aftercompilation,models can be inspected to return information about parameters and outputs,or intermediate values after calculations.'odin'is not targeted at any particular domain and is suitablefor any system that can be expressed primarily as mathematicalexpressions.Additional support is provided for working withdelays(delay differential equations,DDE),using interpolatedfunctions during interpolation,and for integrating quantitiesthat represent arrays.License MIT+ﬁle LICENSEURL https:///mrc-ide/odinBugReports https:///mrc-ide/odin/issues Imports R6,cinterpolate(>=1.0.0),deSolve,digest,glue,jsonlite, ring,withrSuggests dde(>=1.0.0),jsonvalidate(>=1.1.0),knitr,mockery, pkgbuild,pkgload,rlang,rmarkdown,testthat VignetteBuilder knitrRoxygenNote7.1.1Encoding UTF-8Language en-GBNeedsCompilation no12can_compile Author Rich FitzJohn[aut,cre],Thibaut Jombart[ctb],Imperial College of Science,Technology and Medicine[cph]Maintainer Rich FitzJohn<***********************>Repository CRANDate/Publication2023-10-0213:40:11UTCR topics documented:can_compile (2)odin (3)odin_build (5)odin_ir (6)odin_ir_deserialise (7)odin_options (7)odin_package (9)odin_parse (10)odin_validate (11)Index13 can_compile Test if compilation is possibleDescriptionTest if compilation appears possible.This is used in some examples,and tries compiling a trivial C program with pkgbuild.Results are cached between runs within a session so this should be fast to rely on.Usagecan_compile(verbose=FALSE,refresh=FALSE)Argumentsverbose Be verbose when running commands?refresh Try again to compile,skipping the cached value?DetailsWe use pkgbuild in order to build packages,and it includes a set of heuristics to locate and organise your C compiler.The most likely people affected here are Windows users;if you get this ensure that you have rtools ing pkgbuild::find_rtools()with debug=TRUE may be helpful for diagnosing compiler issues.odin3ValueA logical scalarExamplescan_compile()#will take~0.1s the first timecan_compile()#should be basically instantaneousodin Create an odin modelDescriptionCreate an odin model from aﬁle,text string(s)or expression.The odin_version is a"standard evaluation"escape hatch.Usageodin(x,verbose=NULL,target=NULL,workdir=NULL,validate=NULL,pretty=NULL,skip_cache=NULL,compiler_warnings=NULL,no_check_unused_equations=NULL,options=NULL)odin_(x,verbose=NULL,target=NULL,workdir=NULL,validate=NULL,pretty=NULL,skip_cache=NULL,compiler_warnings=NULL,no_check_unused_equations=NULL,options=NULL)Argumentsx Either the name of aﬁle to read,a text string(if length is greater than1elements will be joined with newlines)or an expression.verbose Logical scalar indicating if the compilation should be verbose.Defaults to the value of the option odin.verbose or FALSE otherwise.target Compilation target.Options are"c"and"r",defaulting to the option odin.target or"c"otherwise.workdir Directory to use for any generatedﬁles.This is only relevant for the"c"target.Defaults to the value of the option odin.workdir or tempdir()otherwise.validate Validate the model’s intermediate representation against the included schema.Normally this is not needed and is intended primarily for development use.De-faults to the value of the option odin.validate or FALSE otherwise.pretty Pretty-print the model’s intermediate representation.Normally this is not needed and is intended primarily for development use.Defaults to the value of theoption odin.pretty or FALSE otherwise.skip_cache Skip odin’s cache.This might be useful if the model appears not to compile when you would expect it to.Hopefully this will not be needed often.Defaultsto the option odin.skip_cache or FALSE otherwise.4odin compiler_warningsPreviously this attempted detection of compiler warnings(with some degree ofsuccess),but is currently ignored.This may become supported again in a futureversion depending on underlying support in pkgbuild.no_check_unused_equationsIf TRUE,then don’t print messages about unused variables.Defaults to the optionodin.no_check_unused_equations or FALSE otherwise.options Named list of options.If provided,then all other options are ignored.DetailsDo not use odin::odin in a package;you almost certainly want to use odin_package instead.A generated model can return information about itself;odin_irValueAn odin_generator object(an R6class)which can be used to create model instances.User parametersIf the model accepts user parameters,then the parameter to the constructor or the$set_user() method can be used to control the behaviour when unknown user actions are passed into the model.Possible values are the strings stop(throw an error),warning(issue a warning but keep go-ing),message(print a message and keep going)or ignore(do nothing).Defaults to the option odin.unused_user_action,or warning otherwise.Delay equations with ddeWhen generating a model one must chose between using the dde package to solve the system or the default deSolve.Future versions may allow this to switch when using run,but for now this requires tweaking the generated code to a point where one must decide at generation.dde implements only the Dormand-Prince5th order dense output solver,with a delay equation solver that may perform better than the solvers in deSolve.For non-delay equations,deSolve is very likely to outperform the simple solver implemented.Author(s)Rich FitzJohnExamples##Compile the model;exp_decay here is an R6ClassGenerator and will##generate instances of a model of exponential decay:exp_decay<-odin::odin({deriv(y)<--0.5*yinitial(y)<-1},target="r")##Generate an instance;there are no parameters here so all instances##are the same and this looks a bit pointless.But this step isodin_build5 ##required because in general you don t want to have to compile the##model every time it is used(so the generator will go in a##package).mod<-exp_decay$new()##Run the model for a series of times from0to10:t<-seq(0,10,length.out=101)y<-mod$run(t)plot(y,xlab="Time",ylab="y",main="",las=1)odin_build Build an odin model generator from its IRDescriptionBuild an odin model generator from its intermediate representation,as generated by odin_parse.This function is for advanced use.Usageodin_build(x,options=NULL)Argumentsx An odin ir(json)object or output from odin_validate.options Options to pass to the build stage(see odin_optionsDetailsIn applications that want to inspect the intermediate representation rather before compiling,ratherthan directly using odin,use either odin_parse or odin_validate and then pass the result to odin::odin_build.The return value of this function includes information about how long the compilation took,if itwas successful,etc,in the same style as odin_validate:success Logical,indicating if compilation was successfulelapsed Time taken to compile the model,as a proc_time object,as returned by proc.time.output Any output produced when compiling the model(only present if compiling to C,and if the cache was not hit.model The model itself,as an odin_generator object,as returned by odin.ir The intermediate representation.error Any error thrown during compilationSee Alsoodin_parse,which creates intermediate representations used by this function.6odin_irExamples#Parse a model of exponential decayir<-odin::odin_parse({deriv(y)<--0.5*yinitial(y)<-1})#Compile the model:options<-odin::odin_options(target="r")res<-odin::odin_build(ir,options)#All results:res#The model:mod<-res$model$new()mod$run(0:10)odin_ir Return detailed information about an odin modelDescriptionReturn detailed information about an odin model.This is the mechanism through which coef works with odin.Usageodin_ir(x,parsed=FALSE)Argumentsx An odin_generator function,as created by odin::odinparsed Logical,indicating if the representation should be parsed and converted into an R object.If FALSE we return a json string.WarningThe returned data is subject to change for a few versions while I work out how we’ll use it. Examplesexp_decay<-odin::odin({deriv(y)<--0.5*yinitial(y)<-1},target="r")odin::odin_ir(exp_decay)coef(exp_decay)odin_ir_deserialise7 odin_ir_deserialise Deserialise odin’s IRDescriptionDeserialise odin’s intermediate model representation from a json string into an R object.Unlike the json,there is no schema for this representation.This function provides access to the same deserialisation that odin uses internally so may be useful in applications.Usageodin_ir_deserialise(x)Argumentsx An intermediate representation as a json stringValueA named listSee Alsoodin_parseExamples#Parse a model of exponential decayir<-odin::odin_parse({deriv(y)<--0.5*yinitial(y)<-1})#Convert the representation to an R objectodin::odin_ir_deserialise(ir)odin_options Odin optionsDescriptionFor lower-level odin functions odin_parse,odin_validate we only accept a list of options rather than individually named options.8odin_options Usageodin_options(verbose=NULL,target=NULL,workdir=NULL,validate=NULL,pretty=NULL,skip_cache=NULL,compiler_warnings=NULL,no_check_unused_equations=NULL,rewrite_dims=NULL,rewrite_constants=NULL,substitutions=NULL,options=NULL)Argumentsverbose Logical scalar indicating if the compilation should be verbose.Defaults to the value of the option odin.verbose or FALSE otherwise.target Compilation target.Options are"c"and"r",defaulting to the option odin.target or"c"otherwise.workdir Directory to use for any generatedﬁles.This is only relevant for the"c"target.Defaults to the value of the option odin.workdir or tempdir()otherwise.validate Validate the model’s intermediate representation against the included schema.Normally this is not needed and is intended primarily for development use.De-faults to the value of the option odin.validate or FALSE otherwise.pretty Pretty-print the model’s intermediate representation.Normally this is not needed and is intended primarily for development use.Defaults to the value of theoption odin.pretty or FALSE otherwise.skip_cache Skip odin’s cache.This might be useful if the model appears not to compile when you would expect it to.Hopefully this will not be needed often.Defaultsto the option odin.skip_cache or FALSE otherwise.compiler_warningsPreviously this attempted detection of compiler warnings(with some degree ofsuccess),but is currently ignored.This may become supported again in a futureversion depending on underlying support in pkgbuild.no_check_unused_equationsIf TRUE,then don’t print messages about unused variables.Defaults to the optionodin.no_check_unused_equations or FALSE otherwise.rewrite_dims Logical,indicating if odin should try and rewrite your model dimensions(if us-ing arrays).If TRUE then we replace dimensions known at compile-time withliteral integers,and those known at initialisation with simpliﬁed and shared ex-pressions.You may get less-comprehensible error messages with this optionset to TRUE because parts of the model have been effectively evaluated duringprocessing.rewrite_constantsLogical,indicating if odin should try and rewrite all constant scalars.This is asuperset of rewrite_dims and may be slow for large models.Doing this willmake your model less debuggable;error messages will reference expressionsthat have been extensively rewritten,some variables will have been removedentirely or merged with other identical expressions,and the generated code maynot be obviously connected to the original code.odin_package9 substitutions Optionally,a list of values to substitute into model speciﬁcation as constants, even though they are declared as user().This will be most useful in conjunctionwith rewrite_dims to create a copy of your model with dimensions known atcompile time and all loops using literal integers.options Named list of options.If provided,then all other options are ignored.ValueA list of parameters,of class odin_optionsExamplesodin_options()odin_package Create odin model in a packageDescriptionCreate an odin model within an existing package.Usageodin_package(path_package)Argumentspath_package Path to the package root(the directory that contains DESCRIPTION)DetailsI am resisting the urge to actually create the package here.There are better options than I cancome up with;for example devtools::create,pkgkitten::kitten,mason::mason,or creating DESCRIPTIONﬁles using desc.What is required here is that your package:•Lists odin in Imports:•Includes useDynLib(<your package name>)in NAMESPACE(possibly via a roxygen comment @useDynLib<your package name>•To avoid a NOTE in R CMD check,import something from odin in your namespace(e.g., importFrom("odin","odin")s or roxygen@importFrom(odin,odin)Point this function at the package root(the directory containing DESCRIPTION and it will write outﬁles src/odin.c and odin.R.Theseﬁles will be overwritten without warning by running this again.10odin_parseExamplespath<-tempfile()dir.create(path)src<-system.file("examples/package",package="odin",mustWork=TRUE)file.copy(src,path,recursive=TRUE)pkg<-file.path(path,"package")#The package is minimal:dir(pkg)#But contains odin files in inst/odindir(file.path(pkg,"inst/odin"))#Compile the odin code in the packageodin::odin_package(pkg)#Which creates the rest of the package structuredir(pkg)dir(file.path(pkg,"R"))dir(file.path(pkg,"src"))odin_parse Parse an odin modelDescriptionParse an odin model,returning an intermediate representation.The odin_parse_version is a"stan-dard evaluation"escape hatch.Usageodin_parse(x,type=NULL,options=NULL)odin_parse_(x,options=NULL,type=NULL)Argumentsx An expression,character vector orﬁlename with the odin codetype An optional string indicating the the type of input-must be one of expression, file or text if provided.This skips the type detection code used by odin andmakes validating user input easier.options odin options;see odin_options.The primary options that affect the parse stage are validate and pretty.DetailsA schema for the intermediate representation is available in the package as schema.json.It issubject to change at this point.See Alsoodin_validate,which wraps this function where parsing might fail,and odin_build for building odin models from an intermediate representation.Examples#Parse a model of exponential decayir<-odin::odin_parse({deriv(y)<--0.5*yinitial(y)<-1})#This is odin s intermediate representation of the modelir#If parsing odin models programmatically,it is better to use#odin_parse_;construct the model as a string,from a file,or as a#quoted expression:code<-quote({deriv(y)<--0.5*yinitial(y)<-1})odin::odin_parse_(code)odin_validate Validate an odin modelDescriptionValidate an odin model.This function is closer to odin_parse_than odin_parse because it does not do any quoting of the code.It is primarily intended for use within other applications.Usageodin_validate(x,type=NULL,options=NULL)Argumentsx An expression,character vector orﬁlename with the odin codetype An optional string indicating the the type of input-must be one of expression, file or text if provided.This skips the type detection code used by odin andmakes validating user input easier.options odin options;see odin_options.The primary options that affect the parse stage are validate and pretty.Detailsodin_validate will always return a list with the same elements:success A boolean,TRUE if validation was successfulresult The intermediate representation,as returned by odin_parse_,if the validation was success-ful,otherwise NULLerror An error object if the validation was unsuccessful,otherwise NULL.This may be a classed odin error,in which case it will contain source location information-see the examples for details.messages A list of messages,if the validation returned any.At present this is only non-fatal infor-mation about unused variables.Author(s)Rich FitzJohnExamples#A successful validation:odin::odin_validate(c("deriv(x)<-1","initial(x)<-1"))#A complete failure:odin::odin_validate("")#A more interesting failurecode<-c("deriv(x)<-a","initial(x)<-1")res<-odin::odin_validate(code)res#The object res$error is an odin_error object:res$error#It contains information that might be used to display to a#user information about the error:unclass(res$error)#Notes are raised in a similar way:code<-c("deriv(x)<-1","initial(x)<-1","a<-1")res<-odin::odin_validate(code)res$messages[[1]]Indexcan_compile,2coef,6odin,3,5odin_(odin),3odin_build,5,11odin_ir,4,6odin_ir_deserialise,7odin_options,5,7,10,11odin_package,4,9odin_parse,5,7,10,11odin_parse_,11,12odin_parse_(odin_parse),10odin_validate,5,7,11,11pkgbuild::find_rtools(),2proc.time,5tempdir(),3,813。

Introduction_to_x64_Assembly

DF
10 Direction
Direction string instructions operate (increment or decrement)
ID
21 Identification Changeability denotes presence of CPUID instruction
The floating point unit (FPU) contains eight registers FPR0-FPR7, status and control registers, and a few other specialized registers. FPR0-7 can each store one value of the types shown in Table 2. Floating point operations conform to IEEE 754. Note that most C/C++ compilers support the 32 and 64 bit types as float and double, but not the 80-bit one available from assembly. These registers share space with the eight 64-bit MMX registers.
Assembly is often used for performance-critical parts of a program, although it is difficult to outperform a good C++ compiler for most programmers. Assembly knowledge is useful for debugging code – sometimes a compiler makes incorrect assembly code and stepping through the code in a debugger helps locate the cause. Code optimizers sometimes make mistakes. Another use for assembly is interfacing with or fixing code for which you have no source code. Disassembly lets you change/fix existing executables. Assembly is necessary if you want to know how your language of choice works under the hood – why some things are slow and others are fast. Finally, assembly code knowledge is indispensable when diagnosing malware.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Advanced Micro DevicesCompiler UsageGuidelines for 64-BitOperating Systems onAMD64 PlatformsApplication NotePublication # 32035 Revision: 3.18 Issue Date: June 2005© 2004, 2005 Advanced Micro Devices, Inc. All rights reserved.The contents of this document are provided in connection withAdvanced Micro Devices, Inc. (“AMD”) products. AMD makes norepresentations or warranties with respect to the accuracy or completeness ofthe contents of this publication and reserves the right to make changes tospecifications and product descriptions at any time without notice. No license,whether express, implied, arising by estoppel, or otherwise, to any intellectualproperty rights are granted by this publication. Except as set forth in AMD’sStandard Terms and Conditions of Sale, AMD assumes no liability whatsoever,and disclaims any express or implied warranty, relating to its productsincluding, but not limited to, the implied warranty of merchantability, fitnessfor a particular purpose, or infringement of any intellectual property right.AMD’s products are not designed, intended, authorized or warranted for use ascomponents in systems intended for surgical implant into the body, or in otherapplications intended to support or sustain life, or in any other application inwhich the failure of AMD’s product could create a situation where personalinjury, death, or severe property or environmental damage may occur. AMDreserves the right to discontinue or make changes to its products at any timewithout notice.TrademarksAMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, and combinations thereof, are trademarks of Advanced Micro Devices, Inc.Microsoft, Windows, and Windows NT are registered trademarks of Microsoft Corporation.Pentium and MMX are registered trademarks of Intel Corporation.SPEC is a registered trademark of the Standard Performance Evaluation Corporation.Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.32035 Rev. 3.18 June 2005 Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 PlatformsContents 3ContentsRevision History (10)Chapter 1Introduction............................................................................................................11 1.1 Audience..........................................................................................................................11 1.2Intent of Document..........................................................................................................11 1.3 Definitions, Abbreviations, and Notation (11)1.4 Additional Documents (12)Chapter 2List of Compiler Vendors for AMD Athlon™ 64 and AMD Opteron™ Processors...............................................................................................................13 2.1 Compilers (64-Bit) for Linux. (13)2.1.1 GCC (13)2.1.2 Intel (14)2.1.3 PathScale (14)2.1.4 PGI (14)2.2 Compilers (64-Bit) for Microsoft ® Windows ® (14)2.2.1 Intel (14)2.2.2 Microsoft (14)2.2.3 PGI (14)2.3Compilers (64-bit) for Solaris..........................................................................................15 2.3.1 Sun (15)2.4 Compilers (32-Bit) for Linux...........................................................................................15 2.4.1 GCC. (15)2.4.2 Intel (15)2.4.3 PathScale (15)2.4.4 PGI (16)2.5 Compilers (32-Bit) for Microsoft Windows (16)2.5.1 Intel (16)2.5.2 Microsoft (16)2.5.3 PGI (16)Compiler Usage Guidelines for 64-Bit OperatingSystems on AMD64 Platforms32035 Rev. 3.18 June 20054 Contents2.6 Compilers (32-bit) for Sun Solaris (16)2.6.1 Sun (16)Chapter 3Performance-Centric Compiler Switches...........................................................17 3.1GCC Compilers (64-Bit) for Linux.................................................................................17 3.1.1Recommended Compiler Versions..........................................................................17 3.1.2 Invocation Commands.............................................................................................18 3.1.3Generic Performance Switches...............................................................................19 3.1.4 Other Switches........................................................................................................20 3.2 Intel Compilers (64-Bit) for Linux. (21)3.2.1 Invocation Commands.............................................................................................21 3.2.2Generic Performance Switches...............................................................................21 3.2.3 Other Switches........................................................................................................21 3.3 PathScale Compilers (64-Bit) for Linux. (22)3.3.1 Invocation Commands.............................................................................................22 3.3.2Generic Performance Switches...............................................................................22 3.3.3 Other Switches........................................................................................................22 3.4 PGI Compilers (64-Bit) for Linux.. (23)3.4.1 Invocation Commands.............................................................................................23 3.4.2Generic Performance Switches...............................................................................23 3.4.3 Other Switches........................................................................................................24 3.5 Intel Compilers (64-Bit) for Microsoft Windows.. (24)3.5.1 Invocation Commands.............................................................................................24 3.5.2Generic Performance Switches...............................................................................24 3.5.3 Other Switches.. (25)3.6 Microsoft Compilers (64-Bit) for Microsoft Windows (25)3.6.1 Invocation Commands (25)3.6.2Generic Performance Switches...............................................................................25 3.7 PGI Compilers (64-Bit) for Microsoft Windows (26)3.7.1 Invocation Commands.............................................................................................26 3.7.2Generic Performance Switches...............................................................................26 3.7.3 Other Switches.. (26)32035 Rev. 3.18 June 2005 Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 PlatformsContents 53.8 Sun Compilers (64-bit) for Solaris (27)3.8.1 Invocation Commands.............................................................................................27 3.8.2Generic Performance Switches................................................................................27 3.8.3 Other Switches.........................................................................................................27 3.9 GCC Compilers (32-Bit) for Linux (28)3.9.1 Recommended Compiler Versions (28)3.9.2 Invocation Commands.............................................................................................29 3.9.3Generic Performance Switches................................................................................30 3.9.4 Other Switches.........................................................................................................31 3.10 Intel Compilers (32-Bit) for Linux. (32)3.10.1 Invocation Commands.............................................................................................32 3.10.2Generic Performance Switches................................................................................33 3.10.3 Other Switches.........................................................................................................33 3.11 PathScale Compilers (32-Bit) for Linux.. (34)3.11.1 Invocation Commands.............................................................................................34 3.11.2Generic Performance Switches................................................................................34 3.11.3 Other Switches.........................................................................................................34 3.12 PGI Compilers (32-Bit) for Linux.. (35)3.12.1 Invocation Commands.............................................................................................35 3.12.2Generic Performance Switches................................................................................36 3.12.3 Other Switches.........................................................................................................36 3.13 Intel Compilers (32-Bit) for Microsoft Windows (36)3.13.1 Invocation Commands.............................................................................................36 3.13.2Generic Performance Switches................................................................................37 3.13.3 Other Switches (37)3.14 Microsoft Compilers (32-Bit) for Microsoft Windows (38)3.14.1 Invocation Command (38)3.14.2Generic Performance Switches................................................................................38 3.14.3 Other Switches.........................................................................................................38 3.15 PGI Compilers (32-Bit) for Microsoft Windows. (39)3.15.1 Invocation Commands (39)Compiler Usage Guidelines for 64-Bit OperatingSystems on AMD64 Platforms32035 Rev. 3.18 June 20056 Contents3.15.2Generic Performance Switches...............................................................................39 3.15.3 Other Switches........................................................................................................40 3.16 Sun Studio Compilers (32-bit) for Solaris . (40)3.16.1 Invocation Commands.............................................................................................40 3.16.2Generic Performance Switches...............................................................................40 3.16.3 Other Switches.. (41)Chapter 4Troubleshooting and Portability Issues...............................................................43 4.1 GCC Compilers (64-Bit) for Linux (43)4.1.1 Compilation Errors (43)4.1.2 Link-Time Errors (44)4.1.3 Run-Time Errors (44)4.1.4Compiled and Linked Code Generates Unexpected Results...................................44 4.1.5Program Gives Unexpected Results or Exception Behavior...................................45 4.2Intel Compilers (64-Bit) for Linux..................................................................................45 4.3PathScale Compilers (64-Bit) for Linux.........................................................................46 4.4 PGI Compilers (64-Bit) for Linux.. (46)4.4.1 Interoperability Between Languages (46)4.4.2 Run-Time Errors.....................................................................................................48 4.4.3 Compiled and Linked Code Generates Unexpected Results.. (48)4.4.4Program Gives Unexpected Results or Terminates Unexpectedly.........................48 4.5Intel Compilers (64-Bit) for Microsoft Windows...........................................................49 4.6 Microsoft Compilers for (64-Bit) Microsoft Windows (49)4.6.1 Compilation Errors (49)4.6.2 Run-Time Errors (49)4.6.3Compiled and Linked Code Generates Unexpected Results...................................49 4.6.4Program Gives Unexpected Results or Exception Behavior...................................50 4.7PGI Compilers (64-Bit) for Microsoft Windows............................................................50 4.7.1Interoperability Between Languages.......................................................................50 4.7.2 Run-Time Errors.. (52)4.7.3Compiled and Linked Code Generates Unexpected Results...................................52 4.7.4 Program Gives Unexpected Results or Terminates Unexpectedly. (53)32035 Rev. 3.18 June 2005 Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 PlatformsContents 74.8Sun Compilers (64-bit) for Solaris...................................................................................53 4.9 GCC Compilers (32-Bit) for Linux (53)4.9.1 Compilation Errors (53)4.9.2 Link-Time Errors (54)4.9.3 Run-Time Errors (54)4.9.4Compiled and Linked Code Generates Unexpected Results...................................54 4.9.5Program Gives Unexpected Results or Exception Behavior...................................55 4.10 Intel Compilers (32-Bit) for Linux. (55)4.10.1 Compilation Errors (55)4.10.2 Link-Time Errors (56)4.10.3Compiled and Linked Code Generates Unexpected Results...................................56 4.10.4Program Terminates Unexpectedly..........................................................................56 4.11PathScale Compilers (32-Bit) for Linux..........................................................................57 4.12 PGI Compilers (32-Bit) for Linux.. (57)4.12.1 Interoperability Between Languages (57)4.12.2 Run-Time Errors......................................................................................................59 4.12.3 Compiled and Linked Code Generates Unexpected Results.. (59)4.12.4Program Gives Unexpected Results or Terminates Unexpectedly..........................60 4.13 Intel Compilers (32-Bit) for Microsoft Windows (60)4.13.1 Compilation Errors...................................................................................................60 4.13.2 Compiled and Linked Code Generates Unexpected Results.. (60)4.13.3 Program Terminates Unexpectedly (61)4.13.4Program Gives Unexpected Results or Exception Behavior...................................61 4.14 Microsoft Compilers (32-Bit) for Microsoft Windows (61)4.14.1 Run-Time Errors (62)4.14.2Compiled and Linked Code Generates Unexpected Results...................................62 4.14.3Program Gives Unexpected Results or Exception Behavior...................................62 4.15PGI Compilers (32-Bit) for Microsoft Windows.............................................................63 4.15.1Interoperability Between Languages.......................................................................63 4.15.2 Run-Time Errors (63)4.15.3 Compiled and Linked Code Generates Unexpected Results (64)Compiler Usage Guidelines for 64-Bit OperatingSystems on AMD64 Platforms32035 Rev. 3.18 June 20058 Contents4.15.4Program Gives Unexpected Results or Terminates Unexpectedly.........................64 4.16 Sun Compilers (32-bit) for Solaris. (64)4.16.1 Compilation Errors..................................................................................................64 4.16.2 Compiled and Linked Code Generates Unexpected Results.. (65)Chapter 5Peak Options for SPEC ®-CPU2000.....................................................................67 5.1SuSE GCC 3.3.3 (64-Bit) C/C++ Compiler for Linux....................................................67 5.2Pathscale EKO 2.1 C/C++ Compiler (64-Bit) for Linux................................................68 5.3Pathscale EKO 2.1 Fortran Compiler (64-bit) for Linux................................................69 5.4PGI 6.0 Fortran Compiler (64-Bit) for Linux.................................................................70 5.5Intel 8.0 C/C++ Compiler for (32-Bit) Microsoft Windows...........................................71 5.6PGI 6.0 Fortran Compiler (32-Bit) for Microsoft Windows...........................................72 5.7Sun C/C++ Compiler (64-bit) for Solaris........................................................................73 5.8Sun Fortran Compiler (64-bit) for Solaris (74)32035 Rev. 3.18 June 2005 Compiler Usage Guidelines for 64-Bit OperatingSystems on AMD64 Platforms List of TablesTable 1. Summary of Compilers (13)Table 2. GCC Versions Included with Linux Distributions (17)Table 3. Recommended Option Switches for 64-Bit GCC Compilers for Linux (19)Table 4. Profile Guided Optimization for 64-Bit GCC Compilers for Linux (20)Table 5. GCC Versions Included with Linux Distributions (28)Table 6. Recommended Option Switches for 32-Bit GCC Compilers for Linux (30)Table 7. Profile Guided Optimization for 32-Bit GCC Compilers for Linux (31)Table 8. Recommended Option Switches for 32-Bit Intel Compilers for Linux (33)Table 9. Recommended Option Switches for 32-Bit Intel Compilers forMicrosoft® Windows® (37)Table 10. Unsafe Architecture Switches in 32-Bit Intel Compilers for Linux (57)Table 11. Unsafe Architecture Switches in 32-Bit Intel Compilers for Microsoft Windows (61)Table 12. Best-Known Peak Switches for the 64-Bit SuSE GCC 3.3.3 C/C++ Compiler for Linux (67)Table 13. Best-Known Peak Switches for the Pathscale 1.4 C/C++ Compiler for Linux (68)Table 14. Best-Known Peak Switches for the 64-bit Pathscale 1.4 Fortran Compilerfor Linux (70)Table 15. Best-Known Peak Switches for the 64-Bit PGI Fortran Compiler for Linux (71)Table 16. Best-Known Peak Switches for the 32-Bit Intel 8.0 C/C++ Compilerfor Microsoft Windows (71)Table 17. Best-Known Peak Switches for the 32-Bit PGI Fortran Compilerfor Microsoft Windows (73)Table 18. Best-Known Peak Switches for the 64-bit Sun C/C++ Compilers for Solaris (73)Table 19. Best-Known Peak Switches for the 64-bit Sun Fortran Compiler for Solaris (74)List of Tables 9Compiler Usage Guidelines for 64-Bit Operating Systems on AMD64 Platforms32035 Rev. 3.18 June 200510 Revision History Revision History Date Revision DescriptionJune 2005 3.18 Fourth public release.Updated generic performance switches for Sun Solaris in Section 3.8,Section 3.16, and Section 4.16.All updates since Revision 3.16 are marked with revision bars.June 2005 3.16 Third public release.February 2005 3.10 Second public release.October 2004 3.00 Initial public release.Chapter 1 IntroductionISVs and end-users of platforms for the AMD Athlon™ 64 and AMD Opteron™ processors have a significant interest in porting and tuning their applications for this architecture. Because several compilers are available for AMD64 architecture, evaluating them to choose the best-suited compiler for an application is a non-trivial task. This document provides a quick reference for optimization and portability switches for some commonly used compilers. The intent is to provide starting guidelines for porting and performance tuning applications and for increased performance of compiled code. The user should refer to the user’s guide for specific compilers for further tuning help or for troubleshooting problems that are beyond the simple diagnostic steps listed here. New compilers of interest are always on the horizon. This document may be updated when new compilers arrive or when the current compiler switches change significantly in their newer versions.1.1 AudienceThis document is intended for ISVs and end-users of the AMD Athlon 64 processor-based platforms or AMD Opteron processor-based platforms interested in porting and tuning their applications for the AMD64 architecture.1.2 Intent of DocumentThis document provides a quick reference for optimization and portability switches for some commonly used compilers for the AMD Athlon 64 processor-based platforms and AMD Opteron processor-based platforms.1.3 Definitions, Abbreviations, and NotationSwitches and invocation commands are highlighted in bold text.1.4 AdditionalDocumentsHere is a short list of other resources for developers working with 64-bit operating systems.• Software Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 25112• System V Application Binary Interface (AMD64 Architecture Processor Supplement) /documentation• PGI Compiler User’s Guides: /resources/docs.htm• Intel Compiler Manuals:/software/products/compilers/clin/docs/manuals.htm/software/products/compilers/flin/docs/manuals.htm/software/products/compilers/cwin/docs/manuals.htm/software/products/compilers/fwin/docs/manuals.htm• Microsoft® Windows® AMD64 Application Binary Interface• MSDN: /• GNU Compiler Collection: • GCC Online Documentation: /onlinedocs/• Sun Studio Documentation: /prodtech/cc/reference/docs/index.htmlChapter 2 List of Compiler Vendors forAMD Athlon™ 64 andAMD Opteron™ ProcessorsThe compiler vendors listed in this chapter are discussed in detail in subsequent chapters of this application note. This is not a comprehensive list of all compiler vendors for AMD Athlon™ 64 and AMD Opteron™ processors.Table 1 lists the compiler vendors discussed in this document and shows whether a vendor provides 64-bit compilers, 32-bit compilers, or both for the Linux, Microsoft® Windows®, or Sun Solaris platforms.Table 1. Summary of CompilersCompiler PlatformCompilerSun SolarisVendor Linux Microsoft®Windows®GCC 64-bit and 32-bit – 64-bit and 32-bitIntel 64-bit and 32-bit 64-bit and 32-bit –PathScale 64-bit and 32-bit – –PGI 64-bit and 32-bit 64-bit and 32-bit –Microsoft®– 64-bit and 32-bit –Sun – – 64-bit and 32-bit2.1 Compilers (64-Bit) for LinuxThe following companies provide 64-bit compilers for Linux.2.1.1 GCCGCC provides C, C++, and Fortran compilers for AMD64 architecture-based systems running the Linux or the Sun Solaris operating systems. This application note, however, does not discuss the GCC compilers for Sun Solaris; it discusses only GCC compilers for Linux. Different Linux distributions offer different versions of the GCC compilers. This application note focuses on therecommended compilers for the following major Linux distributions—SuSE Linux Enterprise Server 8, SuSE Linux Enterprise Server 9, SuSE Linux 9.2, Red Hat Enterprise Linux 3, Red Hat Enterprise Linux 4. This application note also briefly discusses the GCC 4.0 compiler, which is the current GCC compiler from the Free Software Foundation (FSF).2.1.2 IntelIntel provides C, C++, and Fortran compilers for EM64T and compatible architecture-based systems running the Linux operating systems. The current version (as of April 2005) is 8.1.2.1.3 PathScalePathScale provides C, C++, and Fortran compilers for AMD64 architecture-based systems running the Linux operating system. The current version (as of April 2005) is 2.1.2.1.4 PGIPGI provides C, C++, and Fortran compilers for AMD64 architecture-based systems running the Linux operating system. The current version (as of April 2005) is 6.0.2.2 Compilers (64-Bit) for Microsoft® Windows®The following companies provide 64-bit compilers for Microsoft Windows.2.2.1 IntelIntel provides C/C++ and Fortran compilers for EM64T and compatible systems running the Microsoft Windows operating system. The current version (as of April 2005) is 8.1.2.2.2 Microsoft®Microsoft provides C/C++ compilers for AMD64 architecture-based systems running the Microsoft Windows operating system. The current beta version is Visual Studio 2005 Beta 1, version 40607.2.2.3 PGIPGI provides C and Fortran compilers for AMD64 architecture-based systems running the Microsoft Windows operating system. The current version (as of April 2005) is 1.1 Beta.2.3 Compilers (64-bit) for SolarisThe following companies provide 64-bit compilers for x86 Solaris.2.3.1 SunSun provides C, C++, and Fortran compilers for the AMD64 architecture-based systems running the Sun Solaris operating system. The current version (as of April, 2005) is 5.7 and comes in the Sun Studio 10 developer tool suite.2.4 Compilers (32-Bit) for LinuxThe following companies provide 32-bit compilers for x86 Linux. These compilers also run on 64-bit Linux Operating systems, running on AMD Athlon 64 processor-based platforms orAMD Opteron processor-based platforms.2.4.1 GCCThe GNU Compiler Collection (GCC) provides C, C++, and Fortran compilers for x86 Linux and Sun Solaris. This application note, however, does not discuss the GCC compilers for Sun Solaris; it discusses only GCC compilers for Linux. Different Linux distributions offer different versions of the GCC compiler. This application note focuses on the recommended compilers for the following major Linux distributions for workstations and servers—SuSE Linux Enterprise Server 8, SuSE Linux Enterprise Server 9, SuSE Linux 9.2, Red Hat Enterprise Linux 3 and Red Hat Enterprise Linux 4. This application note also briefly discusses the GCC 4.0 compiler, which is the current GCC version from the Free Software Foundation (FSF).2.4.2 IntelIntel provides C, C++, and Fortran compilers for x86 Linux. The current version (as of April 2005) is 8.1. This document also talks about two previous versions of the compiler, 8.0 and 7.1, because they are comparable in performance to the current version (when run on AMD platforms) and are still in use.2.4.3 PathScalePathScale provides C, C++, and Fortran compilers for x86 Linux. The current version (as of April 2005) is 2.1.2.4.4 PGIThe Portland Group, Inc. (PGI) provides C, C++, and Fortran compilers for x86 Linux. The current version (as of April 2005) is 6.0.2.5 Compilers (32-Bit) for Microsoft® Windows®The following companies provide 32-bit compilers for Microsoft Windows.2.5.1 IntelIntel provides C, C++ and Fortran compilers for x86 Microsoft Windows. The current version (as of April 2005) is 8.1. This document also talks about two previous versions of the compiler, 8.0 and 7.1, because they are comparable in performance to the current version and are still in use. 2.5.2 Microsoft®Microsoft provides C/C++ compilers for x86 Microsoft Windows. The current version (as of April 2005) is Microsoft Visual Studio 2005 Beta 1, version 40607.2.5.3 PGIPGI provides C and Fortran compilers for x86 Microsoft Windows. The current version (as of April 2005) is 6.0.2.6 Compilers (32-bit) for Sun SolarisThe following companies provide 32-bit compilers for Sun Solaris.2.6.1 SunSun provides C, C++, and Fortran compilers for x86 Solaris 10 operating system. The current version (as of April, 2005) is 5.7 and comes in the Sun Studio 10 developer tool suite.Chapter 3 Performance-Centric CompilerSwitchesThis chapter describes the various switches that can be useful for individual compilers. For each compiler, a list of generally recommended performance switches is provided. This list is further augmented by other switches that could prove beneficial for certain code bases.3.1 GCC Compilers (64-Bit) for LinuxThe 64-bit GCC compilers can be installed and run on 64-bit Linux on AMD Athlon™ 64 and AMD Opteron™ processors. The GCC compilers come with different flavors. This section discusses the following different GCC compilers.• gcc-ssa compiler supplied with Red Hat Enterprise Linux 3• gcc 3.3.3 compiler from SuSE Linux Enterprise Server 8• gcc 3.4 from Free Software Foundation (FSF)• gcc 4.0 from Free Software Foundation (FSF)• gcc 3.3.3 compiler from SuSE Linux Enterprise Server 9• gcc 3.4.1 compiler supplied with Red Hat Enterprise Linux 4• gcc 3.3.4 compiler from SuSE Linux 9.23.1.1 RecommendedCompilerVersionsThe Linux distributions from SuSE and Red Hat include a default 64-bit GCC compiler and optional GCC compilers. From a performance standpoint, the optional compilers are recommended. Table 2 shows the recommended (optional) compiler versions for the current SuSE and Red Hat distributions. These optional compilers are included in the product CDs/DVDs.Table 2. GCC Versions Included with Linux DistributionsLinux Distribution Default GCCCompiler VersionRecommended(Optional) CompilerVersionPackage Name ofRecommendedCompilerRed Hat Enterprise Linux 3 3.2 gcc-ssa gcc-ssa SuSE Linux EnterpriseServer 83.2 gcc 3.3.3 gcc33。