NUMERICAL EXPERIMENTS WITH PARALLEL ORDERINGS FOR ILU PRECONDITIONERS
- 格式:pdf
- 大小:268.41 KB
- 文档页数:27
广义特征值多类分类算法阳红英;杨志霞【期刊名称】《计算机工程与应用》【年(卷),期】2014(000)008【摘要】提出了一个新的多类分类算法,该算法的目标是寻找 M 个相互不平行的超平面,使得第m(m=1'2''M )类的各点到第m个超平面的距离之和尽可能小,而其余类的所有点到该超平面的距离之和尽可能大。
基于这个思想,寻求第m个超平面的优化模型最终可转化为一个广义特征值问题。
该方法编程简单,易于实现。
在数值试验部分,该算法与一些经典的基于支持向量机的多类分类算法进行比较,表明了该算法的优越性。
%In this paper, it proposes a new multi-class classification algorithm. The task is to find M non-parallel hyper-planes. The m-th(m=1'2''M ) hyperplane is constructed such that the inputs of the m-th class are close to the hyper-plane and the rest inputs are far away from it as far as possible. Based on this idea, for finding the m-th hyperplane, the optimization model can be transformed into a generalized eigenvalue problem. The method is easily implemented by a single Matlab command that solves the classical generalized eigenvalue problem. The preliminary numerical experiments show that the method is competitive with several classical multi-class classification algorithms based on support vector machine.【总页数】4页(P154-157)【作者】阳红英;杨志霞【作者单位】新疆大学数学与系统科学学院,乌鲁木齐 830046;新疆大学数学与系统科学学院,乌鲁木齐 830046【正文语种】中文【中图分类】O235【相关文献】1.基于投影正交化LANCZOS算法的广义特征值求解方法 [J], 莫晓聪2.一种基于广义特征值最大化的胎儿心电盲提取算法 [J], 苏博妮;谢宏;化希耀3.基于参考信号的广义特征值分解盲源分离算法及其在北斗多路径效应削弱中的应用 [J], 陈健;岳东杰;刘志强;朱少林;陈浩4.基于投影正交化LANCZOS算法的广义特征值求解方法 [J], 莫晓聪5.广义Rayleigh商矩阵和求解大稀疏广义特征值问题的块算法(英文) [J], 曹志浩因版权原因,仅展示原文概要,查看原文内容请购买。
摘要Laplace方程是最典型,最简单但应用广泛的椭圆型偏微分方程。
用边界元法解边值问题,由不同的边界归化方法可以得到不同的边界积分方程,数值求解边界积分方程也有好几种方法。
本文考虑用Green公式和基本解推导得出直接边界积分方程来求解二维Laplace方程的Dirichlet问题,该直接边界积分方程是第一类Fredholm积分方程。
对二维问题,一般的带对数积分核第一类Fredholm积分方程并不总是唯一可解的,特别是对外边值问题,解在无穷远处的形态有很大的影响。
人们在用直接边界元方法进行计算时,并不刻意去考虑积分方程的可解性,但可解性的问题是不能回避的,这涉及到原问题的解与边界积分方程的解的等价性问题。
事实上,对内边值问题,第一类Fredholm直接边界积分方程的可解性条件是自然得到满足的,本文对此做了验证。
对外边值问题,考虑到二维Dirichlet 问题的解应当在无穷远处有界,故解的边界积分表达式要做修正,对积分方程的解要有约束,这样去解边界积分方程得出的解才等同于原问题的解。
一般来说,直接边界积分方程可以很方便的用配点法求解,还未见有实际用Galerkin边界元来解的报道。
本文采用Galerkin边界元方法求解直接边界积分方程,是为了验证这两种方法的效率和精度,且Galerkin法易于进行收敛性分析。
Galerkin 边界元方法是把积分方程转化为等价的边界变分方程,经用边界元离散后,通过求解线性代数方程组和计算解的离散的积分表达式求得原问题的数值解,该方法需要在边界上计算重积分。
本文推出了第一重积分的解析计算公式,对外层积分则采用高斯数值积分。
对外边值问题,第一类Fredholm积分方程的解要附加在边界上积分为零的条件,本文采用Lagrange乘子放松这个约束,求解扩展的变分方程时,可同时得出解在无穷远的值。
本文采用常单元和线性元这两种离散方式,分别用Fortran90编写了计算程序,对误差与边界元的数量的关系做了数值实验。
sci中numerical example的区别Numerical examples play a crucial role in scientific research as they provide valuable insights and evidence to support or refute hypotheses, theories, and models. They help researchers analyze complex phenomena, understand the underlying mechanisms, and make predictions. In this article, we will explore the significance and differences of numerical examples in scientific studies.Numerical examples in scientific research involve the application of mathematical and statistical techniques to real-world problems. They consist of data, equations, calculations, and graphical representations that enable researchers to quantify the relationship between variables and test theoretical predictions. By carefully selecting relevant variables and parameters, researchers can construct numerical models that simulate complex phenomena.One significant advantage of numerical examples is their ability to provide a clear and concise representation of data. Often, scientific research involves large volumes of complex data that can be difficult to interpret without visualization. Numerical examples allow researchers to present data in simplified forms, such as tables or charts, that make it easier for the audience to understand and draw insights from. This ability to simplify complex data is particularly valuable for effective communication and knowledge dissemination.Moreover, numerical examples allow researchers to explore various scenarios and test hypotheses under different conditions. By manipulating different variables and parameters, researchers can analyze the behavior of a system or model and observe how itchanges in response to different inputs. This enables them to identify trends, patterns, and relationships that would otherwise be challenging to discern. For instance, in climate change research, scientists use numerical models to simulate different emission scenarios and predict the impact on global temperatures and sea levels.Numerical examples also serve as essential tools for validation and verification in scientific research. By comparing the results of numerical simulations with experimental or observed data, researchers can assess the accuracy and reliability of their models. This process of validation ensures that the numerical examples faithfully represent the real-world phenomena they seek to model. Additionally, numerical models allow for sensitivity analysis, which helps researchers identify the most crucial parameters and assess their impact on the overall system behavior.One of the key differences in numerical examples lies in the level of complexity involved. Some numerical examples may involve relatively straightforward calculations and simple equations, while others may require complex mathematical algorithms and rigorous statistical analyses. The level of complexity depends on the nature of the scientific problem being studied and the level of detail required to adequately represent the phenomena.Another difference in numerical examples lies in the scale or size of the data being analyzed. Some numerical examples may involve small datasets, such as those obtained from laboratory experiments or surveys, while others may involve large datasets obtained from sources like satellite observations or nationwide surveys. The sizeof the dataset can influence the complexity of the analysis and the computational techniques required to process the data efficiently.Additionally, numerical examples can differ in terms of the precision and accuracy required. Some scientific research may require high levels of accuracy, especially when dealing with critical applications like medical treatments or engineering designs. In such cases, researchers need to carefully select appropriate numerical methods and algorithms to ensure that the results are reliable and trustworthy. On the other hand, in some instances, a rough estimation or approximation may be sufficient for the research question at hand, allowing researchers to adopt simpler and faster numerical techniques.In conclusion, numerical examples are indispensable tools in scientific research. They provide a means to analyze complex phenomena, visualize data, test hypotheses, validate models, and make predictions. The differences in numerical examples lie in the level of complexity, the scale of data, and the precision and accuracy required. Regardless of these differences, numerical examples offer valuable insights and evidence that contribute to advancing our understanding of the natural and physical world.。
高三英语学术文章单选题50题1. In the scientific research paper, the term "hypothesis" is closest in meaning to _.A. theoryB. experimentC. conclusionD. assumption答案:D。
解析:“hypothesis”的意思是假设,假定。
“assumption”也表示假定,假设,在学术语境中,当提出一个假设来进行研究时,这两个词意思相近。
“theory”指理论,是经过大量研究和论证后的成果;“experiment”是实验,是验证假设或理论的手段;“conclusion”是结论,是研究之后得出的结果,所以选D。
2. The historical article mentioned "feudal system", which refers to _.A. democratic systemB. hierarchical social systemC. capitalist systemD. modern political system答案:B。
解析:“feudal system”是封建制度,它是一种等级森严的社会制度。
“democratic system”是民主制度;“capitalist system”是资本主义制度;“modern political system”是现代政治制度,与封建制度完全不同概念,所以选B。
3. In a literary review, "metaphor" is a figure of speech that _.A. gives human qualities to non - human thingsB. compares two different things without using "like" or "as"C. uses exaggeration to emphasize a pointD. repeats the same sound at the beginning of words答案:B。
南京航空航天大学博士学位论文结构动力学中的特征值反问题姓名:***申请学位级别:博士专业:一般力学与力学基础指导教师:***20060601南京航空航天大学博士学位论文摘要本文研究了结构动力学中的特征值反问题,包括弹簧-质点系统振动反问题、离散梁振动反问题、阻尼振动系统的振动反问题以及振动杆结构探伤问题。
全文主要包括以下内容:首先,研究了弹簧-质点系统的振动反问题。
对二自由度简单连接度弹簧-质点系统分别通过加刚性约束、弹性约束和质量摄动得到修改系统,研究了利用原系统和修改系统的两组特征值(频率)和修改量识别系统的物理参数问题,给出了解的表达式。
对于多自由度简单连接度弹簧-质点系统,研究了增容修改系统的频率反问题。
提出了由多自由度简单连接弹簧-质点系统的四个和五个特征对(频率和模态)识别系统物理参数的振动反问题,分别研究了解的存在性,给出了解的表达式、相应算法和算例。
提出并研究了一类混合连接弹簧-质点系统的振动反问题,提出了利用三个特征对(频率和模态)以及部分系统物理参数识别系统其它物理参数的振动反问题,研究了解的存在性,给出了解的表达式、相应算法和模型算例。
其次,研究了有限差分离散梁振动反问题,利用有限差分法得到振动梁的弹簧-质点-刚杆模型,质量矩阵为对角矩阵而刚度矩阵为对称五对角矩阵。
提出了基于三个特征对的频率模态反问题,研究了解的存在性,给出了解存在惟一的充要条件和解的表达式、数值算法和算例。
再次,研究了阻尼振动系统中的二次特征值反问题。
研究了阻尼弹簧-质点系统的物理参数识别,包括:由全部频率信息模态识别阻尼振动系统的结构物理参数;由部分频率模态信息识别比例阻尼振动系统的结构物理参数;由两对频率模态信息识别比例阻尼振动系统的结构物理参数;由频率模态信息识别非比例阻尼振动系统的结构物理参数。
对每种提法分别研究了问题解的存在性,给出了数值算法,并对每种问题给出了阻尼振动模型算例。
最后,研究了振动杆结构探伤的特征值反问题。
求解 Black-Scholes 模型下美式回望看跌期权的有限差分法李庚;朱本喜;张琪;宋海明【摘要】The authors mainly studied the numerical method for valuing American lookback put options under the Black-Scholes model.Applying the finite difference method,we obtained the discretization form of the Black-Scholes equation,which was used to solve the option value,and we got the optimal exercise boundary by Newton’s method.Solving this problem by the two method in turn,we can get the option price and the optimal exercise boundary simultaneously.Numerical experiments verify the efficiency of the method.%考虑 Black-Scholes 模型下美式回望看跌期权的定价问题。
先采用有限差分法对 Black-Scholes 方程离散,求解期权价格,再通过 Newton 法求解最佳实施边界。
用两种方法交替求解,得到了期权价格和最佳实施边界的数值逼近结果。
数值实验验证了算法的有效性。
【期刊名称】《吉林大学学报(理学版)》【年(卷),期】2014(000)004【总页数】5页(P698-702)【关键词】Black-Scholes 模型;美式回望看跌期权;最佳实施边界【作者】李庚;朱本喜;张琪;宋海明【作者单位】吉林大学数学学院,长春 130012;吉林大学数学学院,长春130012;吉林大学数学学院,长春 130012;吉林大学数学学院,长春 130012【正文语种】中文【中图分类】O241.80 引言回望期权[1]是一种依赖于路径的期权.令S,t,σ,r,q,T和J分别表示原生资产价格、时间、原生资产波动率、无风险利率、原生资产红利率、期权的到期日和0时刻到t时刻原生资产价格的最大值,则美式回望看跌期权P(S,J,t)满足的Black-Scholes模型[2-3]为:其中B(t)为美式回望看跌期权的最佳实施边界,它把美式期权的求解区域分成两部分:S>B(t)为持有区;S≤B(t)为实施区;B(T)=max{ln(q/r),0}.求解Black-Scholes模型下美式回望看跌期权定价问题目前主要存在以下困难:1)问题(1)是一个空间变量为二维的非线性问题,难以直接求解;2)左端的最佳实施边界B(t)是一条未知曲线,空间求解区域为形状不规则的无界区域;3)期权价格P(S,J,t)和最佳实施边界B(t)存在相关性,难以给出同时求出两者的数值方法.针对1),2),本文采用降维变换[4]和Landau’s变换[5],最终将问题(1)化成一个[0,1]区间上的一维抛物问题.对于3),采用有限差分法和Newton法交替迭代求解离散后的方程组,进而得到期权价格P(S,J,t)和最佳实施边界B(t).1 降维变换由于方程(1)是一个二维空间上的抛物问题,因此为简化模型,可通过变量替换将美式回望看跌期权定价问题(1)转化为一个自由边值问题:其中:至此已解决了第一个难点,将方程(1)化成了一维抛物问题.同时,也部分解决了第二个难点,观察方程(3)可见,它的求解区域已化为一个有界区域.然而其右边界仍然是一条未知曲线.2 Landau’s变换通过Landau’s变换可将问题(3)化为如下有界规则区域上的抛物问题:为描述简便,做如下变换:将方程(6)由倒向问题化为正向问题:至此已解决了第二个难点.美式回望看跌期权定价问题(1)已转化为一个[0,1]区间上的抛物问题.该问题可采用有限差分法[6-8]进行数值求解.3 数值解法下面考虑求解方程(8)的有限差分法.记时间剖分空间剖分差分近似任给0<m≤M,方程(8)的θ格式差分离散形式为其中对于方程(8)的Neumann边值条件,采用二阶Taylor展开近似,得到离散格式的数值精度和方程的中心差商精度一致,均为二阶精度,表示成矩阵形式如下:给定bm的一个初值,可利用方程(9)求解um;已知um,可通过Newton法解方程(10)得到bm的一个更好近似,这两步交替求解,便可得到um和bm的逼近结果.由于期权的价格非负,故要求算法求得的数值解非负.定理1 假设足够小,则方程组(9)的解是非负的,即≥0,n=0,1,…,N;m=1,2,…,M.证明:为简单,只对向后Euler格式(θ=0)给出证明.把方程组(9)写成如下形式:其中:观察系数和右端,把所有系数都乘以h2/km,当ρ1和ρ2足够小时,可得因此,方程组(9)的系数矩阵是一个M矩阵,由M矩阵的性质可知,它的逆矩阵非负.又因为U0非负,故向量非负.以此类推,和Um(m=1,2,…,M)均非负.证毕.下面考虑如何求解bm.注意到um是关于bm的非线性函数,根据方程(10)可知,可视为bm的隐函数:采用Newton法求解非线性方程(13),其中算法如下:1)b(0)=bm-1+km,1<m≤M;2)对于j=1,2,…:① 求解AU=F和为对AU=F两端关于b(j-1)求导时得到的右端,从而得u0和② 计算③ 如果令bm=b(j),终止循环.3)求解AU=F,得到U的最佳逼近.实际计算时,取ε=10-6.4 数值算例下面给出一个数值算例验证本文算法的有效性.考虑一支一年期的美式回望看跌期权,假设原生资产的波动率σ=0.2,红利率q和银行利率r分别取以下3种情形:1)当r<q时,r=0.025,q=0.05;2)当r=q时,r=q=0.05;3)当r>q时,r=0.1,q=0.05.数值结果如图1和图2所示.其中:y表示股价与J商的对数值;t表示时间;S表示股票价格.图1给出了3种情形下本文算法得到的最佳实施边界与二叉树法[4]得到最佳实施边界的对比.对于这3种情形,本文算法取二叉树法中选取M=10 240.图2给出了3种情形下本文算法计算得到的最佳实施曲面(t,S,J)的三维图像.由数值结果可见,本文算法能较精确地拟合出最佳实施边界和期权价格.图1 FDM法得到的最佳实施边界Fig.1 Optimal exercise boundary obtained by FDM图2 二叉树法得到的最佳实施曲面Fig.2 Optimal exercise surface obtained by binomial method参考文献【相关文献】[1]Goldman M B,Sosin H B,Gatto M A.Path Dependent Options:“Buy at the Low,Sell at the High”[J].Journal of Finance,1979,34(5):1111-1127.[2]Black F,Scholes M.The Pricing of Options and Corporate Liabilities[J].J Pol Econ,1973,81(3):637-654.[3]花秋玲,杨成荣.美式封顶期权的对称性[J].吉林大学学报:理学版,2009,47(4):717-722.(HUA Qiuling,YANG Chengrong.Symmetry Properties on American Options [J].Journal of Jilin University:Science Edition,2009,47(4):717-722.)[4]姜礼尚.期权定价的数学模型和方法[M].2版.北京:高等教育出版社,2008:170-191.(JIANG Lishang.Mathematical Modeling and Methods of Option Pricing[M].2nd ed.Beijing:Higher Education Press,2008:170-191.)[5]MA Jingtang,XIANG Kaili,JIANG Yingjun.An Integral Equation Method with High-Order Collocation Implementations for Pricing American Put Options[J].International Journal of Economics and Finance,2010,2(4):102-112.[6]吕桂霞,马富明.抛物方程的一类并行差分格式[J].吉林大学学报:理学版,2002,40(4):327-330.(LÜGuixia,MA Fuming.A Parallel Difference Scheme for Parabolic Equation[J].Journal of Jilin University:Science Edition,2002,40(4):327-330.)[7]HAN Houde,WU Xiaonan.A Fast Numerical Method for the Black-Scholes Equation of American Options[J].SIAM J Numer Anal,2004,41(6):2081-2095. [8]王智宇,李景诗,朱本喜,等.求解CEV模型下美式看跌期权的有限差分法[J].吉林大学学报:理学版,2014,52(3):489-493.(WANG Zhiyu,LI Jingshi,ZHU Benxi,et al.Finite Difference Method for Solving American Put Option under the CEV Model [J].Journal of Jilin University:Science Edition,2014,52(3):489-493.)。
NUMERICAL EXPERIMENTS WITH PARALLEL ORDERINGS FOR ILUPRECONDITIONERSMICHELE BENZI,WAYNE JOUBERT,AND GABRIEL MATEESCU Abstract.Incomplete factorization preconditioners such as ILU,ILUT and MILU are well-known robust general-purpose techniques for solving linear systems on serial computers.However,they are difficult to parallelize efficiently.Various techniques have been used to parallelize these preconditioners,such as multicolor orderings and subdomain preconditioning.These techniques may degrade the performance and robustness of ILU precondition-ings.The purpose of this paper is to perform numerical experiments to compare these techniques in order to assess what are the most effective ways to use ILU preconditioning for practical problems on serial and parallel computers.Key words.Krylov subspace methods,preconditioning,incomplete factorizations,sparse matrix orderings, additive Schwarz methods,parallel computing.AMS subject classifications.1.Introduction.1.1.Motivation and focus.Krylov subspace methods[21]are customarily employed for solving linear systems arising from modeling large-scale scientific problems.The con-vergence of these iterative methods can be improved by preconditioning the linear system .The preconditioned system(for left preconditioning)can be solved faster than the original system if the preconditioner is an efficient and good ap-proximation of;efficient,in the sense that the cost of solving is much smaller than the cost of solving,and good in the sense that the convergence rate for the preconditioned iteration is significantly faster than for the unpreconditioned one.Incomplete factorization techniques[24],[17],[20]provide a good preconditioning strat-egy for solving linear systems with Krylov subspace ually,however,simply applying this strategy to the full naturally ordered linear system leads to a method with lit-tle parallelism.Incomplete factorization is also useful as an approximate subdomain solver in domain decomposition-based preconditioners,such as Additive Schwarz Method(ASM) preconditioning[23].In this paper we study the effect of the following algorithm parameters on the conver-gence of preconditioned Krylov subspace methods:Symmetric reorderings of the matrix:this applies to incomplete factorization(ILU) preconditioners;Subdomain overlap:this applies to Additive Schwarz preconditioners.Symmetric permutations of the linear system have beenfirst used in direct factorization solution methods for reducing the operation count and memory requirements.For example, the Minimum Degree ordering is effective for direct solvers in that it tends to reduce the num-ber of nonzeros of the and factors[12].Other reorderings can have a beneficial effect on incomplete factorizations employed as preconditioners,e.g.,by providing a parallel pre-conditioner[21];the storage for the preconditioner is typically controlled by the incomplete10002Parallel ILU Preconditioningsfactorization scheme.However,parallel orderings may degrade the convergence rate,and allowingfill may diminish the parallelism of the solver.In this paper,we consider structurally symmetric matrices arising from partial differential equations(PDEs)discretized on structured grids usingfinite differences.We focus on sym-metric permutations as they represent similarity transformations and preserve the spectrum of the linear system matrix.Furthermore,symmetric permutations preserve the structural symmetry of a matrix and the set of diagonal elements.Additive Schwarz methods[23]derive a preconditioner by decomposing the problem domain into a number of overlapping subdomains,(approximately)solving each subdomain, and summing the contributions of the subdomain solves.Variants of ASM are obtained by varying the amount of overlap and the subdomain solvers.When the subdomains are solved approximately using an incomplete factorization,the resulting preconditioner can be thought of as a“parallel ILU”strategy.Like block Jacobi,ASM has good parallelism and locality,but these advantages could be offset by a high iteration count of the underlying Krylov subspace method.1.2.Related work.A lexicographic ordering of the grid points in a regular two-or three-dimensional grid is obtained by scanning the grid nodes and assigning numbers to the nodes in the order in which they are seen in the scanning.A widely used ordering is the Nat-ural Order(NO),which is the order induced by labeling the grid nodes from the bottom up, one horizontal line at a time and(for three-dimensional grids)scanning consecutive vertical planes.Several reorderings have been considered in the literature as alternatives to the natural order.Among these are Minimum Degree(MD),Multiple Minimum Degree(MMD),Reverse Cuthill-McKee(RCM),Nested Dissection(ND),and Multicoloring(MCL).For a description of these reorderings see[6],[11],[12],[16],[21].MMD,MD,and ND reorderings attempt to minimize thefill in the factors,while RCM reduces the bandwidth of the matrix.The degree of parallelism(DOP)of a preconditioning algorithm is the number of pro-cessors that can work simultaneously on constructing or applying the preconditioner.MCL provides large grain parallelism for no-fill incomplete LU factorization[21].With multicol-oring,the linear system is partitioned in subsets of equations that do not have any internal dependencies between unknowns.For direct solvers,fill-reducing(ND,MD,MMD)and bandwidth-reducing(RCM)reorderings are superior to NO[12].The usefulness of these reorderings for incomplete factorization preconditioners is not well established.A simple incomplete factorization of a matrix is,where the triangular matrices and have the same nonzero structure as the lower and upper triangular parts of ,respectively,and is the residual matrix.This strategy is known as no-fill ILU,and is denoted by ILU(0).For symmetric positive definite(SPD)problems,and the factor-ization is called no-fill incomplete Cholesky,denoted by IC(0).One can attempt to improve the effectiveness of an incomplete LU factorization by allowingfill-in in the triangular factors ,.The ILU(1)method allows nonzeros entries for the elements with level offill at most one(see[21],pp.278–281);the corresponding factorization for SPD problems is denoted by IC(1).A more sophisticated preconditioner is a dual-dropping ILU preconditioner[20], denoted by ILUT(),where is the dropping threshold and is the maximum number of nonzeros offill allowed in a row above those present in the original matrix.The effect of reorderings on the performance of ILU-type preconditioners has been stud-ied by Duff and Meurant[7],Benzi et al.[2],and Saad[19],among others.Duff and Meurant have studied the impact of reorderings on incomplete factorization preconditioning for SPD problems.The sources of inaccuracy in incomplete factorizations and the effect of reorder-ings on accuracy and stability have been analyzed by Chow and Saad[4]and Benzi et al.[2].Let.The residual matrix measures the accuracy of the incom-M.Benzi,W.Joubert,and G.Mateescu10003 plete factorization.Let denote the Frobenius norm of a matrix.Chow and Saad[4] and Benzi et al.[2]have shown that for nonsymmetric problems may be an insuf-ficient characterization of the convergence of the preconditioned iterative methods.This is in contrast with symmetric positive definite problems where provides a good measure of convergence.These authors have shown that the stability of the triangular solves is gauged by the Frobenius norm of the deviation from identity matrix,.For ILU(0), can be small,while may be very large,i.e.,can be very ill-conditioned.Note that.Benzi et al.[2]have shown that RCM and MD reorderings can be beneficial for prob-lems that are highly nonsymmetric and far from diagonally dominant.Specifically,MD has the effect of stabilizing the ILU(0)triangular factors,while RCM improves the accuracy of incomplete factorizations withfill.Saad[19]has shown that improving the accuracy of the preconditioner,by replacing ILU(0)with dual dropping ILU preconditioning ILUT(),greatly improves the perfor-mance of red-black(RB)ordering,so that by increasing the ILUT preconditioner induced by RB will eventually outperform the one induced by NO,as measured by iterations orfloat-ing point operations.RB ordering is the simplest variant of MCL,in which the grid points are partitioned in two independent subsets(see Subsection2.1and[21],page366).1.3.Contributions of the paper.Our main contribution is to show that parallel order-ings can perform well even in a sequential environment,producing solvers requiring a lower wall-clock time(and in some cases reduced storage needs)than NO,especially for ILUT pre-conditioning of two-dimensional problems.We also show that for problems which are highly nonsymmetric and far from diagonally dominant these orderings are still better than NO,but they are outperformed by RCM.For such problems we also observe that parallel orderings can have a stabilizing effect on the ILU(0)preconditioner.We propose and investigate a new MCL ordering,which allows for parallel ILU(1)and IC(1)preconditioners.We perform numerical experiments with multicoloring for nonsym-metric problems and incomplete factorizations withfill-in,extending the work done by Poole and Ortega[18]and by Jones and Plassmann[15]who studied the effect of MCL reorderings on IC(0)preconditioners for SPD problems.Our experiments suggest that for nonsymmetric problems,the loss in convergence rate caused by switching from NO to multicolorings for ILU(0)and ILU(1)preconditioners is compensated by the large DOP of the multicolorings.We extend the study[7]in two ways:first,we show that RB applied to symmetric ILUT can outperform NO;second,we look at RB applied to nonsymmetric problems.We further the study of Benzi et al.[2]on orderings for nonsymmetric problems by considering MCL and ND orderings.We further the work of Saad by considering the performance of RB on larger problems (Saad considers problems with up to unknowns,while we consider up to unknowns)and comparing RB with RCM and ND,in addition to NO.Finally,we assess the scalability of one-level overlapping ASM preconditioning for the set of model problems defined below.1.4.Model problems.Although in practical experience it is often desirable to solve problems with complex physics on possibly unstructured grids,a minimal requirement of an effective parallel scheme is that it work well on simple,structured problems.For this reason, we will focus on a small set of model problems on structured grids which are“tunable”in terms of problem size and difficulty.The numerical experiments throughout the paper use the following three model PDE problems,all of them with homogeneous Dirichlet boundary conditions.Below we denote by the Laplace operator in two or three dimensions.10004Parallel ILU PreconditioningsProblem1.The Poisson’s equation:(1.1)Problem2.Convection-diffusion equation with convection in the xy-plane: (1.2)Problem3.Convection-diffusion equation with convection in the z-direction:M.Benzi,W.Joubert,and G.Mateescu10005 toolkit uses left-preconditioning and the residual used in the stopping test corresponds to the preconditioned problem.The code for the experiments in Sections3and5has been compiled using the compiler options-pfa(automatic parallelization of do-loops)-n32(new32-bit objects)-O0(turn off optimization).The timing data are obtained using the Fortran intrinsic function dtime()from which the user CPU time is extracted.The code uses right preconditioning.The runs in Section3use one processor,while those in Section5use eight processors.2.Background.2.1.Multicoloring orderings.Given a graph,where is the set ofvertices and is the set of edges,the graph coloring problem is to construct a partition of the set such that all vertices in the same part form an independent set,i.e.,vertices in the same subset are not connected by any edge.The minimum number ofcolors necessary for a graph,,is the chromatic number of the graph.The relevance of the chromatic number from a computational point of view is that all unknowns in the same subset can be solved in parallel.Thus,the number of inherent sequential steps is greater or equal to the chromatic number of the graph.With each matrix we can associate a graph such thatand.For arbitrary,finding the chromatic number of is NP-hard,but in practice a suboptimal coloring suffices.For the5-point stencil discretization, it is easy tofind a2-coloring of the graph,commonly called red-black coloring.For red-black coloring,the degree of parallelism in applying an IC(0)or ILU(0)preconditioner is (is the number of unknowns),which justifies considering this ordering strategy.The problem with this approach is that a no-fill preconditioner obtained with RB may result in poor convergence.For SPD matrices arising fromfinite difference discretizations of two-dimensional ellip-tic problems,Duff and Meurant[7]have shown,by way of experiments,that RB reordering has a negative effect on the convergence of conjugate gradient preconditioned with IC(0). Poole and Ortega[18]have observed similar convergence behavior for multicolorings.2.2.Scalability.Let be the time complexity of executing an iterative linear solver on a parallel computer with processors,where is a parameter equal to the number of subproblems(for example,the number of subdomains in the case of ASM)into which the solver divides the original problem of size,and is the subproblem size,.Following Gustafson[13],an algorithm is scalable if the time complexity stays constant when the subproblem size is kept constant while the number of processors and the number of subproblems both increase times,i.e.,.The scalability of an iterative linear solver can be decomposed into two types of scalability(see,for example,Cai et al.[3]):(i)algorithmic or numerical scalability,i.e.,the number of iterations is independent of the problem size;(ii)parallel scalability,i.e.,the parallel efficiencyremains constant when grows.In the case of linear solvers for problems arising from elliptic PDEs,it is likely thatthe two conditions for scalability cannot be simultaneously achieved;see,for example,Wor-ley[26].Hence,in practice a method is considered scalable if the number of iterations de-pends only weakly on the number of subproblems and the parallel efficiency decreases only slightly as increases.3.Performance of serial implementations.In this section we are concerned with theserial performance of reorderings for ILU preconditioners.We look at the effect of RB,RCM, and ND on several ILU preconditioners.10006Parallel ILU PreconditioningsIffill is allowed,the advantage of RB providing a parallel preconditioner is greatly dimin-ished;however,by allowingfill the accuracy of the RB-induced preconditioner may improve so much that it outperforms NO,thereby making RB an attractive option for uniprocessorcomputers.Indeed,Saad has observed[19]that,for a given problem and dropping tolerance ,there is an amount offill,such that if then the preconditioner ILUT()in-duced by RB outperforms the corresponding preconditioner induced by NO.On the otherhand,it has been observed by Benzi et al.[2]that,for highly nonsymmetric problems,RCM outperforms NO whenfill in the preconditioner is allowed.We further these studies by per-forming numerical experiments on NO,RB,and RCM,thereby comparing RB to RCM,and considering larger problems(more than100,000unknowns).For the drop tolerance-based solvers we use global scaling,i.e.,we divide all the matrixelements by the element of largest magnitude in the matrix.Thus,all entries in the scaledmatrix are in the interval.This scaling has no effect on the spectral properties of, but it helps in the choice of the(absolute)drop tolerance,which is a number. As an alternative,we tried the relative drop tolerance approach suggested in[20],where fill-ins at step are dropped whenever they are smaller in absolute value than times the2-norm of the th row of.For the problems considered in this paper,the two drop strategies gave similar results.For2D problems,we let,and for3D problems,we let.The number of iterations(I),the solver times(T)in seconds,and the memory require-ments of the preconditioners are shown in Tables3.1–3.6.Here denotes the amount of fill-in(in addition to the nonzero entries in the matrix)in thefill-controlled preconditioners. We denote as SILUT a modification of the ILUT algorithm which gives rise to a symmetric preconditioner whenever the original matrix is symmetric.Symmetry is exploited by stor-ing only the upper triangular part of.Throughout the paper we use as unit for storage measurement.The solver time reported is the sum of the time to construct the preconditioner and the time taken by the iterations of the preconditioned Krylov subspace method.The bold fonts indicate the smallest iteration count,time,andfill for each precondi-tioner.3.1.Symmetric positive definite problems.We consider symmetric ILUT precondi-tioning and incomplete Cholesky preconditioners with nofill and with level offill one,de-noted respectively by IC(0)and IC(1).We examine the effect of RB and ND orderings on the convergence of preconditioned CG.As already mentioned,allowingfill in the preconditioner largely limits the parallelism provided by RB ordering.Except for the IC(0)preconditioner, for which the degree of parallelism of applying the preconditioner is,the other precon-ditioners have modest parallelism.The results are reported in Tables3.1–3.2.We may draw several conclusions from these results.First,nested dissection ordering isnot competitive with other orderings for IC(0),IC(1)and SILUT,thus we will exclude thisordering from the subsequent discussion.Second,our numerical results for IC(0)indicate the same performance for NO and RCM;this is in accordance with theoretical results[27]which show that the ILU(0)preconditioner induced by RCM is just a permutation of that induced by NO.Third,while NO and RCM are the best orderings for IC(0),the best ordering for IC(1) is RB,followed by RCM and NO.This is true for both2D and3D problems.The best reordering for SILUT from an iteration count standpoint is RCM,followed byRB and NO.In the2D case,notice that RB,even though requiring slightly more iterations than RCM,leads to a preconditioner that has afill of roughly2/3of that of the RCM pre-conditioner.This has an effect on the solver time:for small problem sizes(, not shown)the best reordering is RCM,but for the larger problems RB is the best reordering, in terms of CPU time.Here,the higher iteration count of RB as compared to RCM is out-M.Benzi,W.Joubert,and G.Mateescu10007T ABLE3.1Iterations(I),time(T),andfill versus problem size(),for Problem1,2D domain,with different reorderings and preconditionersIC(1)SILUT(.001,10)I T T T(sec)(sec)(sec)NO57 3.30 2.74 2.65RB94 2.68 2.31 1.78RCM57 3.32 2.33 2.03ND110 4.91 3.87 2.8727.96763.5541262462845.452127411582731627.86763.5391882149944.61081148616341323NO12633.828.724.0RB21426.822.517.4RCM12633.523.317.9ND19951.547.229.4105102155823103215441797831061388367751041021555946328123016314926811243264826T ABLE3.2Iterations(I),time(T),andfill versus problem size(),for Problem1,3D domain,with different reorderings and preconditionersIC(1)SILUT(.001,10)I T T T(sec)(sec)(sec)NO140.440.390.55RB160.490.430.39RCM140.460.360.47ND210.560.460.524.031790.116119143136.16151352779.3141584.131790.115119113176.01231222797.918209NO2810.99.6413.5RB389.2113.19.37RCM2810.99.3811.8ND4515.114.813.722.5243512246618120629.520527373031960722.3243512246616121634.5334773937627804 weighed by the lower cost of applying the RB preconditioner,so that overall RB becomes thebest reordering for large problems with a moderately high level offill such as5or10.In the3D case the best ordering for SILUT depends on the level offill allowed:for SILUT(.005,5)the fastest solution is obtained with RCM,whereas RB is the best ordering for SILUT(.001,10).This appears to be due to the fact that RB does a better job at preserving sparsity in the incomplete factors,while the convergence rate is only marginally worse than with RCM.10008Parallel ILU Preconditionings3.2.Convection-diffusion problems.As we move from SPD problems to problems which are nonsymmetric and lack diagonal dominance,the relative merits of RB and RCM observed in the previous subsection change.In this subsection,we consider two-and three-dimensional instances of Problem2,by setting and,and compare the performance of the RB,RCM,NO,and ND reorderings.T ABLE3.3Iterations(I),time(T),andfill versus problem size(),for Problem2,2D domain,,with different reorderings and preconditionersILU(1)ILUT(.001,10)I T T T(sec)(sec)(sec)NO32 2.34 2.55 2.07RB106 2.11 1.84 1.74RCM32 1.85 1.47 1.10ND90 4.99 3.18 2.4140.052127413242277194.634254302861852839.442127263091359283.6832285529628532NO10646.145.733.8RB23935.132.424.6RCM10640.330.320.6ND20183.358.038.517390310669243719883366662156700321313175833105477626155531014853696766501357T ABLE3.4Iterations(I),time(T),andfill versus problem size(),for Problem2,3D domain,,with different reorderings and preconditionersILU(1)ILUT(.001,10)I T T T(sec)(sec)(sec)NO100.350.420.66RB190.350.450.53RCM100.350.330.45ND150.500.550.723.1681806193640113.17270615653083.1751805205341010.61224491787348NO1010.19.4811.9RB4810.39.519.91RCM10 5.65 5.83 6.75ND3317.314.116.123.6127031158710133283.5101054115578111423.7670366485110362.51995415602111246The effects of RB,RCM,and ND on the iteration count and execution time are illustrated in Tables3.3–3.6.Notice that,as for the incomplete Cholesky factorization,ND gives poorM.Benzi,W.Joubert,and G.Mateescu10009 performance under all aspects(iterations,time,storage),in all test cases except for the3D domain,,with ILU(0).First,consider the mildly nonsymmetric problems,(Tables3.3and3.4).In the two-dimensional case(Table3.3),RCM and NO are equivalent(up to round-off)for ILU(0) and are the best orderings.However,for ILU(1)for larger problems,RB is the best ordering from a time and iteration count standpoint.The second best ordering is RCM,followed by NO.Notice that RCM and NO induce the same size of the ILU(1)preconditioner,which has about half the size of thefill-in induced by RB.For ILUT,the time,iteration,and storage cost of RB and RCM are close,with RCM being somewhat better for.For larger problems RCM is generally the least expensive in terms of time and iterations,while RB induces the smallestfill.Notice that for largefill the relative storage saving obtained with RB as compared to RCM is less than the corresponding saving for the SPD case.In the three-dimensional case(Table3.4),RCM is the best ordering(or nearly so)for all preconditioners,often by a wide margin.RB,which is much worse than NO with ILU(0), gives a performance that is close to that of NO with ILU(1)and ILUT(.005,5)and is some-what better than NO with ILUT(.001,10).Next,consider the moderately nonsymmetric problems,(see Tables3.5and 3.6),where a indicates failure to converge in5000iterations.T ABLE3.5Iterations(I),time(T),andfill versus problem size(),for Problem2,2D domain,,with different reorderings and preconditionersILU(1)ILUT(.001,10)I T T T(sec)(sec)(sec)NO44 1.4 1.8 1.8RB106 1.0 1.0 1.3RCM46 1.40.90.8ND82 5.3 2.2 1.73.012127122149482114.8254819953533.25127597322285.1792282225510457NO2011.914.512.8RB27610.49.68.7RCM20 5.0 5.1 5.0ND19672.122.415.152.82831027738141495355.186211868911112653.912310114625781275.12053644728191271 For two-dimensional domains(Table3.5),NO and RCM are still the best orderings for ILU(0),while RB is much worse than these two.For ILU(1),with the only exception of the case,the best ordering is RCM;the performance of RB is about midway between RCM and NO.RCM outperforms RB by all criteria:iterations,time,and preconditioner size. This differs from the mildly nonsymmetric case,for which RB for ILU(1)typically wins in terms of CPU time.The results for ILUT are qualitatively similar to those for ILU(1), with RCM being the clear winner,and RB being better than NO.Notice that the size of the ILUT(.001,10)preconditioner induced by RB is always larger than that of the RCM-induced preconditioner.This behavior is different from that observed for Problem1and the instance of Problem2and indicates that the advantage of RB leading to a smaller ILUT10010Parallel ILU Preconditioningspreconditioner size than RCM for large enough is problem-dependent.T ABLE3.6Iterations(I),time(T),andfill versus problem size(),for Problem2,3D domain,,with different reorderings and preconditionersILU(1)ILUT(.001,10)I T T T(sec)(sec)(sec) NO 3.050.75 1.05RB56 1.01 1.190.93RCM47920.820.400.62ND6480.7 1.580.96171807313563014.01227013159831991804255353320.431244131946366NO8515.711.017.6RB4613.215.215.8RCM6810.3 5.879.67ND3925.215.819.252.316703711965240475.581054116096121852.9970349773206978.4219541174481409The results for the3D domain(see Table3.6)are similar to those for2D,with two ex-ceptions.First,for ILU(0),RB and ND are better than NO and RCM for; note that ILU(0)with the natural ordering(as well as RCM)is unstable for. Second,for ILUT,the minimum size of the preconditioner is induced by RB;this is similar to the behavior observed in Tables3.2and3.4.A salient characteristic of the ILUT precondi-tioners for the3D case is that the iteration counts are almost insensitive to the problem size, which suggests that the incomplete factorization is very close to a complete one.3.3.Summary of results.The experiments above suggest that RB and RCM reorder-ings can be superior to NO:they may lead to a reduction in the number of iterations and, for threshold-based ILU,to a smaller amount offill-in.For SPD problems and for mildly nonsymmetric problems,RB and RCM are the best reorderings for the set of preconditioners considered.The winner is either RB or RCM,depending on the type of preconditioner and on the problem size.On the other hand,RCM is the best reordering for highly nonsymmetric problems,for almost all preconditioners and problem sizes covered in this section;for the few cases where RCM is not the best choice,its performance is very close to the best choice.Therefore,the robustness of RCM makes it the best choice for problems with strong convection.We mention that similar conclusions hold when more adequate discretization schemes are used;see the experiments with locally refined meshes in[2].While it was already known that RB can outperform NO provided enoughfill is allowed(Saad[19]),here we have found that RCM is even better.We observe that,for the preconditioners withfill,the convergence is faster for the convection-diffusion problems than for the Poisson’s equation corresponding to the same problem size,preconditioner,and order.We should make a comment on the methodology of these experiments.It should be pointed out that these experiments are not exhaustive,and ideally one would like to say that for a given problem,for anyfixed,one method is better than another,which wouldimply that the given method for its best is better than the other method for its best ,i.e.,the choice of the best ordering is not an artifact of the parameterization of the knobs.However,such a test would require a very large number of experiments with different values which would not be practical.Furthermore,we feel the results we have given do give a general sense of the comparative performance of the methods.In this Section we have not been concerned with the issue of parallelism,and the ex-periments above were meant to assess the effect of different orderings on the quality of ILU preconditionings in a sequential setting.The remainder of the paper is devoted to an evalua-tion of different strategies for parallelizing ILU-type preconditioners.4.Additive Schwarz preconditioning.In this section we examine the effect of the problem size and number of subdomains(blocks the matrix is split into)on the convergence of ASM.The subdomain size,denoted by,is chosen such that the subdomain boundaries do not cross the z-axis.The experiments in this section use Problems1and3.The number of subdomains is always a divisor(or a multiple)of,the number of grid points along one dimension.Note that this is not always a power of2.The ASM preconditioner divides the matrix in overlapping blocks(corresponding to sub-domains of the PDE problem),each of which is approximately solved with IC(0)in the SPD case,and ILU(0)in the nonsymmetric case.We call these variants ASM.IC0and ASM.ILU0, respectively.The amount of overlap is denoted by.Since Additive Schwarz preconditioners can be improved by employing subdomain overlapping,we consider three levels of overlaps: ;a0-overlap gives an approximate(since subdomain solves are ILU(0))Block Jacobi preconditioner.We have employed the ASM preconditioner provided by the PETSc library.Conceptually,the preconditioner is formed by summing the(approximate)inverse of each block,where for each block a restriction operator extracts the coefficients corresponding to that block.A limited interpolation is employed,in which the off-block values for each block are ignored(this is the PETSc’s PC RESTRICT preconditioner).The number of processors is;more processors caused a performance degradation for the problem sizes considered here.We perform two kinds of scalability experiments.In thefirst kind,wefix the problem size and increase the number of subdomains.In the second,wefix the subdomain size, ,and increase.We monitor the iteration counts and the execution times.The grid nodes are numbered row-wise in each xy-plane,so each matrix block corresponds to a subdomain in an xy-plane.4.1.Scalability for constant problem size.For thefirst kind of scalability experiments, we measure for Problems1and3the number of iterations and solution times versus the number of subdomains.The results for Problem1solved with CG preconditioned with ASM are shown in Fig-ures4.1through4.4.The best wall-clock time is reached for about subdomains,for both 2D and3D problems.From a running-time standpoint,is the best choice,even though the iteration count is smaller(for large enough)when.The improvement in con-vergence brought by the overlap is not large enough to compensate for the higher cost of iterations as compared to zero-overlap.On the other hand,an overlap prevents the iteration count from increasing significantly with,and this effect is more pronounced in the2D case.Notice,however,that the number of iterations does not decrease monotonically with increasing overlap:for example,increasing from1to4for3D,, increases the number of iterations by10%.Moreover,for a zero-overlap gives the smallest number of iterations.。