当前位置:文档之家› Multithreaded systolic computation A novel approach to performance enhancement of systolic

Multithreaded systolic computation A novel approach to performance enhancement of systolic

Multithreaded systolic computation A novel approach to performance enhancement of systolic
Multithreaded systolic computation A novel approach to performance enhancement of systolic

Multithreaded systolic computation:

A novel approach to performance enhancement of systolic arrays

R. Sernec, M. Zajc

Digital Signal Processing Laboratory

University of Ljubljana

Slovenia

Abstract

In this paper we propose a synergy of processing on parallel processor arrays (systolic or SIMD) and multithreading named multithreaded systolic computation. Multithreaded systolic computation enables simultaneous execution of independent algorithm data sets, or even different algorithms, on systolic array level. This approach results in higher throughput and improved utilization of programmable systolic arrays, when processing elements are designed with pipelined functional units. The real benefit of multithreaded systolic computation is that it increases the throughput of systolic arrays without changing systolic algorithm.

The multithreaded systolic computation principle is demonstrated on a programmable systolic array executing a set of linear algebra algorithms. A cycle accurate simulation presents throughput improvement results for three different processing element models. We demonstrate that multithreaded systolic computation can provide throughput improvements that asymptotically approach the number of simultaneously executable threads.

Povzetek

V ?lanku je predstavljen pristop, ki zdru?uje vzporedna procesorska polja ter ve?nitnost, imenovan ve?nitno sistoli?no ra?unanje. Ve?nitno sistoli?no ra?unanje pomeni hkratno izvajanje neodvisnih podatkovnih nizov istega algoritma ali ve?razli?nih algoritmov na sistoli?nem polju. Pristop omogo?a pove?anje prepustnosti in bolj?o izrabo programljivih sistoli?nih polj s cevovodnimi funkcionalnimi enotami. Prednost podanega pristopa je pove?anje prepustnosti brez spreminjanja sistoli?nega algoritma.

Ve?nitno sistoli?no ra?unanje je predstavljeno na programljivem sistoli?nem polju, ki izvaja nabor algoritmov linearne algebre. Pove?anje prepustnosti sistoli?nega polja predstavljajo rezultati simulacij treh razli?nih modelov procesorskih elementov. Izbolj?anje prepustnosti se asimptoti?no pribli?uje ?tevilu niti.

Keywords: multithreading, homogeneous processor array, systolic array, systolic algorithm, linear algebra.

Klju?ne besede: ve?nitnost, homogena procesorska polja, sistoli?ni algoritem, linearna algebra.

1 Introduction

Modern digital signal processing (DSP) and image processing applications depend on high processing throughput and massive data bandwidths [1, 2]. DSP, image processing as well as linear algebra algorithms have found an efficient implementation medium in parallel computing domain. These algorithms can be efficiently mapped onto array processors (systolic arrays, wavefront arrays and SIMD arrays), due to their regularity on data level [3, 4, 5]. The data parallelism property allows us to simultaneously trigger execution on the whole processor array with the same instruction, whereby each processing element (PE) operates on its own data element or data set.

Systolic arrays (SA) were designed to tackle fine-grained computation and exploit data level parallelism of these problems directly [4]. They feature a very homogeneous (regular) design and consist of a number of identical processing elements. These are all ideal features for system-on-chip implementations [6].

Designing systolic algorithms begins with parallelisation of a sequential algorithm, definition of a suitable signal flow graph (SFG) and systolisation by mapping of the resulting signal flow graph to a systolic array. The result of mapping an algorithm to a systolic array is not always optimal in the maximum throughput sense. In order to increase systolic array's efficiency/throughput, several systolic algorithm transformation techniques can be applied [3, 4, 7]. These transformations are performed on the algorithm level, after the algorithm parallelisation and systolisation procedures have been done. As a result, the systolic algorithm is altered and consequently the structure/design of underlying systolic array and processing element functionality changes, too.

The advantage of systolic array computing is a homogeneous pool of identical processing elements that results in the replication of execution resources. In addition to the systolic array structure only one scalar processor is needed, which executes serial parts of algorithms and at the same time controls the systolic array. Synchronization between the processing elements is achieved implicitly via synchronous inter processing element communication, i.e. data exchange. This keeps control overhead low and simplifies parallel programming.

The result of any systolic algorithm design procedure is a systolic array that can execute optimally only a particular systolic algorithm. To extend systolic array's capability, programmability must be incorporated. Using programmable processing elements within the systolic array benefits great flexibility in several respects:

The same physically implemented systolic array can execute different types of algorithms, which enables versatility of the physical implementation.

Algorithms can be changed in real-time, leading to greater flexibility in the environments that depend on multiple algorithms to solve the problem.

Suitable for system-on-chip integration in embedded systems, which is especially true for multimedia terminals that depend on numerous filtering and image processing algorithms and can thus be executed on a single programmable processor array.

Our paper is focused on systolic arrays with programmable processing elements. In this respect we treat each processing element as a programmable processor with specialized inter processing element communication capabilities for efficient data exchange. Processing on the systolic arrays can be presented as a combination of inner processing element operations and inter processing element communication operations, all controlled from a scalar processor. This allows us to decouple algorithm execution within the processing element from algorithm synchronization requirements, as specified by the selected algorithm's signal flow graph. Additionally, pipelining can be employed to achieve higher operational frequencies within the processing element, leading to throughput enhancement. However, this is true only until there are enough independent data elements that use the same functional unit to prevent any data hazards. This problem can be mitigated with the introduction of multithreading concepts within systolic array – multithreaded systolic computation [1].

Multithreading is a technique for hiding various types of latencies and is used in contemporary operating systems, Java programs, etc. It is seen as the technique that can substantially push forward the throughput of 21st century single processors [8]. The term thread refers to a single path of execution or control flow. There are several flavours of multithreading, which differ by thread switching intervals, synchronization mechanisms among threads and implementation requirements. On the finer granularity level of processor clock cycles, multithreading can hide the latencies of jump and branch or even long latency instructions as division or square root. This kind of implementation is covered in our contribution.

The model of a programmable processing element is the central point of the simultaneous, concurrent execution of many algorithms on one systolic array. In order to achieve consistent speedups with programmable and pipelined processing elements, we decompose the whole problem into smaller, unrelated tasks, or write such programs that execute multiple algorithms simultaneously. Such simultaneous execution of independent algorithm data sets, or even different algorithms, on a systolic array is termed multithreaded systolic computation. Multithreaded systolic computation results in higher systolic array's throughput, due to improved utilization of pipelined functional units within processing elements. It turns out that it is possible to generate enough independent data sets with this concept so that the throughput of each processing element is maximized, which in the end results also in increased throughput of the whole systolic array. DSP and image processing algorithms as well as linear algebra offer many opportunities for multithreading on the processor array level, both systolic and SIMD.

This paper is organised as follows. Throughput limitations of systolic algorithms are presented next covering also the traditional approaches to systolic algorithms' throughput enhancement. Then the proposed multithreaded systolic computation

concept is presented with a model of multithreaded systolic computation. At the end the simulation results are collected. Firstly, the algorithm selection and analysis are presented, followed by a description of the simulation environment. The results for two systolic algorithms are given namely Gaussian elimination and Givens rotations.

2 Throughput limitations of systolic algorithms

Increasing the throughput of the systolic array is the main issue of our paper. In this section we give a brief introduction to systolic arrays and explore throughput limitations of single algorithm systolic arrays.

2.1 The systolic concept

Model of processing on systolic arrays

Systolic computation combines parallel processing and pipelining. Parallel processing is the result of parallel operation of all processing elements within a systolic array that co-operate in order to solve the problem faster. Pipelining is done on the level of the processing elements within the systolic array structure.

Each systolic algorithm viewed from the processing element perspective includes three distinct processing phases:

Data input. A new data item is input into a processing element from its neighbours or global memory (in case of bordering processing elements of the systolic array).

Algorithm processing. Algorithm processing specified by the processing element assignment of systolic algorithm is performed.

Data output. Result of the algorithm data processing phase is output to designated processing element’s neighbours or to global memory (in case of bordering processing elements of the systolic array).

All three phases constitute what is known as a systolic cycle. The systolic cycle is defined by a global clock, which synchronises data exchange between the processing elements within the systolic array.

Figure 1 depicts processing timing within a classical single-threaded systolic processing element. Observe the distinctive I/O communication phases (I and O) and algorithm processing phase (P). The next I phase can not begin before O phase from the previous systolic cycle is completed. Computations or processing phase (P) within the processing element are executed in a number of smaller time steps designated as processing element cycles. Therefore each systolic cycle consists of a number of processing element cycles. Within a systolic cycle a processing element must execute all computations defined by the systolic algorithm (processing element assignment). When data enter a processing element (phase I), we can treat data as being executed on a single processor. The algorithm execution phase commences (phase P). Data exchange phase follows again

(phase O). In the next systolic cycle all computations are repeated on a different set of data. The data communication phases are the synchronization mechanisms between processing elements in a systolic array. This synchronization is achieved implicitly by execution of certain inter processing element communication instructions in programmable systolic arrays.

Systolic cycle

I #1 P#1O #1

Sample #1

I #2

Sample #2

Figure 1: Systolic processing phases within a processing element during systolic cycle. I: input, P: processing, O: output.

2.2 Traditional approaches to throughput enhancement

Throughput of a systolic array can be limited due to various reasons: low systolic algorithm efficiency, data synchronization between parts of the systolic array with different types of processing elements [4, 5], data dependencies and long latency operations within a processing element. The systolic array efficiency and systolic cycle length depend on the complexity of the processing element algorithm and the underlying implementation of processing elements. Both bound the systolic array throughput.

Many algorithm post-transformations can be applied to systolic solutions in order to: enhance their effectiveness, lower the number of required processing elements, increase the efficiency, shorten the processing time or increase the throughput. Well known systolic array transformation techniques are: c-slowing, folding, double pipelining, fast designs and two-level pipelining [3, 4, 7]. A c-slow systolic array concept is a rate changing method of a single data flow, although c data flows are concurrently processed in a time-shared fashion. Folding, double pipelining and fast designs are used for efficiency enhancement of systolic arrays, directly leading to a higher throughput of the algorithm. Two-level pipelining introduces pipelining on the processing element level and increases throughput proportional to the number of pipeline stages within processing elements. The overall transformation is performed on the algorithm level, i.e. it treats the operations within each processing element as being executed within one cycle, one systolic cycle, no matter how complicated the processing element algorithms themselves are. This assures proper data flow synchronization on the whole systolic array.

In order to achieve further throughput enhancements, a detailed look within a processing element on the level of the processing element clock cycle is necessary.

2.2.1 Latencies within systolic arrays

The definition of various latencies within a systolic array and a processing element is given next. We can observe latencies on a coarser systolic array level and finer processing element level. In the latter case, we take a look inside the processing (P) phase during systolic algorithm execution.

Systolic cycle level

Lower than 100% efficiency. Data flows between processing elements are spaced with dummy values in order to satisfy the requirement for proper data synchronization within a systolic array. Efficiency enhancing techniques outlined above can be used to deal with this problem. Alternatively, instead of dummy value processing, a data sample from another thread can be injected in the same processing element.

Synchronization problems. There is a number of systolic algorithms where the systolic array is composed of sub-arrays with different types of processing elements. Each sub-array of such systolic array receives its own instruction sequence commensurate with the processing element algorithm definitions. Instruction streams must be synchronized, i.e. they must execute inter-processing element communication operations that transfer data between different parts of a systolic array at the same time instant. Typical examples of such systolic algorithms are treated in Section 4.

Processing element clock cycle level

Data dependencies. Traditionally, non-pipelined functional units were preferred as a building block within a processing element, thus eliminating data hazards. Non-pipelined functional units also exhibit lower latencies in terms of clock cycles and can execute recursions more effectively. With higher clock rates pipelined functional units are preferred, but then we have to cope with data dependency problems. In this case hazard detection logic is required to preserve proper program flow. True data dependencies, i.e. read-after-write data hazards force the serialization of execution.

Long latency of operation with pipelined functional unit. Classical notation of systolic algorithms assumes that operations inside each processing element take zero time. However, real implementations of functional units take a fixed amount of time expressed in processing element clock cycles. All computing within processing element are accomplished during one systolic cycle. Functional unit operation latency is closely related to data hazards and they are both responsible for lowering the throughput of systolic arrays. Simple arithmetic operations (multiply, add, subtract, multiply/accumulate) can be efficiently pipelined. After the initial latency period and if there are enough independent operands to be consecutively supplied to a functional unit, its throughput becomes one datum per a clock cycle. Multiple threads can successfully fill the pipelines of these functional units, if clever instruction scheduling is performed.

Long latency operations with non-pipelined functional unit. Several arithmetic operations can not be efficiently pipelined (at least not with low real estate requirements) as foe example division, square root computation, transcendental functions, etc, rather they are iteratively executed. Their execution time does not change with the number of data samples to be processed, since there is no pipeline to fill-up. More threads of the same algorithm do provide more independent instructions, but they are serially executed due to the resource constraints.

3 Multithreaded systolic computation

This section addresses the possibility of increasing the throughput of systolic arrays beyond that dictated by the systolic cycle. From the design-reuse perspective the throughput increase should not be based on any algorithm transformation/mapping technique. The remaining part of the paper presents how can the throughput be significantly increased with a multithreaded systolic computation concept [1].

In the proposed approach, multiple independent algorithm threads (i.e. instances of the same algorithm) are interleaved within a given systolic cycle on the same systolic array. Data from multiple threads are pumped through the systolic array in blocks resulting in a dramatic improvement of the net throughput. All threads share the same resources of the systolic array. Currently favoured processing element designs include single-cycle multiplier/accumulator data paths, which severely limit throughput (in terms of operating frequencies), but simplify scheduling of operations. Instead, we propose employment of pipelined functional units together with a sophisticated compiler/scheduler framework. Processing elements for multithreaded systolic computation could even be deeply pipelined to achieve this goal. Multithreading keeps execution pipelines within the processing element full and thus increases the efficiency of functional units. The proposed programmable processing element can exploit simultaneous multithreading of multiple algorithm instances within one systolic cycle.

Each thread can belong to a different systolic algorithm or be another instance of the same systolic algorithm. Data streams, i.e. systolic algorithms, pass the systolic array and are executed, without noticing the presence of others due to the programmable nature of the processing element. All threads share the processing element’s resources including, the inter-processing element communication data paths. The net result is a shortened algorithm-processing phase achieved by running concurrently interleaved multiple independent algorithm threads on processor arrays. Performance increase is due to the elimination of true data hazards within each algorithm, better functional unit utilisation and larger amounts of usable instruction level parallelism (ILP). Basic block length describing the algorithm within the processing element is lengthened, thus creating more ILP. Functional unit pipelines within processing elements are kept busy by interleaving independent threads running simultaneously through the systolic array. As a side-effect, efficiency of the whole systolic array improves, since many algorithms can finish execution in shorter time than they would when executed serially on the same systolic array.

3.1 Sources of threads within systolic arrays

For multithreaded systolic computation to thrive it is necessary to present multiple threads to the systolic array. There are several sources of data sets that can be treated as unrelated threads:

Data vectors from different partitions. When problems are too big to be fitted directly to a given systolic array, problem partitioning is used to handle processing of sub parts sequentially on systolic array [3, 4]. In general, sub parts are independent until the very end of the algorithm processing; therefore each sub part can be regarded as an instance of the same algorithm. Data vectors from each sub problem can thus be used as an independent thread.

Loop unrolled systolic cell algorithm. Viewing the execution from the systolic controller’s point of view, we can see that processing element computation is executed repetitively in a loop. The number of iterations equals the number of samples being processed. (We disregard the additional processing steps needed for initial data loading to fill the systolic array and data unloading to pump results out of it). In each systolic cycle blocks of data, corresponding to unrolled loop instances, are transferred between processing elements.

Multiple instances of the same algorithm. There are some applications that require running the same algorithm several times with different data sources (e.g. channel processing in mobile base stations or xDSL systems). The situation is very similar to loop unrolling except that different instances of the same algorithm use different filter coefficients and thus have no common points. Algorithm processing within a processing element is completely independent and can be performed in parallel. Simultaneous execution of different types of algorithms. The goal of processing different types of algorithms simultaneously is to utilize vacant functional units available within processing elements. At the same time it is also desirable that different types of algorithm threads use the same functional units, so that the functional units become fully utilized. However, when algorithms require non-pipelined, long latency functional units, resource contention is inevitable, resulting in longer processing.

Suitable combinations of the above. In the most complicated case, all of the above possibilities can be used to some extent to achieve higher speedup. For example, one algorithm can be unrolled and combined with a more complex counterpart.

3.2 Model of multithreaded systolic computation

We can examine the multithreaded systolic computation on a logical level where all threads are concurrently executing on the systolic array with the granularity of one processing element clock cycle. Implementation of multithreaded programs uses a packet data transfer approach: For each iteration of the systolic program M data elements from M threads are input, processed and output. A graphical representation of I/O activities within a multithreaded systolic array is shown in Figure 2. Data

elements of input data vectors of M threads are time multiplexed into a multithreaded systolic array at a rate of one element per processing element clock cycle. Data elements of result vectors are output at the same rate. This whole process constitutes a systolic cycle. The process repeats for all subsequent elements and all threads. Since threads are independent data sets, there are no data dependencies present within processing element pipelines.

Figure 2: Logical representation of multithreaded systolic computation. Switch rate = 1/PE clock cycle.

Since all functional units within processing elements are pipelined, data are pumped at the pipeline rate of these functional units, compared to the systolic cycle rate in traditional systolic arrays (Figure 1). For each iteration, multiple unrelated data samples from all threads are input, processed and output within one systolic cycle (Figure 3). After a data sample of the first thread (O #1,1) exits, the processing element, input of the next sample of the first thread (I #2,1), can begin. The process is repeated for each subsequent thread. Serialization of data elements from multiple threads occurs at I/O phase, since we assume that only single threaded I/O processing element bandwidth is available.

I #2,1 P #2,1 O #2,1

Thread #1

Figure 3: Multithreaded systolic processing phases within a processing element.

Next we identify three function blocks that could limit the performance increase of multithreaded systolic computation: Inter processing element communication channels . We can expect at some point no increase in the systolic array throughput if bisection bandwidth of the systolic array remains constant. This is the result of the required thread data serialisation.

Register file . Enough operand storage must be provided within each processing element to handle the increased number of data elements from multiple threads.

M ultithreded systolic array

Delay elements. We proposed a logical view of multithreading on the systolic array, where all inter processing element communication channels are replicated. Delay elements are situated on these channels and these delays can be implemented within the register file or local memory. The number of delay elements required in multithreaded processing element is the sum of all delays specified for each thread.

m a i n

{

f o r

(i=0,j=0,..,M=0;i

{

{

d a t a i n p u t t h r

e a d i;

d a t a i n p u t t h r

e a d j;

...

d a t a i n p u t t h r

e a d M;

}

{

a l g o r i t h m e x e c u t i o n o f t h r e a d i;

a l g o r i t h m e x e c u t i o n o f t h r e a d j;

...

a l g o r i t h m e x e c u t i o n o f t h r e a d M;

}

{

d a t a o u t p u t t h r

e a d i;

d a t a o u t p u t t h r

e a d j;

...

d a t a o u t p u t t h r

e a d M;

}

}

}

Table 1: Algorithm of multithreaded systolic computation.

In single-threaded programming environments, we can also simulate the appearance of multiple threads. Each thread executes different data sets of the same algorithm or completely different algorithm. To simulate the simultaneous execution of multiple threads, all variables and data vectors belonging to different threads need to have unique names to occupy different storage locations within the processing element. A pseudo source code below shows the structure of a simulated M threaded systolic program run on the main scalar processor and on a systolic processor element. Loop control is handled by the scalar processor.

All statements within the for loop execute on processing elements. A large proportion of ILP can be uncovered within the algorithm processing phase. All M threads compete for a common pool of pipelined execution units within processing elements. Software scheduler is responsible to uncover the potential ILP due to the larger number of unrelated and independent operations in longer basic blocks and accordingly schedule these operations such that minimal execution times are achieved.

The separation of I/O and P phases within the systolic cycle is conditioned by the preservation of synchronization between processing elements on the systolic array. The length of the multithreaded systolic cycle is increased compared to the single threaded systolic cycle, but several data samples from multiple threads are processed compared to only one.

4 Simulation results

4.1 Algorithm selection

A basic core of mathematical algorithms arising in problems of modern multimedia and image processing consists of linear algebra and linear operator theory. Specific tasks that need to be performed in real time in DSP systems include matrix-vector multiplication, matrix-matrix multiplication and addition, matrix inversion, solution of linear systems, least-squares approximate solution of linear systems, eigensystem solution, generalized eigensystem solution, and singular value decomposition of matrices. These algorithms are computationally intensive, requiring O(N3) operations for each data set of N input data samples.

Figure 4: Systolic array. An array of interconnected processing elements.

To practically present the use of concepts described above, we focus on systolic algorithms for two well-known linear algebra algorithms, namely Gaussian elimination and Givens rotations [4, 5].

The systolic arrays for both algorithms are composed of a triangular sub-array on the left hand side (dotted processing elements are not active) and a square sub-array on the right hand side (Figure 4). All processing elements in the systolic array with the same fill pattern execute the same operation as defined by the systolic algorithms presented next.

4.1.1 Systolic algorithm for the Gaussian elimination method

Gaussian elimination is a standard method for solving linear systems of equations and is one of the most widely used algorithms for LU decomposition of matrices [9]. The Gaussian elimination algorithm is mainly based on multiply-accumulate operation with the exception of a row multiplier computation, which requires a division (Figure 5).

a) x’ = x + m r; b)

Figure 5: Processing elements assignment for the Gaussian elimination method. a) Diagonal, b) off-diagonal processing element.

4.1.2 Systolic algorithm for the Givens rotations method

Givens rotations are usually considered as a method for QR decomposition of matrices [9]. More generally, Givens rotations can be used for QR decomposition as well as for orthogonal triangularisation. The method is very attractive due to its numerical properties, i.e. stability and accuracy. As such it is a vital part of many computationally demanding methods as solving linear systems of equations, least squares and eigenvalue problems. Givens rotations method is rich in divide and square-root operations (Figure 6).

c' = 1 ; s' = 0;

else

t = sqrt (x 2 + r 2);

c' = r / t;

s' = x / t;

r = t;

a) x' = - s r + c x; r = c r + s x; b)

Figure 6: Processing elements assignment for the Givens rotations method. a) Diagonal, b) off-diagonal processing element.

m

m'

m'

(c, s)

(c', s')

(c', s')

4.2 Analysis of algorithms' computational requirements

An analysis of algorithms' computational requirements based on the above processing element assignments is presented next. Figure 7 and Figure 8 collect the operation counts for Gaussian elimination and Givens rotations respectively. Input and output operations (I/O ) present data that enter and data that exit a processing element within a single systolic cycle. The other columns in respective figures present the number of individual arithmetic operations executed within a systolic cycle: multiplication (MUL ), addition (ADD ), division (DIV ) and square-root computation (SQRT ).

a) b)

Figure 7: Operation count for Gaussian elimination: a) Diagonal, b) Off-diagonal processing elements.

a) b)

Figure 8: Operation count for Givens rotations: a) Diagonal, b) Off-diagonal processing elements.

4.3 Simulation environment

The simulation is focused on the execution of computations within each processing element of the systolic array. We functionally simulate multithreaded systolic computation operation as it is executed on a processing element employing a parameterised very long instruction word (VLIW) processor simulator. Within this simulation environment we have direct control over the number and type of functional units used in a specific model of the processing element, e.g. adders (ADD), multipliers (MUL) and divider/square rooters (DIV/SQRT).

Three different models of the processing element are treated in this paper. The parameters and names of the three simulated processing element models are described in Table 2. The three models used are modelled as having multiple deeply pipelined functional units. All computational functional units are pipelined into 5 stages, except for divide/square root units that have

17 processing element clock cycle latency. The number of memory and inter processing element communication functional units is fixed at two for the model A and at eight units for the models B and C respectively and are two stage pipelined. Operation issue width describes the maximum number of operations simultaneously executable within the processing element. There are no restrictions on the type of operations that can proceed concurrently, as long as there are no data or structural hazards. Number of functional units by Type (T) / Latency (L) Add Mul Div/Sqrt PE model

name PE operation issue width

T L T L T L A 5

2 3 2 3 1 17 B 8

2 5 2 5 2 17 C

8 4 5 4 5 4 17 Table 2: Description of processing element models used in the simulations.

4.4 Simulation results

The results present speedups achievable on a multithreaded systolic array running the selected algorithms on the presented processing element architectures. Speedups are based on the execution time ratios of multithreaded systolic computation vs. single threaded systolic computing.

Our definition does not take into account the second level effects as data load/store to/from systolic array. On the other hand, the speedups defined here reflect the true processing element clock cycle-by-cycle situation within the processing element and as such they also reflect the net throughput of each processing element. Subsequently, due to synchronous operation of systolic arrays we can with reasonable confidence take these speedups as they are observed on the whole systolic array. In this respect, our definition of speedups can be regarded as equivalent to the global view of speedup defined as execution time ratios of the application on the whole systolic array.

We can define the speedup S MTH of multithreaded program running one type of algorithm versus single threaded program running the same type of algorithm as

program

ded multithrea of time Execution program threaded single a of time Execution M S MTH ?= where M equals to number of threads. Multiplication factor M is needed, because the single threaded program runs only one instance of the algorithm with a predefined number of data points. A multithreaded version, on the other hand, runs M instances of the same algorithm concurrently, each executing the same number of data points. Both versions have to perform an equal job and the equalisation constant is M . Speedup is calculated for the same processing element model running a different number of threads.

Simulation results are collected in Figure 9 and Figure 10. A series of algorithms executing simultaneously on a multithreaded systolic array is faster than a serial execution of the same tasks on the systolic array. Note that each task separately would not be executed in shorter time, though.

From Figure 9 we can observe that the speedup curve flattens already with two threads, since the algorithm contains long latency arithmetic operations, non-pipelineable operations and insufficient number of resources available within a given processing element. A similar observation applies to Figure 10, although higher speedups are achieved, since the long latency operation count is smaller for this algorithm. When multiple long latency functional units are available, than the speedup continues to increase. As a result it pays out to equip processing elements with multiple long latency functional units, if a systolic array is going to be more linear algebra oriented.

Figure 9: Multithreaded systolic computation of Givens rotations. Speedup.

Figure 10: Multithreaded systolic computation of Gaussian elimination. Speedup.

Multithreaded systolic computation creates the following effects:

PE utilization is increased. This is due to the fact that there are many independent operations available and these can be executed concurrently.

As a direct consequence of multithreading the net throughput increases. As long as there are enough pipelined functional units and inter processing element communication channels, the speedup curve can experience a proportional increase. -

1

2

3

4

124

Number of threads

S p e e d u p

-

1

2

3

4

12

4

Number of threads S p e e d u p

Basic blocks of the code executed on processing elements become longer, which results in the opportunity for the extraction of more ILP.

Register requirements increase proportionally to the number of threads.

5 Conclusions

In this paper a novel technique for throughput enhancement of systolic arrays is presented, termed multithreaded systolic computation. We describe a multithreaded systolic computation concept as an efficient speed-up mechanism with simulation results.

Multithreaded systolic computation is very effective in increasing the throughput and utilization of systolic arrays. A linear increase in the throughput is observed as long as the processing element functional units are pipelined and the algorithms do not experience very low computation-to-communication ratios. In the case of linear algebra algorithms that require the computation of lengthy square root and division operations that can not be pipelined, multithreading can not provide any significant benefit unless multiple functional units are employed.

The multithreaded systolic computation principle increases the number of operations within a given systolic cycle and since the threads are independent of each other, more instruction level parallelism can be easily extracted from the code. Furthermore, pipelined functional units are kept busy, producing results with a theoretical maximum throughput rate. We have to note that linear algebra algorithms, that we selected as a proof of the proposed concept, require very complex, long latency operations within each processing element (Figure 7 and Figure 8). On the other hand, classical DSP algorithms (linear FIR and IIR filtering, pattern matching, etc.) require only multiply-accumulate type of operations. As such they can experience even better speedup results [1].

Programmability of individual processors of the systolic array offers an additional degree of freedom and is a key to the straightforward implementation of various algorithms with a multithreaded systolic computation concept. Multithreaded systolic computation provides speedups that asymptotically approach the number of threads executed simultaneously on the systolic array. The ideas outlined in the paper are equally well applicable to systolic and SIMD arrays alike, although the simulation results are presented only for the systolic algorithms.

6 References

[1] R. Sernec, “Massively Parallel Architectures for Digital Signal Processing”, Ph.D. Thesis, University of Ljubljana, 2000.

[2] M. Zajc, "Faddeev algorithm based systolic structures for digital signal processing", Ph.D. Thesis, University of Ljubljana, 1999.

[3] H. T. Kung, C. E. Leiserson “Systolic Arrays (for VLSI),” Technical Report CS 79-103, Carnegie Mellon University, 1978.

[4] N. Petkov, Systolic Parallel Processing, North-Holland, 1993.

[5] High Performance VLSI Signal Processing: Innovative Architectures and Algorithms, vol. I, II, Edited by: K. J. R. Liu, K. Yao, IEEE

Press, 1998.

[6] J. Tasi?, R. Sernec, M. Zajc, "Efficient SoC design with homogeneous processor arrays", to be presented on SCI'01 as an invited

lecture, Florida, July 2001.

[7] Special issue on: Systolic Arrays, Computer, vol. 20, no. 7, July 1987.

[8] Special issue on: The Future of Micro Processors, Computer, vol. 30, no. 9, September 1997.

[9] G. H. Golub, C. F. Loan, Matrix Computations, 3. edition, Johns Hopkins, 1996.

Cloud Computing for Mobile Users Can Offloading Computation Save Energy - Summary

Cloud computing for mobile users: can offloading computation save energy? Karthik Kumar and Yung-Hsiang Lu, Purdue University Motivation: The primary constraints for mobile computing are limited energy and wireless bandwidth. So the motivation behind this paper is to test weather cloud applications a good solution to improving battery lifetime of a mobile device. Assumptions: (D = data to be transferred; B = network bandwidth; C = amount of computation) 1. D is small enough for B to handle it. 2. C is large enough that its better to be done using offloading. Features, Summary and notes: 1. Cloud computing enhances the computing capability of mobile devices (as most of it can be done on the cloud) 2. Amazon Web Services: a. Simple Storage Service (S3): Lets Users to store personal data. b. Elastic Compute Cloud (EC2): Can perform computations on the stored data. 3. long battery life most desirable feature. 4. Eliminate computation all together. The mobile system does not perform the computation; instead, computation is performed somewhere else, thereby extending the mobile system’s battery lifetime. 5. Offloading - is the concept of Sending computation to another machine. Limitations & Challenges: Limitations and challenges faced by the approach suggested in the paper are: a. Privacy and security of data (Possible Solution: data encryption) i. A bug or security loophole: this might cause the data to be open to security attacks. ii. 3rd party vendors: if the data is processed by 3rd party vendors, again the problem that, how to safeguard data at their end? iii. Tracking individual using location-based-services: What if an unauthorized person is able to track the individual using location-based-services (which often uses offloading)? b. Reliability i. Dependence on Wireless network: What if there is no network connection? The application won’t work at places like deep in a forest. What if the ISP server is down? thus service outage. c. handling real-time data i. amount of data to be transferred: what if an application, which processes real time data? one cannot predict the size of data that is to be transferred. What if the size of data is huge at some time that D is too huge for B to handle?

ICRP110Adult Refeerence Computationl Phantoms2Abstract

Adult Reference Computational Phantoms ICRP PUBLICATION 110 Approved by ICRP in October 2007and adopted by ICRU in October 2008 Abstract –This report describes the development and intended use of the computational phan-toms of the Reference Male and Reference Female.In its 2007Recommendations,ICRP adopted these computational phantoms for forthcoming updates of organ dose coe?cients for both internal and external radiation sources (ICRP,2007).The phantoms are based on medical image data of real people,yet are consistent with the data given in Publication 89(ICRP,2002)on the reference anatomical and physiological parameters for both male and female subjects.The reference phantoms are constructed after modifying the voxel models (Golem and Laura)of two individuals whose body height and mass resembled the reference data.The organ masses of both models were adjusted to the ICRP data on the adult Reference Male and Reference Female,without compromising their anatomic realism.This report describes the methods used for this process and the characteristics of the resulting computa-tional phantoms. Chapter 1summarises the main reasons for constructing these phantoms –voxel phan-toms being the state of the art,and the necessity for compliance with the anatomical char-acteristics of the Reference Male and Reference Female given in Publication 89(ICRP,2002).Chapter 2summarises the speci?cations of the computational phantoms with respect to external dimensions and the source and target regions that are required.Chapter 3char-acterises the previously segmented voxel models (Golem and Laura)that are the origins of the reference phantoms.Chapter 4sketches the modi?cations that had to be applied to these models to create voxel models of the Reference Male and Reference Female.Chapter 5is a description of the resulting reference computational phantoms of the Reference Male and Reference Female.Finally,Chapter 6indicates their applications and highlights their limitations. The phantoms’technical descriptions are contained in Annexes A–H,which represent the larger part of this report.The numerical data representing the phantoms are contained on an electronic data storage medium (CD-ROM)that accompanies the printed publication.One of the aims of this report is to assist those who wish to implement the phantoms for their own calculations. Furthermore,to illustrate the uses of these phantoms,graphical illustrations of conver-sion coe?cients for some external and internal exposures are included in Annexes I–L. ICRP Publication 110

Textbooks 1. Introduction to Automata Theory, Languages, and Computation

RAJESH P.N.RAO Curriculum Vitae June2000 Work Address: The Salk Institute,CNL 10010N.Torrey Pines Road La Jolla,CA92037 Phone:858-453-4100x1527Home Address: 9230Regents Road,Apt.H La Jolla,CA92037 E-mail:rao@https://www.doczj.com/doc/dd1946548.html, WWW:https://www.doczj.com/doc/dd1946548.html,/rao/ Ph.D.in Computer Science,University of Rochester,1998.Dissertation title:Dynamic EDUCATION Appearance-Based Vision.Advisor:Dr.Dana Ballard. M.S.in Computer Science,University of Rochester,1994. B.S.summa cum laude in Computer Science and Mathematics,Angelo State Univer- sity,Texas,1992.GPA:4.0 Research Associate,Sloan Center for Theoretical Neurobiology,Salk Institute,1997-POSITIONS present.Advisor:Dr.Terrence Sejnowski. Research Assistant,Department of Computer Science,University of Rochester,Sum- mer1993-1997. Assistant Administrator,Microcomputer Laboratory,Angelo State University,1989- 1992. Prepared lesson plans and delivered two lectures for an undergraduate course on Com-TEACHING EXPERIENCE putational Neurobiology(BIPN146)at University of California,San Diego,1999.Pro-fessor:T.Sejnowski.Textbook:Biophysics of Computation by Christof Koch. Teaching Assistant,Department of Computer Science,University of Rochester,Spring 1993and1994.Courses:1.Theory of Computation2.Design and Analysis of Algo- rithms.Textbooks:1.Introduction to Automata Theory,Languages,and Computation by John E.Hopcroft and Jeffrey D.Ullman.2.Introduction to Algorithms by Thomas H.Cormen,Charles E.Leiserson and Ronald L.Rivest. Teaching Assistant,Mathematics Department,Angelo State University,1989-1992.Un- dergraduate courses on calculus and analytical geometry. Teaching Assistant,Physics Department,Angelo State University,1989-1990.Under- graduate courses on fundamentals of physics.

H. B. Barlow. Unsupervised learning. Neural Computation, 1295–311, 1989.

References D.H.Ackley,G. E.Hinton,and T.J.Sejnowski.A learning algorithm for Boltzmann machines.Cognitive Science,9:147–169,1985. D.G.Amaral,N.Ishizuka,and B.Claiborne.Neurons,numbers,and the hip- pocampal network.Progress in Brain Research,83:1–11,1990. J.Atick and N.Redlich.Towards a theory of early visual processing.Neural Computation,2:308–320,1990. S.Bao,V.T.Chan,and M.M.Merzenich.Cortical remodelling induced by activity of ventral tegmental dopamine neurons.Nature,412(6842):79–83,2001. H.B.Barlow.Unsupervised learning.Neural Computation,1:295–311,1989. H.B.Barlow and P.F¨o ldi′a k.Adaptation and decorrelation in the cortex.In R.Durbin,C.Miall,and G.Mitchison,editors,The Computing Neuron,chap- ter4,pages54–72.Addison-Wesley Publishing Corp.,1989. C.A.Barnes,B.L.McNaughton,S.J.Y.Mizumori,and https://www.doczj.com/doc/dd1946548.html,- parison of spatial and temporal characteristics of neuronal activity in sequential stages of hippocampal processing.Progress in Brain Research,83:287–300,1990. S.Becker.Mutual information maximization:Models of cortical self-organization. Network:Computation in Neural Systems,7:7–31,1996. S.Becker.Implicit learning in3d object recognition:The importance of temporal context.Neural Computation,1999. S.Becker.A computational principle for hippocampal learning and neurogenesis. To appear in Hippocampus,2005. S.Becker and G.E.Hinton.A self-organizing neural network that discovers surfaces in random-dot stereograms.Nature,355:161–163,1992. S.Becker and J.Lim.A computational model of prefrontal control in free recall: strategic memory use in the california verbal learning task.Journal of Cognitive Neuroscience,15(6):1–12,2003. A.J.Bell and T.J.Sejnowski.An information-maximisation approach to blind separation and blind deconvolution.Neural Computation,7(6):1129–1159,1995. A.J.Bell and T.J.Sejnowski.The independent components of natural scenes are edge?lters.Vision Research,37:3327–3338,1997. N.Brenner,W.Bialek,and R.de Ruyter van Steveninck.Adaptive rescaling Haykin,Principe,Sejnowski,and McWhirter:New Directions in Statistical Signal Processing:From Systems to Brain2005/03/0810:14

计算机英语(第二版)清华大学出版社 课后习题答案

专业英语课后习题答案 Chapter1 I. 1.Application software 2. -128 127 3. system software 4. hardware system software system 5.microcomputers, minicomputers, mainframe computers, supercomputers II. 1. false 2. false 3. false 4. true 5.true III. (2)CPU (3)bit (1)integrated circuit(IC) (4) ASCII IV. (1)编码技术(2)应用软件(3) 浮点数据(4)分时(5)存储容量 VIII.1.By using various coding techniques, groups of bits can be made to represent not only binary numbers but also other discrete symbols 通过应用多种编码技术,一组二进制数字不但可以表示二进制数据,而且还可以表示其它的离散符号 2.System software includes not only the complex programs used by technicians to create application software in the first place but also the organizational programs needed to start up the computer and govern its use of other programs. 系统软件不仅包括技术人员用于创建应用软件的复杂程序,而且还包括用于启动计算机和提供给其他程序使用的管理程序。 3.Data are numbers and other binary-code information that are operated on to achieve required computational results. 数据是数字和其他的二进制代码信息,通过处理这些数据得到所需要的计算结果。 4.Rather than arithmetically or logically manipulating characters, a computer may concatenate strings of characters, replace some characters with others, or otherwise manipulate character strings. 计算机能将若干字符连成串,用一些字符代替其他字符或另行处理字符串,而不是用算术方法或逻辑方法处理字符。 5.Software applications like word processing, electronic spreadsheets, database management programs, painting and drawing programs, desktop publishing, and so forth became commercially available, giving more people reasons to use a computer. 软件应用,像文字处理、电子表格、数据库管理程序、绘图程序及桌面印刷等进入商业市场,使更多的人去使用计算机。 Chapter2 I. 1. the I/O subsystem 2. Read Only Memory(ROM) 3. SRAM 4. I/O interface 5. interrupts II. 1. false 2. true 3. false 4. false 5.false III. (2)RAM

常用光学期刊缩写

光学与应用光学等领域常用期刊英文缩写Acta Optica Sinica Acta Photonica Sinica AIP CONFERENCE PROCEEDINGS AIP CONF PROC APPLIED OPTICS APPL. OPTICS APPLIED PHYSICS LETTERS APPL PHYS LETT Chinese Journal of Lasers Chinese J. Lasers High Power Laser and Particle Beams IEEE AEROSPACE AND ELECTRONIC SYSTEMS MAGAZINE IEEE AERO EL SYS MAG IEEE ANNALS OF THE HISTORY OF COMPUTING IEEE ANN HIST COMPUT IEEE ANTENNAS AND PROPAGATION MAGAZINE IEEE ANTENNAS PROPAG IEEE CIRCUITS & DEVICES IEEE CIRCUITS DEVICE IEEE CIRCUITS AND DEVICES MAGAZINE IEEE CIRCUIT DEVIC IEEE COMMUNICATIONS LETTERS IEEE COMMUN LETT IEEE COMMUNICATIONS MAGAZINE IEEE COMMUN MAG IEEE COMPUTATIONAL SCIENCE & ENGINEERING IEEE COMPUT SCI ENG IEEE COMPUTER APPLICATIONS IN POWER IEEE COMPUT APPL POW IEEE COMPUTER GRAPHICS AND APPLICATIONS IEEE COMPUT GRAPH IEEE COMPUTER GROUP NEWS IEEE COMPUT GROUP N IEEE CONCURRENCY IEEE CONCURR

Velocity-Position Integration Formula (II)-Application to Inertial Navigation Computation

Velocity/Position Integration Formula (II): Application to Strapdown Inertial Navigation Computation Yuanxin Wu and Xianfei Pan Abstract— Inertial navigation applications are usually referenced to a rotating frame. Consideration of the navigation reference frame rotation in the inertial navigation algorithm design is an important but so far less seriously treated issue, especially for the future ultra-precision navigation system of several meters per hour. This paper proposes a rigorous approach to tackle the issue of navigation frame rotation in velocity/position computation by use of the newly-devised velocity/position integration formulae in the Part I companion paper. The two integration formulae set a well-founded cornerstone for the velocity/position algorithms design that makes the compr ehension of the iner tial navigation computation pr inciple mor e accessible to pr actitioner s, and differ ent appr oximations to the integr als involved will give bir th to var ious velocity/position update algorithms. Two-sample velocity and position algorithms are derived to exemplify the design process. In the context of level-flight airplane examples, the derived algorithm is analytically and numerically compared to the typical algorithms existing in the literature. The results throw light on the problems in existing algorithms and the potential benefits of the derived algorithm. Index Ter ms— iner tial navigation, velocity integr ation for mula, position integr ation for mula, fr ame rotation I.I NTRODUCTION Over fifty years have seen the fruitful algorithm development for the strapdown inertial navigation system using a triad of gyroscopes and accelerometers that are strapped down to the vehicle [1]. Because the inertial sensors rotate This work was supported in part by the Fok Ying Tung Foundation (131061), National Natural Science Foundation of China (61174002), the Foundation for the Author of National Excellent Doctoral Dissertation of People’s Republic of China (FANEDD 200897) and Program for New Century Excellent Talents in University (NCET-10-0900). Authors’ address: Department of Automatic Control, College of Mechatronics and Automation, National University of Defense Technology, Changsha, Hunan, P. R. China, 410073. Tel/Fax: 086-0731-********-8212 (e-mail: yuanx_wu@https://www.doczj.com/doc/dd1946548.html,).

PhaseCenterComputationofaCorrugatedHorn

Phase Center Computation of a Corrugated Horn Phase center computations are a very sensitive subject and difficult to measure. The location of the phase center depends upon a couple of parameters such as polarization direction, scan angle direction and aperture width. The device modeled in this application is a cylindrical corrugated horn with a linear vertical polarization, depicted in Figure 1. Figure 1:The CST MWS model showing the excitation direction and the definition of H- and E-Plane Correct settings are crucial in order to obtain meaningful results. The polarization of the E-field is along the E-plane (vertically orientated). Figure 2 shows the E-phi component in a three-dimensional view. It can be seen that this field component is very Figure 2:Field and H-plane scan direction settings

Revenue Limit Computation - School Financial :年收入极限的计算-学校财务

2005-06 FINAL REVENUE LIMIT COMPUTATION 2420 HAMILTON 05OCT06 --> LINE 1: 2004-05 BASE REVENUE (ROUNDED) = 36,427,384 1. 2004-05 BASE REVENUE (LINE 1 FROM LEFT) 36,427,384 LN 1 AMNT MAY NOT EXCEED LN 9 OF FINAL 04-05 REV LIM WORKSHEET 2. BASE SEPT MEMB AVG (LINE 2 FROM LEFT) 3,946 3. 04-05 BASE REVENUE PER MEMBER (LN1 / LN2)(CENTS) 9,231.47 04-05 GENERAL AID CERT 10-15-04 (LN 12) + 16,628,146.00 4. 05-06 PER MEMB INCREASE (A+B) 248.48 04-05 COMPUTER AID, SCR 691(04-05 LN 17) + 123,477.00 A. ALLOWED PER PUPIL INCREASE 248.48 04-05 FND 10 LEVY CERT(04-05 LINE 18) + 19,425,114.00 B. LOW REV INCR: ((8100-(3+4A))-4C) (NOT < 0) 0.00 04-05 FND 38 LEVY CERT(04-05 LINE 14B) + 250,647.00 C. VALUE OF CCDEB IN LOW REVENUE DISTRICT 0.00 04-05 FND 41 LEVY CERT(04-05 LINE 14C) + 0.00 5. 05-06 MAXIMUM REVENUE / MEMB (LN 3 + LN 4) 9,479.95 04-05 AID PENALTY OVER LEVY(04-05 BOTTOM) - 0.00 6. CURR MEMB AVG (LINE 6 FROM LEFT) 4,039 7. 05-06 REV LIMIT, NO EXEMPT (LN5 X LN6) (ROUNDED) 38,289,518 04-05 LEVY FOR NON-RECURRING EXEMPTIONS (AMOUNT USED IN 04-05) 8. TOTAL RECURRING EXEMPTIONS (A+B+C+D+E) (RND) 68,127 UNUSED 04-05 RECURRING LEVY AUTHORITY= 0 04-05 NON-RECURRING REF TO EXCEED LIMIT - 0.00 04-05 DECLINING ENROLLMENT - 0.00 A. PRIOR YR CARRYOVER(100% OF AMT ABOVE) 0 04-05 OTHER NON-RECURRING - 0.00 B. TRAN OF SERVICE 68,127 C. TRANSFER OF TERRITORY 0 SEPTEMBER & SUMMER FTE MEMBERSHIP AVERAGES D. FED IMPACT AID LOSS (03-04 TO 04-05) 0 STARTING IN 2000, RES INTER FTE COUNTS ONLY 75%.) E. RECURR REF TO EXCEED(IF 05-06 1ST YR) 0 --> LINE 2: BASE AVG((02+.4SS+03+.4SS+04+.4SS)/3) = 3,946 2002 2003 2004 9. 2005-06 LIMIT WITH RECURRING EXEMPTIONS(LN7+LN8) 38,357,645 SUMMER FTE 78 76 77 10. TOTAL 05-06 NON-RECURRING EXEMPTIONS (A+B+C) 0 40%,40%,40% SUMMER 31 30 31 A. NON-RECURR REF, TO EXCEED 05-06 LIM 0 SEPT FTE 3,819 3,924 4,003 B. DECLIN ENROLL EXEMPT 05-06(FROM LEFT) 0 ------- ------- ------- C. OTHER NON-RECURRING EXEMPTION 0 TOTAL FTE 3,850 3,954 4,034 11. 2005-06 REVENUE LIMIT WITH EXEMPT (LN9 + LN10) 38,357,645 12. OCT 15 CERTIFICATION OF 2005-06 GENERAL AID 19,119,608 --> LINE 6: CURR AVG((03+.4SS+04+.4SS+05+.4SS)/3) = 4,039 13. ALLOWABLE LIMITED REV 10,38,41 LEVY+SRC 691 19,238,037 2003 2004 2005 (LINE 11 - LINE 12) (SRC 691 IS DOR COMPUTER AID) SUMMER FTE 76 77 85 40%,40%,40% SUMMER 30 31 34 14. LIMITED REVENUE TO BE USED (A+B+C) NOT>LN 13 19,219,077 SEPT FTE 3,924 4,003 4,095 AMNTS NEEDED BY PURPOSE AND FUND: ------- ------- ------- A. GEN OPER FND 10 INC 211 & 691 18,964,980 FUND 10+COMP AID TOTAL FTE 3,954 4,034 4,129 B. NON-REF DEBT(IN LIM)FD 38 210 254,097 ACTUAL FD 38 LEVY C. CAPITAL EXP FND 41 SRC 210 0 ACTUAL FD 41 LEVY --> LINE 10B: DECLINING ENROLLMENT EXEMPT = 0 15. TOTAL REVENUE FROM OTHER LEVIES (A+B+C+D) 3,326,781 AVERAGE FTE LOSS (LN 2 - LN 6, IF > 0) 0 A. REF APRVD DEBT (FD 39 210) 3,269,327 X 0.75 = 0 B. COMM SERV FND 80 SRC 210 33,500 X (LINE 5, MAXIMUM 05-06 REVENUE PER MEMB = 9,479.95 C. PRIOR YR LEVY CHRGEBK SRC 212 23,954 NON-RECURRING EXEMPTION AMOUNT = 0 D. OTHER LEVY, MLWK & KENOSHA 0 16. TOTAL LEVY+SRC 691 (LN14+LN15) 22,545,858 --> LINE 17: STATE AID FOR EXEMPT COMPUTERS= 120,629 17. SRC 691 AID (VOUCHERED BY DOR 5/1/06) 120,629 LINE 17 = A X (LINE 16 / C) (TO 8 DECIMALS) 18. FND 10 SRC 211 (LN14A-LN17), ACTUAL FD 10 LEVY 18,844,351 ENTER TAX VALUES FROM OCT 2005 CERT (MAILED 10/15/05) LINE 18(NOT 14A)IS THE FND 10 LEVY CERTIFIED BY THE BOARD A. 2005 EXEMPT COMPUTER PROPERTY VALUATION = 12,699,600 19. TOT ALL FUND TAX LEVY (18+14B+14C+15) 22,425,229 B. 2005 TIF-OUT TAX APPORTIONMENT EQ VALUE = 2,360,895,999 USE LINE 19 TO APPORT ON DOR SD-401. LEVY RATE = 0.00949861 C. 2005 TIF-OUT VAL + EXEMPT COMPUTERS(A+B)= 2,373,595,599 20. FUND 30 SRC210(38 + NON-38) (LN 14B + LN15A) 3,523,424 COMPUTER AID REPLACES A PORTION OF PROPOSED FUND 10 LEVY ALLOWABLE LESS ACTUAL (LN13-LN14) (UNDER LIMIT) 18,960 SRC691 = COMP VAL X (PROPOSED LEVY/(TIF-OUT VAL+COMP VAL))

相关主题
文本预览
相关文档 最新文档