当前位置:文档之家› Effective file data-block placement for different types of page cache on hybrid main memory

Effective file data-block placement for different types of page cache on hybrid main memory

Effective file data-block placement for different types of page cache on hybrid main memory
Effective file data-block placement for different types of page cache on hybrid main memory

Des Autom Embed Syst(2013)17:485–506

DOI10.1007/s10617-014-9148-3

Effective?le data-block placement for different types

of page cache on hybrid main memory architectures

Penglin Dai·Qingfeng Zhuge·Xianzhang Chen·

Weiwen Jiang·Edwin H.-M.Sha

Received:15January2014/Accepted:17September2014/Published online:27September2014

?Springer Science+Business Media New York2014

Abstract Hybrid main memory architectures employing both DRAM and non-volatile mem-ories(NVMs)are becoming increasingly attractive due to the opportunities for exploring bene?ts of various memory technologies,for example,high speed writes on DRAM and low stand-by power consumption on NVMs.File data-block placement(FDP)on different types of page cache is one of the important problems that directly impact the performance and cost of?le operations on a hybrid main memory architecture.Page cache is widely used in modern operating systems to expedite?le I/O by mapping disk-backed?le data-blocks in main memory to process space in virtual memory.In a hybrid main memory,different types of memory with different read/write costs can be allocated as page cache by operating system.In this paper,we study the problem of?le data-block placement on different types of page cache to minimize the total cost of?le accesses in a program.We propose a dynamic programming algorithm,the FDP Algorithm,to solve the problem optimally for simple pro-grams.We develop an ILP model for the?le data-block placement problem for programs composed of multiple regions with data dependencies.An ef?cient heuristic,the global?le data-block placement(GFDP)Algorithm,is proposed to obtain near-optimal solutions for the problem of global?le data-block placement on hybrid main memory.Experiments on a set of benchmarks show the effectiveness of the GFDP algorithm compared with a greedy strategy and the ILP.Experimental results show that the GFDP algorithm reduces the total cost of?le accesses by51.3%on average compared with the the greedy strategy. Keywords Hybrid main memory·Page cache·File data-block placement

P.Dai·Q.Zhuge·X.Chen·W.Jiang·E.H.-M.Sha

College of Computer Science,Chongqing University,

Chongqing,China

Q.Zhuge(B)·E.H.-M.Sha

Key Laboratory of Dependable Service Computing in Cyber Physical Society,

Ministry of Education,Chongqing400044,China

e-mail:qfzhuge@https://www.doczj.com/doc/a911826594.html,

486P.Dai et al. 1Introduction

In light of recent advances in non-volatile memory technologies,hybrid main memory archi-tectures containing both DRAM and various types of non-volatile memories(NVMs),such as phase-change memory(PCM)and ferroelectric RAM(FeRAM),are becoming a promis-ing alternative to replace conventional DRAM main memory[3,5,6,12,17,20,21,23,31,32]. Non-volatile memories show advantages over DRAM in several aspects including their non-volatile nature,low standby power consumption,and high density,etc.However,non-volatile memory technologies are usually constrained by high costs of write operations compared with read operations,as well as limited write endurance.With hybrid main memory architectures, new opportunities arise in exploring bene?ts of various memory technologies with low costs.

A range of new problems need to be examined to fully understand hybrid main memory architectures.The problem of?le data-block placement on different types of page cache is one of the important problems that directly impacts performance and cost of?le I/O for systems using hybrid main memory architecture.

Page cache is widely supported by operating systems,such as Linux and Windows,to expedite?le I/O[4,9].With page cache,data blocks loaded from secondary storage are organized in memory pages and cached in main memory for later accesses.Executable binaries such as libraries,for example,are typically accessed through page cache and mapped to process space in virtual memory.In Unix-like systems,a lot of programs use mmap() system call to conveniently create memory mapping on page cache.A mmap()system call allocates available memory pages for an opened?le and returns a pointer of accessed pages in page cache to process space.Then,a process is able to directly access and modify?le data in page cache through the pointer.Data management in page cache is implemented with paging memory management and transparent to user processes.Programs using?le read/write system calls to access?le data may not aware of data accesses on page cache because operating system allocates available physical memory as page cache for?le operations if not speci?ed otherwise.

Because data access on memory is faster than those on secondary storage,e.g.hard disk and?ash disk,in orders of magnitude,?le access on page cache usually yields signi?cant improvement in system performance[18,24].A carefully designed data-block placement strategy bene?ts all kinds of programmed?le I/O,either synchronous or asynchronous ones.

The architecture we considered in this paper is a single core system with hybrid main memory as shown in Fig.1.Since all physical memory not assigned to kernel or user processes can be used as page cache,different types of memories can be allocated by operating system as page cache.Hence,page caches on different types of memories have limited capacities. Also,read/write costs vary on different types of page caches.In this paper,we study the problem of?le data-block placement on different types of page caches to minimize the total cost of?le operations on a hybrid main memory.The?le access cost considered in the?le data-block placement problem can be latency or energy consumption,etc.depending on optimization goal.

To the best of authors’knowledge,there is no existing work studying the problem of placing ?le data-blocks on hybrid main memory consisting of DRAM and non-volatile memories. Existing techniques for data placement problem mostly focus on allocating?xed-size scalar or array data to scratch-pad memories.These techniques cannot be directly applied to the?le data-block placement problem because various sizes of?le data-blocks.Also,enhancement on operating system and system calls need to be studied in order to take full advantage of ?le data-block placement on a hybrid memory architecture.

Effective FDP on hybrid main memory architectures487

Fig.1A hybrid main memory architecture

In this paper,we propose techniques to solve the problem of?le data-block placement on hybrid main memory architectures.The system call mmap()is used to illustrate the proposed techniques for solving the?le data-block placement problem.Our techniques can also be applied to generate data-block placement for programs using read/write system calls in a similar way.

Our contributions include:

–We formally de?ne the problem of?le data-block placement for hybrid main memory architectures.

–An ef?cient algorithm,the?le data-block placement(FDP)algorithm,is proposed to optimally solve the?le data-block placement problem for a simple program accessing?le data-blocks with varying sizes.

–We develop the ILP model to solve the?le data-block placement problem for a program composed of multiple program regions with data dependencies.

–A heuristic algorithm,the global?le data-block placement(GFDP)algorithm,is designed to solve the global?le data-block placement problem for a program consisting of multiple program regions with data dependencies.It merges?le data-blocks of sequential I/O to the same type of memory to avoid scattered I/O on page cache.

–We propose a integrated method to solve the?le data-block placement problem at system level using mmap()system call.It integrates compiler-based techniques,an extension of mmap()system call,and simple modi?cations in operating system to allocate page cache on speci?ed memory types.This technique can be applied to programs using?le read/write system calls for?le I/O in a similar way with very few modi?cations.

The experiments are conducted with a set of benchmarks to show the effectiveness of the proposed techniques.The experimental results show the effectiveness of the proposed techniques by comparing costs of solutions generated by the GFDP algorithm,a greedy algorithm,and the ILP method.The GFDP algorithm reduces the total cost of?le accesses by36.9%on average compared with a greedy method.The ILP model produces the optimal solutions for small-size problems.However,it cannot?nd a solution for large-size problems within reasonable time.

488P.Dai et al.

The rest of this paper is organized as follows:The architecture model,application model, and a formal de?nition of the?le data-block placement problem are presented in Sect.2.

A motivational example is provided in Sect.3.The FDP algorithm is proposed in Sect. 4to optimally solve the FDP problem in a program region.In Sect.5,we propose ILP model and the GFDP algorithm for solving the global?le data-block placement problem. The experimental setup and results are described in Sect.6.Related works are discussed in Sect.7.Finally,Sect.8makes a conclusion.

2System model and problem de?nition

In this section,we present a hybrid main memory architecture with different types of page cache.An application model that utilizes page caches on a hybrid main memory to expedite ?le accesses is explained.A formal de?nition of the FDP problem is also presented.At the end of this section,we use user-space?le operations as an example to illustrate a solution for allocating?le data-blocks to designated page caches by integrating compiler-based technique, an extension of mmap()system call,and simple modi?cations to operating system.

2.1System model

The architecture model studied in this paper is a single-core system equipped with hybrid main memory as shown in Fig.1.The hybrid main memory consists of multiple types of memory,such as DRAM,FeRAM,and PCM,etc.with different access latencies and sizes. Therefore,page caches on various types of memories have various read/write costs and limited capacities.Applications can access data across multiple types of memories.The hybrid main memory architecture provides?exibility for operating system to place?le data-blocks on page caches in different types of memories to minimize access cost.

Page cache is implemented in kernel with paging memory management.To improve the ef?ciency for?le operations using page cache,operating system checks the existence of requested?le data-blocks in page cache.If there exist valid pages of?le data-blocks in page cache,operating system directly fetches data from page cache.If?le data-blocks are not cached,operating system allocates enough memory space as page cache,and loads?le data-blocks into allocated memory space.Therefore,the capacities of different types of page caches are limited by the size of different types of memories.

A lot of programs use mmap()system call to create memory mapping on page cache.Then, a process is able to directly access data in page cache using a pointer mapped in process space without copying data between buffers.Modi?cations to?le data in page cache are?nalized on disk after page cache space is un-mapped.A program sketch of?le accesses using traditional mmap()system call is shown in Fig.2.File read/write operations using fread/fwrite system calls can also be bene?ted by using page cache implicitly through operating systems.

In a hybrid main memory architecture,various types of memories can be used as page cache for?le data-blocks to reduce the?le access cost with?exibility.In order to generate ?le data-blocks adapting to a running environment,we study the technique of?le data-block placement for a program region,such as a basic block,that accesses?les.Programs composed of one or more program regions is shown in Fig.3a.A program composed of multiple regions with data dependencies can be modeled by a directed-acyclic graph(DAG) as shown in Fig.3b.

Each node in DAG represents a program region with?le accesses.An edge between two nodes represents a data dependency between two program regions.For example,the program

Effective FDP on hybrid main memory architectures489

Fig.2A program sketch of?le accesses on page cache using a pointer returned by mmap()

Fig.3Graph model of a program divided into regions.a A program divided into regions.b A DAG for the program shown in(a)

in Fig.3a is divided into four regions.Program region1is represented by node1in the DAG as shown in Fig.3b.A data dependency from region1to region2is represented by an edge from node1to node2as shown in Fig.3b.An execution sequence has to follow the precedence relationship de?ned by data dependencies in a program.Note that a?le can be accessed by multiple program regions.

In this paper,we propose an integrated solution to exploit advantage of FDP on hybrid memory.It involves slight modi?cations to operating system and compiler in order to explic-itly choose an appropriate type of main memory to cache?le data-blocks.

Modern operating systems,such as Linux kernel2.6.11,use a buddy system algorithm to partition memory into various memory zones.In a hybrid memory system,each type of memory is considered as a memory zone.Free page frames of one memory zone is managed by a unique buddy system.When page allocation function is called,a?ag is added to specify a memory zone.Then,the corresponding buddy system is able to allocate free pages from speci?ed memory zone.The requested?le data-blocks can be placed in speci?ed page cache according to the?ag.

A typical mmap()system call is shown in Fig.2.The mmap function asks the kernel to allocate a new virtual memory area for placing contiguous?le data-blocks.The size of allocated page cache is able to cache4,096bytes data,i.e.a?le data-block of four pages. The mmap function returns a pointer to the beginning address of the allocated space of?le data-blocks.Then,user process is able to directly access and modify?le data simply using the returned pointer.The problem is that the mmap()shown in Fig.2cannot specify any memory zone explicitly.

490P.Dai et al.

Fig.4An extension of mmap function

We propose to overload mmap function by adding a speci?ed memory zone as one of the parameters.In order to place?le data on a designated memory zone,we introduce a parameter mem_zone to specify types of memory zone.An example of the extended mmap function is shown in Fig.4.Once placement of?le data-blocks are generated,compiler is able to insert the parameter of memory zone into the extended mmap function to specify a designated memory type for allocating page cache.Then,operating system is able to interpret values of parameters speci?ed in the extended mmap function.It allocates available memory space in a designated memory zone using buddy system.Finally,a pointer is returned to user process space.

When accessing?le data-blocks by read and write system call,the operating system implicitly places?le data-blocks in page cache if not otherwise speci?ed.We can also extend ?le read and write system call in a similar way to place the?le data-blocks in page cache on designated memory zone.

2.2Problem de?nition

Before we formally de?ne the problem of FDP on hybrid main memory,we?rst need to de?ne the placement function.A placement function P:F B→M is de?ned as a mapping function from?le data-block f b i∈F B to a page cache m k∈M,indicating that?le data-block f b i is loaded to the page cache m k.

De?nition1File data-block placement problem Given a set of requested?le data-blocks F B={f b1,f b2,...,f b K}and their read/write frequencies,a set of page caches M= {m1,m2,...,m N}on different types of memories with various read/write costs and limited capacity,and also given the initial placement of?le data-blocks P0(f b i),for f b i∈F B.The objective of?le data-block placement problem is to determine the placement of requested ?le data-blocks in F B on multiple page caches in M on a hybrid main memory such that the ?le access cost is minimized.

3Motivational example

In this section,we show the advantage of our technique in a motivational example.First,an example of?le data-block placement for one program region is presented.Then,we show the impact of execution sequence and data-block sharing for multiple program regions on cost reduction during execution of the whole program.

We assume that there exist three types of page caches and a secondary storage as shown in Table1a.The type M4represents secondary storage.Table1a lists read cost and write cost of multiple types of page caches with their capacities.Table1b lists information of each requested?le data-block,such as number of reads and number of writes,as well as?le block

Effective FDP on hybrid main memory architectures491

Table1Input data for

Page caches Capacity Read cost Write cost motivational example

(a)Properties of page caches on different types of memories

M1323

M2677

M38515

M4172550

File data-blocks#of Reads#of Writes Size

(b)Access frequencies and sizes of?le data-blocks

A233

B882

C921

D174

E742

F353

G862

Fig.5The result of?le data-block placement on multiple page caches.a Result of?le data-block placement using the LCF method.b The optimal result of?le data-block placement

size.For example,?le data-block A of size3is read two times and written three times.For simplicity,the initial location of all the?le data-block is in the secondary storage M4.

We consider the case that several?le data-blocks are requested during execution of one program region.We would like to place the requested?le data-blocks on a speci?ed type of page cache in order to minimize the?le access cost.Figure5shows result of placement for?le data-blocks,as well as access costs,using various techniques.Let us?rst look at placement solution generated by a greedy method,the Low-Cost-First(LCF)algorithm,as shown in Fig.5a.The basic idea of the LCF algorithm is to?nd a type of page cache with the lowest cost for accessed?le data-blocks.File data-block A is considered?rst,and it is placed in M1because the cost is the lowest among all types of page caches.Then,?le data-block B is placed in M2since there is no enough space for placing B in M1.After placing all the ?le data-blocks,the total access cost is1,036.The optimal placement in this case,however is to place?le data-block B and C together in M1and A in M3,shown in Fig.5b.The total memory access cost can be reduced from1,455to1,036,a39.9%reduction compared to the greedy LCF method.

In the following,we show a motivational example for the global?le data-block placement problem.The program regions and?le data-blocks accessed by each region are shown in Fig.6a.Figure6a shows access frequencies of?le data-blocks in each region.For example,

492P.Dai et al.

Fig.6An example of global?le data-block placement.a Access frequencies of?le data-blocks in different regions.b The graph model of a program and?le data-blocks requested by multiple program regions

Table2Solutions of global?le data-block placement for the example shown in?gure Region A B C D E

(a)The?le data-block placement generated by the LCF algorithm

1m4m1m4m2m3 2m1m4m2m4m4 3m4m1m4m2m4 4m1m4m2m4m3 5m4m1m4m2m4 6m1m4m2m4m4 (b)An optimal?le data-block placement for the whole program

1m4m3m4m2m1 3m3m1m4m2m4 2m3m1m3m2m4 4m3m1m3m2m4 6m4m1m3m2m3 5m4m1m3m2m3

the?le data-block B is read for eight times and written for four times in program region1. The sizes of?le data-blocks A,B,C,D,E are2,2,4,5,and2,respectively.The information of page caches are the same as the previous example as shown in Table1a.The execution order generated by the LCF method is1→2→3→4→5→6.The execution order and ?le data-block placement is shown in Table2a.In this case,?le block B is?rst placed in page cache m1for execution of region1.Then,it is moved to page cache m4in region2. The execution sequence causes unnecessary“thrashing”between two different page caches, which causes additional?le access cost.The resulting total cost for the LCF method is20645.

The total cost of?le accesses can be reduced by taking advantage of common?le data-blocks shared by regions executed consecutively.The execution sequence now becomes 1→3→2→4→6→5.As a result,?le blocks C and D stay in the same page cache for region1and region3.Therefore,the cost of moving data-blocks is saved.It turns out that this is the optimal solution for the motivational example shown in Table2b.The total

Effective FDP on hybrid main memory architectures493

Table3Notations used in FDP

Notation De?nition

algorithm

M A set of types of page caches M={m1,m2,...m N}

FB A set of?le data-blocks accessed by program

F B={f b1,f b2,...f b K}

S i the size of?le data-block f b i

RN i Number of reads for?le data-block f b i

WN i Number of writes for?le data-block f b i

The capacity of page cache type m i

SIZE m

i

RC m

Cost of read unit size of?le data-block on memory m i

i

WC m

Cost of write unit size of?le data-block on memory m i

i

f p i The?le which contains the?le block f b i

a i the start position of?le block f

b i in the?le f p i

b i the end position of?le block f b i in the?le f p i

cost of?le accesses with the optimal execution order is reduced from20,645to9,555,a cost reduction of53.8%compared with the greedy method.

As we can see from this example,execution order has a great impact on the access cost of?le data-blocks because of data sharing among different program regions.

4Solving the?le data-block placement problem with dynamic programming

In this section,we present the FDP algorithm for solving the problem of?le data-block placement on multiple types of page caches using dynamic programming.The algorithm solves the?le data-block placement problem optimally for simple programs composed of one program region.

4.1Notation and problem de?nition

Before presenting the algorithm,we?rst de?ne notations to be used in the algorithm and problem description as well.Table3shows a list of notations with detail explanations.

The moving cost of a?le data-block f b k refers to the cost of reading f b k data-block on page cache m i plus writing f b k on page cache m j,where i=j.Hence,the moving cost is de?ned as follows:

MOVE(m i,m j,f b k)=RC m i×S k+W C m j×S k(1) where S k is the size of?le data-block f b k.

The memory access cost of a?le data-block f b i on page cache m k refers to the total cost of?le operations on page cache m k for f b i.The following equation calculates the?le access cost for reading/writing?le data-block f b i on page cache m k.The memeory access cost is calculate as:

COST(m k,f b i)=RN i×RC m k+W N i×W C m k(2) Given a set of?le data-blocks FB={f b1,f b2,...,f b K}and their read/write fre-quencies in one program region,also given a set of page caches M={m1,m2,...,m N} with their read/write costs and capacities,and given the initial placement of?le data-block P0(f b i),i=1,2,...,M,The objective of?le data-block placement is to place?le data-

494P.Dai et al. blocks in F B on page caches in M to achieve the minimum?le access cost as described by the following objective function.

min

K

k=1

MOVE(P0(f b k),P(f b k),f b k)+

K

k=1

COST(P(f b k),f b k)

(3)

4.2The?le data-block placement algorithm

In this section,we present a dynamic programming algorithm,the?le data-block placement (F D P)algorithm,to optimally solve the problem of?le data-block placement for simple programs.The main idea is to transform this problem to a multi-dimensional0?1knapsack problem.The?le data-blocks are treated as the items with assigned value and size.Each type of page cache is considered as a knapsack with limited capacity.Then,we are able to solve the problem by dynamic programming.

The algorithm consists of three major steps:

First,we preprocess these?le blocks.The?le blocks of sequential IO can be placed consecutively for avoiding multiple disk requests.Therefore,the information of accessed?le blocks should be pre-processed.We merge two?le blocks f b i and f b j into one?le block if they are adjacent or overlapped in the same?le.Hence,the condition of merging two?le blocks is as follows:1)f p i=f p j2)a j≤b i≤b j or a i≤b j≤b i.Then,the size,memory access cost and movement cost of merging?le blocks can be de?ned as follows:

S i j=S i+S j?S i j(4) COST(m,f b i j)=COST(m,f b i)+COST(m,f b j)(5)

MOVE(m k,m l,f b i j)=RC m l×S i j+W C m k×S i j(6) where S i j refers to the size of merged?le blocks and S i j refers to the overlapped size of two merged?le blocks.After that,we can acquire the set of merged?le blocks

F B ={f b 1,f b 2,...,f b

K }.

Second,compute moving costs M OV E(m i,m j,f b k)by Eq.1and access costs C O ST(m i,f b k)by Eq.2for all pairs of page caches and?le data-blocks.

In the third step,the optimal solution is generated by computing the cells in dynamic programming table and tracing back the dynamic programming table for?nal placement. The details of the FDP algorithm are shown in Algorithm4.1.

Each cell of the dynamic programming table is de?ned as C[j,s1,s2,...,s N],where j represents the merged?le data-block f b j to be allocated in page cache,s i indicates available space of page cache m i.The initial value of s i is equal to the capacity of page cache m i. Each element C[j,s1,s2,...,s N]in the dynamic programming table represents the min-imal cost when the?rst j?le data-blocks are already placed in page cache while the rest?le data-blocks are placed in initial location.And,there are still s i pages available in page cache m i,where i=1,2,...,N.We can calculate this table in a bottom-up fashion.The recursion function for computing the dynamic programming problem C[j,s1,s2,...,s N]is shown in Eq.(7).

In order to trace back the solution from the dynamic programming table,we de?ne a multi-dimensional array R[j,s1,s2,...,s N]to record the location of?le data-block f b j along the computation.

The time complexity of the FDP algorithm shown in Algorithm4.1is O(M×C N),where M is the number of?le data-blocks and C is the maximal capacity of all types of page caches, and N is number of types of page cache.Therefore,the FDP algorithmm is a polynomial time

Effective FDP on hybrid main memory architectures 495algorithm because the number of memory types N is usually a given constant for a hybrid main memory architecture.

Algorithm 4.1The FDP algorithm

Input:A set of ?le data-blocks F B ={f b 1,f b 2,...,f b K },their sizes S i ,their read times RN i and write times W N i ,where i =1,2,...,K .

A set of types of page caches M with limited capacity SI Z E m i and read and write cost RC m i /W C m i ,i =1,2,...,N .We assume that m N is large enough to store all ?le data-blocks.

The initial location of ?le data-block P 0(f b i ),i =1,2,...,K Output:An optimal placement of ?le data-blocks on page cache P (f b i ),i =1,2,...,K .1:Merge the ?le blocks and acquire the set of merged ?le blocks F B ={f b 1,f b 2,...,f b K

}

2:For each merged ?le data-block f b k and each page cache type m i ,calculate move cost M OV E (P 0(f b k ),m i ,f b k )by Eq.(1).3:For each ?le data-block f b k and each memory type m i ,calculate access memory cost C O ST (m i ,f b k )by Eq.(2).4:Initialize an array C [|F B |+1,SI Z E m 1+1,SI Z E m 2+1,...,SI Z E m N +1]for computing costs of ?le data-block placement.

5:Initialize an array R [|F B |+1,SI Z E m 1+1,SI Z E m 2+1,...,SI Z E m N +1]to keep record of location for ?le data-blocks.

6:for j ←0to |F B |do

7:for s 1←SI Z E m 1to 0do

8:for s 2←SI Z E m 2to 0do

9:...

10:Calculate C [j ,s 1,s 2,...,s M ]by Eq.(7).11:R [j ,s 1,s 2,...,s N ]←choose the type of page cache which j th ?le data-block is allocated on in

C [j ,s 1,s 2,...,s M ].12:end for

13:end for

14:end for

15:Find the value in C [|F B |,0,0,...,0].

16:Find the placement of ?le data-block by tracing back the array R from R [|F B |,0,0,...,0].C [j ,s 1,s 2,...,s N ]=????????????????????? K i =1COST (P 0(f b i ),f b i )if j =0∞if N i =1s i + j i =1S i > N k =1SIZE m k min k ∈{1,2,...,N }{C [j ?1,s 1,s 2,...,s N ];if N i =1s i + j i =1S i < N k =1SIZE m k C [j ?1,s 1,...,s k +S j ,..,s N ]+COST (m k ,f b j )?COST (P 0(j ),f b j )+MOVE (P 0(j ),m k ,f b j

)}(7)

5The ILP model and a heuristic for solving the global ?e data-block placement problem

In this section,an ILP model is presented to solve the global ?le data-block placement problem for programs composed of multiple program regions and data dependencies.Since the time complexity of ILP model is exponential,a heuristic algorithm is proposed to generate near-optimal solutions in polynomial time.

496P.Dai et al.

5.1The ILP model and notations

Let’s consider a program composed of multiple program regions u 1,u 2,...,u T and data dependencies.The execution unit of the program is one region.There exist data dependencies between program regions.The whole program can be modeled by a directed acyclic graph G =(V ,E ),where V is a set of program regions,E is a set of dependency edges.Each region can access ?le data-blocks in secondary storage through ?le operations.The execution order of program regions,then,has to follow data dependencies de?ned by edges.Note that ?le can be accessed by multiple regions.

File operation cost on a program region u i is the summation of ?le access costs for all the ?le data-blocks on program region u i computed by Eq.8.

CS (u i )=K

k =1COST (P u i (f b k ),f b k ),?u i ∈U (8)

where f b k refers to the ?le data-blocks accessed in program region u i ,and P u i (f b k )denotes the location of ?le block f b k in region u i .

Moving cost between two regions is calculated as the summation of all the moving costs happened when moving ?le data-blocks between region u i and u j .

MV (u i ,u j )=K

k =1MOVE (P u i (f b k ),P u j (f b k ),f b k ),?e (u i ,u j )∈p (9)

where p represents the execution order of program regions and e (u i ,u j )∈p represents region u j executes immediately after the ?nish of region u i .If e (u i ,u j )/∈p ,MV (u i ,u j )equals to 0.

We de?ne the total ?le access cost of the whole program as the summation of ?le access costs in each region and moving costs between consecutive regions.The objective function of the global ?le data-block placement problem is de?ned as follows:min ??T i =1CS (u i )+ ?u i ,u j ∈U

MV (u i ,u j )??

(10)As we’ve seen in the motivational example,execution order of program regions has a great impact on the moving cost between regions because of ?le data-block sharing among consecutive program regions.

In the ILP formulation,we also preprocess the ?le data blocks mentioned in Algorithm

4.1.The ?le data-blocks here refer to merged ?le data-blocks.In the following,a set of ILP formulations is presented to solve the global ?le data-block placement problem optimally.A list of notations used in the ILP model is shown in Table 4.

5.1.1Mapping constraints

The architectural model considered in this paper is a single-core system.Therefore,each step can only execute one program region.

T

i =1Step (d ,u i )=1,d =1,2,...,D .(11)

where D is the maximum number of steps.

Effective FDP on hybrid main memory architectures497

Table4Notations used in ILP

model

Notation De?nition

U The set of regions U={u1,u2,...,u T}

Reg(d)The program region executed at step d

RN(u i,f b k)The number of reads for?le data-block f b k in region u i

WN(u i,f b k)The number of writes for?le data-block f b k in region u i

RC(m l)The read cost per page on the type of page cache m l

RW(m l)The write cost per page on the type of page cache m l

SIZE(m l)The limited capacity of page cache m l

Step(d,u j)=1if and only if region u j is executed at step d

On the other hand,a program region is executed exactly once in any step.It can be formulated as:

D

d=1

Step(d,u i)=1,i=1,2,...,T.(12)

We also de?ne an array of binary variables X(d,u j,f b k,m l)to denote whether region u j is executed at step d when?le block f b k is allocated on the type of page cache m l.If program region u j is executed at step d,and there are?le data-blocks to be accessed by region u j at step d,there must exist some value X(d,u j,f b k,m l)equals to1.Hence,for each step d and each program region u j,the following inequalities hold.

K k=1

N

l=1

X(d,u j,f b k,m l)/A≤Step(d,u j)(13)

K k=1

N

l=1

X(d,u j,f b k,m l)≥Step(d,u j)(14)

where A is a suf?ciently large number.

5.1.2Precedence constraints

An edge e=(u i,u j)∈E indicates region u j depends on region u i.That is,u j is not able to execute until u i is?nished.This constraint can be de?ned as following inequality:

D

d=1Step(d,u i)×d≤

D

d=1

Step(d,u j)×d,?(u i,u j)∈E.(15)

To obtain the execution order of program regions,we de?ne a notation Reg(d)to represent a program region u j is executed at step d.

Reg(d)=

T

j=1

Step(d,u j)×u j(16)

5.1.3Memory constraints

A?le data-block f b k accessed by a region u i can only be placed on one type of page cache.

D d=1

N

l=1

X(d,u j,f b k,m l)=1,?u j∈U,f b k∈F B(17)

498P.Dai et al.

Since the capacity of page cache on each type of memory is limited,the total size of ?le data-blocks allocated on a page cache must not exceed its capacity.It is formulated as

follows:

D

d=1

M

k=1

X(d,u j,f b k,m l)×S k≤M S(l)(18)

5.1.4Execution relationship

In order to calculate moving cost,we introduce an intermediate variable Z(d,u i,u j,f b k, m l,m n).It equals to1when the following conditions are satis?ed:First,program region u i is executed at step d and region u j is executed at step d+1.Second,?le data-block f b k accessed by region u i is allocated on page cache m https://www.doczj.com/doc/a911826594.html,st,?le data-block f b k accessed by u j is placed on page cache m n.

When Z(d,u i,u j,f b k,m l,m n)equals to1,it indicates that region u i executed at step d accesses?le data-block f b k on page cache m l,while region u j executed at step d+1accesses ?le data-block f b k on page cache m n.That is,X(d,u i,f b k,m l)=X(d+1,u j,f b k,m n)= 1.The constraint is expressed as:

(X(d,u i,f b k,m l)+X(d+1,u j,f b k,m n)?1)/2≤Z(d,u i,u j,f b k,m l,m n) (X(d,u i,f b k,m l)+X(d+1,u j,f b k,m n))/2≥Z(d,u i,u j,f b k,m l,m n)(19) In order to calculate the moving cost,we de?ne a binary variable Y(u i,u j,f b k,m l,m n). When Y(u i,u j,f b k,m l,m n)equals to1,it satis?es the following three conditions:First, region u i and region u j are executed in consecutive order.Second,region u i accesses?le data-block f b k on page cache m l,and region u j accesses f b k on page cache m n.In other words,Y(u i,u j,f b k,m l,m n)equals to1if and only if there exists a step d such that Z(d,u i,u j,f b k,m l,m n)equals to1.

Y(u i,u j,f b k,m l,m n)=

D

d=1

Z(d,u i,u j,f b k,m l,m n),f or l=n(20)

when l=n,it refers to the condition that?le data-block f b k is placed in the same page cache for both region u i and region u j.Therefore,the moving cost is zero,i.e., Y(u i,u j,f b k,m l,m n)=0.

The total cost of?le operations is composed of two parts:?le access cost and moving cost. The?le access cost function C S(u j)represents the total cost of?le data-block accesses for program region u j.It is calculated as the summation of read/write costs for?le data-blocks accessed in all regions:

CS(u j)=

D

d=1

K

k=1

M

l=1

X(d,u j,f b k,m l)×(RC(m l)×

RN(u j,f b k)+W C(m l)×W N(u j,f b k)),?u j∈U(21) The total cost for moving?le data-blocks is computed as the summation of moving costs for any two consecutive regions in a execution sequence:

MV(u i,u j)=

K

k=1

N

l=1

N

n=1

Y(u i,u j,f b k,m l,m n)×S k

×(RC(m l)+W C(m n)),?(u i,u j)∈p.(22)

Effective FDP on hybrid main memory architectures 499

5.1.5Objective function

The objective function of the global ?le data-block placement problem is to minimize the total cost of ?le accesses and ?le data-block movements during the execution of program:Min

??T j =1C S (u j )+T i =1T j =1

MV (u i ,u j )??(23)The ILP generates the optimal execution sequence and solution for ?le data-block place-ment for each region.However,when the size of a program increases,the computation time of ILP is unacceptable.

5.2The global ?le data-block placement algorithm

We propose a heuristic algorithm for generating near-optimal solutions ef?ciently for the global ?le data-block placement problem.The FDP algorithm is integrated into the GFDP algorithm to improve the quality of solutions to the global problem.

The basic idea of the GFDP algorithm is to reduce the moving cost of ?le data-blocks accessed by consecutive regions in an execution sequence.Therefore,the GFDP algorithm ?rst determines the execution order of program regions,then it applies the FDP algorithm to each program region to obtain optimal solutions for each region.In order to reduce the moving cost between two consecutively regions,it requires that the size of different ?le data blocks accessed by two regions is as small as possible and the size of common ?le data blocks accessed by both two regions is as large as possible.We propose to use a weighted graph G 1=(V 1,E 1)to model the access relationship of ?le data-blocks in two regions.We de?ne the weight of the edge between two nodes as the total size of the common ?le data-blocks accessed by two regions subtracting the total size of different ?le data blocks by two regions.Therefore,the weight of an edge e (u i ,u j )can be calculated as follows:W (u i ,u j )=

S k ∈u i u j

S k ? S l ∈u i u j S l /∈u i u j

S l ,(24)where u i u j represents a set of common ?le data-blocks accessed by both region u i and region u j and u i u j represents a set of ?le data-blocks accessed either by regions u i or by u j .

The execution sequence of program regions with maximum weight is a hamiltonian path p in G .The weight of one execution order is the summation of edge weights of path p ,which is de?ned as follows:W (p )= e (u i ,u j )∈P W (u i ,u j )(25)

The GFDP algorithm consists of two major steps.In the ?rst step,the program regions are classi?ed into different layers.There is no dependency exist between any two regions in the same layers.In second step,the execution order is computed in each layer using greedy strategy.We obtain a solution by ordering the nodes in layers.The detail of the GFDP algorithm is described in Algorithm 5.1.

Step (1)Classify program regions into layers:Given a DAG G =(V ,E ),the depth of each node Depth (u )is computed iteratively by bread-?rst search.Nodes at the same depth k can be grouped into the layer L k .

500P.Dai et al.

Algorithm5.1GFDP algorithm:global?le data-block placement algorithm

Input:A DAG G=,where V represents the set of program regions and E denotes the set of dependency relationship between nodes.

A weight matrix W,where W(i,j)represents the weight between node i and node j.

All the input data in both Algorithm4.1

Output:?le data-block placement for each region

1:Initialize an array I n(|V|)to record ingoing edge number of each node.

2:Initialize an array Depth[V]to calculate the depth of each node.

3:Initialize a queue Seq to record the execution order of each node.

4:Initialize Depth←1,Seq←1;

5:For each node i,calculate I n(i)according to input graph G.

6:Q←node i with Depth(i)=1and I n(i)=1.

7:Depth←calculate the depth of all nodes using bread-?rst search

8:for k←1to max u∈V Depth(u)do

9:Create a layer L k of nodes whose depth equal to k

10:if k==1then

11:tmp←pick either of the two nodes,the edge between which has the maximal weight in layer L1 12:Push the node into Seq and delete it out of layer L1

13:end if

14:while(|L k|)do

15:Find maximal weight e=(tmp,u k),where u k∈L k

16:tmp←u k

17:Push u k into Seq and kick it out of L k

18:end while

19:end for

20:Initialize the location of?le data-blocks for the?rst region in Seq

21:j=1

22:while j≤|P|do

23:Determine?le data-block placement for j th region by running FDP algorithm

24:Set initial location of?le data-block in j+1th region as the location of?le data-block in j th region. 25:j=j+1

26:end while

27:Output the execution order Seq and?le data-block placement for each region.

Step(2)Determining the execution order in each layer We compute the layers in an ascending order sorted by depth.The execution order of each layer is computed by greedy method.We ?rst pick a node with maximal outgoing edge weight in layer L1.Then,another node with the next largest weighted edge is chosen from its adjacency list.This procedure is performed iteratively until all the nodes are processed.Finally,we obtain the execution order of the program regions.

Step(3)Determining the?le data-block placement for each region We compute the?le data-block placement of each region in execution order.The?le data-block placement in a region is determined using the FDP algorithm.The initial locations of all?le data-blocks for the?rst region are allocated on secondary storage.The result of?le data-block placement in a region is the initial location of?le data-blocks for the next region.The algorithm stops until all regions are processed.

In the following,we use the motivational example shown in Sect.3to illustrate the GFDP algorithm for solving the global?le data-block placement problem.The program regions and?le data-blocks are shown in Fig.7a.The access frequencies of the?le data-blocks is the same as that in Fig.6a.In our GFDP algorithm,the execution order is determined by the weight edge.The weight graph is constructed in Fig.7.The program starts at node1.Then, node3is chosen since the weight of edge between node1and3is4,which is larger than the weight between node1and2.Following the process of the GFDP algorithm,the?nal

Effective FDP on hybrid main memory architectures501

Fig.7An example for global?le data-block placement problem.a Global program regions with its accessing ?le data-blocks.b The weight graph model for regions in(a)

execution order is decided as1→3→2→4→6→5,which is the same as the optimal solution for this example.The total cost of the GFDP algorithm reduces the total?le access cost by53.8%compared the LCF method.The GFDP algorithm generates a near-optimal solution because it reduces costs of?le data-block movements through?le data-block sharing for consecutive regions.

The time complexity of step1in the GFDP algorithm is O(V+E)since it traverses all the edges in a graph G.Step2traverses adjacency list and the computation time is O(VE).In step3,the FDP algorithm is performed by V times,to produce the?le data-block placement for V regions.The computation time for step3is O(V×M×C N).Therefore,the time complexity of the GFDP algorithm is O(V×M×C N+V×E),where M is the number of

?le data-blocks,C is the maximal capacity of page caches,and N is number of page caches on different types of memory.

6Experiment

In this section,the effectiveness of our proposed technique is evaluated by comparing the experimental results with Low-Cost-First algorithm.First,we present the experimental setup in Sect.6.1.Then,the experimental results are shown in Sect.6.2.

6.1Experimental setup

We simulate the process of global?le data-block placement and obtain the total cost of a program by a dedicated simulator.The simulator has one CPU,a hybrid main memory and one hard disk.The hybrid main memory consists of three types of memories:DRAM,PCM and FeRAM.Each one can be used as page cache space.The capacities of page caches are limited by sizes of various types of memories.The unit size of a?le data-block is one page, which is4KB.The size of DRAM is relative smaller than the other memory types.The capacity of hard disk is able to hold all the?le data-blocks and all the?le data-blocks are initially placed on hard disk.The details of system con?guration are shown in Table5.The cost of various types of memory is setting up manually,according to that cost shown in[30]. The read cost refers to the cost of reading one megabyte size of?le data-block on the memory

502P.Dai et al.

Table5System con?guration

Architecture Component and description

CPU Frequency:3.2G H z

Memory1DRAM;Size:50MB;Read Cost:4/MB Write Cost:5/MB

Memory2PCM;Size:250MB;Read Cost:6/MB;Write Cost:12/MB

Memory3FeRAM;Size:250MB;Read Cost:5/MB;Write

Cost:7/MB

Hard Disk Size:2G;Sequential Read Cost:400/MB;Write

Cost:500/MB

type and the write cost represents the cost of writing one megabyte size of?le data-block on the memory type.

The costs in our simulated system can be considered as energy consumption or bandwidth, etc.depending on optimization goal.Costs of read/write operations on hard disk are set to be substantially higher than those on memories in our simulator.Therefore,solutions of?le data-block placement generated by various techniques will not be affected by values of costs for hard disk.For example,costs of small random read/write operations are signi?cantly higher than those of sequential read/write operations on hard disk.The difference,however, does not affect the result of data-block placement.

Once data-block placement is generated,all the data assigned to page cache are accessed on memories.Data originally located on hard disk can be allocated to memories at the beginning of a program region provided that there’s enough space to hold the requested data-blocks, and also,depending on the access frequency of a certain data-block.

The experiments are conducted with a set of benchmarks selected from MiBench bench-mark suite[10],including mad,jpeg,adpcm,which contain frequent I/O operations.The benchmarks are divided into several program regions,using a similar method used in[33]. Each region executes several?le I/O operations.The number of regions for each benchmark is shown in the Table6.The?le read/write operations in these benchmarks are replaced by mmap()system call.The input data including read/write frequencies and sizes of?le data-blocks can be obtained through pro?ling.

6.2Experimental results

All the benchmarks are processed with three?le data-block placement techniques:1)The Low-Cost-First(LCF)approach is a greedy approach.It allocates?le data-blocks with the minimum local cost.2)The GFDP approach:It?rst classi?es program regions into multiple layers,and determines the execution order of program regions using greedy approach,and

Table6Number of regions in

Benchmarks Number of regions benchmark

mad10

minimad8

adpcm11

rijndael12

sha10

ispell8

jpeg18

bitcount9

qsort7

Effective FDP on hybrid main memory architectures503

Fig.8Costs of total?le data accesses on hybrid main memory with multiple types of page caches generated by three techniques

then apply FDP algorithm to determine the?le data-block placement for each region.3)The ILP method.

Figure8shows experimental results of memory cost.The cost in“LC F”column repre-sents the memory cost produced by the LCF algorithm.The cost in“GFDP”column refers to the?le data-block access cost using the GFDP algorithm,which shows great reduction in cost compared to the LCF approach.The“I mpr v(L)”column in“GFDP”displays the percentage of cost improvement over the LCF algorithm.The“I mpr v(G)”in I L P shows the reduction of cost of ILP method compared with that of GFDP algorithm.According to Fig.8,the GFDP algorithm performs much better than LCF algorithm,especially when running benchmark with frequent I/O operations such as adpcm and j peg.The GFDP algorithm achieves a cost reduction of51.3%on average for all the benchmarks over LCF algorithm.Although the ILP method is always the optimal solution,it achieves a cost of reduction35.8%on average for all benchmarks over the GFDP algorithm.The reduction of cost of?le data-block placement obtained by ILP over that of the GFDP algorithm is5.5%at least for benchmark“jpeg”. The ILP method gives the optimal solution.However the ILP cannot?nd any solution within reasonable tie for benchmarks with big input.The proposed algorithm achieves near-optimal solutions ef?ciently.

The cost reduction for the proposed technique is achieved by two major factors.One is dif-ferent access costs on various memory types produced by various data-block placements.The other is cost of data movement among different storages between program regions.The cost of context switching,even though consumes a lot of clock cycles and energy from system’s point of view,is not considered as a direct cost of?le I/O in this paper.However,a carefully designed data-block placement technique is able to effectively reduce the occurrences of context switching.

7Related works

Several emerging non-volatile memory techniques have been proposed for their advantages, such as low-cost,shock-resistivity,high density and low leakage power,however,they also

504P.Dai et al. own some shortcomings,such as limitation of write number.Many research works have been done to take advantage of their advantages as well as alleviate their shortcomings.Wang et al.proposed a novel physical-location aware address mapping strategy for3-D NAND ?ash memory,which can effectively enhance the reliability of NAND?ash memory in[35]. Liu et al.presented a write-activity-aware NAND?ash memory management scheme,called PCM-FTL,to effectively enhance the endurance of PCM-based embedded systems in[22]. Hu et al.introduced scheduling,data migration and recomputation techniques to reduce the number of write activities on NVMS on embedded Chip Multiprocessor(CMPs)with Scratch Pad Memory(SPM)and non-volatile main memory in[16].Hu et al.proposed an ILP formulation and a heuristic algorithm,Concatenation Scheduling to schedule tasks,in order to minimize the main memory access time and extend the lifetime of the NVM in[15].Liu et al.proposed an application-speci?c wear leveling technique,called Curling-PCM,to evenly distribute write activities across the PCM chip,in order to improve the endurance of PCM in[21].However,they only focused on optimization techniques on one type of non-volatile memory.Hybrid main memory combining advantages of various types of memory was not under considered in their papers.

Hybrid main memory has been extensively studied in recent years because of the ef?-ciency and?exibility in power management.Several hybrid main memory models have been proposed.Qureshi et al.proposed a PCM-based hybrid main memory system at an archi-tecture level in[29].Liu et al.proposed variable partitioning technique for a hybrid main memory composed of DRAM and PRAM for DSPs in[23],to reduce consumption compared with pure DRAM.Dhiman et al.proposed a hybrid main memory composed of PDRAM and DRAM in[8].Park et al.proposed a power management strategy for a hybrid DRAM/PRAM-based main memory in[28].Hu et al.proposed software enable wear-leveling technique for hybrid PCM+DRAM main memory on embedded system,to further extend PCMs lifetime by distributing writes more evenly over the PCM in[14].Data accesses to?les on secondary storage can be greatly expedited using page cache in main memory[11,19,27].The existing techniques,however,do not consider the in?uence of data placement on hybrid memory. Data placement directly impacts the latency and energy consumption for?le accesses.

Researchers have studied several data placement techniques to reduce the memory access cost as well as improve the performance of application.These techniques can be divided into two categories:static data placement and dynamic data placement.When using static technique,the location of data is?rst determined before execution.Panda et al.presented a static technique for scalar and array variables in[26].Avissar et al.presented an approach for static data partitioning technique for global and stack data in[2].When using dynamic strategy,a program is divided into several regions.Each region is a basic block,such as loop.Data allocation instructions are inserted by compiler before execution of each region. Udayakumaran et al.proposed a greedy data placement for each region in[34].Ozturk et al.presented a data placement approach for loop-intensive application with regular data access in[25].Chen et al.and Asbar et al.proposes dynamic data placement techniques for irregular array access pattern in[1,7].Hu et al.proposed an optimal dynamic placement strategy for array variable in[13].Guo et al.proposed a dynamic approach for global data placement algorithm in[36].Static technique is not able to take advantage of data placement in runtime environment.The trend of research direction is toward dynamic data placement technique.Dynamic strategy is more?exible than static technique by reallocating the data for each program regions.Existing techniques focus on variables with?xed sizes.They are not suitable for data placement for?le data with various sizes.The?le data-block placement problem has not yet been studied previously.This paper focus on the problem of?le data-block placement for?le data-blocks loaded into main memory from secondary storage.

相关主题
文本预览
相关文档 最新文档