An Analysis of Data Corruption in the Storage Stack
Lakshmi N.Bairavasundaram?,Garth R.Goodson?,Bianca Schroeder?
Andrea C.Arpaci-Dusseau?,Remzi H.Arpaci-Dusseau?
?University of Wisconsin-Madison?Network Appliance,Inc.?University of Toronto {laksh,dusseau,remzi}@https://www.doczj.com/doc/4f15694751.html,,garth.goodson@https://www.doczj.com/doc/4f15694751.html,,bianca@https://www.doczj.com/doc/4f15694751.html,
Abstract
An important threat to reliable storage of data is silent data corruption.In order to develop suitable protection mechanisms against data corruption,it is essential to un-derstand its characteristics.In this paper,we present the ?rst large-scale study of data corruption.We analyze cor-ruption instances recorded in production storage systems containing a total of1.53million disk drives,over a pe-riod of41months.We study three classes of corruption: checksum mismatches,identity discrepancies,and par-ity inconsistencies.We focus on checksum mismatches since they occur the most.
We?nd more than400,000instances of checksum mismatches over the41-month period.We?nd many interesting trends among these instances including:(i) nearline disks(and their adapters)develop checksum mismatches an order of magnitude more often than enter-prise class disk drives,(ii)checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality,and(iii)checksum mismatches across different disks in the same storage system are not independent.We use our observations to derive lessons for corruption-proof system design.
1Introduction
One of the biggest challenges in designing storage sys-tems is providing the reliability and availability that users expect.Once their data is stored,users expect it to be per-sistent forever,and perpetually available.Unfortunately, in practice there are a number of problems that,if not dealt with,can cause data loss in storage systems.
One primary cause of data loss is disk drive unreli-ability[16].It is well-known that hard drives are me-chanical,moving devices that can suffer from mechani-cal problems leading to drive failure and data loss.For example,media imperfections,and loose particles caus-ing scratches,contribute to media errors,referred to as latent sector errors,within disk drives[18].Latent sector errors are detected by a drive’s internal error-correcting codes(ECC)and are reported to the storage system. Less well-known,however,is that current hard drives and controllers consist of hundreds-of-thousands of lines of low-level?rmware code.This?rmware code,along with higher-level system software,has the potential for harboring bugs that can cause a more insidious type of disk error–silent data corruption,where the data is silently corrupted with no indication from the drive that an error has occurred.
Silent data corruptions could lead to data loss more of-ten than latent sector errors,since,unlike latent sector er-rors,they cannot be detected or repaired by the disk drive itself.Detecting and recovering from data corruption re-quires protection techniques beyond those provided by the disk drive.In fact,basic protection schemes such as RAID[13]may also be unable to detect these problems. The most common technique used in storage systems to detect data corruption is for the storage system to add its own higher-level checksum for each disk block,which is validated on each disk block read.There is a long his-tory of enterprise-class storage systems,including ours, in using checksums in a variety of manners to detect data corruption[3,6,8,22].However,as we discuss later, checksums do not protect against all forms of corruption. Therefore,in addition to checksums,our storage system also uses?le system-level disk block identity informa-tion to detect previously undetectable corruptions.
In order to further improve on techniques to handle corruption,we need to develop a thorough understanding of data corruption characteristics.While recent studies provide information on whole disk failures[11,14,16] and latent sector errors[2]that can aid system designers in handling these error conditions,very little is known about data corruption,its prevalence and its character-istics.This paper presents a large-scale study of silent data corruption based on?eld data from1.53million disk drives covering a time period of41months.We use the
same data set as the one used in recent studies of latent sector errors[2]and disk failures[11].We identify the fraction of disks that develop corruption,examine fac-tors that might affect the prevalence of corruption,such as disk class and age,and study characteristics of corrup-tion,such as spatial and temporal locality.To the best of our knowledge,this is the?rst study of silent data cor-ruption in production and development systems.
We classify data corruption into three categories based on how it is discovered:checksum mismatches,iden-tity discrepancies,and parity inconsistencies(described in detail in Section2.3).We focus on checksum mis-matches since they are found to occur the most.Our im-portant observations include the following:
(i)During the41-month time period,we observe more than400,000instances of checksum mismatches,8%of which were discovered during RAID reconstruction,cre-ating the possibility of real data loss.Even though the rate of corruption is small,the discovery of checksum mismatches during reconstruction illustrates that data corruption is a real problem that needs to be taken into account by storage system designers.
(ii)We?nd that nearline(SATA)disks and their adapters develop checksum mismatches an order of magnitude more often than enterprise class(FC)disks.Surprisingly, enterprise class disks with checksum mismatches de-velop more of them than nearline disks with mismatches. (iii)Checksum mismatches are not independent occur-rences–both within a disk and within different disks in the same storage system.
(iv)Checksum mismatches have tremendous spatial lo-cality;on disks with multiple mismatches,it is often con-secutive blocks that are affected.
(v)Identity discrepancies and parity inconsistencies do occur,but affect3to10times fewer disks than checksum mismatches affect.
The rest of the paper is structured as follows.Section2 presents the overall architecture of the storage systems used for the study and Section3discusses the method-ology used.Section4presents the results of our analy-sis of checksum mismatches,and Section5presents the results for identity discrepancies,and parity inconsisten-cies.Section6provides an anecdotal discussion of cor-ruption,developing insights for corruption-proof storage system design.Section7presents related work and Sec-tion8provides a summary of the paper.
2Storage System Architecture
The data we analyze is from tens-of-thousands of pro-duction and development Network Appliance TM storage systems(henceforth called the system)installed at hun-dreds of customer sites.This section describes the ar-chitecture of the system,its corruption detection mecha-nisms,and the classes of corruptions in our study.
2.1Storage Stack
Physically,the system is composed of a storage-controller that contains the CPU,memory,network in-terfaces,and storage adapters.The storage-controller is connected to a set of disk shelves via Fibre Channel loops.The disk shelves house individual disk drives. The disks may either be enterprise class FC disk drives or nearline serial ATA(SATA)disks.Nearline drives use hardware adapters to convert the SATA interface to the Fibre Channel protocol.Thus,the storage-controller views all drives as being Fibre Channel(however,for the purposes of the study,we can still identify whether a drive is SATA and FC using its model type).
The software stack on the storage-controller is com-posed of the W AFL R ?le system,RAID,and storage layers.The?le system processes client requests by issu-ing read and write operations to the RAID layer,which transforms the?le system requests into logical disk block requests and issues them to the storage layer.The RAID layer also generates parity for writes and reconstructs data after failures.The storage layer is a set of cus-tomized device drivers that communicate with physical disks using the SCSI command set[23].
2.2Corruption Detection Mechanisms
The system,like other commercial storage systems,is designed to handle a wide range of disk-related errors. The data integrity checks in place are designed to de-tect and recover from corruption errors so that they are not propagated to the user.The system does not know-ingly propagate corrupt data to the user under any cir-cumstance.
We focus on techniques used to detect silent data cor-ruption,that is,corruptions not detected by the disk drive or any other hardware component.Therefore,we do not describe techniques used for other errors,such as trans-port corruptions reported as SCSI transport errors or la-tent sector https://www.doczj.com/doc/4f15694751.html,tent sector errors are caused by physical problems within the disk drive,such as media scratches,“high-?y”writes,etc.[2,18],and detected by the disk drive itself by its inability to read or write sec-tors,or through its error-correction codes(ECC).
In order to detect silent data corruptions,the system stores extra information to disk blocks.It also peri-odically reads all disk blocks to perform data integrity checks.We now describe these techniques in detail.
Corruption Class Possible Causes Detection Operation Checksum mismatch Bit-level corruption;torn write;Any disk read
misdirected write
File system-level block identity
Parity inconsistency Memory corruption;lost write;Data scrub
bad parity calculation
Table1:Corruption classes summary.
64?byte Data
Integrity Segment
(c)Structure of the data integrity segment(DIS)
Checksum of data block
Identity of data block
........
Checksum of DIS
Figure1:Data Integrity Segment.The?gure shows the
different on-disk formats used to store the data integrity seg-
ment of a disk block on(a)enterprise class drives with520B
sectors,and on(b)nearline drives with512B sectors.The?g-
ure also shows(c)the structure of the data integrity segment.
In particular,in addition to the checksum and identity informa-
tion,this structure also contains a checksum of itself.
2.2.1Data Integrity Segment
In order to detect disk block corruptions,the system
writes a64-byte data integrity segment along with each
disk block.Figure1shows two techniques for storing
this extra information,and also describes its structure.
For enterprise class disks,the system uses520-byte sec-
tors.Thus,a4-KB?le system block is stored along with
64bytes of data integrity segment in eight520-byte sec-
tors.For nearline disks,the system uses the default512-
byte sectors and store the data integrity segment for each
set of eight sectors in the following sector.We?nd that
the protection offered by the data integrity segment is
well-worth the extra space needed to store them.
One component of the data integrity segment is a
checksum of the entire4KB?le system block.The
checksum is validated by the RAID layer whenever the
data is read.Once a corruption has been detected,the
original block can usually be restored through RAID re-
construction.We refer to corruptions detected by RAID-
level checksum validation as checksum mismatches.
A second component of the data integrity segment is
block identity information.In this case,the fact that the
?le system is part of the storage system is utilized.The
identity is the disk block’s identity within the?le system
(e.g.,this block belongs to inode5at offset100).This
identity is cross-checked at?le read time to ensure that
the block being read belongs to the?le being accessed.
If,on?le read,the identity does not match,the data is
reconstructed from parity.We refer to corruptions that
are not detected by checksums,but detected through?le
system identity validation as identity discrepancies.
2.2.2Data Scrubbing
In order to pro-actively detect errors,the RAID layer pe-
riodically scrubs all disks.A data scrub issues read op-
erations for each physical disk block,computes a check-
sum over its data,and compares the computed check-
sum to the checksum located in its data integrity seg-
ment.If the checksum comparison fails(i.e.,a checksum
mismatch),the data is reconstructed from other disks in
the RAID group,after those checksums are also veri?ed.
If no reconstruction is necessary,the parity of the data
blocks is generated and compared with the parity stored
in the parity block.If the parity does not match the ver-
i?ed data,the scrub process?xes the parity by regener-
ating it from the data blocks.In a system protected by
double parity,it is possible to de?nitively tell which of
the parity or data block is corrupt.
We refer to these cases of mismatch between data and
parity as parity inconsistencies.Note that data scrubs
are unable to validate the extra?le system identity infor-
mation stored in the data integrity segment,since,by its
nature,this information only has meaning to the?le sys-
tem and not the RAID-level scrub.Depending on sys-
tem load,data scrubs are initiated on Sunday evenings.
From our data,we?nd that an entire RAID group is
scrubbed approximately once every two weeks on an av-
erage.However,we cannot ascertain from the data that
every disk in the study has been scrubbed.
2.3Corruption Classes
This study focuses on disk block corruptions caused by both hardware and software errors.Hardware bugs in-clude bugs in the disk drive or the disk shelf?rmware, bad memory,and adapter failures.Software bugs could also cause some corruption.In many cases,the cause of corruption cannot be identi?ed.We detect different forms of corruption using the different data protection mechanisms in place.As mentioned earlier,we distin-guish between these forms in our study.Table1gives a summary of these corruption classes.
?Checksum mismatches(CMs):This corruption class refers to cases where the corruption is detected from mismatched data and checksum.The cause could be(i)data content corrupted by components within the data path,or(ii)a torn write,wherein only a portion of the data block is written successfully,or(iii)a mis-directed write,wherein the data is written to either the wrong disk or the wrong location on disk,thus overwrit-ing and corrupting data[3,15].Checksum mismatches can be detected anytime a disk block is read(?le system reads,data scrubs,RAID reconstruction and so on).?Identity discrepancies(IDs):This corruption class refers to a mismatch detected when a disk block iden-tity check is performed during a?le system read.The cause could be(i)a lost write,which typically occurs be-cause a write destined for disk is not written but thought of as written,or(ii)a misdirected write,where the orig-inal disk location is not updated.We are aware of actual cases when the disk?rmware replied successfully to a write that was never written to stable media.Identity dis-crepancies can be detected only during?le system reads.?Parity inconsistencies(PIs):This corruption class refers to a mismatch between the parity computed from data blocks and the parity stored on disk despite the indi-vidual checksums being valid.This error could be caused by lost or misdirected writes,in-memory corruptions, processor miscalculations,and software bugs.Parity in-consistencies are detected only during data scrubs.
Our study primarily focuses on checksum mismatches, since we?nd that these corruptions occur much more frequently.
3Methodology
This section describes some terminology,our data col-lection and analysis methodology,and notation used to discuss our results.
3.1Terminology
We use the following terms in the remaining sections.Disk class Enterprise Class or nearline disk drives with respectively Fibre Channel and ATA interfaces. Disk family A particular disk drive product.The same product(and hence a disk family)may be offered in different capacities.Typically,disks in the same family only differ in the number of platters and/or read/write heads[17].
Disk model The combination of a disk family and a par-ticular disk size.Note that this term does not imply an analytical or simulation model.
Disk age The amount of time a disk has been in the ?eld since its ship date,rather than the manufacture date.In practice these these two values are typically within a month of each other.
Corrupt block This term refers to a4-KB?le system block with a checksum mismatch.
Corrupt disk This term is used to refer to a disk drive that has at least one corrupt block.
3.2Data Collection and Analysis
We now describe our data collection and analysis methodology and some limitations.
Data collection:The storage system has a built-in,low-overhead mechanism called Autosupport to log impor-tant system events back to a central repository.These messages can be enabled for a variety of system events including disk errors.Not all customers enable logging, although a large percentage do.Those that do,some-times do so only after some period of initial use.These logs allow customized support based on observed events. Although these logs are primarily intended for support, they have also been utilized for analyzing various disk errors.In addition to our corruption study,this repos-itory(the“Network Appliance Autosupport Database”) has been used in disk failure[11]and latent sector er-ror[2]studies.
Analysis:We study corruption instances that were logged in tens of thousands of storage systems for a pe-riod of41months starting in January2004.These sys-tems belong to a range of different models,run differ-ent versions of storage-controller software(perhaps with one or more updates during the study period)and contain many different models or versions of hardware compo-nents.In order to have a complete history of the activities of the disks used in the study,we constrain our sample to only those disks that were shipped after January2004. Our sample consists of1.53million disk drives.These drives belong to14disk families and31distinct models. To derive statistically signi?cant results,we often further constrain the sample set depending on the analysis being performed.For example,we sometimes use shorter time
periods for our analysis so as to maximize the number of models we can study;clearly not all disk families and models have been in the?eld for the same duration.The disk models we consider for each study may have one of the following constraints:
?Model has at least1000disks in the?eld for time period being considered.
?Model has at least1000disks in the?eld and at least 15corrupt disks for time being considered.
The?rst constraint is used for studies of factors that impact checksum mismatches,while other studies use the second constraint.In addition to the constraints on the model sample,we often restrict our data to include only the?rst17months since a drive was shipped.This helps make results more comparable,since many of the drives in the study were shipped on different dates and have been in the?eld for different amounts of time. While we usually present data for individual disk mod-els,we sometimes also report averages(mean values)for nearline disks and enterprise class disks.Since the sam-ple size for different disk models per disk class varies considerably,we weigh the average by the sample size of each disk model in the respective class. Limitations:The study has a few limitations that mostly stem from the data collection process.First,for a vari-ety of reasons,disks may be removed from the system. Our study includes those disks up to the point of their removal from the system.Therefore,we may not ob-serve errors from otherwise error prone disks after some period of time.Second,since the logging infrastructure has been built with customized support as the primary purpose,the data can be used to answer most but not all questions that are interesting for a study such as ours. For example,while we can identify the exact disk when an error is detected during a scrub,we cannot verify that every disk in the study has been scrubbed periodically in the absence of errors.
3.3Notation
We denote each disk drive model as family-type .For anonymization purposes,family is a single letter repre-senting the disk family(e.g.,Quantum Fireball EX)and type is a single number representing the disk’s particular capacity.Although capacities are anonymized,relative sizes within a family are ordered by the number repre-senting the capacity.For example,n-2is larger than n-1, and n-3is larger than both n-1and n-2.The anonymized capacities do not allow comparisons across disk families. Disk families from A to E(upper case letters)are nearline disk families,while families from f to o(lower case let-ters)are enterprise class disk families.Lines on graphs labeled NL and ES represent the weighted average for nearline and enterprise class disk models respectively.
We present data as the probability of developing x checksum mismatches for a particular sample of disks. The notation P(X T≥L)denotes the probability of a disk developing at least L checksum mismatches within T months since the disk’s?rst use in the?eld.
4Checksum Mismatches
This section presents the results of our analysis of check-sum mismatches.We?rst provide basic statistics on the occurrence of checksum mismatches in the entire pop-ulation of disk drives.We then examine various fac-tors that affect the probability of developing checksum mismatches.Next,we analyze various characteristics of checksum mismatches,such as spatial locality.Further, we look for correlations between occurrence of check-sum mismatches and other system or disk errors.Finally, we analyze the source of the disk requests that discovered the mismatches.
4.1Summary Statistics
During the41-month period covered by our data we ob-serve a total of about400,000checksum mismatches. Of the total sample of1.53million disks,3855disks de-veloped checksum mismatches–3088of the358,000 nearline disks(0.86%)and767of the1.17million enter-prise class disks(0.065%).Using our probability repre-sentation,P(X t≥1)=0.0086for nearline disks,and P(X t≥1)=0.00065for enterprise class disks with-out any restriction on time,t.This indicates that nearline disks may be more susceptible to corruption leading to checksum mismatches than enterprise class disks.On av-erage,each disk developed0.26checksum mismatches. Considering only corrupt disks,that is disks that experi-enced at least one checksum mismatch,the mean number of mismatches per disk is104,the median is3and the mode(i.e.the most frequently observed value)is1mis-match per disk.The maximum number of mismatches observed for any single drive is33,000.
4.2Factors
We examine the dependence of checksum mismatches on various factors:disk class,disk model,disk age,disk size,and workload.
4.2.1Disk Class,Model and Age
Figures2and3show the probability of a disk developing checksum mismatches as it ages for nearline and enter-prise class disks respectively.The graphs plot the cu-mulative distribution function of the time until the?rst
F r a c t i o n o f t o t a l d i s k s w i t h a t l e a s t 1 C M
Disk age (months)
Figure 2:Impact of disk age on nearline disks.
The
probability that a disk develops checksum mismatches as it ages is shown for nearline disk models.Note that the probability is cumulative.
checksum mismatch occurs.The ?gures can be repre-sented as P (X t ≥1)for t ={3,6,9,12,15,17}months,i.e.,the probability of at least one checksum mismatch after t months.Note the different Y-axis scale for the nearline and enterprise class disks.
We see from the ?gures that checksum mismatches de-pend on disk class,disk model and disk age.
Observation 1Nearline disks (including the SATA/FC adapter)have an order of magnitude higher probabil-ity of developing checksum mismatches than enterprise class disks.
Figure 2(line ‘NL’–Nearline average)shows that 0.66%of nearline disks develop at least one mismatch during the ?rst 17months in the ?eld (P (X 17≥1)=0.0066),while Figure 3(b)(line ‘ES’)indicates that only 0.06%of enterprise class disks develop a mismatch dur-ing that time (P (X 17≥1)=0.0006).
Observation 2The probability of developing check-sum mismatches varies signi?cantly across different disk models within the same disk class.
We see in Figure 2that there is an order of magnitude difference between models ‘C-1’and ‘E-2’for develop-ing at least one checksum mismatch after 17months;i.e.,P (X 17≥1)=0.035for ‘C-1’and 0.0027for ‘E-2’.Observation 3Age affects different disk models differ-ently with respect to the probability of developing check-sum mismatches.
On average,as nearline disks age,the probability of developing a checksum mismatch is fairly constant,with some variation across the models.As enterprise class disks age,the probability of developing the ?rst check-sum mismatch decreases after about 6-9months and then stabilizes.
4.2.2Disk Size
Observation 4There is no clear indication that disk size affects the probability of developing checksum mis-matches.
Figure 4presents the fraction of disks that develop checksum mismatches within 17months of their ship-date (i.e.,the rightmost data points from Figures 2and 3;P (X 17≥1)).The disk models are grouped within their families in increasing size.Since the impact of disk size on the fraction of disks that develop checksum mis-matches is not constant across all disk families (it occurs in only 7out of 10families),we conclude that disk size does not necessarily impact the probability of developing checksum mismatches.4.2.3Workload
Observation 5There is no clear indication that work-load affects the probability of developing checksum mis-matches.
The systems in the study collect coarse workload data including the number of read and write operations,and the number of blocks read and written for each week of our study.To study the effect of workload on checksum mismatches,we computed the correlation coef?cient be-tween the workload data and the number of checksum mismatches observed in the system.
We ?nd that in all cases the correlation coef?cient is less than 0.1(in fact,in most cases less than 0.001),in-dicating no signi?cant correlation between workload and checksum mismatches.However,these results might be due to having only coarse per-system rather than per-drive workload data.A system consists of at least 14disks and can have as many as several hundred disks.Aggregating data across a number of disks might blur existing correlations between an individual drive’s work-load and corruption behavior.
4.3Characteristics
In this subsection,we explore various characteristics of checksum mismatches.First,we analyze the number of mismatches developed by corrupt disks.Then,we ex-amine whether mismatches are independent occurrences.Finally,we examine whether the mismatches have spatial or temporal locality.
4.3.1Checksum mismatches per corrupt disk Figure 5shows the cumulative distribution function of the number of checksum mismatches observed per cor-rupt disk,i.e.the Y-axis shows the fraction of corrupt
F r a c t i o n o f t o t a l d i s k s w i t h a t l e a s t 1 C M
Disk age (months)
(a) Enterprise (set 1)
F r a c t i o n o f t o t a l d i s k s w i t h a t l e a s t 1 C M
Disk age (months)
(b) Enterprise (set 2)
Figure 3:Impact of disk age on enterprise class disks.The probability that a disk develops checksum mismatches as it
is
F r a c t i o n o f t o t a l d i s k s w i t h a t l e a s t 1 C M
Figure 4:The impact of disk size.
The ?gures show the fraction of disks with at least one checksum mismatch within
17
months of shipping to the ?eld for (a)nearline disk models,and (b)enterprise class disk models.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1F r a c t i o n o f c o r r u p t d i s k s h a v i n g u p t o X C M s
Number of CMs
(a) Nearline
0.1 0.2 0.3
0.4 0.5 0.6 0.7 0.8 0.9 1F r a c t i o n o f c o r r u p t d i s k s h a v i n g u p t o X C M s
Number of CMs
(b) Enterprise
Figure 5:Checksum mismatches per corrupt disk.The fraction of corrupt disks as a function of the number of checksum
mismatches that develop within 17months after the ship date for (a)nearline disk models and (b)enterprise class disk models.Note that the x-axis is not linear in scale –the lines in the graph are used only to help distinguish the points of different disk models,and their slopes are not meaningful.
disks that have fewer than or equal to X number of cor-rupt blocks.The?gure can be represented as P(X17≤x|X17≥1)for x={1,2,3,4,5,10,20,50,100,200, 500,1000}.
Observation6The number of checksum mismatches per corrupt disk varies greatly across disks.Most cor-rupt disks develop only a few mismatches each.However, a few disks develop a large number of mismatches. Figure5shows that a signi?cant fraction of corrupt disks(more than a third of all corrupt nearline disks and more than a?fth of corrupt enterprise disks)de-velop only one checksum mismatch.On the other hand, a small fraction of disks develop several thousand check-sum mismatches.The large variability in the number of mismatches per drive is also re?ected in the great differ-ence between the mean and median:while the median is only3mismatches per drive,the mean is78.
A more detailed analysis reveals that the distributions exhibit heavy tails.A large fraction of the total number of checksum mismatches observed in our study is experi-enced by a very small fraction of the corrupt disks.More precisely,1%of the corrupt disks(the top1%corrupt disks with the largest number of mismatches)produce more than half of all mismatches recorded in the data. Observation7On average,corrupt enterprise class disks develop many more checksum mismatches than cor-rupt nearline disks.
Figure5(a)(line‘NL’)and Figure5(b)(line‘ES’) show that within17months50%of corrupt disks(i.e., the median)develop about2checksum mismatches for nearline disks,but almost10for enterprise class disks. The trend also extends to a higher percentage of corrupt disks.For example,80%of nearline corrupt disks have fewer than20mismatches,whereas80%of enterprise class disks have fewer than100mismatches.Given that very few enterprise class disks develop checksum mis-matches in the?rst place,in the interest of reliability and availability,it might make sense to replace the enterprise class disk when the?rst mismatch is detected. Observation8Checksum mismatches within the same disk are not independent.
We?nd that the conditional probability of develop-ing further checksum mismatches,given that a disk has at least one mismatch,is higher than the probability of developing the?rst mismatch.For example,while the probability that a nearline disk will develop one or more checksum mismatches in17months is only0.0066,the conditional probability of developing more than1mis-match given that the disk already has one mismatch is as high as0.6(1minus0.4,the probability of exactly1 block developing a checksum mismatch in Figure5).
Finally,it is interesting to note that nearline disk model ‘E-1’is particularly aberrant–around30%of its corrupt disks develop more than1000checksum mismatches. We are currently investigating this anomaly.
4.3.2Dependence between disks in same system Observation9The probability of a disk developing a checksum mismatch is not independent of that of other disks in the same storage system.
While most systems with checksum mismatches have only one corrupt disk,we do?nd a considerable number of instances where multiple disks develop checksum mis-matches within the same storage system.In fact,one of the systems in the study that used nearline disks had92 disks develop checksum mismatches.Taking the max-imum number of disks in the systems in the study into consideration,the probability of92disks developing er-rors independently is less than1e?12,much less than 1e?05,the approximate fraction of systems represented by one system.This dependence is perhaps indicative of a common corruption-causing component,such as a shelf controller or adapter.We are aware of such compo-nents causing corruptions.
4.3.3Spatial Locality