Data swapping Variations on a theme by dalenius and reiss
- 格式:pdf
- 大小:156.80 KB
- 文档页数:16
sequ词根-回复Seq stands for "sequ" root, meaning "to follow" or "sequence". In this article, we will delve into the concept of sequence and explore its various applications in different fields such as mathematics, genetics, and computer science. Step by step, we will decipher the significance of sequence and its role in understanding patterns and processes. By the end of this article, you will have a comprehensive understanding of the sequ root and its implications.The concept of sequence is deeply embedded in the fabric of our everyday lives. We encounter sequences in different forms, whether it's the order in which we perform a series of tasks or the steps involved in solving a problem. The essence of sequence lies in recognizing the order and arranging elements accordingly.In mathematics, sequence plays a crucial role in understanding patterns and series. A sequence is a list of numbers arranged in a particular order. Each number in the sequence is called a term. The sequence can be finite or infinite, depending on the number of terms present. For example, the sequence 1, 2, 3, 4, 5 forms a finite sequence with five terms, while the sequence 2, 4, 6, 8, ... continues indefinitely, forming an infinite sequence.Moreover, sequences can follow specific patterns or rules. For instance, the Fibonacci sequence is a famous example, where each term is the sum of the previous two terms (1, 1, 2, 3, 5, 8, 13, ...). By understanding the rules governing a sequence, mathematicians can predict and model various phenomena.Moving beyond mathematics, the concept of sequence extends to genetics. In the field of genetics, DNA sequences are of utmost importance. DNA, short for deoxyribonucleic acid, carries the genetic instructions required for the development, functioning, and reproduction of all known living organisms.The human genome, for instance, is made up of billions of DNA base pairs arranged in a specific order. This order is critical for various biological processes, such as protein synthesis and gene regulation. By sequencing the DNA, scientists can read the genetic code and gain insights into the causes of genetic disorders, develop personalized medicine, and understand the evolutionary history of species.The sequencing technology has evolved over time, enablingscientists to unravel the complexities of DNA. Initially, sequencing a DNA sequence was a labor-intensive and time-consuming process. However, advancements in technology, such as the development of high-throughput sequencing methods like Next-Generation Sequencing (NGS), have revolutionized the field. These techniques allow for the rapid and cost-effective sequencing of DNA, enabling researchers to analyze complex genomes and identify genetic variations more efficiently.Besides genetics, the concept of sequence is also vital in computer science. In computer science, a sequence is defined as an ordered set of elements. The order of these elements is crucial as it determines how a computer program processes and manipulates the data.Sequences are used in various algorithms and data structures to solve problems efficiently. For example, in sorting algorithms like Merge Sort or Quick Sort, the elements are rearranged into a particular order by comparing and swapping them based on a predefined rule. Similarly, in data structures like linked lists or arrays, the elements are stored in a specific order, allowing for faster access and manipulation of data.Overall, the sequ root encompasses the concept of sequence and its implications across different fields. From mathematics to genetics to computer science, the understanding of sequencing plays a pivotal role in unraveling patterns, solving problems, and gaining insights into various phenomena. By recognizing the order and following the sequence, we can unlock the mysteries that surround us and make significant advancements in different areas of knowledge.。
数据分析英语试题及答案一、选择题(每题2分,共10分)1. Which of the following is not a common data type in data analysis?A. NumericalB. CategoricalC. TextualD. Binary2. What is the process of transforming raw data into an understandable format called?A. Data cleaningB. Data transformationC. Data miningD. Data visualization3. In data analysis, what does the term "variance" refer to?A. The average of the data pointsB. The spread of the data points around the meanC. The sum of the data pointsD. The highest value in the data set4. Which statistical measure is used to determine the central tendency of a data set?A. ModeB. MedianC. MeanD. All of the above5. What is the purpose of using a correlation coefficient in data analysis?A. To measure the strength and direction of a linear relationship between two variablesB. To calculate the mean of the data pointsC. To identify outliers in the data setD. To predict future data points二、填空题(每题2分,共10分)6. The process of identifying and correcting (or removing) errors and inconsistencies in data is known as ________.7. A type of data that can be ordered or ranked is called________ data.8. The ________ is a statistical measure that shows the average of a data set.9. A ________ is a graphical representation of data that uses bars to show comparisons among categories.10. When two variables move in opposite directions, the correlation between them is ________.三、简答题(每题5分,共20分)11. Explain the difference between descriptive andinferential statistics.12. What is the significance of a p-value in hypothesis testing?13. Describe the concept of data normalization and its importance in data analysis.14. How can data visualization help in understanding complex data sets?四、计算题(每题10分,共20分)15. Given a data set with the following values: 10, 12, 15, 18, 20, calculate the mean and standard deviation.16. If a data analyst wants to compare the performance of two different marketing campaigns, what type of statistical test might they use and why?五、案例分析题(每题15分,共30分)17. A company wants to analyze the sales data of its products over the last year. What steps should the data analyst take to prepare the data for analysis?18. Discuss the ethical considerations a data analyst should keep in mind when handling sensitive customer data.答案:一、选择题1. D2. B3. B4. D5. A二、填空题6. Data cleaning7. Ordinal8. Mean9. Bar chart10. Negative三、简答题11. Descriptive statistics summarize and describe thefeatures of a data set, while inferential statistics make predictions or inferences about a population based on a sample.12. A p-value indicates the probability of observing the data, or something more extreme, if the null hypothesis is true. A small p-value suggests that the observed data is unlikely under the null hypothesis, leading to its rejection.13. Data normalization is the process of scaling data to a common scale. It is important because it allows formeaningful comparisons between variables and can improve the performance of certain algorithms.14. Data visualization can help in understanding complex data sets by providing a visual representation of the data, making it easier to identify patterns, trends, and outliers.四、计算题15. Mean = (10 + 12 + 15 + 18 + 20) / 5 = 14, Standard Deviation = √[(Σ(xi - mean)^2) / N] = √[(10 + 4 + 1 + 16 + 36) / 5] = √52 / 5 ≈ 3.816. A t-test or ANOVA might be used to compare the means ofthe two campaigns, as these tests can determine if there is a statistically significant difference between the groups.五、案例分析题17. The data analyst should first clean the data by removing any errors or inconsistencies. Then, they should transformthe data into a suitable format for analysis, such ascreating a time series for monthly sales. They might also normalize the data if necessary and perform exploratory data analysis to identify any patterns or trends.18. A data analyst should ensure the confidentiality andprivacy of customer data, comply with relevant data protection laws, and obtain consent where required. They should also be transparent about how the data will be used and take steps to prevent any potential misuse of the data.。
市场异象与市场效率G. William Schwert威廉沃特Simon School of Business, University of Rochester西蒙商学院罗彻斯特大学This paper can be downloaded from theSocial Science Research Network Electronic Paper Collection:/abstract=目录1 引言 (2)2 挑选出的试验规律 (2)2.1可预见的资产的差别回报 (2)2.2各时期收益的预测性的不同 (7)3 不同类型的投资者的收益 (11)3.1个人投资者 (11)3.2机构投资者 (12)3.3套利限制 (14)4 长期回报 (15)5资产定价影响 (17)6公司金融的启示 (18)7结论 (19)Anomalies and Market Efficiency市场异象与市场效率G. William SchwertUniversity of Rochester, Rochester, NY 14627and National Bureau of Economic Research(国家经济研究局)October 2002摘要实践证明,市场异象似乎与现有的资产价格行为理论并不相符。
它表明市场并不是有效的,且相关的资产价格理论也有不足之处。
这篇论文的实证表明,规模效应,价值效应,周末效应以及股息率效应在一系列研究将他们公诸于世后他们的效果变弱了或者根本不出现这些效应了。
与此同时,操作者开始实践一些学术研究中运用的投资策略。
小公司一月效应在其首次发布在学术文献上以来其作用力不断减弱,尽管有证据表明还存在这种现象。
然而,有趣的是,这种现象并不存在于那些集中投资于小型股的组合回报投资者中。
所有的这些发现使得市场异象趋于表现而非真实。
随着这些非同寻常的发现而来的恶名,诱惑了很多学者去进一步调查市场异象,并试图去解释这种现象。
A User’s Guide toRockWare®Aq•QA®Version 1.1RockWare, Inc.Golden, Colorado, USACopyright © 2003–2004 Prairie City Computing, Inc. All rights reserved.Aq•QA® information and updates: Aq•QA® sales and support:RockWare, Inc.2221 East Street, Suite 101Golden, Colorado 80401 USASales: 303-278-3534, aqqa@Orders: 800-775-6745Fax: 303-278-4099Developer: Developed exclusively for RockWare, Inc. by:Prairie City Computing, Inc.115 West Main Street, Suite 400PO Box 1006Urbana, Illinois 61803-1006 USATrademarks: Aq•QA® and Prairie City Computing® are trademarks or registered trademarks of Prairie City Computing, Inc. RockWare® is a registered trademark of RockWare, Inc. All other trademarks used herein are the properties of their respective owners.Warranty: RockWare warrants that the original CD is free from defects in material and workmanship, assuming normal use, for a period of 90 days from the date of purchase. If a defect occurs during this time, you may return the defective CD to PCC, along with a dated proof of purchase, and RockWare will replace it at no charge. After 90 days, you can obtain a replacement for a defective CD by sending it and a check for $25 (to cover postage and handling) to RockWare. Except for the express warranty of the original CD set forth here, neither RockWare nor Prairie City Computing (PCC) makes any other warranties, express or implied. RockWare attempts to ensure that the information contained in this manual is correct as of the time it was written. We are not responsible for any errors or omissions. RockWare’s and PCC’s liability is limited to the amount you paid for the product. Neither RockWare not PCC is liable for any special, consequential, or other damages for any reason.Copying and Distribution: You are welcome to make backup copies of the software for your own use and protection, but you are not permitted to make copies for the use of anyone else. We put a lot of time and effort into creating this product, and we appreciate your support in seeing that it is used by licensed users only.End User License Agreement: Use of Aq•QA® is subject to the terms of the accompanying End User License Agreement. Please refer to that Agreement for details.ContentsA Guided Tour of Aq•QA®1About Aq•QA® (1)Data Sheet (1)Entering Data (2)Working With Data (4)Graphing Data (6)Replicates, Standards, and Mixing (11)The Data Sheet 13 About the Data Sheet (13)Creating a New Data Sheet (13)Opening an Existing Data Sheet (13)Layout of the Data Sheet (13)Selecting Rows and Columns (14)Reordering Rows and Columns (14)Adding Samples and Analytes (14)Deleting Samples and Analytes (15)Using Analyte Symbols (15)Data Cells (15)Entering Data (15)Changing Units (16)Using Elemental Equivalents (16)Notes and Comments (17)Flagging Data Outside Regulatory Limits (17)Saving Data (17)Exporting Data to Other Software (17)Printing the Data Sheet (18)Analytes 19 About Analytes (19)Analyte Properties (19)Changing the Properties of an Analyte (20)Creating a New Analyte (21)Analyte Libraries (21)Editing the Analyte Library (21)Updating Aq•QA Files (22)A User’s Guide to Aq•QA Contents • iData Analysis 23 About Data Analysis (23)Fluid Properties (23)Water Type (24)Dissolved Solids (24)Density (24)Electrical Conductivity (24)Hardness (25)Internal Consistency (25)Anion-Cation Balance (25)Measured TDS Matches Calculated TDS (26)Measured Conductivity Matches Calculated Value (26)Measured Conductivity and Ion Sums (26)Calculated TDS to Conductivity Ratio (26)Measured TDS to Conductivity Ratio (26)Organic Carbon Cannot Exceed Sum of Organics (26)Carbonate Equilibria (26)Speciation (27)Total Carbonate From Titration Alkalinity (27)Titration Alkalinity From Total Carbonate (27)Mineral Saturation (27)Partial Pressure of CO2 (27)Irrigation Waters (27)Salinity hazard (28)Sodium Adsorption Ratio (28)Exchangeable Sodium Ratio (28)Magnesium Hazard (28)Residual Sodium Carbonate (28)Reference (29)Geothermometry (29)Unit Conversion (30)Replicates, Standards, and Mixing 33 About Replicates, Standards, and Mixing (33)Comparing Replicate Analyses (33)Checking Against Standards (34)Fluid Mixing (34)Graphing Data 35 About Graphing Data (35)Time Series Plots (35)Series Plots (36)Cross Plots (37)Ternary Diagrams (37)Piper Diagrams (38)Durov Diagrams (39)Schoeller Diagrams (39)ii • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAStiff Diagrams (40)Radial Plots (40)Ion Balance Diagrams (41)Pie Charts (41)Copying a Graph to Another Document (42)Saving Graphs (42)Tapping Aq•QA®’s Power 43 About Tapping Aq•QA®’s Power (43)Template for New Data Sheets (43)Exporting the Data Sheet (43)Subscripts, Superscripts, and Greek Characters (44)Analyte Symbols (44)Colors and Markers (44)Calculated Ions (44)Hiding Analytes and Samples (44)Selecting Display Fonts (45)Searching the Data Sheet (45)Arrow Key Behavior During Editing (45)Sorting Samples and Analytes (45)“Tip of the Day” (45)Appendix: Carbonate Equilibria 47 About Carbonate Equilibria (47)Necessary Data (47)Activity Coefficients (47)Apparent Equilibrium Constants (48)Speciation (49)Titration Alkalinity (49)Mineral Saturation (50)CO2 Partial Pressure (51)Index 53 A User’s Guide to Aq•QA Contents • iiiA Guided Tour of Aq•QA®About Aq•QAImagine you could keep the results of your chemical analyses in aspreadsheet developed especially for the purpose. A spreadsheet thatknows how to convert units, check your analyses for internal consistency,graph your data in the ways you want it graphed, and so on.A spreadsheet like that exists, and it’s called Aq•QA. Aq•QA was writtenby water chemists, for water chemists. Best of all, it is not only powerfulbut easy to learn, so you can start using it in minutes. Just copy the datafrom your existing ordinary spreadsheets, paste it into Aq•QA, andyou’re ready to go!To see what Aq•QA can do for you, take the guided tour below.Data SheetWhen you start Aq•QA, you see an empty Data Sheet. Click on File →Open…, move to directory “\Program Files\AqQA\Examples” and openfile “Example1.aqq”.The example Data SheetAnalyteSampleis arranged with samples in columns, and analytes – the things youmeasure – in rows.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 12 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAYou can flip an Aq•QA Data Sheet so the samples are in rows andanalytes in columns by selecting View → Transpose Data Sheet . Clickon this tab again to return to the original view. Tip: Aq•QA by default labels analytes by name (Sodium, Potassium,Dissolved Solids, …), but by clicking on View → Show AnalyteSymbols you can view them by chemical symbol (Na, K, TDS, …). To include more samples or analytes in your Data Sheet, click on the “Add Sample” or “Add Analyte” button: Add asampleAdd an analyteSelect analyte(s)Select sample(s)Select valuesYou select analytes or samples by clicking on “handles”, marked in theData Sheet by small triangles. You can select the values associated withan analyte using a separate set of handles, next to the “Unit” column.Give it a try!Tip: To rearrange rows or columns, select one or more, hold down theAlt key, and drag them to the desired location. Entering DataTo see how to enter your own data into an Aq•QA Data Sheet, begin byselecting File → New . Add to the Data Sheet whatever analytes youneed, and delete any you don’t need.Tip: To delete analytes, select one or more and click on the button.To delete samples you have selected, click on the button.When you click on the “Add Analyte” button, you can pick from amonga number of predefined choices in various categories, such as “InorganicAnalytes”, “Organic Analytes”, and so on:lets youfromA number of commonly encountered data fields (Date, pH,Temperature, …) can be found in the “General” category.Tip: If you don’t find an analyte you need among the predefined choices,you can easily define your own by clicking on Analytes→NewAnalyte….To make your work easier, rearrange the analytes (select, hold down theAlt key, and drag) so they appear in the same order as in your data.Tip: You can add a number of analytes in a single step by clicking onAnalytes→Add Analytes….Set units for the various analytes, as necessary: right click in the unitfield and choose the desired units from under Change Units, or selectChange Units under Analytes on the menubar.Right click tochange unitsA User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 3Tip: You can change the units for more than one analyte in one step.Simply select any number of analytes and right-click on the unit field.Tip: Analyses are sometimes reported in elemental equivalents. Forexample, sulfate might be reported as “SO4 (as S)”, bicarbonate as“HCO3 (as C)”, and so on. In this case, right click on the unit of such ananalyte and select Convert to Elemental Equivalents.You can now enter your data into the Data Sheet as you would in anordinary spreadsheet.Tip: If you have an analysis below the detection limit, you can enter afield such as “<0.01”. Aq•QA knows what this means. If the analysisreports an analyte was not detected, enter a string such as “n/d” or “--”.For missing data, enter a non-numeric string, or simply leave the entryblank.You of course can type data into the Data Sheet by hand, or paste thevalues into cells one-by-one. But it’s far easier to copy them from anordinary spreadsheet or other document as a block and paste them all atonce into the Aq•QA Data Sheet.Making sure the analytes appear in the same order as in your spreadsheet,copy the data block, click on the top, leftmost cell in the Aq•QA DataSheet, and select Edit → Paste, or touch ctrl+V.Tip: If there are more samples in a data block you are pasting than inyour Aq•QA Data Sheet, Aq•QA will make room automatically.Tip: If the data arranged in your spreadsheet in columns fall in rows inyour Aq•QA Data Sheet, or vice-versa, you can transpose the Data Sheet,or simply select Edit → Paste Special → Paste Transposed.Tip: You can flag data in an Aq•QA Data Sheet that fall outsideregulatory guidelines. Select Samples → Check Regulatory Limits, orclick on . Violations on the Data Sheet are flagged in red.Working With DataOnce you have entered your chemical analyses in the Data Sheet, Aq•QAcan tell you lots of useful information.Click on File → Open… and load file “Example2.aqq” from directory“\Program Files\AqQA\Examples”. To see Aq•QA’s analysis of one ofthe samples in the Data Sheet, select the sample by clicking on its handleand then click on the tab. This moves you to the DataAnalysis pane, which looks like4 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAClick on anybar to expandor close up acategoryClick herefor moreinformationThere are a number of categories in the Data Analysis pane. To open acategory, click on the corresponding bar. A second click on the bar closesthe category. Clicking on the symbol gives more information aboutthe category.Tip: You can view the data analysis for the previous or next sample inyour Data Sheet by clicking on the and buttons to the left andright of the top bar in the Data Analysis pane.The top category, Fluid Properties, identifies the water type, dissolvedsolids content, density, temperature-corrected conductivity, and hardness,as measured or calculated by Aq•QA.The next category, Internal Consistency, reports the results of a numberof Quality Assurance tests from the American Water Works Association“Standard Methods” reference. For example, Aq•QA checks that anionsand cations balance electrically, that TDS and conductivitymeasurements are consistent with the reported fluid composition, and soon.The Carbonate Equilibria category tells the speciation of carbonate insolution, carbonate concentration calculated from measured titrationalkalinity and vice-versa, the fluid’s calculated saturation state withrespect to the calcium carbonate minerals calcite and aragonite, and thecalculated partial pressure of carbon dioxide.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 5The Irrigation Waters category shows the irrigation properties of asample, and the Geothermometry category shows the results of applyingchemical geothermometers to the samples, assuming they are geothermalwaters.Finally, the sample’s analysis is displayed in a broad range of units, frommg/kg to molal and molar.Tip: You can print results in the Data Analysis pane: open the categoriesyou want printed and click on File→Print…Graphing DataAq•QA can display the data in your Data Sheet on a number of the typesof plots most commonly used by water chemists.To try your hand at making a graph, make sure that you have file“Example2.aqq” open. If not, click on File→Open… and select the filefrom directory “\Program Files\AqQA\Examples”.On the Data Sheet, select the row for Iron. Hold down the ctrl key andselect the row for Manganese. Click on and select Time SeriesPlot. The graph appears in Aq•QA as a new pane.The result should look like:To change or Click here toSelect…delete a graph, right-click on its tab alter the graph’s appearanceanalytes…andsamples tographYou can select the analytes and samples to appear in the graph on thecontrol panel to the right of the plot. Right clicking on the pane’s tab,along the bottom of the Aq•QA window, lets you change the plot to adifferent type, or delete it.Tip: You can alter the appearance of a graph by clicking on theAdvanced Options… button on the graph pane.6 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAYou can copy the graph (Edit→Copy) and paste it into another program, such as an illustration program like Adobe® Illustrator® or Microsoft® PowePoint®, or a word processing program like Microsoft®Word®.Tip: Once you have pasted a graph into an illustration program, you can edit its appearance and content. To do so, select the graphic and ungroup the picture elements (you may need to ungroup them twice).You can also send it to a printer by clicking on File→Print.Tip: In addition to copying a graph to the clipboard, you can save it in a file in one of several formats: as a Windows® EMF file, an EncapsulatedPostScript® (EPS) file, or a bitmap. Select File→Save Image As… andselect the format from the “Save as type” dropdown menu.Tip: Select a linear or logarithmic vertical axis for a Series or TimeSeries plot by unchecking or checking the box labeled “Log Scale” onthe Advanced Options…→ dialog or dialog.Aq•QA can display your data on a broad variety of graphs and diagrams:simply choose a diagram type from the pulldown.In addition to Time Series plots, Aq•QA can produce the following typesof diagrams:Series Diagrams.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 7Cross Plots, in linear and logarithmic coordinates.Ternary diagrams.8 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAPiper diagrams.Durov diagrams.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 9Schoeller diagrams.Stiff diagrams.Radial diagrams.10 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAIon balance diagrams.Pie charts.Replicates, Standards, and MixingAq•QA can check replicate analyses, compare analyses to a standard, andfigure the compositions of sample mixtures.Replicate analyses are splits of the same sample that have been analyzedmore than once, whether by the same or different labs. The analyses,therefore, should agree to within a small margin of error.To see how this feature works, load (File → Open…) file“Replicates.aqq” from directory “\Program Files\AqQA\Examples”.Select samples PCC-2, PCC-2a, and PCC-2b: click on the handle forPCC-2, then hold down the shift key and click on the handle for PCC-2b.Now, click on the button on the toolbar.A new display appears at the right side of the Aq•QA Data Sheet, oralong the bottom if you have transposed it.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 1112 • A Guided Tour of Aq•QA®A User’s Guide to Aq•QAThe display shows the coefficient of variation for each analyte, and whether this value falls within a certain tolerance. Small coefficients of variation indicate good agreement among the replicates. The tolerance, by default, is ±5, but you can set it to another value by clicking on Samples → Set Replicate Tolerance….A standard is a sample of well-known composition, one that wasprepared synthetically, or whose composition has already been analyzed precisely. Enter the known composition as a sample in the Data Sheet and click on Samples → Designate As Standard, or the button. Then select an analysis of the standard on the Data Sheet and click on Samples → Compare To Standard, or the button. The display at the right or bottom of the Data Sheet shows the error in the analysis, relative to the standard. Set the tolerance for the comparison, by default ±10, clicking on Samples → Set Standard Tolerance….To find the composition of a mixture of two or more samples, select two or more samples and click on the button on the toolbar. Thecomposition of the mixed fluid appears to the right or bottom of the Data Sheet.The Data SheetAbout the Data SheetThe Aq•QA® Data Sheet is a special spreadsheet that holds yourchemical data. The data is typically composed of the values measured forvarious analytes, for a number of samples.You can enter data into a Data Sheet and manipulate it, as describedbelow.Creating a New Data SheetTo create a new Aq•QA Data Sheet, select File → New, or touch ctrl+N.An empty Data Sheet, containing a number of analytes, but no data,appears.The appearance of new Data Sheets is specified by a template. You cancreate your own template so new Data Sheets contain the analytes youneed, in your choice of units, and ordered as you desire. For moreinformation, see Template for New Data Sheets in the TappingAqQA’s Power chapter of this guide.Opening an Existing Data SheetAq•QA files end with the extension “.aqq”. These files contain the dataentered in the Data Sheet, as well as any graphs produced and theprogram’s current configuration.You can open an existing Data Sheet by clicking on File → Open… andselecting a “.aqq” file, either one that you have previously saved or anexample file installed with the Aq•QA package. A number of examplefiles are installed in the “Examples” directory within the Aq•QAinstallation directory (commonly “\Program Files\AqQA”).Layout of the Data SheetAn Aq•QA Data Sheet contains the values measured for various analytes(Na+, Ca2+, HCO3−, and so on) for any number of samples that have beenanalyzed. Each piece of information about a sample is considered ananalyte, even sample ID, location, sampling date, and so on.A User’s Guide to Aq•QA The Data Sheet • 13By default, each analyte occupies a row in the Data Sheet, and thesamples fall in columns. You can reverse this arrangement, so analytesfall in columns and the samples occupy rows, by clicking on Edit →Transpose Data Sheet. To flip the Data Sheet back to its originalarrangement, click on this tab a second time.You can rearrange the order of analytes or symbols on the Data Sheet, asdescribed below under Reordering Rows and Columns.Selecting Rows and ColumnsTo select a row or column, click on the marker to the left of a row, or thetop of a column. The marker for a row or column appears as a smalltriangle. Analytes have two markers, one for selecting the entire analyte,and one for selecting only the analyte’s data values.You can select a range of rows or columns by holding down the leftmouse button on the marker at the beginning of the range, then draggingthe mouse to the marker at the end of the range. Alternatively, select thebeginning of the range, then hold down the shift button and click on themarker for the end of the range.To select a series of rows or columns that are not necessarily contiguouson the Data Sheet, select the first row or column, then hold down the ctrlkey and select subsequent rows or columns.By clicking on one of the small blue squares at the top or left of the DataSheet, you can select either the entire sheet, or all of the data values onthe sheet.Reordering Rows and ColumnsYou can easily rearrange the rows and columns of samples and analytesin your Data Sheet. To do so, first select a row or column, or a range ofrows and columns, as described under Selecting Rows andColumns. Then, holding down the alt key, press the left mouse button,drag the selection to its new position, and release the mouse button.Adding Samples and AnalytesTo include more samples or analytes in your Data Sheet, select onSamples → Add Sample, or Analytes → Add Analyte, or simply clickon the or buttons on the toolbar. To add several analytes at once,select Analytes → Add Analytes…, which opens a dialog box for thispurpose.When you add an analyte, you choose from among the large number thatAq•QA knows about. These are arranged in categories: inorganics,organics, biological assays, radioactivity, isotopes, and a generalcategory that includes things like pH, temperature, date, and samplelocation.14 • The Data Sheet A User’s Guide to Aq•QAIf you don’t find the analyte you need, you can quickly define your own.Select Analytes → New Analyte…, or New Analyte…from thedropdown menu. For more information about defining analytes, see theAnalytes chapter of the guide.Deleting Samples and AnalytesTo delete analytes or samples, select one or more and click on Analytes→ Delete, or Samples → Delete. Alternatively, select an analyte andclick on the button, or a sample and click on .Using Analyte SymbolsAnalytes are labeled with names such as Sodium, Calcium, andBicarbonate. If you prefer, you can view them labeled with thecorresponding chemical symbols, such as Na+, Ca2+, HCO3−. Simplyclick on View → Show Analyte Symbols. A second click on this tabreturns to labeling analytes by name.Data CellsEach cell in the data sheet contains one of several types of information:1. A numerical value, such as the concentration of a species.2. A character string.3. A date or a time.Numerical values are, most commonly, simply a number. You can,however, indicate a lack of data with a character string, such as “n/d” or“Not analyzed”, or just leaving the cell empty.If an analysis falls below the detection limit for a species, enter thedetection limit preceded by a “<”. For example, “<0.01”.Character strings, such as you might enter for the “Sample ID”, containany combination of characters, and can be of any length.You can enter dates in a variety of formats: “Sep 21, 2003”, 9/21/03”,“September 23”, and so on. Aq•QA will interpret your input and cast it inyour local format (e.g., mm/dd/yy in the U.S.). Similarly, enter time as“2:20 PM” or “14:20”. Append seconds, if you wish: 2:20:30 PM”.To change the width of the data cells (i.e., the column width), drag thedividing line between columns to the left or right. This changes the widthof all the data columns in the Data Sheet.Entering DataTo enter data into an Aq•QA Data Sheet, you can of course type it infrom the keyboard, or paste it into the cells in the Data Sheet, one by one.A User’s Guide to Aq•QA The Data Sheet • 15It is generally more expedient, however, to copy all of the values as ablock from a source file, such as a table in a word processing document,or a spreadsheet. To do so, set up your Aq•QA Data Sheet so that itcontains the same analytes as the source file, in the same order (seeAdding Samples and Analytes above, and Reordering Rows andColumns). You don’t necessarily need to add samples: Aq•QA will addcolumns (or rows) to accommodate the data you paste.Now, select the data block from the source document and copy it to theclipboard. Move to Aq•QA, click on the top, leftmost data cell, and selectEdit → Paste. If the source data is arranged in the opposite sense as yourData Sheet (the samples are in rows instead of columns, or vice-versa),transpose the Data Sheet (View → Transpose Data Sheet), or selectEdit → Paste Special → Paste Transposed.Changing UnitsYou can change the units of analytes on the Data Sheet at any time. Todo so, select one or more analytes, then click on Analytes → ChangeUnits. Alternatively, right click and choose a new unit from the optionsunder Change Units. If you have entered numerical data for the analyte(or analytes), you will be given the option of converting the values to thenew unit.Some unit conversions require that the program be able to estimatevalues for the fluid’s density, dissolved solids content, or both. If youhave entered values for the Density or Dissolved Solids analytes, Aq•QAwill use these values directly when converting units.If you have not specified this data for a sample, Aq•QA will calculateworking values for density and dissolved solids from the chemicalanalysis provided. It is best, therefore, to enter the complete analysis for asample before converting units, so the Aq•QA can estimate density anddissolved solids as accurately as possible.Aq•QA estimates density and dissolved solids using the methodsdescribed in the Data Analysis section of the User’s Guide, assuming atemperature of 20°C, if none is specified. Aq•QA can estimate densityover only the temperature range 0°C –100°C; outside this range, itassumes a value of 1.0 g/cm3, which can be quite inaccurate and lead toerroneous unit conversions.Using Elemental EquivalentsYou may find that some of your analytical results are reported aselemental equivalents. For example, sulfate might be reported as “SO4(as S)”, bicarbonate as “HCO3 (as C)”, and so on.In this case, select the analyte or analytes in question and click onAnalytes → Convert to Elemental Equivalents. Alternatively, select16 • The Data Sheet A User’s Guide to Aq•QAthe analyte(s), then right click on your selection and choose Convert toElemental Equivalents.To return to the default setting, select Analytes → Convert to Species,or select the Convert to Species option when you right-click.Notes and CommentsWhen you construct a Data Sheet, you may want to save certain notesand comments, such as a site’s location, who conducted the sampling,what laboratory analyzed the samples, and so on.To do so, select File → Notes and Comments… and type theinformation into the box that appears. This information will be savedwith your Aq•QA document; you may access it and alter it at any time.Flagging Data Outside Regulatory LimitsYou can highlight on the Data Sheet concentrations in excess of ananalyte’s regulatory limit. Select Samples → Check Regulatory Limits.Concentrations above the limit now appear highlighted in a red font.Select the tab a second time to disable the option. Touching ctrl+L alsotoggles the option.Aq•QA can maintain a regulatory limit for each analyte. The analytelibrary contains default limits based on U.S. water quality standards atthe time of compilation, but you should of course verify these againststandards as implemented locally. You can easily change the limit carriedfor an analyte, as described in the Analytes chapter of this guide.Saving DataBefore you exit Aq•QA, you will probably want to save your workspace,which includes the data in your Data Sheet, any graphs you have created,and so on, in a .aqq file.To save your workspace, select File → Save , or click on the button onthe Aq•QA toolbar.To save your workspace as a .aqq file under a different name, select File→ Save As… and specify the file’s new name.You may also want to save the data in the Data Sheet as a file that can beread by other applications, such as Microsoft® Excel®. For informationon saving data in this way, see the next section, Exporting Data toOther Software.Exporting Data to Other SoftwareWhen Aq•QA saves a .aqq file, it does so in a special format thatincludes all of the information about your Aq•QA session, such as theA User’s Guide to Aq•QA The Data Sheet • 17。
全文分为作者个人简介和正文两个部分:作者个人简介:Hello everyone, I am an author dedicated to creating and sharing high-quality document templates. In this era of information overload, accurate and efficient communication has become especially important. I firmly believe that good communication can build bridges between people, playing an indispensable role in academia, career, and daily life. Therefore, I decided to invest my knowledge and skills into creating valuable documents to help people find inspiration and direction when needed.正文:写关于兰州天气预报的英语作文全文共3篇示例,供读者参考篇1The Unpredictable Skies of Lanzhou: A Student's Perspective on Weather ForecastingAs a student hailing from the captivating city of Lanzhou, located in the heart of China's Gansu Province, I have developeda keen interest in the ever-changing weather patterns that grace our skies. Nestled along the Yellow River, our city is renowned for its unique geographical location, which often leads to unexpected meteorological phenomena. Keeping a watchful eye on the weather forecast has become a ritual for many of us, as it holds the key to planning our daily activities and ensuring our safety.Growing up in Lanzhou, I quickly learned that the weather here could be as unpredictable as the winding streets of our ancient city. One moment, the sun would be shining brightly, and the next, dark clouds would gather, threatening to unleash a torrential downpour. This capricious nature of the weather has taught me the importance of being prepared for any eventuality, whether it's carrying an umbrella or donning an extra layer of clothing.As a student, the weather forecast plays a crucial role in my academic life. During exam season, when the pressure is at its peak, a sudden change in weather conditions can significantly impact my ability to concentrate and perform at my best. On sweltering summer days, the scorching heat can drain my energy, making it challenging to focus on my studies. Conversely, duringthe bitter winter months, the frigid temperatures can make the journey to and from school a daunting task.However, it's not just the extremes that concern us; even the slightest variations in weather can have far-reaching consequences. A bout of heavy rain can turn the city's streets into treacherous rivers, making it difficult for students like myself to navigate our way to class. Conversely, a sudden snowfall can bring the city to a standstill, causing transportation disruptions and forcing schools to close unexpectedly.Despite the challenges posed by Lanzhou's unpredictable weather, I have learned to embrace the excitement it brings. Watching the clouds gather and dissipate, observing the subtle shifts in wind direction, and feeling the temperature fluctuations have become a part of my daily routine. It's a constant reminder of the power of nature and the importance of respecting the forces that shape our environment.In recent years, advances in technology have made weather forecasting more accurate and accessible than ever before. Meteorological agencies now employ sophisticated models and algorithms to predict weather patterns with increasing precision. As a tech-savvy student, I have come to rely heavily on mobile applications and online resources that provide real-time updateson weather conditions, ensuring that I am always prepared for whatever Mother Nature has in store.Yet, even with these technological advancements, there is still an element of uncertainty that surrounds weather forecasting in Lanzhou. The city's unique geographical location, nestled between the Qilian Mountains and the Yellow River basin, creates a complex interplay of factors that can influence the weather in unpredictable ways. It's a humbling reminder that, despite our best efforts, nature still holds the ultimate trump card.Nonetheless, the challenges posed by Lanzhou'sever-changing weather have taught me valuable lessons in resilience, adaptability, and respect for the natural world. As I navigate the academic and personal challenges that life as a student presents, I find solace in the knowledge that, just like the weather, every obstacle is temporary, and with perseverance and preparation, I can weather any storm.In conclusion, the weather forecast in Lanzhou is more than just a report on atmospheric conditions; it's a window into the city's unique character, a reflection of the unpredictable forces that shape our lives, and a constant reminder of the beauty and power of nature. As a student, embracing the uncertainties of theweather has taught me invaluable lessons that will undoubtedly serve me well as I embark on the journey of life beyond the classroom walls.篇2An Unexpected Surprise: Lanzhou's Peculiar Weather ForecastAs I was getting ready for school this morning, I couldn't help but notice the odd weather report on the TV. Living in Lanzhou, the capital city of Gansu Province in northwest China, we're accustomed to a relatively dry and continental climate. However, today's forecast seemed to defy all logic and normalcy.The cheery meteorologist on screen announced with a bright smile, "Good morning, Lanzhou! Brace yourselves for a delightfully unexpected surprise today. We're in for a burst of tropical weather conditions unlike anything we've experienced before!"I nearly choked on my breakfast, certain I had misheard. Tropical weather? In Lanzhou? The city rests on the upper reaches of the Yellow River, surrounded by the arid Gobi Desert and rugged mountain ranges. Surely, this had to be some kind of joke or technical glitch.Nevertheless, the forecast insisted on defying my disbelief. "That's right, folks! We can expect scorching temperatures reaching a sweltering 40°C (104°F), coupled with intense humidity levels of 90%. But that's not all! Get ready for torrential downpours, with rainfall accumulations of up to 300 millimeters (12 inches) throughout the day."My jaw must have hit the floor at that point. Lanzhou receives an average annual precipitation of merely 315 millimeters (12.4 inches), and the thought of that much rain falling in a single day was mind-boggling.As if that wasn't enough, the meteorologist continued, "And hold on to your hats, folks, because we're also anticipating hurricane-force winds gusting up to 200 kilometers per hour (124 mph)! It's going to be one wild ride, so make sure to secure any loose objects and seek shelter when necessary."I couldn't believe what I was hearing. Lanzhou, a city known for its dry, temperate climate, was supposedly going to transform into a tropical cyclone zone overnight. This had to be a prank, right?Skeptical yet intrigued, I decided to keep an open mind and see how the day would unfold. After all, stranger things have happened in the world of weather.As I stepped outside, the first thing that hit me was the overwhelming humidity. The air felt thick and heavy, like walking through a sauna. Beads of sweat immediately formed on my forehead, and my clothes clung to my skin uncomfortably.Pushing through the oppressive heat, I made my way to school, only to be met with a torrential downpour halfway there. Within seconds, I was drenched from head to toe, my backpack soaked through and weighing a ton. The streets quickly transformed into raging rivers, with water levels rising rapidly.Seeking refuge under a shop's awning, I watched in awe as the storm intensified. Winds howled ferociously, whipping debris through the air and threatening to sweep me off my feet. Trees bent precariously, their branches thrashing violently, and streetlights swayed ominously.Just when I thought the situation couldn't get any more surreal, a massive flash of lightning illuminated the sky, followed by a deafening clap of thunder that rattled the windows around me. The hair on the back of my neck stood on end, and I couldn't help but feel a sense of awe mixed with trepidation.As the hours ticked by, the tropical onslaught showed no signs of letting up. News reports flooded in, detailingwidespread flooding, power outages, and even a few tornado sightings on the city's outskirts.By the time I finally made it to school, the campus resembled a war zone. Fallen trees and branches littered the grounds, and the once-pristine lawns had transformed into muddy quagmires. Several classrooms were flooded, forcing the cancellation of afternoon classes.During our lunch break, my friends and I huddled in the cafeteria, swapping stories of our harrowing journeys through the storm. Some had even witnessed roof tiles being ripped off buildings or cars being swept away by the raging floodwaters.As the day drew to a close, the meteorologists offered a glimmer of hope, predicting that the tropical conditions would begin to subside by the following morning. However, they warned that the aftermath would be significant, with widespread damage and cleanup efforts required throughout the city.On my way home, I couldn't help but feel a sense of disbelief and wonder at the day's events. Lanzhou, a city known for its dry, temperate climate, had been transformed into a tropical paradise (or nightmare, depending on your perspective) in a matter of hours.As I finally reached the sanctuary of my home, I couldn't help but reflect on the incredible power of nature and the unpredictability of weather patterns. What had seemed like an ordinary day had turned into an adventure straight out of a Hollywood disaster movie.In the end, this unexpected tropical surprise left me with a newfound appreciation for the meteorologists and their tireless efforts to forecast and prepare us for Mother Nature's whims. It also served as a humbling reminder that no matter how advanced our technology or knowledge may be, the forces of nature still possess the ability to surprise and astound us.As for Lanzhou, well, let's just say we'll be stocking up on raincoats and umbrellas from now on, just in case the tropics decide to pay us another unexpected visit.篇3A Grey and Hazy Future: Lanzhou's Troubling Weather ForecastAs a student living in the city of Lanzhou, the capital of Gansu Province in northwest China, the weather forecast has become a topic of significant concern and anxiety for me and my classmates. Surrounded by mountains and situated in asemi-arid climate zone, Lanzhou has long faced environmental challenges, but recent projections paint a grim picture of what lies ahead.According to the latest meteorological data and climate models, Lanzhou is expected to experience an increasing number of days with severe air pollution, extreme temperatures, and water scarcity over the next decade. These alarming trends not only threaten our quality of life but also raise serious questions about the long-term sustainability of our city.One of the most pressing issues is the prevalence of air pollution, which has become an all-too-familiar aspect of life in Lanzhou. The city's geography, with its surrounding mountains trapping pollutants, combined with industrial emissions and vehicle exhaust, has created a perfect storm for poor air quality. In recent years, we've witnessed an alarming rise in the number of days with hazardous levels of particulate matter (PM2.5) and other harmful pollutants.The weather forecast for the upcoming years only paints a bleaker picture. Meteorologists predict that Lanzhou will experience a significant increase in the number of days with severe smog, often lasting for weeks at a time. During theseperiods, the air becomes thick and hazy, making it difficult to breathe and forcing schools and businesses to close temporarily.As a student, the impact of air pollution on our health and education is a major concern. Exposure to high levels of particulate matter has been linked to respiratory problems, heart disease, and even cognitive impairment. Numerous studies have shown that air pollution can negatively affect children's lung development and academic performance. It's heartbreaking to see my younger siblings and their classmates struggle with asthma and other respiratory issues exacerbated by the poor air quality.Unfortunately, air pollution is not the only challenge we face. The weather forecast also warns of an increase in the frequency and intensity of extreme temperatures, both hot and cold. Lanzhou's continental climate has always been characterized by hot summers and cold winters, but climate change is amplifying these extremes.During the summer months, we can expect more frequent and prolonged heatwaves, with temperatures soaring well above 40°C (104°F). These sco rching conditions not only make outdoor activities unbearable but also put a strain on the city's energy resources as air conditioning usage skyrockets. Moreover, therisk of heat-related illnesses, such as heat stroke and dehydration, becomes a significant concern for vulnerable populations, including the elderly and young children.In contrast, the winters in Lanzhou are expected to become even harsher, with longer periods of extreme cold and heavy snowfall. While we're accustomed to dealing with frigid temperatures, the forecast suggests that we may face more frequent and intense cold snaps, with temperatures plummeting below -20°C (-4°F). These conditions can lead to disruptions in transportation, power outages, and increased heating costs, placing a significant burden on families and the local economy.Perhaps the most alarming aspect of Lanzhou's weather forecast is the projected water scarcity. As a semi-arid region, Lanzhou has long relied on the Yellow River and other water sources for its freshwater supply. However, climate change is expected to further exacerbate the already strained water resources in the region.The forecast indicates a significant decrease in precipitation levels, coupled with more frequent and prolonged droughts. This combination could potentially lead to water shortages, affecting not only households but also agricultural production and industrial activities. The prospect of water rationing andpotential conflicts over limited resources is a genuine concern for our community.As students, we are taught about the importance of environmental stewardship and sustainable development, but the reality we face in Lanzhou makes it challenging to remain hopeful. We watch helplessly as our city grapples with the consequences of pollution, climate change, and resource depletion, and we worry about the future that awaits us.Despite the grim forecast, we must not lose hope. It is crucial for us, as the next generation, to actively participate in finding solutions and advocating for change. We must demand that our local authorities and policymakers take decisive action to address these issues, from implementing stricter environmental regulations to investing in renewable energy sources and water conservation measures.Furthermore, we must educate ourselves and our peers about the importance of adopting sustainable lifestyles. Simple actions, such as reducing energy consumption, using public transportation, and minimizing waste, can collectively make a significant impact. By fostering a culture of environmental awareness and responsibility, we can work towards mitigatingthe adverse effects of climate change and preserving our city's natural resources.As I look out of my classroom window and see the hazy skyline, I am reminded of the challenges we face. But I also see the determination and resilience of my peers, who refuse to accept this grim future as inevitable. We are the future of Lanzhou, and it is our responsibility to fight for a cleaner, more sustainable, and livable city.The weather forecast may paint a bleak picture, but it is also a call to action. By working together, embracing innovation, and prioritizing environmental protection, we can create a brighter future for Lanzhou – a future where we can breathe clean air, enjoy moderate temperatures, and have access to clean water. It won't be easy, but as students, we must be the driving force behind this change, for the sake of our city, our health, and the generations to come.。
不平衡系数英文In the realm of machine learning, the imbalance coefficient holds significant importance, especially when dealing with datasets that exhibit a significant disparity in the number of instances between different classes. This disparity, often referred to as class imbalance, can lead to challenges in training accurate and reliable models. The imbalance coefficient serves as a metric to quantify this imbalance, allowing researchers and practitioners to assess the severity of the issue and take appropriate measures to address it.The imbalance coefficient is typically calculated as a ratio between the number of instances in the majority class and the number of instances in the minority class. A higher imbalance coefficient indicates a more severe imbalance, which can lead to issues such as bias towards the majority class and poor performance on the minority class. This can be problematic in scenarios where accurate predictions on the minority class are crucial, such as in fraud detection or rare disease diagnosis.To address class imbalance, several strategies can be employed. One common approach is oversampling, which involves generating synthetic instances of the minority class to increase its representation in the dataset. Another approach is undersampling, which involves reducing the number of instances in the majority class to balance the classes. Both approaches aim to create a more balanced dataset that can lead to improved model performance.However, it is important to note that simply balancing the classes may not always be sufficient. The imbalance coefficient, although a useful metric, does not capture all the nuances of class imbalance. For instance, the distribution of instances within each class may still be highly skewed, even if the overall class counts are balanced. In such cases, more advanced techniques such as cost-sensitive learning or ensemble methods may be required to effectively handle the imbalance.In addition, it is crucial to evaluate model performance not just on the overall dataset but also on each individual class. Metrics such as precision, recall, and F1-score provide a more nuanced understanding of modelperformance, especially in the context of class imbalance. By monitoring these metrics, researchers and practitioners can assess whether their strategies to address imbalance are effective and make informed decisions about model improvements.In conclusion, the imbalance coefficient plays apivotal role in machine learning, particularly when dealing with datasets exhibiting class imbalance. It serves as a valuable metric to quantify the severity of the imbalance and guide strategies to address it. By understanding and effectively addressing class imbalance, researchers and practitioners can develop more accurate and reliable models that perform well across all classes, leading to improved outcomes in various real-world applications.**不平衡系数在机器学习中的重要性**在机器学习的领域中,不平衡系数具有重要意义,特别是在处理不同类别之间实例数量存在显著差异的数据集时。
variance和variation的用法Variance and variation are terms commonly used in statistics and probability to describe the extent, dispersion, or spread of data. While they are related concepts, there are subtle differences in their usage. In this essay, we will explore the definitions, applications, and significance of variance and variation in various fields.Variance is a statistical measure that quantifies how spread out or dispersed a set of values is. It is calculated as the average of the squared differences from the mean of the data set. The basic idea behind variance is to determine the average distance between each data point and the mean. A larger variance indicates a greater spread or dispersion of data, while a smaller variance indicates a more concentrated cluster of values around the mean.Variance is widely used in fields such as finance, economics, engineering, and physics. In finance, for example, variance is a key measure of volatility in asset prices. Higher variance implies greater price fluctuations, making an investment riskier. Economists use variance to assess the volatility of economic indicators like GDP, inflation rates, and stock market returns. In engineering, variance helps evaluate the consistency and reliability of processes or systems. For instance, in manufacturing, measuring the variance of product dimensions ensures quality control. In physics, variance is used to analyze the fluctuations or noise in experimental measurements.On the other hand, variation refers to the range or diversity of values within a data set or population. It provides a measure of how different individual observations are from one another. Variation can be expressed in several ways, such as the range (maximum minus minimum), interquartile range (middle 50% of observations), or coefficient of variation (standard deviation divided by the mean). Variation is used in fields including biology, genetics, ecology, and social sciences. In biology and genetics, variation is crucial for understanding the diversity of traits within a species or population. It helps researchers study genetic variability, evolution, and adaptability. In ecology, variation is used to analyze how different environmental factors impact species diversity, population dynamics, and ecosystem stability. Social scientists use variation to investigate differences in attitudes, behaviors, socioeconomic factors, or cultural practices across different groups or regions.While variance and variation share similarities, they serve distinct purposes in statistics. Variance focuses specifically on the dispersion or spread of data around the mean. It provides a quantitative measure of the average distance between individual values and the central tendency of the data set. On the other hand, variation encompasses a broader concept that considers the entire range of values or patterns in a data set. It quantifies the degree of diversity, heterogeneity, or variability within the set. Both variance and variation play vital roles in hypothesis testing, modeling, and decision-making. They help researchers and practitioners make inferences, draw comparisons, and evaluate statistical significance. For example, when testing the effectiveness of a new drug, variance allows researchers to assess the consistency and reliability of treatment outcomes. In a manufacturing process, variation analysis helps identify sources of defects, optimize performance, and minimize waste. Moreover, in social sciences, analyzing variation across different groups provides insights into social inequalities,policy implications, or cultural differences.In conclusion, variance and variation are critical statistical measures used to analyze the spread, diversity, or dispersion of data. Variance focuses on the differences between individual values and the mean, providing a measure of how spread out the values are. Variation, on the other hand, considers the entire range of values or patterns within a data set, quantifying the degree of diversity or heterogeneity. Both concepts are extensively applied in various fields and are instrumental in decision-making, evaluating statistical significance, and understanding patterns in data.。
Torpid Mixing of Simulated Tempering on the Potts ModelNayantara Bhatnagar Dana RandallAbstractSimulated tempering and swapping are two families of sam-pling algorithms in which a parameter representing temper-ature varies during the simulation.The hope is that this willovercome bottlenecks that cause sampling algorithms to beslow at low temperatures.Madras and Zheng demonstratethat the swapping and tempering algorithms allow efficientsampling from the low-temperature mean-field Ising model,a model of magnetism,and a class of symmetric bimodaldistributions[10].Local Markov chains fail on these distri-butions due to the existence of bad cuts in the state space.Bad cuts also arise in the-state Potts model,anotherfundamental model for magnetism that generalizes the Isingmodel.Glauber(local)dynamics and the Swendsen-Wangalgorithm have been shown to be prohibitively slow forsampling from the Potts model at some temperatures[1,2,6].It is reasonable to ask whether tempering or swappingcan overcome the bottlenecks that cause these algorithms toconverge slowly on the Potts model.We answer this in the negative,and give thefirst ex-ample demonstrating that tempering can mix slowly.Weshow this for the3-state ferromagnetic Potts model on thecomplete graph,known as the mean-field model.The slowconvergence is caused by afirst-order(discontinuous)phasetransition in the underlying ing this insight,wedefine a variant of the swapping algorithm that samples ef-ficiently from a class of bimodal distributions,including themean-field Potts model.1IntroductionThe standard approach to sampling via Markov chain MonteCarlo algorithms is to connect the state space of configura-tions via a graph called the Markov kernel.The Metropo-lis algorithm proscribes transition probabilities to the edgesof the kernel so that the chain will converge to any desireddistribution[14].Unfortunately,for some natural choices ofthe Markov kernel,the Metropolis Markov chain can con-temperature,is the goal distribution from which we wish to generate samples;at the highest temperature,is typically less interesting,but the rate of convergence is fast.A Markov chain that keeps modifying the distribution,interpolating between these two extremes,may produceuseful samples efficiently.Despite the extensive use of simulated tempering and swapping in practice,there hasbeen little formal analysis.A notable exception is work byMadras and Zheng[10]showing that swapping converges quickly for two simple,symmetric distributions,includingthe mean-field Ising model.1.2Results.In this work,we show that for the meanfieldPotts model,tempering and swapping require exponential time to converge to equilibrium.The slow convergenceof the tempering chain on the Potts model is caused by afirst-order(discontinuous)phase transition.In contrast,the Ising model studied by Madras and Zheng has a second-order(continuous)phase transition,which distinguishes why tempering works for one model and not the other.In addition,we give thefirst Markov chain algorithmthat is provably rapidly mixing on the Potts model.Tradi-tionally,swapping is implemented by defining a set of in-terpolating distributions where a parameter corresponding totemperature is varied.We make use of the fact that there is greaterflexibility in how we define the set of interpolants.Finally,our analysis extends the arguments of Madras and Zheng showing that swapping is fast on symmetric distribu-tions so as to include asymmetric generalizations.2Preliminaries2.1The-state Potts model.The Potts model was defined by R.B.Potts in1952to study ferromagnetism andanti-ferromagnetism[15].The interactions between particlesare modeled by an underlying graph with edges between particles that influence each other.Each of the verticesof the underlying graph is assigned one of differentspins(or colors).A configuration is an assignment of spins to the vertices,where denotes thespin at the vertex.The energy of a configuration is a function of the Hamiltonianwhere is the Kronecker-function that takes the value1if its arguments are equal and zero otherwise.Whenthe model corresponds to the ferromagnetic case where neighbors prefer the same color,while corresponds to the anti-ferromagnetic case where neighbors prefer to be differently colored.The state space of the-state ferromagnetic Potts model is the space of all-colorings of.We will thus use colorings and configurations interchangeably.Define the inverse temperaturewhere is the normalizing constant.Note that at,this is just the uniform distribution on all(not necessarily proper)-colorings of.We consider the ferromagnetic mean-field model where is the complete graph on vertices and all pairs of particles influence each other.For the3-state Potts model, .Let,and be the number of vertices assigned thefirst,second,and third colors.Letting ,we can rewrite the Gibbs distribution for the3-state Potts model asD EFINITION2.2.Let,then the mixing time isis rapidly mixing if the mixing time is bounded above by a polynomial in andconnect the state space,where vertices are configurations andedges are allowable1-step transitions.The transition proba-bilities on are defined asfor all,neighbors in,where is the maximumdegree of.It is easy to verify that if the kernel is connected then is the stationary distribution.For the Potts model,a natural choice for the Markovkernel is to connect configurations at Hamming distance one.Unfortunately,for large values of,the Metropolis algorithm converges exponentially slowly on the Potts modelfor this kernel[1,2].This is because the most probable states are largely monochromatic and to go from a predominantlyred configuration to a predominantly blue one we would have to pass through states that are highly unlikely at low temperatures.2.2.2Simulated tempering.Simulated tempering at-tempts to overcome this bottleneck by introducing a temper-ature parameter that is varied during the simulation,effec-tively modifying the distribution being sampled from.Letbe a set of inverse temperatures.The state space of the tempering chain iswhich we can think of as the union of copies of theoriginal state space,each corresponding to a different in-verse temperature.Our choice of corresponds to in-finite temperature where the Metropolis algorithm convergesrapidly to stationarity(on the uniform distribution),andis the inverse temperature at which we wish to sample.Weinterpolate by settingThe tempering Markov chain consists of two types of moves: level moves,which update the configuration while keeping the temperaturefixed,and temperature moves,which update the temperature while remaining at the same configuration.A level move Here is the Metropolis probability of going from to according to the stationary probability.A temperature move(the uniform distribution),for.A configuration in the swapping chain is an-tuple, where each component represents a configuration chosen from the distribution.The probability distribution is the product measureThe swapping chain also consists of two types of moves:A level moveA swap moveNotice that now the normalizing constants cancel out. Hence,implementing a move of the swapping chain is straightforward,unlike tempering where good approxima-tions for the partition functions are required.Zheng proved that fast mixing of the swapping chain implies fast mixing of the tempering chain[17],although the converse is unknown.For both tempering and swapping,we must be careful about how we choose the number of distributions. It is important that successive distributions and have sufficiently small variation distance so that temperature moves are accepted with nontrivial probability.However, must be small enough so that it does not blow up the running time of the algorithm.Following[10],we set. This ensures that for the values of at which we wish to sample,the ratio of and is bounded from above and below by a constant.3Torpid Mixing of Tempering on the Potts modelWe will show lower bounds on the mixing time of the tem-pering chain on the mean-field Potts model by bounding the spectral gap of the transition matrix of the chain.Letbe the eigenvalues of the transition ma-trix,so that for all.Let.The mixing time is related to the spectral gap of the chain by the following theorem(see[16]):T HEOREM3.1.Let.For all,(a).(b).The conductance,introduced by Jerrum and Sinclair,pro-vides a good measure of the mixing rate of a chain[7].For ,letThen,the conductance1is given byIt has been shown by Jerrum and Sinclair[7]that,for any reversible chain,the spectral gap satisfiesT HEOREM3.2.For any Markov chain with conductance and eigenvalue gap,1It suffices to minimize over,for any polynomial; this decreases the conductance by at most a polynomial factor(see[16]).Thus,to lower bound the mixing time it is sufficient to show that the conductance is small.If a chain converges rapidly to its stationary distribution it must have large conductance,indicating the absence of “bad cut,”i.e.,a set of edges of small capacity separating fromLet denote the set of configurations,where; and,configurations whereThis implieswhich occurs whengives the desired result.(ii)LetThefirst part of the lemma verifies that at the critical temperature there are3ordered modes(one for each color,by symmetry)and1disordered mode.In the next lemmas, we show that the disordered mode is separated from the ordered modes by a region of exponentially low density.To do this,we use the second part Lemma3.1and show that bounds the density of the separating region at each.Let, be the continuous extension of the discrete function.L EMMA3.2.For sufficiently large,the real functionand attains its maximum aton this line,wefindNeglecting factors not dependent on and simplifying usingStirling’s formula,we need to check for the stationary points of the function,we compare the quantities,where.AtAs is decreased,the slope of the lineis independent of.varies with.Thus,for some function,we have The claim follows by the second part of Lemma3.1..Let.Letbe the boundary of .The set defines a bad cut in the state space of the tempering chain.T HEOREM3.3.For sufficiently large,there exists such that.ing the definition of conductance,we havewhere.are within a linear factor of each other.By Theorem3.2the upper bound on bounds the spectral gap of the tempering chain at the inverse temperature.Applying Theorem3.1,wefind the tempering chain for the3-state Potts model mixes slowly.As a consequence of Zheng’s demonstrating that rapid mixing of the swapping chain implies fast mixing of the tempering chain[17],we also have established the slow mixing of the swapping chain for the mean-field Pott model.4Modifying the Swapping Algorithm for Rapid Mixing We now reexamine the swapping chain on two classes of distributions:one is an asymmetric exponential distribution (generalizing a symmetric distribution studied by Madras and Zheng[10]),and the other a class of the mean-field models.First,we show that swapping and tempering are fast on the exponential distribution.The proofs suggest that a key idea behind designing fast sampling algorithms for models withfirst-order phase transitions is to define a new set of interpolants that do not preserve the bad cut.We start with a careful examination of the exponential distribution since the proofs easily generalize to the new swapping algorithm applied to bimodal mean-field models.Example I:where is the normalizing constant.Define the interpolat-ing distributions for the swapping chain aswhere is a normalizing constant.T HEOREM4.1.The swapping chain with inverse tempera-tures,whereThe comparison theorem of Diaconis and Saloff-Coste is useful in bounding the mixing time of a Markov chain when the mixing time of a related chain on the same state space is known.Let and be two Markov chains on.Let and be the transition matrix and stationary distri-butions of and let and be those of.Letandbe sets of directed edges.Forsuch that,define a path,a sequence of states such that. Let denote the set of endpoints of paths that use the edge.T HEOREM4.2.(Diaconis and Saloff-Coste[3])Decomposition:. Define the projection4.1Swapping on the exponential distribution.We are now prepared to prove Theorem4.1.The state space for the swapping chain applied to Example I is.D EFINITION4.1.Let.The trace Tr where if and if,.The possible values of the trace characterize the partition we use.Letting be the set of configurations with trace,we have the decompositionThis partition of into sets offixed trace sets the stage for the decomposition theorem.The restrictions simulate the swapping Markov chain on regions offixed trace. The projectionIf we temporarily ignore swap moves on the restrictions, the restricted chains move independently according to the Metropolis probabilities on each of the distributions. The following lemma reduces the analysis of the restricted chains to analyzing the moves of at eachfixed tempera-ture.L EMMA4.1.(Diaconis and Saloff-Coste[3])For,let be a reversible Markov chain on afinite state space.Consider the product Markov chain on the product space,defined byNow restricted to each of the distributions is unimodal,suggesting that should be rapidly mixing at each temperature.Madras and Zheng formalize this in[10] and show that the Metropolis chain restricted to the positive or negative parts of mixes quickly.Thus,from Lemma 4.1and following the arguments in[10],we can conclude that each of the restricted Markov chains is rapidly mixing.Bounding the mixing rate of the projection:is an dimensional hypercube.The stationary probabilities of the projection chain are given by.This captures the idea that for the true projection chain,swap moves(transpositions)always have constant probability,and at the highest temperature there is high probability of changing sign.Of course there is a chance offlipping the bit at each higher temperature,but we will see that this is not even necessary for rapid mixing.To analyze RW1,we can compare it to an even simpler walk,RW2,that chooses any bit at random and updates it to 0or1with the correct stationary probabilities.It is easy to argue that RW2converges very quickly and we use this to infer the fast mixing of RW1.More precisely,let be a new chain on the hypercube for the purpose of the comparison.At each step it picksand updates the component by choosing exactly according to the appropriate stationary distribution at.In other words,the component is at stationarity as soon as it is ing the coupon collector’s theorem,we haveL EMMA4.2.The chain on mixes in timeand.We are now in a position to prove the following theorem. T HEOREM4.4.The projection.Letbe a single transition in from tothatflips the bit.The canonical path from to is the concatenation of three paths.In terms of tempering,is a heating phase and is a cooling phase.consists of swap moves from to;consists of one step thatflips the bit corre-sponding to the highest temperature to move to;consists of swaps until we reach.To bound in Theorem4.2,we will establish that(Transitions along)Let and.(4.2)First we considerLet us assume,without loss of generality,thatThen we have Case2:Therefore,again wefind equation4.1is satisfied.Case3:By the comparison theorem wefind thatFix con-stants and let be a large integer. The state space of the mean-field model consists of all spin configurations on the complete graph,namelyThe probability distribution over these configurations is determined by,inverse temperature, and,the-wise interactions between particles.The Hamiltonian is given bywhere is the Kronecker-function that takes the value1 if all of the arguments are equal and is0otherwise(whenwe set=1iff).The Gibbs distribution is where is the normalizing constant.This can be de-scribed by the model in Example II by takingand.It can be shown that this distribution is bi-modal for all values of and.A second important special case included in Example II is the-state Potts model where we restrict to the part of the state space such that.Note that.Consequently,sampling from is sufficient since we can randomly permute the colors once we obtain a sample and get a random configuration of the Potts model on the nonrestricted state space.Here we take,and and the Gibbs distribution becomes wherewhere is another normalizing con-stant.When is taken to be the constant function,then we obtain the distributions of the usual swapping algorithm. The Flat-Swap Algorithm:We shall see that this graduallyflattens out the total spins distributions uniformly,thus eliminating the bad cut that can occur when we take constant.The function effectively dampens the entropy(multinomial)just as the change in temperature dampens the energy term coming from the Hamiltonian.We have the following theorem.T HEOREM4.5.The Flat-Swap algorithm is rapidly mixing for any bimodal mean-field model.To prove Theorem4.5,we follow the strategy set forth for Theorem4.1,using decomposition and comparison in a similar manner.For simplicity,we concentrate our exposi-tion here on the Ising model in an externalfield.The advan-tage of this special case is that the total spins configurations form a one-parameter family(i.e.,the number of vertices as-signed+1),much like in Example I.The proofs for the gen-eral class of models,including the Potts model on,are analogous.We sketch the proof of Theorem4.5.For the Ising model,we have.Note that is easy to compute given.A simple calculation reveals that,forThus,all the total spins distributions have the same relative shape,but getflatter as is decreased.This no longerpreserves the non-analytic nature of the phase transition seen for the usual swap algorithm.It is this property that makes this choice of distributions useful.The total spins distribution for the Ising model is known to be bimodal,even in the presence of an externalfield.With our choice of interpolants,it now follows that all distributions are bimodal as well.Moreover,the minima of the distributions occur at the same location for all distributions.Let be the place at which these minima occur.In order to show that this swapping chain is rapidly mixing we use decomposition.Let be the state space of the swapping chain on the Ising model,where.Define the trace Tr, where if the number of s in is less than and let if the number of s in is at least.The analysis of the restricted chains given in[10]in the context of the Ising model without an externalfield can be readily adapted to show the restrictions are also rapidly mixing.The analysis of the projection is analogous to the arguments used to bound the mixing rate of the projection for Example I.Hence,we can conclude that the swapping algorithm is rapidly mixing for the mean-field Ising model at any temperature,with any externalfield.We leave the details,including the extension to the Potts model,for the full version of the paper.5ConclusionsSwapping,tempering and annealing provide a means,exper-imentally,for overcoming bottlenecks controlling the slow convergence of Markov chains.However,our results of-fer rigorous evidence that heuristics based on these methods might be incorrect if samples are taken after only a poly-nomial number of steps.In recent work,we have extended the arguments presented here to show an even more surpris-ing result;tempering can actually be slower than thefixed temperature Metropolis algorithm by an exponential multi-plicative factor.Many other future directions present themselves.It would be worthwhile to continue understanding examples when the standard(temperature based)interpolants fail to lead to efficient algorithms,but nonetheless variants of the swapping algorithm,such as presented in Section4.3,suc-ceed.The difficulty in extending our methods to more in-teresting examples,such as the Ising and Potts models on lattices,is that it is not clear how to define the interpolants. We would want a way to slowly modify the the entropy term in addition to the temperature,as we did in the mean-field case,to avoid the bad cut arising from the phase transition. It would be worthwhile to explore whether it is possible to determine a good set of interpolants algorithmically by boot-strapping,rather than analytically,as was done here,to de-fine a more robust family of tempering-like algorithms.AcknowledgmentsThe authors thank Christian Borgs,Jennifer Chayes,Claire Kenyon,and Elchanan Mossel for useful discussions. References[1] C.Borgs,J.T.Chayes, A.Frieze,J.H.Kim,P.Tetali,E.Vigoda,and V.H.Vu.Torpid mixing of some MCMC algorithms in statistical physics.Proc.40th IEEE Symposium on Foundations of Computer Science,218–229,1999.[2] C.Cooper,M.E.Dyer,A.M.Frieze,and R.Rue.MixingProperties of the Swendsen-Wang Process on the Complete Graph and Narrow Grids.J.Math.Phys.41:1499–1527: 2000.[3]P.Diaconis and parison theorems forreversible Markov chains.Annals of Applied Probability.3: 696–730,1993.[4] C.J.Geyer.Markov Chain Monte Carlo Maximum Likeli-puting Science and Statistics:Proceedings of the 23rd Symposium on the Interface(E.M.Keramidas,ed.),156-163.Interface Foundation,Fairfax Sta tion,1991.[5] C.J.Geyer and E.A.Thompson.Annealing Markov ChainMonte Carlo with Applications to Ancestral Inference.J.Amer.Statist.Assoc.90:909–920,1995.[6]V.K.Gore and M.R.Jerrum.The Swendsen-Wang ProcessDoes Not Always Mix Rapidly.J.Statist.Phys.97:67–86, 1995.[7]M.R.Jerrum and A.J.Sinclair.Approximate counting,uni-form generation and rapidly mixing Markov rma-tion and Computation.82:93–133,1989.[8]S.Kirkpatrick,L.Gellatt Jr.,and M.Vecchi.Optimization bysimulated annealing.Science.220:498–516,1983.[9]N.Madras and D.Randall.Markov chain decomposition forconvergence rate analysis.Annals of Applied Probability.12: 581–606,2002.[10]N.Madras and Z.Zheng.On the swapping algorithm.Random Structures and Algorithms.22:66–97,2003. [11] E.Marinari and G.Parisi.Simulated tempering:a new MonteCarlo scheme.Europhys.Lett.19:451–458,1992.[12]R.A.Martin and D.Randall.Sampling adsorbing staircasewalks using a new Markov chain decomposition method.Proc.41st Symposium on the Foundations of Computer Sci-ence(FOCS2000),492–502,2000.[13]R.A.Martin and D.Randall.Disjoint decomposition withapplications to sampling circuits in some Cayley graphs.Preprint,2003.[14]N.Metropolis,A.W.Rosenbluth,M.N.Rosenbluth,A.H.Teller,and E.Teller.Equation of state calculations by fast computing machines.Journal of Chemical Physics,21: 1087–1092,1953.[15]R.B.Potts.Some Generalized Order-disorder Transforma-tions Proceedings of the Cambridge Philosophical Society, 48:106–109,1952.[16] A.J.Sinclair.Algorithms for random generation&counting:a Markov chain approach.Birkh¨a user,1993.[17]Z.Zheng.Analysis of Swapping and Tempering Monte CarloAlgorithms.Dissertation,York Univ.,1999.。
DID SECURITIZATION LEAD TO LAX SCREENING?EVIDENCE FROM SUBPRIME LOANS∗B ENJAMIN J.K EYST ANMOY M UKHERJEEA MIT S ERUV IKRANT V IGA central question surrounding the current subprime crisis is whether the se-curitization process reduced the incentives offinancial intermediaries to carefully screen borrowers.We examine this issue empirically using data on securitized subprime mortgage loan contracts in the United States.We exploit a specific rule of thumb in the lending market to generate exogenous variation in the ease of securitization and compare the composition and performance of lenders’portfolios around the ad hoc threshold.Conditional on being securitized,the portfolio with greater ease of securitization defaults by around10%–25%more than a similar risk profile group with a lesser ease of securitization.We conduct additional anal-yses to rule out differential selection by market participants around the threshold and lenders employing an optimal screening cutoff unrelated to securitization as alternative explanations.The results are confined to loans where intermediaries’screening effort may be relevant and soft information about borrowers determines their creditworthiness.Ourfindings suggest that existing securitization practices did adversely affect the screening incentives of subprime lenders.I.I NTRODUCTIONSecuritization,converting illiquid assets into liquid securi-ties,has grown tremendously in recent years,with the universe of securitized mortgage loans reaching$3.6trillion in2006.The ∗We thank Viral Acharya,EffiBenmelech,Patrick Bolton,Daniel Bergstresser,Charles Calomiris,Douglas Diamond,John DiNardo,Charles Good-hart,Edward Glaeser,Dwight Jaffee,Chris James,Anil Kashyap,Jose Liberti, Gregor Matvos,Chris Mayer,Donald Morgan,Adair Morse,Daniel Paravisini, Karen Pence,Guillaume Plantin,Manju Puri,Mitch Petersen,Raghuram Ra-jan,Uday Rajan,Adriano Rampini,Joshua Rauh,Chester Spatt,Steve Schaefer, Henri Servaes,Morten Sorensen,Jeremy Stein,James Vickery,Annette Vissing-Jorgensen,Paul Willen,three anonymous referees,and seminar participants at Boston College,Columbia Law,Duke,the Federal Reserve Bank of Philadel-phia,the Federal Reserve Board of Governors,the London Business School,the London School of Economics,Michigan State,NYU Law,Northwestern,Oxford, Princeton,Standard and Poor’s,the University of Chicago Applied Economics Lunch,and the University of Chicago Finance Lunch for useful discussions.We also thank numerous conference participants for their comments.Seru thanks the Initiative on Global Markets at the University of Chicago forfinancial sup-port.The opinions expressed in the paper are those of the authors and do not reflect the views of the Board of Governors of the Federal Reserve System or Sorin Capital Management.Shu Zhang provided excellent research assistance. All remaining errors are our responsibility.benjamin.j.keys@,tmukherjee@ ,amit.seru@,vvig@.C 2010by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.The Quarterly Journal of Economics,February2010307308QUARTERLY JOURNAL OF ECONOMICSoption to sell loans to investors has transformed the traditional role offinancial intermediaries in the mortgage market from“buy-ing and holding”to“buying and selling.”The perceived benefits of thisfinancial innovation,such as improving risk sharing and reducing banks’cost of capital,are widely cited(e.g.,Pennacchi [1988]).However,delinquencies in the heavily securitized sub-prime housing market increased by50%from2005to2007,forcing many mortgage lenders out of business and setting off a wave offi-nancial crises,which spread worldwide.In light of the central role of the subprime mortgage market in the current crisis,critiques of the securitization process have gained increased prominence (Blinder2007;Stiglitz2007).The rationale for concern over the“originate-to-distribute”model during the crisis derives from theories offinancial inter-mediation.Delegating monitoring to a single lender avoids the duplication,coordination failure,and free-rider problems associ-ated with multiple lenders(Diamond1984).However,for a lender to screen and monitor,it must be given appropriate incentives (H¨o lmstrom and Tirole1997),and this is provided by the illiquid loans on its balance sheet(Diamond and Rajan2003).By creating distance between a loan’s originator and the bearer of the loan’s default risk,securitization may have potentially reduced lenders’incentives to carefully screen and monitor borrowers(Petersen and Rajan2002).On the other hand,proponents of securitization argue that reputation concerns,regulatory oversight,or sufficient balance sheet risk may have prevented moral hazard on the part of lenders.What the effects of existing securitization practices on screening were thus remains an empirical question.This paper investigates the relationship between securitiza-tion and screening standards in the context of subprime mortgage loans.The challenge in making a causal claim is the difficulty of isolating differences in loan outcomes independent of contract and borrower characteristics.First,in any cross section of loans,those that are securitized may differ on observable and unobservable risk characteristics from loans that are kept on the balance sheet (not securitized).Second,in a time-series framework,simply doc-umenting a correlation between securitization rates and defaults may be insufficient.This inference relies on establishing the opti-mal level of defaults at any given point in time.Moreover,this ap-proach ignores macroeconomic factors and policy initiatives that may be independent of lax screening and yet may induce composi-tional differences in mortgage borrowers over time.For instance,DID SECURITIZATION LEAD TO LAX SCREENING?309 house price appreciation and the changing role of government-sponsored enterprises(GSEs)in the subprime market may also have accelerated the trend toward originating mortgages to riskier borrowers in exchange for higher payments.We overcome these challenges by exploiting a specific rule of thumb in the lending market that induces exogenous variation in the ease of securitization of a loan compared to another loan with similar observable characteristics.This rule of thumb is based on the summary measure of borrower credit quality known as the FICO score.Since the mid-1990s,the FICO score has become the credit indicator most widely used by lenders,rating agen-cies,and investors.Underwriting guidelines established by the GSEs,Fannie Mae and Freddie Mac,standardized purchases of lenders’mortgage loans.These guidelines cautioned against lend-ing to risky borrowers,the most prominent rule of thumb being not lending to borrowers with FICO scores below620(Avery et al. 1996;Loesch1996;Calomiris and Mason1999;Freddie Mac2001, 2007;Capone2002).1Whereas the GSEs actively securitized loans when the nascent subprime market was relatively small,since 2000this role has shifted entirely to investment banks and hedge funds(the nonagency sector).We argue that persistent adherence to this ad hoc cutoff by investors who purchase securitized pools from nonagencies generates a differential increase in the ease of securitization for loans.That is,loans made to borrowers which fall just above the620credit cutoff have a higher unconditional likelihood of being securitized and are therefore more liquid than loans below this cutoff.To evaluate the effect of securitization on screening decisions, we examine the performance of loans originated by lenders around this threshold.As an example of our design,consider two borrow-ers,one with a FICO score of621(620+)and the other with a FICO score of619(620−),who approach the lender for a loan. Screening to evaluate the quality of the loan applicant involves collecting both“hard”information,such as the credit score,and “soft”information,such as a measure of future income stability of the borrower.Hard information,by definition,is something that is easy to contract upon(and transmit),whereas the lender has to exert an unobservable effort to collect soft information(Stein 2002).We argue that the lender has a weaker incentive to base1.We discuss the620rule of thumb in more detail in Section III and in reference to other cutoffs in the lending market in Section IV.G.310QUARTERLY JOURNAL OF ECONOMICSorigination decisions on both hard and soft information,less care-fully screening the borrower,at620+,where there is an increase in the relative ease of securitization.In other words,because in-vestors purchase securitized loans based on hard information,the cost of collecting soft information is internalized by lenders when screening borrowers at620+to a lesser extent than at620−.There-fore,by comparing the portfolio of loans on either side of the credit score threshold,we can assess whether differential access to se-curitization led to changes in the behavior of lenders who offered these loans to consumers with nearly identical risk profiles.Using a sample of more than one million home purchase loans during the period2001–2006,we empirically confirm that the number of loans securitized varies systematically around the 620FICO cutoff.For loans with a potential for significant soft information—low documentation loans—wefind that there are more than twice as many loans securitized above the credit thresh-old at620+than below the threshold at620−.Because the FICO score distribution in the population is smooth(constructed from a logistic function;see Figure I),the underlying creditworthiness and demand for mortgage loans(at a given price)are the same for prospective buyers with a credit score of either620−or620+. Therefore,these differences in the number of loans confirm that the unconditional probability of securitization is higher above the FICO threshold;that is,it is easier to securitize620+loans.Strikingly,wefind that although620+loans should be of slightly better credit quality than those at620−,low-documentation loans that are originated above the credit threshold tend to default within two years of origination at a rate10%–25%higher than the mean default rate of5%(which amounts to roughly a0.5%–1%increase in delinquencies).As this result is conditional on observable loan and borrower character-istics,the only remaining difference between the loans around the threshold is the increased ease of securitization.Therefore, the greater default probability of loans above the credit threshold must be due to a reduction in screening by lenders.Because our results are conditional on securitization,we con-duct additional analyses to address selection on the part of bor-rowers,lenders,or investors as explanations for differences in the performance of loans around the credit threshold.First,we rule out borrower selection on observables,as the loan terms and borrower characteristics are smooth across the FICO score thresh-old.Next,selection of loans by investors is mitigated because theDID SECURITIZATION LEAD TO LAX SCREENING?311D e n s i t yFICO F IGURE IFICO Distribution (U.S.Population)The figure presents the FICO distribution in the U.S.population for 2004.The data are from an anonymous credit bureau,which assures us that the data exhibit similar patterns during the other years of our sample.The FICO distribution across the population is smooth,so the number of prospective borrowers in the local vicinity of a given credit score is similar.decisions of investors (special purpose vehicles,SPVs)are based on the same (smooth–through the threshold)loan and borrower variables as in our data (Kornfeld 2007).Finally ,strategic adverse selection on the part of lenders may also be a concern.However,lenders offer the entire pool of loans to investors,and,conditional on observables,SPVs largely follow a randomized selection rule to create bundles of loans out of these pools,suggesting that securitized loans would look similar to those that remain on the balance sheet (Comptroller’s Handbook 1997;Gorton and Souleles 2006).Furthermore,if at all present,this selection will tend to be more severe below the threshold,thereby biasing the results against our finding any screening effect.We also constrain our analysis to a subset of lenders who are not sus-ceptible to strategic securitization of loans.The results for these lenders are qualitatively similar to the findings using the full sample,highlighting that screening is the driving force behind our results.312QUARTERLY JOURNAL OF ECONOMICSCould the620threshold be set by lenders as an optimal cut-off for screening that is unrelated to differential securitization? We investigate further using a natural experiment in the pas-sage and subsequent repeal of antipredatory laws in New Jersey (2002)and Georgia(2003)that varied the ease of securitization around the threshold.If lenders used620as an optimal cutoff for screening unrelated to securitization,we would expect the pas-sage of these laws to have no effect on the differential screening standards around the threshold.However,if these laws affected the differential ease of securitization around the threshold,our hypothesis would predict an impact on the screening standards. Our results confirm that the discontinuity in the number of loans around the threshold diminished during a period of strict enforce-ment of antipredatory lending laws.In addition,there was a rapid return of a discontinuity after the law was revoked.Importantly, our performance results follow the same pattern,that is,screen-ing differentials attenuated only during the period of enforcement. Taken together,this evidence suggests that our results are indeed related to differential securitization at the credit threshold and that lenders did not follow the rule of thumb in all instances. Importantly,the natural experiment also suggests that prime-influenced selection is not at play.Once we have confirmed that lenders are screening more rig-orously at620−than620+,we assess whether borrowers were aware of the differential screening around the threshold.Although there is no difference in contract terms around the cutoff,bor-rowers may have an incentive to manipulate their credit scores in order to take advantage of differential screening around the threshold(consistent with our central claim).Aside from out-right fraud,it is difficult to strategically manipulate one’s FICO score in a targeted manner and any actions to improve one’s score take relatively long periods of time,on the order of three to six months(Fair Isaac).Nonetheless,we investigate further using the same natural experiment evaluating the performance effects over a relatively short time horizon.The results reveal a rapid return of a discontinuity in loan performance around the620 threshold,which suggests that rather than manipulation,our re-sults are largely driven by differential screening on the part of lenders.As a test of the role of soft information in screening incen-tives of lenders,we investigate the full documentation loan mar-ket.These loans have potentially significant hard informationDID SECURITIZATION LEAD TO LAX SCREENING?313 because complete background information about the borrower’s ability to repay is provided.In this market,we identify another credit cutoff,a FICO score of600,based on the advice of the three credit repositories.Wefind that twice as many full documenta-tion loans are securitized above the credit threshold at600+as below the threshold at600−.Interestingly,however,wefind no significant difference in default rates of full documentation loans originated around this credit threshold.This result suggests that despite a difference in ease of securitization across the thresh-old,differences in the returns to screening are attenuated due to the presence of more hard information.Ourfindings for full docu-mentation loans suggest that the role of soft information is crucial to understanding what worked and what did not in the existing securitized subprime loan market.We discuss this issue in more detail in Section VI.This paper connects several strands of the literature.Our evidence sheds new light on the subprime housing crisis,as discussed in the contemporaneous work of Doms,Furlong,and Krainer(2007),Gerardi,Shapiro,and Willen(2007),Dell’Ariccia, Igan,and Laeven(2008),Mayer,Piskorski,and Tchistyi(2008), Rajan,Seru,and Vig(2008),Benmelech and Dlugosz(2009),Mian and Sufi(2009),and Demyanyk and Van Hemert(2010).2This paper also speaks to the literature that discusses the benefits (Kashyap and Stein2000;Loutskina and Strahan2007),and the costs(Morrison2005;Parlour and Plantin2008)of securitization. In a related line of research,Drucker and Mayer(2008)document how underwriters exploit inside information to their advantage in secondary mortgage markets,and Gorton and Pennacchi(1995), Sufi(2006),and Drucker and Puri(2009)investigate how contract terms are structured to mitigate some of these agency conflicts.3 The rest of the paper is organized as follows.Section II pro-vides a brief overview of lending in the subprime market and de-scribes the data and sample construction.Section III discusses the framework and empirical methodology used in the paper,whereas Sections IV and V present the empirical results in the paper.Sec-tion VI concludes.2.For thorough summaries of the subprime mortgage crisis and the research which has sought to explain it,see Mayer and Pence(2008)and Mayer,Pence,and Sherlund(2009).3.Our paper also sheds light on the classic liquidity/incentives trade-off that is at the core of thefinancial contracting literature(see Coffee[1991],Diamond and Rajan[2003],Aghion,Bolton,and Tirole[2004],and DeMarzo and Urosevic [2006]).314QUARTERLY JOURNAL OF ECONOMICSII.L ENDING IN THE S UBPRIME M ORTGAGE M ARKETII.A.BackgroundApproximately60%of outstanding U.S.mortgage debt is traded in mortgage-backed securities(MBS),making the U.S.sec-ondary mortgage market the largestfixed-income market in the world(Chomsisengphet and Pennington-Cross2006).The bulk of this securitized universe($3.6trillion outstanding as of January 2006)is composed of agency pass-through pools—those issued by Freddie Mac,Fannie Mae,and Ginnie Mae.The remainder,ap-proximately,$2.1trillion as of January2006,has been securitized in nonagency securities.Although the nonagency MBS market is relatively small as a percentage of all U.S.mortgage debt,it is nevertheless large on an absolute dollar basis.The two mar-kets are separated based on the eligibility criteria of loans that the GSEs have established.Broadly,agency eligibility is estab-lished on the basis of loan size,credit score,and underwriting standards.Unlike the agency market,the nonagency(referred to as“sub-prime”in the paper)market was not always this size.This mar-ket gained momentum in the mid-to late1990s.Inside B&C Lending—a publication that covers subprime mortgage lending extensively—reports that total subprime lending(B&C origina-tions)grew from$65billion in1995to$500billion in2005.Growth in mortgage-backed securities led to an increase in securitization rates(the ratio of the dollar value of loans securitized divided by the dollar value of loans originated)from less than30%in1995 to over80%in2006.From the borrower’s perspective,the primary feature distin-guishing between prime and subprime loans is that the up-front and continuing costs are higher for subprime loans.4The sub-prime mortgage market actively prices loans based on the risk associated with the borrower.Specifically,the interest rate on the loan depends on credit scores,debt-to-income ratios,and the doc-umentation level of the borrower.In addition,the exact pricing may depend on loan-to-value ratios(the amount of equity of the borrower),the length of the loan,theflexibility of the interest rate(adjustable,fixed,or hybrid),the lien position,the property4.Up-front costs include application fees,appraisal fees,and other fees associ-ated with originating a mortgage.The continuing costs include mortgage insurance payments,principal and interest payments,late fees for delinquent payments,and fees levied by a locality(such as property taxes and special assessments).DID SECURITIZATION LEAD TO LAX SCREENING?315 type,and whether stipulations are made for any prepayment penalties.5For investors who hold the eventual mortgage-backed secu-rity,credit risk in the agency sector is mitigated by an implicit or explicit government guarantee,but subprime securities have no such guarantee.Instead,credit enhancement for nonagency deals is in most cases provided internally by means of a deal struc-ture that bundles loans into“tranches,”or segments of the overall portfolio(Lucas,Goodman,and Fabozzi2006).II.B.DataOur primary data set contains individual loan data leased from LoanPerformance.The database is the only source that pro-vides a detailed perspective on the nonagency securities market. The data include information on issuers,broker dealers/deal un-derwriters,servicers,master servicers,bond and trust adminis-trators,trustees,and other third parties.As of December2006, more than eight thousand home equity and nonprime loan pools (over seven thousand active)that include16.5million loans(more than seven million active)with over$1.6trillion in outstanding balances were included.LoanPerformance estimates that as of 2006,the data cover over90%of the subprime loans that are securitized.6The data set includes all standard loan application variables such as the loan amount,term,LTV ratio,credit score, and interest rate type—all data elements that are disclosed and form the basis of contracts in nonagency securitized mortgage pools.We now describe some of these variables in more detail.For our purpose,the most important piece of information about a particular loan is the creditworthiness of the borrower. The borrower’s credit quality is captured by a summary measure called the FICO score.FICO scores are calculated using vari-ous measures of credit history,such as types of credit in use and5.For example,the rate and underwriting matrix of Countrywide Home Loans Inc.,a leading lender of prime and subprime loans,shows how the credit score of the borrower and the loan-to-value ratio are used to determine the rates at which different documentation-level loans are made().6.Note that only loans that are securitized are reported in the LoanPerfor-mance munication with the database provider suggests that the roughly10%of loans that are not reported are for privacy concerns from lenders. Importantly for our purpose,the exclusion is not based on any selection crite-ria that the vendor follows(e.g.,loan characteristics or borrower characteristics). Moreover,based on estimates provided by LoanPerformance,the total number of nonagency loans securitized relative to all loans originated has increased from about65%in early2000to over92%since2004.316QUARTERLY JOURNAL OF ECONOMICSamount of outstanding debt,but do not include any information about a borrower’s income or assets(Fishelson-Holstein2005). The software used to generate the score from individual credit re-ports is licensed by the Fair Isaac Corporation to the three major credit repositories—TransUnion,Experian,and Equifax.These repositories,in turn,sell FICO scores and credit reports to lenders and consumers.FICO scores provide a ranking of potential bor-rowers by the probability of having some negative credit event in the next two years.Probabilities are rescaled into a range of 400–900,though nearly all scores are between500and800,with a higher score implying a lower probability of a negative event. The negative credit events foreshadowed by the FICO score can be as small as one missed payment or as large as bankruptcy.Bor-rowers with lower scores are proportionally more likely to have all types of negative credit events than are borrowers with higher scores.FICO scores have been found to be accurate even for low-income and minority populations(see Fair Isaac website www.myfi;also see Chomsisengphet and Pennington-Cross [2006]).More importantly,the applicability of scores available at loan origination extends reliably up to two years.By design,FICO measures the probability of a negative credit event over a two-year horizon.Mortgage lenders,on the other hand,are interested in credit risk over a much longer period of time.The continued acceptance of FICO scores in automated underwriting systems indicates that there is a level of comfort with their value in deter-mining lifetime default probability differences.7Keeping this as a backdrop,most of our tests of borrower default will examine the default rates up to24months from the time the loan is originated.Borrower quality can also be gauged by the level of documen-tation collected by the lender when taking the loan.The docu-ments collected provide historical and current information about the income and assets of the borrower.Documentation in the mar-ket(and reported in the database)is categorized as full,limited, or no documentation.Borrowers with full documentation provide verification of income as well as assets.Borrowers with limited documentation provide no information about their income but do7.An econometric study by Freddie Mac researchers showed that the pre-dictive power of FICO scores drops by about25%once one moves to a three to five–year performance window(Holloway,MacDonald,and Straka1993).FICO scores are still predictive,but do not contribute as much to the default rate prob-ability equation after thefirst two years.provide some information about their assets.“No-documentation”borrowers provide no information about income or assets,which is a very rare degree of screening lenience on the part of lenders. In our analysis,we combine limited and no-documentation bor-rowers and call them low-documentation borrowers.Our results are unchanged if we remove the very small portion of loans that are no-documentation.Finally,there is also information about the property beingfi-nanced by the borrower,and the purpose of the loan.Specifically, we have information on the type of mortgage loan(fixed rate,ad-justable rate,balloon,or hybrid)and the loan-to-value(LTV)ratio of the loan,which measures the amount of the loan expressed as a percentage of the value of the home.Typically loans are clas-sified as either for purchase or refinance,though for convenience we focus exclusively on loans for home purchases.8Information about the geography where the dwelling is located(ZIP code)is also available in the database.9Most of the loans in our sample are for owner-occupied single-family residences,townhouses,or condominiums(single-unit loans account for more than90%of the loans in our sam-ple).Therefore,to ensure reasonable comparisons,we restrict the loans in our sample to these groups.We also drop nonconven-tional properties,such as those that are FHA-or VA-insured or pledged properties,and also exclude buy down mortgages.We also exclude Alt-A loans,because the coverage for these loans in the database is limited.Only those loans with valid FICO scores are used in our sample.We conduct our analysis for the period January2001to December2006,because the securitization mar-ket in the subprime market grew to a meaningful size post-2000 (Gramlich2007).III.F RAMEWORK AND M ETHODOLOGYWhen a borrower approaches a lender for a mortgage loan,the lender asks the borrower tofill out a credit application.In addi-tion,the lender obtains the borrower’s credit report from the three credit bureaus.Part of the background information on the appli-cation and report could be considered“hard”information(e.g.,8.Wefind similar rules of thumb and default outcomes in the refinance market.9.See Keys et al.(2009)for a discussion of the interaction of securitization and variation in regulation,driven by the geography of loans and the type of lender.。
data-processing inequality的例子-回复Data processing inequality is a fundamental concept in information theory that states that no data processing can increase the amount of information contained in a dataset. This principle, formulated by Claude Shannon, is at the core of many theoretical and practical applications of data analysis and communication systems. In this article, we will explore the concept of data processing inequality and provide examples to illustrate its significance.To understand the data processing inequality, let's start by defining information. In information theory, information is measured by the amount of uncertainty reduced or eliminated when an event occurs. It is quantified using a metric called entropy. The entropy of a dataset represents the average amount of information carried by each item in the dataset.Now, suppose we have two datasets, A and B, where B is derived from A through some data processing algorithm. According to the data processing inequality, the average amount of information contained in B cannot be higher than the average amount of information in A. In other words, any data processing algorithm can only reduce or maintain the uncertainty in a dataset, but it cannotcreate new information or increase the amount of information.To better understand this concept, let's consider an example. Suppose we have a dataset A containing pictures of cats and dogs. Each picture is labeled as either "cat" or "dog." The entropy of the dataset A represents the average uncertainty or randomness associated with the process of predicting the label of a randomly chosen picture.Now, let's apply a data processing algorithm to dataset A, which extracts features from the pictures, such as color, texture, and shape. The processed dataset, B, contains the extracted features but not the original pictures. According to the data processing inequality, the entropy of B cannot be higher than the entropy of A. This means that the extracted features do not contain more information than the original pictures. They may contain less information due to the loss of specific image details during the feature extraction process.Another example that illustrates the data processing inequality is data compression. Suppose we have a dataset A consisting of a collection of documents. The entropy of A represents the averageuncertainty or randomness associated with predicting the next word in a randomly chosen document.If we apply a data compression algorithm to dataset A, the compressed dataset, B, will have a smaller size than dataset A, perhaps due to the removal of redundant information or through the use of more efficient coding schemes. However, according to the data processing inequality, the entropy of B cannot be higher than the entropy of A. This means that the compressed dataset does not contain more information than the original dataset. It may contain less information due to the lossy compression techniques employed.The data processing inequality has broad implications in various fields, including data analysis, communication systems, and machine learning. For example, in data analysis, the inequality reminds us that no matter how sophisticated our data processing techniques are, we cannot extract more information from the data than what is inherently present. It also guides the design of communication systems to ensure that data transmission and compression algorithms do not introduce information loss.In conclusion, the data processing inequality is a fundamental concept in information theory that states that no data processing can increase the amount of information contained in a dataset. Through examples such as image feature extraction and data compression, we have illustrated the significance of this principle. Understanding and applying the data processing inequality is crucial for effective data analysis and communication system design, helping to ensure the integrity and efficiency of information processing.。
假如我是一只鲨鱼英语作文全文共3篇示例,供读者参考篇1If I Were a SharkI can't help but daydream sometimes about what it would be like to be a different creature entirely. Sure, being human is pretty great - we have things like video games, pizza, and unlimited access to cat videos on the internet. But there's just something so fascinating about the animal world that makes me envious at times. If I could be any animal, I think I would choose to be a shark.I know what you're thinking - "Sharks? But they're giant killing machines!" Well, yes, sharks can be quite formidable predators. However, there's so much more to them than just their infamous jaws. For one, they're evolutionary marvels that have survived essentially unchanged for over 400 million years. That's approximately 200 million years before the first dinosaurs even existed! Talk about an ancient and veteran lineage.As a shark, I would be a member of the proud subclass Elasmobranchii, which includes sharks, rays, and skates. Wecartilaginous fishes are the awesomely older cousins of the bony fishes like tuna and salmon. While they get all the love from routine seafood consumers, we sharks are just straight-up harder and more hardcore.Can you imagine having a skeleton made entirely of cartilage rather than bone? It would make me literallysemi-flexible and able to squeeze through tiny crevices that my bony counterparts could never access. I'd basically be awater-based contortionist.Then there are the sensors. As a shark, my electro-sensitivity would be off the charts! I could detect the tiny electrical fields generated by muscle movements of potential prey from miles away. This bio-radar system would make me a finely-tuned hunting machine. No fish would be safe from my stealthy approach.But it wouldn't just be about the hunt. I'd also have an incredible sense of smell, with nostrils along the underside of my body that could pick up scent trails from incredible distances. I could track a single drop of blood in the ocean from a mile away. Not that I'd necessarily want to chomp whatever unlucky creatures were leaking, but hey, knowing is half the battle.My vision would be similarly enhanced for an aquatic lifestyle as a shark. In addition to excellent eyesight, I would possess the ability to detect tiny temperature variations through a special "sixth sense" organ called an ampullae of Lorenzini. Essentially, I'd be able to visualize swimmers and surfers from their body heat alone before they even knew I was there. Sneaky, sneaky.Of course, swimming itself would be a completely euphoric experience as a shark. I'd be sleek, streamlined, and able to cruise effortlessly at around 5-10 mph on average - a speed and grace unseen by clumsy terrestrial animals. I could travel hundreds of miles without getting tired thanks to my specialized body design for low-resistance motion through water.There would be no more reliance on inefficient forms of transportation that harm the planet. As a shark, I'd just keep on swimming endlessly in a totally eco-friendly manner. I'd basically be living the dream of a carbon-neutral existence!Speaking of dreams, can you imagine how crazy my sleep patterns would be? Like many sharks, I would likely fall into a sort of sleep-swimming trance where I could rest but continue making all the tiny movements needed to keep respiringthrough my gills. Naps and Netflix would take on a whole new meaning.Then again, maybe constantly swimming to breathe would get old after a while. I suppose I could always take a page from the incredible nurse sharks who have adapted away from the need for constant motion. With specially reinforced breathing muscles, they can simply laze around on the seafloor, breathing away without swimming at all. Nurse shark life would definitely be a contender for the path of most slothfulness.Of course, whichever mode of shark living I chose, the underwater scenery would always be breathtaking. I'd get to explore dazzling coral reefs, deep oceanic ridges, underwater caves and shipwrecks, all while silently gliding past in my new shark form. No longer bound by the limitations of human air supply, the ocean's bounty would be my eternal playground.There would be downsides though, I'll admit. I probably wouldn't get to hang out with my human friends and family very often for safety reasons. Trips to shopping malls and restaurants would also likely get a bit...hairy (well, I guess the proper term would be "finny"). My new eating habits of tearing prey to bits would take some getting used to as well.Still, I think the wondrous experiences of shark life would make up for the sacrifices. Can you imagine how epic an encounter with the great whites or tiger sharks would be if I were one of them? I'd get to be part of the most fearsome lineage of alpha predators our world has ever known.Or maybe I'd opt for a less bloodthirsty shark existence, cruising around tropical paradises as a whale shark - a 40-foot long, docile leviathan that sustains itself by filter feeding on microscopic plankton. Talk about an existence of pure, unadulterated bliss.Whatever species of shark I chose, perhaps the greatest aspect would be getting a first-hand glimpse into the deep wisdom and time-honed survival instincts of these ancient mariners. Sharks have outlived literally every single threat the planet has thrown at them through unimaginable eons of existence. They are relics with secrets to longevity that we humans can scarcely comprehend.As a shark, I would quite literally embody that precious ancient knowledge - an ambassador between eras as I swam purposefully onward, driven by the powerful currents of hundreds of millions of years of evolution. No longer just a human looking in, but a true living vessel of the sharks'time-honored journey and persistent dominion over the vast waters of our world. What an unparalleled perspective that would be!So yeah, while being a human is pretty neat, you can definitely see the appeal of swapping it out for life as one of the earth's most skillful, well-adapted, and successful predators. The sharks' way of experiencing our planet's oceans is something to be respected, studied, and daydreamed about by those of us stuck on dry land. If the opportunity to be reborn as one of these lords of the deep ever arose, I know I would seize it in a heartbeat (do sharks even have heartbeats?). The marine realm has so much left to teach us, and by far the best instructors would be the sharks.篇2If I Were a SharkHave you ever wondered what it would be like to be a shark? To glide through the vast depths of the ocean, a powerful apex predator feared and respected by all? As a student, I often find myself daydreaming about escaping the confines of the classroom and experiencing life from a different perspective. If Iwere a shark, my existence would be one of primal freedom and underwater majesty.Let me paint a picture of my life as a great white shark, the most iconic and infamous species of them all. I would be born into the cold, unforgiving waters of the Pacific Ocean, a mere pup among the thrashing currents. From the moment I emerged from my egg case, instinct would guide my every move as I navigated the treacherous realm that is my domain.As a newborn pup, I would be small and vulnerable, relying on my mother's protection until I grew large enough to fend for myself. Those early months would be a crash course in survival, teaching me to hunt efficiently and avoid the dangers that lurk in every corner of the deep. I would learn to respect the ocean's power, for it is a force that can uplift me to great heights or crush me without a second thought.Once I reached adolescence, the real adventure would begin. With a sleek, torpedo-shaped body and a mouthful of serrated teeth, I would become an unstoppable force, cutting through the water with remarkable agility and speed. My senses would be finely tuned, able to detect the faintest vibrations and electrical impulses that signal the presence of prey.The thrill of the hunt would course through my veins with every pursuit. I would stalk my quarry with patience and cunning, waiting for the perfect moment to strike. Then, with a powerful thrust of my tail, I would explode into action, jaws agape, ready to clamp down on my hapless victim with bone-crushing force. The rush of adrenaline would be unparalleled, a primal ecstasy that no land-dwelling creature could ever hope to experience.As an apex predator, fear would be a constant companion, but it would be a fear tinged with respect. Other ocean dwellers would give me a wide berth, sensing the raw power and ferocity that radiates from my very being. I would command the depths, a ruler of the underwater kingdom, bowing to no one and nothing.Of course, life as a shark would not be without its challenges.I would have to navigate treacherous currents, evade the clutches of killer whales and other formidable predators, and constantly search for new hunting grounds to sustain my voracious appetite. But these trials would only serve to sharpen my senses and hone my skills, forging me into an even more formidable creature of the deep.Imagine the thrill of breaching the surface, launching myself into the air in a spectacular display of power and grace. For thosebrief moments, I would be a creature of two worlds, defying gravity and basking in the warm embrace of the sun before plunging back into the depths whence I came.As a shark, I would be a part of something greater than myself, a link in the intricate chain of life that has existed since the dawn of time. My existence would be a delicate balance between predator and prey, hunter and hunted, a constant reminder of the cyclical nature of life and death in the ocean's vast expanse.But perhaps the greatest allure of being a shark lies in the freedom it would afford me. Unbound by the constraints of human society, I would roam the world's oceans at will, answering to no one but the primal urges that govern my every move. The open waters would be my playground, my kingdom, my entire universe.In those moments of solitary exploration, I would bear witness to sights that few humans have ever laid eyes upon. I would glide effortlessly past vibrant coral reefs teeming with life, explore vast underwater canyons and trenches, and venture into the inky blackness of the abyssal depths, where bizarre and otherworldly creatures dwell.Of course, my existence as a shark would not be without its dangers. Humans, with their insatiable appetites for profit and conquest, would pose a constant threat. Their voracious fishing practices, pollution, and disregard for the delicate balance of marine ecosystems would put my very survival at risk. But even in the face of such peril, I would persevere, a testament to the resilience and adaptability that has allowed my kind to thrive for millions of years.In the end, perhaps the greatest lesson I would learn as a shark is one of respect – respect for the ocean, respect for the intricate web of life that sustains us all, and respect for the primal forces that govern our existence. For in the depths of the sea, one quickly learns that we are all merely temporary visitors in a world far older and grander than we can possibly imagine.So, if you ever find yourself daydreaming about a life beyond the confines of the classroom, let your mind wander to the vast expanse of the ocean. Imagine yourself as a shark, a sleek and powerful predator gliding through the deep, reveling in the freedom and majesty of the underwater realm. Who knows? Perhaps in that imaginary existence, you might just find a newfound appreciation for the wonders of the natural world and the delicate balance that sustains us all.篇3If I Were a SharkIf I could be any animal in the world, I think I would choose to be a great white shark. Sure, being a cute bunny or majestic eagle would be nice, but there's just something so fascinating about sharks that draws me to them. They are the apex predators of the ocean, striking fear into the hearts of all who encounter them with their powerful jaws and seemingly soulless black eyes. Yet at the same time, sharks are magnificently adapted, sleek machines, perfectly suited for an existence of ceaseless hunting and survival of the fittest.As a great white, my life would be a constant quest for prey in the vast, cold depths of the open ocean. I would roam far from land, swimming hundreds of miles across унblue expanses of open water, always following the currents and temperature shifts that would lead me to concentrateions of nutrient-rich waters teeming with life. With my keen senses, I could detect a tiny electrical field given off by the muscle movements of a nearby seal from over a mile away. When the time was right, I would execute an ambush attack, exploding up from the darkness of the deep to snatch the hapless creature in my jaws.That first bone-crunching bite would be an exhilarating rush of power, the thrill of the hunt rewarded with a mouthful of warm flesh and a spray of crimson life force clouding the water around me. However, I wouldn't linger to savor the moment. As a shark, I can never rest, for I must keep swimming constantly to force water through my gills to breathe. After tearing my prey into chunks, I would have to quickly consume the entire seal before the scent of its blood attracted other, hungrier sharks to steal my quarry.As cold-blooded as my insatiable killer instinct would be, I would also have another hardwired drive: that of a perpetual wanderer. Even after a successful kill that should keep me fed for weeks, I would feel some primal urge pulling me to keep moving, to never stay in any one spot for too long. Maybe it would be some deeply ingrained survival tactic to avoid falling into a rut or depleting local resources. Or perhaps it is simply the lifestyle of the shark, a eternally-roving predator living with a nomadic existence with no place to truly call home.This unceasing wanderlust would lead me on a crisscrossing pattern of epic, years-long migrations across entire ocean basins in search of fertile hunting grounds. I would follow defined migratory routes encoded in the DNA of my kind, traversingthousands of miles of open ocean from the warm waters off Hawaii to the chilly Californian coast, then down again into the frigid, rich seas of Antarctica before returning full circle up along the coast of South America. All this without any maps or compasses, guided only by the eons-perfected instincts of my species and the positioning of the sun and stars.Over the decades of my life, I would visit more of the world's oceans than most humans could ever dream of. One year I could be patrolling the turquoise flats of the Bahamas, weaving through teeming reefs in pursuit of sea turtles and stingrays. The next, I could be cutting through the churning currents and upwellings of the North Atlantic, gorging myself on tuna and seals before riding the drifts down to warmer Caribbean waters for the winter. The following spring I may join the great white gangs swarming off the coast of South Africa, engaged in furious competition for the abundant seal colonies crowding the shorelines before heading back up to the Indian Ocean to bask in its tropical latitudes. No waters, from the most intensely pressurized hadspeths of the Mariana Trench to the frozen Ross Sea of Antarctica, would be off limits from my immense migratory cycles.This unending journey would not be without its share of drama and hardship, however. As an alpha predator, I would have to fiercely compete for territory and resources against rivals of equal size and prowess, battling in visceral clashes of teeth and muscle for dominance over fertile hunting grounds. Smaller sharks and other opportunistic marine creatures would constantly be looking to scavenge any of my leftovers, which I would have to jealously guard from their greedy jaws.Even prey that should be easy pickings would carry their own set of risks. A single miscalculated strike against a heavy bony fish or feisty sea lion could leave my mouth impaled with spines or my eyes ravaged from retaliation. And everywhere I went I would have to be vigilant for the deadliest threat - the distinctive sting of a great white's greatest foe, the brutal killer whale. These whales hunt in remorseless packs and could easily dispatch even the largest sharks with their teamwork and brawn.Life as an apex predator would be a constant struggle, and I would have to be willing to viciously Fight for every scrap of nourishment. But that's simply what it means to swim at the top of the food chain. Only the most relentlessly aggressive and strong sharks like myself could endure this harsh lifestyle.As a great white, I wouldn't have the capacity for deeper thoughts and emotions like humans do. But at the same time, I likely wouldn't suffer from the existential worries and neuroses that plague the human psyche. My entire being would be focused on simply achieving the core drives of survival, migration, and procreation that my DNA had hard-coded into my primal mind over millions of years of evolution. I wouldn't have doubts, regrets, or anxiety - just a pure flow of instinctual drives propelling me forward in an endless cycle of hunt, travel, and mate.Such a life may seem aimless, empty, and cold to a human. But for a shark, it would be a perfect existence. The freedom of the open ocean stretching to the horizon in all directions. The thrill of detecting prey and timing the perfect killing strike. The profound satisfaction of successfully passing on my genes to the next generation. As a great white, I would be living out an eternity-perfected cycle of ruthless yet graceful survival mastered over eons. While I may lament the lack of deeper consciousness and abstract thinking, there is also something to be said for such a purely primal existence disconnected from the stresses and complexities of the human world.Of course, eventually my reign as a great white shark would have to come to an end. But what an apex predator's demise that would be! Going out in a blaze of glory befitting the greatest fish in the sea - quite literally. Perhaps I would be locked in a desperate battle with a rivaling male over a fertile mate, our titanic bodies thrashing the surface into a reddish foam as we tore chunks from each other in a frenzy. Maybe I would finally meet my match against a killer whale pack too numerous to fend off alone. Or I may simply perish in honorable old age from lack of nourishment after sinking my last teeth into my final kill. However it played out, I can't think of a more appropriately primal and dramatic end for such a primal and dramatic life than that of the great white shark.。
堆叠替换系数和时间English answer:Stacking Replacement Ratio and Time.The stacking replacement ratio, also known as the replacement ratio, is a metric used to measure the efficiency of a stacking operation. It is calculated by dividing the number of cases that are stacked by the total number of cases that were available to be stacked. A higher stacking replacement ratio indicates that the stacking operation is more efficient.There are a number of factors that can affect the stacking replacement ratio, including the type of product being stacked, the size and shape of the cases, the stacking pattern used, and the equipment used to stack the cases.The time it takes to stack a pallet can also varydepending on a number of factors, including the number of cases that need to be stacked, the size and weight of the cases, the stacking pattern used, and the equipment used to stack the cases.In general, the larger the number of cases that need to be stacked, the longer it will take to stack the pallet. Similarly, the heavier the cases, the longer it will taketo stack the pallet. The stacking pattern used can also affect the time it takes to stack the pallet. Some stacking patterns are more efficient than others, and can save time. Finally, the equipment used to stack the cases can also affect the time it takes to stack the pallet. Some equipment is more efficient than others, and can save time.Chinese answer:堆叠替换系数,也称为替换系数,是用来衡量堆叠作业效率的一个指标。
英语作文三明治制作方法How to Make a Delicious Sandwich.The sandwich, a timeless delicacy that has captivated the hearts and stomachs of people across the globe, is a culinary masterpiece of simplicity and versatility. Whether it's a quick lunch on the go or a hearty meal for a cozy dinner, the sandwich is a perfect blend of flavors and textures that can be customized to suit any taste. In this article, we'll delve into the art of sandwich making, exploring the essential ingredients, techniques, and variations that make each sandwich a unique culinary experience.The Basics of Sandwich Making.Before delving into the vast world of sandwich variations, it's important to understand the basics. At its core, a sandwich consists of three main components: the bread, the filling, and the condiments.Bread.The bread is the foundation of any sandwich and can greatly impact its overall flavor and texture. Common bread choices include whole wheat, white, rye, multigrain, and even gluten-free options. The bread should be soft and slightly pliable to allow for easy biting and chewing. Toasting the bread can add a delightful crispiness and enhance the flavor, especially when using denser breadslike rye or whole wheat.Filling.The filling is what gives the sandwich its characterand personality. It can range from simple spreads likebutter or mayonnaise to elaborate combinations of meats, cheeses, vegetables, and condiments. Common fillingsinclude ham, turkey, roast beef, tuna, chicken, cheeseslike cheese, Swiss, or cheddar, and vegetables like lettuce, tomato, cucumber, and avocado. The key to a great sandwichis to balance the flavors and textures of the filling,ensuring that each bite is a harmonious blend of salt, sweet, sour, and bitter.Condiments.Condiments are what tie the sandwich together, adding zest and complexity to the flavor profile. Mustard, ketchup, mayonnaise, pickle relish, and hot sauce are all popular condiments that can be used to enhance the flavor of the filling and provide a refreshing contrast to the bread. Experimenting with different condiments and combining themin unique ways can create sandwiches that are both unique and delicious.Step-by-Step Guide to Making a Classic Sandwich.Now, let's delve into the step-by-step process of making a classic ham and cheese sandwich.Step 1: Prepare the Bread.Start by selecting your bread. For this classicsandwich, a soft whole wheat or white bread slice works well. Toast the bread lightly if desired, to add a touch of crispiness.Step 2: Spread the Condiments.Apply a thin layer of your chosen condiment to oneslice of the bread. Mustard or mayonnaise are excellent choices for this step, but feel free to experiment with other condiments or even create a blend of your own.Step 3: Add the Filling.Layer the filling on top of the condiment. For a ham and cheese sandwich, place a few slices of ham and cheese on the condiment-spread slice of bread. The order of the filling can be adjusted according to personal preference, but a common rule of thumb is to place the wetter ingredients (like cheese or mayonnaise) towards the middle of the sandwich to prevent them from leaking out.Step 4: Top with the Second Slice of Bread.Gently press the second slice of bread onto the filling, sandwiching it all together. If desired, you can use arolling pin or the palm of your hand to gently press downon the sandwich, ensuring that the filling is evenly distributed and the bread adheres well.Step 5: Cut and Serve.Using a sharp knife, cut the sandwich in halfdiagonally or horizontally, depending on your preference. Serve immediately for maximum freshness and enjoy!Variations and Innovations.The beauty of the sandwich lies in its versatility and adaptability. Once you've mastered the basics, you can experiment with a wide range of variations and innovationsto create sandwiches that are uniquely yours. Try swapping out the ham and cheese for turkey and Swiss, or experiment with different bread types like sourdough or ciabatta. Add vegetables like cucumber, tomato, or sprouts for a crispertexture and extra nutrients. Or, get creative with condiments and try combinations like wasabi and soy sauce, or chili and lime juice.In Conclusion.The sandwich is a timeless delicacy that can be enjoyed by anyone, anywhere, and at any time. With its simple yet versatile nature, it's easy to see why it has become a global favorite. By understanding the basics of sandwich making and experimenting with different ingredients and techniques, you can create sandwiches that are both delicious and unique. So, why wait? Grab your ingredients, get creative, and enjoy the sandwich-making journey!。
NISSData Swapping as aDecision Problem Shanti Gomatam, Alan F. Karr and Ashish P. Sanil Technical Report Number 140January, 2004National Institute of Statistical Sciences19 T. W. Alexander DrivePO Box 14006Research Triangle Park, NC 27709-4006Data Swapping as a Decision ProblemShanti Gomatam∗,Alan F.Karr and Ashish P.SanilNational Institute of Statistical SciencesResearch Triangle Park,NC27709–4006,USA{sgomatam,karr,ashish}@January6,2004AbstractWe construct a decision-theoretic formulation of data swapping in which quantitative mea-sures of disclosure risk and data utility are employed to select one release from a possibly largeset of candidates.The decision variables are the swap rate,swap attribute(s)and possibly,con-straints on the unswapped attributes.Risk–utility frontiers,consisting of those candidates notdominated in(risk,utility)space by any other candidate,are a principal tool for reducing thescale of the decision problem.Multiple measures of disclosure risk and data utility,includingutility measures based directly on use of the swapped data for statistical inference,are intro-duced.Their behavior and resulting insights into the decision problem are illustrated usingdata from the Current Population Survey,the well-studied“Czech auto worker data”and dataon schools and administrators generated by the National Center for Education Statistics.1IntroductionData swapping[12,24,26]is a technique for statistical disclosure limitation that works at the mi-crodata(individual data record)level.Confidentiality protection is achieved by selectively modi-fying a fraction of the records in the database by switching a subset of attributes between selected pairs of records.Data swapping makes it impossible for an intruder to be certain of having identi-fied an individual or entity in the database,because no record is certain to be unaltered.A formal definition of data swapping is given in[26],using elementary swaps.An elementary swap is a selection of two records from the microdata and an interchange of the values of attributes being swapped for these two records.When the candidates for each swap pair are picked at random we will refer to the resulting swaps as random swaps.We assume that elements of a swap pair are picked without replacement,so that no record appears in more than one swap pair.We also allow only true swaps,in the sense that both the swap attribute and at least one unswapped attribute ∗Currently at US Food and Drug Administration,Rockville,MD.must differ.1For multiple swap attributes,all attributes are swapped simultaneously,and all swap attributes must differ.The algorithm to perform the swapping is described in Appendix A and[20].In the past,implementation of data swapping by statistical agencies has been a matter of judge-ment.This leads to conservative behavior,erring on the side of too much protection of confiden-tiality rather than risking too little.Moreover,compared to immense attention to the effects of data swapping on confidentiality,much less attention has been paid to the effects of data swapping on the usefulness of the released data.Clearly data swapping distorts the data:joint distributions involving both swapped and unswapped attributes change.This decreases the value of the data for purposes such as statistical inference.Confidentiality protection and data utility must be traded off:they are,in economic terminology,substitutes—more of one entails less of the other.In this paper,we formulate(implementation of)data swapping as a decision problem with explicit tradeoff of quantified measures of disclosure risk and data utility.In its simplest form, this problem entails selection of one or more swap attributes and the swap rate,the fraction of records for which swapping occurs.More complex versions of the problem allow constraints on unswapped attributes.For example,an unswapped attribute may be forced to remain unchanged (preventing swapping across geographical boundaries,for example)or forced to change.Our formulation of data swapping as a decision problem appears in§2,together with two com-plementary approaches to solving the problem.In§3and4we introduce particular measures of disclosure risk and data utility,the latter conceptualized in part as lack of data distortion.These are illustrated using example data from the Current Population Survey(CPS)[4].In§5we describe risk–utility tradeoffs for three databases—CPS data,data on school administrators from the Na-tional Center for Education Statistics(NCES),and the Czech automobile worker database in[11];§6contains a concluding discussion.2Problem FormulationIn this section,we formulate data swapping as a decision problem:what must be decided(§2.1) and how quantified measures of disclosure risk and data utility(§2.2)facilitate solution of the problem(§2.3).Model-based frameworks for trading off risk and utility have been considered by[10]and [29].In[10],the terminology“R–U confidentiality map”is used for this tradeoff in the context of top-coding.The same phrase is used in[15]to denote a simulation experiment for perturbed(by addition of noise)multivariate data.A Bayesian approach to contrasting risk and utility for cell suppression is studied in[29].A risk–utility approach statistical disclosure limitation for tabular data,in which releases are marginal subtables of a large contingency table,appears in[7,8,14].1In some early versions of our software[20],a looser definition of“true swap”was employed,which required that each record change,but not that the database change.For example,with Age with swap attribute,(Age=≥50,Sex= Male)↔(Age<50,Sex=Male)would have been a true swap under the earlier formulation,but no longer is one.2.1Structure of the Decision ProblemConsider a database D consisting of a single table of data records having only categorical at-tributes.Much of the formulation in this paper but fewer of the specifics such as measures of disclosure risk and data distortion,generalizes to“continuous”attributes.The decision problem for data swapping involves three principal stages.2Thefirst stage is to decide whether to use data swapping at all,and whether to use data swap-ping alone or in conjunction with other strategies for statistical disclosure limitation.This choice lies largely outside the realm of this paper,and may be dictated by agency practice,political issues or scientific considerations.In general,data swapping is used in situations where the release of altered microdata is preferred to that of(possibly exact)summaries or analyses of the data.Extensions of our risk–utility paradigm may allow quantification of tradeoffs among multiple statistical disclosure limitation strategies,although clearly additional research is required before this becomes a reality.Second,if data swapping is employed,disclosure risk and data utility measures must be se-lected,which are used as shown in§2.3to perform the third stage of the decision process.Exam-ples of such measures appear in§3and4.Third,the release must be selected from some set R cand of candidate releases,which ordinarily entails choosing theSwap rate,the fraction of records in the database D for which swapping will occur.Swap attributes,those attributes whose values are exchanged between randomly selected pairs of records in D.Constraints on the unswapped attributes,which are optional.Such constraints may require or forbid equality of unswapped attributes.More specifically,as in§2.2,candidate releases are parameterized by a swap rate,the swap at-tributes and constraints,and constructed by actually performing the swap.Then,values of dis-closure risk and data utility are computed for each candidate release,and used to select which candidate to release.Although in principle the risk–utility paradigm in§2.2–2.3can be used to select all three of these,we envision that it will be used frequently to select swap attributes,less frequently to select the swap rate,and only rarely to select the constraints.Ordinarily,constraints would be imposed exogenously on the basis of domain knowledge.For example,it may be declared that swapping may not occur across state lines,because doing so would lead to released microdata that are in-consistent with state-level totals available elsewhere.Similarly,constraints may be necessary to prevent physically infeasible(and hence detectable)swapped records,such as males who have un-dergone hysterectomies.Even in such cases,however,our methods can still be used to evaluate the impact of the constraints on disclosure risk and data utility.2In effect,one decision precedes all of these:to release microdata at all,as opposed to summaries or statistical analyses of the data.As more external databases become available and record linkage technologies improve,any useful release of microdata may be too threatening to confidentiality.An initial look at a“world without microdata”appears in[13].2.2Mathematical RepresentationLet d be the number of(categorical)attributes in the pre-swap database D pre.The mathematical abstraction of the decision problem laid out in§2.1entails specification of candidate releases,a disclosure risk measure and a data utility measure.Releases.We parameterize candidate releases asR=(r,AS1,...,AS d),(1) where r is the swap rate,and for each attribute i,the attribute specificationAS i∈{S,F,C,U}(2) determines whether attribute i is swapped(S),must remainfixed(F),must change(C),or is neither swapped nor constrained(U).Release Space.The space R of all possible releases,which we term the release space,is uncountably infinite because there are uncountably many possible swap rates.Even for afixed swap rate,there are on the order of4d−1/2possible releases,corresponding to all possible com-binations of S,F,C and U in(2)other than(C,...,C),(F,...,F),(S,...,S)and(U,...,U) and accounting for complementarity—swapping one set of attributes is equivalent to swapping its complement.Candidate Release Space.In many settings,therefore,it is convenient or necessary to consider a smaller set R cand of candidate releases.For example,in§5.1,where d=8,there are136 candidate releases corresponding to three swap rates,all possible one-and two-attribute swaps, and no constraints.Note that candidate releases correspond to parameterized rather than actual releases.For each release R∈R cand we construct an actual release—a post-swap database D post(R),using the algorithm in Appendix A.Define the actual candidate release spaceR act cand=D post(R):R∈R cand,(3)one of whose elements will be released.The selection problem is to choose which one.Its solution, which we describe in§2.3,requires quantified measures of disclosure risk and data utility;specific examples for data swapping are presented in§3–4.Because the data swapping algorithm in Appendix A entails randomization,there is ambiguity in(3):different choices of the randomization seed yield different post-swap databases,even for the same parameterized release R in(1).It is even possible to include the randomization seed in the choice problem,but for simplicity we do not.In fact,when there is little possibility of confusion, we treat R∈R cand and D post(R)∈R act cand as synonymous.Disclosure Risk.The disclosure risk measure is a function DR:R→R with the inter-pretation that DR(R)is the disclosure risk associated with the release R.3If R cand is immutable, 3This is an example of the simplification from the preceding paragraph.Strictly speaking,disclosure risk is a function of D post(R)rather than R,and indeed,the examples in§3show this.then of course DR,as well as the data utility measure DU,need only be defined on it,and not necessarily on all of R.The disclosure risk function need not have any particular properties other than sensibly abstracting disclosure risk.However,in settings such as tabular data,in which the release space is partially ordered,the disclosure risk measure must be monotone with respect to the partial order.Data Utility.The data utility measure is a function DU:R→R with the interpretation that DU(R)is the utility of the release R.2.3Solution of the Decision ProblemGiven disclosure risk and data utility measures,the data swapping decision problem can be solved in two distinct ways.Utility Maximization.In this case,the optimal release R∗is chosen that maximizes data utility subject to an upper bound constraint on disclosure risk:DU(R)R∗=arg max R∈Rcand(4)s.t.DR(R)≤α,whereαis the bound on disclosure risk,which must be specified by the decision maker.Risk–Utility Frontiers.Especially but not only if R cand is small,then it may be more insight-ful simply to compare releases R in terms of risk and utility simultaneously,using the partial order RU defined byR1 RU R2⇔DR(R2)≤DR(R1)and DU(R2)≥DU(R1).(5) If R1 RU R2,then clearly R2is preferred to R1because it has both lower disclosure risk and higher greater utility.Only elements of R cand on the risk–utility frontier∂R cand consisting of the maximal elements of R cand with respect to the partial order(5)need be considered further. Ordinarily,as illustrated schematically in Figure1and for real data in Figures3and4,the frontier is much smaller than R putation of the frontier is not difficult:although no algorithm can avoid worst case complexity of#{R cand}2,where#{R cand}is the number of cases in R cand, algorithms exist whose average complexity is of order#{R cand}.Selection of a release on the risk–utility frontier can be done by assessing the risk–utility bal-ance subjectively or quantitatively,by means of an objective function that relates risk and utility. To illustrate,the dashed line in Figure1corresponds to a linear risk-utility relationship of the formDR=a×DU+c,and thefigure identifies the release on∂R cand that is optimal for a particular value of a.Simi-lar approaches have been used in economics to maximize consumer utility for the purchase of a combination of two commodities.Risk–utility frontiers also facilitate solution of the utility maximization problem(4),because the optimal release R∗must lie on the frontier.Figure1:Conceptual risk-utility frontier and optimal release for a linear tradeoff between risk and utility.3Disclosure Risk MeasuresHere we describe two disclosure risk measures that are both derived from the concept that re-identification of data subjects is the primary threat to confidentiality.3.1Small Cell CountsEspecially for census data,population uniques or near uniques are potentially riskier than other elements.For categorical data,these elements are contained in small count cells in the contingency table created by using all attributes in the data.The n-rule,which is widely used in statistical disclosure limitation[25],considers records that fall in cells with count(strictly)less than n(typically n=3)to be at risk.Reflecting this,we define risk as the proportion of unswapped records in small count cells in the table created fromthe post-swap data:DR(R)=C1,C2Number of unswapped records in D post(R)Total number of unswapped records in D post,(6)where C1and C2are the cells in the full data table associated with D post(R)with counts of1and 2respectively.Unlike the date distortion measures in§4,which are stated for categorical data but generalize readily to continuous data,the disclosure measure of(6)makes sense only for categorical data.3.2Record LinkageA number of authors(for example,[5],[9],[16],[22],[27],[26],[28])have considered disclosure risk measures based on re-identification through record linkage.For example,let D ext be an exter-nal database containing attributes in common with D(and the same attributes in common with any D post(R)),and for each record r∈D post(R)let n(r)=n(R;r)be the number of records in D ext that agree with r on the common attributes.These are candidates for linkage to r.For purposes of statistical disclosure limitation,larger values of n(r)are better,because they make record linkage more uncertain.A disclosure risk measure that captures this isDR (R)=Number of records in D post(R)with n(r)≤βTotal number of records in D post(R),(7)whereβis a threshold.4Data Utility MeasuresLet D pre denote the database prior to swapping,and let D post(R)denote the post-swap database for candidate release R.In this section,we describe two classes of data utility measures that capture the extent to which D post(R)differs from D pre.Thefirst of these(§4.1)measures explicitly the distortion introduced by data swapping.Distortion is data disutility,so that if DD is a measure of data distortion,then DU=−DD is the associated measure of data utility.Direct measures of distortion are general but blunt.They are disconnected from specific uses of the data,such as statistical inference.In§4.2we present data utility measures that quantify the extent to which inferences(in our case,using log-linear models)based on D post(R)differ from those based on D pre.4.1Data DistortionRecall that the data are categorical.Our data distortion measures are based on viewing D pre and D post(R)as contingency tables,and thus(when normalized)as distributions on the space I index-ing cells in these tables.4We let D pre(c)be the cell count associated with cell c∈I.The distortion measures all have the formDD(R)=d(D pre,D post(R)),(8) where d is a metric on an appropriate space of distributions.Recall also that data swapping changes only joint distributions of the attributes that involve both swap attributes and unswapped attributes. Distortion measures of the form(8)involve all attributes.Hellinger distance[17]is given byHD(D pre,D post(R))=1√2c∈ID pre(c)−D post(R,c)2.(9)4Mathematically,I is the Cartesian product of the sets of category values for each attribute.Note that the same absolute difference between D pre(c)and D post(R,c)affects the Hellinger dis-tance to a greater extent when the value of D pre(c)is small.Hellinger distance also corresponds to Cressie–Read divergence[6]withλ=−0.5.Total variation distance is given byTV(D pre,D post(R))=12c∈ID pre(c)−D post(R,c).(10)Entropy change is based on Shannon entropy,which for D pre is given byE(D pre)=−c∈I D pre(c)logD pre(c),and is conventionally interpreted at the amount of uncertainty in D pre.Entropy change,then, constitutes another measure of data distortion:EC(D pre,D post(R))=E(D post(R))−E(D pre).(11) Positive values of EC(D pre,D post(R))indicate that swapping has increased the uncertainty in the data.Related distortion measures involving conditional entropy have also been employed[26].We illustrate these measures using an8-attribute database CPS-8D extracted from the1993 CPS.The attributes,abbreviations we use for them and category values appear in Table1.There are48,842data records;the associated full table contains2880cells,of which1695are non-zero.5 In reality,the fact that we have survey rather than census data would represent additional protection against disclosure.Figure2shows the values of HD(D pre,D post(R)),TV(D pre,D post(R))and EC(D pre,D post(R)) for the CPS-8D data for24candidate releases corresponding to swap rates of1%,5%and10% and all single-attribute swaps.(These and other results in this paper were produced using—in this case a prototype of—the NISS Data Swapping Toolkit[18].)As expected,distortion increases as the swap rate increases,approximately linearly.Figure2shows rather dramatically that swapping some attributes induces more distortion than swapping others,an issue that we discuss at greater length in§5.In general,though,the three distortion measures track each other very closely,and in particular,total variation distance and entropy change result in almost the same ordering of swap variables.Hellinger distance shows a somewhat different ordering,to which Age and AvgHrs appear to contribute the most.Additional data distortion measures that are restricted to two-attribute databases(or more gen-erally,if only distortion of bivariate distributions is of interest),appear in Appendix B.4.2Inference-Based Measures of UtilityAs noted in the lead-in to this section,data distortion is a blunt measure of data utility because it does not address directly inferences that are drawn from the post-swap data.There is,of course, 5This is not a realistic level of sparsity.Attribute Name(ShortName)Abbreviation CategoriesAge(in years)(Age)A<25,25–55,>55Employer Type(WrkTyp)W Govt.,Priv.,Self-Emp.,Other Education(Educ)E<HS,HS,Bach,Bach+,CollMarital Status(MarStat)M Married,OtherRace(Race)R White,Non-WhiteSex(Sex)S Male,FemaleAverage Weekly Hours Worked(Hours)H<40,40,>40Annual Salary(Income)I<$50K,$50K+Table1:Attributes and attributes categories for the CPS-8D data.Figure2:Graph of Hellinger(top)and total variation(middle)distances and entropy change(bot-tom)for1%swap( ),5%( )and10%swap rates(+).indirect information,because nearly all inference procedures are in some sense“continuous”with respect to the data,so that low distortion implies nearly correct inference.Here,by contrast,we describe data utility measures that account explicitly for inference in the form of log-linear models [2]of the data.Let M∗=M∗(D pre)be the“optimal”log-linear model of the pre-swap database D pre,ac-cording to some criterion,for example,the Akaike information criterion(AIC)[1]or Bayes infor-mation criterion(BIC)[21].Concretely,M∗can be thought of in terms of its minimal sufficient statistics—a set of marginal subtables of the contingency table associated with D pre representing the highest-order interactions present.Let L M∗(·)be the log-likelihood function associated with M∗.Then as measure of data utility we employ the log-likelihood ratioDU llm(R)=L M∗(D post(R))−L M∗(D pre).(12) Although in general DU llm(R)<0in(12),because of the randomization in data swapping,this is not a logical necessity.The rationale is that higher values of DU llm(R)indicate that M∗remains a good model for D post(R).This is not,however,completely equivalent to saying that the same inferences would be drawn from D post(R)as from D pre,since data users do not have access to M∗.A more complex inference-based measure of utility might,for example,compare M∗to a similarly optimal model M∗(D post(R))of the post-swap data.Precisely how to do so,however,requires further research. One example would be whether M∗(D pre)and M∗(D post(R))have the same minimal sufficient statistics,but this measure is highly discontinuous.In§5.2we illustrate DU llm for the“Czech auto worker data,”an intensively studied(see,for example,[11,7]),6-attribute database containing risk factors for coronary thrombosis for1841 Czechoslovakian automobile factory workers who took part in a prospective epidemiological study. The associated contingency table,which contains26=64cells and is not sparse,appears in Table2.The six dichotomous attributes are defined as follows:A indicates whether the worker “smokes,”B corresponds to“strenuous mental work,”C corresponds to“strenuous physical work,”D corresponds to“systolic blood pressure,”E corresponds to“ratio ofβandαlipoproteins,”andF represents“family anamnesis of coronary heart disease.”There are three high risk cells,one with count1and two with count two.5Risk–Utility TradeoffsIn this section,we illustrate risk–utility tradeoffs for a variety of databases and utility measures: the CPS-8D database(§5.1),school administrator data from the NCES(§5.3)and the Czech auto-mobile worker database of Table2(§5.2).Rather than a“full factorial”design of all risk measures and all utility measures on each database,we report selected results that illuminate our risk–utility methodology.B no yesF E D C A no yes no yesneg<3<140no444011267yes1291451223≥140no35128033yes1096779≥3<140no23327066yes5080713≥140no24257357yes5163716pos<3<140no57219yes91714≥140no43118yes141752≥3<140no731414yes91623≥140no401311yes51444Table2:The Czech automobile worker database from[11].High-risk cells are shown by boxes.5.1CPS-8D DataHere we illustrate risk–utility tradeoffs for the CPS-8D data for a candidate release space R cand containing108cases corresponding to candidate releases comprising all(8)single-attribute swaps and all(28)two-attribute swaps together with swap rates of1%,2%and10%of the data.The disclosure risk measure is given by(6)and data utility is derived from Hellinger distance-measureddistortion:DU(R)=−DD(R)=−HDD pre,D post(R).The results,which were obtained using the NISS Data Swapping Toolkit,are shown separately for each of the three swap rates in Figure3,with the swap attributes identified,and with all three rates on one plot in Figure4.Since DU=−DD,these plots are reversed left–to–right as compared to Figure1,and∂R cand is now the“southwest boundary.”In Figure3,lines connect the cases on the frontier∂R cand.A user who has already decided on a rate need only look at the plot corresponding to that rate and make a decision as to which candidate release on the frontier best captures the relevant risk and utility tolerances.For example, and restricting attention to the2%rate,if the optimization criterion of(4)were employed,which in this case translates toR∗=arg min R∈RcandDD(R)s.t.DR(R)≤α,(13)and ifα=.14,then the optimal release corresponds to swap attributes Sex and WrkTyp,which is labeled by“WS”in the middle panel of Figure3.Alternatively,a user who is undecided about the swap rate would select from the combined frontier∂R cand generated by putting together all swaps for the rates of interest,as in Figure4.The frontier for the combined plot is a strict subset of the union of the three individual frontiers.For example,the10%swap of Educ,which was on the frontier for the10%swap rate,is dominated by many1%and2%swaps.Figure4also clearly illustrates how distortion increases and risk decreases with increasing swap rate.Single swaps tend to be riskier than pair swaps but show less mean distortion than pair swaps.As swap rate increases,variability in both risk and Hellinger distance increases.5.2Czech Automobile Worker DataThe log-linear model based data utility measure DU llm(R)in(12)was calculated for the Czech automobile worker data in Table2for21releases corresponding to all one-and two-attribute swaps,with a single swap rate of10%,with the“batch swap”capability[19]of the NISS Data Swapping Toolkit used to perform the swapping.The optimal model M∗=M∗(D pre)under either AIC or BIC has as sufficient statistics the marginal subtables[ABC D],[ADE],[F B].(14)This model is also well-recognized as the“best”model on the basis of domain knowledge[11,23].Figure5shows the associated risk–utility plot,with risk given by(6).Points there are labeled by swap attributes,with A,...,F the single-attribute swaps and fe,...,ba the two-attribute swaps. Since this is a risk–utility(not risk–distortion)plot,it is comparable to Figure1.The frontier is the southeast boundary of the set of candidate releases:∂R cand=B,ed,fe,ec,fa,fd.In Figure5,the points fc and fb are clearly anomalous:they have extremely low utility.One(butonly one of these)corresponds to a minimal in(14).One obvious question is whether the inference-based utility measure DU llm actually“picksup”some sort of signal that is obscured by the(general but as we termed it“blunt”)Hellingerdistance data distortion measure of(9).Figure6plots(DU llm(R),HD)pairs for the same21cases appearing in Figure5.The relationship is ambiguous at best,which we interpret as meaning thatDU llm(R)and HD are indeed different.Indeed,ignoring the anomalous points fc and fb,there seems to be little apparent relationship between DU llm(R)and HD.5.3NCES DataHere we illustrate insights produced by our decision-theoretic formulation of data swapping,using data from the NCES.Specifically,we use eight categorical attributes extracted from the1993 Common Core of Data(CCD)Public Elementary/Secondary School Universe Survey datafile and the1993–94Schools and Staffing Survey(SASS)Public and Private Administrator datafile.The attribute names and category values appear in Table3.。
Data Swapping:Variations on a Theme by Dalenius and ReissStephen E.Fienberg1, and Julie McIntyre21Department of StatisticsCenter for Automated Learning and DiscoveryCenter for Computer Communications and SecurityCarnegie Mellon University,Pittsburgh,PA15213-3890,USAfienberg@2Department of StatisticsCarnegie Mellon University,Pittsburgh PA15213-3890,USAjulie@Abstract.Data swapping,a term introduced in1978by Dalenius andReiss for a new method of statistical disclosure protection in confidentialdata bases,has taken on new meanings and been linked to new statisticalmethodologies over the intervening twenty-five years.This paper revis-its the original(1982)published version of the the Dalenius-Reiss dataswapping paper and then traces the developments of statistical disclo-sure limitation methods that can be thought of as rooted in the originalconcept.The emphasis here,as in the original contribution,is on bothdisclosure protection and the release of statistically usable data bases.Keywords:Bounds table cell entries;Constrained perturbation;Con-tingency tables;Marginal releases;Minimal sufficient statistics;Rankswapping.1IntroductionData swapping wasfirst proposed by Tore Dalenius and Steven Reiss(1978) as a method for preserving confidentiality in data sets that contain categori-cal variables.The basic idea behind the method is to transform a database by exchanging values of sensitive variables among individual records.Records are exchanged in such a way to maintain lower-order frequency counts or marginals. Such a transformation both protects confidentiality by introducing uncertainty about sensitive data values and maintains statistical inferences by preserving certain summary statistics of the data.In this paper,we examine the influence of data swapping on the growingfield of statistical disclosure limitation.Concerns over maintaining confidentiality in public-use data sets have in-creased since the introduction of data swapping,as has access to large,comput-erized databases.When Dalenius and Reissfirst proposed data swapping,it was in many ways a unique approach the problem of providing quality data to users Currently Visiting Researcher at CREST,INSEE,Paris,France.J.Domingo-Ferrer and V.Torra(Eds.):PSD2004,LNCS3050,pp.14–29,2004.c Springer-Verlag Berlin Heidelberg2004Data Swapping:Variations on a Theme by Dalenius and Reiss15 while protecting the identities of subjects.At the time most of the approaches to disclosure protection had essentially no formal statistical content,e.g.,see the1978report of the Federal Committee on Statistical Methodology,FCSM (1978),for which Dalenius served as as a consultant.Although the original procedure was little-used in practice,the basic idea and the formulation of the problem have had an undeniable influence on subse-quent methods.Dalenius and Reiss were thefirst to cast disclosure limitation firmly as a statistical problem.Following Dalenius(1977),Dalenius and Reiss define disclosure limitation probabilistically.They argue that the release of data is justified if one can show that the probability of any individual’s data being compromised is appropriately small.They also express a concern regarding the usefulness of data altered by disclosure limitation methods by focusing on the type and amount of distortion introduced in the data.By construction,data swapping preserves lower order marginal totals and thus has no impact on in-ferences that derive from these statistics.The current literature on disclosure limitation is highly varied and combines the efforts of computer scientists,official statisticians,social scientists,and statis-ticians.The methodologies employed in practice are often ad hoc,and there are only a limited number of efforts to develop systematic and defensible approaches for disclosure limitation(e.g.,see FCSM,1994;and Doyle et al.,2001).Among our objectives here are the identification of connections and common elements among some of the prevailing methods and the provision of a critical discus-sion of their comparative effectiveness1.What we discovered in the process of preparing this review was that many of those who describe data swapping as a disclosure limitation method either misunderstood the Dalenius-Reiss arguments or attempt to generalize them in directions inconsistent with their original pre-sentation.The paper is organized as follows.First,we examine the original proposal by Dalenius and Reiss for data swapping as a method for disclosure limitation, focusing on the formulation of the problem as a statistical one.Second,we ex-amine the numerous variations and refinements of data swapping that have been suggested since its initial appearance.Third,we discuss a variety of model-based methods for statistical disclosure limitation and illustrate that these have basic connections to data swapping.2Overview of Data SwappingDalenius and Reiss originally presented data swapping as a method for disclosure limitation for databases containing categorical variables,i.e.,for contingency tables.The method calls for swapping the values of sensitive variables among records in such a way that the t-order frequency counts,i.e.,entries in the the 1The impetus for this review was a presentation delivered at a memorial session for Tore Dalenius at the2003Joint Statistical Meetings in San Franciso,California.Tore Dalenius made notable contributions to statistics in the areas of survey sampling and confidentiality.In addition to the papers we discuss here,we especially recommend Dalenius(1977,1988)to the interested reader.16Stephen E.Fienberg and Julie McIntyret-way marginal table,are preserved.Such a transformed database is said to be t-order equivalent to the original database.The justification for data swapping rests on the existence of sufficient num-bers of t-order equivalent databases to introduce uncertainty about the true values of sensitive variables.Dalenius and Reiss assert that any value of a sensi-tive variable is protected from compromise if there is at least one other database or table,t-order equivalent to the original one,that assigns it a different value.It follows that an entire database or contingency table is protected if the values of sensitive variables are protected for each individual.The following simple exam-ple demonstrates how data swaps can preserve second-order frequency counts. Example:Table1contains data for three variables for seven individuals.Sup-pose variable X is sensitive and we cannot release the original data.In particular, notice that record number5is unique and is certainly at risk for disclosure from release of the three-way tabulated data.However,is it safe to release the two-way marginal tables?Table1b shows the table after a data-swapping transformation.Values of X were swapped between records1and5and between records4and7.When we display the data in tabular form as in Table2,we see that the two-way marginal tables have not changed from the original data.Summing over any dimension results in the same2-way totals for the swapped data as for the original data. Thus,there are at least two data bases that could have generated the same set of two-way tables.The data for any single individual cannot be determined with certainty from the release of this information alone.Table1.Swapping X values for two pairs of records in a3-variable hypothetical example(a)Original Data Record X Y Z 1010201030014001511161007100(b)Swapped Data Record X Y Z 1110201030014101501161007000An important distinction arises concerning the form in which data are re-leased.Releasing the transformed data set as microdata clearly requires that enough data are swapped to introduce sufficient uncertainty about the true val-ues of individuals’data.In simple cases such as the example in Table1above, appropriate data swaps,if they exist,can be identified by trial and error.However identifying such swaps in larger data sets is difficult.An alternative is to release the data in tabulated form.All marginal tables up to order t are unchanged by the transformation.Thus,tabulated data can be released by showing the exis-tence of appropriate swaps without actually identifying them.Schl¨o rer(1981)Data Swapping:Variations on a Theme by Dalenius and Reiss17 Table2.Tabular versions of original and swapped data from Table1(a)Original DataZY X01 002 1201YX01020101(a)Swapped DataZYX010111111YX01011110discusses some the trade-offs between the two approaches and we return to this issue later in the context of extensions to data swappping.Dalenius and Reiss developed a formal theoretical framework for data swap-ping upon which to evaluate its use as a method for protecting confidentiality. They focus primarily on the release of data in the form of2-way marginal to-tals.They present theorems and proofs that seek to determine conditions on the number of individuals,variables,and the minimum cell counts under which data swapping can be used to justify the release of data in this form.They argue that release is justified by the existence of enough2-order equivalent databases or tables to ensure that every value of every sensitive variable is protected with high probability.In the next section we discuss some of the main theoretical results presented in the paper.Many of the details and proofs in the original text are unclear,and we do not attempt to verify or replace them.Most important for our discussion is the statistical formulation of the problem.It is the probabilistic concept of disclosure and the maintenence of certain statistical summaries that has proved influential in thefield.2.1Theoretical Justification for Data SwappingConsider a database in the form of an N×V matrix,where N is the number of individuals and V is the number of variables.Suppose that each of the V variables is categorical with r≥2categories.Further define parameters a i,i≥1,that describe lower bounds on the marginal counts.Specifically,a i=N/m i where m i is the minimum count in the i-way marginal table.Dalenius and Reiss consider the release of tabulated data in the form of2-way marginal tables.In theirfirst result,they consider swapping values of a single variable among a random selection of k individuals.They then claim that the probability that the swap will result in a2-equivalent database isp≈r(V−1)r (πk)(V−1)(r−1).Observations:1.The proof of this result assumes that only1variable is sensitive.2.The proof also assumes that variables are independent.Their justificationis:“each pair of categories will have a large overlap with respect to k.”18Stephen E.Fienberg and Julie McIntyreBut the specific form of independence is left vague.The2-way margins for X are in fact the minimal sufficient statistics for the model of conditional independence of the other variables given X(for further details,see Bishop, Fienberg,and Holland,1975).Dalenius and Reiss go on to present results that quantify the number of potential swaps that involve k individuals.Conditions on V,N,and a2follow that ensure the safety of data released as2-order statistics.However the role of k in the discussion of safety for tabulated data is unclear.First they let k=V to get a bound on the expected number of data swaps.Thefirst main result is:Theorem1.If V<N/a2,V≥4,and N≥14a1F1/(V−1)V(V r−r+1)/(V−1)for some function F then the expected number of possible data-swaps of k=V individuals involving afixed variable is≥F.Unfortunately,no detail or explaination is given about the function F.Condi-tions on V,N,and a2that ensure the safety of data in2-way marginal tables are stated in the following theorem:Theorem2.If V<N/a2,andN{log(5NV p∗)}2/(V−1)≥a1V(V r−r+1)/(V−1)where p∗=log(1−p)/log(p),then,with probability p,every value in the database is2-safe.Observations:1.The proof depends on the previous result that puts a lower bound on theexpected number of data swaps involving k=V individuals.Thus the result is not about releasing all2-way marginal tables but only those involving a specic variable,e.g.,X.2.The lower bound is a function F,but no discussion of F is provided.In reading this part of the paper and examining the key results,we noted that Dalenius and Reiss do not actually swap data.They only ask about possible data swaps.Their sole purpose appears to have been to provide a framework for evaluating the likelihood of disclosure.In part,the reason for focusing on the release of tabulated data is that identi-fying suitable data swaps in large databases is difficult.Dalenius and Reiss do ad-dress the use of data swapping for release of microdata involving non-categorical data.Here,it is clear that a database must be transformed by swapping before it can safely be released;however,the problem of identifying enough swaps to protect every value in the data base turns out to be computationally impractical.A compromise,wherein data swapping is performed so that t-order frequency counts are approximately preserved,is suggested as a more feasible approach. Reiss(1984)gives this problem extensive treatment and we discuss it in more detail in the next section.Data Swapping:Variations on a Theme by Dalenius and Reiss19 We need to emphasize that we have been unable to verify the theoretical results presented in the paper,although they appear to be more specialized that the exposition suggests,e.g.,being based on a subset of2-way marginals and not on all2-way marginals.This should not be surprising to those faminiliar with the theory of log-linear models for contingency tables,since the cell probabilities for the no2nd-order interaction model involving the2-way margins does not have an explicit functional representation(e.g.,see Bishop,Fienberg,and Holland,1975). For similar reasons the extension of these results to orders greater than2is far from straightforward,and may involve only marginals that specify decomposable log-linear models(c.f.,Dobra and Fienberg,2000).Nevertheless,wefind much in the authors’formulation of the disclosure limi-tation problem that is important and interesting,and that has proved influential in later theoretical developments.We summarize these below.1.The concept of disclosure is probabilistic and not absolute:(a)Data release should be based on an assessment of the probability of theoccurrence of a disclosure,c.f.,Dalenius(1977).(b)Implicit in this conception is the trade-offbetween protection and util-ity.Dalenius also discusses this in his1988Statistics Sweden monograph.He notes that essentially there can be no release of information without some possibility of disclosure.It is in fact the responsibility of data man-agers to weigh the risks.Subjects/respondents providing data must also understand this concept of confidentiality.(c)Recent approaches rely on this trade-offnotion,e.g.,see Duncan,et al.(2001)and the Risk-Utility frontiers in NISS web-data-swapping work (Gomatam,Karr,and Sanil,2004).2.Data utility is defined statistically:(a)The requirement to maintain a set of marginal totals places the emphasison statistical utility by preserving certain types of inferences.Although Dalenius and Reiss do not mention log-linear models,they are clearly focused on inferences that rely on t-way and lower order marginal totals.They appear to have been thefirst to make this a clear priority.(b)The preservation of certain summary statistics(at least approximately)is a common feature among disclosure limitation techniques,although until recently there was little reference to the role these statistics have for inferences with regard to classes of statistical models.We next discuss some of the immediate extensions by Delanius and Reiss to their original data swapping formulation and its principal initial application. Then we turn to what others have done with their ideas.2.2Data Swapping for Microdata ReleasesTwo papers followed the original data swapping proposal and extended those methods.Reiss(1984)presented an approximate data swapping approach for the release of microdata from categorical databases that approximately preserves20Stephen E.Fienberg and Julie McIntyret-order marginal totals.He computed relevant frequency tables from the original database,and then constructed a new database elementwise to be consistent with these tables.To do this he randomly selected the value of each element according to probability distribution derived from the original frequency tables and and then updated the table each time he generated a new element.Reiss,Post,and Dalenius(1982)extended the original data swapping idea to the release of microdatafiles containing continuous variables.For continu-ous data,they chose data swaps to maintain generalized moments of the data, e.g.,means,variances and covariances of the set of variables.As in the case of categorical data,finding data swaps that provide adequate protection while pre-serving the exact statistics of the original database is impractical.They present an algorithm for approximately preserving generalized k th order moments for the case of k=2.2.3Applying Data Swapping to Census Data ReleasesThe U.S.Census Bureau began using a variant of data swapping for data releases from the1990decennial census.Before implementation,the method was tested with extensive simulations,and the release of both tabulations and microdata was considered(for details,see Navarro,et al.(1988)and Griffin et al.(1989)). The results were considered to be a success and essentially the same methodology was used for actual data releases.Fienberg,et al.(1996)describe the specifics of this data swapping methodol-ogy and compare it against Dalenius and Reiss’proposal.In the Census Bureau’s version,records are swapped between census blocks for individuals or households that have been matched on a predetermined set of k variables.The(k+1)-way marginals involving the matching variables and census block totals are guaran-teed to remain the same;however,marginals for tables involving other variables are subject to change at any level of tabulation.But,as Willenborg and de Waal (2001)note,swapping affects the joint distribution of swapped variables,i.e, geography,and the variables not used for matching,possibly attenuating the association.One might aim to choose the matching variables to approximate conditional independence between the swapping variables and the others.Because the swapping is done between blocks,this appears to be consistent with the goals of Dalenius and Reiss,at least as long as the released marginals are those tied to the swapping.Further,the method actually swaps a specified (but unstated)number of records between census blocks,and this becomes a data base from which marginals are released.However the release of margins that have been altered by swapping suggests that the approach goes beyond the justification in Dalenius and Reiss.Interestingly,the Census Bureau description of their data swapping methods makes little or no reference to Dalenius and Reiss’s results,especially with regard to protection.As for ultility,the Bureau focuses on achieving the calculation of summary statistics in released margins other than those left unchanged by swapping(e.g.,correlation coefficients)rather than on inferences with regard to the full cross-classification.Data Swapping:Variations on a Theme by Dalenius and Reiss21 Procedures for the U.S.2000decennial census were similar,although with modifications(Zayatz2002).In particular,unique records that were at more risk of disclosure were targeted to be involved in swaps.While the details of the approach remain unclear,the Office of National Statistics in the United Kingdom has also applied data swapping as part of its disclosure control procedures for the U.K.2001census releases(see ONS,2001).3Variations on a Theme–Extensions and Alternatives3.1Rank SwappingMoore(1996)described and extended the rank-based proximity swapping algo-rithm suggested for ordinal data by Brian Greenberg in an1987unpublished manuscript.The algorithmfinds swaps for a continuous variable in such a way that swapped records are guaranteed to be within a specified rank-distance of one another.It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed af-ter an unconstrained swap.Moore attempts to provide rigorous justification for this,as well as conditions on the rank-proximity between swapped records that will ensure that certain summary statistics are preserved within a specified in-terval.The summary statistics considered are the means of subsets of a swapped variable and the correlation between two swapped variables.Moore makes a cru-cial assumption that values of a swapped variable are uniformly distributed on the interval between its bottom-coded and top-coded values,although few of those who have explored rank swapping have done so on data satisfying such an assumption.He also includes both simulations(e.g.,for skewed variables)and some theoretical results on the bias introduced by two independent swaps on the correlation coefficient.Domingo-Ferrer and Torra(2001a,2001b)use a simplified version of rank swapping and in a series of simulations of microdata releases and claim that it provides superior performance among methods for masking continuous data. Trotinni(2003)critiques their performance measures and suggests great caution in interpreting their results.Carlson and Salabasis(2002)also present a data-swapping technique based on ranks that is appropriate for continuous or ordinally scaled variables.Let X be such a variable and consider two databases containing independent samples of X and a second variable,Y Suppose that these databases,S1=[X1,Y1]and S2=[X2,Y2]are ranked with respect to X.Then for large sample sizes,thecorresponding ordered values of X1and X2should be approximately equal.The authors suggest swapping X1and X2to form the new databases,S∗1=[X1,Y2] and S∗2=[X2Y1].The same method can be used given only a single sample by randomly dividing the database into two equal parts,ranking and performing the swap,and then recombining.Clearly this method,in either variation,maintains univariate moments of the data.Carlson and Salabasis’primary concern,however,is the effect of the data swap on the correlation between X and Y.They examine analytically22Stephen E.Fienberg and Julie McIntyrethe case where X and Y are bivariate normal with correlation coefficientρ, using theory of order statistics andfind bounds onρ.The expected deterioration in the association between the swapped variables increases with the absolute magnitude ofρand decreases with sample size.They support these conclusions by simulations.While this paper provides thefirst clear statistical description of data swap-ping in the general non-categorical situation,it has a number of shortcomings. In particular,Fienberg(2002)notes that:(1)the method is extremely waste-ful of the data,using1/2or1/3according to the variation chosen and thus is highly ineffecient.Standard errors for swapped data are approximately40%to 90%higher than for the original unswapped data;(2)the simulations and theory apply only to bivariate correlation coefficients and the impact of the swapping on regression coefficients or partial correlation coefficients is unclear.3.2NISS Web-Based Data SwappingResearchers at the National Institute of Statistical Science(NISS),working with a number of U.S.federal agencies,have developed a web-based tool to perform data swapping in databases of categorical variables.Given user-specified param-eters such as the swap variables and the swap rate,i.e.,the proportion of records to be involved in swaps,this software produces a data set for release as micro-data.For each swapping variable,pairs of records are randomly selected and values for that variable exchanged if the records differ on at least one of the unswapped attributes.This is performed iteratively until the designated num-ber of records have been swapped.The system is described in Gomatam,Karr, Chunhua,and Sanil(2003).Documentation and free downloadable versions of the software are available from the NISS web-page,.Rather than aiming to preserve any specific set of statistics,the NISS pro-cedure focuses on the trade-offbetween disclosure risk and data utility.Both risk and utility diminish as the number of swap variables and the swap rate increase.For example,a high swapping rate implies that data are well-protected from compromise,but also that their inferential properties are more likely to be distorted.Gomatam,Karr and Sanil(2004)formulate the problem of choosing optimal values for these parameters as a decision problem that can be viewed in terms of a risk-utility frontier.The risk-utility frontier identifies the greatest amount of protection achievable for any set of swap variables and swap rate.One can measure risk and utility in a variety of ways,e.g.,the proportion of unswapped records that fall into small-count cells(e.g.,with counts less than 3)in the tabulated,post-swapped data base.Gomatam and Karr(2003,2004) examine and compare several“distance measures”of the distortion in the joint distributions of categorical variables that occurs as a result of data swapping, including Hellinger distance,total variation distance,Cramer’s V,the contin-gency coefficient C,and entropy.Gomatam,Karr,and Sanil(2004)consider a less general measures of utility—the distortion in inferences from a specific statistical analysis,such as a log-linear model analysis.Data Swapping:Variations on a Theme by Dalenius and Reiss23 Given methods for measuring risk and utility,one can identify optimal re-leases are empirically byfirst generating a set of candidate releases by performing data swapping with a variety of swapping variables and rates and then measur-ing risk and utility on each of the candidate releases and provide a means of making comparisons.Those pairs that dominate in terms of having low risk and high utility comprise a risk-utility frontier that leads optimal swaps for allow-able levels of risk.Gomatam,Karr,and Sanil(2003,2004)provide a detailed discussion of choosing swap variables and swap rates for microdata releases of categorical variables.3.3Data Swapping and Local RecodingTakemura(2002)suggests a disclosure limitation procedure for microdata that combines data swapping and local recoding(similar to micro-aggregation).First, he identifies groups of individuals in the database with similar records.Next,he proposes“obscuring”the values of sensitive variables either by swapping records among individuals within groups,or recoding the sensitive variables for the entire group.The method works for both continuous and categorical variables.Takemura suggests using matching algorithms to identify and pair similar individuals for swapping,although other methods(clustering)could be used. The bulk of the paper discusses optimal methods for matching records,and in particular he focuses on the use of Edmond’s algorithm which represents individuals as nodes in a graph,linking the nodes with edges to which we attach weights,and then matches individuals by a weighting maximization algorithm. The swapping version of the method bears considerable resemblance to rank swapping,but the criterion for swapping varies across individuals.3.4Data ShufflingMulalidhar and Sarathy(2003a,2003b)report on their variation of data swap-ping which they label as data shuffling,in which they propose to replace sensitive data by simulated data with similar distributional properties.In particular,sup-pose that X represents sensitive variables and S non-sensitive variables.Then they propose a two step approach:–Generate new data Y to replace X by using the conditional distribution of X given S,f(X|S),so that f(X|S,Y)=f(X|S).Thus they claim that the released versions of the sensitive data,i.e.,Y,provide an intruder with no additional information about f(X|S).One of the problems is,of course,thatf is unknown and thus there is information in Y.–Replace the rank order values of Y with those of X,as in rank swapping. They provide some simulation results that they argue show the superiority of their method over rank swapping in terms of data protection with little or no loss in the ability to do proper inferences in some simple bivariate and trivariate settings.。