Data swapping Variations on a theme by dalenius and reiss
- 格式:pdf
- 大小:156.80 KB
- 文档页数:16
sequ词根-回复Seq stands for "sequ" root, meaning "to follow" or "sequence". In this article, we will delve into the concept of sequence and explore its various applications in different fields such as mathematics, genetics, and computer science. Step by step, we will decipher the significance of sequence and its role in understanding patterns and processes. By the end of this article, you will have a comprehensive understanding of the sequ root and its implications.The concept of sequence is deeply embedded in the fabric of our everyday lives. We encounter sequences in different forms, whether it's the order in which we perform a series of tasks or the steps involved in solving a problem. The essence of sequence lies in recognizing the order and arranging elements accordingly.In mathematics, sequence plays a crucial role in understanding patterns and series. A sequence is a list of numbers arranged in a particular order. Each number in the sequence is called a term. The sequence can be finite or infinite, depending on the number of terms present. For example, the sequence 1, 2, 3, 4, 5 forms a finite sequence with five terms, while the sequence 2, 4, 6, 8, ... continues indefinitely, forming an infinite sequence.Moreover, sequences can follow specific patterns or rules. For instance, the Fibonacci sequence is a famous example, where each term is the sum of the previous two terms (1, 1, 2, 3, 5, 8, 13, ...). By understanding the rules governing a sequence, mathematicians can predict and model various phenomena.Moving beyond mathematics, the concept of sequence extends to genetics. In the field of genetics, DNA sequences are of utmost importance. DNA, short for deoxyribonucleic acid, carries the genetic instructions required for the development, functioning, and reproduction of all known living organisms.The human genome, for instance, is made up of billions of DNA base pairs arranged in a specific order. This order is critical for various biological processes, such as protein synthesis and gene regulation. By sequencing the DNA, scientists can read the genetic code and gain insights into the causes of genetic disorders, develop personalized medicine, and understand the evolutionary history of species.The sequencing technology has evolved over time, enablingscientists to unravel the complexities of DNA. Initially, sequencing a DNA sequence was a labor-intensive and time-consuming process. However, advancements in technology, such as the development of high-throughput sequencing methods like Next-Generation Sequencing (NGS), have revolutionized the field. These techniques allow for the rapid and cost-effective sequencing of DNA, enabling researchers to analyze complex genomes and identify genetic variations more efficiently.Besides genetics, the concept of sequence is also vital in computer science. In computer science, a sequence is defined as an ordered set of elements. The order of these elements is crucial as it determines how a computer program processes and manipulates the data.Sequences are used in various algorithms and data structures to solve problems efficiently. For example, in sorting algorithms like Merge Sort or Quick Sort, the elements are rearranged into a particular order by comparing and swapping them based on a predefined rule. Similarly, in data structures like linked lists or arrays, the elements are stored in a specific order, allowing for faster access and manipulation of data.Overall, the sequ root encompasses the concept of sequence and its implications across different fields. From mathematics to genetics to computer science, the understanding of sequencing plays a pivotal role in unraveling patterns, solving problems, and gaining insights into various phenomena. By recognizing the order and following the sequence, we can unlock the mysteries that surround us and make significant advancements in different areas of knowledge.。
数据分析英语试题及答案一、选择题(每题2分,共10分)1. Which of the following is not a common data type in data analysis?A. NumericalB. CategoricalC. TextualD. Binary2. What is the process of transforming raw data into an understandable format called?A. Data cleaningB. Data transformationC. Data miningD. Data visualization3. In data analysis, what does the term "variance" refer to?A. The average of the data pointsB. The spread of the data points around the meanC. The sum of the data pointsD. The highest value in the data set4. Which statistical measure is used to determine the central tendency of a data set?A. ModeB. MedianC. MeanD. All of the above5. What is the purpose of using a correlation coefficient in data analysis?A. To measure the strength and direction of a linear relationship between two variablesB. To calculate the mean of the data pointsC. To identify outliers in the data setD. To predict future data points二、填空题(每题2分,共10分)6. The process of identifying and correcting (or removing) errors and inconsistencies in data is known as ________.7. A type of data that can be ordered or ranked is called________ data.8. The ________ is a statistical measure that shows the average of a data set.9. A ________ is a graphical representation of data that uses bars to show comparisons among categories.10. When two variables move in opposite directions, the correlation between them is ________.三、简答题(每题5分,共20分)11. Explain the difference between descriptive andinferential statistics.12. What is the significance of a p-value in hypothesis testing?13. Describe the concept of data normalization and its importance in data analysis.14. How can data visualization help in understanding complex data sets?四、计算题(每题10分,共20分)15. Given a data set with the following values: 10, 12, 15, 18, 20, calculate the mean and standard deviation.16. If a data analyst wants to compare the performance of two different marketing campaigns, what type of statistical test might they use and why?五、案例分析题(每题15分,共30分)17. A company wants to analyze the sales data of its products over the last year. What steps should the data analyst take to prepare the data for analysis?18. Discuss the ethical considerations a data analyst should keep in mind when handling sensitive customer data.答案:一、选择题1. D2. B3. B4. D5. A二、填空题6. Data cleaning7. Ordinal8. Mean9. Bar chart10. Negative三、简答题11. Descriptive statistics summarize and describe thefeatures of a data set, while inferential statistics make predictions or inferences about a population based on a sample.12. A p-value indicates the probability of observing the data, or something more extreme, if the null hypothesis is true. A small p-value suggests that the observed data is unlikely under the null hypothesis, leading to its rejection.13. Data normalization is the process of scaling data to a common scale. It is important because it allows formeaningful comparisons between variables and can improve the performance of certain algorithms.14. Data visualization can help in understanding complex data sets by providing a visual representation of the data, making it easier to identify patterns, trends, and outliers.四、计算题15. Mean = (10 + 12 + 15 + 18 + 20) / 5 = 14, Standard Deviation = √[(Σ(xi - mean)^2) / N] = √[(10 + 4 + 1 + 16 + 36) / 5] = √52 / 5 ≈ 3.816. A t-test or ANOVA might be used to compare the means ofthe two campaigns, as these tests can determine if there is a statistically significant difference between the groups.五、案例分析题17. The data analyst should first clean the data by removing any errors or inconsistencies. Then, they should transformthe data into a suitable format for analysis, such ascreating a time series for monthly sales. They might also normalize the data if necessary and perform exploratory data analysis to identify any patterns or trends.18. A data analyst should ensure the confidentiality andprivacy of customer data, comply with relevant data protection laws, and obtain consent where required. They should also be transparent about how the data will be used and take steps to prevent any potential misuse of the data.。
市场异象与市场效率G. William Schwert威廉沃特Simon School of Business, University of Rochester西蒙商学院罗彻斯特大学This paper can be downloaded from theSocial Science Research Network Electronic Paper Collection:/abstract=目录1 引言 (2)2 挑选出的试验规律 (2)2.1可预见的资产的差别回报 (2)2.2各时期收益的预测性的不同 (7)3 不同类型的投资者的收益 (11)3.1个人投资者 (11)3.2机构投资者 (12)3.3套利限制 (14)4 长期回报 (15)5资产定价影响 (17)6公司金融的启示 (18)7结论 (19)Anomalies and Market Efficiency市场异象与市场效率G. William SchwertUniversity of Rochester, Rochester, NY 14627and National Bureau of Economic Research(国家经济研究局)October 2002摘要实践证明,市场异象似乎与现有的资产价格行为理论并不相符。
它表明市场并不是有效的,且相关的资产价格理论也有不足之处。
这篇论文的实证表明,规模效应,价值效应,周末效应以及股息率效应在一系列研究将他们公诸于世后他们的效果变弱了或者根本不出现这些效应了。
与此同时,操作者开始实践一些学术研究中运用的投资策略。
小公司一月效应在其首次发布在学术文献上以来其作用力不断减弱,尽管有证据表明还存在这种现象。
然而,有趣的是,这种现象并不存在于那些集中投资于小型股的组合回报投资者中。
所有的这些发现使得市场异象趋于表现而非真实。
随着这些非同寻常的发现而来的恶名,诱惑了很多学者去进一步调查市场异象,并试图去解释这种现象。
A User’s Guide toRockWare®Aq•QA®Version 1.1RockWare, Inc.Golden, Colorado, USACopyright © 2003–2004 Prairie City Computing, Inc. All rights reserved.Aq•QA® information and updates: Aq•QA® sales and support:RockWare, Inc.2221 East Street, Suite 101Golden, Colorado 80401 USASales: 303-278-3534, aqqa@Orders: 800-775-6745Fax: 303-278-4099Developer: Developed exclusively for RockWare, Inc. by:Prairie City Computing, Inc.115 West Main Street, Suite 400PO Box 1006Urbana, Illinois 61803-1006 USATrademarks: Aq•QA® and Prairie City Computing® are trademarks or registered trademarks of Prairie City Computing, Inc. RockWare® is a registered trademark of RockWare, Inc. All other trademarks used herein are the properties of their respective owners.Warranty: RockWare warrants that the original CD is free from defects in material and workmanship, assuming normal use, for a period of 90 days from the date of purchase. If a defect occurs during this time, you may return the defective CD to PCC, along with a dated proof of purchase, and RockWare will replace it at no charge. After 90 days, you can obtain a replacement for a defective CD by sending it and a check for $25 (to cover postage and handling) to RockWare. Except for the express warranty of the original CD set forth here, neither RockWare nor Prairie City Computing (PCC) makes any other warranties, express or implied. RockWare attempts to ensure that the information contained in this manual is correct as of the time it was written. We are not responsible for any errors or omissions. RockWare’s and PCC’s liability is limited to the amount you paid for the product. Neither RockWare not PCC is liable for any special, consequential, or other damages for any reason.Copying and Distribution: You are welcome to make backup copies of the software for your own use and protection, but you are not permitted to make copies for the use of anyone else. We put a lot of time and effort into creating this product, and we appreciate your support in seeing that it is used by licensed users only.End User License Agreement: Use of Aq•QA® is subject to the terms of the accompanying End User License Agreement. Please refer to that Agreement for details.ContentsA Guided Tour of Aq•QA®1About Aq•QA® (1)Data Sheet (1)Entering Data (2)Working With Data (4)Graphing Data (6)Replicates, Standards, and Mixing (11)The Data Sheet 13 About the Data Sheet (13)Creating a New Data Sheet (13)Opening an Existing Data Sheet (13)Layout of the Data Sheet (13)Selecting Rows and Columns (14)Reordering Rows and Columns (14)Adding Samples and Analytes (14)Deleting Samples and Analytes (15)Using Analyte Symbols (15)Data Cells (15)Entering Data (15)Changing Units (16)Using Elemental Equivalents (16)Notes and Comments (17)Flagging Data Outside Regulatory Limits (17)Saving Data (17)Exporting Data to Other Software (17)Printing the Data Sheet (18)Analytes 19 About Analytes (19)Analyte Properties (19)Changing the Properties of an Analyte (20)Creating a New Analyte (21)Analyte Libraries (21)Editing the Analyte Library (21)Updating Aq•QA Files (22)A User’s Guide to Aq•QA Contents • iData Analysis 23 About Data Analysis (23)Fluid Properties (23)Water Type (24)Dissolved Solids (24)Density (24)Electrical Conductivity (24)Hardness (25)Internal Consistency (25)Anion-Cation Balance (25)Measured TDS Matches Calculated TDS (26)Measured Conductivity Matches Calculated Value (26)Measured Conductivity and Ion Sums (26)Calculated TDS to Conductivity Ratio (26)Measured TDS to Conductivity Ratio (26)Organic Carbon Cannot Exceed Sum of Organics (26)Carbonate Equilibria (26)Speciation (27)Total Carbonate From Titration Alkalinity (27)Titration Alkalinity From Total Carbonate (27)Mineral Saturation (27)Partial Pressure of CO2 (27)Irrigation Waters (27)Salinity hazard (28)Sodium Adsorption Ratio (28)Exchangeable Sodium Ratio (28)Magnesium Hazard (28)Residual Sodium Carbonate (28)Reference (29)Geothermometry (29)Unit Conversion (30)Replicates, Standards, and Mixing 33 About Replicates, Standards, and Mixing (33)Comparing Replicate Analyses (33)Checking Against Standards (34)Fluid Mixing (34)Graphing Data 35 About Graphing Data (35)Time Series Plots (35)Series Plots (36)Cross Plots (37)Ternary Diagrams (37)Piper Diagrams (38)Durov Diagrams (39)Schoeller Diagrams (39)ii • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAStiff Diagrams (40)Radial Plots (40)Ion Balance Diagrams (41)Pie Charts (41)Copying a Graph to Another Document (42)Saving Graphs (42)Tapping Aq•QA®’s Power 43 About Tapping Aq•QA®’s Power (43)Template for New Data Sheets (43)Exporting the Data Sheet (43)Subscripts, Superscripts, and Greek Characters (44)Analyte Symbols (44)Colors and Markers (44)Calculated Ions (44)Hiding Analytes and Samples (44)Selecting Display Fonts (45)Searching the Data Sheet (45)Arrow Key Behavior During Editing (45)Sorting Samples and Analytes (45)“Tip of the Day” (45)Appendix: Carbonate Equilibria 47 About Carbonate Equilibria (47)Necessary Data (47)Activity Coefficients (47)Apparent Equilibrium Constants (48)Speciation (49)Titration Alkalinity (49)Mineral Saturation (50)CO2 Partial Pressure (51)Index 53 A User’s Guide to Aq•QA Contents • iiiA Guided Tour of Aq•QA®About Aq•QAImagine you could keep the results of your chemical analyses in aspreadsheet developed especially for the purpose. A spreadsheet thatknows how to convert units, check your analyses for internal consistency,graph your data in the ways you want it graphed, and so on.A spreadsheet like that exists, and it’s called Aq•QA. Aq•QA was writtenby water chemists, for water chemists. Best of all, it is not only powerfulbut easy to learn, so you can start using it in minutes. Just copy the datafrom your existing ordinary spreadsheets, paste it into Aq•QA, andyou’re ready to go!To see what Aq•QA can do for you, take the guided tour below.Data SheetWhen you start Aq•QA, you see an empty Data Sheet. Click on File →Open…, move to directory “\Program Files\AqQA\Examples” and openfile “Example1.aqq”.The example Data SheetAnalyteSampleis arranged with samples in columns, and analytes – the things youmeasure – in rows.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 12 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAYou can flip an Aq•QA Data Sheet so the samples are in rows andanalytes in columns by selecting View → Transpose Data Sheet . Clickon this tab again to return to the original view. Tip: Aq•QA by default labels analytes by name (Sodium, Potassium,Dissolved Solids, …), but by clicking on View → Show AnalyteSymbols you can view them by chemical symbol (Na, K, TDS, …). To include more samples or analytes in your Data Sheet, click on the “Add Sample” or “Add Analyte” button: Add asampleAdd an analyteSelect analyte(s)Select sample(s)Select valuesYou select analytes or samples by clicking on “handles”, marked in theData Sheet by small triangles. You can select the values associated withan analyte using a separate set of handles, next to the “Unit” column.Give it a try!Tip: To rearrange rows or columns, select one or more, hold down theAlt key, and drag them to the desired location. Entering DataTo see how to enter your own data into an Aq•QA Data Sheet, begin byselecting File → New . Add to the Data Sheet whatever analytes youneed, and delete any you don’t need.Tip: To delete analytes, select one or more and click on the button.To delete samples you have selected, click on the button.When you click on the “Add Analyte” button, you can pick from amonga number of predefined choices in various categories, such as “InorganicAnalytes”, “Organic Analytes”, and so on:lets youfromA number of commonly encountered data fields (Date, pH,Temperature, …) can be found in the “General” category.Tip: If you don’t find an analyte you need among the predefined choices,you can easily define your own by clicking on Analytes→NewAnalyte….To make your work easier, rearrange the analytes (select, hold down theAlt key, and drag) so they appear in the same order as in your data.Tip: You can add a number of analytes in a single step by clicking onAnalytes→Add Analytes….Set units for the various analytes, as necessary: right click in the unitfield and choose the desired units from under Change Units, or selectChange Units under Analytes on the menubar.Right click tochange unitsA User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 3Tip: You can change the units for more than one analyte in one step.Simply select any number of analytes and right-click on the unit field.Tip: Analyses are sometimes reported in elemental equivalents. Forexample, sulfate might be reported as “SO4 (as S)”, bicarbonate as“HCO3 (as C)”, and so on. In this case, right click on the unit of such ananalyte and select Convert to Elemental Equivalents.You can now enter your data into the Data Sheet as you would in anordinary spreadsheet.Tip: If you have an analysis below the detection limit, you can enter afield such as “<0.01”. Aq•QA knows what this means. If the analysisreports an analyte was not detected, enter a string such as “n/d” or “--”.For missing data, enter a non-numeric string, or simply leave the entryblank.You of course can type data into the Data Sheet by hand, or paste thevalues into cells one-by-one. But it’s far easier to copy them from anordinary spreadsheet or other document as a block and paste them all atonce into the Aq•QA Data Sheet.Making sure the analytes appear in the same order as in your spreadsheet,copy the data block, click on the top, leftmost cell in the Aq•QA DataSheet, and select Edit → Paste, or touch ctrl+V.Tip: If there are more samples in a data block you are pasting than inyour Aq•QA Data Sheet, Aq•QA will make room automatically.Tip: If the data arranged in your spreadsheet in columns fall in rows inyour Aq•QA Data Sheet, or vice-versa, you can transpose the Data Sheet,or simply select Edit → Paste Special → Paste Transposed.Tip: You can flag data in an Aq•QA Data Sheet that fall outsideregulatory guidelines. Select Samples → Check Regulatory Limits, orclick on . Violations on the Data Sheet are flagged in red.Working With DataOnce you have entered your chemical analyses in the Data Sheet, Aq•QAcan tell you lots of useful information.Click on File → Open… and load file “Example2.aqq” from directory“\Program Files\AqQA\Examples”. To see Aq•QA’s analysis of one ofthe samples in the Data Sheet, select the sample by clicking on its handleand then click on the tab. This moves you to the DataAnalysis pane, which looks like4 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAClick on anybar to expandor close up acategoryClick herefor moreinformationThere are a number of categories in the Data Analysis pane. To open acategory, click on the corresponding bar. A second click on the bar closesthe category. Clicking on the symbol gives more information aboutthe category.Tip: You can view the data analysis for the previous or next sample inyour Data Sheet by clicking on the and buttons to the left andright of the top bar in the Data Analysis pane.The top category, Fluid Properties, identifies the water type, dissolvedsolids content, density, temperature-corrected conductivity, and hardness,as measured or calculated by Aq•QA.The next category, Internal Consistency, reports the results of a numberof Quality Assurance tests from the American Water Works Association“Standard Methods” reference. For example, Aq•QA checks that anionsand cations balance electrically, that TDS and conductivitymeasurements are consistent with the reported fluid composition, and soon.The Carbonate Equilibria category tells the speciation of carbonate insolution, carbonate concentration calculated from measured titrationalkalinity and vice-versa, the fluid’s calculated saturation state withrespect to the calcium carbonate minerals calcite and aragonite, and thecalculated partial pressure of carbon dioxide.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 5The Irrigation Waters category shows the irrigation properties of asample, and the Geothermometry category shows the results of applyingchemical geothermometers to the samples, assuming they are geothermalwaters.Finally, the sample’s analysis is displayed in a broad range of units, frommg/kg to molal and molar.Tip: You can print results in the Data Analysis pane: open the categoriesyou want printed and click on File→Print…Graphing DataAq•QA can display the data in your Data Sheet on a number of the typesof plots most commonly used by water chemists.To try your hand at making a graph, make sure that you have file“Example2.aqq” open. If not, click on File→Open… and select the filefrom directory “\Program Files\AqQA\Examples”.On the Data Sheet, select the row for Iron. Hold down the ctrl key andselect the row for Manganese. Click on and select Time SeriesPlot. The graph appears in Aq•QA as a new pane.The result should look like:To change or Click here toSelect…delete a graph, right-click on its tab alter the graph’s appearanceanalytes…andsamples tographYou can select the analytes and samples to appear in the graph on thecontrol panel to the right of the plot. Right clicking on the pane’s tab,along the bottom of the Aq•QA window, lets you change the plot to adifferent type, or delete it.Tip: You can alter the appearance of a graph by clicking on theAdvanced Options… button on the graph pane.6 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAYou can copy the graph (Edit→Copy) and paste it into another program, such as an illustration program like Adobe® Illustrator® or Microsoft® PowePoint®, or a word processing program like Microsoft®Word®.Tip: Once you have pasted a graph into an illustration program, you can edit its appearance and content. To do so, select the graphic and ungroup the picture elements (you may need to ungroup them twice).You can also send it to a printer by clicking on File→Print.Tip: In addition to copying a graph to the clipboard, you can save it in a file in one of several formats: as a Windows® EMF file, an EncapsulatedPostScript® (EPS) file, or a bitmap. Select File→Save Image As… andselect the format from the “Save as type” dropdown menu.Tip: Select a linear or logarithmic vertical axis for a Series or TimeSeries plot by unchecking or checking the box labeled “Log Scale” onthe Advanced Options…→ dialog or dialog.Aq•QA can display your data on a broad variety of graphs and diagrams:simply choose a diagram type from the pulldown.In addition to Time Series plots, Aq•QA can produce the following typesof diagrams:Series Diagrams.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 7Cross Plots, in linear and logarithmic coordinates.Ternary diagrams.8 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAPiper diagrams.Durov diagrams.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 9Schoeller diagrams.Stiff diagrams.Radial diagrams.10 • A Guided Tour of Aq•QA® A User’s Guide to Aq•QAIon balance diagrams.Pie charts.Replicates, Standards, and MixingAq•QA can check replicate analyses, compare analyses to a standard, andfigure the compositions of sample mixtures.Replicate analyses are splits of the same sample that have been analyzedmore than once, whether by the same or different labs. The analyses,therefore, should agree to within a small margin of error.To see how this feature works, load (File → Open…) file“Replicates.aqq” from directory “\Program Files\AqQA\Examples”.Select samples PCC-2, PCC-2a, and PCC-2b: click on the handle forPCC-2, then hold down the shift key and click on the handle for PCC-2b.Now, click on the button on the toolbar.A new display appears at the right side of the Aq•QA Data Sheet, oralong the bottom if you have transposed it.A User’s Guide to Aq•QA A Guided Tour of Aq•QA® • 1112 • A Guided Tour of Aq•QA®A User’s Guide to Aq•QAThe display shows the coefficient of variation for each analyte, and whether this value falls within a certain tolerance. Small coefficients of variation indicate good agreement among the replicates. The tolerance, by default, is ±5, but you can set it to another value by clicking on Samples → Set Replicate Tolerance….A standard is a sample of well-known composition, one that wasprepared synthetically, or whose composition has already been analyzed precisely. Enter the known composition as a sample in the Data Sheet and click on Samples → Designate As Standard, or the button. Then select an analysis of the standard on the Data Sheet and click on Samples → Compare To Standard, or the button. The display at the right or bottom of the Data Sheet shows the error in the analysis, relative to the standard. Set the tolerance for the comparison, by default ±10, clicking on Samples → Set Standard Tolerance….To find the composition of a mixture of two or more samples, select two or more samples and click on the button on the toolbar. Thecomposition of the mixed fluid appears to the right or bottom of the Data Sheet.The Data SheetAbout the Data SheetThe Aq•QA® Data Sheet is a special spreadsheet that holds yourchemical data. The data is typically composed of the values measured forvarious analytes, for a number of samples.You can enter data into a Data Sheet and manipulate it, as describedbelow.Creating a New Data SheetTo create a new Aq•QA Data Sheet, select File → New, or touch ctrl+N.An empty Data Sheet, containing a number of analytes, but no data,appears.The appearance of new Data Sheets is specified by a template. You cancreate your own template so new Data Sheets contain the analytes youneed, in your choice of units, and ordered as you desire. For moreinformation, see Template for New Data Sheets in the TappingAqQA’s Power chapter of this guide.Opening an Existing Data SheetAq•QA files end with the extension “.aqq”. These files contain the dataentered in the Data Sheet, as well as any graphs produced and theprogram’s current configuration.You can open an existing Data Sheet by clicking on File → Open… andselecting a “.aqq” file, either one that you have previously saved or anexample file installed with the Aq•QA package. A number of examplefiles are installed in the “Examples” directory within the Aq•QAinstallation directory (commonly “\Program Files\AqQA”).Layout of the Data SheetAn Aq•QA Data Sheet contains the values measured for various analytes(Na+, Ca2+, HCO3−, and so on) for any number of samples that have beenanalyzed. Each piece of information about a sample is considered ananalyte, even sample ID, location, sampling date, and so on.A User’s Guide to Aq•QA The Data Sheet • 13By default, each analyte occupies a row in the Data Sheet, and thesamples fall in columns. You can reverse this arrangement, so analytesfall in columns and the samples occupy rows, by clicking on Edit →Transpose Data Sheet. To flip the Data Sheet back to its originalarrangement, click on this tab a second time.You can rearrange the order of analytes or symbols on the Data Sheet, asdescribed below under Reordering Rows and Columns.Selecting Rows and ColumnsTo select a row or column, click on the marker to the left of a row, or thetop of a column. The marker for a row or column appears as a smalltriangle. Analytes have two markers, one for selecting the entire analyte,and one for selecting only the analyte’s data values.You can select a range of rows or columns by holding down the leftmouse button on the marker at the beginning of the range, then draggingthe mouse to the marker at the end of the range. Alternatively, select thebeginning of the range, then hold down the shift button and click on themarker for the end of the range.To select a series of rows or columns that are not necessarily contiguouson the Data Sheet, select the first row or column, then hold down the ctrlkey and select subsequent rows or columns.By clicking on one of the small blue squares at the top or left of the DataSheet, you can select either the entire sheet, or all of the data values onthe sheet.Reordering Rows and ColumnsYou can easily rearrange the rows and columns of samples and analytesin your Data Sheet. To do so, first select a row or column, or a range ofrows and columns, as described under Selecting Rows andColumns. Then, holding down the alt key, press the left mouse button,drag the selection to its new position, and release the mouse button.Adding Samples and AnalytesTo include more samples or analytes in your Data Sheet, select onSamples → Add Sample, or Analytes → Add Analyte, or simply clickon the or buttons on the toolbar. To add several analytes at once,select Analytes → Add Analytes…, which opens a dialog box for thispurpose.When you add an analyte, you choose from among the large number thatAq•QA knows about. These are arranged in categories: inorganics,organics, biological assays, radioactivity, isotopes, and a generalcategory that includes things like pH, temperature, date, and samplelocation.14 • The Data Sheet A User’s Guide to Aq•QAIf you don’t find the analyte you need, you can quickly define your own.Select Analytes → New Analyte…, or New Analyte…from thedropdown menu. For more information about defining analytes, see theAnalytes chapter of the guide.Deleting Samples and AnalytesTo delete analytes or samples, select one or more and click on Analytes→ Delete, or Samples → Delete. Alternatively, select an analyte andclick on the button, or a sample and click on .Using Analyte SymbolsAnalytes are labeled with names such as Sodium, Calcium, andBicarbonate. If you prefer, you can view them labeled with thecorresponding chemical symbols, such as Na+, Ca2+, HCO3−. Simplyclick on View → Show Analyte Symbols. A second click on this tabreturns to labeling analytes by name.Data CellsEach cell in the data sheet contains one of several types of information:1. A numerical value, such as the concentration of a species.2. A character string.3. A date or a time.Numerical values are, most commonly, simply a number. You can,however, indicate a lack of data with a character string, such as “n/d” or“Not analyzed”, or just leaving the cell empty.If an analysis falls below the detection limit for a species, enter thedetection limit preceded by a “<”. For example, “<0.01”.Character strings, such as you might enter for the “Sample ID”, containany combination of characters, and can be of any length.You can enter dates in a variety of formats: “Sep 21, 2003”, 9/21/03”,“September 23”, and so on. Aq•QA will interpret your input and cast it inyour local format (e.g., mm/dd/yy in the U.S.). Similarly, enter time as“2:20 PM” or “14:20”. Append seconds, if you wish: 2:20:30 PM”.To change the width of the data cells (i.e., the column width), drag thedividing line between columns to the left or right. This changes the widthof all the data columns in the Data Sheet.Entering DataTo enter data into an Aq•QA Data Sheet, you can of course type it infrom the keyboard, or paste it into the cells in the Data Sheet, one by one.A User’s Guide to Aq•QA The Data Sheet • 15It is generally more expedient, however, to copy all of the values as ablock from a source file, such as a table in a word processing document,or a spreadsheet. To do so, set up your Aq•QA Data Sheet so that itcontains the same analytes as the source file, in the same order (seeAdding Samples and Analytes above, and Reordering Rows andColumns). You don’t necessarily need to add samples: Aq•QA will addcolumns (or rows) to accommodate the data you paste.Now, select the data block from the source document and copy it to theclipboard. Move to Aq•QA, click on the top, leftmost data cell, and selectEdit → Paste. If the source data is arranged in the opposite sense as yourData Sheet (the samples are in rows instead of columns, or vice-versa),transpose the Data Sheet (View → Transpose Data Sheet), or selectEdit → Paste Special → Paste Transposed.Changing UnitsYou can change the units of analytes on the Data Sheet at any time. Todo so, select one or more analytes, then click on Analytes → ChangeUnits. Alternatively, right click and choose a new unit from the optionsunder Change Units. If you have entered numerical data for the analyte(or analytes), you will be given the option of converting the values to thenew unit.Some unit conversions require that the program be able to estimatevalues for the fluid’s density, dissolved solids content, or both. If youhave entered values for the Density or Dissolved Solids analytes, Aq•QAwill use these values directly when converting units.If you have not specified this data for a sample, Aq•QA will calculateworking values for density and dissolved solids from the chemicalanalysis provided. It is best, therefore, to enter the complete analysis for asample before converting units, so the Aq•QA can estimate density anddissolved solids as accurately as possible.Aq•QA estimates density and dissolved solids using the methodsdescribed in the Data Analysis section of the User’s Guide, assuming atemperature of 20°C, if none is specified. Aq•QA can estimate densityover only the temperature range 0°C –100°C; outside this range, itassumes a value of 1.0 g/cm3, which can be quite inaccurate and lead toerroneous unit conversions.Using Elemental EquivalentsYou may find that some of your analytical results are reported aselemental equivalents. For example, sulfate might be reported as “SO4(as S)”, bicarbonate as “HCO3 (as C)”, and so on.In this case, select the analyte or analytes in question and click onAnalytes → Convert to Elemental Equivalents. Alternatively, select16 • The Data Sheet A User’s Guide to Aq•QAthe analyte(s), then right click on your selection and choose Convert toElemental Equivalents.To return to the default setting, select Analytes → Convert to Species,or select the Convert to Species option when you right-click.Notes and CommentsWhen you construct a Data Sheet, you may want to save certain notesand comments, such as a site’s location, who conducted the sampling,what laboratory analyzed the samples, and so on.To do so, select File → Notes and Comments… and type theinformation into the box that appears. This information will be savedwith your Aq•QA document; you may access it and alter it at any time.Flagging Data Outside Regulatory LimitsYou can highlight on the Data Sheet concentrations in excess of ananalyte’s regulatory limit. Select Samples → Check Regulatory Limits.Concentrations above the limit now appear highlighted in a red font.Select the tab a second time to disable the option. Touching ctrl+L alsotoggles the option.Aq•QA can maintain a regulatory limit for each analyte. The analytelibrary contains default limits based on U.S. water quality standards atthe time of compilation, but you should of course verify these againststandards as implemented locally. You can easily change the limit carriedfor an analyte, as described in the Analytes chapter of this guide.Saving DataBefore you exit Aq•QA, you will probably want to save your workspace,which includes the data in your Data Sheet, any graphs you have created,and so on, in a .aqq file.To save your workspace, select File → Save , or click on the button onthe Aq•QA toolbar.To save your workspace as a .aqq file under a different name, select File→ Save As… and specify the file’s new name.You may also want to save the data in the Data Sheet as a file that can beread by other applications, such as Microsoft® Excel®. For informationon saving data in this way, see the next section, Exporting Data toOther Software.Exporting Data to Other SoftwareWhen Aq•QA saves a .aqq file, it does so in a special format thatincludes all of the information about your Aq•QA session, such as theA User’s Guide to Aq•QA The Data Sheet • 17。
全文分为作者个人简介和正文两个部分:作者个人简介:Hello everyone, I am an author dedicated to creating and sharing high-quality document templates. In this era of information overload, accurate and efficient communication has become especially important. I firmly believe that good communication can build bridges between people, playing an indispensable role in academia, career, and daily life. Therefore, I decided to invest my knowledge and skills into creating valuable documents to help people find inspiration and direction when needed.正文:写关于兰州天气预报的英语作文全文共3篇示例,供读者参考篇1The Unpredictable Skies of Lanzhou: A Student's Perspective on Weather ForecastingAs a student hailing from the captivating city of Lanzhou, located in the heart of China's Gansu Province, I have developeda keen interest in the ever-changing weather patterns that grace our skies. Nestled along the Yellow River, our city is renowned for its unique geographical location, which often leads to unexpected meteorological phenomena. Keeping a watchful eye on the weather forecast has become a ritual for many of us, as it holds the key to planning our daily activities and ensuring our safety.Growing up in Lanzhou, I quickly learned that the weather here could be as unpredictable as the winding streets of our ancient city. One moment, the sun would be shining brightly, and the next, dark clouds would gather, threatening to unleash a torrential downpour. This capricious nature of the weather has taught me the importance of being prepared for any eventuality, whether it's carrying an umbrella or donning an extra layer of clothing.As a student, the weather forecast plays a crucial role in my academic life. During exam season, when the pressure is at its peak, a sudden change in weather conditions can significantly impact my ability to concentrate and perform at my best. On sweltering summer days, the scorching heat can drain my energy, making it challenging to focus on my studies. Conversely, duringthe bitter winter months, the frigid temperatures can make the journey to and from school a daunting task.However, it's not just the extremes that concern us; even the slightest variations in weather can have far-reaching consequences. A bout of heavy rain can turn the city's streets into treacherous rivers, making it difficult for students like myself to navigate our way to class. Conversely, a sudden snowfall can bring the city to a standstill, causing transportation disruptions and forcing schools to close unexpectedly.Despite the challenges posed by Lanzhou's unpredictable weather, I have learned to embrace the excitement it brings. Watching the clouds gather and dissipate, observing the subtle shifts in wind direction, and feeling the temperature fluctuations have become a part of my daily routine. It's a constant reminder of the power of nature and the importance of respecting the forces that shape our environment.In recent years, advances in technology have made weather forecasting more accurate and accessible than ever before. Meteorological agencies now employ sophisticated models and algorithms to predict weather patterns with increasing precision. As a tech-savvy student, I have come to rely heavily on mobile applications and online resources that provide real-time updateson weather conditions, ensuring that I am always prepared for whatever Mother Nature has in store.Yet, even with these technological advancements, there is still an element of uncertainty that surrounds weather forecasting in Lanzhou. The city's unique geographical location, nestled between the Qilian Mountains and the Yellow River basin, creates a complex interplay of factors that can influence the weather in unpredictable ways. It's a humbling reminder that, despite our best efforts, nature still holds the ultimate trump card.Nonetheless, the challenges posed by Lanzhou'sever-changing weather have taught me valuable lessons in resilience, adaptability, and respect for the natural world. As I navigate the academic and personal challenges that life as a student presents, I find solace in the knowledge that, just like the weather, every obstacle is temporary, and with perseverance and preparation, I can weather any storm.In conclusion, the weather forecast in Lanzhou is more than just a report on atmospheric conditions; it's a window into the city's unique character, a reflection of the unpredictable forces that shape our lives, and a constant reminder of the beauty and power of nature. As a student, embracing the uncertainties of theweather has taught me invaluable lessons that will undoubtedly serve me well as I embark on the journey of life beyond the classroom walls.篇2An Unexpected Surprise: Lanzhou's Peculiar Weather ForecastAs I was getting ready for school this morning, I couldn't help but notice the odd weather report on the TV. Living in Lanzhou, the capital city of Gansu Province in northwest China, we're accustomed to a relatively dry and continental climate. However, today's forecast seemed to defy all logic and normalcy.The cheery meteorologist on screen announced with a bright smile, "Good morning, Lanzhou! Brace yourselves for a delightfully unexpected surprise today. We're in for a burst of tropical weather conditions unlike anything we've experienced before!"I nearly choked on my breakfast, certain I had misheard. Tropical weather? In Lanzhou? The city rests on the upper reaches of the Yellow River, surrounded by the arid Gobi Desert and rugged mountain ranges. Surely, this had to be some kind of joke or technical glitch.Nevertheless, the forecast insisted on defying my disbelief. "That's right, folks! We can expect scorching temperatures reaching a sweltering 40°C (104°F), coupled with intense humidity levels of 90%. But that's not all! Get ready for torrential downpours, with rainfall accumulations of up to 300 millimeters (12 inches) throughout the day."My jaw must have hit the floor at that point. Lanzhou receives an average annual precipitation of merely 315 millimeters (12.4 inches), and the thought of that much rain falling in a single day was mind-boggling.As if that wasn't enough, the meteorologist continued, "And hold on to your hats, folks, because we're also anticipating hurricane-force winds gusting up to 200 kilometers per hour (124 mph)! It's going to be one wild ride, so make sure to secure any loose objects and seek shelter when necessary."I couldn't believe what I was hearing. Lanzhou, a city known for its dry, temperate climate, was supposedly going to transform into a tropical cyclone zone overnight. This had to be a prank, right?Skeptical yet intrigued, I decided to keep an open mind and see how the day would unfold. After all, stranger things have happened in the world of weather.As I stepped outside, the first thing that hit me was the overwhelming humidity. The air felt thick and heavy, like walking through a sauna. Beads of sweat immediately formed on my forehead, and my clothes clung to my skin uncomfortably.Pushing through the oppressive heat, I made my way to school, only to be met with a torrential downpour halfway there. Within seconds, I was drenched from head to toe, my backpack soaked through and weighing a ton. The streets quickly transformed into raging rivers, with water levels rising rapidly.Seeking refuge under a shop's awning, I watched in awe as the storm intensified. Winds howled ferociously, whipping debris through the air and threatening to sweep me off my feet. Trees bent precariously, their branches thrashing violently, and streetlights swayed ominously.Just when I thought the situation couldn't get any more surreal, a massive flash of lightning illuminated the sky, followed by a deafening clap of thunder that rattled the windows around me. The hair on the back of my neck stood on end, and I couldn't help but feel a sense of awe mixed with trepidation.As the hours ticked by, the tropical onslaught showed no signs of letting up. News reports flooded in, detailingwidespread flooding, power outages, and even a few tornado sightings on the city's outskirts.By the time I finally made it to school, the campus resembled a war zone. Fallen trees and branches littered the grounds, and the once-pristine lawns had transformed into muddy quagmires. Several classrooms were flooded, forcing the cancellation of afternoon classes.During our lunch break, my friends and I huddled in the cafeteria, swapping stories of our harrowing journeys through the storm. Some had even witnessed roof tiles being ripped off buildings or cars being swept away by the raging floodwaters.As the day drew to a close, the meteorologists offered a glimmer of hope, predicting that the tropical conditions would begin to subside by the following morning. However, they warned that the aftermath would be significant, with widespread damage and cleanup efforts required throughout the city.On my way home, I couldn't help but feel a sense of disbelief and wonder at the day's events. Lanzhou, a city known for its dry, temperate climate, had been transformed into a tropical paradise (or nightmare, depending on your perspective) in a matter of hours.As I finally reached the sanctuary of my home, I couldn't help but reflect on the incredible power of nature and the unpredictability of weather patterns. What had seemed like an ordinary day had turned into an adventure straight out of a Hollywood disaster movie.In the end, this unexpected tropical surprise left me with a newfound appreciation for the meteorologists and their tireless efforts to forecast and prepare us for Mother Nature's whims. It also served as a humbling reminder that no matter how advanced our technology or knowledge may be, the forces of nature still possess the ability to surprise and astound us.As for Lanzhou, well, let's just say we'll be stocking up on raincoats and umbrellas from now on, just in case the tropics decide to pay us another unexpected visit.篇3A Grey and Hazy Future: Lanzhou's Troubling Weather ForecastAs a student living in the city of Lanzhou, the capital of Gansu Province in northwest China, the weather forecast has become a topic of significant concern and anxiety for me and my classmates. Surrounded by mountains and situated in asemi-arid climate zone, Lanzhou has long faced environmental challenges, but recent projections paint a grim picture of what lies ahead.According to the latest meteorological data and climate models, Lanzhou is expected to experience an increasing number of days with severe air pollution, extreme temperatures, and water scarcity over the next decade. These alarming trends not only threaten our quality of life but also raise serious questions about the long-term sustainability of our city.One of the most pressing issues is the prevalence of air pollution, which has become an all-too-familiar aspect of life in Lanzhou. The city's geography, with its surrounding mountains trapping pollutants, combined with industrial emissions and vehicle exhaust, has created a perfect storm for poor air quality. In recent years, we've witnessed an alarming rise in the number of days with hazardous levels of particulate matter (PM2.5) and other harmful pollutants.The weather forecast for the upcoming years only paints a bleaker picture. Meteorologists predict that Lanzhou will experience a significant increase in the number of days with severe smog, often lasting for weeks at a time. During theseperiods, the air becomes thick and hazy, making it difficult to breathe and forcing schools and businesses to close temporarily.As a student, the impact of air pollution on our health and education is a major concern. Exposure to high levels of particulate matter has been linked to respiratory problems, heart disease, and even cognitive impairment. Numerous studies have shown that air pollution can negatively affect children's lung development and academic performance. It's heartbreaking to see my younger siblings and their classmates struggle with asthma and other respiratory issues exacerbated by the poor air quality.Unfortunately, air pollution is not the only challenge we face. The weather forecast also warns of an increase in the frequency and intensity of extreme temperatures, both hot and cold. Lanzhou's continental climate has always been characterized by hot summers and cold winters, but climate change is amplifying these extremes.During the summer months, we can expect more frequent and prolonged heatwaves, with temperatures soaring well above 40°C (104°F). These sco rching conditions not only make outdoor activities unbearable but also put a strain on the city's energy resources as air conditioning usage skyrockets. Moreover, therisk of heat-related illnesses, such as heat stroke and dehydration, becomes a significant concern for vulnerable populations, including the elderly and young children.In contrast, the winters in Lanzhou are expected to become even harsher, with longer periods of extreme cold and heavy snowfall. While we're accustomed to dealing with frigid temperatures, the forecast suggests that we may face more frequent and intense cold snaps, with temperatures plummeting below -20°C (-4°F). These conditions can lead to disruptions in transportation, power outages, and increased heating costs, placing a significant burden on families and the local economy.Perhaps the most alarming aspect of Lanzhou's weather forecast is the projected water scarcity. As a semi-arid region, Lanzhou has long relied on the Yellow River and other water sources for its freshwater supply. However, climate change is expected to further exacerbate the already strained water resources in the region.The forecast indicates a significant decrease in precipitation levels, coupled with more frequent and prolonged droughts. This combination could potentially lead to water shortages, affecting not only households but also agricultural production and industrial activities. The prospect of water rationing andpotential conflicts over limited resources is a genuine concern for our community.As students, we are taught about the importance of environmental stewardship and sustainable development, but the reality we face in Lanzhou makes it challenging to remain hopeful. We watch helplessly as our city grapples with the consequences of pollution, climate change, and resource depletion, and we worry about the future that awaits us.Despite the grim forecast, we must not lose hope. It is crucial for us, as the next generation, to actively participate in finding solutions and advocating for change. We must demand that our local authorities and policymakers take decisive action to address these issues, from implementing stricter environmental regulations to investing in renewable energy sources and water conservation measures.Furthermore, we must educate ourselves and our peers about the importance of adopting sustainable lifestyles. Simple actions, such as reducing energy consumption, using public transportation, and minimizing waste, can collectively make a significant impact. By fostering a culture of environmental awareness and responsibility, we can work towards mitigatingthe adverse effects of climate change and preserving our city's natural resources.As I look out of my classroom window and see the hazy skyline, I am reminded of the challenges we face. But I also see the determination and resilience of my peers, who refuse to accept this grim future as inevitable. We are the future of Lanzhou, and it is our responsibility to fight for a cleaner, more sustainable, and livable city.The weather forecast may paint a bleak picture, but it is also a call to action. By working together, embracing innovation, and prioritizing environmental protection, we can create a brighter future for Lanzhou – a future where we can breathe clean air, enjoy moderate temperatures, and have access to clean water. It won't be easy, but as students, we must be the driving force behind this change, for the sake of our city, our health, and the generations to come.。
不平衡系数英文In the realm of machine learning, the imbalance coefficient holds significant importance, especially when dealing with datasets that exhibit a significant disparity in the number of instances between different classes. This disparity, often referred to as class imbalance, can lead to challenges in training accurate and reliable models. The imbalance coefficient serves as a metric to quantify this imbalance, allowing researchers and practitioners to assess the severity of the issue and take appropriate measures to address it.The imbalance coefficient is typically calculated as a ratio between the number of instances in the majority class and the number of instances in the minority class. A higher imbalance coefficient indicates a more severe imbalance, which can lead to issues such as bias towards the majority class and poor performance on the minority class. This can be problematic in scenarios where accurate predictions on the minority class are crucial, such as in fraud detection or rare disease diagnosis.To address class imbalance, several strategies can be employed. One common approach is oversampling, which involves generating synthetic instances of the minority class to increase its representation in the dataset. Another approach is undersampling, which involves reducing the number of instances in the majority class to balance the classes. Both approaches aim to create a more balanced dataset that can lead to improved model performance.However, it is important to note that simply balancing the classes may not always be sufficient. The imbalance coefficient, although a useful metric, does not capture all the nuances of class imbalance. For instance, the distribution of instances within each class may still be highly skewed, even if the overall class counts are balanced. In such cases, more advanced techniques such as cost-sensitive learning or ensemble methods may be required to effectively handle the imbalance.In addition, it is crucial to evaluate model performance not just on the overall dataset but also on each individual class. Metrics such as precision, recall, and F1-score provide a more nuanced understanding of modelperformance, especially in the context of class imbalance. By monitoring these metrics, researchers and practitioners can assess whether their strategies to address imbalance are effective and make informed decisions about model improvements.In conclusion, the imbalance coefficient plays apivotal role in machine learning, particularly when dealing with datasets exhibiting class imbalance. It serves as a valuable metric to quantify the severity of the imbalance and guide strategies to address it. By understanding and effectively addressing class imbalance, researchers and practitioners can develop more accurate and reliable models that perform well across all classes, leading to improved outcomes in various real-world applications.**不平衡系数在机器学习中的重要性**在机器学习的领域中,不平衡系数具有重要意义,特别是在处理不同类别之间实例数量存在显著差异的数据集时。
variance和variation的用法Variance and variation are terms commonly used in statistics and probability to describe the extent, dispersion, or spread of data. While they are related concepts, there are subtle differences in their usage. In this essay, we will explore the definitions, applications, and significance of variance and variation in various fields.Variance is a statistical measure that quantifies how spread out or dispersed a set of values is. It is calculated as the average of the squared differences from the mean of the data set. The basic idea behind variance is to determine the average distance between each data point and the mean. A larger variance indicates a greater spread or dispersion of data, while a smaller variance indicates a more concentrated cluster of values around the mean.Variance is widely used in fields such as finance, economics, engineering, and physics. In finance, for example, variance is a key measure of volatility in asset prices. Higher variance implies greater price fluctuations, making an investment riskier. Economists use variance to assess the volatility of economic indicators like GDP, inflation rates, and stock market returns. In engineering, variance helps evaluate the consistency and reliability of processes or systems. For instance, in manufacturing, measuring the variance of product dimensions ensures quality control. In physics, variance is used to analyze the fluctuations or noise in experimental measurements.On the other hand, variation refers to the range or diversity of values within a data set or population. It provides a measure of how different individual observations are from one another. Variation can be expressed in several ways, such as the range (maximum minus minimum), interquartile range (middle 50% of observations), or coefficient of variation (standard deviation divided by the mean). Variation is used in fields including biology, genetics, ecology, and social sciences. In biology and genetics, variation is crucial for understanding the diversity of traits within a species or population. It helps researchers study genetic variability, evolution, and adaptability. In ecology, variation is used to analyze how different environmental factors impact species diversity, population dynamics, and ecosystem stability. Social scientists use variation to investigate differences in attitudes, behaviors, socioeconomic factors, or cultural practices across different groups or regions.While variance and variation share similarities, they serve distinct purposes in statistics. Variance focuses specifically on the dispersion or spread of data around the mean. It provides a quantitative measure of the average distance between individual values and the central tendency of the data set. On the other hand, variation encompasses a broader concept that considers the entire range of values or patterns in a data set. It quantifies the degree of diversity, heterogeneity, or variability within the set. Both variance and variation play vital roles in hypothesis testing, modeling, and decision-making. They help researchers and practitioners make inferences, draw comparisons, and evaluate statistical significance. For example, when testing the effectiveness of a new drug, variance allows researchers to assess the consistency and reliability of treatment outcomes. In a manufacturing process, variation analysis helps identify sources of defects, optimize performance, and minimize waste. Moreover, in social sciences, analyzing variation across different groups provides insights into social inequalities,policy implications, or cultural differences.In conclusion, variance and variation are critical statistical measures used to analyze the spread, diversity, or dispersion of data. Variance focuses on the differences between individual values and the mean, providing a measure of how spread out the values are. Variation, on the other hand, considers the entire range of values or patterns within a data set, quantifying the degree of diversity or heterogeneity. Both concepts are extensively applied in various fields and are instrumental in decision-making, evaluating statistical significance, and understanding patterns in data.。
Torpid Mixing of Simulated Tempering on the Potts ModelNayantara Bhatnagar Dana RandallAbstractSimulated tempering and swapping are two families of sam-pling algorithms in which a parameter representing temper-ature varies during the simulation.The hope is that this willovercome bottlenecks that cause sampling algorithms to beslow at low temperatures.Madras and Zheng demonstratethat the swapping and tempering algorithms allow efficientsampling from the low-temperature mean-field Ising model,a model of magnetism,and a class of symmetric bimodaldistributions[10].Local Markov chains fail on these distri-butions due to the existence of bad cuts in the state space.Bad cuts also arise in the-state Potts model,anotherfundamental model for magnetism that generalizes the Isingmodel.Glauber(local)dynamics and the Swendsen-Wangalgorithm have been shown to be prohibitively slow forsampling from the Potts model at some temperatures[1,2,6].It is reasonable to ask whether tempering or swappingcan overcome the bottlenecks that cause these algorithms toconverge slowly on the Potts model.We answer this in the negative,and give thefirst ex-ample demonstrating that tempering can mix slowly.Weshow this for the3-state ferromagnetic Potts model on thecomplete graph,known as the mean-field model.The slowconvergence is caused by afirst-order(discontinuous)phasetransition in the underlying ing this insight,wedefine a variant of the swapping algorithm that samples ef-ficiently from a class of bimodal distributions,including themean-field Potts model.1IntroductionThe standard approach to sampling via Markov chain MonteCarlo algorithms is to connect the state space of configura-tions via a graph called the Markov kernel.The Metropo-lis algorithm proscribes transition probabilities to the edgesof the kernel so that the chain will converge to any desireddistribution[14].Unfortunately,for some natural choices ofthe Markov kernel,the Metropolis Markov chain can con-temperature,is the goal distribution from which we wish to generate samples;at the highest temperature,is typically less interesting,but the rate of convergence is fast.A Markov chain that keeps modifying the distribution,interpolating between these two extremes,may produceuseful samples efficiently.Despite the extensive use of simulated tempering and swapping in practice,there hasbeen little formal analysis.A notable exception is work byMadras and Zheng[10]showing that swapping converges quickly for two simple,symmetric distributions,includingthe mean-field Ising model.1.2Results.In this work,we show that for the meanfieldPotts model,tempering and swapping require exponential time to converge to equilibrium.The slow convergenceof the tempering chain on the Potts model is caused by afirst-order(discontinuous)phase transition.In contrast,the Ising model studied by Madras and Zheng has a second-order(continuous)phase transition,which distinguishes why tempering works for one model and not the other.In addition,we give thefirst Markov chain algorithmthat is provably rapidly mixing on the Potts model.Tradi-tionally,swapping is implemented by defining a set of in-terpolating distributions where a parameter corresponding totemperature is varied.We make use of the fact that there is greaterflexibility in how we define the set of interpolants.Finally,our analysis extends the arguments of Madras and Zheng showing that swapping is fast on symmetric distribu-tions so as to include asymmetric generalizations.2Preliminaries2.1The-state Potts model.The Potts model was defined by R.B.Potts in1952to study ferromagnetism andanti-ferromagnetism[15].The interactions between particlesare modeled by an underlying graph with edges between particles that influence each other.Each of the verticesof the underlying graph is assigned one of differentspins(or colors).A configuration is an assignment of spins to the vertices,where denotes thespin at the vertex.The energy of a configuration is a function of the Hamiltonianwhere is the Kronecker-function that takes the value1if its arguments are equal and zero otherwise.Whenthe model corresponds to the ferromagnetic case where neighbors prefer the same color,while corresponds to the anti-ferromagnetic case where neighbors prefer to be differently colored.The state space of the-state ferromagnetic Potts model is the space of all-colorings of.We will thus use colorings and configurations interchangeably.Define the inverse temperaturewhere is the normalizing constant.Note that at,this is just the uniform distribution on all(not necessarily proper)-colorings of.We consider the ferromagnetic mean-field model where is the complete graph on vertices and all pairs of particles influence each other.For the3-state Potts model, .Let,and be the number of vertices assigned thefirst,second,and third colors.Letting ,we can rewrite the Gibbs distribution for the3-state Potts model asD EFINITION2.2.Let,then the mixing time isis rapidly mixing if the mixing time is bounded above by a polynomial in andconnect the state space,where vertices are configurations andedges are allowable1-step transitions.The transition proba-bilities on are defined asfor all,neighbors in,where is the maximumdegree of.It is easy to verify that if the kernel is connected then is the stationary distribution.For the Potts model,a natural choice for the Markovkernel is to connect configurations at Hamming distance one.Unfortunately,for large values of,the Metropolis algorithm converges exponentially slowly on the Potts modelfor this kernel[1,2].This is because the most probable states are largely monochromatic and to go from a predominantlyred configuration to a predominantly blue one we would have to pass through states that are highly unlikely at low temperatures.2.2.2Simulated tempering.Simulated tempering at-tempts to overcome this bottleneck by introducing a temper-ature parameter that is varied during the simulation,effec-tively modifying the distribution being sampled from.Letbe a set of inverse temperatures.The state space of the tempering chain iswhich we can think of as the union of copies of theoriginal state space,each corresponding to a different in-verse temperature.Our choice of corresponds to in-finite temperature where the Metropolis algorithm convergesrapidly to stationarity(on the uniform distribution),andis the inverse temperature at which we wish to sample.Weinterpolate by settingThe tempering Markov chain consists of two types of moves: level moves,which update the configuration while keeping the temperaturefixed,and temperature moves,which update the temperature while remaining at the same configuration.A level move Here is the Metropolis probability of going from to according to the stationary probability.A temperature move(the uniform distribution),for.A configuration in the swapping chain is an-tuple, where each component represents a configuration chosen from the distribution.The probability distribution is the product measureThe swapping chain also consists of two types of moves:A level moveA swap moveNotice that now the normalizing constants cancel out. Hence,implementing a move of the swapping chain is straightforward,unlike tempering where good approxima-tions for the partition functions are required.Zheng proved that fast mixing of the swapping chain implies fast mixing of the tempering chain[17],although the converse is unknown.For both tempering and swapping,we must be careful about how we choose the number of distributions. It is important that successive distributions and have sufficiently small variation distance so that temperature moves are accepted with nontrivial probability.However, must be small enough so that it does not blow up the running time of the algorithm.Following[10],we set. This ensures that for the values of at which we wish to sample,the ratio of and is bounded from above and below by a constant.3Torpid Mixing of Tempering on the Potts modelWe will show lower bounds on the mixing time of the tem-pering chain on the mean-field Potts model by bounding the spectral gap of the transition matrix of the chain.Letbe the eigenvalues of the transition ma-trix,so that for all.Let.The mixing time is related to the spectral gap of the chain by the following theorem(see[16]):T HEOREM3.1.Let.For all,(a).(b).The conductance,introduced by Jerrum and Sinclair,pro-vides a good measure of the mixing rate of a chain[7].For ,letThen,the conductance1is given byIt has been shown by Jerrum and Sinclair[7]that,for any reversible chain,the spectral gap satisfiesT HEOREM3.2.For any Markov chain with conductance and eigenvalue gap,1It suffices to minimize over,for any polynomial; this decreases the conductance by at most a polynomial factor(see[16]).Thus,to lower bound the mixing time it is sufficient to show that the conductance is small.If a chain converges rapidly to its stationary distribution it must have large conductance,indicating the absence of “bad cut,”i.e.,a set of edges of small capacity separating fromLet denote the set of configurations,where; and,configurations whereThis implieswhich occurs whengives the desired result.(ii)LetThefirst part of the lemma verifies that at the critical temperature there are3ordered modes(one for each color,by symmetry)and1disordered mode.In the next lemmas, we show that the disordered mode is separated from the ordered modes by a region of exponentially low density.To do this,we use the second part Lemma3.1and show that bounds the density of the separating region at each.Let, be the continuous extension of the discrete function.L EMMA3.2.For sufficiently large,the real functionand attains its maximum aton this line,wefindNeglecting factors not dependent on and simplifying usingStirling’s formula,we need to check for the stationary points of the function,we compare the quantities,where.AtAs is decreased,the slope of the lineis independent of.varies with.Thus,for some function,we have The claim follows by the second part of Lemma3.1..Let.Letbe the boundary of .The set defines a bad cut in the state space of the tempering chain.T HEOREM3.3.For sufficiently large,there exists such that.ing the definition of conductance,we havewhere.are within a linear factor of each other.By Theorem3.2the upper bound on bounds the spectral gap of the tempering chain at the inverse temperature.Applying Theorem3.1,wefind the tempering chain for the3-state Potts model mixes slowly.As a consequence of Zheng’s demonstrating that rapid mixing of the swapping chain implies fast mixing of the tempering chain[17],we also have established the slow mixing of the swapping chain for the mean-field Pott model.4Modifying the Swapping Algorithm for Rapid Mixing We now reexamine the swapping chain on two classes of distributions:one is an asymmetric exponential distribution (generalizing a symmetric distribution studied by Madras and Zheng[10]),and the other a class of the mean-field models.First,we show that swapping and tempering are fast on the exponential distribution.The proofs suggest that a key idea behind designing fast sampling algorithms for models withfirst-order phase transitions is to define a new set of interpolants that do not preserve the bad cut.We start with a careful examination of the exponential distribution since the proofs easily generalize to the new swapping algorithm applied to bimodal mean-field models.Example I:where is the normalizing constant.Define the interpolat-ing distributions for the swapping chain aswhere is a normalizing constant.T HEOREM4.1.The swapping chain with inverse tempera-tures,whereThe comparison theorem of Diaconis and Saloff-Coste is useful in bounding the mixing time of a Markov chain when the mixing time of a related chain on the same state space is known.Let and be two Markov chains on.Let and be the transition matrix and stationary distri-butions of and let and be those of.Letandbe sets of directed edges.Forsuch that,define a path,a sequence of states such that. Let denote the set of endpoints of paths that use the edge.T HEOREM4.2.(Diaconis and Saloff-Coste[3])Decomposition:. Define the projection4.1Swapping on the exponential distribution.We are now prepared to prove Theorem4.1.The state space for the swapping chain applied to Example I is.D EFINITION4.1.Let.The trace Tr where if and if,.The possible values of the trace characterize the partition we use.Letting be the set of configurations with trace,we have the decompositionThis partition of into sets offixed trace sets the stage for the decomposition theorem.The restrictions simulate the swapping Markov chain on regions offixed trace. The projectionIf we temporarily ignore swap moves on the restrictions, the restricted chains move independently according to the Metropolis probabilities on each of the distributions. The following lemma reduces the analysis of the restricted chains to analyzing the moves of at eachfixed tempera-ture.L EMMA4.1.(Diaconis and Saloff-Coste[3])For,let be a reversible Markov chain on afinite state space.Consider the product Markov chain on the product space,defined byNow restricted to each of the distributions is unimodal,suggesting that should be rapidly mixing at each temperature.Madras and Zheng formalize this in[10] and show that the Metropolis chain restricted to the positive or negative parts of mixes quickly.Thus,from Lemma 4.1and following the arguments in[10],we can conclude that each of the restricted Markov chains is rapidly mixing.Bounding the mixing rate of the projection:is an dimensional hypercube.The stationary probabilities of the projection chain are given by.This captures the idea that for the true projection chain,swap moves(transpositions)always have constant probability,and at the highest temperature there is high probability of changing sign.Of course there is a chance offlipping the bit at each higher temperature,but we will see that this is not even necessary for rapid mixing.To analyze RW1,we can compare it to an even simpler walk,RW2,that chooses any bit at random and updates it to 0or1with the correct stationary probabilities.It is easy to argue that RW2converges very quickly and we use this to infer the fast mixing of RW1.More precisely,let be a new chain on the hypercube for the purpose of the comparison.At each step it picksand updates the component by choosing exactly according to the appropriate stationary distribution at.In other words,the component is at stationarity as soon as it is ing the coupon collector’s theorem,we haveL EMMA4.2.The chain on mixes in timeand.We are now in a position to prove the following theorem. T HEOREM4.4.The projection.Letbe a single transition in from tothatflips the bit.The canonical path from to is the concatenation of three paths.In terms of tempering,is a heating phase and is a cooling phase.consists of swap moves from to;consists of one step thatflips the bit corre-sponding to the highest temperature to move to;consists of swaps until we reach.To bound in Theorem4.2,we will establish that(Transitions along)Let and.(4.2)First we considerLet us assume,without loss of generality,thatThen we have Case2:Therefore,again wefind equation4.1is satisfied.Case3:By the comparison theorem wefind thatFix con-stants and let be a large integer. The state space of the mean-field model consists of all spin configurations on the complete graph,namelyThe probability distribution over these configurations is determined by,inverse temperature, and,the-wise interactions between particles.The Hamiltonian is given bywhere is the Kronecker-function that takes the value1 if all of the arguments are equal and is0otherwise(whenwe set=1iff).The Gibbs distribution is where is the normalizing constant.This can be de-scribed by the model in Example II by takingand.It can be shown that this distribution is bi-modal for all values of and.A second important special case included in Example II is the-state Potts model where we restrict to the part of the state space such that.Note that.Consequently,sampling from is sufficient since we can randomly permute the colors once we obtain a sample and get a random configuration of the Potts model on the nonrestricted state space.Here we take,and and the Gibbs distribution becomes wherewhere is another normalizing con-stant.When is taken to be the constant function,then we obtain the distributions of the usual swapping algorithm. The Flat-Swap Algorithm:We shall see that this graduallyflattens out the total spins distributions uniformly,thus eliminating the bad cut that can occur when we take constant.The function effectively dampens the entropy(multinomial)just as the change in temperature dampens the energy term coming from the Hamiltonian.We have the following theorem.T HEOREM4.5.The Flat-Swap algorithm is rapidly mixing for any bimodal mean-field model.To prove Theorem4.5,we follow the strategy set forth for Theorem4.1,using decomposition and comparison in a similar manner.For simplicity,we concentrate our exposi-tion here on the Ising model in an externalfield.The advan-tage of this special case is that the total spins configurations form a one-parameter family(i.e.,the number of vertices as-signed+1),much like in Example I.The proofs for the gen-eral class of models,including the Potts model on,are analogous.We sketch the proof of Theorem4.5.For the Ising model,we have.Note that is easy to compute given.A simple calculation reveals that,forThus,all the total spins distributions have the same relative shape,but getflatter as is decreased.This no longerpreserves the non-analytic nature of the phase transition seen for the usual swap algorithm.It is this property that makes this choice of distributions useful.The total spins distribution for the Ising model is known to be bimodal,even in the presence of an externalfield.With our choice of interpolants,it now follows that all distributions are bimodal as well.Moreover,the minima of the distributions occur at the same location for all distributions.Let be the place at which these minima occur.In order to show that this swapping chain is rapidly mixing we use decomposition.Let be the state space of the swapping chain on the Ising model,where.Define the trace Tr, where if the number of s in is less than and let if the number of s in is at least.The analysis of the restricted chains given in[10]in the context of the Ising model without an externalfield can be readily adapted to show the restrictions are also rapidly mixing.The analysis of the projection is analogous to the arguments used to bound the mixing rate of the projection for Example I.Hence,we can conclude that the swapping algorithm is rapidly mixing for the mean-field Ising model at any temperature,with any externalfield.We leave the details,including the extension to the Potts model,for the full version of the paper.5ConclusionsSwapping,tempering and annealing provide a means,exper-imentally,for overcoming bottlenecks controlling the slow convergence of Markov chains.However,our results of-fer rigorous evidence that heuristics based on these methods might be incorrect if samples are taken after only a poly-nomial number of steps.In recent work,we have extended the arguments presented here to show an even more surpris-ing result;tempering can actually be slower than thefixed temperature Metropolis algorithm by an exponential multi-plicative factor.Many other future directions present themselves.It would be worthwhile to continue understanding examples when the standard(temperature based)interpolants fail to lead to efficient algorithms,but nonetheless variants of the swapping algorithm,such as presented in Section4.3,suc-ceed.The difficulty in extending our methods to more in-teresting examples,such as the Ising and Potts models on lattices,is that it is not clear how to define the interpolants. We would want a way to slowly modify the the entropy term in addition to the temperature,as we did in the mean-field case,to avoid the bad cut arising from the phase transition. It would be worthwhile to explore whether it is possible to determine a good set of interpolants algorithmically by boot-strapping,rather than analytically,as was done here,to de-fine a more robust family of tempering-like algorithms.AcknowledgmentsThe authors thank Christian Borgs,Jennifer Chayes,Claire Kenyon,and Elchanan Mossel for useful discussions. References[1] C.Borgs,J.T.Chayes, A.Frieze,J.H.Kim,P.Tetali,E.Vigoda,and V.H.Vu.Torpid mixing of some MCMC algorithms in statistical physics.Proc.40th IEEE Symposium on Foundations of Computer Science,218–229,1999.[2] C.Cooper,M.E.Dyer,A.M.Frieze,and R.Rue.MixingProperties of the Swendsen-Wang Process on the Complete Graph and Narrow Grids.J.Math.Phys.41:1499–1527: 2000.[3]P.Diaconis and parison theorems forreversible Markov chains.Annals of Applied Probability.3: 696–730,1993.[4] C.J.Geyer.Markov Chain Monte Carlo Maximum Likeli-puting Science and Statistics:Proceedings of the 23rd Symposium on the Interface(E.M.Keramidas,ed.),156-163.Interface Foundation,Fairfax Sta tion,1991.[5] C.J.Geyer and E.A.Thompson.Annealing Markov ChainMonte Carlo with Applications to Ancestral Inference.J.Amer.Statist.Assoc.90:909–920,1995.[6]V.K.Gore and M.R.Jerrum.The Swendsen-Wang ProcessDoes Not Always Mix Rapidly.J.Statist.Phys.97:67–86, 1995.[7]M.R.Jerrum and A.J.Sinclair.Approximate counting,uni-form generation and rapidly mixing Markov rma-tion and Computation.82:93–133,1989.[8]S.Kirkpatrick,L.Gellatt Jr.,and M.Vecchi.Optimization bysimulated annealing.Science.220:498–516,1983.[9]N.Madras and D.Randall.Markov chain decomposition forconvergence rate analysis.Annals of Applied Probability.12: 581–606,2002.[10]N.Madras and Z.Zheng.On the swapping algorithm.Random Structures and Algorithms.22:66–97,2003. [11] E.Marinari and G.Parisi.Simulated tempering:a new MonteCarlo scheme.Europhys.Lett.19:451–458,1992.[12]R.A.Martin and D.Randall.Sampling adsorbing staircasewalks using a new Markov chain decomposition method.Proc.41st Symposium on the Foundations of Computer Sci-ence(FOCS2000),492–502,2000.[13]R.A.Martin and D.Randall.Disjoint decomposition withapplications to sampling circuits in some Cayley graphs.Preprint,2003.[14]N.Metropolis,A.W.Rosenbluth,M.N.Rosenbluth,A.H.Teller,and E.Teller.Equation of state calculations by fast computing machines.Journal of Chemical Physics,21: 1087–1092,1953.[15]R.B.Potts.Some Generalized Order-disorder Transforma-tions Proceedings of the Cambridge Philosophical Society, 48:106–109,1952.[16] A.J.Sinclair.Algorithms for random generation&counting:a Markov chain approach.Birkh¨a user,1993.[17]Z.Zheng.Analysis of Swapping and Tempering Monte CarloAlgorithms.Dissertation,York Univ.,1999.。
Data Swapping:Variations on a Theme by Dalenius and ReissStephen E.Fienberg1, and Julie McIntyre21Department of StatisticsCenter for Automated Learning and DiscoveryCenter for Computer Communications and SecurityCarnegie Mellon University,Pittsburgh,PA15213-3890,USAfienberg@2Department of StatisticsCarnegie Mellon University,Pittsburgh PA15213-3890,USAjulie@Abstract.Data swapping,a term introduced in1978by Dalenius andReiss for a new method of statistical disclosure protection in confidentialdata bases,has taken on new meanings and been linked to new statisticalmethodologies over the intervening twenty-five years.This paper revis-its the original(1982)published version of the the Dalenius-Reiss dataswapping paper and then traces the developments of statistical disclo-sure limitation methods that can be thought of as rooted in the originalconcept.The emphasis here,as in the original contribution,is on bothdisclosure protection and the release of statistically usable data bases.Keywords:Bounds table cell entries;Constrained perturbation;Con-tingency tables;Marginal releases;Minimal sufficient statistics;Rankswapping.1IntroductionData swapping wasfirst proposed by Tore Dalenius and Steven Reiss(1978) as a method for preserving confidentiality in data sets that contain categori-cal variables.The basic idea behind the method is to transform a database by exchanging values of sensitive variables among individual records.Records are exchanged in such a way to maintain lower-order frequency counts or marginals. Such a transformation both protects confidentiality by introducing uncertainty about sensitive data values and maintains statistical inferences by preserving certain summary statistics of the data.In this paper,we examine the influence of data swapping on the growingfield of statistical disclosure limitation.Concerns over maintaining confidentiality in public-use data sets have in-creased since the introduction of data swapping,as has access to large,comput-erized databases.When Dalenius and Reissfirst proposed data swapping,it was in many ways a unique approach the problem of providing quality data to users Currently Visiting Researcher at CREST,INSEE,Paris,France.J.Domingo-Ferrer and V.Torra(Eds.):PSD2004,LNCS3050,pp.14–29,2004.c Springer-Verlag Berlin Heidelberg2004Data Swapping:Variations on a Theme by Dalenius and Reiss15 while protecting the identities of subjects.At the time most of the approaches to disclosure protection had essentially no formal statistical content,e.g.,see the1978report of the Federal Committee on Statistical Methodology,FCSM (1978),for which Dalenius served as as a consultant.Although the original procedure was little-used in practice,the basic idea and the formulation of the problem have had an undeniable influence on subse-quent methods.Dalenius and Reiss were thefirst to cast disclosure limitation firmly as a statistical problem.Following Dalenius(1977),Dalenius and Reiss define disclosure limitation probabilistically.They argue that the release of data is justified if one can show that the probability of any individual’s data being compromised is appropriately small.They also express a concern regarding the usefulness of data altered by disclosure limitation methods by focusing on the type and amount of distortion introduced in the data.By construction,data swapping preserves lower order marginal totals and thus has no impact on in-ferences that derive from these statistics.The current literature on disclosure limitation is highly varied and combines the efforts of computer scientists,official statisticians,social scientists,and statis-ticians.The methodologies employed in practice are often ad hoc,and there are only a limited number of efforts to develop systematic and defensible approaches for disclosure limitation(e.g.,see FCSM,1994;and Doyle et al.,2001).Among our objectives here are the identification of connections and common elements among some of the prevailing methods and the provision of a critical discus-sion of their comparative effectiveness1.What we discovered in the process of preparing this review was that many of those who describe data swapping as a disclosure limitation method either misunderstood the Dalenius-Reiss arguments or attempt to generalize them in directions inconsistent with their original pre-sentation.The paper is organized as follows.First,we examine the original proposal by Dalenius and Reiss for data swapping as a method for disclosure limitation, focusing on the formulation of the problem as a statistical one.Second,we ex-amine the numerous variations and refinements of data swapping that have been suggested since its initial appearance.Third,we discuss a variety of model-based methods for statistical disclosure limitation and illustrate that these have basic connections to data swapping.2Overview of Data SwappingDalenius and Reiss originally presented data swapping as a method for disclosure limitation for databases containing categorical variables,i.e.,for contingency tables.The method calls for swapping the values of sensitive variables among records in such a way that the t-order frequency counts,i.e.,entries in the the 1The impetus for this review was a presentation delivered at a memorial session for Tore Dalenius at the2003Joint Statistical Meetings in San Franciso,California.Tore Dalenius made notable contributions to statistics in the areas of survey sampling and confidentiality.In addition to the papers we discuss here,we especially recommend Dalenius(1977,1988)to the interested reader.16Stephen E.Fienberg and Julie McIntyret-way marginal table,are preserved.Such a transformed database is said to be t-order equivalent to the original database.The justification for data swapping rests on the existence of sufficient num-bers of t-order equivalent databases to introduce uncertainty about the true values of sensitive variables.Dalenius and Reiss assert that any value of a sensi-tive variable is protected from compromise if there is at least one other database or table,t-order equivalent to the original one,that assigns it a different value.It follows that an entire database or contingency table is protected if the values of sensitive variables are protected for each individual.The following simple exam-ple demonstrates how data swaps can preserve second-order frequency counts. Example:Table1contains data for three variables for seven individuals.Sup-pose variable X is sensitive and we cannot release the original data.In particular, notice that record number5is unique and is certainly at risk for disclosure from release of the three-way tabulated data.However,is it safe to release the two-way marginal tables?Table1b shows the table after a data-swapping transformation.Values of X were swapped between records1and5and between records4and7.When we display the data in tabular form as in Table2,we see that the two-way marginal tables have not changed from the original data.Summing over any dimension results in the same2-way totals for the swapped data as for the original data. Thus,there are at least two data bases that could have generated the same set of two-way tables.The data for any single individual cannot be determined with certainty from the release of this information alone.Table1.Swapping X values for two pairs of records in a3-variable hypothetical example(a)Original Data Record X Y Z 1010201030014001511161007100(b)Swapped Data Record X Y Z 1110201030014101501161007000An important distinction arises concerning the form in which data are re-leased.Releasing the transformed data set as microdata clearly requires that enough data are swapped to introduce sufficient uncertainty about the true val-ues of individuals’data.In simple cases such as the example in Table1above, appropriate data swaps,if they exist,can be identified by trial and error.However identifying such swaps in larger data sets is difficult.An alternative is to release the data in tabulated form.All marginal tables up to order t are unchanged by the transformation.Thus,tabulated data can be released by showing the exis-tence of appropriate swaps without actually identifying them.Schl¨o rer(1981)Data Swapping:Variations on a Theme by Dalenius and Reiss17 Table2.Tabular versions of original and swapped data from Table1(a)Original DataZY X01 002 1201YX01020101(a)Swapped DataZYX010111111YX01011110discusses some the trade-offs between the two approaches and we return to this issue later in the context of extensions to data swappping.Dalenius and Reiss developed a formal theoretical framework for data swap-ping upon which to evaluate its use as a method for protecting confidentiality. They focus primarily on the release of data in the form of2-way marginal to-tals.They present theorems and proofs that seek to determine conditions on the number of individuals,variables,and the minimum cell counts under which data swapping can be used to justify the release of data in this form.They argue that release is justified by the existence of enough2-order equivalent databases or tables to ensure that every value of every sensitive variable is protected with high probability.In the next section we discuss some of the main theoretical results presented in the paper.Many of the details and proofs in the original text are unclear,and we do not attempt to verify or replace them.Most important for our discussion is the statistical formulation of the problem.It is the probabilistic concept of disclosure and the maintenence of certain statistical summaries that has proved influential in thefield.2.1Theoretical Justification for Data SwappingConsider a database in the form of an N×V matrix,where N is the number of individuals and V is the number of variables.Suppose that each of the V variables is categorical with r≥2categories.Further define parameters a i,i≥1,that describe lower bounds on the marginal counts.Specifically,a i=N/m i where m i is the minimum count in the i-way marginal table.Dalenius and Reiss consider the release of tabulated data in the form of2-way marginal tables.In theirfirst result,they consider swapping values of a single variable among a random selection of k individuals.They then claim that the probability that the swap will result in a2-equivalent database isp≈r(V−1)r (πk)(V−1)(r−1).Observations:1.The proof of this result assumes that only1variable is sensitive.2.The proof also assumes that variables are independent.Their justificationis:“each pair of categories will have a large overlap with respect to k.”18Stephen E.Fienberg and Julie McIntyreBut the specific form of independence is left vague.The2-way margins for X are in fact the minimal sufficient statistics for the model of conditional independence of the other variables given X(for further details,see Bishop, Fienberg,and Holland,1975).Dalenius and Reiss go on to present results that quantify the number of potential swaps that involve k individuals.Conditions on V,N,and a2follow that ensure the safety of data released as2-order statistics.However the role of k in the discussion of safety for tabulated data is unclear.First they let k=V to get a bound on the expected number of data swaps.Thefirst main result is:Theorem1.If V<N/a2,V≥4,and N≥14a1F1/(V−1)V(V r−r+1)/(V−1)for some function F then the expected number of possible data-swaps of k=V individuals involving afixed variable is≥F.Unfortunately,no detail or explaination is given about the function F.Condi-tions on V,N,and a2that ensure the safety of data in2-way marginal tables are stated in the following theorem:Theorem2.If V<N/a2,andN{log(5NV p∗)}2/(V−1)≥a1V(V r−r+1)/(V−1)where p∗=log(1−p)/log(p),then,with probability p,every value in the database is2-safe.Observations:1.The proof depends on the previous result that puts a lower bound on theexpected number of data swaps involving k=V individuals.Thus the result is not about releasing all2-way marginal tables but only those involving a specic variable,e.g.,X.2.The lower bound is a function F,but no discussion of F is provided.In reading this part of the paper and examining the key results,we noted that Dalenius and Reiss do not actually swap data.They only ask about possible data swaps.Their sole purpose appears to have been to provide a framework for evaluating the likelihood of disclosure.In part,the reason for focusing on the release of tabulated data is that identi-fying suitable data swaps in large databases is difficult.Dalenius and Reiss do ad-dress the use of data swapping for release of microdata involving non-categorical data.Here,it is clear that a database must be transformed by swapping before it can safely be released;however,the problem of identifying enough swaps to protect every value in the data base turns out to be computationally impractical.A compromise,wherein data swapping is performed so that t-order frequency counts are approximately preserved,is suggested as a more feasible approach. Reiss(1984)gives this problem extensive treatment and we discuss it in more detail in the next section.Data Swapping:Variations on a Theme by Dalenius and Reiss19 We need to emphasize that we have been unable to verify the theoretical results presented in the paper,although they appear to be more specialized that the exposition suggests,e.g.,being based on a subset of2-way marginals and not on all2-way marginals.This should not be surprising to those faminiliar with the theory of log-linear models for contingency tables,since the cell probabilities for the no2nd-order interaction model involving the2-way margins does not have an explicit functional representation(e.g.,see Bishop,Fienberg,and Holland,1975). For similar reasons the extension of these results to orders greater than2is far from straightforward,and may involve only marginals that specify decomposable log-linear models(c.f.,Dobra and Fienberg,2000).Nevertheless,wefind much in the authors’formulation of the disclosure limi-tation problem that is important and interesting,and that has proved influential in later theoretical developments.We summarize these below.1.The concept of disclosure is probabilistic and not absolute:(a)Data release should be based on an assessment of the probability of theoccurrence of a disclosure,c.f.,Dalenius(1977).(b)Implicit in this conception is the trade-offbetween protection and util-ity.Dalenius also discusses this in his1988Statistics Sweden monograph.He notes that essentially there can be no release of information without some possibility of disclosure.It is in fact the responsibility of data man-agers to weigh the risks.Subjects/respondents providing data must also understand this concept of confidentiality.(c)Recent approaches rely on this trade-offnotion,e.g.,see Duncan,et al.(2001)and the Risk-Utility frontiers in NISS web-data-swapping work (Gomatam,Karr,and Sanil,2004).2.Data utility is defined statistically:(a)The requirement to maintain a set of marginal totals places the emphasison statistical utility by preserving certain types of inferences.Although Dalenius and Reiss do not mention log-linear models,they are clearly focused on inferences that rely on t-way and lower order marginal totals.They appear to have been thefirst to make this a clear priority.(b)The preservation of certain summary statistics(at least approximately)is a common feature among disclosure limitation techniques,although until recently there was little reference to the role these statistics have for inferences with regard to classes of statistical models.We next discuss some of the immediate extensions by Delanius and Reiss to their original data swapping formulation and its principal initial application. Then we turn to what others have done with their ideas.2.2Data Swapping for Microdata ReleasesTwo papers followed the original data swapping proposal and extended those methods.Reiss(1984)presented an approximate data swapping approach for the release of microdata from categorical databases that approximately preserves20Stephen E.Fienberg and Julie McIntyret-order marginal totals.He computed relevant frequency tables from the original database,and then constructed a new database elementwise to be consistent with these tables.To do this he randomly selected the value of each element according to probability distribution derived from the original frequency tables and and then updated the table each time he generated a new element.Reiss,Post,and Dalenius(1982)extended the original data swapping idea to the release of microdatafiles containing continuous variables.For continu-ous data,they chose data swaps to maintain generalized moments of the data, e.g.,means,variances and covariances of the set of variables.As in the case of categorical data,finding data swaps that provide adequate protection while pre-serving the exact statistics of the original database is impractical.They present an algorithm for approximately preserving generalized k th order moments for the case of k=2.2.3Applying Data Swapping to Census Data ReleasesThe U.S.Census Bureau began using a variant of data swapping for data releases from the1990decennial census.Before implementation,the method was tested with extensive simulations,and the release of both tabulations and microdata was considered(for details,see Navarro,et al.(1988)and Griffin et al.(1989)). The results were considered to be a success and essentially the same methodology was used for actual data releases.Fienberg,et al.(1996)describe the specifics of this data swapping methodol-ogy and compare it against Dalenius and Reiss’proposal.In the Census Bureau’s version,records are swapped between census blocks for individuals or households that have been matched on a predetermined set of k variables.The(k+1)-way marginals involving the matching variables and census block totals are guaran-teed to remain the same;however,marginals for tables involving other variables are subject to change at any level of tabulation.But,as Willenborg and de Waal (2001)note,swapping affects the joint distribution of swapped variables,i.e, geography,and the variables not used for matching,possibly attenuating the association.One might aim to choose the matching variables to approximate conditional independence between the swapping variables and the others.Because the swapping is done between blocks,this appears to be consistent with the goals of Dalenius and Reiss,at least as long as the released marginals are those tied to the swapping.Further,the method actually swaps a specified (but unstated)number of records between census blocks,and this becomes a data base from which marginals are released.However the release of margins that have been altered by swapping suggests that the approach goes beyond the justification in Dalenius and Reiss.Interestingly,the Census Bureau description of their data swapping methods makes little or no reference to Dalenius and Reiss’s results,especially with regard to protection.As for ultility,the Bureau focuses on achieving the calculation of summary statistics in released margins other than those left unchanged by swapping(e.g.,correlation coefficients)rather than on inferences with regard to the full cross-classification.Data Swapping:Variations on a Theme by Dalenius and Reiss21 Procedures for the U.S.2000decennial census were similar,although with modifications(Zayatz2002).In particular,unique records that were at more risk of disclosure were targeted to be involved in swaps.While the details of the approach remain unclear,the Office of National Statistics in the United Kingdom has also applied data swapping as part of its disclosure control procedures for the U.K.2001census releases(see ONS,2001).3Variations on a Theme–Extensions and Alternatives3.1Rank SwappingMoore(1996)described and extended the rank-based proximity swapping algo-rithm suggested for ordinal data by Brian Greenberg in an1987unpublished manuscript.The algorithmfinds swaps for a continuous variable in such a way that swapped records are guaranteed to be within a specified rank-distance of one another.It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed af-ter an unconstrained swap.Moore attempts to provide rigorous justification for this,as well as conditions on the rank-proximity between swapped records that will ensure that certain summary statistics are preserved within a specified in-terval.The summary statistics considered are the means of subsets of a swapped variable and the correlation between two swapped variables.Moore makes a cru-cial assumption that values of a swapped variable are uniformly distributed on the interval between its bottom-coded and top-coded values,although few of those who have explored rank swapping have done so on data satisfying such an assumption.He also includes both simulations(e.g.,for skewed variables)and some theoretical results on the bias introduced by two independent swaps on the correlation coefficient.Domingo-Ferrer and Torra(2001a,2001b)use a simplified version of rank swapping and in a series of simulations of microdata releases and claim that it provides superior performance among methods for masking continuous data. Trotinni(2003)critiques their performance measures and suggests great caution in interpreting their results.Carlson and Salabasis(2002)also present a data-swapping technique based on ranks that is appropriate for continuous or ordinally scaled variables.Let X be such a variable and consider two databases containing independent samples of X and a second variable,Y Suppose that these databases,S1=[X1,Y1]and S2=[X2,Y2]are ranked with respect to X.Then for large sample sizes,thecorresponding ordered values of X1and X2should be approximately equal.The authors suggest swapping X1and X2to form the new databases,S∗1=[X1,Y2] and S∗2=[X2Y1].The same method can be used given only a single sample by randomly dividing the database into two equal parts,ranking and performing the swap,and then recombining.Clearly this method,in either variation,maintains univariate moments of the data.Carlson and Salabasis’primary concern,however,is the effect of the data swap on the correlation between X and Y.They examine analytically22Stephen E.Fienberg and Julie McIntyrethe case where X and Y are bivariate normal with correlation coefficientρ, using theory of order statistics andfind bounds onρ.The expected deterioration in the association between the swapped variables increases with the absolute magnitude ofρand decreases with sample size.They support these conclusions by simulations.While this paper provides thefirst clear statistical description of data swap-ping in the general non-categorical situation,it has a number of shortcomings. In particular,Fienberg(2002)notes that:(1)the method is extremely waste-ful of the data,using1/2or1/3according to the variation chosen and thus is highly ineffecient.Standard errors for swapped data are approximately40%to 90%higher than for the original unswapped data;(2)the simulations and theory apply only to bivariate correlation coefficients and the impact of the swapping on regression coefficients or partial correlation coefficients is unclear.3.2NISS Web-Based Data SwappingResearchers at the National Institute of Statistical Science(NISS),working with a number of U.S.federal agencies,have developed a web-based tool to perform data swapping in databases of categorical variables.Given user-specified param-eters such as the swap variables and the swap rate,i.e.,the proportion of records to be involved in swaps,this software produces a data set for release as micro-data.For each swapping variable,pairs of records are randomly selected and values for that variable exchanged if the records differ on at least one of the unswapped attributes.This is performed iteratively until the designated num-ber of records have been swapped.The system is described in Gomatam,Karr, Chunhua,and Sanil(2003).Documentation and free downloadable versions of the software are available from the NISS web-page,.Rather than aiming to preserve any specific set of statistics,the NISS pro-cedure focuses on the trade-offbetween disclosure risk and data utility.Both risk and utility diminish as the number of swap variables and the swap rate increase.For example,a high swapping rate implies that data are well-protected from compromise,but also that their inferential properties are more likely to be distorted.Gomatam,Karr and Sanil(2004)formulate the problem of choosing optimal values for these parameters as a decision problem that can be viewed in terms of a risk-utility frontier.The risk-utility frontier identifies the greatest amount of protection achievable for any set of swap variables and swap rate.One can measure risk and utility in a variety of ways,e.g.,the proportion of unswapped records that fall into small-count cells(e.g.,with counts less than 3)in the tabulated,post-swapped data base.Gomatam and Karr(2003,2004) examine and compare several“distance measures”of the distortion in the joint distributions of categorical variables that occurs as a result of data swapping, including Hellinger distance,total variation distance,Cramer’s V,the contin-gency coefficient C,and entropy.Gomatam,Karr,and Sanil(2004)consider a less general measures of utility—the distortion in inferences from a specific statistical analysis,such as a log-linear model analysis.Data Swapping:Variations on a Theme by Dalenius and Reiss23 Given methods for measuring risk and utility,one can identify optimal re-leases are empirically byfirst generating a set of candidate releases by performing data swapping with a variety of swapping variables and rates and then measur-ing risk and utility on each of the candidate releases and provide a means of making comparisons.Those pairs that dominate in terms of having low risk and high utility comprise a risk-utility frontier that leads optimal swaps for allow-able levels of risk.Gomatam,Karr,and Sanil(2003,2004)provide a detailed discussion of choosing swap variables and swap rates for microdata releases of categorical variables.3.3Data Swapping and Local RecodingTakemura(2002)suggests a disclosure limitation procedure for microdata that combines data swapping and local recoding(similar to micro-aggregation).First, he identifies groups of individuals in the database with similar records.Next,he proposes“obscuring”the values of sensitive variables either by swapping records among individuals within groups,or recoding the sensitive variables for the entire group.The method works for both continuous and categorical variables.Takemura suggests using matching algorithms to identify and pair similar individuals for swapping,although other methods(clustering)could be used. The bulk of the paper discusses optimal methods for matching records,and in particular he focuses on the use of Edmond’s algorithm which represents individuals as nodes in a graph,linking the nodes with edges to which we attach weights,and then matches individuals by a weighting maximization algorithm. The swapping version of the method bears considerable resemblance to rank swapping,but the criterion for swapping varies across individuals.3.4Data ShufflingMulalidhar and Sarathy(2003a,2003b)report on their variation of data swap-ping which they label as data shuffling,in which they propose to replace sensitive data by simulated data with similar distributional properties.In particular,sup-pose that X represents sensitive variables and S non-sensitive variables.Then they propose a two step approach:–Generate new data Y to replace X by using the conditional distribution of X given S,f(X|S),so that f(X|S,Y)=f(X|S).Thus they claim that the released versions of the sensitive data,i.e.,Y,provide an intruder with no additional information about f(X|S).One of the problems is,of course,thatf is unknown and thus there is information in Y.–Replace the rank order values of Y with those of X,as in rank swapping. They provide some simulation results that they argue show the superiority of their method over rank swapping in terms of data protection with little or no loss in the ability to do proper inferences in some simple bivariate and trivariate settings.。