Getting more from out-of-core columnsort
- 格式:pdf
- 大小:135.79 KB
- 文档页数:12
Table of ContentsAbout Daymak 3 Safety 4 Mobility Scooter Diagram 5 Riding Instructions 6 Starting The Vehicle 6 Steering Lock 6 Driving the Vehicle 7 FeaturesRight Handlebar 7 Display Dashboard 8 Left Handlebar 9 Radio / Mp3 9 Remote Control 10 Bluetooth App 11 Bluetooth Features 12 Storage 13 Seat 14 Charging your unit 15 Best Practices 15 Safety and Trouble Shooting 16 Mirrors 16 Kill Switch 16 Brake Lock 1760 Second Check 18Service 18About DaymakDaymak is one of Canada’s largest Alternative Vehicle providers. We design, engineer, man-ufacture, import and repair everything from recreational dirt bikes, go-karts and electric golf cars to alternative transportation solutions such as e-bikes and gas scooters.Our electric bicycles represent an energy-e cient and eco-friendly alternative for people who need to get around the city. They greatly increase the practicality of bicycle transpor-tation in urban centres. Costing only a few cents to charge, an e-bike can make city life more convenient and much less expensive.While there are many new Green technologies that are still in their infancy, electric bicycles have been developing over the last 40 years or more. E-bike technology has been dramat-ically re ned since the introduction of the rst custom-conversion bicycles. Today, electric bicycles are a supremely reliable and a ordable means of transportation.Daymak is constantly developing new eco-friendly alternative transportation strategies, led by its own Research and Development department in Toronto, Canada. We are always improving our products. Our innovative in-house engineering and quality testing provide customers with many new kinds of reliable, eco-friendly vehicles, designed to help change the lives of our customers and the world.Daymak warranties, services, and stocks parts for everything it sells. We support our prod-ucts.Please feel free to visit our website. You’ll nd the latest in cool transportation solutions, support for the products you’ve purchased and contact information.SafetyWhen operating the Rickshaw King please make sure you adhere to the following:Always check your mirrors and blind spots when operating the vehicle.Turn on headlights when in need of additional visibility.Make sure that your battery power is suffi cient before you go out to rideObey all laws of the road.Periodically charge the unit when not in use for long periods of timeIf you bring your charger avoid shaking / rattling charger while ridingPerform the 60 second check before riding.Do not take a second passenger.Do not over charge the battery by leaving the charger in the charging port. Once the battery is fully charged remove the charger immediately. Do not try to operate the unit while charging.Do not let anyone uner the age of 16 years old operate this vehicle.Do not make sharp / abrupt turns at high speeds to avoid tipping.Do not operate under the infl uence of any use of drugs or alcoholDo not completely submerge the unit in waterDo not operate in harsh weather conditions.For customer service call 1-800-649-9320Mobility Scooter DiagramDiagram 1:This diagram illustrates the various parts of your mobility scooter. Please note that many of these parts are not user-serviceable and should be repaired only by trained professionals. This is especially true of the elec-trical systems and the mechanical components.A) Mirrors B)Windshield C) Seat D) BasketE) Brake Lights F) Safety Wheels G) Rear Wheels H) Front Wheels I) Turn Signals J) HeadlightTo start the vehicleOnce you have received the vehicle. Sit on the unit and put the keys in the ignition. The ignition is located on the main dash of the unit below the right handlebar.To turn on the unit, turn the keys clockwiseso the key is pointing at the right position.To turn o the unit turn it the oppositedirection.Steering lockTo lock the steering to prevent theft,the Rickshaw King comes with a lockingmechanism that locks the wheel perpen-dicular to the unit.To engage this:1) Turn handlebars so they are facing theleft.2) With the key in the o position, pushthe key in further and turn counterclock-wise. Reverse this process to unlock it.Please note: When engaging the steering lock if you are unable to get it over the stop to get it on the wheel lock position, try moving the handlebars slightly the opposite direc-tion.Driving the vehicleMake sure that you are properly situated on the unit. Once the vehicle is on (you will seethat the dashboard is lit up). Use the throttle on the right hand to rotate towards you.Pictured to the right is the throttle and the braking mechanism.Brakes (silver part): Pull that towards you to slow down and disengage the motorThrottle (gold and black part): Rotate that towards you to drive.We will now go through all the fea-turesFeaturesA) High Beam/Low Beam -Push this in the up position to aim the lights higher, down to aim it lower.B) Rear Brake Handle -Pull this towards you to engage the rear brakes. Use this rst when stop-ping before the front brakes.C) Left handlebar -Use this to steer the unit.D) Heated Handgrips - Set this to one to turn on heat to the handles. Set it to 0 to turn it o .E) Horn - Press this to honk the hornF) Turn Signals - Push this switch to the left to engage the left turn signal. Set it to the middle and press it in to turn it o . Push it to the right for the right turn signal.FeaturesThe Rickshaw King features a digital LED display to show your speed travel time and moe.A) Speed - Y our current speed that you are travelling in KM/hB) Battery / Voltage reading - Shows how much power you have. 5 Bars means it has a full battery.*Please note the Rickshaw King is a 60V battery so a full charge unit should read approximately 68V .**Please note that the accurate remaining battery reading is what it is showed at cruising speed, not at a stop or during acceleration.C) Odometer - Shows how many Kilometers travelled total ermometer - Shows the temperature outside.FeaturesA) Speed Setting - This dial will help you set the speed you want to travel at. Turn the white market towards the turtle to go slow and towards the rabbit to go fast.B) Lights - Set it to the rightmost position for all lights to be o . Middle position for the rear lights to be on. Leftmost position to turn on all lights.C) Flasher - Press this button to ash the headlights on and o quickly.D) High Speed / Low Speed - Set the switch to 1 to set it to set it to low speed. 0 to set it to high speed.E) Forward / Reverse - Set the switch 0 to go forward and 1 to go in reverse.F) Front Brake - Pull this towards you to engage the front brake.Radio / Mp3The Rickshaw King comes with a built in radio and mp3 player so you can listen to your music. The Mp3 player can take either a USB stick or MicroUSB card. Put your songs on to your drive as Mp3’s for compatibility.To access the Mp3 player, on the center steering column of the unit unlock the key the compartment with your keys. A) MicroUSB card Reader B) USB Stick ReaderPlease note to play music on your Rickshaw King you must use the remote.See remote control for more information.Remote ControlThe Rickshaw King comes with a remote control that allows you to arm and disarm the alarm system as well as changesongs that you are listening to.A) Alarm / Volume Reset - Once the unit is turned o press this button to turn on the alarm. Press it again to turn it o . While on press this to turn o the radio and reset volume to low.*Please note that once you turn the alarm on there is a 5 second delay before it is activated*B) Play / Pause music - Press this is to turn on the radio / Mp3 player press it a second time to pause / stop the radio / Mp3 player.C) Seek Reverse / Last Song / Volume Increase - Press this button to nd a station with a lower frequency or play the last song played on the Mp3 player. Press and hold this button to increase the volu-maeD) Seek Forward / Next Song - Press this button to nd a station with a higher frequency or play the next song on the Mp3 player E) Source - Press this button once to switch from Mp3 player or radioBluetooth AppThe Rickshaw King comes with a Daymak Drive Bluetooth app that allows you to control your speed acceleration and more. Available on the iOS and Google Play Store for both Android and Ios, connect via your smart phone to change the core performance of your vehicle.To connect to your unit, turn on the Rickshaw King, open the app on your smart phone and click on the Bluetooth device that says Daymak Drive*Please note for the rst time you connect to the unit you will need a password. For this input 12345678For more information on the Daymak Drive App, you can check out the following links./manuals/DAYMAK_APP_MANUAL_IOS.pdf - IOS/manuals/Daymak%20Drive%20Manual.pdf - AndroidFast Start: Fast start has priority while opening at the same time as soft start. Reduces the time needed to achieve maximum speed. Higher values make your bike accelerate quicker (1-10) Fast start supersedes slow start when it is activated.Soft Start: Display the star mode and range adjustment of controller while starting. The range is divided into 10 grades. The higher, the slower. Entering the following interface, soft star can be set, turn on or o and turn up or down. The acceleration of your bike will slow down. Higher values make acceleration slower (values 1-10)Overspeed On/O : Weak magnetic overdrive grades (10 grades). The higher, the faster. That’s to say, speed up to 120%~130%. Low speed ratio, speed of rst gear (10%~80%), Matching with low speed switch.Forward/reverse: Spins the motor positive or negative direction. Only works if the motor supports this.Manual Cruise: Turn on or o the button of manual cruise. The controller keeps the re-al-time speed, matching with manual button.ON/OFF values: Controller keeps the real-time speed when turned on.Auto Cruise: Turn on the button of Auto Cruise or keep hold of the throttle for 8 seconds, auto cruise begins. If manual cruise turns on, auto cruise will be invalid. When this is on, the rider must hold the throttles position for 8 seconds to allow controller to hold the speed. Auto cruise does not work when Manual is on.Speed Limit: Adjust the highest speed (30%-60%) of vehicle. Too Low speed limit a ects starting torque. Adjust the speed of the motor. You can limit the speed of the motor by 30-60%. Low Values can a ect the acceleration speed.Reverse speed Limit: Adjust the highest speed (10%-100%) of reverse. Much too low speed a ects reverse torque. Similar to speed limit. This controls the spin rate of the motor in reverse (10%-100%). Low values will a ect acceleration in reverse.EBS Braking Force: intensity range of electronic braking (10 grades) The higher, The stron-ger, matching with braking function. Increases the sensitivity of your Electronic braking (1-10 Values) higher values require less pressure on the brake lever.Battery Current Limit (A): Adjust the max output of battery (50%~100%) Too small out-put current a ects starting torque. Change the max output current of the battery. (50-100%) Smaller values can have an e ect on acceleration and torque.Phase Current Limit (A): Adjust the max phase current of motor (50%~100%) Too small phase current a ects starting torque. Adjust the max Phase current of motor (50-100%) Hall Sensor Phase Angle: Motor hall installation angle (120° or 60°). Phase can’t be matched if choose wrong angle.Eco mode: After starting all current becomes weak. It is suitable for small battery to in-crease mileage. Preset mode that reduces battery current after starting the bike. This fea-ture increases mileage and is ideal for small batteries.Adjust Accelerator Curve: Start linear control and nonlinear control shift, increasing con-trollability of low speed.Boost: Higher mode output, torque by 20%. Auto Shuto while controller temp exceeds 80°C. High torque mode. Controller pushes the torque up to 20% or until the motor tem-perature reaches 80°C.Low Voltage Cuto : Adjust cut-o voltage of controllers. When the battery reaches this voltage, controller stops working and protects itself. Adjusting range depends on the con-troller setting. This feature allows you to set a low voltage setting. If your battery reaches your setting it will shut o .Motor Lock: Manual start locks the motor mode so that vehicle can’t be moved. This mode is kept even more power o until the power on next time, unless shut down by app, being used when the power switch turns on. This feature disables the motor. Can be used for anti-theft measures of act as a kill switch. (On/O Values)Restore factory settings: Restore Controller’s Original factory setting. Controllers inside parameters will be restored to original factory settings. All adjusting parameters will be substituted with factory settings parameters. Double con rm after clicking and sending directly without clicking send button. Restores the settings back to the original sate.StorageThe Rickshaw King comes with a basket for storage as well as storage compartment under the seat. To access the storage compartment under the seat you must unlock it using the key and lift the seat up.SeatThe Rickshaw King features an adjustable seat to make sure you are comforatble whenyou ride.To pivot the seat to angle it forward or back lift the lever pictured above and push the seat the direction you want it to go while holding it.To slide the seat forward or backward lift this lever pictured above and push the seat the direction you want it to go while holding it.Charging your unitOn the outside of the unit under seat you willnd a charging port.*Please note there will be a little ap that youmust lift up before you see the image on theright*To charge the unit take the charger that camewith the Rickshaw King and plug it into a wallsocket.Then take the other end of the charger and plug it into the unit.The charger will turn RED while it is charging and Green when it is done.Best practicesCharge as needed, it is not neccesary to charge the battery from 0 power.Plug the charger into the wall rst and then the unitDo not leave the charger charging after it is fully charged. If you see the charge is green un-plug it.It is recommended to get an outlet timer and set it for the recommended charging hours, so that as soon its done charging the timer will cut out the power completely.Safety and TroubleshootingMirrorsIf you are having trouble adjusting the mirrors you may need to adjust them at the base. Lift the cover as seen in gure A) up and use a wrench to losen the bolt at gure B). Repo-sition it as needed and then screw back in the mirror and retighten the bolt then put the cover back over it.Kill SwitchIf you turn the keys in the ignition and you receive 0 power (nothing lights up) either your battery is completly dead or more likely the kill switch is set to o . Lift up the rear seat (see Storage section for more info) and you will nd on the left the kill switch. Flip that switch if it is set to o .Brake lockTo prevent the Rickshaw King from rolling backwards when the unit is o , you need to put on the brake lock.To engage the brake lock1) Pull the rear brake (left brake handle) towards you. 2) Using your index nger pull the black lever also towards you.3) Line it up the black lever with the raised stop position and then release the rear brake60 Second checkCheck brake levers to make sure they are birth rmCheck tires to make sure it is rm and that all bolts are tight.Turn on your bike and check your battery to make sure that you can get to where you need to.Make sure your mirrors are tightly secured and adjusted as needed to see behind you. ServiceFor all technical service please contact us at 1-800-649-9320 for help or visit us online at We recommend that you don’t try to x things yourself and that you should have a trained technician do any service work.THANK YOU FOR CHOOSING DAYMAK。
Lesson 1 The Middle Eastern Bazaar1)Little donkeys thread their way among the throngs of people.Little donkeys make their way in and out of the moving crowds2)Then as you penetrate deeper into the bazaar, the noise of the entrance fades away, and you come to the muted cloth-market.Then as you go deeper into the market, the noise of the entrance gradually disappears, and you come to the silent cloth-market.3) They narrow down their choice and begin the really serious business of beating the price down.After careful search, comparison and some primary bargaining,they reduce their choices and try making the decision by beginning to do the really serious job convince the shopkeeper to lower the price.4) He will price the item high, and yield little in the bargaining.He will ask for a high price for the item and refuse to cut down the price by any significant amount.5) As you approach it, a tinkling and banging and clashing begins to impinge on your ear.As you get near it, a variety of sounds begin to strike your ear.Lesson 2 Hiroshima -- the "Liveliest”City in Japan1)serious-looking men spoke to one another as if they were obvious of the crowds about them They were so absorbed in their conversion that they seemed not to pay any attention to the people around them.2)The cab driver’s door popped open at the very sight of a traveler.As soon as the taxi driver saw a traveler, he immediately open the door3)The rather arresting spectacle of little old Japan adrift amid beige concrete skyscrapers is the very symbol of the incessant struggle between the kimono and the miniskirt.The traditional floating houses among high modern buildings represent the constant struggle between old tradition and new development.4)I experienced a twinge of embarrassment at the prospect of meeting the mayor of Hiroshima in my socks.I suffered from a strong feeling of shame when I thought of the scene of meeting the mayor of Hiroshima wearing my socks only.5) The few Americans and Germans seemed just as inhibited as I was.The few Americans and Germans seemed just as restrained as 1 was.6)After three days in Japan, the spinal column becomes extraordinarily flexible. After three days in Japan one gets quite used to bowing to people as a ritual to show gratitude.7)I was about to make my little bow of assent, when the meaning of these last words sank in, jolting me out of my sad reverie .I was on the point of showing my agreement by nodding when I suddenlyrealized what he meant.His words shocked me out my sad dreamy thinking.8)I thought somehow I had been spared.I thought for some reason or other no harm had been done to me.Lesson3 Ships in the Desert1. the prospects of a good catch looked bleakIt was not at all possible to catch a large amount of fish.2.He moved his finger back in time to the ice of two decades ago.Following the layers of ice in the core sample, his finger came to the place where the layer of ice was formed 2050 years ago.3.keeps its engines running to prevent the metal parts from freeze-locking together keeps its engines running for fear that if he stops them, the metal parts would be frozen solid and the engines would not be able to start again4.Considering such scenarios is not a purely speculative exercise.Bit by bit trees in the rain forest are felled and the land is cleared and turned into pasture where cattle can be raised quickly and slaughtered and the beef can be used in hamburgers.5.Acre by acre, the rain forest is being burned to create fast pasture for fast-food beef…Since miles of forest are being destroyed and the habitat for these rare birds no longer exists, thousands of birds which we have not even had a chance to see will become extinct.6 which means we are silencing thousands of songs we have never even heard. Thinking about how a series of events might happen as a consequence of the thinning of the polar cap is not just a kind of practice in conjecture (speculation), it has got practical Value.7.we are ripping matter from its place in the earth in such volume as to upset the balance between daylight and darkness.We are using and destroying resources in such a huge amount that we are disturbing the balance between daylight and darkness.8.Or have our eyes adjusted so completely to the bright lights of civilization that we can't see these clouds for what they are …Or have we been so accustomed to the bright electric lights that we fail to understand the threatening implication of these clouds.9. To come at the question another way…To put forward the question in a different way10.and have a great effect on the location and pattern of human societiesand greatly affect the living places and activities of human societies11.We seem oblivious of the fragility of the earth's natural systems.We seem unaware that the earth's natural systems are delicate.12. And this ongoing revolution has also suddenly accelerated exponentially.And this continuing revolution has also suddenly developed at a speed that doubled and tripled the original speed.Lesson 4 Everyday Use1.She think s her sister has held life always in the palm of one hand…She thinks that her sister has a firm control of her life.2. "no" is a word the world never learned to say to herShe could always have anything she wanted, and life was extremely generous to her.3. Johnny Carson has much to do to keep up with my quick and witty tongue.The popular TV talk show star, Johnny Carson, who is famous for his witty and glib tongue, has to try hard if he wants to catch up with me.4. It seems to me I have talked to them always with one toot raised in flightIt seems to me that I have talked to them always ready to leave as quickly as possible.5.She washed us in a river of make-believeShe imposed on us lots of falsity.6.burned us with a lot of knowledge we didn't necessarily need to knowimposed on us a lot of knowledge that is totally useless to us7.Like good looks and money, quickness passed her by.She is not bright just as she is neither good-looking rich.8.A dress down to the ground, in this hot weather.Dee wore a very long dress even on such a hot day.9.You can see me trying to move a second or two before I make it.You can see me trying to move my body a couple of seconds before I finally manage to push myself up.10.Anyhow, he soon gives up on Maggie.Soon he knows that won't do for Maggie, so he stops trying to shake hands with Maggie.11.Though, in fact, I probably could have carried it back beyond the Civil War through the branches.As I see Dee is getting tired of this, I don't want to go on either. In fact, I could have traced it far back before the Civil War along the branches of the family tree.12.Every once in a while he and Wangero sent eye signals over my head.Now and then he and Dee communicated through eye contact in a secretive way.13.Less than that!If Maggie put the old quilts on the bed, they would be in rags less than five years.14.This was the way she knew God to work.She knew this was God's arrangement.Lesson 5 Speech on Hitler's Invasion of the U.S.S.R.1.Hitler was counting on enlisting capitalist and Right Wing sympathies in this country and the U. S. A.Hitler was hoping that if he attacked Russia, he would win in Britain and the U.S. the support of those who were enemies of Communism.2.Winant said the same would be true of the U. S. A.Winant said the United States would adopt the same attitude.3 .…my life is much simplified therebyIn this way, my life is made much easier in this case, it will be much easier for me to decide on my attitude towards events.4. I see the German bombers and fighters in the sky, still smarting from many a British whipping, delighted to find what they believe is an easier and a safer prey.I can see the German bombers and fighters in the sky, who, after suffering severe losses in the aerial battle of England, now feel happy because they think they can easily beat the Russian air force without heavy loss.5.We shall be strengthened and not weakened in determination and in resources.We shall be more determined and shall make better and fuller use of our resources.6. Let us redouble our exertions, and strike with united strength while life and power remain.Let us strengthen our unity and our efforts in the fight against Nazi Germany when we have not yet been overwhelmed and when we are still powerful.Lesson 6 Blackmail1.The house detective's piggy eyes surveyed her sardonically from his gross jowled face.The house detective's small narrow eyes looked her up and down scornfully from his fat face with a heavy jowl.2.Pretty neat set-up you folks got.This is a pretty nice room that you have got.3.The obese body shook in an appreciative chuckle .The fat body shook in a chuckle because the man was enjoying the fact that he could afford to do whatever he liked and also he was appreciating the fact that the Duchess knew why he had come.4.He lowered the level of his incongruous falsetto voice.He had an unnaturally high-pitched voice. now, he lowered the pitch.5.The words spat forth with sudden savagery , all pretense of blandness gone. Ogilvie spat out the words, throwing away his politeness.6. The Duchess of Croydon –three centuries and a half of inbred arrogance behind her –did not yield easily.The Duchess was supported by her arrogance coming from parents of noble families with a history of three centuries and a half. She wouldn't give up easily.7."It's no go, old girl. I'm afraid. It was a good try."It's no use. What you did just now was a good attempt at trying to save the situation.8."That's more like it," Ogilvie said. He lit the fresh cigar. "Now we're getting somewhere." "That's more acceptable," Ogilvie said. He lit another cigar, "Now we're making some progress. "9.... his eyes sardonically on the Duchess as if challenging her objection....he looked at the Duchess sardonically as if he wanted to see if she dared to object to his smoking.10. The house detective clucked his tongue reprovingly .The house detective made noises with his tongue to show his disapproval.Lesson 9 Mark Twain ---Mirror of America1.a man who became obsessed with the frailties of the human racea man who became constantly preoccupied by the moral weaknesses of mankind2.Mark Twain digested the new American experience before sharing it with the world as writer and lecturer.Mark Twain first observed and absorbed the new American experience, and then introduce it to the world in his books or lectures.3.The cast of characters set before him in his new profession was rich and varied----a cosmos .In his new profession he could meet people of all kinds.4.Broke and discouraged, he accepted a job as reporter with the Virginia City Territorial Enterprise…With no money and a frashated feeling, he accepted a job as reporter with Territorial Enterprise in Virginia City ...5.Mark Twain began digging his way to regional fame as a newspaper reporter and humorist. Mark Twain began working hard to became well known locally as a newspaper reporter and humorist.6. and when she projects a new surprise, the grave world smiles as usual, and says 'Well, that is California all over. '"and when California makes a plan for a new surprise, the solemn people in other states of the U.S. smile as usual, making a comment "that's typical of California"7.Bitterness fed on the man who had made the world laugh.The man who had made the world laugh was himself consumed by bitterness.。
高级英语第二册课后翻译Paraphrase:U1:1.little donkeys thread their way among the throngs of people.小毛驴穿过熙熙攘攘的人群。
little donkeys make their way in and out of the moving crowds, or pass through them.2.Then as you penetrate deeper into the bazaar, the noise of the entrance fades away, and you come to the muted cloth-market. 随后,当穿行到即使深处时,入口的喧闹声渐渐消散,眼前就是清净的布匹市场了。
Then as you go deeper into the market, the noise of the entrance gradually disappears and you come to the silent cloth-market.3.they narrow down their choice and begin the really serious business of beating the price dowm. 他们缩小选择范围,开始严肃的讨价还价。
After careful search, comparison and some primary bargaining ,they reduce the choices and try making the decision by beginning to do the really serious job-convince the shopkeeper to lower the price.4.he will price the item high, and yield little in the bargaining.他们会漫天要价,而且在还价过程中很难做出让步。
sqoop column index out of range -回复Sqoop是一个用于将关系型数据库(例如MySQL,Oracle等)和Hadoop 生态系统(例如Hive,HBase等)之间进行数据传输的工具。
它可以帮助用户将数据从关系型数据库导入到Hadoop集群中,或将数据从Hadoop集群导出到关系型数据库中。
然而,在使用Sqoop导入数据时,有时会遇到"column index out of range"的错误。
接下来,我们将一步一步地解释这个错误的原因以及如何解决它。
错误原因:"column index out of range"错误通常是由于导入数据时指定的列索引超出了导入数据源所包含的列数。
这可能是由于以下原因导致的:1. 列索引错误:导入数据时,你可能错误地指定了一个超出数据源列数范围的列索引。
2. 导入数据源更改:如果在导入数据之前更改了数据源的列数,那么之前指定的列索引可能已经过时。
3. 数据源中的空列:如果数据源中有空列,那么Sqoop可能无法正确识别列数,从而导致出错。
解决方法:现在,我们将讨论几种常见的解决方法来解决"column index out of range"错误。
1.检查列索引:首先,确保你提供的列索引是正确的,并且在数据源中存在。
你可以在数据源中运行适当的查询或命令,以查看列的索引和数量。
然后,将正确的列索引提供给Sqoop命令。
2.更新数据源:如果你在数据源中添加或删除了列,请确保你在导入数据之前更新了数据源的元数据。
这可以通过运行适当的ALTER TABLE命令或在数据库管理工具中进行操作来实现。
3.忽略空列:如果数据源中有空列,可以尝试忽略它们。
你可以使用Sqoop 的null-non-string选项来指定Sqoop应该如何处理空列。
例如,你可以设置null-non-string '\\\\N',以将空值转换为一个特殊的字符串\\\\N,以便Sqoop能够正确处理列数。
Mark Twain —-- Mirror of AmericaBy Noel Grove 1. Most Americans remember Mark Twain as the father of Huck Finn’s idylliccruise through eternal boyhood and Tom Sawyer’s endless summer of freedom and adventure。
Indeed, this nation’s best-loved author was every bit as adventurous,patriotic,romantic, and humorous as anyone has ever imagined. I found another Twain —-- one who grew cynical, bitter, saddened by the profound personal tragedies life dealt him, a man who became obsessed with frailties of the human race, who saw clearly aheada black wall of night.2. Tramp printer, river pilot,Confederate guerrilla,prospector, starry—eyedoptimist, acid-tongued cynic: the man who became Mark Twain was born Samuel Langhorne Clemens and he ranged across the nation for more than a third of his life,digesting the new American experience before sharing it with the world as writer and lecturer. He adopted his pen name from the cry heard in his steamboat days,signaling two fathoms (12 feet)of water --—a navigable depth。
Analysing RNA-Seq data with the“DESeq”packageSimon AndersEuropean Molecular Biology Laboratory(EMBL),Heidelberg,Germanysanders@fs.tum.delast change:2010-06-11AbstractIn RNA-Seq and related assay types(including comparative ChIP-Seq etc.),one works with tables of count data,which report,for each sample,the number of reads that have beenassigned to a gene(or other types of entities).The package“DESeq”provides a powerfultool to estimate the variance in such data and test for differential expression.1The presentvignette explains the use of the package;for an exposition of the statistical method employed,see our paper.21Quick startThisfirst section just shows the commands necessary for an analysis at a glance.For a more gentle introduction,skip this section onfirst reading and start reading at Section2.The DESeq package expects count data,as obtained,e.g.,from an RNA-Seq or other high-throughput sequencing(HTS)experiment,in form of a matrix of integer values.Each column corresponds to a sample,i.e.,typically one run on the sequencer.Each row corresponds to entity for which you count hits,e.g.,a gene,an exon,a binding region in ChIP-Seq,a window in CNV-Seq,or the like.Important:Each column must stem from an independent experiment or sample.If you spread sample material from one experiment over several“lanes”of the sequencer in order to get better coverage,you must sum up the counts from the lanes to get a single column.Failing to do so will result in incorrect variance estimation and overly optimistic p valuesLet’s say you have the counts in a matrix or data frame countTable,and you further have a factor conds with as many element as there are columns in countTable that indicates treatment groups,i.e.,data in the following form:>head(countsTable)T1a T1b T2T3N1N2Gene_000010020011Other Bioconductor packages for this use case(but employing different methods)are edgeR and baySeq.2The companion paper for DESeq is:S.Anders,W.Huber:“Differential expression analysis for sequence count data”.This paper is currently under review.A preprint can be obtained from Nature Preceedings:http: ///10.1038/npre.2010.4282.11Gene_000022081251926Gene_00003302000Gene_000047584241149271257Gene_00005101640410Gene_00006129126451223243149>conds[1]T T T Tb N NLevels:N T TbThen,the minimal set of commands to run a full analysis is:>cds<-newCountDataSet(countsTable,conds)>cds<-estimateSizeFactors(cds)>cds<-estimateVarianceFunctions(cds)>res<-nbinomTest(cds,"T","N")The last command tests for differential expression between the conditions labelled"T"and"N". It returns a data frame with p values(raw and adjusted),mean values,fold changes,and other useful information,which looks as follows:>head(res)id baseMean baseMeanA baseMeanB foldChange log2FoldChange1Gene_000010.45096310.39386510.536610 1.36242080.44617242Gene_0000217.947248816.002757520.863986 1.30377440.38269433Gene_00003 1.0629635 1.77160580.0000000.0000000-Inf4Gene_00004171.8057235128.6778649236.497511 1.83790360.87806115Gene_0000511.302188014.2894570 6.8212840.4773648-1.06683586Gene_00006198.2748364218.2198341168.3573400.7715034-0.3742556 pval padj resVarA resVarB11.00000001.00000000.324701130.5867700920.52473450.93275430.552669770.5805913130.36207480.87762990.629578350.0000000040.24518420.76797530.060332050.4698535050.67708470.98614801.274423340.3987763360.58255840.95876180.186978710.017003032PreparationsAs example data,we use Tag-Seq data from an experiment studying certain human tissue culture samples,which P.Bertone kindly permitted us to use.As these data are not yet published, we have obscured annotation data and will,for now,remain vague concerning their biological properties.We will amend this once Bertone and coworkers have published their paper.They extracted mRNA from the cultures and sequenced only the3’end of the transcripts (Tag-Seq)with an Illumina GenomeAnalyzer,one lane per sample.They got from6.8to13.6 mio reads from each lane,which they assigned to genes.They were able to assign30%to50% of the tags unambiguously to annotated genes and produced a table that gives these counts.3 3An easy way to produce such a table from the output of the aligner is to use the htseq-count script distributed with the HTSeq package.(Even though HTSeq is a Python package,you do not need to know any Python to use htseq-count.)See http://www-huber.embl.de/users/anders/HTSeq/doc/count.html.2A version of this table is distributed with the DESeq package as example data in afile called “TagSeqExample.tab”.The system.file function allows to see where R has stored thefile when the package was installed:>library(DESeq)>exampleFile=system.file("extra/TagSeqExample.tab",package="DESeq")>exampleFile[1]"/tmp/RtmpsEAocA/Rinst55786267/DESeq/extra/TagSeqExample.tab"It is a tab-delimitedfile with column headers in thefirst line.We read it in with>countsTable<-read.delim(exampleFile,header=TRUE,stringsAsFactors=TRUE) >head(countsTable)gene T1a T1b T2T3N1N21Gene_000010020012Gene_0000220812519263Gene_000033020004Gene_0000475842411492712575Gene_000051016404106Gene_00006129126451223243149To obtain such a table for your own data,you will need other software;this is out of the scope of DESeq.In the course materials from the Workshops section of the Bioconductor web page, you mightfind further information how to do this with the ShortRead and IRanges packages.Thefirst column is the gene ID.(We have shuffled the table rows,removed the RefSeq IDs and replaced them with dummy identifiers of the form“Gene NNNNN”.)We use the gene IDs for the row names and remove the gene ID column:>rownames(countsTable)<-countsTable$gene>countsTable<-countsTable[,-1]We are now left with six columns,referring to the six samples.Thefirst four(labelled“T1a”,“T1b”,“T2”,and“T3”)are from cancerous tissue,the last two(labelled“N1”,“N2”)are from healthy tissue and served as control.We code this information in the following vector,which assigns each sample a“condition”: >conds<-c("T","T","T","Tb","N","N")where“T”stands for a sample derived from a certain tumour type and“N”for a sample derived from non-pathological tissue.Thefirst three samples had a very similar histopathological phe-notype,while the fourth sample was atypical,and hence,we assign it another condition(“Tb”).We can now instantiate a CountDataSet,which is the central data structure in the DESeq package:>cds<-newCountDataSet(countsTable,conds)The CountDataSet class is derived from the eSet class and so shares all features of this standard Bioconductor class.Furthermore,accessors are provided for its data slots.For example, the counts can be accessed with the counts function.>head(counts(cds))3T1a T1b T2T3N1N2Gene_00001002001Gene_000022081251926Gene_00003302000Gene_000047584241149271257Gene_00005101640410Gene_00006129126451223243149One feature derived from the eSet class is the possibility to subset.We can remove thefirst sample(i.e.,thefirst column)as follows>cds<-cds[,-1]We remove it because samples T1a and T1b were derived from the same individuum and are hence more similar than the others.In order to keep the present example simple we continue without sample T1a.Asfirst processing step,we have to estimate the effective library size.This information is called the“size factors”vector,as the package only needs to now the relative library sizes.So,if a non-differentially expressed gene produces twice as many counts in one sample than in another, the size factor for this sample should be twice as large as the one for the other sample.You could simply use the actual total numbers of reads and assign them to the cds object:>libsizes<-c(T1a=6843583,T1b=7604834,T2=13625570,T3=12291910,+N1=12872125,N2=10502656)>sizeFactors(cds)<-libsizes[-1]However,one seems to get better results by estimating the size factors from the count data. The function estimateSizeFactors does that for you.(See the man page of estimateSizeFac-torsForMatrix for technical details on the calculation.)>cds<-estimateSizeFactors(cds)>sizeFactors(cds)T1b T2T3N1N20.55873941.58230961.12704251.28693370.87469983Variance estimationAs explained in detail in the paper,the core assumption of this method it that the mean is a good predictor of the variance,i.e.,that genes with a similar expression level also have similar variance across replicates.Hence,we need to estimate for each condition a function that allows to predict the variance from the mean.This estimation is done by calculating,for each gene,the sample mean and variance within replicates and thenfitting a curve to this data.This computation is performed by the following command.>cds<-estimateVarianceFunctions(cds)In order to use the package,you do not need to know what precisely these raw variance functions estimate. For the interested reader,a few extra details are given here:The point of the variance functions is to predict how much variance one should expect for counts at a certain level.For example,let us assume that we have found123tags for a certain gene in the“T1b”sample.We may now calculate the expected“raw variance”as follows.First we get the“base level”,by which we mean this count value divided by the size factor.This makes the values from different columns comparable.Then,we insert this41101001000100000.00.51.01.52.0base means q u a r e d c o e f f i c i e n t o f v a r i a t i o nN T _maxbase mean densityFigure 1:Plot to show the estimated variances (as squared coefficients of variation (SCV),i.e.,variance over squared mean),produced with the function scvPlot .into the raw variance function to get the estimated “raw variance”which needs to be scaled up to the count level by multiplying with the size factor (squared,because this is a variance).Once we add the expected shot-noise variance (i.e.,the variance due to the Poisson counting process),which is equal to the count value,we get the full variance.The square root of this full variance is then the estimated standard deviation for count values at the given level (provided,of course,that our fundamental assumption is right that the mean allows to get a reasonable prediction for the variance).>countValue <-123>baseLevel <-countValue /sizeFactors(cds)["T1b"]>rawVarFuncForGB <-rawVarFunc(cds,"T")>rawVariance <-rawVarFuncForGB(baseLevel )>fullVariance <-countValue +rawVariance *sizeFactors(cds)["T1b"]^2>sqrt(fullVariance )T1b 71.89861attr(,"size")[1]2Of course,you do not have to do the calculation just outlined yourself,the package does this automatically.If you are confident that the package did a good job in estimating the variance functions,you may now skip directly to the Section 4.If you,however,would like to check whether the fit was good,the rest of this sections explains how to inspect and verify the variance function estimates.The function ’scvPlot’shows all the base variance functions in one plot:5>scvPlot(cds,ylim=c(0,2))In the produced plot(Fig.1),the x axis is the base mean,the y axis the squared coefficient of variation(SCV),i.e.,the ratio of the variance at base level to the square of the base mean.The solid lines are the SCV for the raw variances,i.e.,the noise due to biological replication.There is one coloured solid line per condition,and,in case there are non-replicated conditions,a dashed black line for the maximum of the raw variances,which is used for these.On top of the variance,there is shot noise,i.e.,the Poissonean variance inherent to the process of counting reads.The amount of shot noise depends on the size factor,and hence,for each sample,a dotted line in the colour of its condition is plotted above the solid line.The dotted line is the base variance,i.e.,the full variance,scaled down to base level by the size factors.The vertical distance between solid and dotted lines is the shot noise.The solid black line is a density estimate of the base means:Only were there is an appreciable number of base mean values,the variance estimates can be expected to be accurate.For the condition“Tb”,we cannot estimate a variance function as we have no replicates. When a variance estimate is needed for“Tb”,the package will use the maximum of the variances estimated for all the other conditions.To see the assignment of conditions to variance functions, use the rawVarFuncTable accessor function:>rawVarFuncTable(cds)N T Tb"N""T""_max"It is instructive to observe at which count level the biological noise starts to dominate the shot noise.At low counts,where shot noise dominates,higher sequencing depth(larger library size) will improve the signal-to-noise ratio while for high counts,where the biological noise dominates, only additional biological replicates will help.One should check whether the base variance functions seem to follow the empirical variance well.To this end,two diagnostic functions are provided.The function varianceFitDiagnostics returns,for a specified condition,a data frame with four columns:the mean base level for each gene,the base variance as estimated from the count values of this gene only,and thefitted base variance,i.e.,the predicted value from the localfit through the base variance estimates from all genes.As one typically has few replicates,the single-gene estimate of the base variance can deviate wildly from thefitted value.To see whether this might be too wild,the cumulative prob-ability for this ratio of single-gene estimate tofitted value is calculated from theχ2distribution, as explained in the paper.These values are the fourth column.>diagForT<-varianceFitDiagnostics(cds,"T")>head(diagForT)baseMean baseVar fittedRawVar fittedBaseVar pchisqGene_000010.63198760.79881666.319876e-097.652518e-010.69307480Gene_0000210.950897822.67401186.932106e+018.258113e+010.39971516Gene_000030.63198760.79881666.319876e-097.652518e-010.69307480Gene_00004151.3237122 1.94159617.733836e+037.917068e+030.01249452Gene_0000515.5819200340.81225331.297212e+02 1.485888e+020.87009670Gene_00006255.26701191771.24137102.171358e+04 2.202268e+040.22328186 We may now plot the per-gene estimates of the base variance against the base levels and draw a line with thefit from the local regression:601234−4−202468log10(diagForT$baseMean)l o g 10(d i a g F o r T $b a s e V a r)Figure 2:Diagnostic plot to check the fit of the variance function.70.00.20.40.60.81.00.00.20.40.60.81.0Residuals ECDF plot for condition 'T'chi−squared probability of residual E C D F3.2e−01 .. 3.0e+003.1e+00 .. 1.2e+011.2e+01 .. 3.1e+013.1e+01 .. 6.5e+016.5e+01 .. 1.3e+021.3e+02 .. 3.1e+023.1e+02 ..4.7e+04expected 0.00.20.40.60.8 1.00.00.20.40.60.81.0Residuals ECDF plot for condition 'N'chi−squared probability of residualE C D F3.9e−01 .. 2.3e+002.5e+00 .. 1.0e+011.0e+01 .. 2.8e+012.8e+01 .. 6.1e+016.1e+01 .. 1.3e+021.3e+02 .. 3.2e+023.2e+02 ..4.2e+04expectedFigure 3:Another diagnostic plot to check the fit of the variance functions.This one is produced with the function residualsEcdfPlot .>smoothScatter(log10(diagForT$baseMean),log10(diagForT$baseVar))>lines(log10(fittedBaseVar)~log10(baseMean),+diagForT[order(diagForT$baseMean),],col="red")As one can see (Fig.2),the fit (red line)follows the single-gene estimates well,even though the spread of the latter is considerable,as one should expect,given that each variance value is estimated from just three values.Another way to study the diagnostic data is to check whether the probabilities in the fourth column of the diagnostics data frame are uniform,as they should be.One may simply look at the histogram of diagForGB$pchisq but a more convenient way is the function residualsEcdfPlot ,which show empirical cumulative density functions (ECDF)stratified by base level.We look at them for the conditions “T”and “N”:>par(mfrow=c(1,2))>residualsEcdfPlot(cds,"T")>residualsEcdfPlot(cds,"N")Fig.3shows the output.In both cases,the ECDF curves follow the diagonal well,i.e.,the fit is good.Only for very low counts (below 10),the deviations become stronger,but as at these levels,shot noise dominates,this is no reason for concern.If in your data the residuals ECDF plot indicates problems with the fit,you may want to manually adjust the variance estimates.If the ECDF curves are below the green line,variance is underestimated,and if you test for differential expression (see next section)you will get too low p values (and hence,too many false positives).If the ECDF curves are above the green line,variance is overestimated,which leads to too high p values (and hence,an overestimation of the false discovery rate).The first case (curves below the green line)may indicate a serious problem that might com-promise your results.However,this seems to rarely happen (and I’d appreciate if you could sent me a mail if you observe it with real data so I can investigate).The second case (curves above8the green line)is usually nothing to worry about;it only causes DESeq to be conservative with the tests.4Calling differential expressionHaving estimated and verified the variance–mean dependence,it is now straight-forward to look for differentially expressed genes.To contrast two conditions,e.g.,to see whether there is dif-ferential expression between conditions“N”and“T”,we simply call the function nbinomTest.It performs the tests as described in the paper and returns a data frame with the p value and other useful data.>res<-nbinomTest(cds,"N","T")>head(res)id baseMean baseMeanA baseMeanB foldChange log2FoldChange1Gene_000010.60180610.57162470.6319876 1.10559880.14482792Gene_0000216.597513622.244129410.95089780.4923051-1.02237553Gene_000030.31599380.00000000.6319876Inf Inf4Gene_00004201.7601420252.1965718151.32371220.6000229-0.73691065Gene_0000511.42612437.270328515.5819200 2.1432209 1.09978066Gene_00006217.4247729179.5825340255.2670119 1.42144680.5073601 pval padj resVarA resVarB11.00000001.00000000.533279700.521930508920.44019040.92891830.683642660.132157684730.33027220.90251720.000000000.521930509940.46016670.92891830.389113310.000139036550.58167740.96426730.426276123.029*******60.51904490.93666990.016963710.1080731022For each gene,we get its mean expression level(at the base scale)as a joint estimate from both conditions,and estimated separately for each condition,the fold change from thefirst to the second condition,the logarithm(to basis2)of the fold change,and the p value for the statistical significance of this change.The padj column contains the p values,adjusted for multiple testing with the Benjamini-Hochberg procedure(see the standard R function p.adjust),which controls false discovery rate(FDR).The last two columns show the ratio of the single gene estimates for the base variance to thefitted value.This may help to notice false hits due to“variance outliers”. Any hit that has a very large value in one these two columns should be checked carefully.Let usfirst plot the log2fold changes against the base means,colouring in red those genes that are significant at10%FDR.>plotDE<-function(res)+plot(+res$baseMean,+res$log2FoldChange,+log="x",pch=20,cex=.1,+col=ifelse(res$padj<.1,"red","black"))>plotDE(res)See Fig.4for the plot.As we will use this plot more often,we have stored its code in a function.We canfilter for the significant genes,91e−011e+011e+03−10−5510res$baseMeanr e s $l o g 2F o l d C h a n g eFigure 4:MvA plot for the contrast “T”vs.“N”.10>resSig<-res[res$padj<.1,]and list,e.g.,the most significantly differentially expressed genes:>head(resSig[order(resSig$pval),])id baseMean baseMeanA baseMeanB foldChange log2FoldChange 12236Gene_122361314.27690.00000002628.5538Inf Inf 8420Gene_08420520.44180.00000001040.8835Inf Inf 10387Gene_10387844.31130.00000001688.6226Inf Inf 3806Gene_03806637.59600.00000001275.1920Inf Inf 4189Gene_04189261.01090.0000000522.0217Inf Inf 17263Gene_17263453.09080.5716247905.61001584.27410.62961 pval padj resVarA resVarB122362.940849e-215.408809e-170.000000e+0023.221548420 2.352988e-202.163808e-160.000000e+0023.20920103874.964117e-203.043335e-160.000000e+0023.341673806 4.092018e-191.881510e-150.000000e+0023.297604189 1.174467e-174.320159e-140.000000e+0022.74140172636.374589e-171.954024e-131.584562e-0523.11462We may also want to look at the most strongly down-regulated of the significant genes, >head(resSig[order(resSig$foldChange,-resSig$baseMean),])id baseMean baseMeanA baseMeanB foldChange log2FoldChange 12457Gene_12457243.2076486.415200-Inf 16153Gene_16153230.3059460.611900-Inf 14803Gene_14803140.1458280.291600-Inf 3664Gene_03664138.2713276.542600-Inf 6705Gene_06705136.3325272.665000-Inf429Gene_00429113.0759226.151900-Inf pval padj resVarA resVarB124572.308952e-101.179618e-0718.393626650161532.534270e-099.322057e-0710.980916510148031.880220e-085.239545e-068.2526722203664 4.937565e-091.539181e-069.4491457306705 4.174475e-081.037526e-0532.652755470429 3.086259e-087.994714e-060.053034760or at the most strongly up-regulated ones:>head(resSig[order(-resSig$foldChange,-resSig$baseMean),])id baseMean baseMeanA baseMeanB foldChange log2FoldChange 12236Gene_122361314.276902628.5538Inf Inf 10387Gene_10387844.311301688.6226Inf Inf 3806Gene_03806637.596001275.1920Inf Inf 8420Gene_08420520.441801040.8835Inf Inf 11756Gene_11756269.72540539.4509Inf Inf 4189Gene_04189261.01090522.0217Inf Inf pval padj resVarA resVarB11051015200.00.51.01.5density.default(x = res$resVarA, from = 0, to = 20, na.rm = TRUE)N = 18392 Bandwidth = 0.09163D e n s i t yFigure 5:Density of residual variance ratios.122362.940849e-215.408809e-17023.22154103874.964117e-203.043335e-16023.341673806 4.092018e-191.881510e-15023.297608420 2.352988e-202.163808e-16023.20920117567.271243e-161.485919e-12022.3543741891.174467e-174.320159e-1422.74140The test is based on the assumption that the fitted variance,i.e.,the variance as deduced from the mean vie the raw variance functions,is a good estimate for a gene’s true variance.We have tested the appropriateness of this approach above with the plot produced by residualsEcdfPlot and concluded that it seems to hold well for most genes.The res object gives us two columns to have a closer look at this,namely resVarA and resVarB .These contain the residual variance quotients,i.e.the ratio of the variance as calculated only from the counts for the gene under consideration to the fitted variance.We can plot the density of these ratios (Fig.5):>plot(density(res$resVarA,na.rm=TRUE,from=0,to=20),col="red")>lines(density(res$resVarB,na.rm=TRUE,from=0,to=20),col="blue")>xg <-seq(0,20,length.out=1000);lines(xg,dchisq(xg,df=1),col="grey")The first two lines estimate the density of the quotients for conditions A and B and plot them in red and blue.If the model holds,these should agree with a χ2distribution with 1degree of freedom (we have two replicates for each condition,and the number of degrees of freedom12is one less than the number of replicates).The third line adds the theoretical density function in grey.The fact that the curves agree well is not surprising;we have seen this already in the residual ECDF plots(which show the same information,but in a way that makes it easier to see deviations).We can also see that hardly any genes have a ratio exceeding,say,20.In fact,there are, however,a few such genes,but we cannot see them in a density plot:>table(res$resVarA>15|res$resVarB>15)FALSE TRUE18186206From the chi2distribution,we expect such high ratios to only occur for maybe two genes: >(1-pchisq(15,df=1))*nrow(counts(cds))[1]2.016910Hence,these genes seem to be“variance outliers”,and it may be prudent to exclude them from the list of significant hits.(Of course,the threshold of15was chosen ad hoc here and other thresholds of the same order of magnitude would be defensible as well.)5Working partially without replicatesIf you have replicates for one condition but not for the other,you can still proceed as before.As already stated above,the testing function will simply take the maximum of all estimated variance function for conditions without replicates.If we consider this acceptable,we can contrast the single“Tb”sample against the two“N”samples.>resTbvsN<-nbinomTest(cds,"N","Tb")We produce the same plot as before,again with>plot(+resTbvsN$baseMean,+resTbvsN$log2FoldChange,+log="x",pch=20,cex=.1,+col=ifelse(resTbvsN$padj<.1,"red","black"))The result(Fig.6)shows the same symmetry in up-and down-regulation as in Fig.4but a striking asymmetry in the boundary line for significance.This has an easy explanation:low counts suffer from proportionally stronger shot noise than high counts,and this is more pronounced in the “Tb”data than in the“N”data due to the lack of replicates.Hence a stronger signal is required to call a down-regulation significant than for an up-regulation.6Working without any replicatesProper replicates are essential to interpret a biological experiment.After all,if one compares two conditions andfinds a difference,how else would one know that this difference is due to the different conditions and would not have arisen between replicates,as well,just due to noise?131101001000−10−5510resTbvsN$baseMeanr e s T b v s N $l o g 2F o l d C h a n g eFigure 6:MvA plot for the contrast “Tb”vs.“N”.14Hence,any attempt to work without any replicates will lead to conclusions of very limited reliability.Nevertheless,such experiments are often undertaken,especially in HTS,and the DESeq pack-age can deal with them,even though the soundness of the results may depend very much on the circumstances.Our primary assumption is still that the mean is a good predictor for the variance.Hence, if a number of genes with similar expression level are compared between replicates,we expect that their variation is of comparable magnitude.Once we accept this assumption,we may argue as follows:Given two samples from different conditions and a number of genes with comparable expression levels,of which we expect only a minority to be influenced by the condition,we may take the variance estimated from comparing their count rates across conditions as ersatz for a proper estimate of the variance across replicates.After all,we assume most genes to behave the same within replicates as across conditions,and hence,the estimated variance should not change too much due to the influence of the hopefully few differentially expressed genes.Furthermore, the differentially expressed genes will only cause the variance estimate to be too high,so that the test will err to the side of being too conservative,i.e.,we only lose power.We shall now see how well this works for our example data,even though it has rather many differentially expressed genes.We reduce our count data set to just two columns,one“T”and one“N”sample:>cds2<-cds[,c("T1b","N1")]Now,without any replicates at all,the estimateVarianceFunctions function will refuse to proceed unless we instruct it to ignore the condition labels and estimate the variance by treating all samples as if they were replicates of the same condition:>cds2<-estimateVarianceFunctions(cds2,method="blind")Now,we can attempt tofind differential expression:>res2<-nbinomTest(cds2,"N","T")Unsurprisingly,wefind much fewer hits,as can be seen from the plot(Fig.7)>plot(+res2$baseMean,+res2$log2FoldChange,+log="x",pch=20,cex=.1,+col=ifelse(res2$padj<.1,"red","black"))and from this table,tallying the number of significant hits in our previous and our new,restricted analysis:>addmargins(table(res_sig=res$padj<.1,res2_sig=res2$padj<.1)) res2_sigres_sig FALSE TRUE SumFALSE155337015603TRUE414201615Sum1594727116218As can be seen,we have still found about1/5of the hits,and only a reassuringly small number of new(and potentially false)hits.15。
Using XPSPEAK Version 4.1 November 2000Contents Page Number XPS Peak Fitting Program for WIN95/98 XPSPEAK Version 4.1 (1)Program Installation (1)Introduction (1)First Version (1)Version 2.0 (1)Version 3.0 (1)Version 3.1 (2)Version 4.0 (2)Version 4.1 (2)Future Versions (2)General Information (from R. Kwok) (3)Using XPS Peak (3)Overview of Processing (3)Appearance (4)Opening Files (4)Opening a Kratos (*.des) text file (4)Opening Multiple Kratos (*.des) text files (5)Saving Files (6)Region Parameters (6)Loading Region Parameters (6)Saving Parameters (6)Available Backgrounds (6)Averaging (7)Shirley + Linear Background (7)Tougaard (8)Adding/Adjusting the Background (8)Adding/Adjusting Peaks (9)Peak Types: p, d and f (10)Peak Constraints (11)Peak Parameters (11)Peak Function (12)Region Shift (13)Optimisation (14)Print/Export (15)Export (15)Program Options (15)Compatibility (16)File I/O (16)Limitations (17)Cautions for Peak Fitting (17)Sample Files: (17)gaas.xps (17)Cu2p_bg.xps (18)Kratos.des (18)ASCII.prn (18)Other Files (18)XPS Peak Fitting Program for WIN95/98 XPSPEAKVersion 4.1Program InstallationXPS Peak is freeware. Please ask RCSMS lab staff for a copy of the zipped 3.3MB file, if you would like your own copyUnzip the XPSPEA4.ZIP file and run Setup.exe in Win 95 or Win 98.Note: I haven’t successfully installed XPSPEAK on Win 95 machines unless they have been running Windows 95c – CMH.IntroductionRaymond Kwok, the author of XPSPEAK had spent >1000 hours on XPS peak fitting when he was a graduate student. During that time, he dreamed of many features in the XPS peak fitting software that could help obtain more information from the XPS peaks and reduce processing time.Most of the information in this users guide has come directly from the readme.doc file, automatically installed with XPSPEAK4.1First VersionIn 1994, Dr Kwok wrote a program that converted the Kratos XPS spectral files to ASCII data. Once this program was finished, he found that the program could be easily converted to a peak fitting program. Then he added the dreamed features into the program, e.g.∙ A better way to locate a point at a noise baseline for the Shirley background calculations∙Combine the two peaks of 2p3/2 and 2p1/2∙Fit different XPS regions at the same timeVersion 2.0After the first version and Version 2.0, many people emailed Dr Kwok and gave additional suggestions. He also found other features that could be put into the program.Version 3.0The major change in Version 3.0 is the addition of Newton’s Method for optimisation∙Newton’s method can greatly reduce the optimisation time for multiple region peak fitting.Version 3.11. Removed all the run-time errors that were reported2. A Shirley + Linear background was added3. The Export to Clipboard function was added as requested by a user∙Some other minor graphical features were addedVersion 4.0Added:1. The asymmetrical peak function. See note below2. Three additional file formats for importing data∙ A few minor adjustmentsThe addition of the Asymmetrical Peak Function required the peak function to be changed from the Gaussian-Lorentzian product function to the Gaussian-Lorentzian sum function. Calculation of the asymmetrical function using the Gaussian-Lorentzian product function was too difficult to implement. The software of some instruments uses the sum function, while others use the product function, so both functions are available in XPSPEAK.See Peak Function, (Page 12) for details of how to set this up.Note:If the selection is the sum function, when the user opens a *.xps file that was optimised using the Gaussian-Lorentzian product function, you have to re-optimise the spectra using the Gaussian-Lorentzian sum function with a different %Gaussian-Lorentzian value.Version 4.1Version 4.1 has only two changes.1. In version 4.0, the printed characters were inverted, a problem that wasdue to Visual Basic. After about half year, a patch was received from Microsoft, and the problem was solved by simply recompiling the program2. The import of multiple region VAMAS file format was addedFuture VersionsThe author believes the program has some weakness in the background subtraction routines. Extensive literature examination will be required in order to revise them. Dr Kwok intends to do that for the next version.General Information (from R. Kwok)This version of the program was written in Visual Basic 6.0 and uses 32 bit processes. This is freeware. You may ask for the source program if you really want to. I hope this program will be useful for people without modern XPS software. I also hope that the new features in this program can be adopted by the XPS manufacturers in the later versions of their software.If you have any questions/suggestions, please send an email to me.Raymund W.M. KwokDepartment of ChemistryThe Chinese University of Hong KongShatin, Hong KongTel: (852)-2609-6261Fax:(852)-2603-5057email: rmkwok@.hkI would like to thank the comments and suggestions from many people. For the completion of Version 4.0, I would like to think Dr. Bernard J. Flinn for the routine of reading Leybold ascii format, Prof. Igor Bello and Kelvin Dickinson for providing me the VAMAS files VG systems, and my graduate students for testing the program. I hope I will add other features into the program in the near future.R Kwok.Using XPS PeakOverview of Processing1. Open Required Files∙See Opening Files (Page 4)2. Make sure background is there/suitable∙See Adding/Adjusting the Background, (Page 8)3. Add/adjust peaks as necessary∙See Adding/Adjusting Peaks, (Page 9), and Peak Parameters, (Page 11)4. Save file∙See Saving Files, (Page 6)5. Export if necessary∙See Print/Export, (Page 15)AppearanceXPSPEAK opens with two windows, one above the other, which look like this:∙The top window opens and displays the active scan, adds or adjusts a background, adds peaks, and loads and saves parameters.∙The lower window allows peak processing and re-opening and saving dataOpening FilesOpening a Kratos (*.des) text file1. Make sure your data files have been converted to text files. See the backof the Vision Software manual for details of how to do this. Remember, from the original experiment files, each region of each file will now be a separate file.2. From the Data menu of the upper window, choose Import (Kratos)∙Choose directory∙Double click on the file of interest∙The spectra open with all previous processing INCLUDEDOpening Multiple Kratos (*.des) text files∙You can open up a maximum of 10 files together.1. Open the first file as above∙Opens in the first region (1)2. In the XPS Peak Processing (lower) window, left click on 2(secondregion), which makes this region active3. Open the second file as in Step2, Opening a Kratos (*.des) text file,(Page 4)∙Opens in the second region (2)∙You can only have one description for all the files that are open. Edit with a click in the Description box4. Open further files by clicking on the next available region number thenfollowing the above step.∙You can only have one description for all the files that are open. Edit with a click in the Description boxDescriptionBox 2∙To open a file that has already been processed and saved using XPSPEAK, click on the Open XPS button in the lower window. Choose directory and file as normal∙The program can store all the peak information into a *.XPS file for later use. See below.Saving Files1. To save a file click on the Save XPS button in the lower window2. Choose Directory3. Type in a suitable file name4. Click OK∙Everything that is open will be saved in this file∙The program can also store/read the peak parameter files (*.RPA)so that you do not need to re-type all the parameters again for a similar spectrum.Region ParametersRegion Parameters are the boundaries or limits you have used to set up the background and peaks for your files. These values can be saved as a file of the type *.rpa.Note that these Region Parameters are completely different from the mathematical parameters described in Peak Parameters, (Page 11) Loading Region Parameters1. From the Parameters menu in the upper window, click on Load RegionParameters2. Choose directory and file name3. Click on Open buttonSaving Parameters1. From the Parameters menu in the XPS Peak Fit (Upper) window, clickon Save Region Parameters2. Choose directory and file name3. Click on the Save buttonAvailable BackgroundsThis program provides the background choices of∙Shirley∙Linear∙TougaardAveraging∙ Averaging at the end points of the background can reduce the time tofind a point at the middle of a noisy baseline∙ The program includes the choices of None (1 point), 3, 5, 7, and 9point average∙ This will average the intensities around the binding energy youselect.Shirley + Linear Background1. The Shirley + Linear background has been added for slopingbackgrounds∙ The "Shirley + Linear" background is the Shirley background plus astraight line with starting point at the low BE end-point and with a slope value∙ If the slope value is zero , the original Shirley calculation is used∙ If the slope value is positive , the straight line has higher values atthe high BE side, which can be used for spectra with higher background intensities at the high BE side∙ Similarly, a negative slope value can be used for a spectrum withlower background intensities at the high BE side2. The Optimization button may be used when the Shirley background is higher at some point than the signal intensities∙ The program will increase the slope value until the Shirleybackground is below the signal intensities∙ Please see the example below - Cu2p_bg.xps - which showsbackground subtraction using the Shirley method (This spectrum was sent to Dr Kwok by Dr. Roland Schlesinger).∙ A shows the problematic background when the Shirley backgroundis higher than the signal intensities. In the Shirley calculation routine, some negative values were generated and resulted in a non-monotonic increase background∙ B shows a "Shirley + Linear" background. The slope value was inputby trial-and-error until the background was lower than the signal intensities∙ C was obtained using the optimisation routineA slope = 0B slope = 11C slope = 15.17Note: The background subtraction calculation cannot completely remove the background signals. For quantitative studies, the best procedure is "consistency". See Future Versions, (Page 2).TougaardFor a Tougaard background, the program can optimise the B1 parameter by minimising the "square of the difference" of the intensities of ten data points in the high binding energy side of the range with the intensities of the calculated background.Adding/Adjusting the BackgroundNote: The Background MUST be correct before Peaks can be added. As with all backgrounds, the range needs to include as much of your peak as possible and as little of anything else as possible.1. Make sure the file of interest is open and the appropriate region is active2. Click on Background in the upper window∙The Region 0 box comes up, which contains the information about the background3. Adjust the following as necessary. See Note.∙High BE (This value needs to be within the range of your data) ∙Low BE (This value needs to be within the range of your data) NOTE: High and Low BE are not automatically within the range of your data. CHECK CAREFULLY THAT BOTH ENDS OF THE BACKGROUND ARE INSIDE THE EDGE OF YOUR DATA. Nothing will happen otherwise.∙No. of Ave. Pts at end-points. See Averaging, (Page 7)∙Background Type∙Note for Shirley + Linear:To perform the Shirley + Linear Optimisation routine:a) Have the file of interest openb) From the upper window, click on Backgroundc) In the resulting box, change or optimise the Shirley + LinearSlope as desired∙Using Optimize in the Shirley + Linear window can cause problems. Adjust manually if necessary3. Click on Accept when satisfiedAdding/Adjusting PeaksNote: The Background MUST be correct before peaks can be added. Nothing will happen otherwise. See previous section.∙To add a peak, from the Region Window, click on Add Peak ∙The peak window appears∙This may be adjusted as below using the Peak Window which will have opened automaticallyIn the XPS Peak Processing (lower) window, there will be a list of Regions, which are all the open files, and beside each of these will be numbers representing the synthetic peaks included in that region.Regions(files)SyntheticPeaks1. Click on a region number to activate that region∙The active region will be displayed in the upper window2. Click on a peak number to start adjusting the parameters for that peak.∙The Processing window for that peak will open3. Click off Fix to adjust the following using the maximum/minimum arrowkeys provided:∙Peak Type. (i.e. orbital – s, p, d, f)∙S.O.S (Δ eV between the two halves of the peak)∙Position∙FWHM∙Area∙%Lorenzian-Gaussian∙See the notes for explanations of how Asymmetry works.4. Click on Accept when satisfiedPeak Types: p, d and f.1. Each of these peaks combines the two splitting peaks2. The FWHM is the same for both the splitting peaks, e.g. a p-type peakwith FWHM=0.7eV is the combination of a p3/2 with FWHM at 0.7eV anda p1/2 with FWHM at 0.7eV, and with an area ratio of 2 to 13. If the theoretical area ratio is not true for the split peaks, the old way ofsetting two s-type peaks and adding the constraints should be used.∙The S.O.S. stands for spin orbital splitting.Note: The FWHM of the p, d or f peaks are the FWHM of the p3/2,d5/2 or f7/2, respectively. The FWHM of the combined peaks (e.g. combination of p3/2and p1/2) is shown in the actual FWHM in the Peak Parameter Window.Peak Constraints1. Each parameter can be referenced to the same type of parameter inother peaks. For example, for four peaks (Peak #0, 1, 2 and 3) with known relative peak positions (0.5eV between adjacent peaks), the following can be used∙Position: Peak 1 = Peak 0 + 0.5eV∙Position: Peak 2 = Peak 1 + 0.5eV∙Position: Peak 3 = Peak 2 + 0.5eV2. You may reference to any peak except with looped references.3. The optimisation of the %GL value is allowed in this program.∙ A suggestion to use this feature is to find a nice peak for a certain setting of your instrument and optimise the %GL for this peak.∙Fix the %GL in the later peak fitting process when the same instrument settings were used.4. This version also includes the setting of the upper and lower bounds foreach parameter.Peak ParametersThis program uses the following asymmetric Gaussian-Lorentzian sumThe program also uses the following symmetrical Gaussian-Lorentzian product functionPeak FunctionNote:If the selection is the sum function, when the user opens a *.xps file that was optimised using the Gaussian-Lorentzian product function, you have to re-optimise the spectra using the Gaussian-Lorentzian sum function with a different %Gaussian-Lorentzian value.∙You can choose the function type you want1. From the lower window, click on the Options button∙The peak parameters box comes up∙Select GL sum for the Gaussian-Lorentzian sum function∙Select GL product for the Gaussian-Lorentzian product function. 2. For the Gaussian-Lorentzian sum function, each peak can have sixparameters∙Peak Position∙Area∙FWHM∙%Gaussian-Lorentzian∙TS∙TLIf anyone knows what TS or TL might be, please let me know. Thanks, CMH3. Each peak in the Gaussian-Lorentzian product function can have fourparameters∙Peak Position∙Area∙FWHM∙%Gaussian-LorentzianSince peak area relates to the atomic concentration directly, we use it as a peak parameter and the peak height will not be shown to the user.Note:For asymmetric peaks, the FWHM only refers to the half of the peak that is symmetrical. The actual FWHM of the peak is calculated numerically and is shown after the actual FWHM in the Peak Parameter Window. If the asymmetric peak is a doublet (p, d or f type peak), the actual FWHM is the FWHM of the doublet.Region ShiftA Region Shift parameter was added under the Parameters menu∙Use this parameter to compensate for the charging effect, the fermi level shift or any change in the system work function∙This value will be added to all the peak positions in the region for fitting purposes.An example:∙ A polymer surface is positively charged and all the peaks are shifted to the high binding energy by +0.5eV, e.g. aliphatic carbon at 285.0eV shifts to 285.5eV∙When the Region Shift parameter is set to +0.5eV, 0.5eV will be added to all the peak positions in the region during peak fitting, but the listed peak positions are not changed, e.g. 285.0eV for aliphatic carbon. Note: I have tried this without any actual shift taking place. If someone finds out how to perform this operation, please let me know. Thanks, CMH.In the meantime, I suggest you do the shift before converting your files from the Vision Software format.OptimisationYou can optimise:1. A single peak parameter∙Use the Optimize button beside the parameter in the Peak Fitting window2. The peak (the peak position, area, FWHM, and the %GL if the "fix" box isnot ticked)∙Use the Optimize Peak button at the base of the Peak Fitting window3. A single region (all the parameters of all the peaks in that region if the"fix" box is not ticked)∙Use the Optimize Region menu (button) in the upper window4. All the regions∙Use the Optimize All button in the lower window∙During any type of optimisation, you can press the "Stop Fitting" button and the program will stop the process in the next cycle.Print/ExportIn the XPS Peak Fit or Region window, From the Data menu, choose Export or Print options as desiredExport∙The program can export the ASCII file of spectrum (*.DAT) for making high quality figures using other software (e.g. SigmaPlot)∙It can export the parameters (*.PAR) for further calculations (e.g. use Excel for atomic ratio calculations)∙It can also copy the spectral image to the system clipboard so that the spectral image can be pasted into a document (e.g. MS WORD). Program Options1. The %tolerance allows the optimisation routine to stop if the change inthe difference after one loop is less that the %tolerance2. The default setting of the optimisation is Newton's method∙This method requires a delta value for the optimisation calculations ∙You may need to change the value in some cases, but the existing setting is enough for most data.3. For the binary search method, it searches the best fit for each parameterin up to four levels of value ranges∙For example, for a peak position, in first level, it calculates the chi^2 when the peak position is changed by +2eV, +1.5eV, +1eV, +0.5eV,-0.5eV, -1eV, -1.5eV, and -2eV (range 2eV, step 0.5eV) ∙Then, it selects the position value that gives the lowest chi^2∙In the second level, it searches the best values in the range +0.4eV, +0.3eV, +0.2eV, +0.1eV, -0.1eV, -0.2eV, -0.3eV, and -0.4eV (range0.4eV, step 0.1eV)∙In the third level, it selects the best value in +0.09eV, +0.08eV, ...+0.01eV, -0.01eV, ...-0.09eV∙This will give the best value with two digits after decimal∙Level 4 is not used in the default setting∙The range setting and the number of levels in the option window can be changed if needed.4. The Newton's Method or Binary Search Method can be selected byclicking the "use" selection box of that method.5. The selection of the peak function is also in the Options window.6. The user can save/read the option parameters with the file extension*.opa∙The program reads the default.opa file at start up. Therefore, the user can customize the program options by saving the selectionsinto the default.opa file.CompatibilityThe program can read:∙Kratos text (*.des) files together with the peak fitting parameters in the file∙The ASCII files exported from Phi's Multiplex software∙The ASCII files of Leybold's software∙The VAMAS file format∙For the Phi, Leybold and VAMAS formats, multiple regions can be read∙For the Phi format, if the description contains a comma ",", the program will give an error. (If you get the error, you may use any texteditor to remove the comma)The program can also import ASCII files in the following format:Binding Energy Value 1 Intensity Value 1Binding Energy Value 2 Intensity Value 2etc etc∙The B.E. list must be in ascending or descending order, and the separation of adjacent B.E.s must be the same∙The file cannot have other lines before and after the data∙Sometimes, TAB may cause a reading error.File I/OThe file format of XPSPEAK 4.1 is different from XPSPEAK 3.1, 3.0 and 2.0 ∙XPSPEAK 4.1 can read the file format of XPSPEAK 3.1, 3.0 and 2.0, but not the reverse∙File format of 4.1 is the same as that of 4.0.LimitationsThis program limits the:∙Maximum number of points for each spectrum to 5000∙Maximum of peaks for all the regions to 51∙For each region, the maximum number of peaks is 10. Cautions for Peak FittingSome graduate students believe that the fitting parameters for the best fitted spectrum is the "final answer". This is definitely not true. Adding enough peaks can always fit a spectrum∙Peak fitting only assists the verification of a model∙The user must have a model in mind before adding peaks to the spectrum!Sample Files:gaas.xpsThis file contains 10 spectra1. Use Open XPS to retrieve the file. It includes ten regions∙1-4 for Ga 3d∙5-8 for Ga 3d∙9-10 for S 2p2. For the Ga 3d and As 3d, the peaks are d-type with s.o.s. = 0.3 and 0.9respectively3. Regions 4 and 8 are the sample just after S-treatment4. Other regions are after annealing5. Peak width of Ga 3d and As 3d are constrained to those in regions 1 and56. The fermi level shift of each region was determined using the As 3d5/2peak and the value was put into the "Region Shift" of each region7. Since the region shift takes into account the Fermi level shift, the peakpositions can be easily referenced for the same chemical components in different regions, i.e.∙Peak#1, 3, 5 of Ga 3d are set equal to Peak#0∙Peak#8, 9, 10 of As 3d are set equal to Peak#78. Note that the %GL value of the peaks is 27% using the GL sum functionin Version 4.0, while it is 80% using the GL product function in previous versions.18 Cu2p_bg.xpsThis spectrum was sent to me by Dr. Roland Schlesinger. It shows a background subtraction using the Shirley + Linear method∙See Shirley + Linear Background, (Page 7)Kratos.des∙This file shows a Kratos *.des file∙This is the format your files should be in if they have come from the Kratos instrument∙Use import Kratos to retrieve the file. See Opening Files, (Page 4)∙Note that the four peaks are all s-type∙You may delete peak 2, 4 and change the peak 1,3 to d-type with s.o.s. = 0.7. You may also read in the parameter file: as3d.rpa. ASCII.prn∙This shows an ASCII file∙Use import ASCII to retrieve the file∙It is a As 3d spectrum of GaAs∙In order to fit the spectrum, you need to first add the background and then add two d-type peaks with s.o.s.=0.7∙You may also read in the parameter file: as3d.rpa.Other Files(We don’t have an instrument that produces these files at Auckland University., but you may wish to look at them anyway. See the readme.doc file for more info.)1. Phi.asc2. Leybold.asc3. VAMAS.txt4. VAMASmult.txtHave Fun! July 1, 1999.。
Getting More From Out-of-Core ColumnsortGeeta Chaudhry and Thomas H.Cormen{geetac,thc}@Dartmouth College Department of Computer ScienceAbstract.We describe two improvements to a previous implementationof out-of-core columnsort,in which data reside on multiple disks.Thefirst improvement replaces asynchronous I/O and communication calls bysynchronous calls within a threaded framework.Experimental runs showthat this improvement reduces the running time to approximately half ofthe running time of the previous implementation.The second improve-ment uses algorithmic and engineering techniques to reduce the numberof passes over the data from four to three.Experimental evidence showsthat this improvement yields modest performance gains.We expect thatthe performance gain of this second improvement increases when the rel-ative speed of processing and communication increases with respect todisk I/O speeds.Thus,as processing and communication become fasterrelative to I/O,this second improvement may yield better results thanit currently does.1IntroductionIn a previous paper[1],the authors reported on an out-of-core sorting program based on Leighton’s columnsort algorithm[2].By some resource measures—specifically,disk time and processor time plus disk time—our columnsort-based algorithm was more sorting-efficient than the renowned NOW-Sort program[3].1 Unlike NOW-Sort,the implementation of our algorithm performed interproces-sor communication and disk I/O using only standard,off-the-shelf software,such as MPI[4]and MPI-2[5].The present paper explores two improvements to the implementation of our out-of-core program:1.Overlapping I/O,computation,and communication by means of a threadedimplementation in which all calls to MPI and MPI-2functions are syn-chronous.Asynchrony is provided by the standard pthreads package.The previous implementation achieved asynchrony by making calls to the asyn-chronous versions of MPI and MPI-2functions.2.Reducing the number of passes over the data from four down to three.Analgorithmic observation makes this reduction possible.We shall refer to our prior implementation as the non-threaded4-pass imple-mentation,and to our new implementations as the threaded4-pass and threaded 3-pass implementations.We shall see that the threaded4-pass implementation can reduce the overall running time down to approximately half of the non-threaded4-pass implementation’s running time.Moreover,the threaded4-pass implementation allows greaterflexibility in memory usage by both the user and the program itself.The effect of reducing the number of passes is more modest; experiments show that the running time of the threaded3-pass implementation is between91.5%and94.6%of that of the threaded4-pass implementation.The remainder of this paper is organized as follows.Section2summarizes columnsort and presents the out-of-core algorithm originally described in[1]. Section3outlines the differences between the non-threaded4-pass implemen-tation with asynchronous MPI and MPI-2calls and the threaded4-pass im-plementation.In Section4,we describe our threaded3-pass implementation. We give empirical results for all implementations.Finally,Section5offers some concluding remarks.2BackgroundIn this section,we review our non-threaded4-pass implementation of columnsort from the previous paper[1].After presenting the basic columnsort algorithm,we describe its adaptation to an out-of-core setting.We conclude this section with a discussion of the performance results.The Basic Columnsort Algorithm.Columnsort sorts N records.Each record contains a key.In columnsort,the records are arranged into an r×s matrix, where N=rs,s is a divisor of r,and r≥2(s−1)2.When columnsort completes, the matrix is sorted in column-major order.That is,each column is sorted,and the keys in each column are no larger than the keys in columns to the right.Columnsort proceeds in eight steps.Steps1,3,5,and7are all the same: sort each column individually.Each of steps2,4,6,and8permutes the matrix entries as follows:Step2:Transpose and reshape:Wefirst transpose the r×s matrix into an s×r matrix.Then we“reshape”it back into an r×s matrix by taking each row of r entries and rewriting it as an r/s×s submatrix.For example, the column with r=6entries a b c d e f is transposed into a6-entry row with entries a b c d e f and then reshaped into the2×3submatrix a b c d e f . Step4:Reshape and transpose:Wefirst reshape each set of r/s rows intoa single r-element row and then transpose the matrix.This permutation isthe inverse of that of step2.Step6:Shift down by r/2:We shift each column down by r/2positions, wrapping into the next column.That is,we shift the top half of each column into the bottom half of that column,and we shift the bottom half of each column into the top half of the next column.Step8:Shift up by r/2:We shift each column up by r/2positions,wrapping around into the previous column.This permutation is the inverse of that of step6.Our Out-of-Core Columnsort.Our adaptation of columnsort to an out-of-core setting assumes that the machine has P processors P0,P1,...,P P−1and D disks D0,D1,...,D D−1.When D=P,each processor accesses exactly one disk over the entire course of the algorithm.When D<P,we require that there be P/D processors per node and that they share the node’s disk;in this case,each processor accesses a distinct portion of the disk.In fact,in our implementation, we treat this distinct portion as a separate“virtual disk,”allowing us to assume that D≥P.When D>P,each processor has exclusive access to D/P disks. We say that a processor owns the D/P disks that it accesses.We use buffers that hold exactly r records.Each processor has several such buffers.For convenience,our current implementation assumes that all parame-ters(including r)are powers of2.There is an upper limit on the number of records N in thefile to be sorted. Recalling that N=rs and that each buffer of r records mustfit in the memory of a single processor,this limit occurs because of the columnsort requirement that r≥2(s−1)2.If we simplify this requirement to r≥2s2,we have the√restriction that r≥2(N/r)2,which is equivalent to N≤r3/2/single sorted run.Communicate phase:Each record is destined for a specific column,depend-ing on which even-numbered columnsort step this pass is performing.In order to get each record to the processor that owns this destination column, processors exchange records.We use asynchronous MPI calls to perform this communication.In passes2,3,and4,this phase requires every processor to send some data to every other processor.In pass3,the communication is simpler,since every processor communicates with only two other processors. Permute phase:Having received records from other processors,each proces-sor rearranges them into the correct order for writing.Write phase:Each processor writes a set of records onto the disks that it owns.These records are not necessarily all written consecutively onto the disks,though they are written as a small number of sorted runs.Again,we use asynchronous MPI-2calls to perform the writes.Note that the use of asynchronous MPI and MPI-2calls allows us to overlap local sorting,communication,and I/O.At any particular time,processor j might be communicating records belonging to column j+kP,locally sorting records in column j+(k+1)P,reading column j+(k+2)P,and writing column j+(k−1)P. However,our non-threaded4-pass implementation does not overlap reading with either local sorting or writing.Performance Results.Table1summarizes the performance results of several implementations of out-of-core columnsort.2Our performance goal is to sort large volumes of data while consuming as few resources as possible.The results are for64byte records with an integer key at the beginning of each record.The inputfiles are generated using the drand function to generate the value of each key.The results in Table1are for two different clusters of SMPs.Thefirst system is a cluster of4Sun Enterprise TM450SMPs.We used only one processor on each node.Each processor is an UltraSPARC TM-II,running at296MHz and with128MB of RAM,of which we used40MB for holding data.The nodes are connected by an ATM OC-3network,with a peak speed of155megabits per second.This system has one disk per node,each an IBM DNES309170 spinning at7200RPM and with an average latency of4.17msec.All disks are on Ultra2SCSI buses.The MPI and MPI-2implementations are part of Sun HPC ClusterTools4TM.The second system is also a cluster of4Sun Enterprise450SMPs.Here,we used two processors on each node.Other than the disks,the components are the same as in thefirst system.This system has8disks,4of which are the same IBM disks as in thefirst system,and the other4of which are Seagate ST32171W also spinning at7200RPM and also with an average latency of4.17msec.We conclude this section by discussing two additional features of our imple-mentations:non-threaded threaded threaded4-pass4-pass3-pass 4444441601922561114.338 2.161 2.043other two using the processing power(the sort and permute phases).All phases make use of memory,of course.Our non-threaded4-pass implementation has a single thread of control.In order to permit overlapping of I/O,communication,and processing,it relies on asynchronous MPI and MPI-2calls in order to schedule the usage of the four types of resources.Due to the single thread of control,this implementation has to decide the overlap mechanism statically at coding time and is not fullyflexible. For example,the implementation is unable to adapt automatically to a faster network,faster CPU,or faster I/O.Our threaded4-pass implementation uses the standard pthreads package to overlap I/O,communication,and processing. It uses only synchronous MPI and MPI-2calls.Because the non-threaded4-pass implementation uses static scheduling,its memory usage is static and has to be known at coding time.Consequently,this implementation is unable to adapt to an increased amount of available memory. Our threaded4-pass implementation,on the other hand,maintains a global pool of memory buffers,the number of which is set at the start of each run of the program.One can determine the optimum number of buffers for a given configuration by means of a small number of experimental runs.Basic Structure.In order to overlap the usage of resources in a dynamic manner,we created four threads per processor.One thread is responsible for all the disk I/O functions,one does all the interprocessor communication,one does all the sorting,and thefinal thread does all the permuting.We shall refer to the four threads as the I/O,communicate,sort,and permute threads,respectively. The threads operate on buffers,each capable of holding exactly r records,and which are drawn from a global pool.The threads communicate with each other via a standard semaphore mechanism.The read and write functions appear in a common thread—the I/O thread—because they will serialize at the disk anyway.Figure1illustrates how the threads work by displaying the history of a column within a given round of a given pass.1.The I/O thread acquires a buffer b from the global pool and performs the readphase of column c by reading the column from the disk into this buffer.The I/O thread suspends while the read happens,and when the read completes, the I/O thread wakes up.2.When the I/O thread wakes up after the read completes,it signals the sortthread,which picks up buffer b and performs the sort phase of column c on it.3.The sort thread signals the communicate thread,which picks up buffer band performs the communicate phase of column c on it.The communicate phase suspends during interprocessor communication,and after communica-tion completes,it wakes up.4.The communicate thread signals the permute thread,which picks up buffer band performs the permute phase of column c on it.5.Finally,after the permute phase completes,the permute thread signals theI/O thread,which picks up buffer b and writes it out to disk.The I/O threadFig.1.The history of a buffer b as it progresses within a given round of a given pass.The I/O thread acquires the buffer from the global pool and then reads into it from disk.The I/O thread suspends during the read,and when it wakes up,it signals the sort thread.The sort thread sorts buffer b and signals the communicate thread. The communicate thread suspends during interprocessor communication,and when it wakes up,it signals the permute thread.The permute thread then permutes buffer b and signals the I/O thread.The I/O thread writes the buffer to disk,suspending during the write.When the I/O thread wakes up,it releases buffer b back to the global pool.suspends during the write,and when the write completes,the I/O thread wakes up and releases buffer b back to the global pool.The pthreads implementation may preempt any thread at any time.Thus, during the time that a given thread considers itself as active,it might not actually be running on the CPU.The sort,permute,and communicate threads allocate additional buffers for their own use.Each of these threads allocates one buffer at the beginning of the program and uses it throughout the entire run.Thus,the total memory usage of the threaded4-pass implementation is three buffers more than are created in the global pool.Performance Results.From the columns labeled“threaded4-pass”in Ta-ble1,we see that the threaded4-pass implementation takes only49.8%of the time taken by the non-threaded4-pass implementation on the cluster with4pro-cessors and4disks.On the cluster with8processors and8disks,the threaded4-pass implementation takes57.1%as much time as the non-threaded4-pass implementation.What accounts for this significant improvement in running time?Due to the highly asynchronous nature of each of the implementations,we were unable to obtain accurate breakdowns of where any of them were truly spending their time.We were able to obtain the amounts of time that each thread considered itself as active,but because threads may be preempted,these times may not be reflective of the times that the threads were actually running on the CPU. Similarly,the timing breakdown for the non-threaded4-pass implementation is not particularly accurate.Our best guess is that the gains come primarily from two sources.First is increasedflexibility in scheduling,which we discussed above in regard to the mo-tivation for threaded implementations.The second source of performance gain is that the MPI calls in the threaded4-pass implementation are synchronous, whereas they are asynchronous in the non-threaded4-pass implementation.Ap-parently,asynchronous MPI calls incur a significant overhead.Although there is an overhead cost due to threads,the benefits of schedulingflexibility and syn-chronous MPI calls in the threaded4-pass implementation outweigh this cost.We conducted a set of ancillary tests to verify that a program with threads and synchronous MPI calls is faster than a single-threaded program with asyn-chronous MPI calls.Thefirst test overlaps computation and I/O,and the second test overlaps computation and communication.We found that by converting a single thread with asynchronous MPI calls to a threaded program with syn-chronous MPI calls,the computation-and-I/O program ran7.8%faster and the computation-and-communication program ran23.8%faster.Moreover,there is a qualitative benefit of the threaded4-pass implementa-tion.Because all calls to MPI and MPI-2functions are synchronous,the code itself is cleaner,and it is easier to modify.4Threaded3-Pass ImplementationThis section describes how to reduce the number of passes in the implementation given in the previous section from four to three.The key observation is the pairing observation from[1]:We can combine steps6–8of columnsort by pairing adjacent columns.We sort the bottom r/2entries of each column along with the top r/2entries of the next column,placing the sorted r entries into the same positions.The top r/2entries of the leftmost column were already sorted by step5 and can therefore be left alone,and similarly for the bottom r/2entries of the rightmost column.Basic Structure.To take advantage of the pairing implementation,we com-bine steps5–8of columnsort—passes3and4in a4-pass implementation—into one pass.Figure2shows how.In the4-pass implementation,the communicate,Fig.2.How passes3and4of the threaded4-pass implementation are combined into a single pass of the threaded3-pass implementation.The communicate,permute,and write phases of pass3,and the read phase of pass4are replaced by a single communicate phase.permute,and write phases of pass3,along with the read phase of pass4,merely shift each column down by r/2rows(wrapping the bottom half of each column into the top half of the next column).We replace these four phases by a single communicate phase.In the threaded3-pass implementation,thefirst two passes are the same as in the threaded4-pass implementation.To further understand how the last pass works,let us examine a typical round(i.e.,neither thefirst nor the last round) in this pass.At the start of the round,some processor P k contains in a buffer r/2records left over from the previous round.When the round completes,some other processor P l,where l=(k−1)mod P,will contain r/2leftover records, and it will serve as P k in the next round.The round proceeds as follows:1.Each processor except for P k reads in a column,so that P−1columns areread in.2.Each processor except for P k sorts its column locally.3.With the exception of P k,each processor P i sends thefirst r/2of its sortedelements to processor P(i−1)mod P.After all sends are complete,each pro-cessor except for P l holds r records(r/2that it had prior to the send,andr/2that it just received),and processor P l holds r/2records,which it keeps aside in a separate buffer to be used in the next round.This communicate phase replaces the four phases shaded in Figure2.4.Each processor except for P l sorts its column locally.5.To prepare for the write in PDM order,each processor(except P l)sendsr/P records to every processor(including P l and itself).6.Each processor locally permutes the r(P−1)/P records it has received toput them into the correct PDM order.7.Each processor writes the r(P−1)/P records to the disks that it owns.Thefirst and last rounds have minor differences from the middle rounds. Thefirst round processes all P columns,and the last round may process fewer than P−1columns.Because thefirst round processes P columns,the number of rounds is1+⌈(s−P)/(P−1)⌉.Thread Structure.In the threaded3-pass implementation,thefirst two passes have the same thread structure as in the threaded4-pass implementation,but the third pass is different.As we have just seen,the third pass has seven phases. As before,each phase is assigned to a single thread,except that the read and write phases are assigned to a single I/O thread.Consequently,there is one I/O thread,two communicate threads,two sort threads,and one permute thread. This thread structure raises two additional issues.First,the number of additional buffers increases.Other than the I/O thread, each thread requires an r-record buffer.Since there arefive non-I/O threads,this implementation requiresfive buffers more than are in the global pool.That is, the threaded3-pass implementation requires two buffers more than the threaded 4-pass implementation.Thefigures for memory used in Table1reflect the larger number of buffers required for the threaded3-pass implementation.Second,because there are two communicate threads,each making calls to MPI,not all MPI implementations are suitable.Some MPI implementations are unreliable when multiple threads perform communication.Fortunately,the MPI implementations that we have used—on both the Sun cluster and Silicon Graphics Origin2000systems—support multiple threads that communicate via MPI calls.Performance Results.By inspection of Table1,we see that on the4-processor,4-disk and on the8-processor,8-disk clusters,the threaded3-pass implementation takes94.6%and91.5%,respectively,of the time used by the threaded4-pass implementation.The improvement due to reducing the number of passes from four to three is not as marked as that from introducing a threaded implementation.The observed running times lead to two questions.First,what accounts for the improvement that we see in eliminating a pass?Second,why don’t we see more of an improvement?Compared to the threaded4-pass implementation,the threaded3-pass im-plementation enjoys one significant benefit and one additional cost.Not surpris-ingly,the benefit is due to less disk I/O.The pass that is eliminated reads andwrites each record once,and so the3-pass implementation performs only75%as much disk I/O as the4-pass implementation.This reduced amount of disk I/O is the only factor we know of that accounts for the observed improvement.The added cost is due to more sort phases and more communicate phases.In the3-pass implementation,each round contains two sort and two communicate phases,for a total of2+2⌈(s−P)/(P−1)⌉sort phases and the same number of communicate phases.Together,the two rounds of the4-pass implementation—which are replaced by the last pass of the3-pass implementation—perform2s/P sort phases and2s/P communicate phases.Since the problem is out-of-core,we have s>P,which in turn implies that2+2⌈(s−P)/(P−1)⌉≥2s/P.Thus, the3-pass implementation always has more sort and communicate phases than the4-pass implementation.The degree to which the3-pass implementation improves upon the4-pass implementation depends on the relative speeds of disks,processing,and commu-nication in the underlying system.The25%reduction in the number of passes does not necessarily translate into a25%reduction in the overall time.That is because although the combined third pass of the3-pass implementation writes and reads each record only once,all the other work(communication and sort-ing)of the two passes still has to be done.Therefore,when the last two passes of the4-pass implementation are relatively I/O bound,we would expect the3-pass implementation to be significantly faster than the4-pass implementation. Conversely,when the last two passes are not I/O bound,the advantage of the 3-pass implementation is reduced.In fact,the3-pass implementation can even be slower than the4-pass implementation!On the clusters whose results ap-pear in Table1,the processing and network speeds are not particularly fast. Our detailed observations of the4-pass implementation reveal that in the last two passes,the I/O and communication times are close.Hence,these passes are only slightly I/O bound,and therefore we see only a modest gain in the3-pass implementation.One would expect that technology will evolve so that the speed of processing and communication networks will increase more rapidly than the speed of disk I/O.If this prediction holds,then the last two passes of the threaded4-pass im-plementation will become increasingly I/O bound,and so the relative advantage of the threaded3-pass implementation will become more prominent.5ConclusionWe have seen two ways to improve upon our earlier,non-threaded,4-pass imple-mentation of out-of-core columnsort.This original implementation had perfor-mance results that,by certain measures,made it competitive with NOW-Sort. The two improvements make the implementation even faster.One can characterize the threaded4-pass implementation as an engineering effort,which yielded substantially better performance on the two clusters on which it was tested.On the other hand,the threaded3-pass implementation has both algorithmic and engineering aspects.On the particular clusters that served as our testbed, the performance gains were modest.We note,however,that if we were to run the threaded3-pass implementation on a cluster with faster processors and a faster network,we would expect to see a more significant performance improvement. Our future work includes such experimental runs.We intentionally omitted a non-threaded3-pass implementation,even though that would have completed all four cases in the space of threaded vs.non-threaded and4-pass vs.3-pass implementations.The authors chose to not imple-ment that option because,having seen the significant performance improvement from the threaded implementation,the substantial effort required to produce a non-threaded,3-pass implementation would not have been worthwhile.We have identified several other directions for future work.Can we bypass√the size limitation given by N≤r3/2/。