Combined use of unsupervised and supervised learning for daily peak load forecasting
- 格式:pdf
- 大小:541.56 KB
- 文档页数:7
父母应该监督孩子使用互联网英语作文全文共3篇示例,供读者参考篇1Parents Should Supervise Their Children's Internet UseThe internet has become an integral part of our lives, and for many of us, it's hard to imagine a world without it. However, as a student, I've come to realize that the internet can be adouble-edged sword, especially for children and teenagers. While it provides endless opportunities for learning, entertainment, and connectivity, it also harbors potential dangers that parents should be aware of and take steps to mitigate.Unrestricted internet access can expose children to inappropriate content, cyberbullying, online predators, and other risks that can have lasting psychological and emotional consequences. As a result, I believe that parents have a responsibility to monitor and supervise their children's internet use to ensure their safety and well-being.One of the primary concerns with children's internet use is the exposure to inappropriate or harmful content. The internet isa vast and largely unregulated space, where anyone can publish or share content, regardless of its suitability for minors. From explicit or violent images and videos to hate speech and misinformation, the potential for stumbling upon harmful material is ever-present.As a teenager, I understand the allure of exploring the internet and satisfying one's curiosity, but I also recognize the potential dangers that come with unrestricted access. I've heard stories from friends who accidentally stumbled upon disturbing or traumatizing content that left them feeling confused, scared, or even scarred.Furthermore, the internet can be a breeding ground for cyberbullying, which can have devastating effects on a child's self-esteem, mental health, and overall well-being. With the anonymity and lack of consequences that the online world can provide, bullies can target victims relentlessly, making it difficult for children to escape the harassment, even in the supposed safety of their own homes.Online predators also pose a significant threat, as they can use the internet to groom and manipulate children, potentially leading to dangerous real-world encounters or exploitation. These predators often use deceptive tactics to gain a child's trust,making it crucial for parents to be vigilant and aware of their children's online activities and interactions.While these risks are concerning, I believe that the solution is not to completely restrict children's internet access, as that would deprive them of the numerous educational and developmental opportunities that the internet provides. Instead, I advocate for parental supervision and guidance, combined with age-appropriate internet safety education.Parents should take an active role in their children's online lives, including setting reasonable limits on screen time, monitoring the websites and apps they use, and having open and honest conversations about online safety. They should also consider using parental control software or internet filtering tools to block inappropriate content and restrict access to potentially harmful websites or apps.However, it's important to strike a balance between supervision and trust. Parents should aim to create an environment where children feel comfortable discussing their online experiences and seeking guidance without fear of punishment or overreaction. This open communication can foster a healthy relationship with technology and help childrendevelop the critical thinking skills necessary to navigate the online world responsibly.In addition to parental supervision, schools and educators have a crucial role to play in teaching internet safety and digital literacy. By incorporating age-appropriate lessons on online safety, cyberbullying prevention, and responsible internet use into the curriculum, students can develop the knowledge and skills necessary to protect themselves and make informed decisions online.As a student, I understand the importance of the internet in our lives and the vast opportunities it offers for learning, communication, and self-expression. However, I also recognize the potential risks and dangers that come with unrestricted access, especially for children and teenagers. It's essential for parents to take an active role in monitoring and guiding their children's internet use, while also fostering open communication and trust.By working together – parents, educators, and students – we can create a safer and more responsible online environment for everyone, allowing us to reap the benefits of the internet while minimizing the potential risks and negative consequences.篇2Parents Need to Monitor Kids' Internet UsageAs a high school student, I can't deny how incredibly useful and valuable the internet is. I rely on it for research, learning, entertainment, connecting with friends, and so much more. The internet has amazing resources and capabilities that make our lives so much easier and more connected than past generations could have imagined.However, as great as the internet is, it also has a darker side filled with dangers, inappropriate content, misinformation, cyberbullies, predators, and bad influences. The internet can be an extremely toxic place, especially for kids and teens whose brains are still developing and who are very impressionable. That's why I firmly believe parents must be actively involved in monitoring and limiting their children's internet usage.I'll be the first to admit that having my internet usage monitored by my parents is really annoying and frustrating at times. I feel like I should have total privacy and freedom online. However, I know my parents only do it because they love me and want to protect me. As much as I may complain, I'm actuallygrateful they care enough to pay attention to what I'm doing and seeing on the internet.If parents don't monitor their kids' internet usage, they are essentially leaving them unsupervised in an extremely dangerous place filled with threats around every corner. The internet is the modern equivalent of dropping your child off alone at a big city filled with criminals, indecent people, misinformation, and twisted ideologies. Would any sane parent ever do that to their child in the physical world? Of course not! So why would they allow their kid to roam the internet world unsupervised? It makes no sense.One of the biggest dangers of the unmonitored internet for children and teens is the rampant exposure to pornography and extremely inappropriate sexual content that can warp sexual development. At my school, most guys my age have easy access to violent porn and many struggle with porn addiction already. This has really messed up their perspectives on women, sex, intimacy, and relationships in general. Girls are exposed to fetishized sexual content from a very young age that creates distorted body images, unrealistic expectations, and lowself-esteem. Easy access to porn on the internet is straight up ruining the current generation's sexual and romanticdevelopment if parents aren't involved in monitoring and limiting it.Beyond porn, the internet also exposes kids to graphic violence, hateful ideologies, disinformation, conspiracy theories, risky trends or challenges, and communities that promoteself-harm or eating disorders. At my age, teens are very impressionable and vulnerable to negative influences online that parents often remain blissfully unaware of. We need guidance, boundaries, and monitoring to protect our developing minds from internet toxicity.Another huge risk is online predators who use the internet to exploit and take advantage of kids sexually, financially, or through human trafficking. There are so many creeps online posing as kids, teenagers, or trusted adults to groom and lure in vulnerable young people through deceiving tactics. No child is immune from falling victim to an expert predator's manipulative grooming process if left alone on the internet unsupervised. Parental monitoring and restrictions are a must to prevent kids from being preyed upon by sexual exploiters.I also believe parents need to monitor their children's social media activity very closely. We've all heard the horror stories of kids being ruthlessly cyberbullied to the point of self-harm orsuicide because parents had no idea what was going on in their child's online social world. Social media is a breeding ground for bullying, toxic comparisons, social anxiety, depression, and intense peer pressure. It may seem innocent to parents, but these platforms can be absolute cesspools of cruelty toward kids if left unmonitored. I've seen it first-hand in my own friend groups. No child deserves to suffer online torment because their parents weren't paying attention.Furthermore, parental monitoring of kids' devices and online activity helps prevent cyber crimes like hacking, identity theft, illegal downloading, or unintentionally sharing private information. We may be tech savvy, but we're still kids who make dumb mistakes that could land us in real trouble without our parents' oversight.At the end of the day, the internet was created by adults for adults. It was never intended to be a safe place for children to navigate alone. Of course kids should be taught digital literacy, internet safety, and given reasonable online access for learning. However, the internet is simply too vast, unregulated, and filled with inappropriate and dangerous content/people for children and teens to traverse unsupervised.We may kick and scream about our parents monitoring our internet usage as infringements on our privacy and freedom, but we need to understand it comes from a place of love, concern for our wellbeing, and acknowledgment that we're still young, naive, and impressionable. The brain's prefrontal cortex that governs decision making isn't fully developed until around age 25. We need our parents' guidance to develop a healthy relationship with the internet and technology.So parents, please stay involved! Use parental controls, monitor our online activities, impose reasonable time limits, block inappropriate sites, and have open conversations with us about our internet usage. Don't just let us run wild in the lawless wasteland of the World Wide Web. We may complain, but we need you to protect us from the darkest corners of the internet until we're old enough to navigate it responsibly as adults. We may not show it, but we're grateful you care enough to monitor our internet usage and shelter us from the web's potential harms. After all, you're just doing your job as a good parent looking out for your child's wellbeing in an increasingly digital world.篇3The Internet: A Vast World of Opportunities and DangersThe internet has become an integral part of our daily lives, offering a wealth of information, entertainment, and connectivity at our fingertips. As students, we are constantly encouraged to embrace technology and leverage the internet for our academic pursuits. However, the online realm is not without its perils, and it is crucial for parents to monitor their children's internet usage to ensure their safety and well-being.The Allure of the Digital WorldFrom an early age, we are drawn to the captivating world of the internet. Its vibrant colors, engaging games, and endless streams of content hold an irresistible charm. As we navigate through our formative years, the internet becomes an essential tool for research, communication, and self-expression. Social media platforms, in particular, offer us a sense of belonging and a way to connect with our peers.However, this digital allure can quickly spiral into an unhealthy obsession, compromising our mental health, social development, and academic performance. The constant bombardment of information and stimuli can lead to decreased attention spans, impaired decision-making abilities, and even addictive behaviors.The Perils of Unrestricted AccessThe internet, while a powerful resource, is also adouble-edged sword. Without proper guidance and supervision, children may inadvertently stumble upon inappropriate or harmful content, such as explicit material, violence, or misinformation. The anonymity of the online world can also expose them to cyberbullying, online predators, and other forms of exploitation.Furthermore, the ease of access to the internet can lead to excessive screen time, sedentary lifestyles, and a neglect of physical activity and face-to-face social interactions, which are crucial for healthy development.The Role of Parental MonitoringAs students, we may not fully comprehend the potential risks associated with unrestricted internet use. This is where parental monitoring plays a vital role. By establishing clear guidelines, setting reasonable time limits, and employing appropriate filtering and monitoring tools, parents can create a safer and more controlled online environment for their children.Effective parental monitoring involves open communication and education. Parents should engage in regular conversations with their children, discussing the potential dangers of the internet and fostering an understanding of responsible onlinebehavior. By fostering a trusting relationship, children are more likely to approach their parents with concerns or questions, allowing for proactive intervention and guidance.Additionally, parents should stay informed about the latest trends, platforms, and potential threats in the online world. Attending workshops or seeking advice from experts can equip them with the knowledge and tools necessary to navigate the ever-evolving digital landscape.Striking a Balance: Fostering Responsible UseIt is important to acknowledge that the internet is not inherently harmful; it is a powerful tool that can enhance our educational experiences and personal growth. The key is to strike a balance between enabling access and enforcing responsible use.Parents should encourage their children to leverage the internet for productive purposes, such as research, online learning, and exploring diverse perspectives. By fostering a healthy relationship with technology, children can develop essential digital literacy skills and critically evaluate the information they encounter online.Moreover, parents can introduce their children toage-appropriate educational resources, reputable websites, and online communities that promote positive values and intellectual growth. By curating a safe and enriching online environment, children can reap the benefits of the internet while minimizing exposure to potential risks.The Role of Schools and EducatorsWhile parental monitoring is crucial, schools and educators also play a significant role in promoting responsible internet use among students. Educational institutions should implement comprehensive digital citizenship programs, teaching students about online safety, cyberbullying prevention, and ethical online behavior.Collaboration between parents and educators is essential. Regular communication channels should be established, allowing for the sharing of concerns, strategies, and best practices regarding internet usage. By working together, parents and educators can create a cohesive and supportive environment that nurtures responsible digital citizens.ConclusionIn the digital age, the internet has become an indispensable part of our lives, offering unparalleled opportunities for learning, exploration, and connection. However, its vast expanse harbors potential dangers that cannot be overlooked. As students, we recognize the importance of parental monitoring in ensuring our safety and well-being online.By fostering open communication, setting clear guidelines, and employing appropriate monitoring tools, parents can create a secure environment for their children to navigate the digital world. Simultaneously, parents should encourage responsible and productive use of the internet, nurturing essential digital literacy skills and fostering intellectual growth.Ultimately, the internet is a powerful tool that, when used responsibly and with proper guidance, can enrich our educational experiences and personal development. It is our collective responsibility – parents, educators, and students alike –to embrace the benefits of the online world while mitigating its risks, paving the way for a safer and more enriching digital future.。
备战高考英语语法填空热点话题分类训练(高考模拟真题+名校最新真题)距离高考还有一段时间,不少有经验的老师都会提醒考生,愈是临近高考,能否咬紧牙关、学会自我调节,态度是否主动积极,安排是否科学合理,能不能保持良好的心态、以饱满的情绪迎接挑战,其效果往往大不一样。
以下是本人从事10多年教学经验总结出的以下学习资料,希望可以帮助大家提高答题的正确率,希望对你有所帮助,有志者事竟成!养成良好的答题习惯,是决定高考英语成败的决定性因素之一。
做题前,要认真阅读题目要求、题干和选项,并对答案内容作出合理预测;答题时,切忌跟着感觉走,最好按照题目序号来做,不会的或存在疑问的,要做好标记,要善于发现,找到题目的题眼所在,规范答题,书写工整;答题完毕时,要认真检查,查漏补缺,纠正错误。
总之,在最后的复习阶段,学生们不要加大练习量。
在这个时候,学生要尽快找到适合自己的答题方式,最重要的是以平常心去面对考试。
英语最后的复习要树立信心,考试的时候遇到难题要想“别人也难”,遇到容易的则要想“细心审题”。
越到最后,考生越要回归基础,单词最好再梳理一遍,这样有利于提高阅读理解的效率。
另附高考复习方法和考前30天冲刺复习方法。
专题31 人工智能ChatGPT(2023·黑龙江哈尔滨·哈九中校考二模)阅读下面短文,在空白处填入1个适当的单词或括号内单词的正range of____28____(topic). If anything, the implications for education may push teachers to rethink their courses in innovative ways and give assignments that aren’t____29____(easy) solved by AI. That could be for the best.More worrisome_____30_____(be) the effects of ChatGPT on writing scientific papers. In a recent study, abstracts created by ChatGPT were submitted to academic reviewers, who only caught 63% of these fakes. That’s a lot of Al-generated text that could find its way into the literature soon.(2023·江苏南京·南京师大附中校考一模)阅读下面短文,在空白处填如1个适当的单词或括号内单词的正确形式。
2022-2023学年九上英语期末模拟试卷注意事项1.考生要认真填写考场号和座位序号。
2.试题所有答案必须填涂或书写在答题卡上,在试卷上作答无效。
第一部分必须用2B 铅笔作答;第二部分必须用黑色字迹的签字笔作答。
3.考试结束后,考生须将试卷和答题卡放在桌面上,待监考员收回。
Ⅰ. 单项选择1、--- Time is of great value. We should never _____ today’s work till tomorrow.--- Yeah. To cherish(珍视) time is to cherish life. Life never returns.A.put off B.part with C.put on2、—Could you tell me how many books I can borrow at a time?—Sorry, at all. Our computer has broken down.A.Not B.None C.Nothing D.No3、— Amy, why are you still in the classroom? School is over for half an hour.— Because I my task yet. I still need one more hour.A.didn’t finish B.won’t finish C.haven’t finished D.hadn’t finished4、It’s reported that people throw plastic bags along this street every day.A.hundred B.hundreds C.hundred of D.hundreds of5、You are supposed to be more careful to make fewer while you are writing.A.trouble B.matter C.mistakes D.problems6、_________! This is a chance to make your dream come true. Believe in yourself. Confidence leads to success. Good luck!A.Take it easy B.Take pride in C.Take it seriously7、My book ___________ here in two days through the EMS.A.are sent B.is sent C.will be sent D.will send8、Our basketball team won’t win the match _______ we work together.A.if B.since C.while D.unless9、Don't trouble your brother. He _________ on a new computer program.A.works B.was working C.is working D.has worked10、—I wonder ______.—She always cares much about others.A.how does Jane get on so well with her classmatesB.why Jane is generally liked by all the other classmatesC.if Jane is popular with her classmatesⅡ. 完形填空11、Do you like reading? Reading is a fun thing to do 1 your free time and is also a good way to 2 a betterstudent.You are learning large numbers of new 3 from reading. This can help you to have a bigger and richer vocabulary. 4 you read a lot, you will meet with many new words that you would not often use in everyday life. Reading often also helps 5 your writing because new words 6 in your compositions.Another good reason to read books is that the story can take your 7 to new places and help you become 8 open-minded. Each story you read is 9 . This gets your mind wide. When you read books you have to 10the background and persons and think of new and exciting people and places.1.A.on B.by C.in2.A.become B.get C.make3.A.sentences B.words C.knowledge4.A.Though B.If C.Unless5.A.grow B.keep C.improve6.A.can be used B.were used C.used7.A.body B.mind C.feeling8.A.more B.less C.much9.A.difficult B.easy C.different10.A.think over B.look out C.work outⅢ. 语法填空12、阅读下面对话,在空白处填入恰当的词,或填入括号中所给单词的正确形式(最多不超过3个)。
Novel Techniques for Fraud Detection inMobile Telecommunication NetworksYves Moreau, Bart Preneel, Dept. Electrical Engineering-ESAT K.U.Leuven{yves.moreau,bart.preneel}@esat.kuleuven.ac.bePeter Burge, John Shawe-Taylor, Dept. Computer Science, Royal Holloway{peteb,jst}@Christof Stoermann, Siemens AGchristof.stoermann@zfe.siemens.deChris Cooke, Vodafonechris.cooke@Group: NetworkAbstract: This paper discusses the status of research on detection of fraud undertaken as part of the European Commission-funded ACTS ASPeCT (Advanced Security for Personal Communications Tech-nologies) project. A first task has been the identification of possible fraud scenarios and of typical fraud indicators which can be mapped to data in toll tickets. Currently, the project is exploring the detection of fraudulent behaviour based on a combination of absolute and differential usage. Three approaches are being investigated: a rule-based approach, and two approaches based on neural networks, where both supervised and unsupervised learning are considered. Special attention is being paid to the feasi-bility of the implementations.Introduction:It is estimated that the mobile communications industry loses several million ECUs per year due to fraud. Therefore, prevention and early detection of fraudulent activity is an important goal for net-work operators. It is clear that the additional security measures taken in GSM and in the future UMTS (Universal Mobile Telecommunications System) make these networks less vulnerable to fraud than the analogue networks. Nevertheless, certain types of commercial fraud are very hard to pre-clude by technical means. It is also anticipated that the introduction of new services can lead to the development of new ways to defraud the system. The use of sophisticated fraud detection techniques can assist in early detection of commercial frauds, and will also reduce the effectivity of technical frauds.One of the tasks of the European Commission-funded ACTS project ASPeCT (Advanced Security for Personal Communications Technologies) [1] is the development of new techniques and concepts for the detection of fraud in mobile telecommunication networks. This paper intends to report on the progress made during the first year. For a more detailed status report, the reader is referred to [2]. The remainder of this paper is organised as follows: Section 1 discusses the identification of possible fraud scenarios and of fraud indicators; Section 2 discusses the general approach of user profiling; Section 3 and 4 present respectively the rule-based approach and the neural net based approach to fraud detection.Section 1: Possible frauds and their indicatorsThe first stage of the work consists of the identification of possible fraud scenarios in telecommuni-cations networks and particularly in mobile phone networks. These scenarios have been classified by the technical manner in which they are committed; also an investigation has been undertaken to iden-tify which parts of the mobile telecommunications network are abused in order to commit any par-ticular fraud. Other characteristics that have been studied are whether frauds are technical fraud op-erated for financial gain, or they are fraud related to personal use - hence not employed for profiteer-ing. A further classification is achieved by considering whether the network abuse is the result of administrative fraud, procurement fraud, or application fraud.Subsequently, typical indicators have been identified which may be used for the purposes of detecting fraud committed using mobile telephones. In order to provide an indication of the likely ability of particular indicators to identify a specific fraud, these indicators have been classified both by their type and by their use.The different types are :-• usage indicators, related to the way in which a mobile telephone is used;• mobility indicators, related to the mobility of the telephone;• deductive indicators, which arise as a by-product of fraudulent behaviour (e.g., overlapping calls and velocity checks).Indicators have also been classified by use:-• primary indicators can, in principle, be employed in isolation to detect fraud;• secondary indicators provide useful information in isolation (but are not sufficient by them-selves);• tertiary indicators provide supporting information when combined with other indicators.A selection has been made of those scenarios which cannot be easily detected using existing tools, but which could be identified using more sophisticated approaches.The potential fraud indicators have been mapped to network data required to measure them. The in-formation required to monitor the use of the communications network is contained in the toll tickets. Toll Tickets are data records containing details pertaining to every mobile phone call attempt. Toll Tickets are transmitted to the network operator by the cells or switches that the mobile phone was communicating with. They are used to determine the charge to the subscriber, but they also provide information about customer usage and thus facilitate the detection of any possible fraudulent use. It has been investigated which fields in the GSM toll tickets can be used as indicators for fraudulent behaviour.Before use in the fraud detection engine, the toll tickets are being preprocessed. An essential compo-nent of this process is the encryption of all personal information in the toll tickets (such as telephone numbers). This allows for the protection of the privacy of users during the development of the fraud detection tools, while at the same time the network operators will be able to obtain the identity of fraudulent users.Section 2: User profilingAbsolute or differential analysisExisting fraud detection systems tend to interrogate sequences of Toll Tickets comparing a function of the various fields with fixed criteria known as triggers. A trigger, if activated, raises an alert status which cumulatively would lead to an investigation by the network operator. Such fixed trigger sys-tems perform what is known as an absolute analysis of the Toll Tickets and are good at detecting the extremes of fraudulent activity.Another approach to the problem is to perform a differential analysis. Here we monitor behavioural patterns of the mobile phone comparing its most recent activities with a history of its usage. Criteria can then be derived to use as triggers that are activated when usage patterns of the mobile phone change significantly over a short period of time. A change in the behaviour pattern of a mobile phone is a common characteristic in nearly all fraud scenarios excluding those committed on subscription where there is no behavioural pattern established.There are many advantages to performing a differential analysis through profiling the behaviour of a user. Firstly, certain behavioural patterns may be considered anomalous for one type of user, and hence potentially indicative of fraud, that are considered acceptable for another. With a differential analysis flexible criteria can be developed that detect any change in usage based on a detailed history profile of user behaviour. This takes fraud detection down to the personal level comparing like with like enabling detection of less obvious frauds that may only be noticed at the personal usage level. An absolute usage system would not detect fraud at this level. In addition, however, because a typi-cal user is not a fraudster, the majority of criteria that would have triggered an alarm in an absolute usage system will be seen as a large change in behaviour in a differential usage system. In this way a differential analysis can be seen as incorporating the absolute approach.The differential approachMost fraud indicators do not become apparent from an individual Toll Ticket. With the possible ex-ception of a velocity trap, we can only gain confidence in detecting a real fraud through investigating a fairly long sequence of Toll Tickets. This is particularly the case when considering more subtle changes in a user’s behaviour by performing a differential analysis.A differential usage system requires information concerning the users history of behaviour plus a more recent sample of the mobile phones activities. An initial approach might be to extract and en-code information from Toll Tickets and to store it in record format. This would require two windows or spans over the sequence of transactions for each user. The shorter sequence might be called the Current User Profile (CUP) and the longer sequence, the User Profile History (UPH). Both profiles could be treated and maintained as finite length queues. When a new Toll Ticket arrives for a given user, the oldest entry from the UPH would be discarded and the oldest entry from the CUP would move to the back of the UPH queue. The new record encoded from the incoming Toll Ticket would then join the back of the CUP queue.Clearly it is not optimal to search and retrieve historical information concerning a user’s activities prior to each calculation, on receipt of a new Toll Ticket. A more suitable approach is to compute a single cumulative CUP and UPH, for each user, from incoming Toll Tickets which can be stored as individual records, possibly in a database. So that we maintain the concept of having two different spans over the Toll Tickets without retaining a database record for each Toll Ticket, we will need to decay both profiles before the influence of a new Toll Ticket can be taken into consideration. A straight forward decay factor may not be suitable as this will potentially dilute information relating to encoded parameters stored in the user’s profile. An important concern here is the potential creation of false behaviour patterns. Several decaying systems are currently being investigated.Relevant toll ticket dataThere are two important requirements for user profiling. At first, efficiency is of the foremost con-cern for storing the user data and for performing updates. Secondly, user profiles have to realise a precise description of user behaviour to facilitate reliable fraud detection. All the information that a fraud detection tool will need to handle is derived from the toll tickets provided by the network op-erator.The following toll ticket components have been viewed to be the most fraud relevant measures:• Charged_IMSI1(identifies the user)• First_Cell_Id(location characteristic for mobile originating calls)• Chargeable_Duration(base for all cost estimations)• B_Type_of_Number(for distinguishing between national / international calls)• Non_Charged_Party (the number dialled)These components will continually be picked out of the toll tickets and incorporated into the user pro-files in a cumulative manner.It is also anticipated that the analysis of cell congestion can provide useful ancillary information. Section 3: Rule-based approach to fraud detectionIn ASPeCT, several approaches are taken to identify fraudulent behaviour. In the rule-based ap-proach, both the absolute and differential usage are verified against certain rules. This approach works best with user profiles containing explicit information, where fraud criteria given as rules can be referred to. User profiles are maintained for the directory number of the calling party (A-number), for the directory number of the called party (B-number) and also for the cells used to make/receive the calls. A-number profiles represent user behaviour and are useful for the detection of most types of fraud, while B-number profiles point to hot destinations and thus allow the detection of frauds based upon call forwarding. All deviations from normal user behaviour resulting from the different analysing processes are collected and alarms will finally be raised if the results in combination fulfil given alarm criteria.The implementation of this solution is based on an existing rule-based tool for audit trail analysis PDAT (Protocol Data Analysis Tool) [3]. PDAT is a rule based tool for intrusion detection devel-oped by Siemens ZFE (Corporate Research and Development). PDAT works in heterogeneous envi-ronments, has the possibility of on-line analysis, and provides a performance of about 200 KB input per second. Important goals were flexibility and broad applicability, including the analysis of general protocol data, which is achieved by the special language PDAL (Protocol Data Analysis Language). PDAL allows the programming of analysis criteria as well as a GUI-aided configuration of the analy-sis at run-time.Intrusion detection and mobile fraud detection are quite similar problem fields and the flexibility and broad applicability of PDAT are promising for using this tool for mobile fraud detection too. The main difference between intrusion detection and mobile fraud detection seems to be the kind of input data. The recording for intrusion detection produces 50 MB per day per user, but only for the few users of one UNIX-system. In comparison, fraud detection has to deal with a huge amount of mobile phone subscribers (roughly 1 Million), each of whom, however, produces only about 300 bytes of data per day. PDAT was able to keep all interim results in main memory, since only a few users had 1 International Mobile Subscriber Identityto be dealt with. For fraud detection, however, intermediate data has of course to be stored on hard disc. Because of these new requirements it was necessary to develop some completely new concepts such as user profiling and fast swapping for the updating of user profiles. Also, the internal archi-tecture had to be changed to a great extent. The new architecture is depicted in figure 1.Figure 1 Architecture of rule-based fraud detection toolSection 4: Neural network based approach to fraud detectionA second approach to identify fraudulent behaviour uses neural networks. The multiplicity and het-erogeneity of the fraud scenarios require the use of intelligent detection systems. The fraud detection engine has to be flexible enough to cope with the diversity of fraud. It should also be adaptive in or-der to face new fraud scenarios, since fraudsters are likely to develop new forms of fraud once older attacks become impractical. Further, fraud appears in the billing system as abnormal usage patterns in the toll ticket records of one or more users. The function of the fraud detection engine is to recog-nise such patterns and produce the necessary alarms. High flexibility and adaptivity for a pattern rec-ognition problem directly point to neural networks as a potential solution. Neural networks are sys-tems of elementary decision units that can be adapted by training in order to recognise and classify arbitrary patterns. The interaction of a high number of elementary units makes it possible to learn arbitrarily complex tasks. For fraud detection in telephone networks, neural network engines are cur-rently being developed worldwide [4,5]. As a closely related application, neural networks are now routinely used for the detection of credit card fraud.There are two main forms of learning in neural networks: unsupervised learning and supervised learning. In unsupervised learning [4], the network groups similar training patterns in clusters. It is then up to the user to recognise what class or behaviour has to be associated to each cluster. When patterns are presented to the network after training, they are associated to the cluster they are closest to, and are recognised as belonging to the class corresponding to that cluster. In supervised learning [5], the patterns have to be a priori labelled as belonging to some class. During learning, the network tries to adapt its units so that it produces the correct label at its output for each training pattern. Once training is finished the units are frozen, and when a new pattern is presented, it is classified according to the output produced by the network.Unsupervised learning presents some difficulties. The problem is that patterns have to be presented -that is, encoded - in such a way that the data from fraudulent usage will form groups that are distinct enough from regular data. On the other hand, these systems can be trained using clean data only. With supervised learning, the difficulty is that one must obtain a significant amount of fraudulent data, and label it as such. This represents a significant effort. Further, it is not clear how such sys-tems will handle new fraud strategies. Therefore, none of the approaches appears to be a priori supe-rior to the other, and both directions are being investigated.References[1]ACTS AC095, project ASPeCT: Initial report on security requirements.AC095/ATEA/W21/DS/P/02/B, February 1996.[2]ACTS AC095, project ASPeCT: Definition of fraud detection concepts.AC095/KUL/W22/DS/P/06/A, September 1996.[3]J. Katzer, T. Mehlhart, C. Wolff: PDAT - ein Protokolldaten-Analysewerkzeug fuer sichere Be-triebssysteme und Anwendungen. Unix in Deutschland - GUUG 1993, Network Verlag, Hagen-burg, Germany, 1993.[4]Barson, S. Field, N. Davey, G. McAskie, R. Frank: The Detection of Fraud in Mobile PhoneNetworks. Neural Network World, Vol.6, No. 4, pp. 477-484, 1996.[5]Yuhas: Toll-Fraud Detection. Proceedings of the International Workshop on Applications ofNeural Networks to Telecommunications, ed. J. Alspector., R. Goodman, and T.X. Brown, pp.239-244, Lawrence Erlbaum Associates, 1993.。
A review of feature selection techniques in bioinformaticsAbstractFeature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques.In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.1 INTRODUCTIONDuring the last decade, the motivation for applying feature selection (FS) techniques in bioinformatics has shifted from being an illustrative example to becoming a real prerequisite for model building. In particular, the high dimensional nature of many modelling tasks in bioinformatics, going from sequence analysis over microarray analysis to spectral analyses and literature mining has given rise to a wealth of feature selection techniques being presented in the field.In this review, we focus on the application of feature selection techniques. In contrast to other dimensionality reduction techniques like those based on projection (e.g. principal component analysis) or compression (e.g. using information theory), feature selection techniques do not alter the original representation of the variables, but merely select a subset of them. Thus, they preserve the original semantics of the variables, hence, offering the advantage of interpretability by a domain expert.While feature selection can be applied to both supervised and unsupervised learning, we focus here on the problem of supervised learning (classification), where the class labels are known beforehand. The interesting topic of feature selection for unsupervised learning (clustering) is a more complex issue, and research into this field is recently getting more attention in several communities (Liu and Yu, 2005; Varshavsky et al., 2006).The main aim of this review is to make practitioners aware of the benefits, and in some cases even the necessity of applying feature selection techniques. Therefore, we provide an overview of the different feature selection techniques for classification: we illustrate them by reviewing the most important application fields in the bioinformatics domain, highlighting the efforts done by the bioinformatics community in developing novel and adapted procedures. Finally, we also point the interested reader to some useful data mining and bioinformatics software packages that can be used for feature selection.Previous SectionNext Section2 FEATURE SELECTION TECHNIQUESAs many pattern recognition techniques were originally not designed to cope with large amounts of irrelevant features, combining them with FS techniques has become a necessity in many applications (Guyon and Elisseeff, 2003; Liu and Motoda, 1998; Liu and Yu, 2005). The objectives of feature selection are manifold, the most important ones being: (a) to avoid overfitting andimprove model performance, i.e. prediction performance in the case of supervised classification and better cluster detection in the case of clustering, (b) to provide faster and more cost-effective models and (c) to gain a deeper insight into the underlying processes that generated the data. However, the advantages of feature selection techniques come at a certain price, as the search for a subset of relevant features introduces an additional layer of complexity in the modelling task. Instead of just optimizing the parameters of the model for the full feature subset, we now need to find the optimal model parameters for the optimal feature subset, as there is no guarantee that the optimal parameters for the full feature set are equally optimal for the optimal feature subset (Daelemans et al., 2003). As a result, the search in the model hypothesis space is augmented by another dimension: the one of finding the optimal subset of relevant features. Feature selection techniques differ from each other in the way they incorporate this search in the added space of feature subsets in the model selection.In the context of classification, feature selection techniques can be organized into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods. Table 1 provides a common taxonomy of feature selection methods, showing for each technique the most prominent advantages and disadvantages, as well as some examples of the most influential techniques.Table 1.A taxonomy of feature selection techniques. For each feature selection type, we highlight a set of characteristics which can guide the choice for a technique suited to the goals and resources of practitioners in the fieldFilter techniques assess the relevance of features by looking only at the intrinsic properties of the data. In most cases a feature relevance score is calculated, and low-scoring features are removed. Afterwards, this subset of features is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to very high-dimensional datasets, they are computationally simple and fast, and they are independent of the classification algorithm. As a result, feature selection needs to be performed only once, and then different classifiers can be evaluated.A common disadvantage of filter methods is that they ignore the interaction with the classifier (the search in the feature subset space is separated from the search in the hypothesis space), and that most proposed techniques are univariate. This means that each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification performance when compared to other types of feature selection techniques. In order to overcome the problem of ignoring feature dependencies, a number of multivariate filter techniques were introduced, aiming at the incorporation of feature dependencies to some degree.Whereas filter techniques treat the problem of finding a good feature subset independently of the model selection step, wrapper methods embed the model hypothesis search within the feature subset search. In this setup, a search procedure in the space of possible feature subsets is defined, and various subsets of features are generated and evaluated. The evaluation of a specific subset of features is obtained by training and testing a specific classification model, rendering this approach tailored to a specific classification algorithm. To search the space of all feature subsets, a search algorithm is then ‘wrapped’ around the classification model. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods are used to guide the search for an optimal subset. These search methods can be divided in two classes: deterministic and randomized search algorithms. Advantages of wrapper approaches include the interaction between feature subset search and model selection, and the ability to take into account feature dependencies. A common drawback of these techniques is that they have a higher risk of overfitting than filter techniques and are very computationally intensive, especially if building the classifier has a high computational cost.In a third class of feature selection techniques, termed embedded techniques, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets and hypotheses. Just like wrapper approaches, embedded approaches are thus specific to a given learning algorithm. Embedded methods have the advantage that they include the interaction with the classification model, while at the same time being far less computationally intensive than wrapper methods.Previous SectionNext Section3 APPLICATIONS IN BIOINFORMATICS3.1 Feature selection for sequence analysisSequence analysis has a long-standing tradition in bioinformatics. In the context of feature selection, two types of problems can be distinguished: content and signal analysis. Content analysis focuses on the broad characteristics of a sequence, such as tendency to code for proteins or fulfillment of a certain biological function. Signal analysis on the other hand focuses on the identification of important motifs in the sequence, such as gene structural elements or regulatory elements.Apart from the basic features that just represent the nucleotide or amino acid at each position in a sequence, many other features, such as higher order combinations of these building blocks (e.g.k-mer patterns) can be derived, their number growing exponentially with the pattern length k. As many of them will be irrelevant or redundant, feature selection techniques are then applied to focus on the subset of relevant variables.3.1.1 Content analysisThe prediction of subsequences that code for proteins (coding potential prediction) has been a focus of interest since the early days of bioinformatics. Because many features can be extracted from a sequence, and most dependencies occur between adjacent positions, many variations of Markov models were developed. To deal with the high amount of possible features, and the often limited amount of samples, (Salzberg et al., 1998) introduced the interpolated Markov model (IMM), which used interpolation between different orders of the Markov model to deal with small sample sizes, and a filter method (χ2) to select only relevant features. In further work, (Delcher et al., 1999) extended the IMM framework to also deal with non-adjacent feature dependencies, resulting in the interpolated context model (ICM), which crosses a Bayesian decision tree with a filter method (χ2) to assess feature relevance. Recently, the avenue of FS techniques for coding potential prediction was further pursued by (Saeys et al., 2007), who combined different measures of coding potential prediction, and then used the Markov blanket multivariate filter approach (MBF) to retain only the relevant ones.A second class of techniques focuses on the prediction of protein function from sequence. The early work of Chuzhanova et al. (1998), who combined a genetic algorithm in combination with the Gamma test to score feature subsets for classification of large subunits of rRNA, inspired researchers to use FS techniques to focus on important subsets of amino acids that relate to the protein's; functional class (Al-Shahib et al., 2005). An interesting technique is described in Zavaljevsky et al. (2002), using selective kernel scaling for support vector machines (SVM) as a way to asses feature weights, and subsequently remove features with low weights.The use of FS techniques in the domain of sequence analysis is also emerging in a number of more recent applications, such as the recognition of promoter regions (Conilione and Wang, 2005), and the prediction of microRNA targets (Kim et al., 2006).3.1.2 Signal analysisMany sequence analysis methodologies involve the recognition of short, more or less conserved signals in the sequence, representing mainly binding sites for various proteins or protein complexes. A common approach to find regulatory motifs, is to relate motifs to gene expressionlevels using a regression approach. Feature selection can then be used to search for the motifs that maximize the fit to the regression model (Keles et al., 2002; Tadesse et al.,2004). In Sinha (2003), a classification approach is chosen to find discriminative motifs. The method is inspired by Ben-Dor et al. (2000) who use the threshold number of misclassification (TNoM, see further in the section on microarray analysis) to score genes for relevance to tissue classification. From the TNoM score, a P-value is calculated that represents the significance of each motif. Motifs are then sorted according to their P-value.Another line of research is performed in the context of the gene prediction setting, where structural elements such as the translation initiation site (TIS) and splice sites are modelled as specific classification problems. The problem of feature selection for structural element recognition was pioneered in Degroeve et al. (2002) for the problem of splice site prediction, combining a sequential backward method together with an embedded SVM evaluation criterion to assess feature relevance. In Saeys et al. (2004), an estimation of distribution algorithm (EDA, a generalization of genetic algorithms) was used to gain more insight in the relevant features for splice site prediction. Similarly, the prediction of TIS is a suitable problem to apply feature selection techniques. In Liu et al. (2004), the authors demonstrate the advantages of using feature selection for this problem, using the feature-class entropy as a filter measure to remove irrelevant features.In future research, FS techniques can be expected to be useful for a number of challenging prediction tasks, such as identifying relevant features related to alternative splice sites and alternative TIS.3.2 Feature selection for microarray analysisDuring the last decade, the advent of microarray datasets stimulated a new line of research in bioinformatics. Microarray data pose a great challenge for computational techniques, because of their large dimensionality (up to several tens of thousands of genes) and their small sample sizes (Somorjai et al., 2003). Furthermore, additional experimental complications like noise and variability render the analysis of microarray data an exciting domain.In order to deal with these particular characteristics of microarray data, the obvious need for dimension reduction techniques was realized (Alon et al., 1999; Ben-Dor et al., 2000; Golub et al., 1999; Ross et al., 2000), and soon their application became a de facto standard in the field. Whereas in 2001, the field of microarray analysis was still claimed to be in its infancy (Efron et al., 2001), a considerable and valuable effort has since been done to contribute new and adapt known FS methodologies (Jafari and Azuaje, 2006). A general overview of the most influential techniques, organized according to the general FS taxonomy of Section 2, is shown in Table 2.Table 2.Key references for each type of feature selection technique in the microarray domain3.2.1 The univariate filter paradigm: simple yet efficientBecause of the high dimensionality of most microarray analyses, fast and efficient FS techniques such as univariate filter methods have attracted most attention. The prevalence of these univariate techniques has dominated the field, and up to now comparative evaluations of different classification and FS techniques over DNA microarray datasets only focused on the univariate case (Dudoit et al., 2002; Lee et al., 2005; Li et al., 2004; Statnikov et al., 2005). This domination of the univariate approach can be explained by a number of reasons:the output provided by univariate feature rankings is intuitive and easy to understand;the gene ranking output could fulfill the objectives and expectations that bio-domain experts have when wanting to subsequently validate the result by laboratory techniques or in order to explore literature searches. The experts could not feel the need for selection techniques that take into account gene interactions;the possible unawareness of subgroups of gene expression domain experts about the existence of data analysis techniques to select genes in a multivariate way;the extra computation time needed by multivariate gene selection techniques.Some of the simplest heuristics for the identification of differentially expressed genes include setting a threshold on the observed fold-change differences in gene expression between the states under study, and the detection of the threshold point in each gene that minimizes the number of training sample misclassification (threshold number of misclassification, TNoM (Ben-Dor etal.,2000)). However, a wide range of new or adapted univariate feature ranking techniques has since then been developed. These techniques can be divided into two classes: parametric and model-free methods (see Table 2).Parametric methods assume a given distribution from which the samples (observations) have been generated. The two sample t-test and ANOVA are among the most widely used techniques in microarray studies, although the usage of their basic form, possibly without justification of their main assumptions, is not advisable (Jafari and Azuaje, 2006). Modifications of the standard t-test to better deal with the small sample size and inherent noise of gene expression datasets include a number of t- or t-test like statistics (differing primarily in the way the variance is estimated) and a number of Bayesian frameworks (Baldi and Long, 2001; Fox and Dimmic, 2006). Although Gaussian assumptions have dominated the field, other types of parametrical approaches can also be found in the literature, such as regression modelling approaches (Thomas et al., 2001) and Gamma distribution models (Newton et al.,2001).Due to the uncertainty about the true underlying distribution of many gene expression scenarios, and the difficulties to validate distributional assumptions because of small sample sizes,non-parametric or model-free methods have been widely proposed as an attractive alternative to make less stringent distributional assumptions (Troyanskaya et al., 2002). Many model-free metrics, frequently borrowed from the statistics field, have demonstrated their usefulness in many gene expression studies, including the Wilcoxon rank-sum test (Thomas et al., 2001), the between-within classes sum of squares (BSS/WSS) (Dudoit et al., 2002) and the rank products method (Breitling et al., 2004).A specific class of model-free methods estimates the reference distribution of the statistic using random permutations of the data, allowing the computation of a model-free version of the associated parametric tests. These techniques have emerged as a solid alternative to deal with the specificities of DNA microarray data, and do not depend on strong parametric assumptions (Efron et al., 2001; Pan, 2003; Park et al., 2001; Tusher et al., 2001). Their permutation principle partly alleviates the problem of small sample sizes in microarray studies, enhancing the robustness against outliers.We also mention promising types of non-parametric metrics which, instead of trying to identify differentially expressed genes at the whole population level (e.g. comparison of sample means), are able to capture genes which are significantly disregulated in only a subset of samples (Lyons-Weiler et al., 2004; Pavlidis and Poirazi, 2006). These types of methods offer a more patient specific approach for the identification of markers, and can select genes exhibiting complex patterns that are missed by metrics that work under the classical comparison of two prelabelled phenotypic groups. In addition, we also point out the importance of procedures for controlling the different types of errors that arise in this complex multiple testing scenario of thousands of genes (Dudoit et al., 2003; Ploner et al., 2006; Pounds and Cheng, 2004; Storey, 2002), with a special focus on contributions for controlling the false discovery rate (FDR).3.2.2 Towards more advanced models: the multivariate paradigm for filter, wrapperand embedded techniquesUnivariate selection methods have certain restrictions and may lead to less accurate classifiers by, e.g. not taking into account gene–gene interactions. Thus, researchers have proposed techniques that try to capture these correlations between genes.The application of multivariate filter methods ranges from simple bivariate interactions (Bø and Jonassen, 2002) towards more advanced solutions exploring higher order interactions, such as correlation-based feature selection (CFS) (Wang et al., 2005; Yeoh et al., 2002) and several variants of the Markov blanket filter method (Gevaert et al., 2006; Mamitsuka, 2006; Xing et al., 2001). The Minimum Redundancy-Maximum Relevance (MRMR) (Ding and Peng, 2003) and Uncorrelated Shrunken Centroid (USC) (Yeung and Bumgarner, 2003) algorithms are two other solid multivariate filter procedures, highlighting the advantage of using multivariate methods over univariate procedures in the gene expression domain.Feature selection using wrapper or embedded methods offers an alternative way to perform a multivariate gene subset selection, incorporating the classifier's; bias into the search and thus offering an opportunity to construct more accurate classifiers. In the context of microarray analysis, most wrapper methods use population-based, randomized search heuristics (Blanco et al., 2004; Jirapech-Umpai and Aitken, 2005; Li et al., 2001; Ooi and Tan, 2003), although also a few examples use sequential search techniques (Inza et al., 2004; Xiong et al., 2001). An interesting hybrid filter-wrapper approach is introduced in (Ruiz et al., 2006), crossing a univariatelypre-ordered gene ranking with an incrementally augmenting wrapper method.Another characteristic of any wrapper procedure concerns the scoring function used to evaluate each gene subset found. As the 0–1 accuracy measure allows for comparison with previous works, the vast majority of papers uses this measure. However, recent proposals advocate the use of methods for the approximation of the area under the ROC curve (Ma and Huang, 2005), or the optimization of the LASSO (Least Absolute Shrinkage and Selection Operator) model (Ghosh and Chinnaiyan, 2005). ROC curves certainly provide an interesting evaluation measure, especially suited to the demand for screening different types of errors in many biomedical scenarios.The embedded capacity of several classifiers to discard input features and thus propose a subset of discriminative genes, has been exploited by several authors. Examples include the use of random forests (a classifier that combines many single decision trees) in an embedded way to calculate the importance of each gene (Díaz-Uriarte and Alvarez de Andrés, 2006; Jiang et al., 2004). Another line of embedded FS techniques uses the weights of each feature in linear classifiers, such as SVMs (Guyon et al., 2002) and logistic regression (Ma and Huang, 2005). These weights are used to reflect the relevance of each gene in a multivariate way, and thus allow for the removal of genes with very small weights.Partially due to the higher computational complexity of wrapper and to a lesser degree embedded approaches, these techniques have not received as much interest as filter proposals. However, an advisable practice is to pre-reduce the search space using a univariate filter method, and only then apply wrapper or embedded methods, hence fitting the computation time to the available resources.3.3 Mass spectra analysisMass spectrometry technology (MS) is emerging as a new and attractive framework for disease diagnosis and protein-based biomarker profiling (Petricoin and Liotta, 2003). A mass spectrum sample is characterized by thousands of different mass/charge (m / z) ratios on the x-axis, each with their corresponding signal intensity value on the y-axis. A typical MALDI-TOF low-resolution proteomic profile can contain up to 15 500 data points in the spectrum between 500 and 20 000 m / z, and the number of points even grows using higher resolution instruments.For data mining and bioinformatics purposes, it can initially be assumed that each m / z ratio represents a distinct variable whose value is the intensity. As Somorjai et al. (2003) explain, the data analysis step is severely constrained by both high-dimensional input spaces and their inherent sparseness, just as it is the case with gene expression datasets. Although the amount of publications on mass spectrometry based data mining is not comparable to the level of maturity reached in the microarray analysis domain, an interesting collection of methods has been presented in the last 4–5 years (see Hilario et al., 2006; Shin and Markey, 2006 for recent reviews) since the pioneering work of Petricoin et al.(2002).Starting from the raw data, and after an initial step to reduce noise and normalize the spectra from different samples (Coombes et al., 2007), the following crucial step is to extract the variables that will constitute the initial pool of candidate discriminative features. Some studies employ the simplest approach of considering every measured value as a predictive feature, thus applying FS techniques over initial huge pools of about 15 000 variables (Li et al., 2004; Petricoin et al., 2002), up to around 100 000 variables (Ball et al.,2002). On the other hand, a great deal of the current studies performs aggressive feature extraction procedures using elaborated peak detection and alignment techniques (see Coombes et al., 2007; Hilario et al., 2006; Shin and Markey, 2006 for a detailed description of these techniques). These procedures tend to seed the dimensionality from which supervised FS techniques will start their work in less than 500 variables (Bhanot et al., 2006; Ressom et al., 2007; Tibshirani et al., 2004). A feature extraction step is thus advisable to set the computational costs of many FS techniques to a feasible size in these MS scenarios. Table 3 presents an overview of FS techniques used in the domain of mass spectrometry. Similar to the domain of microarray analysis, univariate filter techniques seem to be the most common techniques used, although the use of embedded techniques is certainly emerging as an alternative. Although the t-test maintains a high level of popularity (Liu et al., 2002; Wu et al., 2003), other parametric measures such as F-test (Bhanot et al., 2006), and a notable variety of non-parametric scores (Tibshirani et al., 2004; Yu et al., 2005) have also been used in several MS studies. Multivariate filter techniques on the other hand, are still somewhat underrepresented (Liu et al., 2002; Prados et al., 2004).Table 3.Key references for each type of feature selection technique in the domain of mass pectrometryWrapper approaches have demonstrated their usefulness in MS studies by a group of influential works. Different types of population-based randomized heuristics are used as search engines in the major part of these papers: genetic algorithms (Li et al., 2004; Petricoin et al., 2002), particle swarm optimization (Ressom et al., 2005) and ant colony procedures (Ressom et al., 2007). It is worth noting that while the first two references start the search procedure in ∼ 15 000 dimensions by considering each m / z ratio as an initial predictive feature, aggressive peak detection and alignment processes reduce the initial dimension to about 300 variables in the last two references (Ressom et al., 2005; Ressom et al., 2007).An increasing number of papers uses the embedded capacity of several classifiers to discard input features. Variations of the popular method originally proposed for gene expression domains by Guyon et al. (2002), using the weights of the variables in the SVM-formulation to discard features with small weights, have been broadly and successfully applied in the MS domain (Jong et al., 2004; Prados et al., 2004; Zhang et al., 2006). Based on a similar framework, the weights of the input masses in a neural network classifier have been used to rank the features'importance in Ball et al. (2002). The embedded capacity of random forests (Wu et al., 2003) and other types of decision tree-based algorithms (Geurts et al., 2005) constitutes an alternative embedded FS strategy.Previous SectionNext Section4 DEALING WITH SMALL SAMPLE DOMAINSSmall sample sizes, and their inherent risk of imprecision and overfitting, pose a great challenge for many modelling problems in bioinformatics (Braga-Neto and Dougherty, 2004; Molinaro et al., 2005; Sima and Dougherty, 2006). In the context of feature selection, two initiatives have emerged in response to this novel experimental situation: the use of adequate evaluation criteria, and the use of stable and robust feature selection models.4.1 Adequate evaluation criteria。
河北省张家口市2023-2024学年高二上学期1月期末考试英语试卷学校:___________姓名:___________班级:___________考号:___________一、阅读理解Is there any truth behind the saying “You are what you eat”? We put this to the test by asking three people to open their fridge doors and talk about their lifestyles.Jenny 31 TeacherMy husband and I avoid all products that come from animals. This means we don't eat meat. We like cooking at home and make our meals from fresh, seasonal fruit and vegetables. Cooking together gives us a chance to relax and catch up on each other's days. If we have children, I want to bring them up just like us, but my husband says that everyone should be able to make their own lifestyle choices.Ted 35 Construction workerEach day at the construction site is more or less the same—tiring! When I get back home, there's nothing more satisfying than a big meat dinner. I guess I've always been a big meat eater and in this house, no family meal is complete without some form of meat. I still don't think that anything can beat my mum's homemade sausages, though.Mike 49 ChefWith five children, we're one big family! My wife and I both work full-time, so life can get pretty busy! Like a lot of chefs I know, I don't really do much cooking at home. Sometimes I bring home food from the restaurant where I work. Once a week, we do a big shop and buy a lot of frozen food. I know we should eat more fresh fruit and vegetables, but ready meals are so convenient.1.What can we learn about Jenny?A. She likes ready meals.B. She has children to bring up.C. She enjoys vegetable-free dinners.D. She enjoys cooking together with her husband. 2.What might Mike say?A. My fridge is filled with meat.B. As a chef, I don't need so much frozen food.C. I provide fresh fruit and vegetables for my children.D. It’s convenient to eat foot brought from my workplace.3.What is the type of the text?A. Interview notes.B. A cooking guide.C. An advertisement.D. Product introductions.It's March in Africa. On the farm, local people are dancing and singing to celebrate their recent harvest. Their hard work has paid off and got beautiful fruit—white and fragrant rice!This is the Second Farm in Xai-Xai. Mozambique, where Yi Yun has been working for six years. The 37-year-old from Liaoning, China. followed his Chinese company here as a project manager. The company runs a project to grow rice on the farm, which covers about 1, 667 hectares.“When we first arrived in 2017, the farm was full of weeds (杂草) , as tall as a man,”said Yi. After clearing the weeds, the company began to teach locals farming skills. “We taught them how to drive tractors (拖拉机) and harvesters, and how to grow rice step by step includi ng sowing seeds and spraying pesticide.”Since 2000, when the Forum on China-Africa Cooperation started, an increasing number of Chinese companies and organizations have gone to Africa to work in the agriculture industry. “We bring in new techniques, equip ment and high-quality seeds,” said Yi.“Local people are happy that we are here because we not only help them solve food problems but also provide job opportunities,” said Yi. “Our farm can employ 500 to 1, 000 local people. Their living conditions have been greatly improved in recent years. Some families have even bought televisions and fridges.”“African people are just as hard-working as the Chinese and also want to pursue a better life. What they need is more opportunities, which China is now helping with. I'm proud of what my country is doing,” Yi said.4.What does paragraph 1 describe?A. A celebration scene.B. A trading scene.C. A working scene.D. A harvest scene.5.What did Yi Yun see when he first arrived at the farm in Mozambique?A. Old tractors.B. Tall weeds.C. Local farmers.D. Rice fields.6.What can be learned about Yi Yun's company?A. It started a project in Mozambique in 2000.B. It suffered from food shortage in Mozambique.C. It sells electrical appliances to people in Mozambique.D. It has contributed to the agriculture development of Mozambique.7.What does Yi Yun think of African people?A. They should get better-paid jobs.B. They are proud of their own country.C. They need opportunities to create a better life.D. They have learned from the Chinese to be hard-working.Esperanto (世界语) was created in the late 1800s by Ludwik Lejzer Zamenhof, a Polish medical doctor. He alone built the basis for the language and helped introduce it into the world.Though Zamenhof's profession was a physician, he was no amateur (外行) in languages. Zamenhol said Russian was his mother tongue—the area of Poland he lived in was then part of the Russian Federation—but he generally spoke Polish everyday. Evidence also shows that he learned Yiddish from his mother and that he studied German, English, Spanlsh, Lithuanian, Italian and French. In addition, Zamenhof learned the classical languages Hebrew, Latin and Aramaic in school. Esperanto was not the first constructed language he'd dealt with. First, he learned a bit of V olapuk (沃拉普克语) . which was invented in Germany a decade before Esperanto, Having command of so many languages had a great impact on his creation of Esperanto, which was Zamenhof's 14th language.During his life, Zamenhof became extremely interested in the idea of creating a tolerant world, free from the horrors (恐怖) of war. He dreamed of a day when people could come together. To make this world a reality, he decided the best thing he could do was make an international language. It would not be anyone's mother tongue, but people can quickly learn it as a second language to make conversation easily with people from anywhere in the world.Why would Zamenhof create Esperanto when English basically already was used as an international language? The problem is that using English gives a huge advantage to people and places that already speak English. Esperanto tries to bypass political and cultural problems by being a neutral (中性的) language. Yet Esperanto is not without flaw. It was strongly influenced by European language, which can put people who speak Asian languages at a distinct disadvantage in learning and speaking Esperanto. Esperanto's simple rules might make it easier to learn than other languages for an Asian speaker, but it still loses some of its “neutrality”.8.Which language was Zamenhof's mother tongue?A. Russian.B. German.C. Spanish.D. Italian.9.Zamenhof was able to create Esperanto because ________.A. he had mastered V olapukB. he had been to many countriesC. he was an amateur in learning languagesD. he was influenced by the 13 languages he had learnt10.What gave Zamenhof the idea of creating Esperanto?A. His family background.B. His desire for a peaceful world.C. His interest in classical languages.D. His experience as an international student. 11.What does the underlined word “flaw” mean in the last paragraph?A. Significance.B. Difficulty.C. Weakness.D. Contribution.From writing Shakespeare-style poetry to making music, ChatGPT has amazed he world since its launch in late 2022 by the US-based company OpenAI. It even passed several law exams in four courses at the Unlversity of Minnesota, US.The AI program can answer questions on a whole host of topics and write essays, stories and any other written texts you can think of. It does this by drawing on information collected from a large corpus (语料库) of text data.What makes ChatGPT so impressive? As Sam Altman, the CEO of OpenAI said n an interview, “It's not actually a fundamentally new technology that made this (ChatGPT) have a moment.” According to MIT Technology Review, ChatGPT is based on GPT-3, a large language model. Because texts are more complex than the meaning of every word combined, language models require a type of neural network (神经网络) that can make sense of texts.One breakthrough behind today's model is a network called Transformer, which was invented by Google researchers in 2017. The neural network can infer word meanings by tracking where the word appears in a sentence. Transformer can therefore gather the meaning of texts more accurately.The GPT models were built by OpenAI combined Transformer with unsupervised (无人指导的) learning, meaning that the models can learn by themselves without being told what to look at. ChatGPT can now generate human-like responses instantly due to the large scale (规模) of texts It made sense of and learned from.“One of the biggest problems with ChatGPT is that it comes back, very confidently, with falsitie s,” Michael Wooldridge at the Alan Turing Institute in London, UK. This means that ChatGPT doesn't know the truth about the world—it learns information from various resources but It cannot decide what is true or false. As for education, many US schools recently banned students from using ChatGPT on school networks because students began to use it as a shortcut for essays.12.What aspect of ChatGPT is most impressive?A. The neural network.B. The large text database.C. The ability to combine words.D. The application of technology.13.What is the key to ChatGPT's response like humans?A. Data-based communication.B. Learning from masses of texts.C. Human-guided machine learning.D. Collecting texts for a large corpus. 14.What is the author's attitude toward ChatGPT?A. Supportive.B. Confused.C. Objective.D. Confident.15.What can be a suitable title for the text?A. Is ChatGPT a Good Translator?B. Should ChatGPT be Banned at School?C. Can ChatGPT Choose True Information?D. Will ChatGPT be the Wonder of Modern Technology?二、七选五16.Acupuncture (针灸) , an ancient Chinese medical practice, has been the solution for countless patients for thousands of years. ①_________ Over time, this natural practice gradually developed into a thorough and deep medical system, forming the basis of acupuncture.Acupuncture is a way to help the body recover itself by using thin needles in certain parts of the body. ②_________ Its medical ideas match traditional Chinese medicine's philosophical beliefs. These beliefs focus on treating the whole person, keeping bodily functions balanced, and promoting good health.③_________ They include needle insertion (针刺) , cupping and scraping (刮痧) . Needle insertion, the most common method, is carried out by inserting hair-thin needles into specific points on the body that channel vital energy (the qi) . Practitioners use thin needles to help the body's energy flow better, fix the balance between yin and yang, and make the body work better on its own. They gently move the needles around in different ways to do this.Acupuncture offers a way to treat the whole person. In addition to its physical benefits, acupuncture also provides emotional and psychological support. ④ _________ The process of receiving acupuncture involves relaxation and meditation, which can further improve its therapeutic (治疗的) effects.⑤_________ Over the years, acupuncture has seen many improvements in scientificresearch and modern medicine. It is now widely accepted as a mainstream treatment option for various health conditions. According to a 2019 WHO report, acupuncture is used in 113 of its 120 member countries.A. Practices can vary in forms.B. Acupuncture is used worldwide.C. Acupuncture is a powerful and effective form of therapy.D. Acupuncture does work and many people have been cured by it.E. At the beginning, ancient Chinese used stone tools to help with pain.F. Many people find it helpful in reducing stress, anxiety, and depression.G. It helps the body work better and improves its natural ability to recover.三、完形填空(15空)John and Mary were approaching their home town by car when they noticed someher, “Where are they?” “In the basement,” she cried.second!”But he went back anyway. It took him a very long time to find the two children. They17.A. sale B. fire C. air D. duty18.A. showed B. walked C. drove D. climbed19.A. see B. cover C. check D. explain20.A. joking B. noting C. arguing D. screaming21.A. encouragement B. opposition C. agreement D. judgement22.A. noisy B. wet C. hot D. small23.A. grabbed B. invited C. touched D. beat24.A. left B. rested C. fell D. wandered25.A. cast B. delivered C. pulled D. changed26.A. running B. waving C. aching D. waiting27.A. burn B. appear C. collapse D. expand28.A. laughing B. coughing C. crying D. whispering29.A. warm B. smelly C. fresh D. dirty30.A. sad B. exciting C. happy D. familiar31.A. treated B. buried C. lost D. rescued四、短文填空you what you are.” Chinese cuisine is a case in point.to suit American tastes. Later, I had③_________ chance to go to Beijing. I went to a Sichuan restaurant that my friend recommended to me. There I had the④ _________ (please) of experiencing an entirely new taste: Sichuan peppercorns. I soon moved to Shandong Province in the eastern part of North China. My favourite dish there was boiled dumplings⑤North China, where making dumplings has always been a family affair. Then I moved totravels then took me to South China, and then on to central China. I experienced local⑦Henan.Through food, Chinese people everywhere show friendship and kindness. ⑨ _________五、书面表达33.假定你是李华,你的新西兰朋友Jodie发邮件和你分享了她最近的旅行经历,并询问你印象深刻的旅游经历。
Tina Memo No.1997-002and published in Neural Networks Journal,10,2,pp.315-326,1997.Supervised Learning Extensions to the CLAM Network.N.A.Thacker,I.A.Abraham and P.Courtney.Last updated1/12/1998Imaging Science and Biomedical Engineering Division,Medical School,University of Manchester,Stopford Building,Oxford Road,Manchester,M139PT.N.A.Thacker,P.Courtney and I.A.Abraham.1AbstractThe Contextual Layered Associative Memory(CLAM)has been developed as a self-generating structure which implements a probabilistic encoding scheme.The training algorithms are geared towards the unsupervised gener-ation of a layerable associative mapping[17].We show here that the resulting structure will support layers which can be trained to produce outputs that approximate conditional probabilities of classification.Unsupervised and supervised learning algorithms operate independently permitting the unsupervised representational layer to be developed before supervision is available.The system thus supports learning which is inherently moreflexible than conventional node labeling schemes.Symbols usedS1/2∞−∞P(x|w)∆zδf(S)I,J,K,Li,j,k,l,m,nI io iXxO xmeaning.In a previous paper[17]it was shown how probabilistic encoding of outputs(estimates with confidences) allows the layering of associative memories for such hierarchical recognition tasks.The CLAM architecture is composed of two sorts of layers-unsupervised representational layers,which feed into supervised classification layers.The output response of nodes in the unsupervised representational layers are defined according to the probability that each could have been chosen as a maximum on the basis of the dot-product metric(reasons for the choice will be given later).This probability is proportional to thefiring rate that would be obtained in a neuronal system operating a winner-take-all system such as that suggested by Grossberg in[4]. Furthermore,each associative layer is structurally identical to all others1.Here we show how such unsupervised representational layers can be used to provide input to a different sort of layer for classification purposes.These classification layers are used in conjunction with a supervised training regime which supports the calculation of conditional probabilities of classification.This contrasts with conventional node labeling methods generally adopted for associative classification which preserve less more information.In developing the system it was considered important that the performance should be invariant,wherever possible, to principled changes in the input data.By this we mean that the assumptions made and constraints applied to the input data should be self-consistent and maintained throughout the system.For instance arbitrary partitioning of the input data should not alter network performance.An example of this in CLAM is the handling of handling of information made available in the form of frequency histograms.Such histograms consists of a number of bins defined with upper and lower boundaries.The contents of each bin represents the number of times an event has yielded a value between these boundaries.The size and number of bins are parameters of the data collection system but the actual location of their boundaries is arbitrary.We would regard this displacement of the defined positions of bin boundaries as a‘principled’change in the input data,which would alter individual bin contents but should not change network performance.The information capacity of afixed architecture will always be limited and we are interested here in developing an architecture which is capable of continual learning,thus aflexible architecture is required.Suchflexible ar-chitectures would require algorithms permitting the addition and removal of nodes and connections thus altering the information capacity to match the task2.In particular we would want the network performance to be stable with respect to node addition and removal,in the sense that the classification ability of the network should remain broadly unchanged rather than degrade abruptly when a fresh node is added.Of course classification ability should improve as the weights of the new node are adjusted.Sparse connectivity achieves the goal of stability because it ensures local representation and thus limits the impact of changes to the few connected nodes.As a side effect such a network would train in a time which is of linear order with respect to the number of input data patterns since there are far fewer weights to adjust.This is an extremely desirable characteristic absent from many network architecture and algorithms.One can summarise the desirable characteristics of a general learning network as follows:•Independent supervised and unsupervised learning-to make use of limited amounts of prior knowledge as and when available•Online training and continuous learning-prior knowledge to be made explicit and not hidden in training or architectural parameters•Flexible architecture with automatic self-generation-to deal with the varying information capacity of different tasks•Probabilistic outputs-to permit the multi-layering of networks required for hierarchical classification tasks •Linear order learning timeThe CLAM algorithms deliver a network architecture with these properties.This paper contains a summary of the essential CLAM algorithms and the extensions necessary to support supervised training-that is an unsupervised training and a node-generation algorithm;and a supervised training algorithm.The overall system is demonstrated by the classification of significantly overlapping data distributions.We show that in this task,(as with other supervised network architectures)the discrimination power of the network approaches optimality in the sense that the decision boundaries are positioned in accordance with the underlying probability distributions.The previous paper[17]concentrated on the unsupervised aspects of the original CLAM which included temporal integration to learn sets of patterns.The present paper discusses the combined unsupervised and supervised architecture for learning conditional probabilities.The rest of the paper is organised as follows:section2recaps on the previous paper by describing the structure and function of the unsupervised layers for spatial and temporal integration in a form which gives resistance to node addition and removal,while section3explains the training algorithms for unsupervised learning and node-generation.Section4discusses the architecture of the supervised layer and the overall model of CLAM is presented in the form of a diagram.Section5outlines a simple test used to demonstrate the performance of the extended network;and section6closes the paper with a reminder of the main elements and gives pointers to the application of CLAM in the area of machine vision.Some remarks which are felt to relate this statistically-based architecture to physiological neuronal systems are made during the course of the paper in the form of footnotes.3Designing a Layered Associative MemoryThe CLAM system is an extension of conventional associative network pattern classifiers allowing a layered classi-fication architecture to be developed.Figure1shows an example of CLAM being used to recognise sets of patterns presented serially.Figure1:Network ArchitectureThe network algorithms are based on a simple stochastic model of neuronal function where the output of particular nodes is determined by the time-averaged response of a nearest-neighbour classification network to noisy input data and noisy internode connections(weights).In what follows we assume that the effects of such noise can be represented with mean values of input I i and connection strength z ij and their respective uncertaintiesδI i and δz ij.The classification scheme requires computation of the probability P j that the input pattern would have best matched the pattern encoded at each node j on the basis of a Bayesian norm(or dot-product)of the I i and z ij in equation6.1.This is as used in Baysian classifiers and is one of the simplest possible functions of neurons and synapses used in neural networks.D j= i I i z ij(1)subject to the expected errorsδI i andδz ij.This form was chosen in preference to a Euclidean distance measure because connections do not need to be maintained for components of the input pattern which are expected to be zero.It is therefore particularly suitable for a system which is to have aflexible architecture as it permits smooth variation in network performance subject to the addition and removal of nodes and connections by incremental training from zero.The standard Euclidean metric is not adequate as it is a distance measure that can only be defined in a space offixed dimensionality and thus afixed number of connected nodes.The weighted Euclidean metric,which can give smooth performance under variable dimensionality requires an additional mechanism to deal with the weights,whereas with the dot-product metric this is all handled automatically3.This dot-product mustalso generate normalised templates of input patterns on the inputs connecting to each node so that equation6.2 holds. i z2ij=1(2) All networks use an implicit subjective function for information retrieval[9]from which constraints on the properties of suitable input patterns can be derived.In the case of CLAM,the use of the dot-product implies that the input vector must have components which are a measure of the relative significance of each feature to the representation. We suggest that the logical extension to this line of reasoning is that equal amounts of evidence in different parts of the input space should carry equal influence in the selection of a maximum node.Referring back to the example of data presented in the form of a frequency histogram,this is equivalent to saying that the initial definition of input space partitions due to the positioning of the histogram bin boundaries should not have an effect on the performance of the trained system.We conclude[Appendix1]that if we are deriving inputs for the network from measures of relative significance S i(defined as proportional to the probability of observing feature i)which add linearly,then the correct function of significance suitable for input to a network using the dot-product metric is the square root,equation6.3.I i=S1/2(3)iIt is important to note that this result has been derived purely on the basis of using the dot-product to index into stored patterns in a principled way,that is,applying the implicit constraints to restrict the form of the input.In this case propagating the constraint restricts the form of input function.Similar reasoning can be proposed with respect to the normalisation of z ij,the stored weights in equation6.2,enforced during training for the storage of all patterns with equal total significance.The node activation function now becomes compared to the dot-product definition and replacing the stored pattern z ij with a template of previous data S1/2gives equation6.4.ijD j= i S1/2i S1/2ij(4)This discrete measure is a form of the Bhattacharyya distance measure L B[8]for comparing continuous probability density functions P(x|w j)when S i and S ij equate to the two probability density functions in equation6.5.L B=−ln ∞−∞(P(x|w1)P(x|w2))1/2dx(5)This is important as any variable we may wish to use for input to CLAM may be represented in the form of a probability density distribution,for example statistical frequency occurrence histograms of a measurement vari-able4.This process can be considered as analogous to the process of fuzzification often described in the fuzzy logic literature,but here as we will show the method can be justified in the context of a strictly Bayesian formalism. As has been demonstrated elsewhere[6]choosing the node with the largest D j is equivalent to choosing that with the lowestχ2chi-squared statistic-the common maximum likelihood estimator.If we consider a simple neuronal model employing a winner-take all mechanism,the frequency output of each node will be constant for constant input frequencies and must be proportional to the node output probabilities P j.The probabilistic form of the output allows a greater amount of information regarding the input pattern to be preserved.It has the rather unusual feature when compared to most other network models that the output of any one node is not simply a function of its inputs.This is due to the winner-take-all strategy used in the underlying stochastic model to generate the time-averaged output probabilities.These probabilities P j can be generated in all cases by Monte-Carlo simulation of the noisefluctuations,but can be approximated analytically(though using an empirical formulation)assuming that allfluctuations in frequency-coding within the network are uncorrelated and normally distributed[Appendix2].The probablistic interpretation is not a complete specification of the output o j as it must clearly also be dependent upon the magnitude of the input.The CLAM network has been designed to be layerable so that the time integrated outputs(mean outputfiring frequencies)from one layer can be used as inputs to the next.The buffered output from an associative layer thus has the correct statistical properties to be used in this yered associative memories make possible hierarchical classification for representing and recognising sets of input vectors.When layering the network,the output is integrated over time and must produce values which are invariant to temporal segmentation of the input.That is,the total number of pulses propagated through each part of the network mustbe the same regardless of the order of presentation of the patterns5.These constraints specify the analytic form of the output from each node o j[Appendix3]which is a measure of significance of the match between the input pattern and the stored weights in equation6.6.o j=P j|I|2(6) In the case of the winning node,the stored template and input pattern will be very similar thus the modulus of the input|I|can be replaced with the locally computed D j.Probabilistic encoding correctly normalises the output data so that correct account is taken of all components forming the input to the next layer.Input to a node k in the next layer is obtained in accordance with our previous result[Appendix1]and is simply equation6.7.I k=( t o j)1/2(7)where the summation implies a temporal addition of the outputs generated by all of the current set of patterns presented to layer J for classification by layer K.Thus the generated output is independent of the order of presentation of data to the previous layer.4Training Associative LayersThe associative layers of the network can be used as an interpolative memory system,using simple weighted averages of values V j associated with each output node to predict the expectation value of the variable<V>.We have found that such an interpolation mechanism can be an order of magnitude more efficient at encoding output mappings than conventional nearest-neighbour systems.By this we mean that the interpolation can give better resolution for fewer nodes.If we choose as in equation6.8V j=D j z ij(8) then the system is capable of making a prediction,T i,of its own input I i,as in equation6.9T i= j P j D j z ij(9)This has been used to facilitate a noisefiltering resonance algorithm with bidirectional weights z ij and z ji used to predict the input values.This resonance algorithm,developed from ideas of Grossberg[4],is crucial to the performance of the system in a noisy environment and has been detailed elsewhere[17].In the normal case, without contamination from outlier data,the bidirectional weights should be equal(z ij=z ji).The training algorithm,which minimises the squared error on the reconstructed input,follows naturally in equa-tion6.10∆z ij=k−1j P j(I i−T i)(10) where k j is given by equation6.11∆k j=P j(D j−βk j)(11) The term k ensures that learning proceeds as if forming a weighted mean of the examples of the input pattern with flexibility for change at a level determined byβ.The number of extra presentations of data required to retrain part of the network is of the order1/β.Afixed network architecture has a limit on information capacity.In an environment where the network is to be continually learning we must have ways of expanding this architecture by generating more nodes.The node-generation algorithm proposed here embodies the basic principle that in a purely unsupervised learning regime the density of nodes should be determined by the inherent precision of an input patterns(limited by the representations used),and not the within-class distribution of the training data sample as in other architectures.The generation rule is to generate a new node whenever max(P j)exceeds a given thresholdρfor any input pattern.For normalised input patterns,this implies that node generation may be triggered simply on the basis of node output.This node-generation algorithm also ensures that all distinguishable patterns are encoded over several nodes.This is what is needed if the output of the network is to accurately reflect small changes in the input pattern.It is very similar,in principle,to the radial basis function approach suggested by[2]when the radial functionsare normalised.This process has now been accepted as a standard method for improving the generalisation characteristics of interpolating systems[3].The expected resolution of the system,embodied in the reproduceability of weight valuesδz ij,is currentlyfixed. However,algorithms are being sought to allow these values to be adapted subject to negative reinforcement.The limiting accuracy is defined by the expected error on the inputsδI i whenδz ij=0.The weights to new nodes are initialised by the current input pattern,so that subsequent training only modifies these values slightly and the generated architectures can be used after a relatively short training period.This is to be compared with other neural architectures which initialise weights at random and thus take some time to settle down.Indeed such ideas as sparse connectivity and local representation over a few nodes,which emerge thanks to the node-generation algorithm,results in a network which trains in a time which is of linear order with respect to the number of input data patterns,as mentioned in the introduction.During the previous study both the training algorithm and the node-generation algorithm were shown to converge to stable solutions for afixed region of pattern space[17].This stability is provided by both the training algorithm, and the increasing learning factor k j.After a period of learning the outputs generated by the network for a particular input pattern thus become reproducible to within a known error limit and the training algorithm ceases to modify the weights.5Supervised Classification LayerSupervised training of associative classification networks is generally done by node labeling,that is each node in the trained network is associated with a particular class type.There exist training algorithms for these systems which have been shown to be capable of approximating the performance of a Bayes classifier[10].These classifiers are capable of returning the most likely class given the inputs but do not return any information on how likely this classification is to be wrong.Provided the classes are well separated in feature space,ambiguity of classification will not be a problem.However,as this situation is rare for real problems it would seem that current methods have limited applicability.With node labeling methods the decision boundaries are inherent in the network architecture and thusfixed for a particular classification task.A completely new architecture must be generated if the classification choices are re-specified for the same input data.This would be a limitation as a general learning environment particularly if a system were trying to develop its own classification scheme without prior access to a pre-defined set of classes.Thus,it is important for a classification scheme to be developed which preserves more information about the result of the classification process and isflexible enough to cope with modification of the classification scheme.We can addressfirst the information preservation aspect of the classification problem.Given the outputs of the CLAM network P(j|I)(the probability that the input pattern I is consistent with the exemplar stored at each node j)information would be preserved if we were able to compute the conditional probability P(C|I)that an input belonged to any externally defined class C.Other neural network architectures can be shown to approximate the same output response when trained with the appropriate algorithms[15]but do not have the internal structure to guarantee adequacy of the architecture to solve the task.In the case of CLAM however the result can be directly obtained as the following derivation demonstrates.Starting with equation6.12:P(C|I)= j P(C,j|I)(12)where j denotes a mutually exclusive data partition such as the selection of the nearest node.Then from Bayes theorem we can obtain equation6.13P(C|I)= j P(C|j,I)P(j|I)(13)but in a fully trained network,knowledge of I gives no additional information once j is known,so that P(C|j,I)≈P(C|j),giving equation6.14P(C|I)≈ j P(C|j)P(j|I)(14)This is a standard form for probability recoding by integrating over a marginal variable as suggested by[5]and used recently in other neural network architectures[12].The method is directly comparable to the method of mixtures model classification used in thefield of statistical pattern recognition[14].Here we have the slight refinement in that P(j|I)is defined for a locally computed6winner-take-all selection mechanism operating a statistically-based comparison metric for comparing frequency-coded signals.Thus we need access to the quantity P(C|j)which is the probability that the input pattern is of class C given that node j has been chosen as the maximum.For the output O x from the classification node x,these connections can be trained using a running-average algorithm in equation6.15∆z jx=k−1x P jα(C x/α−O x)(15) where C x is a measure of certainty of the classification result x which could either be a constant value(say1)for a perfect supervisor or can be any independent scaled probability(for exampleαP(C|I)in the case of information provided by another classification system such as an associative network input stage).The factorαis a measure of the confidence of the supervision and ensures that training ceases if no information is available.The overall form of this training algorithm is distinctly Hebbian in the sense that reinforcement is largest when the activity o j and O x at both ends of the connection z jx are large.The learning rate is controlled by k x where this is determined in equation6.16∆k x=αP j−βk x(16) Theβterm again determines the steady stateflexibility of learning for dealing with pattern variation.For the case β=0this equation provides a best estimate of P(C|j)in the maximum likelihood sense and so can be regarded as optimal learning.The overall probability can then be computed using an additional layer of neural architecture with distinct class nodes x connected to the intermediate mapping nodes j by weights z jx proportional to P(C|j).This is presented infigure2which shows both the unsupervised and the supervised layers7.Figure2:Probability CalculationWe then compute the output O x from the node x using equation6.17O x= j o j z jx(17)which will be proportional to the conditional probabilities that we seek,giving information about all possible classifications x.As will be realised this supervised training algorithm required just one additional classification layer and can thus operate independently of the unsupervised training processes in the intermediate layer J.Connectivity between layers J and X can be sparse as we only need to create connections between those nodes in both layers which have correlated output activities(large P(C|j)).The intermediate association layer J must be stable so that the classification layer X can learn to compute the conditional probabilities.This property can be guaranteed by the CLAM algorithms.The resulting hybrid classification system is capable of learning in both supervised and unsupervised as well as mixed supervision environments.It has been shown[13]that such architectures are capable of extremely rapid learning.This result is borne out in our experience.6Performance TestFor a given input,the network should provide an estimate of the conditional probability for a particular classifi-cation.In order to verify that the analytic approximation to the probability that the input pattern I is consistent with a template stored at a node in layer j,a comparison was carried out using a Monte-Carlo simulation.We also wish to demonstrate that it can cope with non-linearly separable and disjoint classes.Note that unsupervised learning and node-generation ability were demonstrated in the previous study.For an arbitrary data distribution, the only information available on which to base any estimates is the correspondence of input patterns with output classifications.Under these conditions,the best estimate for the conditional probability P(C|I)is given simply by the ratio of the number of times that the particular input vector I has mapped to class C,divided by the total number of times that input vector I has occurred.It is therefore possible to determine the true probabilities,to any required accuracy,by means of a Monte-Carlo simulation.These can then be compared with the network outputs to evaluate performance and convergence properties.Figure3:Class OccurencesThe data for the Monte-Carlo simulation and training of the CLAM network is generated according a standard gaussian distribution8.The use of dot-products for the similarity measure in this network imposes certain con-straints on the type of data we can use as input,and in particular it is necessary to generate normalised input vectors for the Monte-Carlo simulation if the results are to be compared with those of the network.In our test we have used an input data set generated as3-element vectors representing xyz positions on the surface of a unitradius sphere.This simple data set was partitioned into three classes according to location on the sphere.Random 3-element vectors I1,I2and I3were then used as input to thefirst layer of the network.Figure3shows a two-dimensional scatterplot projection of class occurrences.It shows that class A has a bi-modal distribution,with two overlapping classes,B and C,between its peaks.The expected probabilities are simply found by sampling the frequency of different class occurrences at a number of points along a trajectory in the input space(solid line infigure5)(effectively a k-nearest-neighbour classifier).Figure4shows how intermediate nodes were positioned after training to map the input space to a resolution specified by the estimated error on the input components,parametersδz ij andδI i.Figure4:Node PositionsFigure5:Network OutputTo compare the CLAM estimates of the conditional probabilities with those of our Monte-Carlo simulation,we sampled the data at61points along the chosen trajectory through the input space.The output of the CLAM network is then found for11of these points(figure5).The node outputs clearly model the correct probability density distributions of the input data.Optimal Bayesian classification is acheived simply by taking the maximum probability.For this purpose the new network will perform in a similar manner to more conventional node labelling10。
Combined use of unsupervised and supervised learningfor daily peak load forecastingM.R.Amin-Naseri *,A.R.SoroushDepartment of Industrial Engineering,Tarbiat Modares University,Jalal Alahmad Highway,P.O.Box 14115-143,Tehran,IranReceived 22May 2007;accepted 6January 2008Available online 10March 2008AbstractIn this paper,we have aimed to present a hybrid neural network model for daily electrical peak load forecasting (PLF).Since peak loads usually follow similar patterns,classification of data improves the accuracy of the forecasts.Several factors in peak load,e.g.weather temperature,relative humidity,wind speed and cloud cover,were introduced into the model in order to enhance forecast quality.Most classification attempts in the literature have been intuitive and empty of justification.In this paper,we have proposed a novel approach for clustering data by using a self-organizing map.The Davies–Bouldin validity index was introduced to determine the best clusters.A feed forward neural network (FFNN)has been developed for each cluster to provide the PLF.Eight training algorithms have also been used in order to train the proposed FFNNs.Applying principal component analysis (PCA)decreased the dimensions of the network’s inputs and led to simpler architecture.To evaluate the effectiveness of the proposed hybrid model (PHM),forecasting has been performed by developing a FFNN that uses the un-clustered data.The results proved the superiority and effectiveness of the PHM.Lin-ear regression (LR)models have also been developed for PLF,and the results indicated that the PHM produces considerably better fore-casts than those of LR models.Furthermore,the results show that the suggested clustering approach significantly improves the forecasting results on regression analysis too.Ó2008Elsevier Ltd.All rights reserved.Keywords:Forecasting;Peak load;Clustering;Artificial neural networks;Self-organizing map;Feed forward neural networks1.IntroductionSince electrical peak load forecasting plays significant role in effective and economic operation of power utilities,it has long been of interest to researchers and academics.Peak load forecasting (PLF)for the subsequent day has a key role in electrical power system operation,unit commit-ment and energy scheduling.Developing an accurate and robust peak load forecasting methodology can lead to more accurate forecasting of electricity consumption.Fur-thermore,an accurate peak load forecast can significantly reduce the cost of operating power systems [1,2].Thus,researchers have used various techniques in the past forpeak load forecasting.Some have used time series and lin-ear regression models in PLF [3–5].In such methodologies,the relationship between independent variables and the dependent variable (the forecast)is determined through a mathematical equation,which is usually linear.However,modeling the complex correlation between the load and input variables like weather conditions and differences between days of the week makes such methods quite diffi-cult to use.Throughout the last two decades,a great deal of research has been devoted to using artificial neural net-works (ANN)for PLF [6–18].ANN techniques,with their high capability in non-linear modeling,have gained wide-spread use in general forecasting and particularly in electri-cal peak load forecasting.In this paper,artificial neural network techniques have been used in order to forecast daily electrical peak loads.0196-8904/$-see front matter Ó2008Elsevier Ltd.All rights reserved.doi:10.1016/j.enconman.2008.01.016*Corresponding author.Tel.:+9821880110013344;fax:+982188005040.E-mail address:Amin_nas@modares.ac.ir (M.R.Amin-Naseri)/locate/enconmanAvailable online at Energy Conversion and Management 49(2008)1302–1308Since peak electrical load is strongly influenced by weather conditions and consumer behaviors throughout the days of the year,more or less similar consumption patterns can be observed throughout the year.Therefore,classifying data into somewhat similar clusters can lead to noise reduction, and therefore,higher accuracy.Different types of classifica-tions have been proposed in the literature of PLF.How-ever,most of the proposed methods seem to be intuitive with no justifiable reasoning.Huang et al.classified the year into four clusters according to the months of the year [11].Iizaka et al.divided days of the year into spring and summer days[12].Khotanzad et al.merely set holidays aside from other days of the year[13].To forecast hourly peak loads,Kim et al.employed a Kohonen neural net-work model for clustering week days.Having classified the week into four clusters,namely;Tuesday through Fri-day,Saturday,Sunday and Monday,they divided seasonal loads into three clusters,spring and fall,summer and win-ter,using wavelet transforms.Eventually,they used a sec-ond order regression model to forecast the hourly peak load[14].Mori and Yuihara used deterministic annealing for input data clustering and only made forecasts for sum-mer;July,August,September and workdays.Nevertheless, no tests or validations were presented to evaluate the per-formance of the achieved clusters[15].Saini and Soni,for instance,divided the year into four clusters:rainy,dry, summer and winter months[17,18].Amin-Naseri and Soroush classified the days of the year intofive clusters by using a SOM.They then developed a FFNN for each data cluster and trained them to forecast peak loads.The results showed significant improvements due to the use of clustered data[19].This paper proposes a self-organizing map(SOM)tech-nique for classifying the daily peak loads of the year(the input data)into clusters that are most similar based on some criteria.Then,for each cluster of data,a feed forward neural network(FFNN)model is developed to forecast the daily peak load.To evaluate the effectiveness of the pro-posed models,the results obtained are compared with those obtained by using un-clustered data.In addition,a regression model is developed by using the clustered data for comparison.The results point out the superiority of the proposed hybrid model.In the following sections,we will analyze the input data in Section2and then proceed to develop the SOM neural network for clustering the data in Section3.Forecasting models by using the FFNN will be presented in Section 4.Finally,the summary and main conclusions of the paper are presented in Section5.2.Data explorationThe daily peak load data in this research were extracted from the Tehran Regional Electric Utility Com-pany for a period of4years from March21,1999to20 March20,2003,equivalent to1461days.The data con-cerning weather temperature was obtained from the Iranian Meteorological Institute.Fig.1depicts the disper-sion of peak load over the days under study in three con-secutive years.The peak load shows,as seen in Fig.1,various trends through different periods in each year.With the beginning of spring and the Iranian New Year holidays(which start from March21to April2)peak loads decrease,but as sum-mer approaches and the weather gets warmer,electricity consumption rises.Then,as the heat recedes,the peakload Fig.1.Time series of daily electrical peak load.M.R.Amin-Naseri,A.R.Soroush/Energy Conversion and Management49(2008)1302–13081303also dips but goes into a rise again as the weather gets colder.Thus,classifying days of the year based on daily peak loads can greatly enhance the accuracy of forecasting.3.Clustering by using SOMData classification in most research on short term load forecasting has been done intuitively,for historical data have mainly been calendar date dependent only[3–5,10–13,16–18].As a matter of fact,daily loads characteristics for each time interval,even on the same week day,are shown differently because of holidays,special days etc. That is why the self-organizing map has been introduced to cluster the historical data so that it could lead to higher forecasting accuracy[20].Clustering consists of partitioning the data into a set of clusters Q i,i=1,...,C.Optimal clustering can be defined as a partitioning that leads to minimal distances within clusters and maximum distances between them[21].A reg-ular,usually two dimensional(2-D),grid of map units nor-mally conforms the SOM.A prototype vector, m i=[m i1,...,m id],in which d is the input vector dimension, represents each unit.A neighborhood relation connects each unit to the adjacent ones.The number of map units, which may range from a few dozen up to thousands,shows the accuracy and generalization power of the SOM.As the training progresses,the SOM makes an elastic net folding onto the‘‘cloud”created by the input data.Then,data points near one another in the input space are mapped onto map units nearby.Hence,the SOM may be seen as a topology preserving mapping from the input space onto the two dimensional grid of map units.Fig.2shows an example of a self-organizing map.A self-organizing map is trained iteratively.First,a sam-ple vector x is chosen randomly from the input data set at each training step.Then,the distances between,and also all the prototype vectors,are calculated.The best matching unit(BMU),here denoted by b,is the map unit with the prototype closest to x:k xÀm b k¼mini fk xÀm i kgð1ÞThe prototype vectors are then updated.The BMU and itstopological neighbors are shifted nearer to the input vectorin the input space.In this paper,the dimension of the input vector d isassumed to be2,consisting of two input elements,dailypeak load and temperature.The number of map units,i,varies depending upon the number of clusters.There is no pre-defined rule to specify the number ofclusters.However,Vesanto and Alhoniemi suggested arange of2toffiffiffiffiNpclusters be examined,where N is the num-ber of samples in the data set.The SOM algorithm used in this paper minimizes thefollowing error function:E¼X Ck¼1Xx2Q kh b k xÀm k k2ð2Þin which C represents the number of mapunits.Neighbor-hood kernel h b is centered at unit b,which is the BMU ofvector x,and evaluated for unit k.To select the best one among different clusters,a validityindex is needed to evaluate the alternatives.The validityindex used in this paper is a Davies–Bouldin index:1CX Ck¼1maxl¼kS cðQ kÞþS cðQ lÞd ceðQ k;Q lÞð3Þin which S c represents within cluster distance,d ce betweencluster distance and C the number of clusters.Since the Da-vies–Bouldin index leads to low values,a sign of good clus-tering results for spherical clusters,it can be appropriatefor evaluating SOM clustering[21].3.1.Input data normalizationAs seen in Fig.1,the minimum and maximum loadshows an annual increase compared to the year before.Thus,the data for each year has to be clustered separatelyby means of the self-organizing map,for the load of anygiven day is higher than that of the previous year.This willprevent the same day of different years falling into differentclusters.Hence,the data for the three year period is trainedseparately.Because the peak load value is much greater than that ofthe temperature,the effect of the latter may be overshad-owed,leaving only the peak load as the factor influencingthe SOM.Thus,the input data(both peak load and tem-perature)need to be normalized.In this research,the stan-dard normalization with zero mean and unit standarddeviation has been adopted.work array pattern and distance normNeurons used for classifying input vectors may have dif-ferent arrays.Thus,in order to determine the best classifi-cation,the input vectors have to be trained based on thenumber of neurons and various arrays.In this paper,inputvectors consisting of two elements(daily peak load and1304M.R.Amin-Naseri,A.R.Soroush/Energy Conversion and Management49(2008)1302–1308temperature at the time the peak occurs)have been trained with2–12neurons(clusters).The17arrays trained are: (1,2),(1,3),(1,4),(2,2),(1,5),(2,3),(1,6),(1,7),(1,8), (2,4),(1,9),(3,3),(1,10),(2,5),(1,11),(1,12),(2,6),in which thefirst element represents the number of neurons based on the peak load characteristic and the second one shows the number of neurons according to the temperature characteristic.Furthermore,Euclidian distance norms have been used in order to evaluate within cluster and between cluster distances.3.3.SOM trainingThe SOM has been trained by using a random order incremental training algorithm.All of the training vectors (or sequences)for each epoch are presented once using a different random order,and the network,weight and bias values are updated subsequent to each individual presenta-tion.Once any of the following conditions are met,the training stops:reaching the maximum number of iterations(epochs), minimizing performance to the goal,andexceeding the time limit.The winning neuron is identified by the network atfirst, and then,the weights of the winning neuron,together with others adjacent to it,are shifted nearer to the input vector at each training step.The weights of the winning neuron are changed proportionate to half the learning rate.Then, the learning rate and the neighborhood distance are applied in order to determine which neurons in the winning neighborhood of the winning neuron change during two training phases[22].The network in this research was trained for1000epochs.3.4.Outputs of clusteringAs we have already mentioned,the SOM neural net-work was trained for17different arrays using random order incremental training algorithms and two phases of ordering and tuning.The results of training each of the arrays determine the coordinates for the center of each neu-ron and the location of each input vector inside them. Then,in order to select the optimum cluster among the var-ious arrays,the value of the Davies–Bouldin validity index was calculated by using Euclidian distance.The results clearly show which clustering is the best and most suitable one.Fig.3displays the Davies–Bouldin validity indices for each array.As seen in Fig.3,the lowest Davies–Bouldin index belongs to the array withfive neurons,and increasing the number of neurons makes no improvement in the results. Thus,the selected array is(1,5),i.e.in the optimum situa-tion,the data are classified into one cluster according to the peak load and intofive clusters according to the tempera-ture.The output andfinal clustering results are in Table1.In the following section,we show that using the clus-tered data will significantly enhance the accuracy of the forecast.4.ForecastingHaving completed the data clustering,we can now pro-ceed to forecast the peak electrical load for the next day using feed forward neural networks(FFNN).Though being very popular as a general non-linear approximating device,FFNNs may not necessarily be efficient for load forecasting because of the input data variety.Hence,the clustered input data(obtained by using a SOM neural net-work)will be introduced to the FFNNs.Therefore,five FFNNs will be designed to deal with thefive clusters of data.In addition,data extracted from a three year period was used as the training set,and data covering a one year period was used as the test set.In the followingfivesubsec-Fig.3.Davies–Bouldin index values for each array.Table1Output andfinal clustering resultsCluster Weeks of the yearCluster1Saturdays through Wednesdays for the last two weeks in June,July,August and September;Thursdays of the last week ofJune,July,August,and thefirst two weeks in September Cluster2Saturdays through Wednesdays of the last two weeks in May,thefirst two weeks in June and October,and thefirst week inNovember;Thursdays for last two weeks May,first three weeksJune,last two weeks September;Fridays and holidays fromJune to SeptemberCluster3The New Year holidays(the last two weeks in March),Saturdays through Wednesdays in April,thefirst two weeks inMay;Thursdays from April,first three weeks May,October,first week November;Fridays and holidays of April and May,thefirst3weeks in May and October,and thefirst week inNovemberCluster4All Fridays and holidays from the second week in November tothe second week in MarchCluster5Saturdays through Thursdays of the last three weeks inNovember,December,January,February and thefirst threeweeks in MarchM.R.Amin-Naseri,A.R.Soroush/Energy Conversion and Management49(2008)1302–13081305tions,we will proceed to the design of the FFNNs architecture.4.1.Input data to each clusterThe factors that may influence daily electrical peak load on various days of the year,i.e.New year holidays,Satur-days through Wednesdays,Thursdays,Fridays and holi-days have been considered as:New year’s holidays of cluster3:The day of the month, year number,temperature and cloud cover for that partic-ular day,peak load,temperature and cloud cover for a sim-ilar day before and two days before.Fridays and holidays of cluster2:month number,year number,temperature and wind speed for that particular day,peak load,temperature and wind speed for the previ-ous week and two weeks before.Fridays and holidays of cluster3:month number,year number,temperature and cloud cover for that particular day,peak load,temperature and cloud cover for the previ-ous week and two weeks before.Fridays and holidays of cluster4:month number,year number,temperature and cloud cover for that particular day,peak load,temperature and cloud cover for the previ-ous week and two weeks before.Thursdays of cluster1:month number,year number, temperature and relative humidity for that particular day, peak load,temperature and relative humidity for the previ-ous week and two weeks before.Thursdays of cluster2:month number,year number, temperature and wind speed for that particular day,peak load,temperature and wind speed for the previous week and two weeks before.Thursdays of cluster3:month number,year number, temperature and cloud cover for that particular day,peak load,temperature and cloud cover for the previous week and two weeks before.Thursdays of cluster5:month number,year number, temperature and cloud cover for that particular day,peak load,temperature and cloud cover for the previous week and two weeks before.Saturdays through Wednesdays of cluster1:month num-ber,year number,temperature and relative humidity for that particular day,peak load,temperature and relative humidity for the previous day and week.Saturdays through Wednesdays of cluster2:month num-ber,year number,temperature and wind speed for that par-ticular day,peak load,temperature and wind speed for the previous day and week.Saturdays through Wednesdays of cluster3:month num-ber,year number,temperature and cloud cover for that particular day,peak load,temperature and cloud cover for the previous day and week.Saturdays through Wednesdays of cluster5:month num-ber,year number,temperature and cloud cover for that particular day,peak load,temperature and cloud cover for the previous day and week.4.2.Data pre-processingBy per-forming certain pre-processing steps on network input and targets,we can have more efficient network ually,it is best to scale inputs and targets to make them fall within a specific range.Specially,using sigmoid functions for the hidden layers can improve results.In cases where the input vector dimension is large,the vec-tor components may be highly correlated.In such cases,it is quite helpful to decrease the dimension of the input vectors.Principal component analysis(PCA)is an effective tech-nique for performing this ing PCA will result in:(a)orthogonalizing input vector components to make them uncorrelated,(b)ordering the orthogonal components (principal components,in fact),putting those with the larg-est variationfirst,and(c)eliminating components with the least contribution to the data set variation.The input data has been normalized with zero mean and unit variance[22].Having performed the PCA,the value of the second argument passed is1%,indicating that PCA eliminated principle components contributing less than1%to the data set’s total variation.Using PCA resulted in reducing the variables of clusters 1and4to8components instead of10,the variables of clus-ters2and5to9components instead of10.The variables of cluster3,however,remained unchanged at10.4.3.Number of layers,neurons and transfer functionsDepending on the nature of the problem,there are many transfer functions that can be applied to the different layers of ing sigmoid and tangent-hyperbolic transfer functions in the hidden layers stabilize the output of their neurons within the range[0,1]and[À1,1],thus safeguarding the training process from failure by means of preventing very gross amounts that may cause disruption in the network.In this research,a two layer network consisting of one hidden layer has been developed.Tofind the best number of neurons in the hidden layer,a range of1–50neurons has been examined,and results showed that the best number for cluster1is36,for cluster2is10,for cluster3is31, for cluster4is17and for cluster5is2.There is only one neuron in the output layer to provide the one day ahead peak load forecast.Also,due to the positive nature of the outputs,a logistic sigmoid transfer function has been used in the hidden layer.work training algorithmsThe process of training is implemented by modifying the connection weights through an orderly manner using a suit-able method usually called training algorithms(or learning equations).When an input is presented to the network,the learning equation attempts to adjust the weights so that the desired output is produced.In this research,having exam-ined several learning equations,the best ones,which include Levenberg–Marquardt(LM),Broyden-Fletcher-Goldfarb–1306M.R.Amin-Naseri,A.R.Soroush/Energy Conversion and Management49(2008)1302–1308Shanno(BFGS),one step secant(OSS),resilient back prop-agation(RP),scaled conjugate gradient(SCG),conjugate gradient Fletcher-Reeves(CGF),conjugate gradient Polak–Ribiere(CGP)and conjugate gradient Powell–Beale (CGB),have been used in order to train the proposed FFNNs[22,23].The results show that each of these algo-rithms produce the best results in some clusters.Specifically, the LM algorithm in clusters2and5,the SCG algorithm in cluster1,the BFGS algorithm in cluster3and the CGB algorithm in cluster4produce the best results.4.5.Performance criteriaThe performance criteria are all estimates of prediction error,that is,estimates of the differences between the arti-ficial neural network output values and the known target values for all training patterns.The best model,of course, is the one whose estimate has the least error.Researchers in the literature have used many performance criteria.In this research,the mean absolute error(MAE),the mean abso-lute percentage error(MAPE),the mean squared error (MSE),the maximum absolute error(MAXAE)and the maximum absolute percentage error(MAXAPE)have been used over the training set and the test set.Further-more,R,which represents the correlation index between the targets and the output,has been used[9].4.6.Forecasting resultsAs mentioned,forecasting the peak load for one day ahead was performed by developingfive feed forward neural networks that use thefive clustered data sets as their inputs.To evaluate the effectiveness of clustering by using the SOM,another FFNN has been developed that uses the un-clustered data for the PLF.The overall results obtained by using the proposed hybrid model(PHM)and those obtained by using the FFNN(which uses un-clustered data)are shown in Table2.As can be seen in the table,using only a feed forward neural network with un-clustered data(FFNN),produces 2.02mean absolute percentage error(MAPE),whereas it decreases to 1.83by using the clustered data obtained through employing the proposed hybrid model(PHM). Considering the other performance criteria(i.e.MAE, MSE and MAXAE)also proves that using the PHM signif-icantly improves the forecasting results(except for the MAXAPE,which differs slightly).To evaluate the accuracy of the forecasting results obtained by using the PHM,linear regression models have been used for PLF and the results compared.To do so,five linear regression models that use thefive clustered data sets (we call them hybrid linear regression models(HLR))have been developed.Table3shows the forecasts obtained by using the PHM and the HLR for each of thefive clusters created by using the paring the forecasting results,we see that the proposed hybrid model produces better forecasts in most clusters.In clusters2,3,4and5,the forecasts made by the PHM show significant superior-ity in all performance criteria over the HLR.In cluster1, which pertain to hot days of the year,the hybrid regression models present relatively good forecasts,but even in these cases,the PHM results in better forecasts.The value of R, the correlation between the target and the output,in cluster 4is particularly low for the regression model.Furthermore,to observe the effect of the proposed clus-tering model on the performance of regression models,a general linear regression model has also been developed that uses only the un-clustered data(LR).The overall results of the forecasting performances of these models Table2The effectiveness of the proposed hybrid modelPerformance criteria Forecasting modelsPHM FFNN MAE71.079.8 MAPE 1.83 2.02 MSE842211771 MAXAPE10.5310.15 MAXAE307.7393.8Table3Comparisons of results obtained from the proposed hybrid model and the hybrid linear regression modelClusters PerformancecriteriaProposed hybridmodel(PHM)Hybrid linearregression(HLR) Cluster1MAE64.966.9MAPE 1.5 1.56MSE68127950R0.9720.967MAXAPE224.0254.8MAXAE 5.33 6.06Cluster2MAE65.781.9MAPE 1.68 2.13MSE756911232R0.9610.94MAXAPE8.1611.42MAXAE259.7363.3Cluster3MAE88.6113.7MAPE 2.61 3.51MSE1333221118R0.9480.914MAXAPE10.5311.63MAXAE307.7313.1Cluster4MAE64.372.3MAPE 1.81 2.04MSE60329188R0.9240.856MAXAPE 4.90 5.90MAXAE192.9232.4Cluster5MAE67.983.8MAPE 1.62 2.01MSE698710024R0.9590.941MAXAPE 5.87 6.02MAXAE239.4245.6M.R.Amin-Naseri,A.R.Soroush/Energy Conversion and Management49(2008)1302–13081307have been shown in Table4.As can be seen from the table, the MAPE using the linear regression model with un-clus-tered data is2.54;yet,using the clustered data(HLR) reduces the forecasting error to2.25.This is particularly true for all other criteria,i.e.MAE,MSE and MAXAE, indicating that the proposed clustering model using the SOM has its effects on regression analysis too.5.ConclusionIn this paper,we have attempted to develop a hybrid neural network model for peak load forecasting.A novel approach has been proposed for clustering the input data by using a self-organizing map for classifying the days of the year based upon two factors,peak load and tempera-ture at the time of the peak load.The Davies–Bouldin validity index was introduced to determine the best cluster-ing,which ultimately led tofive clusters.Then,peak load forecasting has been performed by developingfive feed for-ward neural networks for each of thefive clusters.Four training algorithms,including LM,OSS,BFGS and RP, have been tested and the best one selected for each cluster. The application of principal component analysis(PCA) reduced the dimensions of the network’s input and achieved simpler architecture.To evaluate the effectiveness of the proposed hybrid model,forecasting has been per-formed using only a FFNN with un-clustered data.The results proved the superiority and effectiveness of the pro-posed hybrid model in all performance criteria.To evaluate the accuracy of the forecasting results obtained by using the PHM,five linear regression models that use thefive clustered data sets(HLR)have been par-ing the forecasting results proved significant superiority of the HPM over the HLR in all performance criteria.Fur-thermore,to observe the effect of the proposed clustering model on the performance of regression models,a general linear regression model has also been developed that uses only the un-clustered data.The overall results of the fore-casting performances of these models indicate that the pro-posed clustering model by using a SOM significantly improves the forecasting results on regression analysis too. AcknowledgementsThe authors wish to thank the Tehran Regional Electric Utility Company,especially Mr.Nasiri,for supplying the data required for this research.References[1]Tzafestas S,Tzafestas putational intelligence techniques forshort-term electric load forecasting.J Intell Robotic Syst2001;31(7): 7–68.[2]Metaxiotis K,Kagiannas A,Askounis D,Psarras J.Artificialintelligence in short term electric load forecasting:a state-of-the-art survey for the researcher.Energy Convers Manage2003;44: 1525–34.[3]Amjady N.Short-term hourly load forecasting using time-seriesmodeling with peak load estimation capability.IEEE Trans Power Syst2001;16(4):798–805.[4]Petridis V,Kehagias A,Petrou L,Bakirtzis A,Kiartzis S,PanagiotouH.A Bayesian multiple models combination method for time seriesprediction.J Intell Robotic Syst2001;31:69–89.[5]Zivanovic R.Local regression based short-term load forecasting.JIntell Robotic Syst2001;31:115–27.[6]Carpinteiro OAS,Alves Da Silva AP.A hierarchical neural model inshort-term load forecasting.Appl Soft Comput2004;4:405–12. [7]Carpinteiro OAS,Alves Da Silva AP.A hierarchical self-organizingmap model in short-term load forecasting.J Intell Robotic Syst 2001;31(1–3):105–13.[8]Carpinteiro OAS,Alves Da Silva AP.A hierarchical neural model inshort-term load forecasting.IEEE2000:120–4.[9]Hippert HS,Pedreira CE,Souza RC.Neural networks for short-termload forecasting:a review and evaluation.IEEE Trans Power Syst 2001;16(1):44–55.[10]Hsu CC,Chen CY.Regional load forecasting in Taiwan–applica-tions of artificial neural networks.Energy Convers Manage2003;44: 1941–9.[11]Huang HG,Hwang RC,Hsieh JG.A new artificial intelligent peakpower load forecaster based on non-fixed neural networks.Electr Power Energy Syst2002;24:245–50.[12]Iizaka T,Matsui T,Fukuyama Y.A novel daily peak load forecastingmethod using analyzable structured neural network.IEEE T&D Asia, Yokohama2002:1–6.[13]Khotanzad A,Afkhami-Rohani R,Maratukulam D.ANNSTLF–artificial neural network short-term load forecaster–generation three.IEEE Trans Power Syst1998;13(4):1413–22.[14]Kim C,Yu I,Song YH.Kohonen neural network and wavelettransform based approach to short-term load forecasting.Electr Power Energy Syst2002;63:169–76.[15]Mori H,Yuihara A.Deterministic annealing clustering for ANN-based short-term load forecasting.IEEE Trans Power Syst 2001;16(3):545–51.[16]Murto P.Neural network models for short-term load forecasting.Thesis,Helsinki University of Technology,Department of Engineer-ing Physics and Mathematics;1998.[17]Saini LM,Soni MK.Artificial neural network based peak loadforecasting using conjugate gradient methods.IEEE Trans Power Syst2002;17(3):907–12.[18]Saini LM,Soni MK.Artificial neural network based peak loadforecasting using Levenberg–Marquardt and quasi-Newton methods.IEE Proc-Gener Transm Distrib2002.[19]Amin-Naseri MR,Soroush AR.A hybrid neural network model fordaily peak load forecasting using a novel clustering approach.In: IASTED international conference on artificial intelligence and soft computing;2006.p.104–9.[20]Oja M,Kaski S,Kohonen T.Bibliography of self-organizing map(SOM)papers:1998–2001Addendum.Neural Comput Surveys 2002;3:1–156.[21]Vesanto J,Alhoniemi E.Clustering of the self-organizing map.IEEETrans Neural Networks2000;11(3):586–600.[22]Demuth H,Beale M.Neural network toolbox for use withMATLAB.version4,The MathWorks,Inc.,(CD-ROM)2002. [23]Hagan MT,Demuth H,Beale M.Neural network design.1stA:PWS Publishing Company;1996.Table4The effect of clustering on the regression modelsPerformance criteria Forecasting modelHLR LRMAE84.697.3MAPE 2.25 2.54MSE1213216207MAXAPE11.6313.86MAXAE363.3537.91308M.R.Amin-Naseri,A.R.Soroush/Energy Conversion and Management49(2008)1302–1308。