Mining Business Topics in Source Code using Latent Dirichlet Allocation ABSTRACT
- 格式:pdf
- 大小:224.75 KB
- 文档页数:8
商务英语翻译Unit 21 Red tape and other examples of government bureaucracy hinder a company’s entry into a market.繁琐的手续及其他官僚作风的例子阻碍了公司进入市场。
2 In these days of increasing global integration, the task many international marketers face is not so much market entry as managing the marketing mix in different national markets.在全球日益一体化的今天,各国际市场面临的任务与其说是建立各种市场准入问题,不如说是如何对不同的国际市场的营销组合的问题。
3 Some companies, however, do develop the same product for all markets regardless of existing local preferences.无论如何,一些公司确实在所有的市场开发相同的产品,忽视了现有的当地客户的偏好。
4 Firms sometimes customize a product to every market; at other times they offer one standardized product everywhere; and sometimes they compromise and settle in the middle.公司有的时候给不同的市场提供不同的产品,而有的时候则为所有的市场提供一个标准化的产品,有的时候则是妥协采用折中的解决方案。
5 The advent of the Internet and Intranets has the potential to accelerate the process of mining all markets for relevant information and for features that can be included in new products.网络和局域网的来临,有可能加速了挖掘相关市场信息或用于新产品的特征这一进程.6 When Citibank introduced its credit card in the Asia-Pacific region, it launched it sequentially and tailored the productfeatures for each country while maintaining its premium positioning.当花旗银行在亚太地区推出信用卡业务时,它在各个国家依次推出,并且在维持它的高端定位的同时,为每个国家度身打造了一些独特的产品特征。
电脑信息课的英语Computer Information CourseComputer information is an essential subject that covers various aspects of computing technology, data management, and information systems. This course provides students with a comprehensive understanding of the fundamental concepts and practical skills required in the field of computer information.The course curriculum typically consists of the following topics:1. Introduction to Computer Systems: This module introduces the basic components and functions of computer systems, including hardware, operating systems, and software.2. Data Representation: Students learn how data is represented and stored in computer systems using binary numbers, ASCII codes, and various data structures.3. Computer Networks: This module covers the basics of computer networking, including network protocols, network topologies, and internet connectivity.4. Database Management Systems: Students learn the principles and practices of database design, implementation, and management. They also explore topics like query languages and data security.5. Information Systems and Business Applications: This part of the course focuses on the applications of computer information in business environments. Topics covered include enterprise resourceplanning (ERP), customer relationship management (CRM), and supply chain management (SCM) systems.6. Web Development: Students acquire skills in web development, including HTML, CSS, and JavaScript. They also learn about website design, usability, and the principles of responsive web design.7. Data Analytics: This module introduces students to the field of data analytics and its applications in decision-making and problem-solving. Topics covered include data visualization, statistical analysis, and data mining techniques.8. Information Security: Students learn about the importance of information security and the measures needed to protect computer systems and data. Topics covered include encryption techniques, access control, and security policies.Throughout the course, students engage in practical assignments and projects to apply their learning and develop essential skills in areas such as data manipulation, programming, and problem-solving. By the end of the course, students should have a solid understanding of computer information and be equipped with the necessary skills to pursue careers in areas such as information technology, data analysis, or system administration.。
矿产资源相关的英语作文Mining resources are the backbone of our modern world. They’re like the hidden treasures beneath our feet, waiting to be unearthed and put to use. From shiny gold to humble gravel, these resources shape our lives in ways we often don’t even realize.Think about it – every gadget you own, every building you see, they all started as a glint in a miner’s eye. The metals and minerals dug up from the earth form the foundation of our technology and infrastructure. Without them, we’d still be living in ca ves, banging rocks together.But mining isn’t all about shiny riches. It’s a gritty, dirty business, with its fair share of challenges. Environmental concerns loom large, as we grapple with the impact of extracting resources from the earth. We’re constantly walking the tightrope between progress and preservation, trying to strike a balance that benefits bothhumanity and the planet.Then there’s the human side of mining – the sweat and toil of the workers who brave the depths of the earth to bring these resources to light. It’s a tough job, filled with danger and uncertainty. Yet, for many, it’s a way of life, passed down through generations like a torch in the darkness.And let’s not forget the geopolitical minefield that mining often becomes. Countries jostle for control of valuable resources, sometimes sparking conflicts that rage for years. It’s a high-stakes game of power and influence, played out on the global stage.But amidst all the chaos and controversy, one thing remains clear: mining is essential to our way of life. Without it, we’d be lost – adrift in a sea of want and need. So, as we delve deeper into the earth in search of riches, let’s not forget the responsibility that comes with it. Let’s mine wisely, with one eye on the f uture and the other on the past.。
大数据时代书籍简介英文回答:In the era of big data, the importance of understanding and harnessing the power of data has become increasingly crucial. As a result, there has been a surge in the number of books that delve into the subject of big data, offering insights, strategies, and practical tips for navigatingthis data-driven world. These books cover a wide range of topics, including data analysis, machine learning, data visualization, and data management.One popular book in this field is "Big Data: A Revolution That Will Transform How We Live, Work, and Think" by Viktor Mayer-Schönberger and Kenneth Cukier. This book explores the potential of big data to revolutionize various aspects of our lives, from business and healthcare to politics and education. It provides real-life examples and case studies to illustrate how big data is already being used to make informed decisions and driveinnovation.Another notable book is "Data Science for Business" by Foster Provost and Tom Fawcett. This book focuses on the practical applications of data science in the business world. It covers topics such as data mining, predictive modeling, and data-driven decision making. The authors provide clear explanations and use real-world examples to demonstrate how businesses can leverage data to gain a competitive edge.For those interested in the technical aspects of big data, "Hadoop: The Definitive Guide" by Tom White is a comprehensive resource. This book offers a deep dive into Hadoop, an open-source framework used for processing and analyzing large datasets. It covers topics such as data storage, data processing, and data analysis using Hadoop. The author provides code examples and practical tips for setting up and using Hadoop in a real-world environment.中文回答:在大数据时代,理解和利用数据的力量变得越来越重要。
Workflow mining:A survey of issues and approachesW.M.P.van der Aalst a,*,B.F.van Dongen a ,J.Herbst b ,L.Maruster a ,G.Schimm c ,A.J.M.M.Weijters aaDepartment of Technology Management,Eindhoven University of Technology,P.O.Box 513,NL-5600MB,Eindhoven,The NetherlandsbDaimlerChrysler AG,Research and Technology,P.O.Box 2360,D-89013Ulm,GermanycOFFIS,Escherweg 2,D-26121Oldenburg,GermanyReceived 2October 2002;received in revised form 29January 2003;accepted 26February 2003AbstractMany of today Õs information systems are driven by explicit process models.Workflow management systems,but also ERP,CRM,SCM,and B2B,are configured on the basis of a workflow model specifying the order in which tasks need to be executed.Creating a workflow design is a complicated time-consuming process and typically there are discrepancies between the actual workflow processes and the processes as perceived by the management.To support the design of workflows,we propose the use of workflow mining.Starting point for workflow mining is a so-called ‘‘workflow log’’containing information about the workflow process as it is actually being executed.In this paper,we introduce the concept of workflow mining and present a common format for workflow logs.Then we discuss the most challenging problems and present some of the workflow mining approaches available today.Ó2003Elsevier B.V.All rights reserved.Keywords:Workflow mining;Workflow management;Data mining;Petri nets1.IntroductionDuring the last decade workflow management technology [2,4,21,35,41]has become readily available.Workflow management systems such as Staffware,IBM MQSeries,COSA,etc.offer*Corresponding author.Tel.:+31-40-247-4295;fax:+31-40-243-2612.E-mail addresses:w.m.p.v.d.aalst@tm.tue.nl (W.M.P.van der Aalst),joachim.j.herbst@ (J.Herbst),schimm@offis.de (G.Schimm).0169-023X/$-see front matter Ó2003Elsevier B.V.All rights reserved.doi:10.1016/S0169-023X(03)00066-1/locate/datakData &Knowledge Engineering 47(2003)237–267238W.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267generic modeling and enactment capabilities for structured business processes.By making process definitions,i.e.,models describing the life-cycle of a typical case(workflow instance)in isolation, one can configure these systems to support business processes.These process definitions need to be executable and are typically graphical.Besides pure workflow management systems many other software systems have adopted workflow technology.Consider for example Enterprise Resource Planning(ERP)systems such as SAP,PeopleSoft,Baan and Oracle,Customer Relationship Management(CRM)software,Supply Chain Management(SCM)systems,Business to Business (B2B)applications,etc.which embed workflow technology.Despite its promise,many problems are encountered when applying workflow technology.One of the problems is that these systems require a workflow design,i.e.,a designer has to construct a detailed model accurately describing the routing of work.Modeling a workflow is far from trivial:It requires deep knowledge of the business process at hand(i.e.,lengthy discussions with the workers and management are needed) and the workflow language being used.To compare workflow mining with the traditional approach towards workflow design and enactment,consider the workflow life cycle shown in Fig.1.The workflow life cycle consists of four phases:(A)workflow design,(B)workflow configuration,(C)workflow enactment,and(D) workflow diagnosis.In the traditional approach the design phase is used for constructing a workflow model.This is typically done by a business consultant and is driven by ideas of man-agement on improving the business processes at hand.If the design isfinished,the workflow system(or any other system that is‘‘process aware’’)is configured as specified in the design phase. In the configuration phases one has to deal with limitation and particularities of the workflow management system being used(cf.[5,65]).In the enactment phase,cases(i.e.,workflow instances) are handled by the workflow system as specified in the design phase and realized in the configu-ration phase.Based on a running workflow,it is possible to collect diagnostic information which is analyzed in the diagnosis phase.The diagnosis phase can again provide input for the design phase thus completing the workflow life cycle.In the traditional approach the focus is on the design and configuration phases.Less attention is paid to the enactment phase and few organi-zations systematically collect runtime data which is analyzed as input for redesign(i.e.,the diag-nosis phase is typically missing).The goal of workflow mining is to reverse the process and collect data at runtime to support workflow design and analysis.Note that in most cases,prior to the deployment of a workflowW.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267239 system,the workflow was already there.Also note that in most information systems transactional data is registered(consider for example the transaction logs of ERP systems like SAP).The in-formation collected at run-time can be used to derive a model explaining the events recorded. Such a model can be used in both the diagnosis phase and the(re)design phase.Modeling an existing process is influenced by perceptions,e.g.,models are often normative in the sense that they state what‘‘should’’be done rather than describing the actual process.As a result models tend to be rather subjective.A more objective way of modeling is to use data related to the actual events that took place.Note that workflow mining is not biased by perceptions or normative behavior.However,if people bypass the system doing things differently,the log can still deviate from the actual work being done.Nevertheless,it is useful to confront man-made models with models discovered through workflow mining.Closely monitoring the events taking place at runtime also enables Delta analysis,i.e.,detecting discrepancies between the design constructed in the design phase and the actual execution regis-tered in the enactment phase.Workflow mining results in an‘‘a posteriori’’process model which can be compared with the‘‘a priori’’model.Workflow technology is moving into the direction of more operationalflexibility to deal with workflow evolution and workflow exception handling [2,7,10,13,20,30,39,40,64].As a result workers can deviate from the prespecified workflow design. Clearly one wants to monitor these deviations.For example,a deviation may become common practice rather than being a rare exception.In such a case,the added value of a workflow system becomes questionable and an adaptation is required.Clearly,workflow mining techniques can be used to create a feedback loop to adapt the workflow model to changing circumstances and detect imperfections of the design.The topic of workflow mining is related to management trends such as Business Process Re-engineering(BPR),Business Intelligence(BI),Business Process Analysis(BPA),Continuous Process Improvement(CPI),and Knowledge Management(KM).Workflow mining can be seen as part of the BI,BPA,and KM trends.Moreover,workflow mining can be used as input for BPR and CPI activities.Note that workflow mining seems to be more appropriate for BPR than for CPI.Recall that one of the basic elements of BPR is that it is radical and should not be restricted by the existing situation[23].Also note that workflow mining is not a tool to(re)design processes. The goal is to understand what is really going on as indicated in Fig.1.Despite the fact that workflow mining is not a tool for designing processes,it is evident that a good understanding of the existing processes is vital for any redesign effort.This paper is a joint effort of a number of researchers using different approaches to workflow mining and is a spin-offof the‘‘Workflow Mining Workshop’’.1The goal of this paper is to introduce the concept of workflow mining,to identify scientific and practical problems,to present a common format to store workflow logs,to provide an overview of existing approaches,and to present a number of mining techniques in more detail.The remainder of this paper is organized as follows.First,we summarize related work.In Section3we define workflow mining and present some of the challenging problems.In Section 4we propose a common XML-based format for storing and exchanging workflow logs.This format is used by the mining tools developed by the authors and interfaces with some of the 1This workshop took place on May22nd and23rd2002in Eindhoven,The Netherlands.240W.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267leading workflow management systems(Staffware,MQSeries Workflow,and InConcert).Sec-tions5–9introducefive approaches to workflow mining focusing on different aspects.These sections give an overview of some of the ongoing work on workflow mining.Section10 compares the various approaches and list a number of open problems.Section11concludes the paper.2.Related workThe idea of process mining is not new[8,11,15–17,24–29,42–44,53–57,61–63].Cook and Wolf have investigated similar issues in the context of software engineering processes.In[15]they describe three methods for process discovery:one using neural networks,one using a purely algorithmic approach,and one Markovian approach.The authors consider the latter two the most promising approaches.The purely algorithmic approach builds afinite state machine (FSM)where states are fused if their futures(in terms of possible behavior in the next k steps) are identical.The Markovian approach uses a mixture of algorithmic and statistical methods and is able to deal with noise.Note that the results presented in[6]are limited to sequential behavior.Cook and Wolf extend their work to concurrent processes in[16].They propose specific metrics(entropy,event type counts,periodicity,and causality)and use these metrics to discover models out of event streams.However,they do not provide an approach to generate explicit process models.Recall that thefinal goal of the approach presented in this paper is to find explicit representations for a broad range of process models,i.e.,we want to be able to generate a concrete Petri net rather than a set of dependency relations between events.In[17] Cook and Wolf provide a measure to quantify discrepancies between a process model and the actual behavior as registered using event-based data.The idea of applying process mining in the context of workflow management wasfirst introduced in[11].This work is based on workflow graphs,which are inspired by workflow products such as IBM MQSeries workflow(formerly known as Flowmark)and InConcert.In this paper,two problems are defined.Thefirst problem is tofind a workflow graph generating events appearing in a given workflow log.The second problem is tofind the definitions of edge conditions.A concrete algorithm is given for tackling thefirst problem.The approach is quite different from other approaches:Because the nature of workflow graphs there is no need to identify the nature(AND or OR)of joins and splits.As shown in[37],workflow graphs use true and false tokens which do not allow for cyclic graphs. Nevertheless,[11]partially deals with iteration by enumerating all occurrences of a given task and then folding the graph.However,the resulting conformal graph is not a complete model.In [44],a tool based on these algorithms is presented.Schimm[53,54,57]has developed a mining tool suitable for discovering hierarchically structured workflow processes.This requires all splits and joins to be balanced.Herbst and Karagiannis also address the issue of process mining in the context of workflow management[24–29]using an inductive approach.The work presented in[27,29]is limited to sequential models.The approach described in[24–26,28]also allows for concurrency.It uses stochastic task graphs as an intermediate representation and it generates a workflow model described in the ADONIS modeling language.In the induction step task nodes are merged and split in order to discover the underlying process.A notable difference with other approaches is that the same task can appear multiple times in the workflow model.The graphW.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267241 generation technique is similar to the approach of[11,44].The nature of splits and joins(i.e., AND or OR)is discovered in the transformation step,where the stochastic task graph is transformed into an ADONIS workflow model with block-structured splits and joins.In con-trast to the previous papers,the work in[8,42,43,61,62]is characterized by the focus on workflow processes with concurrent behavior(rather than adding ad hoc mechanisms to capture parallelism).In[61,62]a heuristic approach using rather simple metrics is used to construct so-called‘‘dependency/frequency tables’’and‘‘dependency/frequency graphs’’.In[42]another variant of this technique is presented using examples from the health-care domain.The pre-liminary results presented in[42,61,62]only provide heuristics and focus on issues such as noise. The approach described in[8]differs from these approaches in the sense that for the a algorithm it is proven that for certain subclasses it is possible tofind the right workflow model.In[3]the a algorithm is extended to incorporate timing information.Process mining can be seen as a tool in the context of Business(Process)Intelligence(BPI). In[22,52]a BPI toolset on top of HPs Process Manager is described.The BPI tools set includes a so-called‘‘BPI Process Mining Engine’’.However,this engine does not provide any techniques as discussed before.Instead it uses generic mining tools such as SAS Enterprise Miner for the generation of decision trees relating attributes of cases to information about execution paths(e.g., duration).In order to do workflow mining it is convenient to have a so-called‘‘process data warehouse’’to store audit trails.Such as data warehouse simplifies and speeds up the queries needed to derive causal relations.In[19,46–48]the design of such warehouse and related issues are discussed in the context of workflow logs.Moreover,[48]describes the PISA tool which can be used to extract performance metrics from workflow logs.Similar diagnostics are provided by the ARIS Process Performance Manager(PPM)[34].The later tool is commercially available and a customized version of PPM is the Staffware Process Monitor(SPM)[59]which is tailored towards mining Staffware logs.Note that none of the latter tools is extracting the process model.The main focus is on clustering and performance analysis rather than causal relations as in[8,11,15–17,24–29,42–44,53–57,61–63].Much of the work mentioned above will be discussed in more detail in Sections5–9.Before doing so,wefirst look at workflow mining in general and introduce a common XML-based format for storing and exchanging workflow logs.3.Workflow miningThe goal of workflow mining is to extract information about processes from transaction logs. Instead of starting with a workflow design,we start by gathering information about the workflow processes as they take place.We assume that it is possible to record events such that(i)each event refers to a task(i.e.,a well-defined step in the workflow),(ii)each event refers to a case(i.e.,a workflow instance),and(iii)events are totally ordered.Any information system using transac-tional systems such as ERP,CRM,or workflow management systems will offer this information in some form.Note that we do not assume the presence of a workflow management system.The only assumption we make,is that it is possible to collect workflow logs with event data.These workflow logs are used to construct a process specification which adequately models the behavior registered. The term process mining refers to methods for distilling a structured process description from a set242W.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267of real executions.Because these methods focus on so-called case-driven process that are sup-ported by contemporary workflow management systems,we also use the term workflow mining.Table1shows a fragment of a workflow log generated by the Staffware system.In Staffware events are grouped on a case-by-case basis.Thefirst column refers to the task(description),the second to the type of event,the third to the user generating the event(if any),and the last column shows a time stamp.The corresponding Staffware model is shown in Fig.2.Case10shown in Table1follows the scenario wherefirst task Register is executed followed Send questionnaire, Receive questionnaire,and Evaluate.Based on the Evaluation,the decision is made to directly archive(task Archive)the case without further processing.For Case9further processing is nee-ded,while Case8involves a timeout and the repeated execution of some tasks.Someone familiar with Staffware will be able to decide that the three cases indeed follow a scenario possible in the Staffware model shown in Fig.2.However,three cases are not sufficient to automatically derive the model of Fig.2.Note that there are more Staffware models enabling the three scenarios shown in Table1.The challenge of workflow mining is to derive‘‘good’’workflow models with as little information as possible.To illustrate the principle of process mining in more detail,we consider the workflow log shown in Table2.This log abstracts from the time,date,and event type,and limits the information to the order in which tasks are being executed.The log shown in Table2contains information aboutfive cases(i.e.,workflow instances).The log shows that for four cases(1,2,3,and4)the tasks A,B,C, and D have been executed.For thefifth case only three tasks are executed:tasks A,E,and D. Each case starts with the execution of A and ends with the execution of D.If task B is executed, then also task C is executed.However,for some cases task C is executed before task B.Based on the information shown in Table2and by making some assumptions about the completeness of the log(i.e.,assuming that the cases are representative and a sufficient large subset of possible be-haviors is observed),we can deduce for example the process model shown in Fig.3.The model is represented in terms of a Petri net[50].The Petri net starts with task A andfinishes with task D. These tasks are represented by transitions.After executing A there is a choice between either executing B and C in parallel or just executing task E.To execute B and C in parallel two non-observable tasks(AND-split and AND-join)have been added.These tasks have been added for routing purposes only and are not present in the workflow log.Note that for this example we assume that two tasks are in parallel if they appear in any order.By distinguishing between start events and complete events for tasks it is possible to explicitly detect parallelism(cf.Section4).Table2contains the minimal information we assume to be present.In many applications,the workflow log contains a time stamp for each event and this information can be used to extract additional causality information.In addition,a typical log also contains information about the type of event,e.g.,a start event(a person selecting an task from a worklist),a complete event(the completion of a task),a withdraw event(a scheduled task is removed),etc.Moreover,we are also interested in the relation between attributes of the case and the actual route taken by a particular case.For example,when handling traffic violations:Is the make of a car relevant for the routing of the corresponding traffic violation?(For example,People driving a Ferrari always pay theirfines in time.)For this simple example(i.e.,Table2),it is quite easy to construct a process model that is able to regenerate the workflow log(e.g.,Fig.3).For more realistic situations there are however a number of complicating factors:W.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267243 Table1A Staffware logDirective description Event User yyyy/mm/dd hh:mmCase10Start bvdongen@staffw_e2002/06/1912:58Register Processed to bvdongen@staffw_e2002/06/1912:58Register Released by bvdongen@staffw_e2002/06/1912:58Send questionnaire Processed to bvdongen@staffw_e2002/06/1912:58Evaluate Processed to bvdongen@staffw_e2002/06/1912:58Send questionnaire Released by bvdongen@staffw_e2002/06/1913:00Receive questionnaire Processed to bvdongen@staffw_e2002/06/1913:00Receive questionnaire Released by bvdongen@staffw_e2002/06/1913:00Evaluate Released by bvdongen@staffw_e2002/06/1913:00Archive Processed to bvdongen@staffw_e2002/06/1913:00Archive Released by bvdongen@staffw_e2002/06/1913:00Terminated2002/06/1913:00Case9Start bvdongen@staffw_e2002/06/1912:36Register Processed to bvdongen@staffw_e2002/06/1912:36Register Released by bvdongen@staffw_e2002/06/1912:35Send questionnaire Processed to bvdongen@staffw_e2002/06/1912:36Evaluate Processed to bvdongen@staffw_e2002/06/1912:36Send questionnaire Released by bvdongen@staffw_e2002/06/1912:36Receive questionnaire Processed to bvdongen@staffw_e2002/06/1912:36Receive questionnaire Released by bvdongen@staffw_e2002/06/1912:36Evaluate Released by bvdongen@staffw_e2002/06/1912:37Process complaint Processed to bvdongen@staffw_e2002/06/1912:37Process complaint Released by bvdongen@staffw_e2002/06/1912:37Check processing Processed to bvdongen@staffw_e2002/06/1912:37Check processing Released by bvdongen@staffw_e2002/06/1912:38Archive Processed to bvdongen@staffw_e2002/06/1912:38Archive Released by bvdongen@staffw_e2002/06/1912:38Terminated2002/06/1912:38Case8Start bvdongen@staffw_e2002/06/1912:36Register Processed to bvdongen@staffw_e2002/06/1912:36Register Released by bvdongen@staffw_e2002/06/1912:35Send questionnaire Processed to bvdongen@staffw_e2002/06/1912:36Evaluate Processed to bvdongen@staffw_e2002/06/1912:36Send questionnaire Released by bvdongen@staffw_e2002/06/1912:36Receive questionnaire Processed to bvdongen@staffw_e2002/06/1912:36Receive questionnaire Expired bvdongen@staffw_e2002/06/1912:37Receive questionnaire Withdrawn bvdongen@staffw_e2002/06/1912:37Receive timeout Processed to bvdongen@staffw_e2002/06/1912:37Receive timeout Released by bvdongen@staffw_e2002/06/1912:37Evaluate Released by bvdongen@staffw_e2002/06/1912:37Process complaint Processed to bvdongen@staffw_e2002/06/1912:37Process complaint Released by bvdongen@staffw_e2002/06/1912:37Check processing Processed to bvdongen@staffw_e2002/06/1912:37(continued on next page)Table 2A workflow log Case identifier Task identifier Case 1Task A Case 2Task A Case 3Task A Case 3TaskB Case 1Task B Case 1TaskC Case 2Task C Case 4Task A Case 2Task B Case 2TaskD Case 5Task A Case 4Task C Case 1Task D Case 3Task C Case 3Task D Case 4Task B Case 5TaskE Case 5Task D Case4TaskDFig.2.The staffware model.Table 1(continued )Directive description Event Useryyyy/mm/dd hh:mm Check processing Released by bvdongen@staffw_e 2002/06/1912:38Process complaint Processed to bvdongen@staffw_e 2002/06/1912:37Process complaint Released by bvdongen@staffw_e 2002/06/1912:37Check processing Processed to bvdongen@staffw_e 2002/06/1912:37Check processing Released by bvdongen@staffw_e 2002/06/1912:38Archive Processed to bvdongen@staffw_e 2002/06/1912:38ArchiveReleased by bvdongen@staffw_e2002/06/1912:38Terminated2002/06/1912:38244W.M.P.van der Aalst et al./Data &Knowledge Engineering 47(2003)237–267W.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267245•For larger workflow models mining is much more difficult.For example,if the model exhibits alternative and parallel routing,then the workflow log will typically not contain all possible combinations.Consider10tasks which can be executed in parallel.The total number of inter-leavings is10!¼3628800.It is not realistic that each interleaving is present in the log.More-over,certain paths through the process model may have a low probability and therefore remain undetected.•Workflow logs will typically contain noise,i.e.,parts of the log can be incorrect,incomplete,or refer to exceptions.Events can be logged incorrectly because of human or technical errors.Events can be missing in the log if some of the tasks are manual or handled by another sys-tem/organizational unit.Events can also refer to rare or undesired events.Consider for example the workflow in a hospital.If due to time pressure the order of two events(e.g.,make X-ray and remove drain)is reversed,this does not imply that this would be part of the regular medical protocol and should be supported by the hospitalÕs workflow system.Also two causally unre-lated events(e.g.,take blood sample and death of patient)may happen next to each other with-out implying a causal relation(i.e.,taking a sample did not result in the death of the patient;it was sheer coincidence).Clearly,exceptions which are recorded only once should not automat-ically become part of the regular workflow.•Table2only shows the order of events without giving information about the type of event,the time of the event,and attributes of the event(i.e.,data about the case and/or task).Clearly,it isa challenge to exploit such additional information.Sections5–9will present different approaches to some of these problems.To conclude this section,we point out legal issues relevant when mining(timed)workflow logs. Clearly,workflow logs can be used to systematically measure the performance of employees.The legislation with respect to issues such as privacy and protection of personal data differs from country to country.For example,Dutch companies are bound by the Personal Data Protection Act (Wet Bescherming Persoonsgegeven)which is based on a directive from the European Union.The practical implications of this for the Dutch situation are described in[14,31,51].Workflow logs are not restricted by these laws as long as the information in the log cannot be traced back to indi-viduals.If information in the log can be traced back to a specific employee,it is important that the employee is aware of the fact that her/his activities are logged and the fact that this logging is used to monitor her/his performance.Note that in a timed workflow log we can abstract from information about the workers executing tasks and still mine the process.Therefore,it is possible to avoid collecting information on the productivity of individual workers and legislation such as the Personal246W.M.P.van der Aalst et al./Data&Knowledge Engineering47(2003)237–267Data Protection Act does not apply.Nevertheless,the logs of most workflow systems contain information about individual workers,and therefore,this issue should be considered carefully.4.Workflow logs:A common XML formatIn this section we focus on the syntax and semantics of the information stored in the workflow log.We will do this by presenting a tool independent XML format that is used by each of the mining approaches/tools described in the remainder.Fig.4shows that this XML format connects transactional systems such as workflow management systems,ERP systems,CRM systems,and case handling systems.In principle,any system that registers events related to the execution of tasks for cases can use this tool independent format to store and exchange logs.The XML format is used as input for the analysis tools presented in Sections5–9.The goal of using a single format is to reduce the implementation effort and to promote the use of these mining techniques in multiple contexts.Table3shows the Document Type Definition(DTD)[12]for workflow logs.This DTD specifies the syntax of a workflow log.A workflow log is a consistent XML document,i.e.,a well-formed and valid XMLfile with top element WorkFlow_log(see Table3).As shown,a workflow log consists of(optional)information about the source program and information about one or more workflow processes.Each workflow process(element process)consists of a sequence of cases (element case)and each case consists of a sequence of log lines(element log_line).Both processes and cases have an id and a description.Each line in the log contains the name of a task(element task_name).In addition,the line may contain information about the task instance(element task_instance),the type of event(element event),the date(element date),and the time of the event (element time).It is advised to make sure that the process description and the case description are unique for each process or case respectively.The task name should be a unique identifier for a task within a process.If there are two or more tasks in a process with the same task name,they are as-。
外贸开发信给客人写邮件的时候,必须要忘记中国人的行文和思维方式,要按欧美人的习惯去思考问题和写邮件,这样才能让客人把你当成同类。
到那一天,你随便写一封邮件,当大部分人都看不出这封邮件出自一个中国人之手,你就出师了!我先列举一些大部分朋友写email时常范的错误,大家可以看看,对比一下自己,其中的几条是自己也会有的?接下来再讨论开发信怎么写。
呵呵。
1)邮件写得过长。
客人的时间很宝贵,每天要收到数百封邮件,你想想,一个不认识的人发了一封又长又臭的邮件给你,英语表述又不好,还加了好几M的附件,你会不会认真去看?而且很多老外的时间观念很强,每天都有几块固定的时间用来处理email,很多长篇大论的邮件,只要不是他的熟人发的,一般会被直接删除,或者是把你的地址设为垃圾邮件。
我问过很多西欧客人,他们一般处理每一封邮件的时间是2-3秒,也就是大致扫一眼,重要的邮件,一般马上仔细阅读并回复,不是太重要的,会在outlook里标注上要处理的具体时间,然后从inbox拉到相应的子目录里。
换句话说,只要客人的邮箱地址是对的,也是你要找的right person,你的开发信只能停留在他眼前2-3秒,就是决定命运的时刻了。
这种情况下,试问你敢不敢把邮件写得很长?2)没有明确的主题。
一个不明确的主题,会让客人根本没兴趣去打开陌生人的邮件。
这个就需要经验了,内容要言简意赅,直接吸引客人通过主题去点开邮件,目的就达到了。
至于他看了以后有没有反应,就要看实际情况和你内容的功力了。
有些人写邮件会这样设置主题:“we are the manufacturer of lights”,又或者“need cooperation”,或者“Guangdong *** trading company ltd”,或者“price list for lights-Guangdong *** trading company ltd”等等,一看就知道是推销信。
global sourcesGlobal Sources: Connecting Businesses around the WorldIntroductionGlobalization has revolutionized the way businesses operate. In this interconnected world, companies are no longer confined to local markets; they now have the opportunity to expand their reach and tap into global markets. However, with increasing global trade comes the challenge of finding reliable suppliers, manufacturers, or distributors. This is where Global Sources come into play. This document explores the significance of Global Sources and how it connects businesses around the world.What is Global Sources?Global Sources is a leading business-to-business (B2B) media company that specializes in facilitating trade between buyers and suppliers. Established in 1970, Global Sources offers a wide range of services to help businesses connect and build relationships. These services include online marketplaces, physical trade shows, sourcing reports, and trade magazines.Online MarketplacesOne of the flagship services offered by Global Sources is its online marketplaces. These digital platforms allow buyers to search for suppliers, manufacturers, or products from all over the world. The online marketplaces cover various industries, including electronics, fashion, home and kitchen, and more. With the extensive database of verified suppliers, buyers can easily find the right business partners to meet their specific needs.Physical Trade ShowsIn addition to its online presence, Global Sources organizes trade shows in major cities around the world. These trade shows attract thousands of buyers and suppliers, creating a platform for face-to-face interactions and business negotiations. These events are advantageous as they allow businesses to showcase their products or services, establish brand recognition, and network with potential partners. Furthermore, trade shows organized by Global Sources often provide sourcing seminars and industry-specific conferences to keep attendees updated with the latest market trends and insights.Sourcing ReportsGlobal Sources produces comprehensive sourcing reports that provide valuable market intelligence and analysis. These reports help businesses make informed decisions when sourcing products or suppliers. The reports cover a wide range of industries and include information on market trends, competitive analysis, supplier profiles, and pricing data. By leveraging these reports, businesses can gain a competitive edge and better understand their target markets, allowing for more effective sourcing strategies.Trade MagazinesGlobal Sources publishes several trade magazines that serve as valuable resources for businesses. These magazines cover various industries and provide readers with industry news, product updates, and expert insights. The magazines also include advertisements and product listings from suppliers, enabling buyers to discover new products and suppliers conveniently.The Advantages of Global SourcesGlobal Sources offers numerous advantages for both buyers and suppliers.For buyers, Global Sources provides a one-stop platform for sourcing quality products, while also connecting them with reliable and verified suppliers. The online marketplaces and sourcing reports simplify the sourcing process, facilitating efficient decision-making. The physical trade shows enable buyers to establish personal connections, assess product quality, and negotiate prices directly.For suppliers, Global Sources offers a global platform to showcase their products and services. The online marketplaces and trade shows provide opportunities to connect with international buyers, expand customer base, and increase sales. Furthermore, the sourcing reports help suppliers understand market demand, allowing them to tailor their offerings to meet customer needs effectively.ConclusionGlobal Sources plays a vital role in connecting businesses around the world. Its online marketplaces, physical tradeshows, sourcing reports, and trade magazines are invaluable resources that enable buyers and suppliers to establish fruitful relationships. In an increasingly globalized world, businesses can rely on Global Sources to find trusted partners, explore new markets, and drive growth. With its comprehensive services and extensive network, Global Sources continues to contribute to the success of businesses in a rapidly evolving global marketplace.。
煤矿智能开采技术专业介绍范文英文版Introduction to the Specialty of Coal Mine Intelligent Mining TechnologyIn the rapidly advancing technological era, the coal mining industry is also undergoing significant transformations. One such transformation is the emergence of the Coal Mine Intelligent Mining Technology specialty, which aims to equip students with the knowledge and skills necessary to efficiently and safely extract coal using modern technology.This specialty covers a wide range of topics, including the principles of coal mining, the latest advancements in mining equipment and technology, as well as the environmental and safety considerations in mining operations. Students are introduced to the concept of intelligent mining, which involves the integration of robotics, automation, and artificialintelligence to improve mining efficiency and reduce human intervention.The curriculum of this specialty also focuses on data analysis and management, as intelligent mining generates a large amount of data that needs to be processed and analyzed. Students learn how to use advanced software and tools to monitor and control mining operations, predict potential hazards, and optimize resource utilization.In addition to technical skills, students are also trained in project management, teamwork, and communication skills, which are crucial for working in a multidisciplinary team in the mining industry. The program prepares students to become professionals who can contribute to the sustainable development of the coal mining industry by applying innovative technology solutions.The Coal Mine Intelligent Mining Technology specialty is designed to meet the growing demand for skilled professionals in the coal mining industry. With the continuous evolution oftechnology, this specialty will continue to play a crucial role in enhancing the efficiency, safety, and sustainability of coal mining operations.中文版煤矿智能开采技术专业介绍在科技迅速发展的时代,煤矿行业也在经历着巨大的变革。
Mining Business Topics in Source Code using LatentDirichlet AllocationGirish Maskeri,Santonu Sarkar SETLabs,Infosys T echnologies Limited Bangalore560100,Indiagirish_rama@, santonu_sarkar@Kenneth Heafield California Institute of T echnologyCA USAkpu@kheafiABSTRACTOne of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation,people without any prior appli-cation knowledge wouldfind it hard to comprehend the func-tionality of the tent Dirichlet Allocation(LDA), a statistical model,has emerged as a popular technique for discovering topics in large text document corpus.But its ap-plicability in extracting business domain topics from source code has not been explored so far.This paper investigates LDA in the context of comprehending large software systems and proposes a human assisted approach based on LDA for extracting domain topics from source code.This method has been applied on a number of open source and propri-etary systems.Preliminary results indicate that LDA is able to identify some of the domain topics and is a satisfactory starting point for further manual refinement of topics. Categories and Subject DescriptorsD.2.7[Software Engineering]:Distribution,Maintenance, and Enhancement—Restructuring,reverse engineering,and reengineering;General termsTheory,Algorithms,ExperimentationKeywordsMaintenance,Program comprehension,LDA1.INTRODUCTIONLarge legacy software systems often exist in a state of dis-organization with poor or no documentation.Adding new features andfixing bugs in such a system is highly error prone and time consuming since the original authors of the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.ISEC’08,February19-22,2008,Hyderabad,India.Copyright2008ACM978-1-59593-917-3/08/0002...$5.00.system are generally no longer available.Moreover,the peo-ple maintaining the code-base do not comprehend the func-tional purpose of different program elements(functions,files, classes,data structures etc.)and the roles they play to fulfill various functional services offered by the system.When a software system is small,one can understand its functional architecture by manually browsing the source code.For large systems,practitioners often rely on program analysis techniques such as call graph,control and dataflow and slicing[2].Though this is very helpful to understand the structural intricacies of the system,it helps a little to com-prehend the functional intent of the system.The reason is not difficult to understand.Structural analysis techniques work on the structural information derived from a set of program source code.The structural information is at a very low level of granularity-such asfiles,functions,data-structures,variable usage dependencies,function calls and so on.This information hardly reveals any underlying func-tional purpose.For a large system,this information becomes overwhelmingly large for a maintainer to manually derive any functional intent out of it.Moreover,for a system with millions of lines of code,the memory requirement to hold the structural information and perform various analysis of-ten becomes a bottleneck.An important step to comprehend the functional intent of the system(or the intended functional architecture)is to identify the business topics that exist in the system,around which the high level components(or modules)have been im-plemented.For example consider a large banking applica-tion that deals with customers,bank accounts,credit cards, interest and so on.A maintainer without any application knowledge willfind it difficult to add a new interest calcula-tion logic.However,if it is possible to extract the business topics such as“customer”,“interest”from the source code and establish an association between“interest”with vari-ous program elements,it would be of immense help to the maintainer to identify the related functions,files,classes and data structures and carry out the required changes for inter-est calculation.This in turn can make a novice programmer much more productive in maintaining a system,specially when the software is in the order of millions of lines of code with little documentation.A plausible technique to identify such topics is to derive semantic information by extracting and analyzing various important“keywords”from the source code text[13].It is often the case that the original authors of the code leave hints to the meaning of program elements in the form of keywords infiles,functions,data type names and so on.Forinstance,it is highly common to use keywords like“proxy”,“http”while implementing an http-proxy functionality.Sim-ilarly,for a banking application one would surely like to use a keyword like“interest”in implementing functions,classes and data structures pertaining to interest calculation.Assuming that the meaningful keywords do exist in pro-gram elements,is it possible to semantically correlate these keywords to meaningful clusters where each meaningful clus-ter of keywords can be interpreted as a“Topic”?For in-stance,is it possible to group all“proxy”related keywords together into one cluster and identify it as“proxy”topic and “authentication”related keywords into another cluster form-ing the“authentication”topic?If this is possible,one can subsequently establish association between the topic“proxy”and various program elements(files,functions or data struc-tures)pertaining to“proxy”.This paper addresses the above problem and proposes a human assisted approach based on the Latent Dirichlet Allo-cation(LDA)[7]for identifying topics in source code.LDA has been quite popular in the realm of text document classi-fication and identifying topics from text documents.To the best of our knowledge,This is thefirst attempt at applying LDA in the context of source code analysis.The paper is organized as follows:In the next section we provide a brief review of the literature relevant to our work and the necessary background information on LDA.Section 3discusses the applicability of LDA for extracting domain topics from source code and provides an interpretation of the LDA model in the context of source code.A detailed description of extracting domain topics from source code us-ing LDA is presented in section4.We have applied LDA on a number of open source and proprietary systems.Section 5presents the results obtained.Advantages and disadvan-tages of the presented LDA based method and some of the interesting observations made during our experimentation are presented in Section6.Finally,Section7discusses the future research directions and concludes the paper.2.BACKGROUND AND RELATED WORKResearchers have long recognized the importance of lin-guistic information such as identifier names and comments in program comprehension.For instance,Biggerstaffet al.[5]have suggested assignment of domain concepts as an ap-proach to program comprehension.Tonella et al.[8]have proposed function names and signatures to obtain domain specific information.Anquetil et al.[3]have suggested that the information obtained from the name offile often carry the functional intent of the source code specified in thefile. Wilde et al.[24]have also suggested usage of linguistic in-formation to identify the functional intent of the system. Since then,linguistic information has been used in various program analysis and maintenance tasks such as traceabil-ity between external documentation and source code[4,16], feature location1[17,21,25],identifying high level concep-tual clones[15]and so on.Recently,linguistic information has also been used to iden-tify topics in source code and subsequently used for software clustering[13]and software categorization[11].Kuhn et al.[13]have used Latent Semantic Analysis(LSA) [9]based approach for identifying topics in source code by 1Feature location is sometimes referred to as concept loca-tion semantically clustering software artifacts such as methods,files or packages based on identifier names and comments. Our approach differs from that of Kuhn et al.in two ways. Firstly,and most importantly,our interpretation of a“topic”is different from that of Kuhn.Kuhn interprets semanti-cally clustered software artifacts(like methods,files etc.)as topics whereas we interpret a set of semantically related lin-guistic terms derived from identifier names and comments as a“topic”.Another important difference is in the ap-proach used for semantic clustering.While Kuhn et al.have adopted LSA to cluster a set of meaningful software arti-facts,our approach of clustering linguistic terms is based on the Latent Dirichlet Allocation.Kawaguchi et al.[11]uses linguistic information in source code for automatically identifying categories and categoriz-ing open source repositories.A cluster of related identifiers is considered as a“category”.In our case,we consider a cluster of terms derived from identifiers as a“topic”;thus a topic can certainly be considered synonymous to a“cat-egory”.However,our approach of clustering semantically related identifier terms differs from the approach suggested by Kawaguchi et al.[11].Kawaguchi et al.first uses LSA to derive pairwise similarity between the terms and then apply a clustering algorithm to cluster similar terms together.The LDA based approach we have adopted alleviates the need of having two steps.Since LDA is essentially a topic modeling technique,it not only discovers similarity between terms,it also creates a cluster of similar terms to form a topic.In the rest of this section,we provide a brief description of LDA and its use in extracting topics from text documents.2.1LDALatent Dirichlet Allocation(LDA)[7]is a statistical model, specifically a topic model,originally used in the area of nat-ural langauge processing for representing text documents. The basic idea of LDA is that a document can be considered as a mixture of a limited number of topics and each mean-ingful word in the document can be associated with one of these topics.Given a corpus of documents,LDA attempts to discover the following:•It identifies a set of topics•It associates a set of words with a topic•It defines a specific mixture of these topics for eachdocument in the corpus.LDA has been applied to extract topics from text docu-ments.For instance,Newman et al.[19]applied LDA to de-rive400topics such as“September11attacks”,“Harry Pot-ter”,“Basketball”and“Holidays”from a corpus of330000 New York Times news articles and represent each news arti-cle as a mixture of these topics.LDA has also been applied for identification of topics in a number of different areas.For instance,LDA has been used tofind scientific topics from abstracts of papers published in the proceedings of the na-tional academy of sciences[10].McCallum et al.[18]have proposed LDA to extract topics from social networks and apply it to a collection of250,000Enron emails.A varia-tion on LDA has also been used by Steyvers et al.[22]to analyze160,000abstracts from the“citeseer”computer sci-ence collection.Recently,Zheng et al.[6]have applied LDA to obtain various biological concepts from a protein related corpus.These applications of LDA seem to indicate that the tech-nique can be effective in identifying latent topics and sum-marizing large corpus of text documents.2.1.1LDA ModelFor the sake of completeness,we briefly introduce the LDA model.A thorough and complete description of the LDA model can be found in[7].The vocabulary for describing the LDA model is as follows:word A word is a basic unit defined to be an item from a vocabulary of size W.document A document is a sequence of N words denoted by d=(w1,···,w N)where w n is the n th word in the sequence.corpus A corpus is a collection of M documents denoted by D={d1,···,d M}.In the statistical natural language processing,it is common to model each document d as a multinomial distributionθ(d) over T topics,and each topic z j,j=1···T as a multino-mial distributionφ(j)over the set of words W.In order to discover the set of topics used and the distribution of these topics in each document in a corpus of documents D,we need to obtain an estimate ofφandθ.Blei et al.[7]have shown that the existing techniques of estimatingφandθare slow to converge and propose a new model-LDA.The LDA based model assumes a prior Dirichlet distribution on θ,thus allowing the estimation ofφwithout requiring the estimation ofθ.LDA assumes a generative process for creating a document [7]as presented below.1.choose N∼P oisson(ξ):Select the number of wordsN2.θ∼Dir(α):Selectθfrom the dirichlet distributionparameterized byα.3.For each w n∈w do(a)Choose a topic z n∼Multinomial(θ)(b)Choose a word w n from p(w n|z n,β),a multino-mial probabilityφz nIn this model,various distributions namely,the set of top-ics,topic distribution for each of the documents and word probabilities for each of the topics are in general intractable for exact inference[7].Hence a wide variety of approximate algorithms are considered for LDA.These algorithms at-tempt to maximize likelihood of the corpus given the model.A few algorithms have been proposed forfitting the LDA model to a text corpus such as variational Bayes[7],expec-tation propagation[14],and Gibbs sampling[10].3.APPLYING LDA TO SOURCE CODE Given that LDA has been successfully applied to large cor-pus of text data(as discussed in Section2.1),it is interest-ing to explore i)how applicable it is in the context of source code ii)how effective the technique is in identifying business topics in a large software system.To apply LDA in source code,we consider a software system to be a collection of source codefiles and the software system is associated with a set of business domain concepts(or topics).For instance, the Apache web server implements functionality associated with http-proxy,authentication,server,caching and so on. Similarly,a database server like Postgresql implements func-tionality related to storage management.Moreover,there exists a many-many relationship between these topics like authentication,storage management and the source codefiles that implement these topics.Thus a source codefile can be thought of as a mixture of these domain topics. Applying LDA to the source code now reduces to map-ping source code entities of a software system to the LDA model,described in Table1.Given this mapping,applica-LDA Model Source Code Entitiesword We define domain specific keywords ex-tracted from names of program elementssuch as functions,files,data structures andcomments to be the vocabulary set withcardinality V.A word w is an item fromthis vocabulary.document A source codefile becomes a document inLDA parlance.For our purpose,we repre-sent a documentf d=(w1,w2,...,w N)to be a sequence of N domain specific key-words.corpus The software systemS={f1,f2,...,f M}having M source codefiles forms the cor-pus.Table1:Mapping LDA to Source Codetion of LDA to source code corpus is not difficult.Given a software system consisting of a set of source codefiles,do-main related words are extracted from each of the source codefiing this,a source codefile-word matrix is con-structed where source codefiles form the rows,domain words form the columns and each cell represents the weighted oc-currence of the word,representing the column,in the source codefile representing the row.This source codefile-word matrix is provided as input to LDA.The result of LDA is a set of topics and a distribution of these topics in each source codefile.A topic is a collection of domain words along with the importance of each word to the topic represented as a numeric fraction.4.IMPLEMENTATIONWe have implemented a tool for static analysis of code. Figure1shows a part of this tool that specifically deals with topic extraction,topic location identification[5,24],visual-ization of the topic distribution and a modularity analysis based on domain topics.The main input of LDA based topic extraction is a document-word matrix wd[w,f d]=ηwhereηis a value indicating the importance of the word w in thefile f d.We will shortly describe an approach to computeηbased on the number and the place of occurrences of w in f d.Our current imple-mentation uses the Gibbs sampling method[10]that uses a markov chain monte carlo method to converge to the targetFigure1:Tool Block diagram distributions in an iterative manner.The detailed descrip-tion of this method is not in scope of this paper.Input Parameters.Our tool,based on the LDA approach,takes two parame-tersαandβ(as described in Section2.1.1)and creates the distributionφof words over topics and the distributionθof topics over documents.In addition to the parametersαand βthe tool also requires the number of iterations.Recall that LDA defines a topic as a probability distribution of over all the terms in the corpus.We have defined an user specified cut-offvalueΨwhich is used to discern the most important words for each topic.4.1Keyword ExtractionIn order to create a document-word matrix for LDA,it is extremely important tofind out meaningful domain related words in the source code.Approaches that apply LDA to linguistic text documents consider each word in the docu-ment for this purpose.However,unlike a plain text docu-ment a source codefile is a structured document and it is definitely not appropriate to assume that each word in the file would be domain related.Firstly,a large percentage of words in source codefiles constitute the programming language syntax,such as for,if,class,return,while and so on.Moreover,domain keywords are often embedded inside identifier names as subwords and identifier names need to be split appropriately to extract the relevant subwords. Given this observation we propose the following steps to ex-tract meaningful keywords from source codefiles:1.Fact Extraction.2.Splitting of identifiers and comments into keywords.3.Stemming the keywords into their respective common roots.4.Filtering of keywords to eliminate keywords that do not indicate any business concept.Fact Extraction.Fact extraction is the process of parsing source code doc-uments and extracting meta-data of interest such asfiles, functions,function dependencies,data structures etc.We have used source navigator[1]for extracting facts from source code.Identifier Splitting.Splitting of identifiers into meaningful subwords is essen-tial because unlike a natural language text where each word is independent and can be found in a dictionary,source code identifier names are generally not independent words but a sequence of meaningful chunks of letters and acronyms delimited by some character or through some naming con-vention.For instance a function named“add auth info”in httpd-2.0.53source code constitutes of three meaning-ful chunks of letters“add”,“auth”and“info”delimited by “”.The vocabulary we use for such meaningful chunks of let-ters is“keyword”.Each identifier is split into a set of key-words.In order to perform this splitting it is essential to know how the identifier names are delimited.There are a number of schemes such as underscore,hyphen,or through capitalization of specific letters(camel case)as in”getLoan-Details”.We have implemented a regular expression based identifier splitting program in Perl.Stemming.Generally keywords in source code are used both singu-larly as well as in plural.For example“loan”and“loans”. In our analysis it is not necessary to consider them as two different keywords.Hence we unify all such keywords by stemming them to their common root.Also,It is a common practice[23,12]to stem terms in order to improve the results of analysis.We have used the Porter’s stemming algorithm [20]for stemming all words into a common root. Filtering.Not all keywords are indicators of topics in the domain. For instance keywords such as“get”and“set”are very generic and a stop-words list is employed tofilter out such terms. 4.2File Keyword MappingHaving identified a set of unique keywords for a given software system,we now compute the wd matrix.For this purpose,we compute a weighted sum of the number of oc-currences of a word w in afile f d as follows:1.We define a heuristic weightageλ:{lt}→ℵthat assigns a user-defined positive integer to a“location-type”lt.A“location-type”lt denotes the structuralpart of a source codefile such asfile name,functionname,formal parameter names,comment,data-structure name and so on from which a keyword has been ob-tained.The weightage given to a word obtained from afunction name would be different from a word obtainedfrom a data structure name.2.The importance factor wd[w,f d]for a word w occur-ring in thefile f d is computed by the weighted sumof the frequency of occurrences of w for each locationtype lt i in a source codefile f d.That is,wd[w,f d]=Xlt iλ(lt i)×ν(w,f d,lt i)whereν(w,f d,lt i)denotes the frequency of occurrenceof the word w in the location type lt i of the sourcefilef d.In order to illustrate the calculation of the importance fac-tor wd consider the following code snippet from thefile Or-derDetails.java.public class OrderDetails implements java.io.Serializable{ private String orderId;private String userId;private String orderDate;private float orderValue;private String orderStatus;public String getOrderStatus(){return(orderStatus);}......}As discussed in section4.1,identifier names are split to get meaningful domain words and importance factor calcu-lated for each of the words.One such word that is extracted from the above code snippet is“Order”which occurs in com-ments and names of different type of identifiers such as in class name,attribute name and method name.These differ-ent types of sources for words constitute our set of location types lt.Generally,in an object oriented system,classes represent domain objects and their names are more likelyto yield domain words that are important for that class. Hence,λ(class)generally is assigned higher value by do-main experts thanλ(attribute).Let us assume that in this particular exampleλ(class)equals2,λ(attribute)equals1 andλ(method)equals1.The importance factor of the word “Order”in the above code snippet as calculated accordingto the formula given above is7.wd[Order,OrderDetails.java]=2∗1+1∗4+1∗1=7 Similarly,weighted occurrence is calculated for other words such as“details”,“user”and“status”.4.3Topic labelingLDA could not satisfactorily derive a human understand-able label for an identified topic.In most of the cases,the terms from which a label can be derived are abbreviationsof business concepts or acronyms.As a result it becomes hard to create a meaningful label for a topic automatically.In the current version of the tool,identified topics have been labeled manually.5.CASE STUDIESWe have tested our approach on a number of open source and proprietary systems.In the rest of this section we dis-cuss the results obtained using some of the topics as exam-ples.5.1Topic Extraction for ApacheWe extracted30topics for Apache.For the sake of brevity we list only two topics,namely“SSL”and“Logging”.Table1(a)lists the top keywords for topic“SSL”and their corre-sponding probability of occurrence when a random keywordis generated from the topic“SSL”.Our tool is able to extract not just the domain topics, but also infrastructure-level topics and cross cutting topics. For instance,“logging”is a topic that cuts acrossfiles and modules.Our tool,based on LDA,is able to cluster together all logging related keywords together as shown in table1(b) that lists the top keywords for topic“Logging”and their corresponding probability values.(a)Topic labeled as SSLKeyword Probabilityssl0.373722expr0.042501init0.033207engine0.026447var0.022222ctx0.023067ptemp0.017153mctx0.013773lookup0.012083modssl0.011238ca0.009548(b)Topic labeled as LoggingKeyword Probabilitylog0.141733request.036017mod0.0311config0.029871name0.023725headers0.021266autoindex0.020037format0.017578cmd0.01512header0.013891add0.012661Table2:Sample Topics extracted from Apache source code5.2Topic Extraction For PetstoreIn order to investigate the effect of naming on topic ex-traction results we considered Petstore,a J2EE blueprint implementation by Sun Microsystems.Being a reference J2EE implementation,it has followed good java naming con-ventions and a large number of identifiers have meaningful names.(a)Topic labeled as Con-tact InformationKeyword Probabilityinfo0.418520contact0.295719email0.050116address0.040159family0.040159given0.036840telephone0.026884by0.000332(b)Topic labeled as Ad-dress InformationKeyword Probabilityaddress0.398992street0.105818city0.055428code0.055428country0.055428zip0.055428name10.050847state0.046267name20.046267end0.005039add0.009548Table3:Sample Topics extracted from petstore source codeAs shown in table2(a)we are able to successfully group all“contact information”related terms together.However, what is more significant in this example is that the top key-words“info”,“contact”are meaningful and indicative of the probable name of the topic.For example if we concatenate these two keywords into“info contact”it can be considered as a valid label for the“contact information”topic.Similarly,even in the case of“address information”topic, shown in table2(b),the concatenation of the top keywords “address”and“street”can be used to label the“address in-formation”topic.It can be observed from the sample topics extracted that good naming convention yields more mean-ingful names thereby simplifying the process of labeling the topics.5.3Synonymy and Polysemy resolutionOne of the key factors in extracting coherent topics and grouping semantically related keywords together is the abil-ity of the algorithm employed to resolve synonymy-different words having the same meaning.We have observed that our tool is able to satisfactorily resolve synonymy to a good ex-tent since LDA models topics in afile and words in a topic using multinomial probability distributions.For instance consider the topic labeled as“transaction”in PostgreSQL shown in table5.3.LDA has identified that“transaction”and“xact”are synonymous and grouped them together in a single cluster as shown below.Keyword Probabilitytransaction0.149284namespace090856commit0.035349xact0.035349visible0.029506current0.029506abort0.026585names0.026585command0.023663start0.020742path0.017821Table4:Transaction and Xact Synonymy resolution by LDAWhat’s more interesting is the fact that our tool has been able to resolve polysemy-same words having different mean-ing in source code.A polyseme can appear in multiple do-main topics depending on the context.The reason for our tool to be able to identify polyseme is not difficult to under-stand.Note that LDA models a topic as a distribution of terms;therefore it is perfectly valid for a term to appear in two topics with different probability values.Furthermore, LDA tries to infer a topic for a given term with the knowl-edge of the context of the word,i.e.the document where the word is appearing.For instance,in Linux-kernel source code we observed that the term“volume”has been used in the context of sound control as well as in the context offile systems.LDA is able to differentiate between these different uses of the term and has grouped the same term in different topics.6.DISCUSSIONIn this section we discuss various factors that impact the results obtained.Subsequently we will discuss benefits and limitations of our approach.6.1Effect of number of TopicsOur approach for topic extraction accepts the number of topics to be extracted as an input from the user.We have observed that varying the number of topics has a signifi-cant impact on polysemy resolution.For instance,consider the example of polysemy resolution of the keyword“volume”in Linux-kernel,discussed in subsection5.3.We have con-ducted our experiment on Linux-kernel source code twice. In both the times we have kept all the parameters,namely α,βthe number of iterations and the cut-offthresholdΨsame except for the number of topics.In thefirst experiment the number of topics T was set to50and in the second ex-periment T was set to60.In both these experiments,of the total topics extracted two topics were“sound”related topics and one topic for“file systems”.Table6.1lists the probabil-ities of keyword“volume”in“sound”and“file systems”topic for both the experiments.Topictype’Volume’probabilityfor Experiment1with T=50’Volume’probabilityfor Experiment2with T=60Soundtopic10.0240.032Soundtopic20.0090.009file sys-temstopic<0.00020.004Table5:Effect of number of topics on polysemy resolution in Linux-kernelIn our experiments we have used a value of the threshold Ψto be0.001for determining whether a keyword belongs to a topic or not.If the probability of a keyword associated with a topic is less than0.001then we do not consider that keyword as indicator of that topic.In view of this,it can be observed from the table6.1that in experiment1the keyword“volume”has a probability of less than0.0002for topic“file systems”.Hence“volume”is associated with only the two sound related topics and not with the“file systems”topic.However,in experiment2when the number of topics was increased to60,the probability of“volume”for topic “file systems”is0.004.This probability is greater than our threshold0.001and hence“volume”is also considered as an indicator for“file systems”topic apart from the sound related topics.The polysemy in the keyword“volume”is revealed only in the second experiment with60topics. 6.2Discovering optimal number of TopicsThe problem of identifying optimal number of topics isnot specific to source code alone.Topic extraction from text doc-uments faces a very similar problem.Griffiths and Steyvers recommend trying different numbers of topics T and suggest using the maximum likelihood method on P(w|T)[10].Ap-plying this technique to extract topics from Linux suggests that the optimal number of topics in the case of Linux is270 as shown infigure2.Figure2:Inferring optimum number of topics for LinuxHowever,automatically inferring the number of topics by maximizing the likelihood is not without problems.。