Data indexing in heterogeneous multiple broadcast channels,” M.Phil

格式：pdf
大小：281.84 KB
文档页数：10

下载文档原格式

现代机械工程专业英语

Lesson 1 Engineering Drawings1.assembly drawing 转配图2.balloon 零件序号3.detail drawing 零件图4.view 视图5.full-section view 全剖视图6.Broken-sectional view 局部视图7.Convention 惯例，规范8.Cutting-plane 剖切图9.Sketch drawing 草图10.Interchangeable 可互换的11.Craftsman 工匠，工人12.Tolerance 公差13.Procurement 采购，获得14.Exploded drawing 分解示图15.Pattern maker 模型工，翻铸工16.Machinist 机械师，机工Lesson 2 mechanics1.mechanics 力学2.Acceleration 加速度3.Deformation 变形4.Stress 应力5.Sub discipline 分支学科6.Static 静力学7.Dynamics 动力学8.Mechanics of material材料力学9.Fluid mechanics 流体力学10.Kinematics 运动学11.Continuum mechanics连续介质力学12.static equilibrium 静力平衡13.Newton’s first law14.Susceptibility 敏感性15.Newton’s second law of motion16.Yield strength 屈服强度17.ultimate strength 极限强度18.Failure by bucking 屈曲破坏19.Stiffness 刚度20.Young’s modulus 杨氏模量21.Macroscopic 宏量22.Microscopic 微量putational fluid dynamics (CFD)计算流体力学24.trajectory 轨道25.Astrophysics 天体物理学26.Celestial 天空的27.Robotics 机器人学28.Biomechanics 生物力学29.Rigid body 刚体Lesson 3 Engineering Materials1.polymer 聚合物2.Ceramics 陶瓷3.Stiff 硬的，刚性的4.Fracture 断裂，折断5.Transparent 透明的，显然的6.Lustrous 有光泽的7.Delocalized 不受位置限制的8.Ferrous 铁的，含铁的9.Nonferrous 不含铁的，非铁的10.Tailored 定制的，特制的11.Hardness 硬度12.Tensile strength 抗拉强度13.Toughness 韧性14.Quenching 淬火15.Tempering 回火16.Stainless 不锈的17.Shield 防护，屏蔽，遮挡Lesson4 Mechanical Design1.mechanism 机构，机械2.Thermal 热量，热的3.Switch 开关4.Cam 凸轮5.Valve 阀门6.Beam 梁7.Phenomena 现象8.Screw 螺钉，螺杆9.Fasteners 紧固件10.Spring 弹簧11.Gear 齿轮12.Durability 耐用性13.Femur 股骨14.Quadriceps 四头肌15.Optimization method 最优化方法16.Stiffness 硬度17.Stock 原料，备品18.Noninterference 互不干扰Lesson5 machinery component1.pulley 滑轮，带轮2.Torque 扭矩，转（力）距3.Sheave 滑轮车，槽轮4.Disassembly 拆卸分解5.Stock 棒料，库存6.Woodruff key 半圆键，月牙键7.Axle 轮轴，车轴8.Spline 花键，用花键连接9.Bushing 轴瓦，轴衬10.Involute spline 渐开线花键11.Spindle 主轴，轴12.Groove 沟，槽；刻沟，刻槽13.Residual stress 残余应力14.Coupling 联轴器15.Distortion 变形，绕曲16.Misalignment 未对准（线），非对中17.Referred to as 把。

人工智能领域中英文专有名词汇总

名词解释中英文对比<using_information_sources> social networks 社会网络abductive reasoning 溯因推理action recognition(行为识别)active learning(主动学习)adaptive systems 自适应系统adverse drugs reactions(药物不良反应)algorithm design and analysis(算法设计与分析) algorithm(算法)artificial intelligence 人工智能association rule(关联规则)attribute value taxonomy 属性分类规范automomous agent 自动代理automomous systems 自动系统background knowledge 背景知识bayes methods(贝叶斯方法)bayesian inference(贝叶斯推断)bayesian methods(bayes 方法)belief propagation(置信传播)better understanding 内涵理解big data 大数据big data(大数据)biological network(生物网络)biological sciences(生物科学)biomedical domain 生物医学领域biomedical research(生物医学研究)biomedical text(生物医学文本)boltzmann machine(玻尔兹曼机)bootstrapping method 拔靴法case based reasoning 实例推理causual models 因果模型citation matching (引文匹配)classification (分类)classification algorithms(分类算法)clistering algorithms 聚类算法cloud computing(云计算)cluster-based retrieval (聚类检索)clustering (聚类)clustering algorithms(聚类算法)clustering 聚类cognitive science 认知科学collaborative filtering (协同过滤)collaborative filtering(协同过滤)collabrative ontology development 联合本体开发collabrative ontology engineering 联合本体工程commonsense knowledge 常识communication networks(通讯网络)community detection(社区发现)complex data(复杂数据)complex dynamical networks(复杂动态网络)complex network(复杂网络)complex network(复杂网络)computational biology 计算生物学computational biology(计算生物学)computational complexity(计算复杂性) computational intelligence 智能计算computational modeling(计算模型)computer animation(计算机动画)computer networks(计算机网络)computer science 计算机科学concept clustering 概念聚类concept formation 概念形成concept learning 概念学习concept map 概念图concept model 概念模型concept modelling 概念模型conceptual model 概念模型conditional random field(条件随机场模型) conjunctive quries 合取查询constrained least squares (约束最小二乘) convex programming(凸规划)convolutional neural networks(卷积神经网络) customer relationship management(客户关系管理) data analysis(数据分析)data analysis(数据分析)data center(数据中心)data clustering (数据聚类)data compression(数据压缩)data envelopment analysis (数据包络分析)data fusion 数据融合data generation(数据生成)data handling(数据处理)data hierarchy (数据层次)data integration(数据整合)data integrity 数据完整性data intensive computing(数据密集型计算)data management 数据管理data management(数据管理)data management(数据管理)data miningdata mining 数据挖掘data model 数据模型data models(数据模型)data partitioning 数据划分data point(数据点)data privacy(数据隐私)data security(数据安全)data stream(数据流)data streams(数据流)data structure( 数据结构)data structure(数据结构)data visualisation(数据可视化)data visualization 数据可视化data visualization(数据可视化)data warehouse(数据仓库)data warehouses(数据仓库)data warehousing(数据仓库)database management systems(数据库管理系统)database management(数据库管理)date interlinking 日期互联date linking 日期链接Decision analysis(决策分析)decision maker 决策者decision making (决策)decision models 决策模型decision models 决策模型decision rule 决策规则decision support system 决策支持系统decision support systems (决策支持系统) decision tree(决策树)decission tree 决策树deep belief network(深度信念网络)deep learning(深度学习)defult reasoning 默认推理density estimation(密度估计)design methodology 设计方法论dimension reduction(降维) dimensionality reduction(降维)directed graph(有向图)disaster management 灾害管理disastrous event(灾难性事件)discovery(知识发现)dissimilarity (相异性)distributed databases 分布式数据库distributed databases(分布式数据库) distributed query 分布式查询document clustering (文档聚类)domain experts 领域专家domain knowledge 领域知识domain specific language 领域专用语言dynamic databases(动态数据库)dynamic logic 动态逻辑dynamic network(动态网络)dynamic system(动态系统)earth mover's distance(EMD 距离) education 教育efficient algorithm(有效算法)electric commerce 电子商务electronic health records(电子健康档案) entity disambiguation 实体消歧entity recognition 实体识别entity recognition(实体识别)entity resolution 实体解析event detection 事件检测event detection(事件检测)event extraction 事件抽取event identificaton 事件识别exhaustive indexing 完整索引expert system 专家系统expert systems(专家系统)explanation based learning 解释学习factor graph(因子图)feature extraction 特征提取feature extraction(特征提取)feature extraction(特征提取)feature selection (特征选择)feature selection 特征选择feature selection(特征选择)feature space 特征空间first order logic 一阶逻辑formal logic 形式逻辑formal meaning prepresentation 形式意义表示formal semantics 形式语义formal specification 形式描述frame based system 框为本的系统frequent itemsets(频繁项目集)frequent pattern(频繁模式)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy clustering (模糊聚类)fuzzy data mining(模糊数据挖掘)fuzzy logic 模糊逻辑fuzzy set theory(模糊集合论)fuzzy set(模糊集)fuzzy sets 模糊集合fuzzy systems 模糊系统gaussian processes(高斯过程)gene expression data 基因表达数据gene expression(基因表达)generative model(生成模型)generative model(生成模型)genetic algorithm 遗传算法genome wide association study(全基因组关联分析) graph classification(图分类)graph classification(图分类)graph clustering(图聚类)graph data(图数据)graph data(图形数据)graph database 图数据库graph database(图数据库)graph mining(图挖掘)graph mining(图挖掘)graph partitioning 图划分graph query 图查询graph structure(图结构)graph theory(图论)graph theory(图论)graph theory(图论)graph theroy 图论graph visualization(图形可视化)graphical user interface 图形用户界面graphical user interfaces(图形用户界面)health care 卫生保健health care(卫生保健)heterogeneous data source 异构数据源heterogeneous data(异构数据)heterogeneous database 异构数据库heterogeneous information network(异构信息网络) heterogeneous network(异构网络)heterogenous ontology 异构本体heuristic rule 启发式规则hidden markov model(隐马尔可夫模型)hidden markov model(隐马尔可夫模型)hidden markov models(隐马尔可夫模型) hierarchical clustering (层次聚类) homogeneous network(同构网络)human centered computing 人机交互技术human computer interaction 人机交互human interaction 人机交互human robot interaction 人机交互image classification(图像分类)image clustering (图像聚类)image mining( 图像挖掘)image reconstruction(图像重建)image retrieval (图像检索)image segmentation(图像分割)inconsistent ontology 本体不一致incremental learning(增量学习)inductive learning (归纳学习)inference mechanisms 推理机制inference mechanisms(推理机制)inference rule 推理规则information cascades(信息追随)information diffusion(信息扩散)information extraction 信息提取information filtering(信息过滤)information filtering(信息过滤)information integration(信息集成)information network analysis(信息网络分析) information network mining(信息网络挖掘) information network(信息网络)information processing 信息处理information processing 信息处理information resource management (信息资源管理) information retrieval models(信息检索模型) information retrieval 信息检索information retrieval(信息检索)information retrieval(信息检索)information science 情报科学information sources 信息源information system( 信息系统)information system(信息系统)information technology(信息技术)information visualization(信息可视化)instance matching 实例匹配intelligent assistant 智能辅助intelligent systems 智能系统interaction network(交互网络)interactive visualization(交互式可视化)kernel function(核函数)kernel operator (核算子)keyword search(关键字检索)knowledege reuse 知识再利用knowledgeknowledgeknowledge acquisitionknowledge base 知识库knowledge based system 知识系统knowledge building 知识建构knowledge capture 知识获取knowledge construction 知识建构knowledge discovery(知识发现)knowledge extraction 知识提取knowledge fusion 知识融合knowledge integrationknowledge management systems 知识管理系统knowledge management 知识管理knowledge management(知识管理)knowledge model 知识模型knowledge reasoningknowledge representationknowledge representation(知识表达) knowledge sharing 知识共享knowledge storageknowledge technology 知识技术knowledge verification 知识验证language model(语言模型)language modeling approach(语言模型方法) large graph(大图)large graph(大图)learning(无监督学习)life science 生命科学linear programming(线性规划)link analysis (链接分析)link prediction(链接预测)link prediction(链接预测)link prediction(链接预测)linked data(关联数据)location based service(基于位置的服务) loclation based services(基于位置的服务) logic programming 逻辑编程logical implication 逻辑蕴涵logistic regression(logistic 回归)machine learning 机器学习machine translation(机器翻译)management system(管理系统)management( 知识管理)manifold learning(流形学习)markov chains 马尔可夫链markov processes(马尔可夫过程)matching function 匹配函数matrix decomposition(矩阵分解)matrix decomposition(矩阵分解)maximum likelihood estimation(最大似然估计)medical research(医学研究)mixture of gaussians(混合高斯模型)mobile computing(移动计算)multi agnet systems 多智能体系统multiagent systems 多智能体系统multimedia 多媒体natural language processing 自然语言处理natural language processing(自然语言处理) nearest neighbor (近邻)network analysis( 网络分析)network analysis(网络分析)network analysis(网络分析)network formation(组网)network structure(网络结构)network theory(网络理论)network topology(网络拓扑)network visualization(网络可视化)neural network(神经网络)neural networks (神经网络)neural networks(神经网络)nonlinear dynamics(非线性动力学)nonmonotonic reasoning 非单调推理nonnegative matrix factorization (非负矩阵分解) nonnegative matrix factorization(非负矩阵分解) object detection(目标检测)object oriented 面向对象object recognition(目标识别)object recognition(目标识别)online community(网络社区)online social network(在线社交网络)online social networks(在线社交网络)ontology alignment 本体映射ontology development 本体开发ontology engineering 本体工程ontology evolution 本体演化ontology extraction 本体抽取ontology interoperablity 互用性本体ontology language 本体语言ontology mapping 本体映射ontology matching 本体匹配ontology versioning 本体版本ontology 本体论open government data 政府公开数据opinion analysis(舆情分析)opinion mining(意见挖掘)opinion mining(意见挖掘)outlier detection(孤立点检测)parallel processing(并行处理)patient care(病人医疗护理)pattern classification(模式分类)pattern matching(模式匹配)pattern mining(模式挖掘)pattern recognition 模式识别pattern recognition(模式识别)pattern recognition(模式识别)personal data(个人数据)prediction algorithms(预测算法)predictive model 预测模型predictive models(预测模型)privacy preservation(隐私保护)probabilistic logic(概率逻辑)probabilistic logic(概率逻辑)probabilistic model(概率模型)probabilistic model(概率模型)probability distribution(概率分布)probability distribution(概率分布)project management(项目管理)pruning technique(修剪技术)quality management 质量管理query expansion(查询扩展)query language 查询语言query language(查询语言)query processing(查询处理)query rewrite 查询重写question answering system 问答系统random forest(随机森林)random graph(随机图)random processes(随机过程)random walk(随机游走)range query(范围查询)RDF database 资源描述框架数据库RDF query 资源描述框架查询RDF repository 资源描述框架存储库RDF storge 资源描述框架存储real time(实时)recommender system(推荐系统)recommender system(推荐系统)recommender systems 推荐系统recommender systems(推荐系统)record linkage 记录链接recurrent neural network(递归神经网络) regression(回归)reinforcement learning 强化学习reinforcement learning(强化学习)relation extraction 关系抽取relational database 关系数据库relational learning 关系学习relevance feedback (相关反馈)resource description framework 资源描述框架restricted boltzmann machines(受限玻尔兹曼机) retrieval models(检索模型)rough set theroy 粗糙集理论rough set 粗糙集rule based system 基于规则系统rule based 基于规则rule induction (规则归纳)rule learning (规则学习)rule learning 规则学习schema mapping 模式映射schema matching 模式匹配scientific domain 科学域search problems(搜索问题)semantic (web) technology 语义技术semantic analysis 语义分析semantic annotation 语义标注semantic computing 语义计算semantic integration 语义集成semantic interpretation 语义解释semantic model 语义模型semantic network 语义网络semantic relatedness 语义相关性semantic relation learning 语义关系学习semantic search 语义检索semantic similarity 语义相似度semantic similarity(语义相似度)semantic web rule language 语义网规则语言semantic web 语义网semantic web(语义网)semantic workflow 语义工作流semi supervised learning(半监督学习)sensor data(传感器数据)sensor networks(传感器网络)sentiment analysis(情感分析)sentiment analysis(情感分析)sequential pattern(序列模式)service oriented architecture 面向服务的体系结构shortest path(最短路径)similar kernel function(相似核函数)similarity measure(相似性度量)similarity relationship (相似关系)similarity search(相似搜索)similarity(相似性)situation aware 情境感知social behavior(社交行为)social influence(社会影响)social interaction(社交互动)social interaction(社交互动)social learning(社会学习)social life networks(社交生活网络)social machine 社交机器social media(社交媒体)social media(社交媒体)social media(社交媒体)social network analysis 社会网络分析social network analysis(社交网络分析)social network(社交网络)social network(社交网络)social science(社会科学)social tagging system(社交标签系统)social tagging(社交标签)social web(社交网页)sparse coding(稀疏编码)sparse matrices(稀疏矩阵)sparse representation(稀疏表示)spatial database(空间数据库)spatial reasoning 空间推理statistical analysis(统计分析)statistical model 统计模型string matching(串匹配)structural risk minimization (结构风险最小化) structured data 结构化数据subgraph matching 子图匹配subspace clustering(子空间聚类)supervised learning( 有support vector machine 支持向量机support vector machines(支持向量机)system dynamics(系统动力学)tag recommendation(标签推荐)taxonmy induction 感应规范temporal logic 时态逻辑temporal reasoning 时序推理text analysis(文本分析)text anaylsis 文本分析text classification (文本分类)text data(文本数据)text mining technique(文本挖掘技术)text mining 文本挖掘text mining(文本挖掘)text summarization(文本摘要)thesaurus alignment 同义对齐time frequency analysis(时频分析)time series analysis( 时time series data(时间序列数据)time series data(时间序列数据)time series(时间序列)topic model(主题模型)topic modeling(主题模型)transfer learning 迁移学习triple store 三元组存储uncertainty reasoning 不精确推理undirected graph(无向图)unified modeling language 统一建模语言unsupervisedupper bound(上界)user behavior(用户行为)user generated content(用户生成内容)utility mining(效用挖掘)visual analytics(可视化分析)visual content(视觉内容)visual representation(视觉表征)visualisation(可视化)visualization technique(可视化技术) visualization tool(可视化工具)web 2.0(网络2.0)web forum(web 论坛)web mining(网络挖掘)web of data 数据网web ontology lanuage 网络本体语言web pages(web 页面)web resource 网络资源web science 万维科学web search (网络检索)web usage mining(web 使用挖掘)wireless networks 无线网络world knowledge 世界知识world wide web 万维网world wide web(万维网)xml database 可扩展标志语言数据库附录 2 Data Mining 知识图谱（共包含二级节点15 个，三级节点93 个）间序列分析)监督学习)领域二级分类三级分类。

韩家炜数据挖掘03

decision-making process.”—W. H. Inmon

Data warehousing:

The process of constructing and using data warehouses
June 14, 2012
Data Mining: Concepts and Techniques

OLAP (on-line analytical processing)

Major task of data warehouse system
Data analysis and decision making

Distinct features (OLTP vs. OLAP):

Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
Data Mining: Concepts and Techniques 10
June 14, 2012
OLTP vs. OLAP
O LTP u sers f u n ctio n D B d esig n d a ta clerk , IT p ro fessio nal d ay to d ay o p eratio ns ap p licatio n-o riented current, up -to -d ate d etailed , flat relatio nal iso lated rep etitive read /w rite ind ex/hash o n p rim . k ey sho rt, sim p le transactio n tens tho usand s 1 0 0 M B -G B transactio n thro ughp ut O LAP k no w led ge w o rk er d ecisio n sup p o rt sub ject-o riented histo rical, sum m arized , m ultid im ensio n al integrated , co nso lid ated ad -ho c lo ts o f scans co m p lex q uery m illio ns hund red s 1 0 0 G B -T B q uery thro ughp ut, resp o nse

IDM数据模型

iDM:A Uniﬁed and Versatile Data Model for PersonalDataspace Management∗Jens-Peter Dittrich Marcos Antonio Vaz SallesETH Zurich8092Zurich,Switzerlanddbis.ethz.ch|ABSTRACTPersonal Information Management Systems require a powerful and versatile data model that is able to represent a highly heterogeneous mix of data such as relational data,XML,ﬁle content,folder hier-archies,emails and email attachments,data streams,RSS feeds and dynamically computed documents,e.g.ActiveXML[3].Interest-ingly,until now no approach was proposed that is able to represent all of the above data in a single,powerful yet simple data model. This paperﬁlls this gap.We present the iMeMex Data Model(iDM) for personal information management.iDM is able to represent unstructured,semi-structured and structured data inside a single model.Moreover,iDM is powerful enough to represent graph-structured data,intensional data as well as inﬁnite data streams. Further,our model enables to represent the structural information available insideﬁles.As a consequence,the artiﬁcal boundary be-tween inside and outside aﬁle is removed to enable a new class of queries.As iDM allows the representation of the whole personal dataspace[20]of a user in a single model,it is the foundation of the iMeMex Personal Dataspace Management System(PDSMS)[16, 14,47].This paper also presents results of an evaluation of an initial iDM implementation in iMeMex that show that iDM can be efﬁciently supported in a real PDSMS.1.INTRODUCTION1.1MotivationPersonal information consists of a highly heterogeneous data mix of emails,XML,L A T E X and word documents,pictures,mu-sic,address book entries,and so on.Personal information is typ-ically stored inﬁles scattered among multipleﬁle systems(local or network),multiple machines(local desktop,network share,mail server),and most of all differentﬁle formats(XML,L A T E X,Ofﬁce, email formats,etc.).This work is motivated by the fact that much of the data stored in thoseﬁles represents information such as hierarchies and graphs which are not exploited by current PIM tools to narrow search and ∗This work was partially supported by the Swiss National Science Foundation(SNF)under contract200021-112115.Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment.To copy otherwise,or to republish,to post on servers or to redistribute to lists,requires a fee and/or special permission from the publisher,ACM.VLDB‘06,September12-15,2006,Seoul,Korea.Copyright2006VLDB Endowment,ACM1-59593-385-9/06/09.query results.Moreover,there is an artiﬁcial gap between the struc-tural information insideﬁles and the outside structural information established by theﬁles&folder hierarchies employed by the user to classify herﬁles.This paper proposes an elegant solution to close these gaps.Key to our approach is to map all available data to single graph data model—no matter whether it belongs inside or outside aﬁle.For instance,our model may be applied to represent at the same time data pertaining to theﬁle system(folder hierarchies)and data per-taining to the contents inside aﬁle(structural graph).Consequently the artiﬁcal boundary between inside and outside aﬁle is removed. At the same time,we also remove the boundary between different subsystems such asﬁle systems and IMAP email servers as we map that information to the same graph model.This enables new kinds of queries that are not supported with state-of-the-art PIM tools.1.2The ProblemE XAMPLE1[I NSIDE V ERSUS O UTSIDEF ILES]Personal Information includes semi-structured data(XML,Word or other Ofﬁce documents1)as well as graph-structured data(L A T E X). These documents are stored asﬁles on theﬁle system.Consider the following query:Query1:“Show me all L A T E X‘Introduction’sectionspertaining to project PIM that contain the phrase‘MikeFranklin’.”The query references information that is partly kept outsideﬁles on theﬁle system,i.e.all project folders related to the PIM project. Another part of the query references information kept inside cer-tain L A T E Xﬁles,i.e.all introduction sections containing the phrase ‘Mike Franklin’.With current technology,this query cannot be is-sued in one single request by the user as it has to bridge that inside-outsideﬁle boundary.The user may only search theﬁle system using simple system tools like grep,ﬁnd,or a keyword search en-gine.However,these tools may return a large number of results which would have to be examined manually to determine theﬁnal result.Even when a matchingﬁle is encountered,then,for struc-turedﬁle formats like Microsoft PowerPoint,the user typically has to conduct a second search inside theﬁle toﬁnd the desired infor-mation[13].Moreover,state-of-the-art operating systems—in-cluding the upcoming WinFS—do not support at all exploitation of structured information inside the user’s documents.The struc-tured information inside theﬁles is in a data cage and cannot be used to reﬁne the query.1Open Ofﬁce has stored documents in XML since version1.0. MS Ofﬁce12appearing end of2006will also enable storage of ﬁles using zipped XML.Desiderata:What is missing is a technology that enables users to execute queries that bridge the divide between the graph-structured information inside theirﬁles and the outsideﬁle system.2 E XAMPLE2[F ILES VERSUS E MAIL A TTACHMENTS]Email has become one of the most important applications of per-sonal information management.Consider a simple project man-agement scenario.When managing multiple projects,e.g.industry projects,research projects,PhD students,lectures,and so on,you may decide to store documents of big projects in a separate folder on your local hard disk.For smaller projects you may decide to keep that information as attachments to email messages you ex-changed with members of the project team.Now let’s consider the following query:Query2:“Show me all documents pertaining to project‘OLAP’that have aﬁgure containing the phrase‘In-dexing Time’in its label.”With state-of-the-art technology this type of query cannot be com-puted as it requires to consider structural information available on the different folder hierachies outside,i.e.,both the folder hierarchy on the local hard disk as well as the folder hierarchy on the IMAP server.Again,the structural constraint“all Figures containing”has to be evaluated considering structural information present inside certainﬁles,i.e.locally availableﬁles and email attachments. Desiderata:What is missing is a technology that abstracts from the different outside subsystems likeﬁle systems,and email servers to enable a uniﬁed approach to querying graph-structured personal information.2 1.3Our ContributionsThis paper presents an elegant solution to the above problems. The core idea is to introduce a uniﬁed,versatile and yet simple data model.All personal information like semi-structured documents (L A T E X,Word or other Ofﬁce documents),relational data,ﬁle con-tent,folder hierarchies,email,RSS feeds and even data streams can be instantiated in that model.Since we are able to represent all data inside a single model,we are then in the position to use a single powerful query language to query all data within a single query.This paper presents our data model.The full speciﬁcation of our query language will be presented as part of future work.In summary,this paper makes the following contributions:1.We present the iMeMex Data Model(iDM)for personal infor-mation management.iDM allows for the uniﬁed representa-tion of all personal information like XML,relational data,ﬁle content,folder hierarchies,email,data streams and RSS feeds inside a single data model.2.We present how to instantiate existing data like XML,relations,ﬁles&folders and data streams inside our model.Further,we show how to instantiate iDM subgraphs based on the structural content ofﬁles such as tree structures(XML)and graph struc-tures(L A T E X).3.We show that all parts of the iMeMex Data Model(iDM)canbe computed lazily.Because of this,iDM is able to support so-called intensional data,i.e.data that is obtained by execut-ing a query or calling a remote service.For this reason,iDM can model approaches such as Active XML[3]as a use-case.In contrast,to the latter,however,iDM is not restricted to the XML domain.4.iDM is the foundation for the iMeMex Personal DataspaceManagement System(PDSMS)[16,14,47].A PDSMS pro-vides integrated services on the whole dataspace[20]of a user,that is the total of all personal information pertaining to a given individual.We present the core architecture of iMeMex and show how iDM was implemented in that system.ing an initial iDM implementation in the iMeMex PersonalDataspace Management System we present results of experi-ments.The result of our experiments demonstrate that iDM can be efﬁciently implemented in a real PDSMS,providing both ef-ﬁcient indexing times and interactive response times.This paper is structured as follows.Section2presents an over-view and the formal deﬁnition of the iMeMex Data Model(iDM) for personal information management.Section3then describes how to represent XML,ﬁles&folders,as well as data streams using iDM.After that,Section4discusses how to compute an iDM graph. Section5gives an overview on the implementation of iDM and its query language iQL in the iMeMex Personal Dataspace Manage-ment System.Section6reviews related work.Section7presents experiments with our initial iDM implementation in iMeMex.Fi-nally,Section8concludes the paper.2.iMeMex DATA MODELThis section is structured into an overview(2.1),the formal def-inition of our model(2.2),and a series of examples(2.3).2.1OverviewThe iMeMex Data Model(iDM)uses the following important ideas to represent the heterogeneous data mix found in personal information management:•In iDM,all personal information available on a user’s desk-top is exposed through a set of resource views.A resource view is a sequence of components that express structured, semi-structured and unstructured pieces of the underlying data.Thus all personal data items,though varied in their representation,are exposed in iDM uniformly.For example, every node in aﬁles&folders hierarchy as well as every ele-ment in an XML document would be represented in iDM by one distinct resource view.•Resource views in iDM are linked to each other in arbitrary directed graph structures.The connections from a given re-source view to other resource views are given by one of its components.•In contrast to XML approaches[45,3],iDM does not impose the need to convert data to a physical XML document repre-sentation before query processing may take place.Rather,we favor a clear separation between logical and physical repre-sentation of data.Data may be dynamically represented in our model during query processing although it remains pri-marily stored in its original format and location.Note that this does not preclude a system that implements iDM to pro-vide facilities such as indexing or replication of the original data to speed-up query evaluation.•We introduce resource view classes to precisely deﬁne the types and representation of resource view components.Any given resource view may or may not comply to a resource view class.We show how to use resource view classes to constrain our model to represent data inﬁles&folders,rela-tions,XML and data streams in Section3.•Resource views may be given extensionally(e.g.,ﬁles&fold-ers or tuples in a relational store)or computed intensionally(e.g.,as a result to a query).Further,resource views may contain components that are ﬁnite as well as inﬁnite .Those aspects of our model are explored in Section 4.2.2Resource ViewsIn this section,we formally deﬁne a resource view and explain its constituent parts.After that,we present examples of resource view graphs (see Section 2.3).ηi NAME COMPONENT :ηi is a ﬁnite string that represents the name of V i .τi TUPLE COMPONENT :τi is a 2-tuple (W ,T ),where W is a schema and T is one single tuple that conforms to W.The schema W = a j ,j =1,2,...,k is deﬁned as a sequence of attributes where attribute a j is the name of a role played bysome domain 3D j in W.The tuple T = v j,j =1,2,...,k is a sequence of atomic values where value v j is an element of the domain D j of the attribute a j .χi CONTENT COMPONENT :χi is a string of symbols taken from an alphabet Σc .The content χi may be ﬁnite or inﬁnite.When χi is ﬁnite,it takes the form of a ﬁnite sequence of symbols denoted by c 1,...,c l ,c j ∈Σc ,j =1,...,l;when χi is inﬁ-nite,the respective sequence is inﬁnite and we denote χi = c 1,...,c ll →∞,c j ∈Σc ,j =1,...,l ,l →∞.γi GROUP COMPONENT :γi is a 2-tuple (S ,Q ),where S is a (pos-sibly empty)set of resource views and Q is a (possibly empty)ordered sequence of resource views.Further,(i)The set S and the sequence Q may be ﬁnite or inﬁnite.When S is ﬁnite,we denote S ={V s 1,...,V s m };when it is inﬁnite,then we denote S ={V s 1,...,V s m }m →∞.Likewise,when Q is ﬁnite,we denote Q = V q 1,...,V q n ;when Q isinﬁnite,then we denote Q = V q 1,...,V q nn →∞.(ii)S ∩Q =∅,i.e.,S and Q are disjoint 4.(iii)Assume a resource view V i has a non-empty γi component.If there exists a resource view V k for which V k ∈S ∪Q holds,we say that V k is directly related to V i ,i.e.,V i →V k .Any given resource view may be directly related to zero,one or many other resource views.(iv)If V i →V j →...→V k ,we say that resource view V k is in-directly related to V i ,i.e.,V i ;V k .If any of the components of a resource view is empty,we denote its value,where convenient,by the empty n-tuple ()or by the empty sequence .2The ηi component is a name used to refer to the resource view.It is of service for the construction of path queries,discussed in 2We use the subscript notation to denote one speciﬁc resource viewand its components.3A domain is considered to be a set of atomic values.In the remain-der,whenever we mention attributes or domains,we refer to the deﬁnitions given above.Our deﬁnitions of domains and attributes conform to the ones given in [19].4For the sake of simplicity,we extend notation and use the sym-bols ∩and ∪to denote intersection not only between sets,but also between a set and a sequence.The ∩operation returns a set con-taining the (distinct)elements of the sequence that also belong to the intersected set.The ∪operation returns a set containing the (distinct)elements that belong either to the set or to the sequence.Section 5.1.The τi component has a similar deﬁnition as the one given for tuples in a relational data model [9,19].One important difference is that the schema W is deﬁned for each tuple,instead of for a set of tuples.This decision of course creates a tension for rep-resenting sets of resource views with the same structural features.Schematic information is important in this setting and we introduce it in our model by deﬁning resource view classes (see Section 3).The χi component represents arbitrary unstructured content.For iDM,it is merely a sequence of atomic symbols from some alpha-bet,such as characters in a ﬁle’s content or in an XML text node.The χi component may be ﬁnite or inﬁnite.Allowing inﬁnite con-tent is useful to naturally represent media streams in our model.Finally,the γi component creates a directed graph structure that connects related views.Note that we impose no restriction on this graph,so that we may represent trees,DAGs and cyclic graphs.If the relative order among the connections established between re-source views is of importance,we represent them in the sequence Q of γi .Otherwise,they are represented in the set S .Like the χi component,both the set and/or the sequence of γi may be ﬁnite or inﬁnite.The inﬁnite case is useful for representing data streams as inﬁnite sequences of resource views in our model.2.3ExamplesIn Figure 1(a),we present a typical ﬁles&folders hierarchy con-taining information on some research projects of one of the authors.There is one high-level folder that groups all research projects and we show two sub-folders for speciﬁc projects.The ‘PIM’folder isfurther expanded to reveal one L A T E X document for one version ofthis VLDB 2006paper,one Microsoft Word document for a grant proposal,and one folder link to the top-level ‘Projects’folder.The contents of those documents are also partially displayed.Both doc-uments contain sections,subsections and their corresponding text.In addition,the document ‘vldb2006.tex’also contains a reference to section ‘Preliminaries’in the subsection ‘The Problem’.Uniﬁed Representation.As we may notice in Figure 1(a),there is a gap between the graph-structured information inside ﬁles and the hierarchies present on the outside ﬁle system.We show the rep-resentation of the same information in iDM in Figure 1(b).Nodes denote resource views and edges denote the connections induced by the resource views’group components.Each node is labeled with the resource view’s name component.In iDM,what we call ﬁles&folders are only resource views.Each ﬁle or folder is represented as one resource view in Figure 1(b).Further,the data stored inside ﬁles is also uniformly represented as resource views.The document class,title,abstract and document portions of the VLDB 2006paper are some of these resource views.They are directly related to the ‘vldb 2006.tex’ﬁle resource view.In addition,any structural information present in the document por-tion of that ﬁle is represented as resource views.Note that the same applies to the ‘Grant.doc’resource view.This is possible as an in-creasing number of ofﬁce tools,such as Microsoft Ofﬁce 12,OpenOfﬁce and L A T E X,offer semi-structured or graph-structured docu-ment formats that make it feasible to write content converters thatobtain good quality resource view graphs.In addition,structure extraction techniques [18,12,35]may be used in conjunction with our approach to further increase the quality of resource view sub-graphs extracted from content components.Intensional Data.One important aspect of our model is that the resource view graph is a logical representation of personal infor-mation.That logical representation does not need to be fully mate-rialized at once and may be computed lazily.Personal information does not have to be imported to or exported from our data model,but rather represented in it.This is in sharp contrast to XML ap-All ProjectsFigure 1:iDM represents heterogeneous personal information as a single resource view graph.The resource view graph represents the whole dataspace of a userproaches which are usually tied to having a physical representation of the whole data as an XML document before querying may be carried out (see Section 4).Resource Views and their Components.Let us show the de-tailed representation of one resource view present in Figure 1.Con-sider the ‘PIM’folder.We represent it as a resource view V PIM =(ηPIM ,τPIM ,χPIM ,γPIM ),in which:ηPIM =‘PIM’;τPIM =(W ,T ),where W =creation time:date,size:integer,lastmodiﬁed time:date and T =‘19/03/200511:54’,4096,‘22/09/200516:14’,i.e.,the τPIM component represents the ﬁlesystem properties associated with the ‘PIM’folder;χPIM =;γPIM =(S ,Q ),where S ={V vldb 2006.tex ,V Grant.doc ,V All Projects }andQ = .Note that the children of the ‘PIM’folder in the ﬁlesystem are represented as resource views directly related to V PIM .These re-source views are V vldb 2006.tex ,V Grant.doc ,and V All Projects .The re-source views V vldb 2006.tex and V Grant.doc have their ηand τcompo-nents deﬁned analogously to V PIM ,their γcomponent correspond-ing to the resource views in their document content and their χcomponent equal to the binary stream of each ﬁle.The V All Projects resource view is also represented analogously to V PIM ,except that its γcomponent corresponds to the folder ‘Projects’.Here and in the remainder,we will omit the empty components from the re-source view notation as they are clear from the context.Therefore,we denote V PIM =(‘PIM’,τPIM ,γPIM ).We also use V PIM to repre-sent a view named ‘PIM’where there is no risk of confusion with other views with the same name.Graph Structures.Some data models,such as XML and tradi-tional ﬁles&folders,organize data in a tree structure.The extension of that structure to represent graphs is usually done through the in-clusion of second-class citizens in the model,such as links.iDM naturally represents graph structures by the connections inducedby the resource views’group components.The relevance of rep-resenting and querying arbitrary graph data has been increasingly recognized in the literature [24,30].For example,the V Projects →V PIM →V All Projects →V Projects path of directly related resource views in Figure 1(b)forms a cycle in the resource view graph.Further,in Figure 1(b),resource view V Preliminaries is directly related to both views V document and V ref .Schemas.Personal information is trapped inside a large array of data cages.Some solutions to this problem propose to cast user’s information into a rigid set of developer deﬁned schemas,e.g.Mi-crosoft WinFS [44].They defend that the relational model is the most appropriate underlying representation for desktop data.Other proposals,e.g.Apple Spotlight [4],abolish schemas and employ search engine technology to represent all data as completely un-structured.In contrast,iDM offers a schema-agnostic graph representation of personal information.We may employ resource view classes,deﬁned in the following section,to enrich iDM with schematic in-formation and to constrain it to represent a wide array of data mod-els typically used to represent personal data.3.INSTANTIATING SPECIALIZED DATA MODELSIn the following,we show how to instantiate specialized data models like ﬁles&folders,XML,as well as data streams using iDM.Due to space constraints,we omit details on how to instantiate rela-tional data.These instantiations will,however,become clear to the reader based on the discussion for the other data models.Table 1summarizes the instantiations for the specialized data models.3.1Resource View ClassesFor convenience,we introduce resource view classes ,which con-strain iDM to represent a particular data model.D EFINITION 2(R ESOURCE V IEW C LASS )Given a set of resource views D ={V i },i =1,...,n,we deﬁne a resource view class C as aResource View Class Resource View Components Deﬁnition ηC i τC i χC i γC iDescriptionNameSQ File ﬁle N f (W FS ,T f )C f∅Folder folder N F (W FS ,T F ) {V child 1,...,V childm } child ∈{ﬁle ,folder }Relational Tupletuple (W R ,t i ) ∅Relation relationN R (){V tuple 1,...,V tuplem } V tuplei= τtuple i ,τtuple i =(W R ,t i ),i =1,...,mRelational database reldb N DB () {V relation 1,...,V relationm}XML text node xmltext ()C t ∅XML element xmlelem N E (W E ,T E ) ∅ V xmlnode 1,...,V xmlnoden xmlnode ∈{xmltext ,xmlelem }XML document xmldoc () ∅ V xmlelemroot XML File xmlﬁle N f (W FS ,T f )C f ∅ V xmldocdocData Stream datstream () ∅ V 1,...,V n n →∞Tuple stream tupstream () ∅ V tuple1,...,V tuple n n →∞RSS/ATOM streamrssatom()∅V xmldoc 1,...,V xmldoc n n →∞or:same as in xmldocTable 1:Important Resource View Classes to represent ﬁles&folders,relations,XML,data streams,and RSSset of formal restrictions on the ηi ,τi ,χi and γi components of all views V i ∈D.The formal restrictions speciﬁed by a resource view class include:1.E MPTYNESS OF COMPONENTS :speciﬁes that a certain subset of the components of resource views that obey to the resource view class must be empty or not.2.S CHEMA OF τ:determines the schema W that the τcompo-nents of resource views must have.3.F INITENESS OF χOR γ:enforces content χor group γelements S or Q to be either ﬁnite,inﬁnite or empty.4.C LASSES OF DIRECTLY RELATED RESOURCE VIEWS :given a resource view V i ,determines the set of resource view classes that are acceptable for any resource view V j such that V i →V j .When a view V i conforms to the resource view class C,we denote itV C i .Accordingly,we denote its components ηC i ,τC i ,χC i ,and γCi .A given resource view may obey directly to only one class.One may,however,organize resource view classes in generalization (or specialization)hierarchies.When a resource view obeys to a given class C ,it automatically obeys to all classes that are generalizations of C .In this sense,our classes have an object-oriented ﬂavor [8].Resource view classes may be used by application developers to provide pre-deﬁned schema information on sets of resource views.Note that a full speciﬁcation of schemas,including several differ-ent categories of integrity constraints,exceeds the scope of this pa-per.Note,additionally,that not all resource views need to have a resource view class attached to them.This means that unlike the relational [9]or object-oriented approaches [8],which support only a schema-ﬁrst data modeling strategy,our model also supports schema-later and even schema-never [20].3.2Files&FoldersTraditional ﬁlesystems can be seen as trees in which internalnodes represent non-empty folders and leaves represent ﬁles or empty folders.Each node in the tree has a ﬁxed set of properties,such as size,creation time,last modiﬁed time,etc.The attributes that appear on each node are deﬁned in a ﬁlesystem-level schemaW FS ={size:int,creation time:date,last modiﬁed time:date,...}.The sequence of values that conform to that schema are expressed per node and,for a node n ,we denote this sequence T n .Each node also has a name,denoted N n .In addition,if n is a ﬁle node,then it has an associated content C n .In iDM we consider a ‘ﬁle’just one out of many possible resource views on user data.In order to have iDM represent the ﬁles&folders data model we deﬁne a ﬁle resource view class ,denoted ﬁle ,that constrains a view instance V ﬁle i to represent a ﬁle f as follows:V ﬁle i =(ηﬁle i ,τﬁle i ,χﬁlei ),where:ηﬁle i =N f ,τﬁle i =(W FS ,T f ),and χﬁlei =C f .Based on this deﬁnition we can recursively deﬁne the concept of a‘folder’.A folder resource view class ,denoted folder ,constrains a view instance V folder i to represent a folder F as follows:V folder i =(ηfolder i ,τfolder i ,γfolder i),where:ηfolder i=N F ,τfolder i =(W FS ,T F ),and γfolder i = {V child 1,...,V childm }, ,child ∈{ﬁle ,folder }.Further,as discussed in Section 2,iDM may be used to exploitsemi-structured content inside ﬁles.One special case of this type of content is XML.We may thus specialize the ﬁle resource view class to deﬁne an XML ﬁle resource view class ,denoted xmlﬁle ,in which the group component is non-empty and deﬁned as:γxmlﬁle i= ∅, V xmldocdoc .The resource view V xmldocdoc pertains to resource view class xmldoc and is the document resource view for the XML document.The xmldoc resource view class is deﬁned in Subsection 3.3.Note that other specializations of the ﬁle resource view class may be deﬁnedanalogously for other document formats,e.g.L A T E X.3.3XMLWe assume that XML data is represented according to the deﬁn-itions in the XML Information Set [46].Due to space constraints,we only discuss how to instantiate the core subset of the XML In-formation Set (document,element,attribute,character)in iDM.In order to have iDM represent the XML data model we ﬁrst deﬁne an XML text resource view class ,denoted xmltext ,that con-strains a view instance V xmltexti to represent a character information item with text content C t = c 1,...,c nas follows:V xmltext i =(χxmltext i ),where:χxmltext i=C t .As a second step,we need to represent element information itemsin our model.An element information item E has a name N E ,a set of attributes with values T E and schema W E ,and a sequence of children,each of which is either a text or an element information item.We deﬁne an XML element resource view class ,xmlelem ,that constrains a view instance V xmlelem i to represent an element information item E as follows:V xmlelem i =(ηxmlelem i ,τxmlelem i ,γxmlelem i),where:ηxmlelem i =N E ,τxmlelem i =(W E ,T E ),and γxmlelem i=(S ,Q ),S =∅,Q = V xmlnode 1,...,V xmlnode n ,xmlnode ∈{xmltext ,xmlelem }.Note that the XML attribute nodes are modelled by the τxmlelem icomponent of V xmlelem i .Finally,we can deﬁne an XML document resource view class ,xmldoc ,to represent a document information item as follows:V xmldoc i =(γxmldoc i),where:γxmldoc i= ∅, V xmlelemroot ,and V xmlelemroot is a view thatrepresents the root element information item of the document.Figure 2shows an example of an instantiationFigure 2:XML fragment represented as a resource view graph of an XML fragment in iDM.Each node in the XML document is represented as a resource view.We use an expanded notation in the ﬁgure to represent nodes in the resource view graph and explicitly indicate their components.The connections among views are given by the resource views’γcomponents.3.4Data StreamsWe assume that a data stream is an inﬁnite data source delivering data items.For the moment we do not require any restrictions on the type of data items.To constrain iDM to represent a generic data stream model we deﬁne a generic data stream resource view class,datstream ,that restricts a resource view V datstream i as follows:V datstream i =(γdatstream i ),where:γdatstream i = ∅, V 1,...,V n n →∞.Figure 3shows schematically how the instantiation of a data stream occurs in iDM.Each data item delivered by that stream is repre-sented as a resource view.Depending on the data model in whichFigure 3:A data stream is represented as a resource view graph the item is represented,the corresponding resource views may beof different classes.The data stream itself is represented as a re-source view whose γcomponent contains all corresponding data item views.Examples of data streams include streams that deliver tuples.A tuple stream resource view class,tupstream ,is deﬁned to restrict a view instance V tupstream i as follows:V tupstream i =(γtupstream i),where:γtupstream i = ∅, V tuple 1,...,V tuplen n →∞ .Another example of a data stream is an RSS/ATOM stream deliver-ing XML messages.An RSS stream resource view class,rssatom ,is deﬁned to restrict a resource view V rssatom i as follows:V rssatom i =(γrssatom i),γrssatom i = ∅, V xmldoc 1,...,V xmldocn n →∞ .Note that RSS/ATOM streams,even though they are frequently called ‘streams’,are just simple XML documents that are provided by a growing number of web servers.There is no notiﬁcation mech-anism for the clients interested in those documents.So,one may argue that an alternative representation for an RSS/ATOM stream is the same as given for an XML document (see also Table 1).4.COMPUTING THE iDM GRAPHThis section outlines how to compute resource views and re-source view graphs.In the following,we show that all components of a resource view may be computed lazily (Section 4.1).After that,we discuss three paradigms to compute resource view compo-nents:extensional components (Section 4.2),intensional compo-nents (Section 4.3),and inﬁnite components (Section 4.4).4.1Lazy Resource ViewsIt is important to understand that all components of a resource view may be computed on demand (aka lazily or intensionally).We do not require that any of the components of the iDM graph are ma-terialized beforehand.Whether the components of a resource view are materialized or not is hidden by modelling a resource view as an interface consisting of four get-methods,one for each component:Interface ResourceView {getNameComponent():return ηgetTupleComponent():return τgetContentComponent():return χgetGroupComponent():return γ}As a consequence,each resource view hides how ,when and even where the different components are computed.For instance,in Figure 1the subgraph representing the contents of L A T E X ﬁle “vldb 2006.tex”may be transformed to an iDM graph if a user requests that information,i.e.when she calls getGroup-Component()on resource view V ﬁlevldb 2006.tex .In the following sec-tions,we discuss how to obtain the data returned by the four differ-ent get*Component -methods.。

Python Pandas库简介与基本操作教程说明书

About the T utorialPandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.In this tutorial, we will learn the various features of Python Pandas and how to use them in practice.AudienceThis tutorial has been prepared for those who seek to learn the basics and various functions of Pandas. It will be specifically useful for people working with data cleansing and analysis. After completing this tutorial, you will find yourself at a moderate level of expertise from where you can take yourself to higher levels of expertise.PrerequisitesYou should have a basic understanding of Computer Programming terminologies. A basic understanding of any of the programming languages is a plus.Pandas library uses most of the functionalities of NumPy. It is suggested that you go through our tutorial on NumPy before proceeding with this tutorial. You can access it from: NumPy Tutorial.Disclaimer & CopyrightCopyright 2017 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher.We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or inthistutorial,******************************************.T able of ContentsAbout the Tutorial (i)Audience (i)Prerequisites (i)Disclaimer & Copyright (i)Table of Contents (ii)1.Pandas – Introduction (1)2.Pandas – Environment Setup (2)3.Pandas – Introduction to Data Structures (3)Dimension & Description (3)Series (4)DataFrame (4)Data Type of Columns (4)Panel (5)4.Pandas — Series (6)pandas.Series (6)Create an Empty Series (7)Create a Series f (7)rom ndarray (7)Create a Series f (8)rom dict (8)Create a Series f (9)rom Scalar (9)Accessing Data from Series with Position (10)Retrieve Data Using Label (Index) (11)5.Pandas – DataFrame (13)pandas.DataFrame (14)Create DataFrame (14)Create an Empty DataFrame (15)Create a DataFrame from Lists (15)Create a DataFrame from Dict of ndarrays / Lists (16)Create a DataFrame from List of Dicts (17)Create a DataFrame from Dict of Series (19)Column Selection (20)Column (20)Addition (20)Column Deletion (21)Row Selection, Addition, and Deletion (23)6.Pandas – Panel (26)pandas.Panel() (26)Create Panel (26)Selecting the Data from Panel (28)7.Pandas – Basic Functionality (30)DataFrame Basic Functionality (35)8.Pandas – Descriptive Statistics (45)Functions & Description (48)Summarizing Data (49)9.Pandas – Function Application (53)Table-wise Function Application (53)Row or Column Wise Function Application (54)Element Wise Function Application (55)10.Pandas – Reindexing (57)Reindex to Align with Other Objects (58)Filling while ReIndexing (58)Limits on Filling while Reindexing (60)Renaming (61)11.Pandas – Iteration (62)Iterating a DataFrame (62)iteritems() (63)iterrows() (64)itertuples() (64)12.Pandas – Sorting (66)By Label (66)Sorting Algorithm (70)13.Pandas – Working with Text Data (71)14.Pandas – Options and Customization (82)get_option(param) (82)set_option(param,value) (83)reset_option(param) (83)describe_option(param) (84)option_context() (84)15.Pandas – Indexing and Selecting Data (86).loc() (86).iloc() (90).ix() (92)Use of Notations (93)16.Pandas – Statistical Functions (96)Percent_change (96)Covariance (97)Correlation (98)Data Ranking (98)17.Pandas – Window Functions (100).rolling() Function (100).expanding() Function (101).ewm() Function (101)18.Pandas – Aggregations (103)Applying Aggregations on DataFrame (103)19.Pandas – Missing Data (108)Cleaning / Filling Missing Data (111)Replace NaN with a Scalar Value (111)Fill NA Forward and Backward (112)Drop Missing Values (113)Replace Missing (or) Generic Values (114)20.Pandas – GroupBy (116)Split Data into Groups (117)View Groups (117)Iterating through Groups (119)Select a Group (120)Aggregations (121)Transformations (123)Filtration (124)21.Pandas – Merging/Joining (125)Merge Using 'how' Argument (127)22.Pandas – Concatenation (131)Concatenating Objects (131)Time Series (136)23.Pandas – Date Functionality (139)24.Pandas – Timedelta (141)25.Pandas – Categorical Data (144)Object Creation (144)26.Pandas – Visualization (150)Bar Plot (151)Histograms (153)Box Plots (154)Area Plot (155)Scatter Plot (155)Pie Chart (156)27.Pandas – IO Tools (157)read.csv (157)28.Pandas – Sparse Data (161)29.Pandas – Caveats & Gotchas (164)30.Pandas – Comparison with SQL (169)1.Python PandasPandas is an open-source Python Library providing high-performance data manipulationand analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.Prior to Pandas, Python was majorly used for data munging and preparation. It had very less contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the originof data — load, prepare, manipulate, model, and analyze.Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.Key Features of Pandas∙Fast and efficient DataFrame object with default and customized indexing.∙Tools for loading data into in-memory data objects from different file formats.∙Data alignment and integrated handling of missing data.∙Reshaping and pivoting of date sets.∙Label-based slicing, indexing and subsetting of large data sets.∙Columns from a data structure can be deleted or inserted.∙Group by data for aggregation and transformations.∙High performance merging and joining of data.∙Time Series functionality.2.Python PandasStandard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.pip install pandasIf you install Anaconda Python package, Pandas will be installed by default with the following:Windows∙Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac.∙Canop y (https:///products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.∙Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/)LinuxPackage managers of respective Linux distributions are used to install one or more packages in SciPy stack.For Ubuntu Userssudo apt-get install python-numpy python-scipy python-matplotlibipythonipython-notebook python-pandas python-sympy python-noseFor Fedora Userssudo yum install numpyscipy python-matplotlibipython python-pandas sympypython-nose atlas-develPython PandasPandas deals with the following three data structures:∙Series∙DataFrame∙PanelThese data structures are built on top of Numpy array, which means they are fast.Dimension & DescriptionThe best way to think of these data structures is that the higher dimensional data structureis a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.Data Structure Dimensions DescriptionSeries 1 1D labeled homogeneous array, size-immutable.Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed columns.Panel 3 General 3D labeled, size-mutable array.Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.MutabilityAll Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.Note: DataFrame is widely used and one of the most important data structures. Panel is very less used.3.SeriesSeries is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …10 23 56 17 52 61 73 90 26 72Key Points∙Homogeneous data∙Size Immutable∙Values of Data MutableDataFrameDataFrame is a two-dimensional array with heterogeneous data. For example,Name Age Gender RatingSteve 32 Male 3.45Lia 28 Female 4.6Vin 45 Male 3.9Katie 38 Female 2.78The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.Data T ype of ColumnsThe data types of the four columns are as follows:Column TypeName StringAge IntegerGender StringRating FloatKey Points∙Heterogeneous data∙Size Mutable∙Data MutablePanelPanel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.Key Points∙Heterogeneous data∙Size Mutable∙Data MutableEnd of ebook previewIf you liked what you saw…Buy it from our store @ https://。

ABSTRACT Text Joins in an RDBMS for Web Data Integration

Text Joins in an RDBMS for Web Data IntegrationLuis Gravano Panagiotis G.Ipeirotis Nick Koudas Divesh Srivastava Columbia University AT&T Labs–Research {gravano,pirot}@{koudas,divesh}@ABSTRACTThe integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and chal-lenging problem.Due to the lack of global identiﬁers,the same entity(e.g.,a product)might have different textual representations across databases.Textual data is also often noisy because of tran-scription errors,incomplete information,and lack of standard for-mats.A fundamental task during data integration is matching of strings that refer to the same entity.In this paper,we adopt the widely used and established cosine similarity metric from the information retrievalﬁeld in order to identify potential string matches across web sources.We then use this similarity metric to characterize this key aspect of data inte-gration as a join between relations on textual attributes,where the similarity of matches exceeds a speciﬁed puting an exact answer to the text join can be expensive.For query process-ing efﬁciency,we propose a sampling-based join approximation strategy for execution in a standard,unmodiﬁed relational database management system(RDBMS),since more and more web sites are powered by RDBMSs with a web-based front end.We implement the join inside an RDBMS,using SQL queries,for scalability and robustness reasons.Finally,we present a detailed performance evaluation of an im-plementation of our algorithm within a commercial RDBMS,us-ing real-life data sets.Our experimental results demonstrate the efﬁciency and accuracy of our techniques.Categories and Subject DescriptorsH.2.5[Database Management]:Heterogeneous Databases;H.2.4 [Database Management]:Systems—Relational databases,Tex-tual databases;H.2.8[Database Management]:Database Appli-cations—Data miningGeneral TermsAlgorithms,Measurement,Performance,ExperimentationKeywordstext indexing,data cleaning,approximate text matching1.INTRODUCTIONThe integration of information from heterogeneous web sources is of central interest for applications such as catalog data integra-tion and warehousing of web data(e.g.,job advertisements and an-nouncements).Such data is typically textual and can be obtained from disparate web sources in a variety of ways,including web Copyright is held by the author/owner(s).WWW2003,May20–24,2003,Budapest,Hungary.ACM1-58113-680-3/03/0005.site crawling and direct access to remote databases via web proto-cols.The integration of such web data exhibits many semantics-and performance-related challenges.Consider a price-comparison web site,backed by a database,that combines product information from different vendor web sites and presents the results under a uniform interface to the user.In such a situation,one cannot assume the existence of global identiﬁers (i.e.,unique keys)for products across the autonomous vendor web sites.This raises a fundamental problem:different vendors may use different names to describe the same product.For example,a ven-dor might list a hard disk as“Western Digital120Gb7200rpm,”while another might refer to the same disk as“Western Digi r al HDD120Gb”(due to a spelling mistake)or even as“WD120Gb 7200rpm”(using an abbreviation).A simple equality comparison on product names will not properly identify these descriptions as referring to the same entity.This could result in the same product entity from different vendors being treated as separate products, defeating the purpose of the price-comparison web site.To effec-tively address the integration problem,one needs to match multiple textual descriptions,accounting for:•erroneous information(e.g.,typing mistakes)•abbreviated,incomplete or missing information•differences in information“formatting”due to the lack of standard conventions(e.g.,for addresses)or combinations thereof.Any attempt to address the integration problem has to specify a measure that effectively quantiﬁes“closeness”or“similarity”be-tween string attributes.Such a similarity metric can help establish that“Microsoft Windows XP Professional”and“Windows XP Pro”correspond to the same product across the web sites/databases,and that these are different from the“Windows NT”product.Many ap-proaches to data integration use a text matching step,where sim-ilar textual entries are matched together as potential duplicates. Although text matching is an important component of such sys-tems[1,21,23],little emphasis has been paid on the efﬁciency of this operation.Once a text similarity metric is speciﬁed,there is a clear require-ment for algorithms that process the data from the multiple sources to identify all pairs of strings(or sets of strings)that are sufﬁciently similar to each other.We refer to this operation as a text join.To perform such a text join on data originating at different web sites, we can utilize“web services”to fully download and materialize the data at a local relational database management system(RDBMS). Once this materialization has been performed,problems and incon-sistencies can be handled locally via text join operations.It is de-sirable for scalability and effectiveness to fully utilize the RDBMS capabilities to execute such operations.In this paper,we present techniques for performing text joins ef-ﬁciently and robustly in an unmodiﬁed RDBMS.Our text joins rely on the cosine similarity metric[20],which has been successfully used in the past in the WHIRL system[4]for a similar data inte-gration task.Our contributions include:•A purely-SQL sampling-based strategy to compute approxi-mate text joins;our technique,which is based on the approxi-mate matrix multiplication algorithm in[2],can be fully exe-cuted within standard RDBMSs,with no modiﬁcation of the underlying query processing engine or index infrastructure.•A thorough experimental evaluation of our algorithms,in-cluding a study of the accuracy and performance of our ap-proach against other applicable strategies.Our experiments use large,real-life data sets.•A discussion of the merits of alternative string similarity met-rics for the deﬁnition of text joins.The remainder of this paper is organized as follows.Section2 presents background and notation necessary for the rest of the dis-cussion,and introduces a formal statement of our problem.Sec-tion3presents SQL statements to preprocess relational tables so that we can apply the sampling-based text join algorithm of Sec-tion4.Then,Section5presents the implementation of the text join algorithm in SQL.A preliminary version of Sections3and5ap-pears in[12].Section6reports a detailed experimental evaluation of our techniques in terms of both accuracy and performance,and in comparison with other applicable approaches.Section7discusses the relative merits of alternative string similarity metrics.Section8 reviews related work.Finally,Section9concludes the paper and discusses possible extensions of our work.2.BACKGROUND AND PROBLEMIn this section,weﬁrst provide notation and background for text joins,which we follow with a formal deﬁnition of the problem on which we focus in this paper.We denote withΣ∗the set of all strings over an alphabetΣ.Each string inΣ∗can be decomposed into a collection of atomic“enti-ties”that we generally refer to as tokens.What constitutes a token can be deﬁned in a variety of ways.For example,the tokens of a string could simply be deﬁned as the“words”delimited by special characters that are treated as“separators”(e.g.,‘’).Alternatively, the tokens of a string could correspond to all of its q-grams,which are overlapping substrings of exactly q consecutive characters,for a given q.Our forthcoming discussion treats the term token as generic,as the particular choice of token is orthogonal to the design of our ter,in Section6we experiment with different token deﬁnitions,while in Section7we discuss the effect of token choice on the characteristics of the resulting similarity function. Let R1and R2be two relations with the same or different at-tributes and schemas.To simplify our discussion and notation we assume,without loss of generality,that we assess similarity be-tween the entire sets of attributes of R1and R2.Our discussion extends to the case of arbitrary subsets of attributes in a straight-forward way.Given tuples t1∈R1and t2∈R2,we assume that the values of their attributes are drawn fromΣ∗.We adopt the widely used vector-space retrieval model[20]from the information retrievalﬁeld to deﬁne the textual similarity between t1and t2. Let D be the(arbitrarily ordered)set of all unique tokens present in all values of attributes of both R1and R2.According to the vector-space retrieval model,we conceptually map each tuple t∈R i to a vector v t∈ |D|.The value of the j-th component v t(j) of v t is a real number that corresponds to the weight of the j-th token of D in v t.Drawing an analogy with information retrieval terminology,D is the set of all terms and v t is a document weight vector.Rather than developing new ways to deﬁne the weight vector v t for a tuple t∈R i,we exploit an instance of the well-established tf.idf weighting scheme from the information retrievalﬁeld.(tf.idf stands for“term frequency,inverse document frequency.”)Our choice is further supported by the fact that a variant of this gen-eral weighting scheme has been successfully used for our task by Cohen’s WHIRL system[4].Given a collection of documents C,a simple version of the tf.idf weight for a term w and a document d is deﬁned as tf w log(id f w),where tf w is the number of times that w appears in document d and id f w is|C|w,where n w is the num-ber of documents in the collection C that contain term w.The tf.idf weight for a term w in a document is high if w appears a large num-ber of times in the document and w is a sufﬁciently“rare”term in the collection(i.e.,if w’s discriminatory power in the collection is potentially high).For example,for a collection of company names, relatively infrequent terms such as“AT&T”or“IBM”will have higher idf weights than more frequent terms such as“Inc.”For our problem,the relation tuples are our“documents,”and the tokens in the textual attribute of the tuples are our“terms.”Consider the j-th token w in D and a tuple t from relation R i. Then tf w is the number of times that w appears in t.Also,id f w is|R i|w,where n w is the total number of tuples in relation R i that contain token w.The tf.idf weight for token w in tuple t∈R i is v t(j)=tf w log(id f w).To simplify the computation of vector similarities,we normalize vector v t to unit length in the Euclidean space after we deﬁne it.The resulting weights correspond to the impact of the terms,as deﬁned in[24].Note that the weight vec-tors will tend to be extremely sparse for certain choices of tokens; we shall seek to utilize this sparseness in our proposed techniques.D EFINITION 1.Given tuples t1∈R1and t2∈R2,let v t1and v t2be their corresponding normalized weight vectors and let D be the set of all tokens in R1and R2.The cosine similarity(or just similarity,for brevity)of v t1and v t2is deﬁned as sim(v t1,v t2)=|D|j=1v t1(j)v t2(j).Since vectors are normalized,this measure corresponds to the cosine of the angle between vectors v t1and v t2,and has values be-tween0and1.The intuition behind this scheme is that the magni-tude of a component of a vector expresses the relative“importance”of the corresponding token in the tuple represented by the vector. Intuitively,two vectors are similar if they share many important to-kens.For example,the string“ACME”will be highly similar to “ACME Inc,”since the two strings differ only on the token“Inc,”which appears in many different tuples,and hence has low weight. On the other hand,the strings“IBM Research”and“AT&T Re-search”will have lower similarity as they share only one relatively common term.The following join between relations R1and R2brings together the tuples from these relations that are“sufﬁciently close”to each other,according to a user-speciﬁed similarity thresholdφ:D EFINITION 2.Given two relations R1and R2,together with a similarity threshold0<φ≤1,the text join R1 IφR2returns all pairs of tuples(t1,t2)such that t1∈R1and t2∈R2,and sim(v t1,v t2)≥φ.The text join“correlates”two relations for a given similarity thresh-oldφ.It can be easily modiﬁed to correlate arbitrary subsets of attributes of the relations.In this paper,we address the problem of computing the text join of two relations efﬁciently and within an unmodiﬁed RDBMS:P ROBLEM 1.Given two relations R1and R2,together with a similarity threshold0<φ≤1,we want to efﬁciently compute(an approximation of)the text join R1 IφR2using“vanilla”SQL in an unmodiﬁed RDBMS.In the sequel,weﬁrst describe our methodology for deriving, in a preprocessing step,the vectors corresponding to each tuple of relations R1and R2using relational operations and represen-tations.We then present a sampling-based solution for efﬁciently computing the text join of the two relations using standard SQL in an RDBMS.3.TUPLE WEIGHT VECTORSIn this section,we describe how we deﬁne auxiliary relations to represent tuple weight vectors,which we later use in our purely-SQL text join approximation strategy.As in Section2,assume that we want to compute the text join R1 IφR2of two relations R1and R2.D is the ordered set of all the tokens that appear in R1and R2.We use SQL expressions to create the weight vector associated with each tuple in the two rela-tions.Since–for some choice of tokens–each tuple is expected to contain only a few of the tokens in D,the associated weight vec-tor is sparse.We exploit this sparseness and represent the weight vectors by storing only the tokens with non-zero weight.Speciﬁ-cally,for a choice of tokens(e.g.,words or q-grams),we create the following relations for a relation R i:•RiTokens(tid,token):Each tuple(tid,w)is associated with an occurrence of token w in the R i tuple with id tid.This relation is populated by inserting exactly one tuple(tid,w) for each occurrence of token w in a tuple of R i with tuple id tid.This relation can be implemented in pure SQL and the implementation varies with the choice of tokens.(See[10] for an example on how to create this relation when q-grams are used as tokens.)•RiIDF(token,idf):A tuple(w,id f w)indicates that token w has inverse document frequency id f w(Section2)in relation R i.The SQL statement to populate relation RiIDF is shown in Figure1(a).This statement relies on a“dummy”relation RiSize(size)(Figure1(f))that has just one tuple indicating the number of tuples in R i.•RiTF(tid,token,tf):A tuple(tid,w,tf w)indicates that token w has term frequency tf w(Section2)for R i tuple with tuple id tid.The SQL statement to populate relation RiTF is shown in Figure1(b).•RiLength(tid,len):A tuple(tid,l)indicates that the weight vector associated with R i tuple with tuple id tid has a Eu-clidean norm of l.(This relation is used for normalizing weight vectors.)The SQL statement to populate relation RiLength is shown in Figure1(c).•RiWeights(tid,token,weight):A tuple(tid,w,n)indicates that token w has normalized weight n in R i tuple with tuple id tid.The SQL statement to populate relation RiWeights is shown in Figure1(d).This relation materializes a compact representation of theﬁnal weight vector for the tuples in R i.•RiSum(token,total):A tuple(w,t)indicates that token w hasa total added weight t in relation R i,as indicated in relationRiWeights.These numbers are used during sampling(see Section4).The SQL statement to populate relation RiSum is shown in Figure1(e).Given two relations R1and R2,we can use the SQL statements in Figure1to generate relations R1Weights and R2Weights with a compact representation of the weight vector for the R1and R2 tuples.Only the non-zero tf.idf weights are stored in these tables. Interestingly,RiWeights and RiSum are the only tables that need to be preserved for the computation of R1 IφR2that we describe in the remainder of the paper:all other tables are just necessary to construct RiWeights and RiSum.The space overhead introduced by these tables is moderate.Since the size of RiSum is bounded by the size of RiWeights,we just analyze the space requirements for RiWeights.Consider the case where q-grams are the tokens of choice.(As we will see,a good value is q=3.)Then each tuple R i.t j of relation R i can contribute up to approximately|R i.t j|q-grams to relation RiWeights,where|R i.t j|is the number of characters in R i.t j.Furthermore,each tuple in RiWeights consists of a tuple id tid,the actual token(i.e.,q-gram in this case),and its associated weight.Then,if C bytes are needed to represent tid and weight, the total size of relation RiWeights will not exceed|R i|j=1(C+q)·|R i.t j|=(C+q)·|R i|j=1|R i.t j|,which is a(small)constant times the size of the original table R i.If words are used as the token of choice,then we have at most|R i.t j|tokens per tuple in R i.Also,to store the token attribute of RiWeights we need no more than one byte for each character in the R i.t j tuples.Therefore,we can bound the size of RiWeights by1+C times the size of R i. Again,in this case the space overhead is linear in the size of the original relation R i.Given the relations R1Weights and R2Weights,a baseline ap-proach[13,18]to compute R1 IφR2is shown in Figure2.This SQL statement performs the text join by computing the similar-ity of each pair of tuples andﬁltering out any pair with similar-ity less than the similarity thresholdφ.This approach produces an exact answer to R1 IφR2forφ>0.Unfortunately,as we will see in Section6,ﬁnding an exact answer with this approach is prohibitively expensive,which motivates the sampling-based tech-nique that we describe next.4.SAMPLING-BASED TEXT JOINSThe result of R1 IφR2only contains pairs of tuples from R1and R2with similarityφor ually we are interested in high values for thresholdφ,which should result in only a few tuples from R2typically matching each tuple from R1.The baseline ap-proach in Figure2,however,calculates the similarity of all pairs of tuples from R1and R2that share at least one token.As a result, this baseline approach is inefﬁcient:most of the candidate tuple pairs that it considers do not make it to theﬁnal result of the text join.In this section,we describe a sampling-based technique[2] to execute text joins efﬁciently,drastically reducing the number of candidate tuple pairs that are considered during query processing. The sampling-based technique relies on the following intuition: R1 IφR2could be computed efﬁciently if,for each tuple t q of R1, we managed to extract a sample from R2containing mostly tuples suspected to be highly similar to t q.By ignoring the remaining (useless)tuples in R2,we could approximate R1 IφR2efﬁciently. The key challenge then is how to deﬁne a sampling strategy that leads to efﬁcient text join executions while producing an accurate approximation of the exact query results.The discussion of the technique is organized as follows:•Section4.1shows how to sample the tuple vectors of R2to estimate the tuple-pair similarity values.•Section4.2describes an efﬁcient algorithm for computing an approximation of the text join.The sampling algorithm described in this section is an instance of the approximate matrix multiplication algorithm presented in[2], which computes an approximation of the product A=A1·...·A n, where each A i is a numeric matrix.(In our problem,n=2.)The actual matrix multiplication A =A2·...·A n happens during a preprocessing,off-line step.Then,the on-line part of the algorithm works by processing the matrix A1row by row.4.1Token-Weighted SamplingConsider tuple t q∈R1with its associated token weight vector v tq,and each tuple t i∈R2with its associated token weight vector v ti.When t q is clear from the context,to simplify the notation we useσi as shorthand for sim(v tq,v ti).We extract a sample of R2 tuples of size S for t q as follows:•Identify each token j in t q that has non-zero weight v tq(j), 1≤j≤|D|.INSERT INTO RiIDF(token,idf)SELECT T.token,LOG(S.size)-LOG(COUNT(UNIQUE(*)))FROM RiTokens T,RiSize S GROUP BY T.token,S.size INSERT INTO RiTF(tid,token,tf)SELECT T.tid,T.token,COUNT(*)FROM RiTokens TGROUP BY T.tid,T.token (a)Relation with token idf counts(b)Relation with token tf countsINSERT INTO RiLength(tid,len)SELECT T.tid,SQRT(SUM(I.idf*I.idf*T.tf*T.tf))FROM RiIDF I,RiTF T WHERE I.token =T.token GROUP BY T.tidINSERT INTO RiWeights(tid,token,weight)SELECT T.tid,T.token,I.idf*T.tf/L.len FROM RiIDF I,RiTF T,RiLength L WHERE I.token =T.token AND T.tid =L.tid (c)Relation with weight-vector lengths (d)Final relation with normalized tuple weight vectors INSERT INTO RiSum(token,total)SELECT R.token,SUM(R.weight)FROM RiWeights R GROUP BY R.tokenINSERT INTO RiSize(size)SELECT COUNT(*)FROM Ri (e)Relation with total token weights(f)Dummy relation used to create RiIDFFigure 1:Preprocessing SQL statements to create auxiliary relations for relation R i .SELECTr1w.tid AS tid1,r2w.tid AS tid2FROM R1Weights r1w,R2Weights r2w WHERE r1w.token =r2w.token GROUP BY r1w.tid,r2w.tidHAVING SUM(r1w.weight*r2w.weight)≥φFigure 2:Baseline approach for computing the exact value of R 1 IφR 2.•For each such token j ,perform S Bernoulli trials over each t i ∈{t 1,...,t |R 2|},where the probability of picking t i in a trial depends on the weight of token j in tuple t q ∈R 1and in tuple t i ∈R 2.Speciﬁcally,this probability is p ij =v t q (j )·v t i (j )T V (t q ),where T V (t q )= |R 2|i =1σi is the sum of the similarity of tuple t q with each tuple t i ∈R 2.In Section 5we show how we can implement the sampling step even if we do not know the value of T V (t q ).Let C i be the number of times that t i appears in the sample of size S .It follows that:T HEOREM 1.The expected value ofC i S·T V (t q )is σi .PThe proof of this theorem follows from an argument similar to that in [2]and from the observation that the mean of the process that generates C i is |D |j =1v t q (j )v t i (j )T V (t q )=σi T V (t q ).Theorem 1establishes that,given a tuple t q ∈R 1,we can obtain a sample of size S of tuples t i such that the frequency C i of tuple t i can be used to approximate σi .We can then report t q ,t i aspart of the answer of R 1 IφR 2for each tuple t i ∈R 2such that its estimated similarity with t q (i.e.,its estimated σi )is φ or larger,where φ =(1− )φis a threshold slightly lower 1than φ.Given R 1,R 2,and a threshold φ,our discussion suggests thefollowing strategy for the evaluation of the R 1 IφR 2text join,in which we process one tuple t q ∈R 1at a time:•Obtain an individual sample of size S from R 2for t q ,using vector v t q to sample tuples of R 2for each token with non-zero weight in v t q .•If C i is the number of times that tuple t i appears in the sam-ple for t q ,then use CiS T V (t q)as an estimate of σi .•Include tuple pair t q ,t i in the result only if C iS T V (t q)≥φ (or equivalently C i ≥S T V (t q )φ ),and ﬁlter out the re-maining R 2tuples.1For all practical purposes, is treated as a positive constant less than 1.This strategy guarantees that we can identify all pairs of tuples withsimilarity of at least φ,with a desired probability,as long as we choose an appropriate sample size S .So far,the discussion has focused on obtaining an R 2sample of size S individually for each tuple t q ∈R 1.A naive implementation of this sampling strat-egy would then require a scan of relation R 2for each tuple in R 1,which is clearly unacceptable in terms of performance.In the next section we describe how the sampling can be performed with only one sequential scan of relation R 2.4.2Practical Realization of SamplingAs discussed so far,the sampling strategy requires extracting a separate sample from R 2for each tuple in R 1.This extraction of a potentially large set of independent samples from R 2(i.e.,one per R 1tuple)is of course inefﬁcient,since it would require a large number of scans of the R 2table.In this section,we describe how to adapt the original sampling strategy so that it requires one single sample of R 2,following the “presampling”implementation in [2].We then show how to use this sample to create an approximateanswer for the text join R 1 IφR 2.As we have seen in the previous section,for each tuple t q ∈R 1we should sample a tuple t i from R 2in a way that depends on the v t q (j )·v t i (j )values.Since these values are different for each tuple of R 1,a straightforward implementation of this sampling strategy requires multiple samples of relation R 2.Here we describe an alter-native sampling strategy that requires just one sample of R 2:First,we sample R 2using only the v t i (j )weights from the tuples t i of R 2,to generate a single sample of R 2.Then,we use the single sample differently for each tuple t q of R 1.Intuitively,we “weight”the tuples in the sample according to the weights v t q (j )of the t q tuples of R 1.In particular,for a desired sample size S and a targetsimilarity φ,we realize the sampling-based text join R 1 IφR 2in three steps:1.Sampling:We sample the tuple ids i and the correspond-ing tokens from the vectors v t i for each tuple t i ∈R2.We sample each token j from a vector v t i with probabil-ity v t i (j ).(We deﬁne Sum (j )as the total weight of the j -th token in relation R 2,Sum (j )= |R 2|i =1v t i (j ).These weights are kept in relation R2Sum .)We perform S trials,yielding approximately S samples for each token j .We in-sert into R2Sample tuples of the form i,j as many times as there were successful trials for the pair.Alternatively,we can create tuples of the form i,j,c ,where c is the number of successful trials.This results in a compact representation of R2Sample ,which is preferable in practice.2.Weighting:The Sampling step uses only the token weights from R 2for the sampling,ignoring the weights of the tokensSELECT rw.tid,rw.token,rw.weight/rs.total AS PFROM RiWeights rw,RiSum rsWHERE rw.token=rs.tokenFigure3:Creating an auxiliary relation that we sample to cre-ate RiSample(tid,token).in the other relation,R1.The cosine similarity,however,uses the products of the weights from both relations.During the Weighting step we use the token weights in the non-sampled relation to get estimates of the cosine similarity,as follows.For each R2Sample tuple i,j ,with c occurrences in thetable,we compute the value v tq (j)·Sum(j)·c,which isan approximation of v tq (j)·v ti(j)·S.We add this value toa running counter that keeps the estimated similarity of thetwo tuples t q and t i.The Weighting step thus departs from the strategy in[2],for efﬁciency reasons,in that we do not use sampling during the join processing.3.Thresholding:After the Weighting step,we include the tu-ple pair t q,t i in theﬁnal result only if its estimated similar-ity is no lower than the user-speciﬁed threshold(Section4.1). Such a sampling scheme identiﬁes tuples with similarity of at leastφfrom R2for each tuple in R1.By sampling R2only once, the sample will be correlated.As we verify experimentally in Sec-tion6,this sample correlation has a negligible effect on the quality of the join approximation.As presented,the join-approximation strategy is asymmetric in the sense that it uses tuples from one relation(R1)to weight sam-ples obtained from the other(R2).The text join problem,as de-ﬁned,is symmetric and does not distinguish or impose an ordering on the operands(relations).Hence,the execution of the text join R1 IφR2naturally faces the problem of choosing which relation to sample.For a speciﬁc instance of the problem,we can break this asymmetry by executing the approximate join twice.Thus,we ﬁrst sample from vectors of R2and use R1to weight the samples. Then,we sample from vectors of R1and use R2to weight the sam-ples.Then,we take the union of these as ourﬁnal result.We refer to this as a symmetric text join.We will evaluate this technique experimentally in Section6.In this section we have described how to approximate the text join R1 IφR2by using weighted sampling.In the next section,we show how this approximate join can be completely implemented ina standard,unmodiﬁed RDBMS.5.SAMPLING AND JOINING TUPLE VEC-TORS IN SQLWe now describe our SQL implementation of the sampling-based join algorithm of Section4.2.Section5.1addresses the Sampling step,while Section5.2focuses on the Weighting and Thresholding steps for the asymmetric versions of the join.Finally,Section5.3 discusses the implementation of a symmetric version of the approx-imate join.5.1Implementing the Sampling Step in SQL Given the RiWeights relations,we now show how to implement the Sampling step of the text join approximation strategy(Sec-tion4.2)in SQL.For a desired sample size S and similarity thresh-oldφ,we create the auxiliary relation shown in Figure3.As the SQL statement in theﬁgure shows,we join the relations RiWeights and RiSum on the token attribute.The P attribute for a tuple inthe result is the probability RiWeights.weightRiSum.total with which we shouldpick this tuple(Section4.2).Conceptually,for each tuple in the output of the query of Figure3we need to perform S trials,pick-ing each time the tuple with probability P.For each successfulINSERT INTO RiSample(tid,token,c)SELECT rw.tid,rw.token,ROUND(S*rw.weight/rs.total,0)AS c FROM RiWeights rw,RiSum rsWHERE rw.token=rs.token ANDROUND(S*rw.weight/rs.total,0)>0Figure4:A deterministic version of the Sampling step,which results in a compact representation of RiSample.SELECT r1w.tid AS tid1,r2s.tid AS tid2FROM R1Weights r1w,R2Sample r2s,R2Sum r2sum,R1V r1vWHERE r1w.token=r2s.token ANDr1w.token=r2sum.token ANDr1w.tid=r1v.tidGROUP BY r1w.tid,r2s.tid,HAVING SUM(r1w.weight*r2sum.total/)≥S∗φ / Figure5:Implementing the Weighting and Thresholding steps in SQL.This query corresponds to the asymmetric execution of the sampling-based text join,where we sample R2and weight the sample using R1.trial,we insert the corresponding tuple tid,token in a relation RiSample(tid,token),preserving duplicates.The S trials can be implemented in various ways.One(expensive)way to do this is as follows:We add“AND P≥RAND()”in the WHERE clause of the Figure3query,so that the execution of this query corresponds to one“trial.”Then,executing this query S times and taking the union of the all results provides the desired answer.A more efﬁcient al-ternative,which is what we implemented,is to open a cursor on the result of the query in Figure3,read one tuple at a time,perform S trials on each tuple,and then write back the result.Finally,a pure-SQL“simulation”of the Sampling step deterministically de-ﬁnes that each tuple will result in Round(S·RiWeights.weightRiSum.total)“suc-cesses”after S trials,on average.This deterministic version of the query is shown in Figure42.We have implemented and run exper-iments using the deterministic version,and obtained virtually the same performance as with the cursor-based implementation of sam-pling over the Figure3query.In the rest of the paper–to keep the discussion close to the probabilistic framework–we use the cursor-based approach for the Sampling step.5.2Implementing the Weighting and Thresh-olding Steps in SQLSection4.2described the Weighting and Thresholding steps as two separate steps.In practice,we can combine them into one SQL statement,shown in Figure5.The Weighting step is implemented by the SUM aggregate in the HA VING clause.We weight each tuple from the sample according to R1W eights.weight·R2Sum.totalR1V.T V, which corresponds to v t q(j)·Sum(j)V q(see Section4.2).Then,we can count the number of times that each particular tuple pair ap-pears in the results(see GROUP BY clause).For each group,the result of the SUM is the number of times C i that a speciﬁc tuple pair appears in the candidate set.To implement the Thresholding step,we apply the countﬁlter as a simple comparison in the HA V-ING clause:we check whether the frequency of a tuple pair at least matches the count threshold(i.e.,C i≥ST V(t q)φ ).Theﬁnal out-put of this SQL operation is a set of tuple id pairs with expected similarity of at leastφ.The SQL statement in Figure5can be fur-ther simpliﬁed by completely eliminating the join with the R1V 2Note that this version of RiSample uses the compact representation in which each tid-token pair has just one associated row.。

heterogeneous analysis

heterogeneous analysisHeterogeneous analysis refers to the process of examining diverse and varied data sources to gain insights, identify patterns, and make informed decisions. It involves analyzing data that come from different types, formats, structures, or sources.Heterogeneous analysis can be applied to various fields, including business, science, social sciences, and technology. It often requires combining and integrating data from multiple sources, such as structured databases, unstructured text documents, images, audio, and video files.The goal of heterogeneous analysis is to uncover hidden patterns, relationships, and trends that may not be evident by analyzing each data source separately. By bringing together different types of data, researchers or analysts can gain a more comprehensive understanding of the phenomenon under study and make more accurate predictions or conclusions.The methods used in heterogeneous analysis can vary depending on the characteristics of the data being analyzed. They may include statistical techniques, data mining, machine learning, natural language processing, image or signal processing, and other advanced analytical tools.Overall, heterogeneous analysis allows for a more holistic and comprehensive approach to data analysis, enabling organizations and researchers to extract valuable insights and make data-driven decisions across a wide range of domains.。

HETEROGENEOUS INDEXING FOR ANNOTATION SYSTEMS

专利名称：HETEROGENEOUS INDEXING FOR ANNOTATION SYSTEMS发明人：CRAGUN, Brian,RICE, Julia,SCHWARZ, Peter,SWOPE, William,TRAN, Hoa申请号：EP2004051082申请日：20040610公开号：WO04/114148P1公开日：20041229专利内容由知识产权出版社提供摘要：Methods, systems, and articles of manufacture for indexing annotations made for a variety of different type (i.e., heterogeneous) data objects are provided. A set of parameters uniquely identifying an annotated data object may be converted to an index comprising a set of index values, each corresponding to a column in a homogeneous index table. In order to accommodate the indexing of heterogeneous data objects, a mapping may be provided for each different type (or classification) of data object that may be annotated, that defines how the identifying parameters of that type will be mapped to the columns of the homogeneous index table.申请人：CRAGUN, Brian,RICE, Julia,SCHWARZ, Peter,SWOPE, William,TRAN, Hoa地址：New Orchard Road Armonk, NY 10504 US,PO Box 41 North Harbour Portsmouth Hampshire PO6 3AU GB,2613 24th Street NW Rochester, MN 55901 US,973 Helena Drive Sunnyvale, CA 94087 US,3108 Jenkins Avenue San Jose, CA 95118 US,15155 Sycamore Drive Morgan Hill, CA 95037 US,5022 191/2 Avenue NW Rochester, MN 55901 US国籍：US,GB,US,US,US,US,US代理机构：LITHERLAND, David, Peter 更多信息请下载全文后查看。

1 Research Frontiers in Advanced Data Mining Technologies and Applications

Research Frontiers in Advanced Data MiningTechnologies and ApplicationsJiawei HanDepartment of Computer Science,University of Illinois at Urbana-ChampaignAbstract.Research in data mining has two general directions:theoretical foun-dations and advanced technologies and applications.In this talk,we will focuson the research issues for advanced technologies and applications in data miningand discuss some recent progress in this direction,including(1)pattern min-ing,usage,and understanding,(2)information network analysis,(3)stream datamining,(4)mining moving object data,RFID data,and data from sensor net-works,(5)spatiotemporal and multimedia data mining,(6)biological datamining,(7)text and Web mining,(8)data mining for software engineering andcomputer system analysis,and(9)data cube-oriented multidimensional onlineanalytical processing.Data mining,as the conﬂuence of multiple intertwined disciplines,including statis-tics,machine learning,pattern recognition,database systems,information retrieval, World-Wide Web,and many application domains,has achieved great progress in the past decade[1].Similar to many researchﬁelds,data mining has two general direc-tions:theoretical foundations and advanced technologies and applications.Here we fo-cus on advanced technologies and applications in data mining and discuss some recent progress in this direction.Notice that some popular research topics,such as privacy-preserving data mining,are not covered in the discussion for lack of space/time.Our discussion is organized into nine themes,and we brieﬂy outline the current status and research problems in each theme.1Pattern Mining,Pattern Usage,and Pattern Understanding Frequent pattern mining has been a focused theme in data mining research for over a decade.Abundant literature has been dedicated to this research and tremendous progress has been made,ranging from efﬁcient and scalable algorithms for frequent itemset min-ing in transaction databases to numerous research frontiers,such as sequential pattern mining,structural pattern mining,correlation mining,associative classiﬁcation,and frequent-pattern-based clustering,as well as their broad applications.Recently,studies have proceeded to scalable methods for mining colossal patterns where the size of the patterns could be rather large so that the step-by-step growth using an Apriori-like approach does not work,methods for pattern compression,extraction of high-quality top-k patterns,and understanding patterns by context analysis and gener-ation of semantic annotations.Moreover,frequent patterns have been used for effective Z.-H.Zhou,H.Li,and Q.Yang(Eds.):PAKDD2007,LNAI4426,pp.1–5,2007.c Springer-Verlag Berlin Heidelberg20072J.Hanclassiﬁcation by top-k rule generation for long patterns and discriminative frequent pat-tern analysis.Frequent patterns have also been used for clustering of high-dimensional biological data.Scalable methods for mining long,approximate,compressed,and so-phisticated patterns for advanced applications,such as biological sequences and net-works,and the exploration of mined patterns for classiﬁcation,clustering,correlation analysis,and pattern understanding will still be interesting topics in research.2Information Network AnalysisGoogle’s PageRank algorithm has started a revolution on Internet search.However, since information network analysis covers many additional aspects and needs scalable and effective methods,the systematic study of this domain has just started,with many interesting issues to be rmation network analysis has broad applications, covering social and biological network analysis,computer network intrusion detection, software program analysis,terrorist network discovery,and Web analysis.One interesting direction is to treat information network as graphs and further de-velop graph mining methods.Recent progress on graph mining and its associated struc-tural pattern-based classiﬁcation and clustering,graph indexing,and similarity search will play an important role in information network analysis.Moreover,since informa-tion networks often form huge,multidimensional heterogeneous graphs,mining noisy, approximate,and heterogeneous subgraphs based on different applications for the con-struction of application-speciﬁc networks with sophisticated structures will help in-formation network analysis substantially.The discovery of the power law distribu-tion of information networks and the rules on density evolution of information net-works will help develop effective algorithms for network analysis.Finally,the study of link analysis,heterogeneous data integration,user-guided clustering,user-based net-work construction,will provide essential methodology for the in-depth study in this direction.3Stream Data MiningStream data refers to the data thatﬂows into the system in vast volume,changing dy-namically,possibly inﬁnite,and containing multi-dimensional features.Such data can-not be stored in traditional database systems,and moreover,most systems may only be able to read the stream once in sequential order.This poses great challenges on effective mining of stream data.With substantial research,progress has bee made on efﬁcient methods for mining fre-quent patterns in data streams,multidimensional analysis of stream data(such as con-struction of stream cubes),stream data classiﬁcation,stream clustering,stream outlier analysis,rare event detection,and so on.The general philosophy is to develop single-scan algorithms to collective information about stream data in tilted time windows, exploring micro-clustering,limited aggregation,and approximation.It is important toResearch Frontiers in Advanced Data Mining Technologies and Applications3 explore new applications of stream data mining,e.g.,real-time detection of anomaly in computer networks,power-gridﬂow,and other stream data.4Mining Moving Object Data,RFID Data,and Data from Sensor NetworksWith the popularity of sensor networks,GPS,cellular phones,other mobile devices, and RFID technology,tremendous amount of moving object data has been collected, calling for effective analysis.There are many new research issues on mining moving object data,RFID data,and data from sensor networks.For example,how to explore correlation and regularity to clean noisy sensor network and RFID data,how to integrate and construct data warehouses for such data,how to perform scalable mining for peta-byte RFID data,how toﬁnd strange moving objects,how to cluster trajectory data, and so on.With time,location,moving direction,speed,as well as multidimensional semantics of moving object data,likely multi-dimensional data mining will play an essential role in this study.5Spatiotemporal and Multimedia Data MiningThe real world data is usually related to space,time,and in multimedia modes(e.g., containing color,image,audio,and video).With the popularity of digital photos,audio DVDs,videos,YouTube,Internet-based map services,weather services,satellite im-ages,digital earth,and many other forms of multimedia and spatiotemporal data,min-ing spatial,temporal,spatiotemporal,and multimedia data will become increasingly popular,with far-reaching implications.For example,mining satellite images may help detect forestﬁre,ﬁnd unusual phenomena on earth,and predict hurricanes,weather patterns,and global warming trends.Research in this domain needs the conﬂuence of multiple disciplines including im-age processing,pattern recognition,parallel processing,and data mining.Automatic categorization of images and videos,classiﬁcation of spatiotemporal data,ﬁnding fre-quent/sequential patterns and outliers,spatial collocation analysis,and many other tasks have been studied popularly.With the mounting in many applications,scalable analysis of spatiotemporal and multimedia data will be an important research frontier for a long time.6Biological Data MiningWith the fast progress of biological research and the accumulation of vast amount of biological data(especially,a great deal of it has been made available on the Web),bi-ological data mining has become a very activeﬁeld,including comparative genomics, evolution and phylogeny,biological databases and data integration,biological sequence analysis,biological network analysis,biological image analysis,biological literature analysis(e.g.,PubMed),and systems biology.This domain is largely overlapped with4J.Hanbioinformatics but data mining researchers has been emphasizing on integrating biolog-ical databases with biological data integration,constructing biological data warehouses, analyzing biological networks,and developing various kinds of scalable bio-data min-ing algorithms.Advances in biology,medicine,and bioinformatics provide data miners with abun-dant real data sets and a broad spectrum of challenging research problems.It is expected that an increasing number of data miners will devoted themselves to this domain and make contributions to the advances in both bioinformatics and data mining.7Text and Web MiningThe Web has become the ultimate information access and processing platform,housing not only billions of link-accessed“pages”,containing textual data,multimedia data,and linkages,on the surface Web,but also query-accessed“databases”on the deep Web. With the advent of Web2.0,there is an increasing amount of dynamic“workﬂow”emerging.With its penetrating deeply into our daily life and evolving into unlimited dynamic applications,the Web is central in our information infrastructure.Its virtually unlimited scope and scale render immense opportunities for data mining.Text mining and information extraction have been applied not only to Web mining but also to the analysis of other kinds of semi-structured and unstructured informa-tion,such as digital libraries,biological information systems,business intelligence and customer relationship management,computer-aided instructions,and ofﬁce automation systems.There are lots of research issues in this domain,which takes the collaborative efforts of multiple disciplines,including information retrieval,databases,data mining,natural language processing,and machine learning.Some promising research topics include heterogeneous information integration,information extraction,personalized informa-tion agents,application-speciﬁc partial Web construction and mining,in-depth Web semantics analysis,and turning Web into relatively structured information-base.8Data Mining for Software Engineering and Computer System AnalysisSoftware program executions and computer system/network operations potentially gen-erate huge amounts of data.Data mining can be performed on such data to monitor system status,improve system performance,isolate software bugs,detect software pla-giarism,analyze computer system faults,uncover network intrusions,and recognize system malfunctions.Data mining for software and system engineering can be partitioned into static anal-ysis and dynamic/stream analysis,based on whether the system can collect traces be-forehand for post-analysis or it must react at real time to handle online data.Differ-ent methods have been developed in this domain by integration and extension of the methods developed in machine learning,data mining,pattern recognition,and statis-tics.However,this is still a rich domain for data miners with further development of sophisticated,scalable,and real-time data mining methods.Research Frontiers in Advanced Data Mining Technologies and Applications5 9Data Cube-Oriented Multidimensional Online Analytical ProcessingViewing and mining data in multidimensional space will substantially increase the power andﬂexibility of data analysis.Data cube computation and OLAP(online analytical processing)technologies developed in data warehouse have substantially in-creased the power of multidimensional analysis of large datasets.Besides traditional data cubes,there are recent studies on construction of regression cubes,prediction cubes,and other sophisticated statistics-oriented data cubes.Such multi-dimensional, especially high-dimensional,analysis tools will ensure data can be analyzed in hier-archical,multidimensional structures efﬁciently andﬂexibly at user’sﬁnger tips.This leads to the integration of online analytical processing with data mining,called OLAP mining.We believe that OLAP mining will substantially enhance the power andﬂexibility of data analysis and bring the analysis methods derived from the research in machine learning,pattern recognition,and statistics into convenient analysis of massive data with hierarchical structures in multidimensional space.It is a promising researchﬁeld that may lead to the popular adoption of data mining in information industry. Reference1.Han,J.,Kamber,M.:Data Mining:Concepts and Techniques(2nd ed.).Morgan Kaufmann(2006)。

利用转录组测序分析大豆矮小突变体中差异表达基因

利用转录组测序分析大豆矮小突变体中差异表达基因近年来，随着测序技术的发展，转录组测序成为研究生物体内基因表达的重要手段之一。

通过转录组测序，可以得知特定生物体在某个生长阶段或环境适应中的基因表达情况。

在大豆研究中，通过转录组测序研究矮小突变体的差异表达基因，可以帮助我们深入了解大豆生长发育中的分子机制。

矮小突变体是指生长期较同个物种的正常个体矮小的一类突变体。

在大豆中，矮小突变体的发现对于提高大豆的产量和抗病能力具有重要意义。

通过转录组测序研究矮小突变体中的差异表达基因，可以揭示矮小突变体生长发育的分子机制，从而为大豆育种提供理论依据。

在利用转录组测序研究大豆矮小突变体中的差异表达基因时，首先需要选择合适的实验材料和对照材料。

一般情况下，可以选择与突变体具有相似生长发育阶段的野生型作为对照材料。

然后，通过高通量测序技术对突变体和对照样品进行测序，获取大量的转录组测序数据。

接下来，对测序数据进行质量控制和过滤，去除低质量的数据和接头序列。

然后使用比对算法将测序数据比对到参考基因组上，得到每个基因的表达情况。

通过比较突变体和对照样品之间的差异表达基因进行筛选，并对其进行功能注释和富集分析。

差异表达基因分析的结果通常可以揭示突变体和对照样品在基因表达水平上的差异，可以给出候选基因的列表。

我们可以进一步对这些候选基因进行生物学实验验证，以了解这些基因在大豆生长发育中的具体功能。

通过转录组测序研究矮小突变体中的差异表达基因，可以明确突变体生长发育过程中哪些基因发生了变化，从而了解突变体的生长发育机制。

此外，通过功能注释和富集分析，我们还可以了解突变体中差异表达基因所参与的生物学过程和通路。

转录组测序研究在大豆育种中的应用前景广阔。

通过研究矮小突变体中差异表达基因，可以为大豆产量和抗病能力的提高提供重要的理论基础。

同时，转录组测序还可以帮助我们发现更多的潜在基因，用于大豆的遗传改良和功能研究。

总之，利用转录组测序研究大豆矮小突变体中的差异表达基因有助于我们了解大豆生长发育过程中的分子机制。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Data Indexing for Heterogeneous Multiple Broadcast ChannelAndrew Y. Ho and Dik L un L eeDepartment of Computer ScienceThe Hong Kong University of Science and TechnologyClear Water Bay, Hong KongEmail: andrewho@t.hk, dlee@t.hkAbstractThis paper studies a heterogeneous multiple channel environment (HMCE), in which the channels are controlled by different wireless operators. To the best of our knowledge, there is no previous research on this scenario. In this paper, we first present the architecture for HMCE which makes use of a centralized index server to broadcast index information about the broadcast data on a dedicated index channel. An analog can be drawn between HMCE and WWW: the wireless operators are web sites and the index channel is Google; Google indexes web pages so that users can find the web pages they want, whereas in HMCE the index channel indexes the data channels to help mobile users to find the data on the air. We propose three indexing methods to reduce the time and energy used to search for data in HMCE. Simulation results are obtained to evaluate the performance of the proposed methods.1.IntroductionThe rapid advance of wireless communication technologies during the last decade has brought a new vision in the computing industry – from traditional wired and stationary desktops to a fast growing area of mobile computing. The trend of using notebook computers, palm-size computers, and personal digital assistants (PDA) is already in full swing. Furthermore, the enhancement in reliability, transmission, and speed of wireless links facilitates the mobility of communication. Usually, a wireless communication environment consists of two sets of entities: a large number of users equipped with mobile devices (mobile clients – MC s) and a relatively fewer number of stationary mobile service stations(MSS) that have base stations (BS) or access points (AP) attached to provide wireless communication in geographical areas known as cells. Unlike a MSS, a MC is able to move freely from cell to cell and poses queries for retrieving data provided by the MSS.There are two major modes for MC s to access information on the wireless channels: pull-based on-demand mode, which collects the queries sent by the MC s through an uplink channel, and then delivers the requested data through the downlink channel; push-based broadcasting mode, which broadcast data onthe broadcast channel continuously according to some previous data access statistics in order to reduce accesslatency and consumption of bandwidth, thus effectively allowing a large number of MC s to access information simultaneously. Furthermore, as receiving messages consumes less power than sending messages, MC s are able to stay longer with the limited power supplies in the broadcasting environment.Because of business, economic and technical reasons, some service providers may want to use multiple low-bandwidth channels to achieve high combined bandwidth instead of getting a single high-bandwidth channel. In this homogeneous multi-channel environment, the server has full control over all channels in terms of scheduling data on the channels and will most likely use a fixed indexing and scheduling scheme. A mobile client typically subscribes to one or a few wireless operators and as such it is easy to identify the available channels and scan them to pick out the interesting information.In this paper, we study a heterogeneous multiple channel environment(HMCE), which, in contrast, consists of a large number of wireless operators ranging from phone companies to amateurs operating in public radio frequency bands (e.g., Starbucks) [9]. The wireless operators disseminate information on channels that are not related or unorganized. An analog can be drawn between HMCE and WWW. In HMCE, the data channels are the web sites and the broadcast data are the web pages.As information is disseminated through different service providers on different wireless broadcast channels, it is difficult for mobile users to identify the available channels (cf. web sites), let alone those containing the data they want (cf. web pages). They need to have knowledge about what channels are available and which channels carry their requested data. Since the broadcast pattern may change dynamically over time, any static channel information pre-programmed in the mobile devices may become outdated very soon. To find out where the expected data will be broadcast, the most straightforward method for MC s is to search all broadcast channels, but this is very time consuming and uses up a lot of battery power. Another way is to announce the data indices through a dedicated index channel so that MC s can identify where the data will be broadcast. In HMCE, we propose to makes use of a centralized index server to broadcast index information about the broadcast data on a dedicated index channel. Mobile clients listen to the index channel and then tune into the data channel according to the index information and scan for the data it wants. In other words, the index server/channel is the “Google” of the data on the air.Researchers have proposed different broadcast scheduling [1, 2, 3] and indexing schemes [4, 5, 6] in order to reduce power consumption for a single broadcast channel. Yet, scheduling and indexing methods used in a single channel broadcast may not be directly applied to a multi-channel environment. There are also studies on multiple channel scheduling and indexing [3, 4, 7] that focus on a homogeneous multiple channel environment. To the best of our knowledge, there is no previous research on the HMCE scenario.In this paper, the architecture of HMCE is proposed. We introduce indexing methods to reduce the time and energy used to search for data on multiple data channels. Three indexing models are described. Simulation results are obtained to evaluate the performance of the three proposed methods.The rest of this paper is organized as follows. Section 2 introduces the background and related work on wireless broadcast. Section 3 describes the proposed methods for data indexing. Section 4 evaluates the performance of the proposed methods. Section 5 concludes the paper.2.BackgroundIn single channel environment, one way of reducing power consumption is by selective tuning [6, 5], which enables MC s to switch into active mode (power consuming) only when the expected data is being broadcast. A server is required to broadcast indexing information to make selective tuning works.The (1, m) indexing scheme [6, 5] is an index allocation method that involves the complete index being broadcast m times in a broadcast cycle. MC traverses the index buckets and determine the offset to the requested data bucket. The tree-based indexing scheme [6, 5] was introduced in which an index is only partially replicated in the broadcast cycle. In this scheme, the data file is associated with a B+-tree index structure. The signature-based indexing scheme was proposed [8] for real-time information filtering. Basically, to access information, a query signature is constructed and compared with the broadcast signature. If the signatures match, all records indexed by the signature will be read until checked for correctness or until the expected record is found in the information frame.Scheduling and indexing methods used in single channel broadcast may not be directly applied to a multiple channel environment. New algorithms and modified methods were thus proposed, although the algorithms did not address certain issues related to a HMCE. In [3], Hameed and Vaidya integrated the online algorithm with alternate labeling by assigning instances of the data item from a single channel schedule into a multiple channel schedule. In [4], Hsu et al. suggested a method for indexing and scheduling a multiple channel broadcast that considered data access frequencies based on distributed indexing [5]. In [7] Ke et al. proposed a scheduling method that concerned the conflict of data pages. The page-based strategy (PB) aims at allocating the data pages by their own access frequencies. The request-based strategy (RB) is based on user requests rather than pages. The conflict-free version (CFV) of the RB further enhances the schedule by checking if the pages in the same request are assigned in the same time slot.3.Data Dissemination in HeterogeneousMultiple Channel EnvironmentsThree indexing schemes in the centralized model for data dissemination in a HMCE are proposed.3.1.Basic ModelThe key point in the centralized model is the central index server (CIS), which is used to manage and broadcast index information about the data being broadcast on all of the broadcasting channels. In the architecture, we define the broadcast agent (BA) as any individual that has data to be broadcast on the broadcast channel.Figure 1. Centralized HMCEEach BA is connected to the CIS and CO through a wired network (see Figure 1). The CIS is responsible for broadcasting index information on a dedicated wireless channel for the whole HMCE, while the CO is only responsible for broadcasting a data message(DM) on itsown wireless channels for BA s who subscribe to it.Assume that a BA has data to be broadcast. The first thing for it to do is to send a “data-to-send” notification (DTS) to the CIS through the wired network (Figure 1). The content of the DTS follows a standard format as defined by the CIS, which includes the BA’s identification, the channel ID that the BA has subscribed to, the message identification, and a list of key attributes describing the data. When the CIS receives the DTS, it extracts the information from the DTS and converts it into the index message(IM) format, which contains a header with the BA’s ID, the channel ID, the message ID, the IM size, the number of attributes, and a pointer to the starting of attributes. Next, the CIS puts the IM into the broadcast queue for index broadcast.Once the CIS receives the DTS, it is required to reply to the BA at the earliest time that it can broadcast its data (Figure 1). The replied message is called the “earliest-send-time” notification(EST). The value of the EST is equal to the broadcast end time of the BA’s IM. The EST is very important for serializing the IM and the indexed data in the centralized model. Suppose it is omitted and a BA broadcasts its data a short while after it has sent the DTS to the CIS. If there are a lot of IM s queued up in the CIS, it will take a long time for the CIS to broadcast all of them. Therefore, there is a chance that the BA broadcasts its data before the CIS has broadcast the corresponding IM. In this case, the data being broadcast by the BA are not properly indexed, and the corresponding IM broadcast by the CIS will be invalid. For data retrieval, if an MC has read this IM and tuned into the specified data channel, then it will wait forever or terminate after a timeout period, since the data have already gone on that channel. With the EST replied by the CIS to the BA, they can ensure that the data will have an index to indicate, providing the data are sent at or after the EST.Once the BA receives an EST from the index server, it will wait until the indicated time. At that moment, the BA can send its data to the service provider – the CO (Figure 1). Whether the CO broadcasts the BA’s data immediately or appends it to an internal broadcast queue depends on the traffic of the channel.Whenever the MC wants to retrieve data from the wireless channel, it will tune into the index channel and filter all broadcast IM s by listening to the channel until it finds an IM containing the attribute that matches its request. Then the MC can tune into the data channel indicated by the IM header and wait for the requested data to be broadcast. Since each IM header contains the corresponding IDs of the BA and the message, the MC only needs to check both IDs in the DM in order to determine if the DM is the one indicated by the IM. Otherwise, the MC can doze off until the end of the incorrect DM. The MC may also end the retrieval process if no attribute in IM s can be found matching its request within a certain period of time.3.2.Signature ModelThe environment of a signature model is the same as the environment used in the basic model. The major difference concerns how indices are constructed. In the basic model, whenever the CIS receives a DTS sent by BA s, all attributes in the attribute list attached to the DTS will be used to construct the IM. Also, each IM is only responsible for one DM. If there are a lot of BA s and all of them are sending a DTS with a long attribute list, then it will take a long period of time for the CIS to broadcast all the IM s on the index channel. Since the EST is equal to the end time of the corresponding broadcast IM, the BA s also need to wait for a long period of time to start sending the data to the CO s.In the signature model, to reduce the size of the IM, signatures are used instead of real attributes. An attribute list signature (ALS) is formed by hashing each attribute in an attribute list(AL) into a random bit string, and thensuperimposing all bit strings together (Figure 2).Figure 2. Generation of attribute list signatureDuring the filtering process, a query signature is constructed in the same way as the ALS. Then the query signature will be compared with the ALS using the logical AND operation to find if the query is potentially in the AL.The CIS maintains two lists for managing signatures: a channel signature(CS) list for storing all CS s with one entry assigned to only one data channel in the system, and an order-array for storing the broadcast order and time for each entry in the CS list. Upon reception of the DTS from a BA, the CIS extracts the channel ID and the ALS from it. Then the ALS is superimposed onto the channel signature (CS) of the referring channel. The main purpose of the order-array is to keep track of the broadcast time for each non-empty CS (since it is not necessary to broadcast an empty CS for indexing purposes). Thus, the CIS can respond to the BA with the EST according to which channel it uses. During signature broadcast, the CIS goes through each non-empty entry in the order-array and only those CS s in the array will be broadcast. Moreover, after broadcasting, all CS s and the corresponding order-array entries will be cleared for future superimpositions.In the signature model, indices are only signatures that guide the MC s to the data channel where the requested data may be found. As a result, the BA s are required to send with each data item an AL to their CO for broadcast, such that the MC s can check whether the query attribute is actually in the AL attached to the broadcast data.A false drop can result from a signature comparison. This happens when an MC finds a matching CS with the query signature, but in fact, the corresponding channel does not contain the requested data item. If this is the case, the MC needs to leave the data channel and tune back into the index channel to filter other CS s. To determine the occurrence of a false drop, a timeout for searching the data channel is required. Since the MC tunes into the indicated data channel after finding a matched CS and waits for the data message, if the length of the searching period is not specified and if the CS is in fact a false drop, then the MC will wait forever on the data channel without finding any useful data.3.3. Signature Model with Operator’s FeedbackAn improper length for the searching period can lengthen the retrieval time. No matter how carefully chosen is the length of the searching period, there is still the possibility of an incident of determining an ‘incorrect false drop’. An incorrect false drop happens when the MC finds a matching CS on the index channel but cannot find the correct DM within the searching period on the corresponding data channel. In fact, the DM is broadcast on the indicated data channel, but after the MC’s expiration time. Incorrect false drops occur more frequently when the searching period is too short, or the traffic load of the data channel is heavy, since in both cases, the correct DM cannot be broadcast within the MC’s searching period.To eliminate the effect of incorrect false drops, at least one CS needs to be broadcast within the length of asearching period before the corresponding DM has been broadcast. Hence, when the MC reads the CS and switches to the data channel, the DM will arrive during the searching period. In order to accomplish this method, cooperation from all the COs is needed.Figure 3. Signature Model with Operator’s feedback In both the basic and signature models, once the BA receives an EST , it waits until the specified time to send the data. But the BA has no idea about the time that the data message will be broadcast on the data channel by its CO .As a result, the BA has no way of ensuring that at least one CS containing the ALS of the data will be broadcast within the searching period prior to the broadcast of the data. Feedback from the CO plays an important role in notifying the BA about the time for the data broadcast. In contrast to the signature model, where the CO receives the data from the BA , besides appending the data to the broadcast queue, it also gives feedback to the BA who requested the data broadcast about the time that the DM will be broadcast on the channel. After getting the feedback from the CO , if the difference between the CS end time and the data broadcast time is more than one searching period, then the BA can send the same ALS again to the CIS one searching period prior to the broadcast time of the data, while avoiding any overlap with the searching period. The reason for this is straightforward: as the MC is still searching for the correct attribute on the data channel during the searching period, ifthe re-sent signature is broadcast within this period, it will be gone before the MC switches back to the index channel. The purpose of the re-sent ALS is only to prevent an incorrect false drop. Since the corresponding data is already in the CO ’s internal broadcast queue, the second reply of the EST is not required by the BA . The message flow in the feedback model is shown in Figure 3. For the MC , the same procedure for data retrieval is used, but when an incorrect false drop occurs and the MC switches back to the index channel after timeout, the MC can use the re-sent signature on the index channel and switch back to the data channel to retrieve the DM .4.Performance EvaluationThis section evaluates the performance of the three proposed heterogeneous multiple channels broadcast models by using simulation. The primary performance metric used for evaluating the models for MC is average access time.For all experiments, CSIM18 [10] was used for implementing the simulation and the same parameters were used in the simulation environment. Besides, we assume that the capacities of the wired networks are much larger than the wireless broadcast channel (1 byte/unit of time), therefore the notification messages, such as DTS and EST ,and any data going through it does not affect the performance of the system. As a result, time spends on wired networks is not counted. For attribute used in the simulation, a vocabulary is made, and each time, BA can choose 1 to 25 words from the vocabulary as the data attributes. Similarly, MC chooses one word from the list for its query attribute. The length for each index message in the basic model is added up by the size of the attributes plus a 3 bytes header - BA ID, message ID and size of the message,which ranges from 9 bytes to 153 bytes in total. For the signature models, 128-bit (16 bytes) signature is used for each channel. The size of data which BA sends to CO is ranging from 10 to 10000 bytes. For basic model, 2 bytes IDs for verification is added, while in the signature models, size of attributes is added.4.1. Impact of the number of channels on MC’sdata accessMC access time measures the time elapsed from the moment the MC poses a request and starts listening on the wireless channels to the moment the requested data is received. The performance on MC access time is investigated with respect to the number of channels. The result is shown in Figure 4.Figure 4. MC access timeIn the figure, MC access time decreases as the number of channels increases. This is because workload on each channel is reduced by using a larger number of channels, thus enabling each data broadcast to start earlier. The figure also shows that MC access time in the feedback model achieves the lowest values at the beginning of the experiments, while the signature model has slightly higher values due to the longest time for retrieving data. As the number of channels increases, MC access time in all models tends to reach the same values and stay unchanged regardless of further increases in the number of channels. The reasons are as follows. First of all, the time that CO receives a data message and the time that CO broadcast the data message are getting closer. As EST is also the end time of the index broadcast, MC access time can be roughly formulated as the summation of the time spent on filtering on index channel, the data queuing time, and the time for broadcasting data message. Since the average data sizes in both models are the same, the filtering time becomes themain factor influencing the access time.Figure 5. MC on index channelsFigure 5 shows the time that MC has spent on the index channel. The graph clearly shows that MC spends longer time in the basic model than in the signature models, due to the difference in index size. As a result, MC access time in the basic model obtains higher values in the experiments.4.2. Impact of the searching periodFigure 6 shows the MC access time with respect to different length of searching period in signature model and feedback model. Figure 7 shows the number of false drops with respect to the same searching period.Figure 6. Access timeFigure 7. False dropsAs shown in the figures, a short searching period causes a lot of false drops, since MC does not have enough time to reach the correct data message before the end of the searching period. Therefore, they interpret all the missed data as false drops. In the feedback model, since ALS of the data will be resent to CIS for broadcast again on the index channel,MC is able to retrieve the CS the second time. Moreover, as the resent of ALS is within one searching period prior to the real data broadcast, once MC gets the resent CS, it is most likely that it will receive the data message on the data channel. Therefore, MC in the feedback model incurs less access time than in the signature model. As the length of the searching period increases, the number of false drops decreases since MC is able to eliminate incorrect false drops. This also explains why the access time for both models tends to overlap with each other by increasing the searching period.4.3. Impact of M-sizeThe number of false drops can influence the MC access time, but false drop itself is also influenced by the m-size. The m-size for a signature stands for the number of 1’s generated by the signature generator (usually hash function). If the m-size is too small, then too many 0’s will be in the signature, resulting in under utilized signatures. If the m-size is too high, then there will be too many 1’s in the signatures, resulting in weakened filtering capabilities. This is because signatures with many 1’s are much easier to match a query signature by chance. Figure 8 shows the number of false drops against m-size.Figure 8. False drops Vs M-sizeThe figure shows that by using signatures with m-size equal to 20 bits, the lowest false drop rate can be obtained. Ideally, m-size with half signature size (64 bits in this case) will be the best, since the largest number of bit pattern for a bit stream is constructed with same numbers of 1’s and 0’s (nCn/2), thus reducing the chance of collision. However, in the proposed models, each CS is in fact a superimpositionof multiple ALS s from different BA s, and each ALS is in turn generated from a different number of attributes and thus contains different number of 1’s. This explains why the minimum m-size is different from the theoretical result, which assumes uniformity in the ALS s. In a real operational environment, it is difficult to predict the optimal m-size, because a change in system loading can change the optimal m-size.4.4. Impact of system loading on MC access timeSystem loading is defined as the fraction that the broadcast channels are occupied for broadcasting data messages. In the experiment, the loading is increased by adding agents to the system in order to increase the usage of each data channel. The maximum value ‘1’ means that all channels are occupied at any instance during the simulation.Figure 9. MC access time Vs System loadingFigure 9 shows MC access time with respect to different system loading. With light channel loading, fewer data will be sent. Hence, the number of false drops is reduced. As a result, MC in the signature and feedback models achieves low and similar access time. In addition, as channel loading increases, the number of data broadcasts also increases. Thus,CO’s broadcast queue will be filled with awaiting data messages, which also increase MC access time by delaying broadcast of the requested data. The fact that the feedback model has the lowest access time is the result of the feedback mechanism with signature resent.In addition, when loading increases, the differences in MC access time between the basic and signature models also increase, which is basically caused by the lengthy AL attached to each IM. In the basic model, the number of BA’s requests is directly related to the number of IM s. If there are 300 BA’s requests, there will be 300 AL s broadcasted on the index channel. Suppose MC’s query attribute is in the last BA’s request, it is required to scan through all 300 IM s before switching to the data channel, which is time consuming compared to the signature models. Although AL s exist in the signature models, the number of AL s that MC has to process is reduced since each channel holds only a fraction of all AL s.MC is only required to filter the AL s that exist on the data channel. Therefore, the difference in access time increases as loading increases.5.Conclusion and Future WorkIn this paper, we propose an HMCE architecture consisting of independent wireless channel operators, broadcast agents and a centralized index server. Compared to the existing data dissemination schemes for a multi-channel environment, HMCE provides a channel through which MC s are able to know where to fetch their desired data. Furthermore, three indexing methods applicable to the HMCE architecture were proposed, namely the basic model, the signature model, and the signature model with channel operator’s feedback (feedback model). The basic model mainly uses data attributes from broadcast agents to form index messages. The signature model superimposes attribute list signaturesfor the purpose of indexing. Finally, the signature model with channel operator’s feedback is an enhancement of the signature model for reducing the effect of incorrect false drops. We showed the both the signature model and feedback model are significantly better than the basic model.Our work on HMCE represents the first attempt, to the best of our knowledge, to address a heterogeneous, autonomous broadcast environment. Much more research needs to be done to investigate different architectures (e.g., CO s and CIS can directly communicate in scheduling the index and data broadcast [9]) and other index schemes.AcknowledgmentThis work was supported in part by grants from the Research Grant Council of Hong Kong (Grant No. HKUST 6179/03E).References[1]S. Acharya, R. Alonso, M. Franklin, and S. Zdonik,“Broadcast disks: Data management for asymmetriccommunication environments,”in P roceedings of theACM Conference on Management of Data (SIGMOD95),pp. 199-210. San Jose, California. May 1995.[2]V. Gondhalekar, R. Jain, and J. Werth, “Scheduling onairdisks: Efficient access to personalized informationservices via periodic wireless data broadcast,” in IEEEInternational Conference on Communications (ICC 97),vol. 3, pp. 1276-1280. Montreal. 1997.[3]S. Hameed and N.H. Vaidya, “L og-time algorithm forscheduling single and multiple channel data broadcast,” inP roceedings of the 3rd Annual ACM/IEEE InternationalConference on Mobile Computing and Networking(MOBICOM97). Budapest. September 1997.[4] C.H. Hsu, G. L ee, and A.L.P. Chen, “Index and dataallocation on multiple broadcast channels considering dataaccess frequencies,” in Proceedings of the 3rd InternationalConference on Mobile Data Management (MDM2002), pp.87-93. January 2002. [5]T. Imielinski, S. Viswanathan, and B.R. Badrinath,“Energy efficiency indexing on air,” in Proceedings of theACM Conference on Management of Data (SIGMOD94),pp. 25-36. May 1994.[6]T. Imielinski, S. Viswanathan, and B.R. Badrinath, “Dataon air: Organization and access,” IEEE Transactions onKnowledge and Data Engineering, 9(3). May/June 1997. [7] C.H. Ke, C. Lee, and C.C. Chen, “Broadcast scheduling formultiple channels in wireless information systems,” inP roceedings of National Computer Symposium (NCS99),pp. 525-532.Taipei, Taiwan. December 1999.[8]W.C. Lee and D.L. Lee, “Signature caching techniques forinformation broadcast and filtering in mobileenvironments,” ACM/Baltzer Journal of WirelessNetworking (WINET), 5(1), 57-67. 1999.[9] A.Y. Ho, “Data indexing in heterogeneous multiplebroadcast channels,” M.Phil. Dissertation, Department ofComputer Science, Hong Kong Unviersity of ScienceTechnology, Hong Kong, HKSAR, 2003.[10]CSIM18, Mesquite Software Inc.。