当前位置:文档之家› 大基因组数据与生物信息学英文及翻译

大基因组数据与生物信息学英文及翻译

大基因组数据与生物信息学英文及翻译
大基因组数据与生物信息学英文及翻译

Big Genomic Data in Bioinformatics Cloud

Abstract

The achievement of Human Genome project has led to the proliferation of genomic sequencing data. This along with the next generation sequencing has helped to reduce the cost of sequencing, which has further increased the demand of analysis of this large genomic data. This data set and its processing has aided medical researches.

Thus, we require expertise to deal with biological big data. The concept of cloud computing and big data technologies such as the Apache Hadoop project, are hereby needed to store, handle and analyse this data. Because, these technologies provide distributed and parallelized data processing and are efficient to analyse even petabyte (PB) scale data sets. However, there are some demerits too which may include need of larger time to transfer data and lesser network bandwidth, majorly.

人类基因组计划的实现导致基因组测序数据的增殖。这与下一代测序一起有助于降低测序的成本,这进一步增加了对这种大基因组数据的分析的需求。该数据集及其处理有助于医学研究。

因此,我们需要专门知识来处理生物大数据。因此,需要云计算和大数据技术(例如Apache Hadoop项目)的概念来存储,处理和分析这些数据。因为,这些技术提供分布式和并行化的数据处理,并且能够有效地分析甚至PB级的数据集。然而,也有一些缺点,可能包括需要更大的时间来传输数据和更小的网络带宽,主要。

Introduction

The introduction of next generation sequencing has given unrivalled levels of sequence data. So, the modern biology is incurring challenges in the field of data management and analysis.

A single human's DNA comprises around 3 billion base pairs (bp) representing approximately 100 gigabytes (GB) of data. Bioinformatics is encountering difficulty in storage and analysis of such data. Moore's Law infers that computers double in speed and half in size every 18 months. And reports say that the biological data will accumulate at even faster pace [1]. Sequencing a human genome has decreased in cost from $1 million in 2007 to $1 thousand in 2012. With this falling cost of sequencing and after the completion of the Human Genome project in 2003, inundate of biological sequence data was generated. Sequencing and cataloguing genetic information has increased many folds (as can be observed from the GenBank database of NCBI). Various medical research institutes like the National Cancer Institute are continuously targeting on sequencing of a million genomes for the understanding of biological pathways and genomic variations to predict the cause of the disease. Given, the whole genome of a tumour and a matching normal tissue sample consumes 0.1 T

B of compressed data, then one million genomes will require 0.1 million TB, i.e. 103 PB (petabyte) [2]. The explosion of Biology's data (the scale of the data exceeds a single machine) has made it more expensive to store, process and analyse compared to its generation. This has stimulated the use of cloud to avoid large capital infrastructure and maintenance costs.

In fact, it needs deviation from the common structured data (row-column organisation) to a semi-structured or unstructured data. And there is a need to develop applications that execute in parallel on distributed data sets. With the effective use of big data in the healthcare sector, a

reduction of around 8% in expenditure is possible, that would account for $300 billion saving annually.

下一代测序的引入给出了无与伦比的序列数据水平。因此,现代生物学在数据管理和分析领域面临挑战。单个人类DNA包含约30亿个碱基对(bp),表示约100吉字节(GB)的数据。生物信息学在这种数据的存储和分析中遇到困难。摩尔定律推测,计算机速度增加了一倍,每18个月大小减少一半。报告说,生物数据将以更快的速度积累[1]。人类基因组测序的成本从2007年的100万美元降至2012年的1千美元。随着测序成本的下降,在2003年人类基因组项目完成后,产生了生物序列数据的淹没。测序和编目遗传信息已经增加了许多倍(如从NCBI的GenBank数据库可以观察到的)。诸如国家癌症研究所的各种医学研究机构正在连续地将一百万个基因组的测序用于理解生物学途径和基因组变异以预测疾病的原因。假定肿瘤的全基因组和匹配的正常组织样品消耗0.1 TB的压缩数据,则一百万基因组将需要10万TB,即103 PB(petabyte)[2]。生物学数据的爆炸(数据的规模超过单个机器)使得与其一代相比存储,处理和分析更昂贵。这刺激了云的使用,以避免大的资本基础设施和维护成本。

实际上,它需要从公共结构化数据(行- 列组织)偏移到半结构化或非结构化数据。并且需要开发在分布式数据集上并行执行的应用程序。随着医疗行业大数据的有效利用,支出减少约8%,每年可节省3000亿美元。

Review

Cloud computing

Cloud computing is defined as "a pay-per-use model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction" [3]. Some of the major concepts involved are grid computing, distributed systems, parallelised programming and visualization technology. A single physical machine can host multiple virtual machines through virtualisation technology. Problem with grid computing was that effort was majorly spent on maintaining the robustness and resilience of the cluster itself. Big data technologies now have identified solutions to process huge parallelised data sets cost effectively. Cloud computing and big data technologies are two different things, one is facilitating the cost effective storage and the other is a Platform as a Service (PaaS), respectively。

Three types of clouds are: public cloud, Private cloud and Hybrid cloud. First one refers to resources like infrastructure, applications, platforms, etc. made available to general public, accessible only through Internet on "pay as you go" basis. Second one refers to virtualised cloud infrastructure owned, housed and managed by a single organisation. Third one refer to the connection of private and public, for scalability and fault tolerance via Virtual Private Networking (VPN). A fourth model is also proposed, namely Community Cloud. Here organisations like public sector organisations, having sameinterest, can contribute financially towards a cloud infrastructure.

云计算被定义为“用于实现对可快速供应和释放的可配置计算资源(例如,网络,服务器,存储,应用和服务)的共享池的方便,按需的网络访问的按使用付费模型与最小的管理努力或服务提供者交互“[3]。涉及的一些主要概念是网格计算,分布式系统,并行编程和可视化技术。单个物理机器可以通过虚拟化技术托管多个虚拟机。网格计算的问题是,努力主要花在维护集群本身的鲁棒性和弹性。大数据技术现在已经确定了以成本有效的方式处理

大量并行数据集的解决方案。云计算和大数据技术是两个不同的事情,一个是促进成本有效的存储,另一个是分别是平台即服务(PaaS)。

三种类型的云是:公共云,私有云和混合云。第一种是指向一般公众提供的基础设施,应用程序,平台等资源,只能通过互联网以“按需付费”的方式访问。第二个是指由单个组织拥有,安置和管理的虚拟化云基础设施。第三种是指私有和公共连接,通过虚拟专用网(VPN)实现可扩展性和容错。还提出了第四个模型,即社区云。这里的组织像公共部门组织,具有相同的兴趣,可以贡献财务到云基础设施。

Genomics through big data technologies

With the implementation of big data technologies in storing, processing and analysing genomics data of medical research can profoundly impact mankind. Timely processing of data, and subsequent analysis are still a challenge. Solutions could be implementation of leading big data technologies like Hadoop. There have been studies regarding the utilisation of Apache Hadoop platform in bioinformafics projects [4].

随着大数据技术在存储,处理和分析基因组学数据的实施,医学研究可以深刻地影响人类。及时处理数据,以及随后的分析仍然是一个挑战。解决方案可能是实施领先的大数据技术,如Hadoop。已经有关于Apache Hadoop平台在生物信息学项目中的利用的研究[4]。

Bioinformatics tools developed

MapReduce projects [5]

Crossbow project [6]

B1astReduce project [7]

C1oudBurst [8]

CrossBow [9}

Cloudera

Cloudera, being the service provider in the big data platform is the leading Apache Hadoop software. It is contributing >50% of its output into open source (Apache licensed))projects, drawing a cutting edge in the development of big data technology and the Hadoop framework. It was established by Google, Yahoo and Facebook leading engineers along with an Oracle executive, who were later joined by the founder of Apache Hadoop project. [3] Cloudera is a pioneer of big data and cloud computing in the biomedical researches. The chief scienfist and the co-founder of Cloudera, is aiming to dedicate 25% of their time towards the use of computational biology in genomics [10]. Hence, leading pioneers of big data and computational biology along with leading multinationals are now committing to aid medical discoveries through contribution towards analysis of large biological data, for the understanding, diagnosis and treatment of diseases. In fact, this is the need of the hour, because the annual growth for healthcare computing is going to be around 20.5% through 2017 [11].

Cloudera作为大数据平台中的服务提供商,是领先的Apache Hadoop软件。它将其50%的输出贡献给开源(Apache许可)项目,在大数据技术和Hadoop框架的开发中占据了前沿。它由Google,雅虎和Facebook领先的工程师与Oracle高管共同建立,后来他们被Apache Hadoop项目的创始人加入。[3]

Cloudera是大数据和云计算在生物医学研究领域的先驱。Cloudera的首席科学家和联合创始人,旨在将25%的时间用于在基因组学中使用计算生物学[10]。因此,大数据和计算

生物学的领先先驱与领先的跨国公司现在承诺通过对大型生物数据的分析,疾病的理解,诊断和治疗的贡献,帮助医疗发现。事实上,这是小时的需要,因为到2017年,医疗保健计算的年增长率将达到20.5%左右[11]。

Hadoop

Two key modules: i) MapReduce ii) Hadoop Distributed File System (HDFS)

1. A computational program is divided into many small sub-problems. Distributed on multiple nodes of the computer.

2. A distributed file system for storing data on these nodes.

Such softwares are designed for load balancing among different nodes and allowing distributed processing of large datasets, enabling fault-tolerant parallelized analysis. Bioinformatics cloud involve services like data storage, acquisition, analysis, etc. as the cloud platform delivers hosted services over the Internet. It could be categorized into four categories namely, Data as a Service, Software as a Service, Platform as a Service, and Infrastructure as a Service [12-16}.

两个关键模块:i)MapReduce ii)Hadoop分布式文件系统(HDFS)

计算程序被分为许多小的子问题。分布在计算机的多个节点上。

2.用于在这些节点上存储数据的分布式文件系统。

这样的软件被设计用于在不同节点之间的负载平衡,并允许大型数据集的分布式处理,使得容错并行化分析成为可能。生物信息云涉及数据存储,采集,分析等服务,因为云平台通过Internet提供托管服务。它可以分为四类:数据即服务,软件即服务,平台即服务和基础设施即服务[12-16]。

Data as a service (DaaS)

Bioinformatics clouds are dependent on data for downstream analyses. "It is reported that annual worldwide sequencing capacity is beyond 13 Pbp and on an increase by a factor of five every year" [17]. Due to this unrevealed explosion of data, Data as a Service (DaaS) delivery via Internet has gained importance. It provides dynamic data access on demand, along with up-to-date data access to a wide range of devices, connected over the Web.

Amazon Web Services (AWS) provide a centralized cloud of public data sets (e.g. archives of GenBank, Ensembl databases, 1000 Genomes,Model Organism Encyclopedia, Unigene, etc.) of biology,economics, etc. as services [18}.

生物信息学云取决于下游分析的数据。“据报道,全球每年的测序能力超过13 Pbp,每年增加5倍”[17]。由于这种数据泄露的爆炸式增长,通过因特网的数据即服务(DaaS)交付已变得越来越重要。它可根据需要提供动态数据访问,以及通过Web连接的各种设备的最新数据访问。

亚马逊网络服务(AWS)提供生物学,经济学等作为服务的公共数据集(例如GenBank,Ensembl数据库,1000基因组,模型生物百科全书,Unigene等的归档)的集中云。

Software as a service (SaaS)

SaaS delivers a large variety of software services online for different types of data analysis facilitating remote access of various heavy bioinformatics softwares. Thus, it eliminates the need for local installation, thereby easing software maintenance. Up-to-date cloud-based services for bioinformatic data analysis has made life

easy for the users.

Efforts have been made to develop cloud-scale and cloud-based sequence mapping [19], multiple sequence alignment [20], expression analysis [21], identification of epistatic interactions of SNPs (single nucleotide polymorphisms) [22], and NGS (Next-Generation Sequencing).

SaaS在线提供各种各样的软件服务,用于不同类型的数据分析,便于远程访问各种重型生物信息学软件。因此,它消除了对本地安装的需要,从而简化软件维护。最新的基于云的生物信息数据分析服务为用户带来了轻松的生活。

已经开发了云尺度和基于云的序列作图[19],多重序列比对[20],表达分析[21],SNPs (单核苷酸多态性)上位相互作用的鉴定[22]和NGS 下一代测序)。

Platform as a service (PaaS)

PaaS allow users to develop, test and use cloud applications in an environment where computer resources scale to match application demand automatically and dynamically. This scalability factor helps in developing applications for biological data.

Two PaaS platforms:

1. Eoulsan, cloud-based-for high-throughput sequencing analyses [23];

2. Galaxy Cloud, cloud-scale-for large-scale data analyses [24].

PaaS允许用户在计算机资源自动和动态地扩展以匹配应用程序需求的环境中开发,测试和使用云应用程序。这种可扩展性因素有助于开发生物数据的应用程序。

两个PaaS平台:

1. Eoulsan,基于云的高通量测序分析[23];

2. Galaxy Cloud,云规模- 用于大规模数据分析[24]。

Infrastructure as a service (IaaS)

IaaS delivers all kinds of resources (virtualized) including CPU (hardwares), OS (softwares) etc. summing up a full computer infrastructure, reaching to the full potential of computer resources via Internet. Virtualized resources can be accessed as a public utility by users and thereby paying for the cloud resources that they utilize. Flexibility and customization give freedom to different users to access different cloud resources, as per their requirement, thus meeting the customized needs of different users.

Examples:

1. Cloud Bio Linux is a virtual machine that is publicly accessible for high-performance bioinformatics computing [25].

2. Clo VR is a portable virtual machine that incorporates several pipelines for automated sequence analysis [26].

IaaS提供各种资源(虚拟化),包括CPU(硬件),操作系统(软件)等等,总计完整的计算机基础设施,通过互联网充分发挥计算机资源的潜力。虚拟化资源可以作为用户的公用设施访问,从而为他们使用的云资源付费。灵活性和定制使得不同用户可以根据自己的需求访问不同的云资源,从而满足不同用户的定制需求。

例子:

1. Cloud Bio Linux是一个可以高性能生物信息学计算公开访问的虚拟机[25]。

2. Clo VR是一种便携式虚拟机,它包含了几个用于自动序列分析的管道[26]。

Bioinformatics cloud

Data in the cloud

Initial method of analysis involve downloading of data from NCBI, Ensembl, etc. and installation of softwares locally on in-house computers. Placing data and loading softwares in cloud, make a way to deliver them as DaaS or SaaS. Both can be seamlessly integrated into cloud. thus, storing of biological data achieves the aim of big data analysis within the cloud. We are using conventional biological databases instead of cloud based. But, for larger sequencing projects, generating ultra-large volumes of data, would require cloud for big data analysis and sharing [27,28]. Project like Genome 10K, 1001 Genomes Project, 1KITE, TCGA etc., are similar kind of projects requiring big data analysis, where solutions of complex biological queries involves utilization of big data tools [29].

初始分析方法涉及从NCBI,Ensembl等下载数据,并在本地计算机上安装软件。在云中放置数据和加载软件,使其成为DaaS或SaaS。两者都可以无缝集成到云。因此,生物数据的存储实现了云内大数据分析的目的。我们使用传统的生物数据库而不是云。但是,对于更大的测序项目,生成超大量的数据,将需要云进行大数据分析和共享[27,28]。像Genome 10K,1001 Genomes Project,1KITE,TCGA等项目是类似的需要大数据分析的项目,其中复杂生物查询的解决方案涉及大数据工具的利用[29]。

Transferring big data

the bottleneck of cloud computing is the transfer of data into cloud. Instead of physically shiping hard drives to the cloud center, a promising solution could be the integration of innovative transferring technologies with cloud computing. One is cloud-based Easy Genomics for high speed genomic data transfer. there was a successful event of transferring genomic data across Pacific Ocean at a rate of about 10 Gigabits per second which proved technologies to be capable of dealing with big data over the Web. Apart from this, there are technologies like data compression and Peer-to-Peer (P2P) data distribution to aid big data transfer [30].

云计算的瓶颈是将数据传输到云中。而不是将硬盘驱动器物理运送到云中心,一个有前途的解决方案可能是将创新的传输技术与云计算集成。一种是基于云的Easy Genomics,用于高速基因组数据传输。有一个成功的事件,以大约10吉比特每秒的速度跨太平洋传输基因组数据,这证明了技术能够通过网络处理大数据。除此之外,还有诸如数据压缩和对等(P2P)数据分发等技术来帮助大数据传输[30]。

Cloud-based programming

the analysis task is implemented as pipeline through linkages between the outputs of tools with the inputs of other tools, to automate the system. Development of customized pipelines is needed for the large-scale automated and configurable data analysis on a cloud-based environment.

Similar programming paradigm is adopted through Hadoop, where a single task is distributed over multiple nodes. Computational skills are required for the development of cloud-based pipelines in Hadoop without the requirement of extensive coding, rather the setting up a system for data exchange to pave the way for programming environment [31].

分析任务通过工具输出与其他工具的输入之间的联系来实现为管道,以使系统自动化。需要开发定制管道以在基于云的环境上进行大规模自动化和可配置的数据分析。

通过Hadoop采用类似的编程范例,其中单个任务分布在多个节点上。在Hadoop中开发基于云的管线需要计算技能,而不需要大量编码,而是建立一个用于数据交换的系统为编程环境铺平道路[31]。

Bioinformatics cloud

Presently, the biggest cloud provider is Amazon, providing commercial clouds for big data processing. Google is another provider allowing users to develop web applications and analyse data. there is more to be done with commercial clouds to provide ample data and software, along with keeping pace of the emerging needs of researches, which require customized clouds for bioinformatics analysis. Open access and public availability of data and software are of equal significance [32]. the availability of the cloud publicly to the scientific community is essential when data and softwares are in cloud [33]. It ensures data integration, reproducible analyses, maximum scope for sharing.

目前,最大的云提供商是亚马逊,为大数据处理提供商业云。Google是另一个供应商,允许用户开发网络应用程序和分析数据。还需要做更多的工作来提供充足的数据和软件,以及保持研究的新兴需求的步伐,这需要定制云的生物信息学分析。开放获取和数据和软件的公共可用性同等重要[32]。当数据和软件在云中时,云对科学界公开的可用性是至关重要的[33]。它确保数据集成,可重复分析,最大范围的共享。

Potential Challenges

Genomics researches with enormous amounts of data has recognized the potential benefits of moving to the cloud, but at the same time cloud computing raises some concerns as well. The optimization of the genomics analysis for the cloud has provided efficient and timely services. For instance, data can be easily run from sequencing facility to analysis pipeline on the cloud, as it is generated. However, there is need to be aware of various potential challenges in adopting cloud computing technologies.

Hadoop programming requires a high level of Java expertise; it needs to be simplified to a SQL like interface to generate parallelized programs. Standardisation of reporting and summarisation of results is a problem which is not much addressed; need is to develop better analytics and visualisation technologies. Hadoop with no front end visualisation is difficult to set, use and maintain; efforts are being made towards introducing developer friendly management interfaces instead of shell/command line interfaces.

Considering the scale of the genomic data that needs to be transmitted over internet, it takes considerably large amount of time (might extend to weeks at times). thus, the rate of transfer of data remains a bottleneck of the technology [36]. Data tenancy is another challenge. Mostly clouds provide lesser capability on data and service interoperability, making it difficult for a customer to move data and services back to an in-house IT environment or to migrate from one provider to another. Moreover, data privacy legislation, legal ownership and responsibility pertaining to data stored between international zones points at another challenge [37]. Nevertheless, genomics and proteomics research projects for sure exhibit the applications for next generation

cloud based computational biology and it essentially has the potential to revolutionise the pace of research in life sciences.

具有大量数据的基因组学研究已经认识到移动到云的潜在好处,但同时云计算也引起了一些关注。云的基因组分析的优化提供了高效和及时的服务。例如,数据可以容易地从测序设备运行到云上的分析流水线,因为它是生成的。然而,需要了解采用云计算技术的各种潜在挑战。

Hadoop编程需要高水平的Java专业知识;它需要简化为类似SQL的接口来生成并行程序。标准化报告和总结结果是一个没有得到很多解决的问题;需要开发更好的分析和可视化技术。Hadoop没有前端可视化是很难设置,使用和维护;正在努力引入开发者友好的管理接口而不是shell /命令行接口。

考虑到需要通过因特网传输的基因组数据的规模,需要相当大量的时间(可能延长到几个星期)。因此,数据传输的速率仍然是该技术的瓶颈[36]。数据租赁是另一个挑战。大多数云对数据和服务互操作性提供较少的能力,使得客户难以将数据和服务移回到内部IT环境或从一个提供商迁移到另一个。此外,数据隐私立法,法律所有权和与存储在国际区域之间的数据有关的责任指出了另一个挑战[37]。然而,基因组学和蛋白质组学研究项目肯定会展示下一代基于云的计算生物学的应用,它本质上有可能改变生命科学研究的步伐。

Security

Privacy and confidentiality is something that is must to maintain especially when dealing with health information. Cloud computing offers the use of data encryption, password protection, secure data transfer, processes’ audits, and the implementation of respective policies against data breeches and malicious use [34]. the involvement of an external entity for data storage and processing services offers added security concerns. Logging access to the data, role-based access, third party certifications, computer network security, notification alarms, change trackers, cloud usage term and associated services are made to address these concerns [35].

隐私和保密是在处理健康信息时必须保持的。云计算提供了数据加密,密码保护,安全数据传输,流程审计以及针对数据流量和恶意使用实施相应策略的使用[34]。外部实体参与数据存储和处理服务提供了额外的安全问题。记录对数据的访问,基于角色的访问,第三方认证,计算机网络安全,通知报警,变化跟踪,云使用期限和相关服务,以解决这些问题[35]。

Future in microbiology research

Petabytes of raw information can revolutionize microbiology research if we are successful to figure out how to use this gold mine. Winston Hide says “In the last five years, more scientificc data has been generated than in the entire history of mankind”. Today the data generation is light-years faster that it was just a few years ago and thus we can’t imagine the amount of digital information a vailable to us now. Like to study respiratory disease we require capturing huge quantities of data for air quality and then match it with equivalently large datasets, are studies which involve big data. We need to engage lots of eyes in this process.

如果我们成功地想出如何使用这个金矿,那么几百亿的原始信息可以革命微生物研究。温斯顿·史密斯说:“在过去五年里,生成的科学数据比人类整个历史更多。今天,数据生成比仅仅几年前的光年快,因此我们无法想象我们现在可用的数字信息量。像研究呼吸系

统疾病一样,我们需要捕获大量的空气质量数据,然后将其与等量的大数据集相匹配,是涉及大数据的研究。我们需要在这个过程中吸引大量的眼睛。

Conclusion

Cloud computing has seen a lot of hype and excitement recently but in the biotech industry it is gradually getting a recognition as a serious alternative to the hardware infrastructures already existing. Parallel DNA sequencing generates massive amount of data, and its interdisciplinary nature employ cloud computing and big data technologies in life sciences. It facilitates high throughput analytics allowing users to interrogate vast data in no time. Metagenomics, systems biology and protein structure prediction require extensive use of big data technology[12]. Metagenomics as a result of genomics revolution gave way to the sequence based analysis of the microbiome(i.e. microbial genomes), which is going to be several orders of magnitude bigger [13]. Try counting total no. of bacterial cells on earth; must be in the range of 1030, most of them still unidentified.the discovery of novel genes encode new proteins whose structure and function needs to be characterised [14].

Next generation cloud based computational biology has the potential to revolutionise life sciences. Cloud-based resources classified as DaaS, SaaS, PaaS and IaaS bears great promises in addressing big data analysis, developing variety of services for data storage, acquisition and analysis by integration of data and softwares, as efficient high-speed transfer technologies to aid the transfer of big data. It provides a light programming environment along with develop customized pipelines publicly accessible to the whole scientific community. Despite existing challenges yet to overcome, the potential advantages that these technologies can bring to the genomic research far outweigh the disadvantages.

云计算近来已经引起了大量的炒作和兴奋,但在生物技术行业,它逐渐被认为是已经存在的硬件基础设施的一个严重替代品。平行DNA测序产生大量的数据,其跨学科性质在生命科学中采用云计算和大数据技术。它促进高吞吐量分析,允许用户在没有时间查询大量数据。宏基因组学,系统生物学和蛋白质结构预测需要广泛使用大数据技术[12]。宏基因组学作为基因组学革命的结果让位于基于序列的微生物组(即微生物基因组)的分析,这将是几个数量级更大[13]。尝试计数总数。的地球上的细菌细胞;必须在1030的范围内,其中大多数仍然是未鉴定的。新基因的发现编码新的蛋白质,其结构和功能需要被表征[14]。

下一代基于云的计算生物学有可能改变生命科学。分类为DaaS,SaaS,PaaS和IaaS的基于云的资源在解决大数据分析,开发各种服务用于数据存储,通过集成数据和软件获取和分析,作为高效的高速传输技术来帮助传输大数据。它提供了轻量级的编程环境,并开发了可供整个科学界公开访问的定制管道。尽管存在着尚待克服的挑战,但这些技术可能为基因组研究带来的潜在优势远远超过了缺点。

大数据单位的换算与翻译

大数据单位的换算与翻译 近几年来,大数据这个词越来越频繁地出现在各种媒体文章上,出现各行各业人士的口中。人工翻译的行业也难免受其影响。在这方面,赛迪翻译亦深有体会。 首先也是最为重要的一点是大数据方面的词语频繁出现。 以往,我们说数据大小,常常使用的单位是MB、GB,而现在我们经常会看到TB、PB、EB、ZB、YB、BB、NB、DB。身为翻译人员,不免要弄清楚这些单位的大小和译法。 从大小方面,这几种单位依然延续了1024 的进制。即,后一个单位是前一个单位的1024 倍。在此,赛迪翻译总结了这些大数据单位。具体的大小如下: 1KB (Kilobyte) = 1024B ,即2 的10 次方字节,读音千字节 1MB (Megabyte) = 1024KB,即2 的20 次方字节,读音兆字节 1GB (Gigabyte) = 1024MB,即2 的30 次方字节,读音吉字节 1TB (Terabyte) = 1024GB,即2 的40 次方字节,读音太字节 1PB (Petabyte) = 1024TB,即2 的50 次方字节,读音拍字节 1EB (Exabyte) = 1024PB,即2 的60 次方字节,读音艾字节 1ZB (Zettabyte) = 1024EB,即2 的70 次方字节,读音泽字节 1YB (Yottabyte) = 1024ZB,即2 的80 次方字节,读音尧字节 1 BB (BrontoByte)= 1024 YB,即 2 的90 次方字节,读音波字节 1NB (NonaByte) = 1024BB,即2 的100 次方字节,读音诺字节 1DB (DoggaByte) = 1024NB,即2 的110 次方字节,读音刀字节

双语:中国姓氏英文翻译对照大合集

[ ]

步Poo 百里Pai-li C: 蔡/柴Tsia/Choi/Tsai 曹/晁/巢Chao/Chiao/Tsao 岑Cheng 崔Tsui 查Cha 常Chiong 车Che 陈Chen/Chan/Tan 成/程Cheng 池Chi 褚/楚Chu 淳于Chwen-yu

D: 戴/代Day/Tai 邓Teng/Tang/Tung 狄Ti 刁Tiao 丁Ting/T 董/东Tung/Tong 窦Tou 杜To/Du/Too 段Tuan 端木Duan-mu 东郭Tung-kuo 东方Tung-fang F: 范/樊Fan/Van

房/方Fang 费Fei 冯/凤/封Fung/Fong 符/傅Fu/Foo G: 盖Kai 甘Kan 高/郜Gao/Kao 葛Keh 耿Keng 弓/宫/龚/恭Kung 勾Kou 古/谷/顾Ku/Koo 桂Kwei 管/关Kuan/Kwan

郭/国Kwok/Kuo 公孙Kung-sun 公羊Kung-yang 公冶Kung-yeh 谷梁Ku-liang H: 海Hay 韩Hon/Han 杭Hang 郝Hoa/Howe 何/贺Ho 桓Won 侯Hou 洪Hung 胡/扈Hu/Hoo

花/华Hua 宦Huan 黄Wong/Hwang 霍Huo 皇甫Hwang-fu 呼延Hu-yen J: 纪/翼/季/吉/嵇/汲/籍/姬Chi 居Chu 贾Chia 翦/简Jen/Jane/Chieh 蒋/姜/江/ Chiang/Kwong 焦Chiao 金/靳Jin/King 景/荆King/Ching

大数据时代英语演讲

Hello, everyone. As we all know, we are now living at the age ofbig data, which leads a revolution that transforms how we think. But many people have half know of big data, they haven’t adopted to the tremendous change. So today I’d like to talk about the three peculiarities of big data. I will be very glad if my speech could help you. Firstly, all samples rather than sampling analysis. With the development of technology, we are able to process massive data, from which we can get more reliable result than through sampling analysis. So we can give up sampling analysis in most cases. Secondly, efficiency rather than accuracy. At the age of big data, we concern more on efficiency rather than accuracy. In other words, we should allow faults to improve efficiency. Comparing with massive data, some faults do not influence the final result. Thirdly, correlation is as important as causality. We can’t deny the importance of causality ,but sometimes it is difficult or unnecessary to explore it. For example, the recommendation system of Amazon doesn’t know why the customer who likes Hemingway’s works is likely to buy Fitzgerald’s works, it just recommends and helps Amazon sell more than 100 times books than before Amazon using it. From this example, we can conclude that correlation plays an important role at the age of big data,so please don’t overlook it.

大英文翻译

001 不忘初心,牢记使命。 Remain true to our original aspiration and keep our mission firmly in mind. 002 这是我国发展新的历史方位。 This is a new historic juncture in China’s development. 003 新时代我国社会主要矛盾是人民日益增长的美好生活需要和不平衡不充分的发展之间的矛盾。 The principal contraction facing Chinese society in the new era is that between unbalanced and inadequate development and the people’s ever-growing needs for a better life. 004 党在新时代的强军目标是建设一支听党指挥、能打胜仗、作风优良的人民军队,把人民军队建设成为世界一流军队。 The Party’s goal of building a strong military in the new era is to build the people’s forces into world-class forces that obey the Party’s command, can fight and win, and maintain excellent conduct. 005 推动构建人类命运共同体 build a community with a shared future for mankind 006 近代以来久经磨难的中华民族迎来了从站起来、富起来到强起来的伟大飞跃。 The Chinese nation, which since modern times began had endured so much for so long, has achieved a tremendous transformation ——it has stood up, grown rich, and become strong. 007 行百里者半九十。 As the Chinese saying goes, the last leg of a journey just marks the halfway point. 008 敢于直面问题,敢于刮骨疗毒。 We must have the courage to face problems squarely, be braced for the pain. 009 不断增强党的政治领导力、思想领导力、群众组织力、社会号召力。 We must keep on strengthening the Party’s ability to lead politically, to guide through theory, to organize the people, and to inspire society. 010 保持政治定力,坚持实干兴邦。 We must maintain our political orientation, do the good solid work that sees our country thrive. 011 坚持党对一切工作的领导。 Ensuring Party leadership over all work. 012 坚持以人民为中心。 Committing to a people-centered approach. 013 坚持全面深化改革。 Continuing to comprehensively deepen reform.

大数据时代的翻译理论与实践

大数据时代的翻译理论与实践 专业方向:翻译理论与实践班级:14级英语语言文学研究生 作者:张琦学号:81420379 摘要:此篇文献综经过查阅大量文章,总结了翻译理论与实践之现状与大数据时代对翻译理论与实践之影响。对于前者,即翻译理论与实践之现状,又包括当下翻译理论与实践现状:翻译理论与实践之关系;对于理论方面的现状包括翻译标准,翻译原则方面,以及翻译家翻译法之研究;而实践方面包括对翻译教学等方面研究。而大数据时代的来临,渐渐对翻译理论与实践的影响包括对翻译理论和翻译实践的影响。本文献综述通过此种分类方式,可以有效的探讨翻译理论与实践受大数据时代的影响。、关键字:翻译理论;实践;现状;大数据时代;影响 近年来对于翻译的研究从理论到对实践方面的研究。然而,随着大数据时代的来临,翻译理论与实践也在一定程度上受到影响。翻译的理论与实践也从传统信息受限单一的状态到数据信息共享状态,进而对翻译理论与实践方面产生一定的影响。 针对大数据对翻译理论与实践的影响,本人查阅中国知网等网站,万方数据等网站,发现涉及此的文献不在少数。 本文献综述从翻译理论与实践的现状开始探讨,将其从理论与实践方面分别进行探究。对于理论方面又继续细分,包括对翻译规则与翻译标准现状之探讨,以及对当代翻译家的翻译法与翻译特色的介绍;对于实践方面,本人从对翻译教学等方面介绍其现状。而后,引入大数据时代对翻译理论与实践的影响,也从对其理论与实践方面开始分析。 对于所查阅之文献,本人对其进行归纳整理,认为对大数据时代的翻译理论与实践的研究有所帮助。 一、翻译理论与实践的现状 在大数据时代影响之前,翻译理论与实践存在于闭塞状态,我们对翻译的理论与实践的探讨也比较狭隘。本文将从对翻译的理论与实践的现状进行介绍。先介绍翻译理论与实践的关系,然后对于理论部分,包括对翻译的标准、规则等方面的介绍;对于实践部分,包括对教学等方面的介绍。 理论与实践的关系

大数据时代英文翻译

Era of Big Data is a woman's age; women in the gene can accumulate and deal with big data/ women are born to accumulate and deal with big data. Many men and children, in fact, have been wondering about this special ability of women. Like, as a child, just as soon as you entered the house your Mother said immediately in a suspicious tone: “Liu zhijun, you didn’t do well in the exam today, did you.” Another example, you just have a glance at the mobile phone, your wife laughs: “Does Er gou the next door ask you to play games?” One more ex ample, when you close the door and make a phone call, your girlfriend will cry: “Who are shot in bed?” They are sometimes right, sometimes wrong. However, On the whole, the accuracy rate is higher than chance level. When they are wrong, men would sneer women always give way to foolish fancies; when they are right, men would say women are sensitive animal maybe with more acute sensory organs. Anyway, that is a guess. It has already scared man that overall accuracy rate is higher than the random level. In order to adapt to this point, the male also developed a very strong skills against reconnaissance. This part is beyond the scope of this article, so no more details about it. Some studies, such as Hanna Holmes’s paper, have indi cated that the white matter of the female’s brain is higher than that of the male. So they have very strong imagination of connecting things together. Some recent studies have shown that women are better than men in the "date" memory. That is the reason why they are able to remember all the birthdays, anniversaries, and even some of the great day of unimportant friends. No matter whether these results are true or not, I am afraid that this is not women's most outstanding ability. Women's most remarkable ability is a long-term tracking of some seemingly unimportant data to form their own baseline and pattern. Once the patterns of these data points are significantly different from the baseline she is familiar with, she knows something unusual. In their daily life, women do not consider the difference between causality and correlation. They believe in the principle: "There must be something wrong out of something unusual." People who talk about big data often take Lin Biao as an example. Lin Biao recorded some detailed and unimportant data after a battle. Such as seized guns, the proportion of rifles and pistols, the age levels of war prisoners, seized grain, whether they are sorghum or millet, etc., all of which were unavoidably recorded in the book. Others laughed at him. But later, he determined where the enemy headquarters were according to these data. What women do is almost the same. A girl A has a secret crush on boy B, but she usually doesn’t contact him directly. Two days later, I asked her if she wanted to ask him to have dinner together. She said he was playing. I wondered “how do you know that?” She said that boy B usually is on the line Gmail at 8:00 am, away status at8:30am, for he goes out to buy coffee and breakfast, on line again at 9:00am, busy status, for he is at work, away again at12:30am for lunch, on line for whole evenings, maybe for reading or playing games. His buddy C is on line at10:00 am, still online till 2:00am next day. He is a boy who gets up late and stays up late. His buddy D is on line for the most of the day. However, the most important pattern is that there are 2-3 days per week, during which they would be offline or away for 3-4 hours together. Conclusion: they are playing together.

中国姓氏英文翻译大全

中国姓氏英文翻译大全- - [b][u]中国姓氏英文翻译大全- -[/u][/b] A: 艾--Ai 安--Ann/An 敖--Ao B: 巴--Pa 白--Pai 包/鲍--Paul/Pao 班--Pan 贝--Pei 毕--Pih 卞--Bein 卜/薄--Po/Pu 步--Poo 百里--Pai-li C:

蔡/柴--Tsia/Choi/Tsai 曹/晁/巢--Chao/Chiao/Tsao 岑--Cheng 崔--Tsui 查--Cha 常--Chiong 车--Che 陈--Chen/Chan/Tan 成/程--Cheng 池--Chi 褚/楚--Chu 淳于--Chwen-yu D: 戴/代--Day/Tai 邓--Teng/Tang/Tung 狄--Ti 刁--Tiao 丁--Ting/T 董/东--Tung/Tong 窦--Tou 杜--To/Du/Too

段--Tuan 端木--Duan-mu 东郭--Tung-kuo 东方--Tung-fang E: F: 范/樊--Fan/Van 房/方--Fang 费--Fei 冯/凤/封--Fung/Fong 符/傅--Fu/Foo G: 盖--Kai 甘--Kan 高/郜--Gao/Kao 葛--Keh 耿--Keng 弓/宫/龚/恭--Kung 勾--Kou

古/谷/顾--Ku/Koo 桂--Kwei 管/关--Kuan/Kwan 郭/国--Kwok/Kuo 公孙--Kung-sun 公羊--Kung-yang 公冶--Kung-yeh 谷梁--Ku-liang H: 海--Hay 韩--Hon/Han 杭--Hang 郝--Hoa/Howe 何/贺--Ho 桓--Won 侯--Hou 洪--Hung 胡/扈--Hu/Hoo 花/华--Hua 宦--Huan 黄--Wong/Hwang

互联网时代的11个单词

Internet 互联网WWW 万维网Internet,译为互联网,或者因特网,其实这已经反应了两种翻译专有名词的基本思路,译“互联网”,属于从词义本身出发,inter+net;而译为“因特网”,则属于音译法。英译汉如此,汉译英又何尝不能如此呢?于是“福娃”不再是Friendlies,索性Fuwa了。其实译为“因特网”,对于我们的父辈来说,也许并不陌生。《国际歌》的最后几句,“英特纳雄耐尔,一定要实现”,International,共产国际,国际主义,据查最早是瞿秋白音译为“英特纳雄耐尔”。 Digital 数字的,数码的十多年前,美国MIT教授Nicolas Negroponte的名著,Being Digital,成为启迪众多互联网先驱的宝典。这个词的重要性,怎么说都不过分。就拿单词来说吧。比如,DC与DV已成为每个家庭的必备。那么下次向您不通晓英语的父母亲解释以下,为什么这里是两个D打头吧。顺便清点一下您有多少数码设备吧。 Grassroots 草根除了表示“基层的”,这个词最早和自下而上的选举活动有关,比如Wiki给的一个解释是:a grassroots political movement is one driven by the constituents of a community. 之后所有的衍生意思,其实都离不开这一本意。比如Blog,博客,web+log,网络日志;比如Audition海选。 Anonymity 匿名Real Name System 实名制如果网络是实名制的,网络将会怎样?至少现在,是可以匿名的,on the Internet, no one knows you are a dog. 于是据说网络的游戏规则是不一样的,谩骂诽谤是家常便饭的,很多时候也不需要为自己的言行负责任的。而一旦你用现实世界的游戏规则去评判网络虚拟世界,只能感慨,“不是我不明白,这世界变化快”。 Mob 暴民不要被这个词的中文意思吓倒。网络暴民,已经司空见惯。那是真正的overwhelming. 课上和学生说过另一个词,smart mob. 有兴趣的可以去查询一下。至少比单纯的网络暴民要有内涵得多。 Censorship 审查这是一个与道德标准甚至意识形态高度相关的单词。当您下次在网上看到某些敏感词汇用拼音缩写或者中间加星号来表示时,就属于规避审查,evasion of censorship. Disruptive 颠覆性的个人觉得比译为“破坏性的”要更准确,因为该词本意是指某项技术对某领域产生根本性的影响,比如wiki给的定义是A disruptive technology or disruptive innovation is a technological innovation, product, or service that eventually overturns the existing dominant technology or status quo product in the market.比如Google, Y ouTube这个当量的,当得起disruptive. Ecosystem 生态环境互联网技术的发展,带动了一系列相关产业的发展,于是有了upstream, downstream,上游,下游产业。一个简单例子就是,昂立口译项目的不断发展壮大,于是提供大量口译教程的出版社,成了我们的上游,而昂立学院旁边的麻辣烫每天中午人满为患,则勉强也算下游产业了。突破这样的线性思维方式,将所有的相关产业及相互关系都放在一起考虑,就有了business ecosystem,商业生态环境。 网民netizen,网~从网民,Net+Citizen,到以网字开头的构词方式,网吧,网恋,网游,网管,至少在口译处理时,还不能指望都象网民一样,有一个词可以精确对应,不妨先拆开来解释更安全。

英文翻译常用技巧大全

英汉两种语言在句法、词汇、修辞等方面均存在着很大的差异,因此在进行英汉互译时必然会遇到很多困难,需要有一定的翻译技巧作指导。常用的翻译技巧有增译法、省译法、转换法、拆句法、合并法、正译法、反译法、倒置法、包孕法、插入法、重组法和综合法等。这些技巧不但可以运用于笔译之中,也可以运用于口译过程中,而且应该用得更加熟练,因为口译工作的特点决定了译员没有更多的时间进行思考。 1.增译法:指根据英汉两种语言不同的思维方式、语言习惯和表达方式,在翻译时增添一些词、短句或句子,以便更准确地表达出原文所包含的意义。这种方式多半用在汉译英里。汉语无主句较多,而英语句子一般都要有主语,所以在翻译汉语无主句的时候,除了少数可用英语无主句、被动语态或"There be…"结构来翻译以外,一般都要根据语境补出主语,使句子完整。英汉两种语言在名词、代词、连词、介词和冠词的使用方法上也存在很大差别。英语中代词使用频率较高,凡说到人的器官和归某人所有的或与某人有关的事物时,必须在前面加上物主代词。因此,在汉译英时需要增补物主代词,而在英译汉时又需要根据情况适当地删减。英语词与词、词组与词组以及句子与句子的逻辑关系一般用连词来表示,而汉语则往往通过上下文和语序来表示这种关系。因此,在汉译英时常常需要增补连词。英语句子离不开介词和冠词。另外,在汉译英时还要注意增补一些原文中暗含而没有明言的词语和一些概括性、注释性的词语,以确保译文意思的完整。总之,通过增译,一是保证译文语法结构的完整,二是保证译文意思的明确。如: (1)What about calling him right away? 马上给他打个电话,你觉得如何?(增译主语和谓语) (2)If only I could see the realization of the four modernizations. 要是我能看到四个现代化实现该有多好啊!(增译主句) (3) Indeed, the reverse is true 实际情况恰好相反。(增译名词) (4) 就是法西斯国家本国的人民也被剥夺了人权。 Even the people in the fascist countries were stripped of their human

大数据时代的自然语言处理

言处理的专著并不多见,国内已有的几本专著(包括译著),除了2008年清华大学出版社出版的该书第一版和2010年中国科学技术大学出版的冯志伟教授的《自然语言处理的形式模型》以外,大多数是10年以前撰写的。而《自然语言处理的形式模型》对统计方法的介绍较为简单。随着大数据时代的到来,统计方法的发展日新月异,很多最新方法和新模型是这两本专著中未能包含的。国外这一领域的主要专著是美国麻省理工学院出版社于1999年出版(2000年校正) 的克里斯托夫·曼宁斯(Christopher D. Manning) 和辛里奇·舒尔策(Hinrich Schütze)撰写的Foundations of Statistical Natural Language Process- ing (2005年由苑春法等翻译成中文),以及2000年普伦蒂斯·霍尔出版社(Prentice Hall)出版的丹尼尔·朱拉斯凯(Daniel Jurafsky)和詹姆斯·马丁(James H. Martin)撰写的Speech and Language Processing: An Introduction to Natural Language Processing, Com- putational Linguistics, and Speech Recognition (2005年由冯志伟和孙乐翻译成中文。2009年该书出版了第2版) 。一方面,这些外文专著出版的时间仍然较早,而另一方面,它们对很多中文信息处理的最新进展都没有涉及,更不涉及我国的少数民族语言信息处理技术,如维语人名识别、藏文分词等。《统计自然语言处理(第2版)》恰好弥补了这些缺失。(2)在写作方式上,作者首先从分析问题入手,介绍 大数据时代的自然语言处理 ——评《统计自然语言处理(第2版)》 关键词:自然语言处理 统计方法 专著 赵东岩 北京大学 网络搜索、机器翻译、智能问答、信息安全等一系列与自然语言处理相关的应用需求,在大数据时代更为人们关注。云计算、大数据、社会计算、数据挖掘等一批新术语也如雨后春笋般涌现,成为众多会议和论坛讨论的话题。然而,当人们拂去表层繁花,拨开缭乱云雾,静下心来思考:大数据时代对自然语言处理技术的根本挑战是什么?近十年来统计自然语言处理研究有哪些实质性的进展?自然语言理解技术在网络信息处理、多语言机器翻译和人机交互中有哪些实际应用?对于这些问题,每一位专家都会从不同的视角给出答案。中国科学院自动化研究所研究员宗成庆撰写的《统计自然语言处理(第2版)》,对自然语言处理的核心技术及其最新进展进行了全面、系统的阐述。基于多年的深入研究与总结提炼、经过缜密思考和严谨论证,他给出了对上述问题较为深刻与独到的回答,为当前自然语言处理技术的深入研究和应用开发提供了翔实的资料。 《统计自然语言处理(第2版)》是清华大学出版社2013年8月出版的。全书共16章,87万字。综观全书,该书具有如下特点:(1)内容新颖,非常全面。该书16章内容几乎涵盖了自然语言处理领域的每一个侧面,从词法到语义,从理论到应用,大多都是近年来该领域最新的研究成果和先进技术。如此丰富的内容和新颖的技术,是在已有的自然语言处理专著中所没有的。国内外有关自然语

大型火力发电厂专业词汇中英文翻译大全

主要监视参数 UNIT LOAD __MW 机组负荷 FW FLOW __t/h 给水流量 MN STM PRS __MPa 主蒸汽压力 MN STM TEM __℃主蒸汽温度 RH STM PRS __ MPa 再热蒸汽压力 RH STM PRS __℃再热蒸汽压力 FNC PRS __ KPa 炉膛压力 TOTAL FUEL __t/h 总燃料量 V ACUUM __ KPa 真空 UNIT FREQ __Hz 机组频率 GEN VOLT __KV 发电机出口电压 GEN CRRT __A 发电机出口电流 POWER MONITOR 电量监视POWER SUPPL Y MONITORING 供电监视 TO TEST SCREEN 去试验画面 Communication OK 联络正常 PRI 优先级 SEC 次级 SYSTEM STATUS 系统状态System Status Display 系统状态显示 Green绿色:Normal Mode 正常模式 Cyan蓝色:Standby Mode 备用模式 Yellow黄色:Backup Mode 备用模式 White白色:Startup Mode 启动模式 Red红色:Drop in Alarm or Fault 报警或故障显示 Orange橙色:Fatled Mode 故障模式 Magenta红紫色:Operator Attention 操作注意事项 Gray灰色:Drop off Highway 信息高速公路中断 Select Drop Priar to Selecting Menu Function DROP DETAILS 中断详情 ACK DROP ALARM 动作失败报警 CLR DROP ALARM 语音失败报警 汽机 MAIN/RHT STM SYS 主/再热蒸汽系统AUX STM 辅汽 BFPT 给水泵汽轮机 CND WTR SYS 凝结水系统 CNSR 凝汽器 COLD RH STM 再热蒸汽冷端 DRN FLTK 疏水扩容器 FEED WTR SYS 给水系统

英文翻中文的八大翻译技巧

英文翻中文的八大翻译技巧 要真正掌握英文翻译的技巧并非易事。这是因为英译汉时会遇到各种各样的困难,下面就和大家分享英文翻中文的八大翻译技巧推荐,来欣赏一下吧。 英文翻中文的八大翻译技巧推荐 首先是英文理解难,这是学习、使用英文的人的共同感觉,由于两国历史、文化、风俗习惯的不同,所以一句英文在英美人看来顺理成章,而在中国人看来却是颠颠倒倒、断断续续,极为别扭。 二是中文表达难,英译汉有时为了要找到一个合适的对等词汇,往往被弄得头昏眼花,好像在脑子里摸一个急于要开箱子的钥匙,却没有。 另外,英译汉时对掌握各种文化知识的要求很高,因为我们所翻译的*,其内容可能涉及到极为广博的知识领域,而这些知识领域多半是我们不大熟悉的外国的事情,如果不具备相应的文化知识难免不出现一些翻译中的差错或笑话。 正是因为英译汉时会遇到这么多的困难,所以,我们必须通过翻译实践,对英汉两种不同语言的特点加以对比、概况和总结,

以找出一般的表达规律来,避免出现一些不该出现的翻译错误,而这些表达的规律就是我们所说的翻译技巧。 一、词义的选择和引伸技巧 英汉两种语言都有一词多类和一词多义的现象。一词多类就是指一个词往往属于几个词类,具有几个不同的意义;一词多义就是同一个词在同一词类中又往往有几个不同的词义。在英译汉的过程中,我们在弄清原句结构后,就要善于运用选择和确定原句中关键词词义的技巧,以使所译语句自然流畅,完全符合汉语习惯的说法;选择确定词义通常可以从两方面着手: 1、根据词在句中的词类来选择和确定词义 They are as like as two peas .他们相似极了。(形容词) He likes mathematics more than physics .他喜欢数学甚于喜 欢物理。(动词) Wheat, oat, and the like are cereals .小麦、燕麦等等皆系谷类。(名词) 2、根据上下文联系以及词在句中的搭配关系来选择和确定词义。 He is the last man to come .他是最后来的。 He is the last person for such a job .他最不配干这个工作。

大数据时代汉英语际对应词的挖掘

摘要文章分析了大数据时代词典编纂可用或可参考数据的特点,探索如何从海量数据中挖掘汉英语际对应词等词汇知识,还简要探讨了与数据或语料使用相关的问题。挖掘实践表明:充分利用可用资源,从纷杂的大数据中可以挖掘出所需的词汇知识,但目前仍需专业人员进行筛选、认定和解读。词典要保持生命力必须及时修订和收录新词。对于汉英词典来说,提供汉语词语的地道英语对应词会提升其实用价值。研究语际对应词挖掘不仅有助于编纂出符合用户需求的双语词典,对构建大数据语言资源库和开发挖掘分析软件也有参考价值。 关键词大数据时代对应词新词挖掘汉英词典 一、研究背景 移动互联网的飞速发展加快了媒体融合的进程,也使传统的词典学研究和词典编纂实践面临挑战。不仅纸质词典,就连掌上型电子词典也遭到了前所未有的冷遇。人们更喜欢通过智能手机或计算机查询在线网络词典或离线电子词典。 大数据的应用前景广阔。但是,词典学研究和词典编纂可用的数据是大数据吗?词典学研究和词典编纂真的需要大数据吗?我们尝试从大数据时代词典学研究和编纂实践可用数据的特点出发回答第一个问题,结合汉英语际对应词的挖掘实例分析回答第二个问题,还简要探讨与数据或语料使用相关的问题。 二、大数据与编纂词典的可用数据 1. 大数据的特点 传统意义上的“数据”指的是“有根据的数字”。现在,“数据”不仅指“数字”,还统称一切保存在电脑中的信息(包括文本、声音、视频等)。(赵勇,徐轲2014)在这个信息爆炸的时代,经过一定时间的积累就会出现海量或巨量的数据。过去,计算机存储信息或数据的计量单位用gb/gigabytes(1gb=1024mb)就已经很大了。现在用到tb/terabytes(1tb=1024gb),pb/petabytes(1pb=1024tb=1048576gb),甚至更大的计量单位。但是,不能简单地认为数量大就是大数据。大数据的体量巨大,不仅存储量大,计算量也大,超出了传统数据处理方法所能管理和处理的能力。 现在具有代表性的观点是大数据具备4v特征:(1)数据量庞大(volume)。(2)数据呈现多样性(variety),不但类型多(如文本、网页、图片、音频、视频和位置信息等),而且来自多种数据源,不仅有结构化数据,更多的是半结构化数据和非结构化数据。(3)时效性(velocity),即数据增长速度快、变化速度快,处理速度也要求快,包括大量的在线或 (4)实时数据分析处理。例如电子商务对销售数据的实时快速分析就意味着能及时抓住商机。 数据价值高(value),但价值密度低,即价值与数据总量之比很低,需要对海量的数据进行挖掘分析才能形成用户价值。如在长时间连续的监控视频中查找犯罪线索,有用的数据可能只有短短几秒钟。(赵勇,徐轲2014;严霄凤,张德馨2013;宗威,吴锋2013) 2. 大数据的定义 信息时代的“数据”概念是明确的,但是对于“大数据”至今还没有一个公认的标准定义。 美国国家科学基金会(nsf)将大数据定义为:“由科学仪器、传感设备、互联网交易、电子邮件、音频视频软件、网络点击流等多种数据源生成的大规模、多元化、复杂、长期的分布式数据集。”(黄南霞,谢辉,王学东2013) 李战怀、王国仁和周傲英(2013)从数据库研究者的视角对大数据进行了解读,认为大数据是个笼统的概念。他们指出:“与应用密切相关的各类数据都属于大数据范畴,大数据强调支持实际应用所涉及到的多个来源且相互关联的大量、高速、异构数据;世界上凡是可以表达出来的信息都是数据;当为了一个具体的应用而需要把大量的不同类型、质量各异的数据及时进行处理时,这些数据就进入了大数据的范畴。”

中国姓氏英文翻译大全

A: 艾--Ai 安--Ann/An 敖--Ao B: 巴--Pa 白--Pai 包/鲍--Paul/Pao 班--Pan 贝--Pei 毕--Pih 卞--Bein 卜/薄--Po/Pu 步--Poo 百里--Pai-li C: 蔡/柴--Tsia/Choi/Tsai 曹/晁/巢--Chao/Chiao/Tsao 岑--Cheng 崔--Tsui 查--Cha 常--Chiong 车--Che 陈--Chen/Chan/Tan 成/程--Cheng 池--Chi 褚/楚--Chu 淳于--Chwen-yu D: 戴/代--Day/Tai 邓--Teng/Tang/Tung 狄--Ti 刁--Tiao 丁--Ting/T 董/东--Tung/Tong 窦--Tou 杜--To/Du/Too 段--Tuan 端木--Duan-mu 东郭--Tung-kuo 东方--Tung-fang E: F: 范/樊--Fan/Van 房/方--Fang 费--Fei 冯/凤/封--Fung/Fong 符/傅--Fu/Foo G: 盖--Kai 甘--Kan 高/郜--Gao/Kao 葛--Keh 耿--Keng 弓/宫/龚/恭--Kung 勾--Kou 古/谷/顾--Ku/Koo 桂--Kwei 管/关--Kuan/Kwan 郭/国--Kwok/Kuo 公孙--Kung-sun 公羊--Kung-yang 公冶--Kung-yeh 谷梁--Ku-liang H: 海--Hay 韩--Hon/Han 杭--Hang 郝--Hoa/Howe 何/贺--Ho 桓--Won 侯--Hou 洪--Hung 胡/扈--Hu/Hoo 花/华--Hua 宦--Huan 黄--Wong/Hwang 霍--Huo 皇甫--Hwang-fu 呼延--Hu-yen I: J: 纪/翼/季/吉/嵇/汲/籍/姬--Chi 居--Chu 贾--Chia 翦/简--Jen/Jane/Chieh 蒋/姜/江/--Chiang/Kwong 焦--Chiao 金/靳--Jin/King 景/荆--King/Ching 讦--Gan K: 阚--Kan 康--Kang 柯--Kor/Ko 孔--Kong/Kung 寇--Ker 蒯--Kuai 匡--Kuang L: 赖--Lai 蓝--Lan 郎--Long 劳--Lao 乐--Loh 雷--Rae/Ray/Lei 冷--Leng 黎/郦/利/李--Lee/Li/Lai/Li 连--Lien 廖--Liu/Liao 梁--Leung/Liang 林/蔺--Lim/Lin 凌--Lin 柳/刘--Liu/Lau 龙--Long 楼/娄--Lou 卢/路/陆鲁--Lu/Loo 伦--Lun 罗/骆--Loh/Lo/Law/Lam/Rowe 吕--Lui/Lu 令狐--Lin-hoo M: 马/麻--Ma 麦--Mai/Mak 满--Man/Mai 毛--Mao

相关主题
文本预览
相关文档 最新文档