AVATAR information extraction system
- 格式:pdf
- 大小:125.61 KB
- 文档页数:9
avatarify原理Avatarify是一种深度学习算法,用于将静态图片中的面部表情转换为动画化的视频,使得原本静态的图片可以像真实的视频一样动态展示。
它的原理涉及到人脸检测、面部特征点标定、面部表情的转换和图像合成等多个步骤。
下面我将详细介绍Avatarify的原理。
1.人脸检测:Avatarify首先使用深度学习算法检测输入图片中的人脸位置。
常用的人脸检测算法包括基于卷积神经网络(CNN)的方法,例如MTCNN(多任务卷积神经网络),它可以有效地定位人脸位置。
2.面部特征点标定:在检测到人脸后,Avatarify会通过深度学习模型定位面部的关键特征点。
这些特征点通常包括眼睛、鼻子、嘴巴等位置。
常用的面部特征点标定算法有Dlib、OpenPose等。
这些算法可以在图像中准确地标定面部特征点的位置。
3.面部表情的转换:Avatarify基于面部特征点的位置信息,使用条件生成对抗网络(Conditional Generative Adversarial Network,CGAN)进行面部表情的转换。
CGAN包括生成器和判别器两个模块。
生成器用于将静态的面部表情转换为动态的面部表情。
判别器则用于判断生成的面部表情是否逼真。
基于CGAN的训练,Avatarify可以学习到如何从一个静态的图片生成与原始图片相符的动态面部表情。
4.图像合成:最后,Avatarify使用图像合成技术将生成的动态面部表情与原始图像进行合成,得到一个动画化的视频。
图像合成可以通过将生成的面部表情帧与原始图像进行融合,使其看起来像是原始图像中的面部表情在动态变化。
图像合成方法包括基于插值的方法、基于深度学习的方法等。
需要注意的是,Avatarify的训练需要大量的数据和计算资源。
常用的数据集包括CelebA、MUG Facial Expression Database等。
此外,为了提高生成的面部表情的逼真度,可以应用一些技术手段,例如周期一致性生成网络(CycleGAN)和迁移学习等。
内容安全API参考(检测服务)··法律声明法律声明阿里云提醒您在阅读或使用本文档之前仔细阅读、充分理解本法律声明各条款的内容。
如果您阅读或使用本文档,您的阅读或使用行为将被视为对本声明全部内容的认可。
1. 您应当通过阿里云网站或阿里云提供的其他授权通道下载、获取本文档,且仅能用于自身的合法合规的业务活动。
本文档的内容视为阿里云的保密信息,您应当严格遵守保密义务;未经阿里云事先书面同意,您不得向任何第三方披露本手册内容或提供给任何第三方使用。
2. 未经阿里云事先书面许可,任何单位、公司或个人不得擅自摘抄、翻译、复制本文档内容的部分或全部,不得以任何方式或途径进行传播和宣传。
3. 由于产品版本升级、调整或其他原因,本文档内容有可能变更。
阿里云保留在没有任何通知或者提示下对本文档的内容进行修改的权利,并在阿里云授权通道中不时发布更新后的用户文档。
您应当实时关注用户文档的版本变更并通过阿里云授权渠道下载、获取最新版的用户文档。
4. 本文档仅作为用户使用阿里云产品及服务的参考性指引,阿里云以产品及服务的“现状”、“有缺陷”和“当前功能”的状态提供本文档。
阿里云在现有技术的基础上尽最大努力提供相应的介绍及操作指引,但阿里云在此明确声明对本文档内容的准确性、完整性、适用性、可靠性等不作任何明示或暗示的保证。
任何单位、公司或个人因为下载、使用或信赖本文档而发生任何差错或经济损失的,阿里云不承担任何法律责任。
在任何情况下,阿里云均不对任何间接性、后果性、惩戒性、偶然性、特殊性或刑罚性的损害,包括用户使用或信赖本文档而遭受的利润损失,承担责任(即使阿里云已被告知该等损失的可能性)。
5. 阿里云网站上所有内容,包括但不限于著作、产品、图片、档案、资讯、资料、网站架构、网站画面的安排、网页设计,均由阿里云和/或其关联公司依法拥有其知识产权,包括但不限于商标权、专利权、著作权、商业秘密等。
非经阿里云和/或其关联公司书面同意,任何人不得擅自使用、修改、复制、公开传播、改变、散布、发行或公开发表阿里云网站、产品程序或内容。
博纳思证照人像采集产品简介最专业的证照采集解决方案提供商20150408一、第二代身份证人像采集解决方案为满足用户日益多样化的需求,博纳思公司在第二代身份证人像采集领域不断的创新,目前推出BNS100二代证人像采集系统目前分三个版本,普及版、自助版、移动版分别如下:1、BNS100二代证人像采集系统V2.6(标准版)只要点三下鼠标就可以完成合格证照的二代证人像采集软件BNS100二代证人像采集系统标准版是我公司严格按照《GA 461-2004 居民身份证制证用数字相片技术要求》文件自主研发,具备国家公安部鉴定认证证书(认证号为:公京检第051216号)。
该产品采用平板电脑的自带摄像头实现人像采集通过“BNSFacePROC”核心算法(自动定位人脸、自动裁切,自动调色,自动头发边缘处理,自动去背景,自动倾斜校正、自动高光抑制、自动清晰调整、头发边缘自动处理)处理,针对派出所户籍民警专用给老百姓进行第二代身份证人像采集拍照的一款产品,功能非常强大(包括自动升位身份证号码、保存时本机检测),但操作却异常简单,不需要任何专业计算机知识,能非常快速掌握所有操作,因操作简单、维护工作量少、技术含量高、智能性和自动化程度高、相片合格率高,深受用户好评,在本行业中都有着极高的知名度及占有绝对优势的市场份额。
功能特点:拍照功能:拍照预览、自动语音提醒相片智能处理功能:智能寻找人脸、自动裁切、自动调色、自动去背景、自动高光抑制、清晰度调整、校正倾斜、自动头发边缘处理手动处理及设置:背景强度、中间色调、饱和度等相片检测:相片保存时按公安部标准检测,合格后保存。
2、BNS100二代证人像采集系统自助版——自助拍、满意拍群众参与,自助拍照,最满意的二代证人像采集系统!“自助拍”产品是我公司严格按照《GA 461-2004 居民身份证制证用数字相片技术要求》文件自主研发,具备国家公安部鉴定认证证书(认证号为:公京检第051216号)。
avatarify原理avatarify是一种基于机器学习的技术,可以将静态图片中的人脸部分动态化。
它的原理是通过将两个神经网络结合起来实现的,即预训练的人脸识别网络和生成对抗网络(GAN)。
avatarify使用预训练的人脸识别网络来检测和定位输入图片中的人脸。
该网络能够准确地识别图像中的人脸,并标记出人脸的关键点位置,如眼睛、嘴巴等。
接下来,avatarify使用生成对抗网络来实现面部表情的转换。
生成对抗网络由两个部分组成,即生成器和判别器。
生成器负责将输入的静态图像转换为动态的面部表情,而判别器则负责判断生成器生成的图像是否真实。
生成器通过学习输入图像和真实动态视频之间的对应关系,来生成逼真的动态表情。
在训练过程中,avatarify使用大量的真实动态视频和静态图像进行训练。
通过不断迭代训练,生成器和判别器之间的对抗性训练使得生成器能够生成更加逼真的动态表情,并且判别器能够更加准确地判断生成的图像是否真实。
一旦训练完成,avatarify就可以将输入的静态图像中的人脸部分转换为动态表情。
用户只需提供一张静态图像,avatarify就可以根据训练得到的模型将人脸部分转换为动态的表情,使得人脸看起来像是在实时动作一样。
avatarify的应用场景非常广泛。
它可以用于视频编辑、特效制作、虚拟角色的动画表情等方面。
在视频编辑中,用户可以使用avatarify将静态人物的表情转换为动态,使得视频更加生动有趣;在特效制作中,avatarify可以用来为特定场景的人物添加动态表情,提升特效的逼真度;在虚拟角色的动画表情中,avatarify可以用来实时生成虚拟角色的面部表情,使得角色更加生动。
avatarify是一种基于机器学习的技术,可以将静态图片中的人脸部分动态化。
它通过将预训练的人脸识别网络和生成对抗网络结合起来,实现了静态图像到动态表情的转换。
它的应用场景非常广泛,可以用于视频编辑、特效制作、虚拟角色的动画表情等方面。
面向云计算的隐私保护图像特征提取方法研究随着云计算技术的飞速发展,越来越多的个人和组织选择将自己的数据存储在云端,以便随时随地进行访问和共享。
随之而来的隐私安全问题也备受关注。
在云计算环境下,用户的数据往往会被存储在不受其控制的服务器上,这就为数据的隐私保护带来了挑战。
特别是在涉及到图像数据的存储和处理时,如何保护图像的隐私成为一个急需解决的问题。
图像特征提取是图像处理领域的一个重要研究方向,其目的是从图像中提取出具有代表性的特征以进行进一步的分析和处理。
在面向云计算的隐私保护图像特征提取方法研究中,主要关注如何在保持图像特征的有效性的保护图像的隐私。
本文将探讨当前在这一领域中的研究现状和存在的挑战,并提出一种基于云计算的隐私保护图像特征提取方法。
一、研究现状当前,针对图像隐私保护的研究已经取得了一定的进展,主要集中在图像加密、水印和隐私保护算法等方面。
在面向云计算的图像特征提取中,由于涉及到云端存储和处理,传统的图像隐私保护方法并不完全适用。
需要在保证图像特征有效性的前提下,结合云计算环境的特点,提出新的隐私保护机制。
目前,针对这一问题的研究主要集中在以下几个方面:1. 图像加密技术:通过对图像进行加密处理,可以保护图像的隐私安全。
传统的加密方法在保护图像隐私的也会严重破坏图像的特征信息,导致无法进行有效的特征提取和分析。
2. 隐私保护算法:针对图像隐私保护的需求,研究者提出了一些针对性的隐私保护算法,如匿名化、模糊化等。
这些算法可以在一定程度上保护图像的隐私,但难以兼顾图像特征的有效性。
3. 水印技术:通过在图像中嵌入水印,可以实现对图像的隐私保护和版权保护。
水印技术对图像的特征提取也会产生一定的影响,尤其是在云计算环境中,由于图像的传输和处理是在公共网络上进行的,水印的安全性也难以得到保障。
以上现有方法在保护图像隐私和保持图像特征有效性之间存在一定的矛盾。
亟需一种新的方法来解决这一问题。
在面向云计算的隐私保护图像特征提取方法中,应充分考虑云计算环境下的特点和挑战。
电影信息提取python电影信息提取是一项利用计算机技术从电影相关数据中获取有用信息的任务。
在这篇文章中,我们将介绍一些用Python实现电影信息提取的方法和技巧。
一、数据收集要进行电影信息提取,首先需要收集电影相关的数据。
可以通过各种途径来获取电影数据,如爬取电影网站上的信息、使用API获取数据等。
然后将获取到的数据保存为结构化的格式,如JSON或CSV。
二、电影信息提取方法1. 标题提取:通过分析电影的标题,可以提取出电影的名称、年份、副标题等信息。
可以使用字符串匹配、正则表达式等方法来实现标题提取。
2. 演员提取:电影的演员信息是电影信息提取中重要的一部分。
可以通过分析电影的演员表、电影介绍等信息来提取演员的姓名、角色等信息。
可以使用自然语言处理技术来实现演员提取。
3. 导演提取:电影的导演信息也是重要的一部分。
可以通过分析电影的导演信息来提取导演的姓名、国籍等信息。
可以使用关键词提取、实体识别等方法来实现导演提取。
4. 剧情提取:电影的剧情描述是电影信息提取中的关键部分。
可以通过分析电影的剧情介绍、影评等信息来提取电影的剧情关键词、主题等信息。
可以使用文本分类、情感分析等方法来实现剧情提取。
5. 评分提取:电影的评分信息是电影信息提取中的重要指标之一。
可以通过分析电影的评分信息、观众评价等信息来提取电影的评分、评价等信息。
可以使用统计分析、机器学习等方法来实现评分提取。
三、应用场景电影信息提取可以应用于各种电影相关的应用场景,如电影推荐、电影票务、电影评论分析等。
通过提取电影的相关信息,可以帮助用户更好地了解电影,提供个性化的推荐服务,提高用户体验。
总结:本文介绍了电影信息提取的方法和技巧,包括数据收集、标题提取、演员提取、导演提取、剧情提取和评分提取等。
通过电影信息提取,可以实现电影相关应用的个性化推荐、票务管理、评论分析等功能。
希望读者通过本文的介绍,对电影信息提取有更深入的了解。
老美NSA牌验证器木马植入过程分析
今天,“老美”利用“验证器”对我们中国青少年进行网络攻击和跟踪。
为什么说它是一种木马程序呢?因为其具有多种功能——可以不通过身份认证、可自动获取最新密码、可发送邮件,并且不需要特殊的软硬件环境即可运作,可以在不同的网站间随意传播。
大家想想看,如果“老美”拿到了你的电脑上安装这个程序,然后再进行相关操作(比如,“老美”下载了“老美”电影、把相片传给别人)那会怎么样啊!所以,这就叫做“木马”。
真实的情况是这样的:当我们进入网络世界之后,首先被植入的就是“老美”的“验证器”,也称为“ NSA 牌验证器”。
它主要由两部分组成:一是服务端,二是客户端。
服务端负责向客户端提供验证信息;而客户端则负责接收来自服务端的验证信息,并将该信息转换为本地的验证信息。
但是,“老美”却没有告诉我们这些东西应该放哪里去,或者说他根本就没打算让我们知道。
所以,我们只好靠自己摸索着找出路啦!我们开始寻找服务端,经过几番周折终于找到了它。
但是,令我们感到奇怪的事又发生了……原来,“老美”竟然还设置了“验证器”,而且这个“验证器”非常厉害,连我们都无法破解。
唉,真是失败呀!
- 1 -。
四种AI技术⽅案,教你拥有⾃⼰的Avatar形象⼤⽕的 Avatar到底是什么?随着元宇宙概念的⼤⽕,Avatar 这个词也开始越来越多出现在⼈们的视野。
2009 年,⼀部由詹姆斯·卡梅隆执导 3D 科幻⼤⽚《阿凡达》让很多⼈认识了 Avatar 这个英语单词。
不过,很多⼈并不知道这个单词并⾮导演杜撰的,⽽是来⾃梵⽂,是印度教中的⼀个重要术语。
根据剑桥英语词典解释,Avatar ⽬前主要包含三种含义。
avatar 在剑桥词典的翻译结果 © Cambridge University Press最初,Avatar 起源于梵⽂ avatarana ,由 ava ( off , down )+ tarati ( cross over )构成,字⾯意思是 “下凡”,指的是神灵降临⼈间的化⾝,通常特指主神毗湿奴 ( VISHNU ) 下凡化作⼈形或者兽形的状态。
后于1784年进⼊英语词语中。
1985 年切普·莫宁斯塔和约瑟夫·罗梅罗在为卢卡斯影视公司Lucasfilm Games ( LucasArts ) 设计⽹络⾓⾊扮演游戏Habitat时使⽤了Avatar 这个词来指代⽤户⽹络形象。
⽽后在1992 年,科幻⼩说家 Neal Stephenson 撰写的《Snow Crash》⼀书中描述了⼀个平⾏于现实世界的元宇宙。
所有的现实世界中的⼈在元宇宙中都有⼀个⽹络分⾝ Avatar,这⼀次也是该词⾸次出现在⼤众媒体。
互联⽹时代,Avatar ⼀词开始被程序员们⼴泛使⽤在软件系统中,⽤于代表⽤户个⼈或其性格的⼀个图像,即我们常说的 “头像” 或 “个⼈秀”。
这个头像可以是⽹络游戏或者虚拟世界⾥三维⽴体的图像,也可以是⽹络论坛或社区⾥常⽤的⼆维平⾯图像。
它是可以代表⽤户本⼈的⼀个标志物。
从QQ秀到Avatar如今⽀持让⽤户创建属于⾃⼰的头像已经成为了各种软件应⽤的标配,⽤户使⽤的头像也随着技术发展从普通 2D形象发展到了3D形象。
人体面部信息的获取方法概述及范文模板1. 引言1.1 概述人体面部信息的获取方法是指通过各种技术手段和工具,对人类面部进行采集和分析,以获取有关个体身份、情绪状态、特征等信息的过程。
随着科技的不断发展,面部信息的获取方法已经成为了众多领域中的重要研究方向,包括安全监控、人机交互、心理学研究等。
本文将对人体面部信息的获取方法进行概述,并探讨其在相关领域中的应用和挑战。
1.2 文章结构本文主要分为五个部分:引言、主体、研究现状、实验与应用案例分析以及结论和展望。
在引言部分,将介绍本文的概述和文章结构,为读者提供整篇文章的大致框架。
随后,在主体部分中,将详细介绍人体面部信息的重要性以及两种常见的面部信息的获取方法。
在研究现状部分,将回顾面部识别技术的发展历程,并概述目前已有的面部信息采集技术以及存在的问题和挑战。
接下来,在实验与应用案例分析中,将介绍实验设计与数据收集过程,并分析两个具体领域中的应用案例。
最后,在结论和展望部分,将总结当前面临的问题并提出解决方向,同时预测未来面部信息获取方法的发展趋势。
1.3 目的本文旨在系统概述人体面部信息的获取方法,并深入研究其在不同领域中的应用和挑战。
通过对该领域的调研和分析,可以进一步推动人体面部信息获取方法的发展,拓宽其应用范围,并提出相关问题与建议。
随着科技进步的加速以及对个体特征、情绪状态等信息需求的增加,掌握人体面部信息的获取方法将成为多个领域中必不可少的技术手段。
因此,在适当整理和总结已有研究工作基础上,本文旨在为相关研究人员提供一个全面了解该领域并进一步进行深入研究的依据。
2. 主体:2.1 面部信息的重要性:面部作为人体最重要的特征之一,具有丰富的信息含量,对于人类交流、情感表达和身份识别起着至关重要的作用。
面部表情不仅可以传递情感信息和社会意义,而且还能反映个体健康状态、疾病诊断和行为分析等方面的指标。
因此,准确获取面部信息对于解决各种问题具有重要意义。
ICS ×××××××× ××备案号:×××-××××中华人民共和国卫生行业标准EMR01.00临床文档基础模板:病历概要数据集 CDA basic template: the dataset of medical summary中华人民共和国卫生部WS/T XXX-2009目次前言 (II)1 范围 (1)2 规范性引用文件 (1)3 数据集元数据 (1)4 数据元目录 (1)4.1 数据元属性 (1)4.1.1 公用属性 (1)4.1.2其它属性 (3)4.2 数据元值域代码表 (11)5 数据元索引 (22)IWS/T XXX-2009前言《病历概要数据集》是我国电子病历基本数据集标准的组成部分之一。
本标准旨在为医疗服务活动中病历概要记录信息提供一套术语规范、定义明确、语义语境无歧义的基本数据集标准,以规范出病历概要基本记录内容,实现病历概要信息在收集、存储、发布、交换等应用中的一致性和可比性,保证病历概要信息的有效交换、统计和共享。
本标准以医疗服务活动中医院的各种病历概要记录内容为对象,以数据元为标识单元,按照摘要式目录格式编制。
病历概要数据集共包含94个数据元,15个数据元值域代码表。
本标准由中华人民共和国卫生部卫生信息标准专业委员会提出;本标准由中华人民共和国卫生部归口;本标准负责起草单位:本标准的主要起草人:IIWS/T XXX-2009病历概要数据集1 范围本标准规定了病历概要记录数据集的内容范围、分类编码和数据元及其值域代码标准。
本标准适用于全国各级卫生行政部门、各类医疗卫生机构。
2 规范性引用文件下列文件中的条款通过本标准的引用而成为本标准的条款。
凡是注日期的引用文件,其随后所有的修改单(不包括勘误的内容)或修订版均不适用于本标准。
Avatar Information Extraction SystemT.S.Jayram,Rajasekar Krishnamurthy,Sriram Raghavan,Shivakumar Vaithyanathan,Huaiyu ZhuIBM Almaden Research Center{jayram,rajase,rsriram,vaithyan,huaiyu}@AbstractThe AVATAR Information Extraction System(IES)at the IBM Almaden Research Center enables high-precision,rule-based,information extraction from text-documents.Drawing from our experience we propose the use of probabilistic database techniques as the formal underpinnings of information extrac-tion systems so as to maintain high precision while increasing recall.This involves building a frame-work where rule-based annotators can be mapped to queries in a database system.We use examples from AVATAR IES to describe the challenges in achieving this goal.Finally,we show that deriving precision estimates in such a database system presents a significant challenge for probabilistic database systems.1IntroductionText analytics is a mature area of research concerned with the problem of automatically analyzing text to extract structured information.Examples of common text analytic tasks include entity identification(e.g.,identifying persons,locations,organizations,etc.)[1],relationship detection(e.g.,person X works in company Y)[9]and co-reference resolution(identifying different variants of the same entity either in the same document or different documents)[8].Text analytic programs used for information extraction are called annotators and the objects extracted by them are called annotations.Traditionally,such annotations have been directly absorbed into applications.Increasingly,due to the complex needs of today’s enterprise applications(such as Community Information Management[3]),there is a need for infrastructure that enables information extraction,manages the extracted objects and provides an easy interface to applications.Moreover,a very important pre-requisite for the use of annotations in enterprise applications is high precision.At the IBM Almaden Research Center we are currently building the AVATAR Information Extraction Sys-tem(IES)to tackle some of these challenges.Drawing from our experience in building the AVATAR IES infrastructure,in this paper,we make a case for the use of probabilistic database techniques as the formal under-pinnings of such information extraction systems.Annotations and Rules.Annotators in AVATAR IES are classified into two categories based on their input:•Base annotators operate over the document text,independent of any other annotator.1Figure1:Thresholding annotator precisions•Derived annotators operate on document text as well as the annotation objects produced by other anno-tators.Every annotation object produced by an annotator is characterized by a well-defined set of operations.We refer to each such operation as a rule.A set of rules within an annotator that together produce an annotation object is called a meta-rule.As an example,consider a simple base annotator that identifies occurrences of person names in the text of a document.An example of a meta-rule used by such an annotator would be (informally)R1:look for the presence of a salutation such as Dr.or Mr.followed by a capitalized word.Meta-rule R1would identify“Dr.Stonebraker”as a candidate Person annotation.Since a derived annotation depends on other annotation objects,the concept of meta-rule history is useful:Definition1(Meta-Rule History):The meta-rule history H(a)of an annotation object a is defined as follows: If a is a base annotation produced by meta-rule R,then H(a)= R ;if a is a derived annotation object produced by a meta-rule R that operates on previously defined(base or derived)annotation objects a1,...,a k, then H(a)= R,H(a1),...,H(a k) .The confidence in the accuracy of an annotation is related to its meta-rule history.For example,the person annotator mentioned above may use a different meta-rule R2that looks for capitalized words that may or may not be person names(e.g.,to identify“Bill”as a candidate Person).Intuitively,the annotator has higher confidence in R1and therefore,higher confidence in the accuracy of the objects produced by R1.To formalize this intuition, we characterize the precision of individual annotation objects as follows:Definition2(Annotation Object Precision):The precision prec(a)of an annotation object a is defined as the confidence value in[0,1]given by the annotator to all objects that can be produced with the same meta-rule history as that of a.Definition3(High-precision Information Extraction(HPIE)System):An information-extraction system in which the precision of all annotation objects are above a thresholdα(αclose to1.0)1is a high-precision information-extraction system.In accordance with Definitions1and2,the precision for derived annotations is computed using the entire meta-rule history of the corresponding base and derived annotators.This approach has two severe drawbacks. First,the combinatorial explosion in the space of meta-rules renders information-extraction systems of any reasonable size to be intractable.Second,there is a sparsity problem arising out of the fact that there may not be enough evidence(i.e.,not enough annotation objects)to obtain precision estimates for all meta-rule histories.Current Implementation in AVATAR IES.The current implementation of AVATAR IES uses the UIMA[6] workflow engine to execute annotators.Annotations produced in AVATAR IES are stored in an annotation store implemented using a relational database(DB2).The store allows multiple derived annotators to be executed without having to re-execute the base annotators.To overcome problems associated with maintaining meta-rule history,AVATAR IES implicitly assumes that allα-filtered input objects to a derived annotator are of equal precision.However,this approach has several problems that we explain below using Figure1.In thisfigure,derived annotator(DA1)has inputs from two base annotators BA1and BA2.For an application that requires an HPIE system withα=0.9,the naive approach of α-filtering the objects of BA1and BA2may not be sufficient for DA1to produce derived objects withα≥0.9.2 The reason is that a derived annotator might require input annotations to befiltered at thresholds different fromαin order to produce derived annotations aboveα.To account for this,annotators are thresholded differently(the β’s in Figure1)for consumption by derived annotators.This process can become complicated if multipleβ’s are required for a single base annotator whose output is consumed by different derived annotators.To minimize the need to tune a large number ofβ’s,in the current implementation of AVATAR IES,we only allow twoβsettings for each annotator,namely,high and medium.Motivation for Probabilistic Databases.Even under the simplistic assumptions made in our implementation, two problems remain.First,as AVATAR IES scales to a large number of annotators,the task of settingβ’s can quickly become intractable.3Second,the choice ofβhas a significant impact on the recall of derived annotators. As an example,consider a derived annotator PersonPhone(see Section2.2)that uses Person annotations pro-duced by a base annotator.In AVATAR IES,by switching theβfor Person from medium to high,the number of annotation objects produced by PersonPhone over the Enron email data set[4]drops from910to580.Below,we motivate the use of probabilistic database techniques[2,7]as a potential solution to these prob-lems.Ourfirst step in this direction is to view the precision of an annotation object in probabilistic terms.Assumption4:Consider an object a of type type(a).The precision prec(a)of a can be interpreted as a probability as follows:let a be drawn at random from the set of annotation objects whose meta-rule histories are identical to that of a.Then prec(a)equals the probability that a is truly an object of type type(a)4.Formally, prec(a)=P(tt( a)=type(a)|H(a)=H(˜a)).Assumption5:Let R be a meta-rule in a derived annotator that takes as input k objects of specific types T1,...,T k.Let the derived annotation object a be produced by R using annotation objects a1,a2,...,a k of types T1,...,T k,respectively,i.e.,a=R(a1,...,a k).Let a and a i correspond to a and a i,for i=1...k,as defined in Assumption4.Then,tt( a)=T=⇒∀i:tt( a i)=T i.Proposition6:Using Assumptions4and5,we can express the precision of an annotation object produced by a derived annotator asprec(a)=P(tt( a)=T|H(a)=H( a))=P(tt( a)=T| a=R( a1,..., a k),{tt( a i)=T i},{H(a i)=H( a i)})(meta-rule-prec)·P({tt( a i)=T i}| a=R( a1,..., a k),{H(a i)=H( a i)})(input-prec)Figure2:Annotation StoreIn the proposition above,the expression(meta-rule-prec)represents meta-rule precision while the expres-sion(input-prec)represents the overall input precision(see Section4.3for details and examples).While As-sumption4has allowed us to reduce the problem of computing precision of derived object to one of comput-ing probabilities for meta-rule precision and overall input precision,the expressions in Proposition6appear intractable.To obtain a practical solution,we believe that a better understanding of annotators and their depen-dencies is essential.In the rest of the paper we describe the internals of AVATAR IES and connections to probabilistic databases as appropriate.In Section2.2,we describe a generic template for AVATAR rule-based annotators.In Section3, we consider a simple probability model for base annotators.Finally,in Section4,we discuss briefly efficiency issues related to derived annotators.2Rule-based Annotators2.1Annotator Data ModelEach annotation produced by an IES can be viewed as a structured object with a well-defined type.The overall output of an IES,called an annotation store,is a collection of such objects as defined below:Definition7(annotation store):An annotation store S=(T,O)consists of a set of types T,a set of objects O,and two distinguished types D,S∈T,such that:•there is a special attribute text for type D•∀x∈O type(x)∈T•∀x∈O such that type(x)=D,type(x)=S,there exist attributes doc and span with type(x.doc)=D and type(x.span)=S.In this definition,D is called the document type,S is called the span type,and every other type in T is an annotation type.The special attribute text refers to the raw text of each document.The doc attribute of an annotation object A points to the source document from which A was extracted.Similarly,the span attribute of A describes the portion of A.doc.text where A was mentioned.Figure2shows a simple annotation store with one document object of type Email and three annotation objects.Each oval in thefigure represents one object and is labeled as A:B where A is the ID of the object and B is the type.The rectangular boxes represent atomic attribute values.In this example the span type S contains a pair of integers begin and end that store the character offsets of the piece of the text corresponding to the annotation.For the purposes of this paper,we assume that every annotation is extracted from a single document and that the span is interpreted relative to the text of this document.4procedure Annotator(d,A d)Features(R f,d)2.foreach r∈R gCandidates=Candidates∪ApplyFeatures step,the input document D is tokenized and each individual token is added to the set of features Features.Further,for each feature F,an attribute dictU(resp.dictA)is set to true if the corresponding token matches an entry in D u(resp.D a).•The set of rules R g for identifying person names are:(i)r uu:a pair of features that are adjacent to each other in the document text and both of which are labeled with D u(e.g.,michael stonebraker),(ii)r ua:a pair of features that are adjacent to each other in the document text and are labeled with D u and D a respectively (e.g.,james gray),(iii)r u:a feature that is labeled with D u(e.g.,michael),and(iv)r a:a feature that is labeled with D a(e.g.,gray).•Step3consists of a single consolidation rule r d that executes the following logic:If two candidates o1and o2are such that the o1.span contains o2.span,discard o2.This rule is applied repeatedly to produce Results.5Note that SimplePerson is a simple but powerful annotator that we use in this paper to illustrate some key ideas. In reality,a person annotator will use a significantly larger set of rules that exploit feature attributes such as capitalization,the presence of salutations(e.g.,Dr.,Mr.),etc.Derived annotator PersonPhone.Derived annotator PersonPhone identifies people’s phone numbers from documents.This annotator takes in as input the document D and a set of Person,Phone and ContactPattern annotations identified by Base annotators.Here(i)Person annotations are produced by a person annotator such as SimplePerson,(ii)Phone annotations are produced by an annotator that identifies telephone numbers,and (iii)ContactPattern annotations are produced by an annotator that looks for occurrences of phrases such as “at”,“can be(reached|called)at”and“’s number is”.Given these inputs,the PersonPhone annotator is fully described by the following two rules:•Candidate generation rule r seq:If a triple Person,ContactPattern,Phone appear sequentially in the docu-ment,create a candidate PersonPhone annotation.An example instance identified using this rule is shown in Figure2.•Consolidation rule r d as described earlier.3Probability Model for Base AnnotatorsLet us revisit the rules of the SimplePerson annotator that was described in Section2.2.Since the only consoli-dation rule in this annotator r d is a discard rule,it does not affect the probability assigned to the result objects. Therefore,each generation rule in SimplePerson fully describes a meta-rule.For each candidate generation rule in the SimplePerson annotator,let us assume the corresponding precision values are available to us(e.g.,we are given prec(r uu),prec(r ua),etc.).The precision value associated with a rule is a measure of the confidence that the annotator writer has in the accuracy of that rule.An experimental procedure to compute such precision values would entail running the annotator on a labeled document collection and setting prec(r)to be the fraction of objects produced by rule r that indeed turned out to be persons.Irrespective of how such precision values are computed,we make the same assumption about candidate generation rules as we did in Assumption4for meta-rules.For example,if o is an object of type Person produced by rule r ua upon examining the text“James Gray”, we have P(tt(o)=Person)=prec(r ua).Thus,for annotators with discard-only consolidation rules,knowing the precision values for generation rules is sufficient to assign probabilities to the result annotations.However, more complex base annotators use a mixture of merge rules and discard rules to perform consolidation.The task of computing annotation probabilities for such annotators is significantly more complicated.To illustrate,let us consider a more sophisticated person annotator ComplexPerson that uses an additional dictionary D s containing salutations(such as Dr.,Mr.,Prof.,etc.).Besides the generation and consolidation rules of SimplePerson,the extended ComplexPerson annotator uses a generation rule r s and a merge rule r m where:Rule r s:generate a candidate annotation if there is a pair of features in the document text that are adjacent to each other and are such that thefirst one is labeled with D s and the second one is labeled with either D u or D a(e.g.,“Dr.Michael”and“Dr.Stonebraker”would both be matched by this rule).Rule r m:given two candidate o1and o2such that o1.span overlaps with o2.span,produce a new candidate object by merging the two spans into a single larger span5For instance,given the piece of text“Dr.Michael Stonebraker”,rule r s would produce object o1correspond-ing to“Dr.Michael”,rule r uu would produce object o2corresponding to“Michael Stonebraker”,and rule r m would merge o1and o2to produce o3corresponding to“Dr.Michael Stonebraker”.From our earlier assumption, we know that P(tt(o1)=Person)=prec(r s)and P(tt(o2)=Person)=prec(r uu).However,to assign a probability to object o3,we need a meaningful probability model to compute P(tt(o3)=Person),given the above two probabilities.In general,designing a probability model to handle consolidation rules that involve merging of candidates to produce new annotation objects is an open problem.Today,in AVATAR,without such a model,we are forced to manually tune base annotators until we obtain desired precision levels.4Derived AnnotatorsBased on our experiences with AVATAR IES on various applications such as email,IBM intranet and Internet blogs,the real power is in the extraction of domain-specific derived annotations which are usually very large in number.As part of our attempts to make AVATAR IES more efficient we try to describe the rules in annotator template2.2as queries over the AVATAR annotation store.Such a view opens the door to exploit appropriate indices,evaluate alternate plans and in general perform cost-based optimization of Derived annotators.6We begin by describing some of our current challenges in mapping rules to queries.We then briefly discuss several issues that arise in computing precision of Derived annotations in the context of probabilistic databases.4.1Rules as QueriesIn this section,we give examples that demonstrate that candidate generation and consolidation rules can be expressed as queries(using the standard Object Query Language(OQL)[5]syntax).For ease of exposition,we only consider discard rules in the consolidation step.As ourfirst example,the candidate generation rule r seq for the PersonPhone annotator can be written as shown below:Query q1.CREATE P as person,Ph as phone,span as SpanMerge(P.span,CP.span,Ph.span),doc as P.docFROM Person P,ContactPattern CP,Phone PhWHERE ImmSucc(P.span,CP.span)AND ImmSucc(CP.span,Ph.span)AND P.doc=CP.doc AND CP.doc=Ph.doc In the above,ImmSucc(span1,span2)is a user-defined function that returns true if“span2”occurs immediately after“span1”in the original document text.SpanMerge is another function that produces a new span by concate-nating the set of spans provided as input.Note that we use a CREATE statement to indicate that a new annotation object is produced that contains the features P and Ph as attributes“person”and“phone”respectively.Similarly, rule r d can be written as:Query q2.Candidates-(SELECT O2FROM Candidates O1,Candidates O2WHERE SpanContains(O1.span,O2.span))where SpanContains(span1,span2)is a user-defined function that returns true if“span1”contains“span2”.No-tice how the recursive application of rule r d is captured using set difference.4.2Challenges in Modeling Complex Rules as QueriesWhile the example candidate generation rules presented above map easily to queries-AVATAR IES uses several more complex rules for which the mapping is not obvious.We present an example below from our suite of rule-based annotators to extract reviews from blogs-specifically,MusicReview annotator that identifies aninformal music review from a blog.A simplistic version of this annotator works as follows.A d consists of MusicDescription phrases identified by an earlier annotator(e.g.,“lead singer sang well”,“Danny Sigelman played drums”).The MusicReview annotator identifies contiguous occurrences of MusicDescription(based on some proximity condition across successive entries),groups them into blocks and marks each block as a MusicReview.We now present the candidate generation and consolidation rules for a simplistic version of this annotator, which is interested in block size up to two,i.e.,it identifies occurrences of the pattern MD and MD,MD in the document(MD is the input MusicDescription annotations).R g={r md,r2md}is a pair of rules that identifies occurrences of MD and MD,MD respectively.The corresponding queries are given below.Query q3.CREATE MD asfirst,MD as secondFROM MusicDescription MDQuery q4.CREATE MD1asfirst,MD2as secondFROM MusicDescription MD1,MusicDescription MD2WHERE Window(MD1.span,MD2.span,m)AND BeginsBefore(MD1.span,MD2.span)where Window(span1,span2,m)is a user-defined function that returns true if the two spans are within a distance of m,and BeginBefore(span1,span2)returns true if“span1”begins before“span2”.There is a single consolidation rule r d1given below.This rule discards candidates that are completely contained in some other candidate,or have another MusicDescription inside them.Query q5.Candidates-(SELECT O2FROM Candidates O1,Candidates O2WHERE BeginsBefore(O1.first.span,O2.first.span)AND EndsAfter(O1.second.span,O2.second.span)UNIONSELECT O2FROM Candidates O1,Candidates O2WHERE BeginsBefore(O1.first.span,O2.first.span)AND O1.second=O2.secondUNIONSELECT O2FROM Candidates O1,Candidates O2WHERE O1.first=O2.first and EndsAfter(O1.second.span,O2.second.span)) For larger block sizes,note that the both the rules become significantly more complex.The following challenges arise in modeling the rules as queries:(i)ability to express proximity conditions across objects based on their location in the document(e.g.,identify two MD’s appearing adjacent to each other within m characters and no other MD between them)(ii)ability to group multiple such objects together,where each pair satisfies some proximity conditions and(iii)ability to retain groups in decreasing order of size.4.3Computing Precision of Derived AnnotationsThe precision of derived annotation objects is given by Proposition6in Section1.In the expression given in Proposition6,the precision of a derived annotation is the product of the rule-precision and the precision of the input objects.The computation of the rule precision for derived annotators is similar to that for base annotators,therefore all the issues discussed in Section3are relevant to derived annotators as well.We now turn to computing the second term involving the precision of the input objects.Without any additional information, one reasonable assumption is the following:8Assumption8:The overall input precision of the input annotations in Proposition6is a product of the preci-sions of individual input objects and is independent of the rule R.In other words,P({tt( a i)=T i}| a=R( a1,..., a k),{H(a i)=H( a i)})= i P(tt( a i)=T i|H(a i)=H( a i)) Although the above assumption allows us to account for the overall input precision,this assumption is invalid for most derived annotators.In particular,we believe that most derived annotator rules enhance our confidence in the precision of the input annotations.For example,let us revisit the PersonPhone annotator described in Section2.2.This annotator has a candidate generation rule r seq that operates on annotations generated by three base annotators,namely,Person,ContactPattern and Phone.Consider the example annotation identified by this rule shown in Figure2.The regular expressions used in the ContactPattern annotator and the sequencing condition(ImmSucc)in the rule r seq have a strong correlation with the average precision of the Person anno-tation.In fact,in the current implementation of this annotator in AVATAR IES,the precision of the Person annotations that satisfy this additional constraint is97.5%–significantly higher than the overall precision of Person annotations underβ=medium setting.Since Assumption8assumes that P(tt(P er1)=Person)is independent of{H(P P1)}and the rule r seq,this significantly lowers the computed precision for PersonPhone objects,and results in lowering the recall of the PersonPhone annotator.The combination of mapping rules to queries and accounting for the dependencies as described above presents a significant challenge for probabilistic database systems.This is the focus of our ongoing work in this area.We believe that this requires enhancing the existing semantics for probabilistic database query evalua-tion,and will lead to a fresh set of open problems for efficient query evaluation under these enhanced semantics. References[1]rmation extraction-a user guide.Technical Report CS-97-02,University of Sheffield,1997.[2]N.Dalvi and D.Suciu.Efficient query evaluation on probabilistic databases.In VLDB,pages864–875,2004.[3]A.Doan,R.Ramakrishnan,F.Chen,P.DeRose,Y.Lee,R.McCann,M.Sayyadian,and munityinformation management.In IEEE Data Engineering Bulletin,March2006.[4]Enron Dataset./enron/.[5]O.Schadow et.al.The Object Data Standard:ODMG3.0.Morgan Kauffman,2000.[6]D.Ferrucci and lly.UIMA:An architectural approach to unstructured information processing in the corporateresearch environment.Natural Language Engineering,June2004.[7]N.Fuhr and T.Roelleke.A probabilistic relational algebra for the integration of information retrieval and databasesystems.ACM Trans.Inf.Syst.,15(1):32–66,1997.[8]J.F.McCarthy and ing decision trees for coreference resolution.In IJCAI,pages1050–1055,1995.[9]bining lexical,syntactic and semantic features with maximum entropy models for extracting relations.In Proc.of the42nd Anniversary Meeting of the Association for Computational Linguistics(ACL04),2004.9。