当前位置:文档之家› 分布式开放存取期刊爬虫的设计与实现

分布式开放存取期刊爬虫的设计与实现

分布式开放存取期刊爬虫的设计与实现

杨镇雄,蔡祖锐,陈国华+,汤庸,张龙

华南师范大学计算机学院,广州510000

Design and Implementation of Distributed Web Crawler for Open Access Journal YANG Zhenxiong,CAI Zurui,CHEN Guohua +,TANG Yong,ZHANG Long

School of Computer,South China Normal University,Guangzhou 510000,China

+Corresponding author:E-mail:chengh3@https://www.doczj.com/doc/eb13527178.html,

YANG Zhenxiong,CAI Zurui,CHEN Guohua,et al.Design and implementation of distributed Web crawler for open access journal.Journal of Frontiers of Computer Science and Technology,2014,8(10):1187-1194.Abstract:Open access journal is a kind of deep online resources and disperses on the Internet,and it is difficult for the traditional search engines to index these online resources,so the user can not access directly the open access journal via search engines,resulting in a waste of these open resources.This paper proposes a novel focused Web crawler with distributed architecture to collect the open access journal resources scattering throughout the Internet.This architecture adopts the distributed master-slave design,which consists of a master control center and multiple distributed crawler nodes,and proposes an academic information extraction method based on user predefined rules from the open access journals.These distributed crawling nodes can be adjusted dynamically and use Chrome browser based plug-in mechanism to achieve scalability and deployment flexibility.

Key words:distributed Web crawler;open access journal;plug-in mechanism

摘要:开放存取(open access ,OA )期刊属于网络深层资源且分散在互联网中,传统的搜索引擎不能对其建立索引,不能满足用户获取OA 期刊资源的需求,从而造成了开放资源的浪费。针对如何集中采集万维网上分散ISSN 1673-9418CODEN JKYTA8

Journal of Frontiers of Computer Science and Technology 1673-9418/2014/08(10)-1187-08doi:10.3778/j.issn.1673-9418.1405051E-mail:fcst@https://www.doczj.com/doc/eb13527178.html, https://www.doczj.com/doc/eb13527178.html, Tel:+86-10-89056056*The National Natural Science Foundation of China under Grant No.61272067(国家自然科学基金);the National High Technology Research and Development Program of China under Grant No.2013AA01A212(国家高技术研究发展计划(863计划));the National Key Technology R&D Program of China under Grant No.2012BAH27F05(国家科技支撑计划项目);the Natural Science Founda-tion of Guangdong Province under Grant No.S2012030006242(广东省自然科学基金团队研究项目);the Major Scientific and Technological Project of Guangdong Province under Grant No.2012A080104019(广东省重大科技专项计划项目);the Science and Technology Planning Project of Guangdong Province under Grant No.2011B080100031(广东省科技计划项目).

Received 2014-04,Accepted 2014-06.

CNKI 网络优先出版:2014-07-01,https://www.doczj.com/doc/eb13527178.html,/kcms/doi/10.3778/j.issn.1673-9418.1405051.html 杨镇雄,蔡祖锐,陈国华,等.分布式开放存取期刊爬虫的设计与实现[J].计算机科学与探索,2014,8(10):

1187-1194.

相关主题
文本预览
相关文档 最新文档