Heritrix

格式：docx
大小：621.34 KB
文档页数：54

下载文档原格式

Heritrix使用指南04

图10-37 Job提交后的Console界面（3）在面版的右测，它显示了当前Java虚拟机的一些状态，如图10-38所示，可以看到当前的堆大小为4184KB，而已经被使用了3806KB，另外，最大的堆内容可以达到65088KB，也就是在64M左右。

图10-38 内存状态显示（4）此时，单击面版中的“Start”链接，就会将此时处于“Pending”状态的抓取任务激活，令其开始抓取（5）在图10-39中，刚才还处于“Start”状态的链接已经变为了Hold状态。

这表明，抓取任务已经被激活。

图10-39 抓取开始（6）此时，面版中出现了一条抓取状态栏，它清楚的显示了当前已经被抓取的链接数量，另外还有在队列中等待被抓取的链接数量，然后用一个百分比显示出来。

（7）在绿红相间的长条左侧，是几个实时的运行状态，其中包括抓取的平均速度（KB/s）和每秒钟抓取的链接数（URIs/sec），另外的统计还包括抓取任务所消耗的时间和剩余的时间，不过这种剩余时间一般都不准，因为URI的数量总是在不断变化，每当分析一个网页，就会有新的URI加入队列中。

如图10-40所示。

图10-40 抓取的速度和时间（8）在绿红相间的长条右侧，是当前的负载，它显示了当前活跃的线程数量，同时，还统计了Heritrix内部的所有队列的平均长度。

如图10-41所示。

图10-41 线程和队列负载（9）从图10-40和图10-41中看到，真正的抓取任务还没有开始，队列中的总URI数量，以及下载的速率都还基本为0。

这应该还处于接收种子URL的网页信息的阶段。

让我们再来看一下当Heritrix运行一段时间后，整个系统的资源消耗和进度情况。

（10）在图10-42中，清楚的看到系统的资源消耗。

其中，每秒下载的速率已经达到了23KB，另外，平均每秒有19.3个URI被抓取。

在负载方面，初设的50个线程均处于工作状态，最长的队列长度已经达到了415个URI，平均长度为5。

Heritrix3.1.0源码解析（二十二）

Heritrix3.1.0源码解析（⼆⼗⼆）本⽂继续分析Heritrix3.1.0系统的源码，其实本⼈感觉接下来待分析的问题不是⼀两篇⽂章能够澄清，本⼈不能因为迫于表述⽽乱了问题本⾝的章法，接下来的分析的Heritrix3.1.0系统封装HttpClient组件可能要分⼏篇⽂章来解析我们知道，Heritrix3.1.0系统是通过封装HttpClient组件（⾥⾯封装了Socket）来与服务器通信的，Socket的输出流写⼊数据，输⼊流接收数据那么Heritrix3.1.0系统是怎样封装Httpclient（Heritrix3.1.0系统是采⽤的以前的Apache版本）组件的呢?我们可以看到，在FetchHTTP处理器⾥⾯有⼀段静态代码块，⽤于注册Socket⼯⼚，分别⽤于HTTP通信与HTTPS通信协议（基于TCP协议通信，⾄于两者的关系本⽂就不再分析了，不懂的读者可以参考⽹络通信⽅⾯的教程）/*** 注册http和https协议*/static {Protocol.registerProtocol("http", new Protocol("http",new HeritrixProtocolSocketFactory(), 80));try {ProtocolSocketFactory psf = new HeritrixSSLProtocolSocketFactory();Protocol p = new Protocol("https", psf, 443);Protocol.registerProtocol("https", p);} catch (KeyManagementException e) {e.printStackTrace();} catch (KeyStoreException e) {e.printStackTrace();} catch (NoSuchAlgorithmException e) {e.printStackTrace();}}上⾯的两个类HeritrixProtocolSocketFactory和HeritrixSSLProtocolSocketFactory都实现了HttpClient组件的ProtocolSocketFactory接⼝，⽤于创建客户端Socket对象（HeritrixSSLProtocolSocketFactory类间接实现了ProtocolSocketFactory接⼝）ProtocolSocketFactory接⼝定义了创建SOCKET对象的⽅法（package mons.httpclient.protocol）/*** A factory for creating Sockets.** Both {@link ng.Object#equals(ng.Object) Object.equals()} and* {@link ng.Object#hashCode() Object.hashCode()} should be overridden appropriately.* Protocol socket factories are used to uniquely identify <code>Protocol</code>s and* <code>HostConfiguration</code>s, and <code>equals()</code> and <code>hashCode()</code> are* required for the correct operation of some connection managers.** @see Protocol** @author Michael Becke* @author <a href="mailto:mbowler@">Mike Bowler</a>** @since 2.0*/public interface ProtocolSocketFactory {/*** Gets a new socket connection to the given host.** @param host the host name/IP* @param port the port on the host* @param localAddress the local host name/IP to bind the socket to* @param localPort the port on the local machine** @return Socket a new socket** @throws IOException if an I/O error occurs while creating the socket* @throws UnknownHostException if the IP address of the host cannot be* determined*/Socket createSocket(String host,int port,InetAddress localAddress,int localPort) throws IOException, UnknownHostException;/*** Gets a new socket connection to the given host.** @param host the host name/IP* @param port the port on the host* @param localAddress the local host name/IP to bind the socket to* @param localPort the port on the local machine* @param params {@link HttpConnectionParams Http connection parameters}** @return Socket a new socket** @throws IOException if an I/O error occurs while creating the socket* @throws UnknownHostException if the IP address of the host cannot be* determined* @throws ConnectTimeoutException if socket cannot be connected within the* given time limit** @since 3.0*/Socket createSocket(String host,int port,InetAddress localAddress,int localPort,HttpConnectionParams params) throws IOException, UnknownHostException, ConnectTimeoutException;/*** Gets a new socket connection to the given host.** @param host the host name/IP* @param port the port on the host** @return Socket a new socket** @throws IOException if an I/O error occurs while creating the socket* @throws UnknownHostException if the IP address of the host cannot be* determined*/Socket createSocket(String host,int port) throws IOException, UnknownHostException;}HeritrixProtocolSocketFactory类实现了上⾯的ProtocolSocketFactory接⼝（⽤于HTTP通信）public class HeritrixProtocolSocketFactory implements ProtocolSocketFactory {/*** Constructor.*/public HeritrixProtocolSocketFactory() {super();}@Overridepublic Socket createSocket(String host, int port, InetAddress localAddress,int localPort) throws IOException, UnknownHostException {// TODO Auto-generated method stubreturn new Socket(host, port, localAddress, localPort);}@Overridepublic Socket createSocket(String host, int port, InetAddress localAddress,int localPort, HttpConnectionParams params) throws IOException,UnknownHostException, ConnectTimeoutException {// TODO Auto-generated method stub// Below code is from the DefaultSSLProtocolSocketFactory#createSocket// method only it has workarounds to deal with pre-1.4 JVMs. I've// cut these out.if (params == null) {throw new IllegalArgumentException("Parameters may not be null");}Socket socket = null;int timeout = params.getConnectionTimeout();if (timeout == 0) {socket = createSocket(host, port, localAddress, localPort);} else {socket = new Socket();InetAddress hostAddress;Thread current = Thread.currentThread();if (current instanceof HostResolver) {HostResolver resolver = (HostResolver)current;hostAddress = resolver.resolve(host);} else {hostAddress = null;}InetSocketAddress address = (hostAddress != null)?new InetSocketAddress(hostAddress, port):new InetSocketAddress(host, port);socket.bind(new InetSocketAddress(localAddress, localPort));try {socket.connect(address, timeout);} catch (SocketTimeoutException e) {// Add timeout info. to the exception.throw new SocketTimeoutException(e.getMessage() +": timeout set at " + Integer.toString(timeout) + "ms.");}assert socket.isConnected(): "Socket not connected " + host;}return socket;}@Overridepublic Socket createSocket(String host, int port) throws IOException,UnknownHostException {// TODO Auto-generated method stubreturn new Socket(host, port);}/*** All instances of DefaultProtocolSocketFactory are the same.* @param obj Object to compare.* @return True if equal*/public boolean equals(Object obj) {return ((obj != null) &&obj.getClass().equals(HeritrixProtocolSocketFactory.class));}/*** All instances of DefaultProtocolSocketFactory have the same hash code.* @return Hash code for this object.*/public int hashCode() {return HeritrixProtocolSocketFactory.class.hashCode();}}HeritrixSSLProtocolSocketFactory类通过SecureProtocolSocketFactory实现SecureProtocolSocketFactory接⼝(间接实现了ProtocolSocketFactory接⼝)⽤于HTTPS通信SecureProtocolSocketFactory接⼝⽅法如下/*** A ProtocolSocketFactory that is secure.** @see mons.httpclient.protocol.ProtocolSocketFactory** @author Michael Becke* @author <a href="mailto:mbowler@">Mike Bowler</a>* @since 2.0*/public interface SecureProtocolSocketFactory extends ProtocolSocketFactory {/*** Returns a socket connected to the given host that is layered over an* existing socket. Used primarily for creating secure sockets through* proxies.** @param socket the existing socket* @param host the host name/IP* @param port the port on the host* @param autoClose a flag for closing the underling socket when the created* socket is closed** @return Socket a new socket** @throws IOException if an I/O error occurs while creating the socket* @throws UnknownHostException if the IP address of the host cannot be* determined*/Socket createSocket(Socket socket,String host,int port,boolean autoClose) throws IOException, UnknownHostException;}HeritrixSSLProtocolSocketFactory类实现上⾯的SecureProtocolSocketFactory接⼝/*** Implementation of the commons-httpclient SSLProtocolSocketFactory so we* can return SSLSockets whose trust manager is* {@link org.archive.httpclient.ConfigurableX509TrustManager}.** We also go to the heritrix cache to get IPs to use making connection.* To this, we have dependency on {@link HeritrixProtocolSocketFactory};* its assumed this class and it are used together.* See {@link HeritrixProtocolSocketFactory#getHostAddress(ServerCache,String)}.** @author stack* @version $Id: HeritrixSSLProtocolSocketFactory.java 6637 2009-11-10 21:03:27Z gojomo $ * @see org.archive.httpclient.ConfigurableX509TrustManager*/public class HeritrixSSLProtocolSocketFactory implements SecureProtocolSocketFactory { // static final String SERVER_CACHE_KEY = "heritrix.server.cache";static final String SSL_FACTORY_KEY = "heritrix.ssl.factory";/**** Socket factory with default trust manager installed.*/private SSLSocketFactory sslDefaultFactory = null;/*** Shutdown constructor.* @throws KeyManagementException* @throws KeyStoreException* @throws NoSuchAlgorithmException*/public HeritrixSSLProtocolSocketFactory()throws KeyManagementException, KeyStoreException, NoSuchAlgorithmException{// Get an SSL context and initialize it.SSLContext context = SSLContext.getInstance("SSL");// I tried to get the default KeyManagers but doesn't work unless you// point at a physical keystore. Passing null seems to do the right// thing so we'll go w/ that.context.init(null, new TrustManager[] {new ConfigurableX509TrustManager(ConfigurableX509TrustManager.DEFAULT)}, null);this.sslDefaultFactory = context.getSocketFactory();}@Overridepublic Socket createSocket(String host, int port, InetAddress clientHost,int clientPort)throws IOException, UnknownHostException {return this.sslDefaultFactory.createSocket(host, port,clientHost, clientPort);}@Overridepublic Socket createSocket(String host, int port)throws IOException, UnknownHostException {return this.sslDefaultFactory.createSocket(host, port);}@Overridepublic synchronized Socket createSocket(String host, int port,InetAddress localAddress, int localPort, HttpConnectionParams params)throws IOException, UnknownHostException {// Below code is from the DefaultSSLProtocolSocketFactory#createSocket// method only it has workarounds to deal with pre-1.4 JVMs. I've// cut these out.if (params == null) {throw new IllegalArgumentException("Parameters may not be null");}Socket socket = null;int timeout = params.getConnectionTimeout();if (timeout == 0) {socket = createSocket(host, port, localAddress, localPort);} else {SSLSocketFactory factory = (SSLSocketFactory)params.getParameter(SSL_FACTORY_KEY);//SSL_FACTORY_KEYSSLSocketFactory f = (factory != null)? factory: this.sslDefaultFactory;socket = f.createSocket();Thread current = Thread.currentThread();InetAddress hostAddress;if (current instanceof HostResolver) {HostResolver resolver = (HostResolver)current;hostAddress = resolver.resolve(host);} else {hostAddress = null;}InetSocketAddress address = (hostAddress != null)?new InetSocketAddress(hostAddress, port):new InetSocketAddress(host, port);socket.bind(new InetSocketAddress(localAddress, localPort));try {socket.connect(address, timeout);} catch (SocketTimeoutException e) {// Add timeout info. to the exception.throw new SocketTimeoutException(e.getMessage() +": timeout set at " + Integer.toString(timeout) + "ms.");}assert socket.isConnected(): "Socket not connected " + host;}return socket;}@Overridepublic Socket createSocket(Socket socket, String host, int port,boolean autoClose)throws IOException, UnknownHostException {return this.sslDefaultFactory.createSocket(socket, host,port, autoClose);}public boolean equals(Object obj) {return ((obj != null) && obj.getClass().equals(HeritrixSSLProtocolSocketFactory.class));}public int hashCode() {return HeritrixSSLProtocolSocketFactory.class.hashCode();}}HTTPS通信的SOCKET对象是通过SSLSocketFactory sslDefaultFactory（SSLSocket⼯⼚）对象创建的，为了创建SSLSocketFactory sslDefaultFactory对象Heritrix3.1.0系统定义了X509TrustManager接⼝的实现类ConfigurableX509TrustManager（⽤于SSL通信，⾃动接收证书）/*** A configurable trust manager built on X509TrustManager.** If set to 'open' trust, the default, will get us into sites for whom we do* not have the CA or any of intermediary CAs that go to make up the cert chain* of trust. Will also get us past selfsigned and expired certs. 'loose'* trust will get us into sites w/ valid certs even if they are just* selfsigned. 'normal' is any valid cert not including selfsigned. 'strict'* means cert must be valid and the cert DN must match server name.** Based on pointers in* <a href="/commons/httpclient/sslguide.html">SSL* Guide</a>,* and readings done in <a* href="/j2se/1.4.2/docs/guide/security/jsse/JSSERefGuide.html#Introduction">JSSE* Guide</a>.** TODO: Move to an ssl subpackage when we have other classes other than* just this one.** @author stack* @version $Id: ConfigurableX509TrustManager.java 6637 2009-11-10 21:03:27Z gojomo $*/public class ConfigurableX509TrustManager implements X509TrustManager{/*** Logging instance.*/protected static Logger logger = Logger.getLogger("org.archive.httpclient.ConfigurableX509TrustManager");public static enum TrustLevel {/*** Trust anything given us.** Default setting.** See <a href="/egs/.ssl/TrustAll.html">* e502. Disabling Certificate Validation in an HTTPS Connection</a> from* the java almanac for how to trust all.*/OPEN,/*** Trust any valid cert including self-signed certificates.*//*** Normal jsse behavior.** Seemingly any certificate that supplies valid chain of trust.*/NORMAL,/*** Strict trust.** Ensure server has same name as cert DN.*/STRICT,}/*** Default setting for trust level.*/public final static TrustLevel DEFAULT = TrustLevel.OPEN;/*** Trust level.*/private TrustLevel trustLevel = DEFAULT;/*** An instance of the SUNX509TrustManager that we adapt variously* depending upon passed configuration.** We have it do all the work we don't want to.*/private X509TrustManager standardTrustManager = null;public ConfigurableX509TrustManager()throws NoSuchAlgorithmException, KeyStoreException {this(DEFAULT);}/*** Constructor.** @param level Level of trust to effect.** @throws NoSuchAlgorithmException* @throws KeyStoreException*/public ConfigurableX509TrustManager(TrustLevel level)throws NoSuchAlgorithmException, KeyStoreException {super();TrustManagerFactory factory = TrustManagerFactory.getInstance(TrustManagerFactory.getDefaultAlgorithm());// Pass in a null (Trust) KeyStore. Null says use the 'default'// 'trust' keystore (KeyStore class is used to hold keys and to hold// 'trusts' (certs)). See 'X509TrustManager Interface' in this doc://// /j2se/1.4.2/docs/guide/security/jsse/JSSERefGuide.html#Introduction factory.init((KeyStore)null);TrustManager[] trustmanagers = factory.getTrustManagers();if (trustmanagers.length == 0) {throw new NoSuchAlgorithmException(TrustManagerFactory.getDefaultAlgorithm() + " trust manager not supported");}this.standardTrustManager = (X509TrustManager)trustmanagers[0];this.trustLevel = level;}@Overridepublic void checkClientTrusted(X509Certificate[] certificates, String type) throws CertificateException {if (this.trustLevel.equals(TrustLevel.OPEN)) {return;}this.standardTrustManager.checkClientTrusted(certificates, type);}@Overridepublic void checkServerTrusted(X509Certificate[] certificates, String type) throws CertificateException {if (this.trustLevel.equals(TrustLevel.OPEN)) {try {this.standardTrustManager.checkServerTrusted(certificates, type);if (this.trustLevel.equals(TrustLevel.STRICT)) {logger.severe(TrustLevel.STRICT + " not implemented.");}} catch (CertificateException e) {if (this.trustLevel.equals(TrustLevel.LOOSE) &&certificates != null && certificates.length == 1){// If only one cert and its valid and it caused a// CertificateException, assume its selfsigned.X509Certificate certificate = certificates[0];certificate.checkValidity();} else {// If we got to here, then we're probably NORMAL. Rethrow.throw e;}}}@Overridepublic X509Certificate[] getAcceptedIssuers() {return this.standardTrustManager.getAcceptedIssuers();}}---------------------------------------------------------------------------本系列Heritrix 3.1.0 源码解析系本⼈原创转载请注明出处博客园刺猬的温驯。

搜索引擎实例讲解

提取链主要是获得资源进行dns转换填写请求和响应表单抽取链当提取完成时抽取感兴趣的htmljavascript通常那里有新的也适合的uri此时uri仅仅被发现不会被评估存储爬行结果返回内容和抽取特性过滤完存储
搜索引擎
Heritrix介绍
• 在一个搜索引擎的开发过程中，使用一个合适的爬虫来获得所需要的网页信息是第一步，这一步是整个系统成功的基础。 • Heritrix是一个纯由Java开发的、开源的Web 网络爬虫，用户可以使用它从网络上抓取想要的资源。 • 它来自于。 • Heritrix最出色之处在于它的可扩展性，开发者可以扩展它的各个组件，来实现自己的抓取逻辑。
Modules
• CrawlScope用于配置当前应该在什么范围内抓取网页链接。比如，如果选择BroadScope，则表示当前抓取的范围不受限制，但如果选择了 HostScope，则表示抓取的范围在当前的Host内 • Frontier则是一个URL的处理器，它将决定下一个被处理的URL是什么。同时，它还会将经由处理器链所解析出来的URL加入到等待处理的队列中去。 • PreProcessor：这个队列中，所有的处理器都是用来对抓取时的一些先决条件做判断的。比如判断robot.txt的信息等，它是整个处理器链的入口。 • Fetcher：从名称上看，它用于解析网络传输协议，比如解析DNS、HTTP 或FTP等。 • Extractor：它的名字就很好的揭示了它的作用。它主要用是于解析当前获取到的服务器返回内容，这些内容通常是以字符串形式缓存的。在这个队列中，包括了一系列的工具，如解析HTML、CSS等。在解析完毕，取出页面中的URL后，将它们放入队列中，等待下次继续抓取。 • Writer：主要是用于将所抓取到的信息写入磁盘。通常写入磁盘时有两种形式，一种是采用压缩的方式写入，在这里被称为Arc方式，另一种则采用镜象方式写入。当然处理起来，镜象方式要更为容易一些， • PostProcessor：在整个抓取解析过程结束后，进行一些扫尾的工作，比如将前面Extractor解析出来的URL有条件的加入到待处理队列中去。

java爬虫框架有哪些,各有什么特点

java爬虫框架有哪些，各有什么特点目前主流的Java爬虫框架主要有Nutch、Crawler4j、WebMagic、scrapy、WebCollector等，各有各的特点，大家可以根据自己的需求选择使用，下面为大家详细介绍常见的java爬虫框架有哪些？各有什么特点？常见的java爬虫框架有哪些1、NutchNutch是一个基于Lucene，类似Google的完整网络搜索引擎解决方案，基于Hadoop的分布式处理模型保证了系统的性能，类似Eclipse 的插件机制保证了系统的可客户化，而且很容易集成到自己的应用之中。

总体上Nutch可以分为2个部分：抓取部分和搜索部分。

抓取程序抓取页面并把抓取回来的数据做成反向索引，搜索程序则对反向索引搜索回答用户的请求。

抓取程序和搜索程序的接口是索引，两者都使用索引中的字段。

抓取程序和搜索程序可以分别位于不同的机器上。

下面详细介绍一下抓取部分。

Nutch抓取部分：抓取程序是被Nutch的抓取工具驱动的。

这是一组工具，用来建立和维护几个不同的数据结构：web database，a set of segments，and the index。

下面逐个解释这三个不同的数据结构：1、The web database，或者WebDB。

这是一个特殊存储数据结构，用来映像被抓取网站数据的结构和属性的集合。

WebDB 用来存储从抓取开始（包括重新抓取）的所有网站结构数据和属性。

WebDB 只是被抓取程序使用，搜索程序并不使用它。

WebDB 存储2种实体：页面和链接。

页面表示网络上的一个网页，这个网页的Url作为标示被索引，同时建立一个对网页内容的MD5 哈希签名。

跟网页相关的其它内容也被存储，包括：页面中的链接数量（外链接），页面抓取信息（在页面被重复抓取的情况下），还有表示页面级别的分数score 。

链接表示从一个网页的链接到其它网页的链接。

因此WebDB 可以说是一个网络图，节点是页面，链接是边。

基于Heritrix限定爬虫的设计与实现

０引言
随着互联网中网页数量的急剧增长，面对如此庞大的网络资源，快速准确地找到自己需要的信息变得越来越困难。因此
元搜索引擎：元搜索引擎并没有自己的机器人和搜索算法，
而是将用户的查询请求同时递交给多个搜索引擎，它的注意力放在改进用户界面及用不同的方法过滤策略，将其他搜索引擎
ＺｈａｎｇＭｉｎＳｕｎＭｉｎ
（ＣｏｌｌｅｇｅｏｆＩｎｆｏｒｍａｔｉｏｎＥｎｇｉｎｅｅｒｉｎｇ，Ｄａｌｉａｎ西ｅ
Ａｂｓｔｒａｃｔ
，Ｄａｌｉａｎ１１６６２２，Ｌｉａｏｎｉｎｇ，Ｃｈｉｎａ）
ｔｈｅｗｅｂｐａｇｅｓｏｆａｐａｒｔｉｃｕｌａｒｗｅｂｓｉｔｅ，ｏｒｏｆａｃｅｔｒａｉｎｒｅｇｉｏｎ，ｓｏｔｈｅｃｏｍｍｏｎｓｐｉｄｅｒｃａｎｂｅｏｆｎｏｈｅｌｐ．Ａｃｃｏｒｄｉｎｇｔｏｔｈｅｓｈｏｒｔｃｏｍｉｎｇｏｆｃｏｍｍｏｎｓｐｉｄｅｒ，ｉｎｔｈｉｓｐａｐｅｒｗｅｅｌａｂｏｒａｔｅｔｈｅｒｅｌａｔｅｄｃｏｎｃｅｐｔａｎｄｔｈｅｔｅｃｈｎｏｌｏｇｉｅｓｏｆｔｈｅｑｕａｌｉｆｉｅｄｓｐｉｄｅｒ，ａｎｄｉｍｐｌｅｍｅｎｔｂａｓｅｄｏｎＨｅｒｉｔｒｉｘｆｒａｍｅｗｏｒｋａｎｄｔｈｒｏｕｇｈＩＰａｄｄｒｅｓｓｔｈｅｑｕａｌｉｉｆｅｄｓｐｉｄｅｒｃｒａｗｌｉｎｇｗｅｂｐａｇｅｓｏｆｔｈｅｈｏｓｔｏｆａｃｅｔａｒｉｎａｒｅａｏｎｌｙ．Ｉｎｅｎｄｏｆｔｈｅｐａｐｅｒ，ｒｅｌｅｖａｎｔｅｘｐｅｉｒｍｅｎｔｓｈｏｗｓｔｈａｔｔｈｅｑｕａｌｉｉｆｅｄｓｐｉｄｅｒｉｓｒｅａｓｏｎａｂｌｅａｎｄｐｒａｃｔｉｃａ１．

Heritrix 3.0新特性新功能介绍

本博客属原创文章,转载请注明出处:/articles/heritrix3-3.html
Heritrix3.0新特性很给力.从性能,功能,灵活配置和灵活控制上都改进很大,可以说更适合垂直抓取了
一.英文原文 ,点击查看
1. Ability to run multiple crawl jobs simultaneously. The only limit on the number of crawl jobs that can run concurrently is the memory allocated to Heritrix.
concurrent crawling connections to a single site.
9. A Scripting Console that accepts script input in various formats such as AppleScript and ECMAScript. Scripting can be used to programmaticly access
This would cause the crawl to crash. Heritrix 3.0 eliminates these problems, allowing stable processing of large scale scrawls.
7. Increased flexibility when modifying a running crawl. Running crawls can be modified by using the Bean Browser or by using the Action Directory.
3. 可以动态更改抓取,并且不用重启.以前版本更改抓取的话,如增量一些类,更改order.xml配置,都需要停止Heritrix再更改,3.0则可以动态修改,可以从以下几个方面:

基于Heritrix的网络爬虫研究与应用

（北方工业大学，北京１００１４４）
摘要：主要介绍了主题搜索引擎、网络爬虫的基本概念和Ｈｅｒｉｔｒｉｘ系统的体系结构，分析了Ｈｅｒｉｔｒｉｘ的工作流程，
在Ｈｅｒｉｔｒｉｘ框架的基础上进行扩展和优化。通过一个实例，实现了对京东网图书信息ｒａｗｌＯｒｄｅｒ，是整个抓取工作的起始点，在一次抓取过程中通常需要设置许多属性值，最简洁实用
是用来不停地抓取不同的网页，可以使用它从网络上抓取
丰富的信息资源。Ｈｅｒｉｔｒｉｘ最出色之处是它具有良好的可扩展性和可自定义开发，开发者可以在它的组件模块上加人自己的业务逻辑进行二次开发，以实现自己特有的抓
服务功能，再将检索到的相关信息呈现给用户，供用户使用］。搜索引擎一般由网络爬虫程序部件、索引数据库部
件和用户查询接口部件３部分组成］。
１Ｈｅｒｉｔｒｉｘ体系结构
Ｈｅｒｉｔｒｉｘ是由Ｊａｖａ语言开发的一种开放源代码的网
取目标。
中，如何才能快速有效地找到自己想要的某个主题信息是
用户面临的一个难题。搜索引擎是指根据相关的搜索策略与算法从互联网上采集网页资源，然后对网页信息进行选择、过滤、处理后保留有用的信息资源，为用户提供查询

Heritrix使用指南01

Lucene很强大，这点在前面的章节中，已经作了详细介绍。

但是，无论多么强大的搜索引擎工具，在其后台，都需要一样东西来支援它，那就是网络爬虫Spi der。

网络爬虫，又被称为蜘蛛Spider，或是网络机器人、BOT等，这些都无关紧要，最重要的是要认识到，由于爬虫的存在，才使得搜索引擎有了丰富的资源。

Heritrix是一个纯由Java开发的、开源的Web网络爬虫，用户可以使用它从网络上抓取想要的资源。

它来自于。

Heritrix最出色之处在于它的可扩展性，开发者可以扩展它的各个组件，来实现自己的抓取逻辑。

本章就来详细介绍一下Heritrix和它的各个组件。

10.1 Heritrix的使用入门要想学会使用Heritrix，当然首先得能把它运行起来。

然而，运行Heritrix并非一件容易的事，需要进行很多配置。

在Heritrix的文档中对它的运行有详细的介绍，不过尽管如此，笔者仍然花了大量时间，才将其配置好并运行成功。

10.1.1 下载和运行HeritrixHeritrix的下载页面为：/downloads.html。

从上面可以链接到SourceForge的下载页面。

当前Heritrix的最新版本为1.10。

（1）在下载完Heritrix的完整开发包后，解压到本地的一个目录下，如图10-1所示。

图10-1 Heritrix的目录结构其中，Heritrix所用到的工具类库都存于lib下，heritrix-1.10.1.jar是Her itrix的Jar包。

另外，在Heritrix目录下有一个conf目录，其中包含了一个很重要的文件：heritrix.properties。

（2）在heritrix.properties中配置了大量与Heritrix运行息息相关的参数，这些参数主要是配置了Heritrix运行时的一些默认工具类、WebUI的启动参数，以及Heritrix的日志格式等。

当第一次运行Heritrix时，只需要修改该文件，为其加入WebUI的登录名和密码，如图10-2所示。

基于Heritrix的内容搜索引擎系统

且在信息存储和描述领域变得越来越流行。系统中采用ｊｖ本ａａ
语言，于ＸＡＨ基类，现对Ｘ文档的处理。基ＰＴ实Ｍｌ
抓取内容分析模块：抓取的网页内容，们需要做进一对我步的过滤．除同样的内容，照统一格式进行存放，于系列剔按对
通过对文章标题、布时间段、者、容分别设定搜索条件，发作内从而达到更加精确的定位。网页内容抓取模块：设定好的关键字输入Ｈｒｒ，动将ｅｉｉ启ｔｘ异步抓取过程。
并存储相关的内容。它对内容来者不拒，不对页面进行内容上
数据．是各种应用程序之间进行数据传输的最常用的工具，并
本文选择Ｈｅｉｉ为研究基础，是因为其良好的扩展ｒｒｔｘ作性，开发人员可以任意添加过滤组件、取逻辑组件．而构建抓从自己所需要的互联网内容检索系统。
终止．可以设定抓取时段，于用户对系列专题的跟踪处理。还用
ＸＭＬ文件处理模块：ＭＬ（ＸｔｎｉｌＭａｋｐＬｎｕｇ）ＸＥｅｓｂｅｒｕａｇａｅ指
可扩展标记语言，Ｗ３是Ｃ的推荐标准，设计用来传输和存储被
入手，力于提出一种集搜索、容分析、果输出的一体化搜索系统。致内结

基于Heritrix的面向电子商务网站增量爬虫研究

第９第７卷期
２１年７月００
软件导刊
ＳｏｔｒｉｅｆｗａｅＧｕｄ
Ｖｏ．７１Ｎｏ．９Ｊ１２１ｕ．００
基于Ｈｒｒｅｉｉｔｘ的面向电子商务网站增量爬虫研究
杨颂，阳柳波欧

（南大学软件学院，南长沙４０８）湖湖１０２
商品信息，实现了增量抓取。并关键词：ｒｒ；量抓取；行策略；Ｈｅｔｘ增ｉｉ爬电子商务中图分类号：３３ＴＰ９文献标识码：Ａ文章编号：６２７０（０００一０８０１７ — ８０２１）７ｏ３ — ２
１Ｈｅｉｉ简介ｒｔｘｒ
Ｈｒｒｅｉｉ目是一个开源的、可扩展的Ｗｅｔｘ项ｂ爬虫项目．基于Ｊｖａａ语言实现。用其出色的可扩展性，发者可以扩展它利开
２基于Ｈｅｉｉ的增量爬虫设计ｒｔｘｒ
（ｏ）在创建Ｊｂ的过程中设置Ｊｂ名称、述、子等信息，Ｊｂ，ｏｏ描种种子可以是一个或多个。个Ｊｂ还要配置处理链和运行时参每ｏ
有页面重新抓取，则需要重新对该范围下所有页面重新抓取。为了在满足带宽及其他资源的限制下。最大化地维护抓取页面的时新性，障用户搜索时，保结果是最新的，而不是过时的或不存在的信息，必须通过增量抓取来实现，本文制定了相应的增

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Lucene很强大，这点在前面的章节中，已经作了详细介绍。

但是，无论多么强大的搜索引擎工具，在其后台，都需要一样东西来支援它，那就是网络爬虫Spi der。

网络爬虫，又被称为蜘蛛Spider，或是网络机器人、BOT等，这些都无关紧要，最重要的是要认识到，由于爬虫的存在，才使得搜索引擎有了丰富的资源。

Heritrix是一个纯由Java开发的、开源的Web网络爬虫，用户可以使用它从网络上抓取想要的资源。

它来自于。

Heritrix最出色之处在于它的可扩展性，开发者可以扩展它的各个组件，来实现自己的抓取逻辑。

本章就来详细介绍一下Heritrix和它的各个组件。

10.1 Heritrix的使用入门要想学会使用Heritrix，当然首先得能把它运行起来。

然而，运行Heritrix并非一件容易的事，需要进行很多配置。

在Heritrix的文档中对它的运行有详细的介绍，不过尽管如此，笔者仍然花了大量时间，才将其配置好并运行成功。

10.1.1 下载和运行HeritrixHeritrix的下载页面为：/downloads.html。

从上面可以链接到SourceForge的下载页面。

当前Heritrix的最新版本为1.10。

（1）在下载完Heritrix的完整开发包后，解压到本地的一个目录下，如图10-1所示。

图10-1 Heritrix的目录结构其中，Heritrix所用到的工具类库都存于lib下，heritrix-1.10.1.jar是Her itrix的Jar包。

另外，在Heritrix目录下有一个conf目录，其中包含了一个很重要的文件：heritrix.properties。

当第一次运行Heritrix时，只需要修改该文件，为其加入WebUI的登录名和密码，如图10-2所示。

图10-2 修改Heritrix的WebUI的登录名和密码其中，用户名和密码是以一个冒号进行分隔，使用者可以指定任何的字符串做为用户名密码，图中所示只不过延续了Heritrix以前版本中默认的用户名和密码而已。

（3）在设置完登录名和密码后，就可以开始运行Heritrix了。

Heritrix有多种方式启动，例如，可以使用CrawlController，以后台方式加载一个抓取任务，即为编程式启动。

不过最常见的还是以WebUI的方式启动它。

（4）Heritrix的主类为org.archive.crawler.Heritrix，运行它，就可以启动Heritrix。

当然，在运行它的时候，需要为其加上lib目录下的所有jar包。

以下是笔者在命令行中启动Heritrix时所使用的批处理文件，此处列出，仅供读者参考（笔者的Heritrix目录是位于E盘的根目下，即E:\heritrix）。

代码10.1java -Xmx512m-Dheritrix.home=e:\\heritrix-cp"E:\\heritrix\\lib\\commons-cod ec-1.3.jar;E:\\heritrix\\lib\\commons-collections-3.1.jar;E:\\heritrix\\lib\\dnsjava-1.6.2.jar; E:\\heritrix\\lib\\poi-scratchpad-2.0-RC1-20031102.jar;E:\\heritrix\\lib\\commons-logging -1.0.4.jar;E:\\heritrix\\lib\\commons-httpclient-3.0.1.jar;E:\\heritrix\\lib\\commons-cli-1.0. jar;E:\\heritrix\\lib\\mg4j-1.0.1.jar;E:\\heritrix\\lib\\javaswf-CVS-SNAPSHOT-1.jar;E:\\her itrix\\lib\\bsh-2.0b4.jar;E:\\heritrix\\lib\\servlet-tomcat-4.1.30.jar;E:\\heritrix\\lib\\junit-3.8.1.jar;E:\\heritrix\\lib\\jasper-compiler-tomcat-4.1.30.jar;E:\\heritrix\\lib\\commons-lang -2.1.jar;E:\\heritrix\\lib\\itext-1.2.0.jar;E:\\heritrix\\lib\\poi-2.0-RC1-20031102.jar;E:\\her itrix\\lib\\jetty-4.2.23.jar;E:\\heritrix\\lib\\commons-net-1.4.1.jar;E:\\heritrix\\lib\\libidn-0.5.9.jar;E:\\heritrix\\lib\\ant-1.6.2.jar;E:\\heritrix\\lib\\fastutil-5.0.3-heritrix-subset-1.0.jar; E:\\heritrix\\lib\\je-3.0.12.jar;E:\\heritrix\\lib\\commons-pool-1.3.jar;E:\\heritrix\\lib\\jas per-runtime-tomcat-4.1.30.jar;E:\\heritrix\\heritrix-1.10.1.jar" org.archive.crawler.He ritrix（5）在上面的批处理文件中，将Heritrix所用到的所有的第三方Jar包都写进了classpath中，同时执行了org.archive.crawler.Heritrix这个主类。

图10 -3为Heritrix启动时的画面。

图10-3 Heritrix的启动画面（6）在这时，Heritrix的后台已经对服务器的8080端口进行了监听，只需要通过浏览器访问http://localhost:8080，就可以打开Heritrix的WebUI了。

如图10-4所示。

图10-4 Heritrix的WebUI的登录界面（7）在这个登录界面，输入刚才在Heritrix.properties中预设的WebUI的用户名和密码，就可以进入如图10-5所示的Heritrix的WebUI的主界面。

图10-5 登录后的界面（8）当看到这个页面的时候，就说明Heritrix已经成功的启动了。

在页面的中央有一道状态栏，用于标识当前正在运行的抓取任务。

如图10-6所示：图10-6 抓取任务的状态栏在这个WebUI的帮助下，用户就可以开始使用Heritrix来抓取网页了。

10.1.2 在Eclipse里配置Heritrix的开发环境讲完了通过命令行方式启动的Heritrix，当然要讲一下如何在Eclipse中配置H eritrix的开发环境，因为可能需要对代码进行调试，甚至修改一些它的源代码，来达到所需要的效果。

下面来研究一下Heritrix的下载包。

（1）webapps文件夹是用来提供Servlet引擎的，也就是提供Heritrix的Web UI的部分，因此，在构建开发环境时必不可少。

conf文件夹是用来提供配置文件的，因此也需要配置进入工程。

Lib目录下主要是Heritrix在运行时需要用到的第三方的软件，因此，需要将其设定到Eclipse的Build Path下。

最后就是Heritrix的jar包了，将其解压，可以看到其内部的结构如图10-7所示。

图10-7 Heritrix的Jar包的结构（2）根据图10-7所示，应该从Heritrix的源代码包中把这些内容取出，然后放置到工程中来。

Heritrix的源代码包解压后，只有两个文件夹，如图10-8所示。

图10-8 Heritrix的源代码包的结构（3）只需在src目录下，把图10-7中的内容配全，就可以将工程的结构完整了。

如图10-9所示。

图10-9 src目录下的内容（4）图10-10和图10-11是笔者机器上的Heritrix在Eclipse中的工程配置好后的截图，以及workspace中文件夹的预览。

图10-10 Eclipse工程视图下的包结构图10-11 文件夹中的工程其中，org目录内是Heritrix的源代码，另外，笔者将conf目录去掉了，直接将heritrix.properties文件放在了工程目录下。

在图10-10中，读者可能没有看到Heritrix所使用到的Jar包，这是因为在工程视图中，它们被过滤器过滤掉了，实际上，所有lib目录下的jar包都已经被加进了build path中。

（5）不过，读者很有可能遇到这样的情况，那就是在将所有的jar包都导入后，工程编译完成，却发现在左边的package explorer中出现了大量的编译错误。

如图10-12所示。

图10-12 出现的编辑错误（6）随便打开一个出错的文件，如图10-13所示，会发现大量的错误都来自于“assert”关键字。

这种写法似乎Eclipse不认识。

图10-13 出错的程序（7）解决问题的关键在于，Eclipse的编译器不认识assert这个关键字。

可以在“选项”菜单中将编译器的语法样式改为5.0，也就是JDK1.5兼容的语法，然后重启编译整个工程就可以了。

如图10-14所示。

图10-14 改变编译器的语法等级（8）在重新编译完整个工程后，笔者的Eclipse中仍然出现了一个编译错误，那就是在org.archive.io.ArchiveRecord类中，如图10-15所示。

图10-15 一个仍然存在的错误从代码看来，这是因为在使用条件表达式，对strippedFileName这个String 类型的对象赋值时，操作符的右则出现了一个char型的常量，因此影响了编译。

暂且不论为什么在Heritrix的源代码中会出现这样的错误，解决问题的办法就是将char变成String类型，即：buffer.append(strippedFileName != null? strippedFileName: "-");（9）当这样修改完后，整个工程的错误就被全部解决了，也就可以开始运行He ritrix了。

在Eclipse下运行org.archive.crawler.Heritrix类，如图10-16所示。

图10-16 在Eclipse中运行Heritrix（10）当看到图10-17所示的界面时，就说明Heritrix已经成功的在Eclipse 中运行，也就意味着可以使用Eclipse来对Heritrix进行断点调试和源码修改了。