Crawling the Web

格式：pdf
大小：360.66 KB
文档页数：25

下载文档原格式

网络爬虫外文翻译

外文资料ABSTRACTCrawling the web is deceptively simple: the basic algorithm is (a)Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)–(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the “seen” URLs. The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective – in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.1. INTRODUCTIONA recent Pew Foundation study [31] states that “Search engines have become an indispensable utility for Internet users” and estimates that as of mid-2002, slightlyover 50% of all Americans have used web search to find information. Hence, the technology that powers web search is of enormous practical interest. In this paper, we concentrate on one aspect of the search technology, namely the process of collecting web pages that eventually constitute the search engine corpus.Search engines collect pages in many ways, among them direct URL submission, paid inclusion, and URL extraction from nonweb sources, but the bulk of the corpus is obtained by recursively exploring the web, a process known as crawling or SPIDERing. The basic algorithm is(a) Fetch a page(b) Parse it to extract all linked URLs(c) For all the URLs not seen before, repeat (a)–(c)Crawling typically starts from a set of seed URLs, made up of URLs obtained by other means as described above and/or made up of URLs collected during previous crawls. Sometimes crawls are started from a single well connected page, or a directory such as , but in this case a relatively large portion of the web (estimated at over 20%) is never reached. See [9] for a discussion of the graph structure of the web that leads to this phenomenon.If we view web pages as nodes in a graph, and hyperlinks as directed edges among these nodes, then crawling becomes a process known in mathematical circles as graph traversal. Various strategies for graph traversal differ in their choice of which node among the nodes not yet explored to explore next. Two standard strategies for graph traversal are Depth First Search (DFS) and Breadth First Search (BFS) – they are easy to implement and taught in many introductory algorithms classes. (See for instance [34]).However, crawling the web is not a trivial programming exercise but a serious algorithmic and system design challenge because of the following two factors.1. The web is very large. Currently, Google [20] claims to have indexed over 3 billion pages. Various studies [3, 27, 28] have indicated that, historically, the web has doubled every 9-12 months.2. Web pages are changing rapidly. If “change” means “any change”, then about40% of all web pages change weekly [12]. Even if we consider only pages that change by a third or more, about 7% of all web pages change weekly [17].These two factors imply that to obtain a reasonably fresh and 679 complete snapshot of the web, a search engine must crawl at least 100 million pages per day. Therefore, step (a) must be executed about 1,000 times per second, and the membership test in step (c) must be done well over ten thousand times per second, against a set of URLs that is too large to store in main memory. In addition, crawlers typically use a distributed architecture to crawl more pages in parallel, which further complicates the membership test: it is possible that the membership question can only be answered by a peer node, not locally.A crucial way to speed up the membership test is to cache a (dynamic) subset of the “seen” URLs in main memory. The main goal of this paper is to investigate in depth several URL caching techniques for web crawling. We examined four practical techniques: random replacement, static cache, LRU, and CLOCK, and compared them against two theoretical limits: clairvoyant caching and infinite cache when run against a trace of a web crawl that issued over one billion HTTP requests. We found that simple caching techniques are extremely effective even at relatively small cache sizes such as 50,000 entries and show how these caches can be implemented very efficiently.The paper is organized as follows: Section 2 discusses the various crawling solutions proposed in the literature and how caching fits in their model. Section 3 presents an introduction to caching techniques and describes several theoretical and practical algorithms for caching. We implemented these algorithms under the experimental setup described in Section 4. The results of our simulations are depicted and discussed in Section 5, and our recommendations for practical algorithms and data structures for URL caching are presented in Section 6. Section 7 contains our conclusions and directions for further research.2. CRAWLINGWeb crawlers are almost as old as the web itself, and numerous crawling systems have been described in the literature. In this section, we present a brief survey of thesecrawlers (in historical order) and then discuss why most of these crawlers could benefit from URL caching.The crawler used by the Internet Archive [10] employs multiple crawling processes, each of which performs an exhaustive crawl of 64 hosts at a time. The crawling processes save non-local URLs to disk; at the end of a crawl, a batch job adds these URLs to the per-host seed sets of the next crawl.The original Google crawler, described in [7], implements the different crawler components as different processes. A single URL server process maintains the set of URLs to download; crawling processes fetch pages; indexing processes extract words and links; and URL resolver processes convert relative into absolute URLs, which are then fed to the URL Server. The various processes communicate via the file system.For the experiments described in this paper, we used the Mercator web crawler [22, 29]. Mercator uses a set of independent, communicating web crawler processes. Each crawler process is responsible for a subset of all web servers; the assignment of URLs to crawler processes is based on a hash of the URL’s host component. A crawler that discovers an URL for which it is not responsible sends this URL via TCP to the crawler that is responsible for it, batching URLs together to minimize TCP overhead. We describe Mercator in more detail in Section 4.Cho and Garcia-Molina’s crawler [13] is similar to Mercator. The system is composed of multiple independent, communicating web crawler processes (called “C-procs”). Cho and Garcia-Molina consider different schemes for partitioning the URL space, including URL-based (assigning an URL to a C-proc based on a hash of the entire URL), site-based (assigning an URL to a C-proc based on a hash of the URL’s host part), and hierarchical (assigning an URL to a C-proc based on some property of the URL, such as its top-level domain).The WebFountain crawler [16] is also composed of a set of independent, communicating crawling processes (the “ants”). An ant that discovers an URL for which it is not responsible, sends this URL to a dedicated process (the “controller”), which forwards the URL to the appropriate ant.UbiCrawler (formerly known as Trovatore) [4, 5] is again composed of multipleindependent, communicating web crawler processes. It also employs a controller process which oversees the crawling processes, detects process failures, and initiates fail-over to other crawling processes.Shkapenyuk and Suel’s crawler [35] is similar to Google’s; the different crawler components are implemented as different processes. A “crawling application” maintains the set of URLs to be downloaded, and schedules the order in which to download them. It sends download requests to a “crawl manager”, which forwards them to a pool of “downloader” processes. The downloader processes fetch the pages and save them to an NFS-mounted file system. The crawling application reads those saved pages, extracts any links contained within them, and adds them to the set of URLs to be downloaded.Any web crawler must maintain a collection of URLs that are to be downloaded. Moreover, since it would be unacceptable to download the same URL over and over, it must have a way to avoid adding URLs to the collection more than once. Typically, avoidance is achieved by maintaining a set of discovered URLs, covering the URLs in the frontier as well as those that have already been downloaded. If this set is too large to fit in memory (which it often is, given that there are billions of valid URLs), it is stored on disk and caching popular URLs in memory is a win: Caching allows the crawler to discard a large fraction of the URLs without having to consult thedisk-based set.Many of the distributed web crawlers described above, namely Mercator [29], WebFountain [16], UbiCrawler[4], and Cho and Molina’s crawler [13], are comprised of cooperating crawling processes, each of which downloads web pages, extracts their links, and sends these links to the peer crawling process responsible for it. However, there is no need to send a URL to a peer crawling process more than once. Maintaining a cache of URLs and consulting that cache before sending a URL to a peer crawler goes a long way toward reducing transmissions to peer crawlers, as we show in the remainder of this paper.3. CACHINGIn most computer systems, memory is hierarchical, that is, there exist two or morelevels of memory, representing different tradeoffs between size and speed. For instance, in a typical workstation there is a very small but very fast on-chip memory, a larger but slower RAM memory, and a very large and much slower disk memory. In a network environment, the hierarchy continues with network accessible storage and so on. Caching is the idea of storing frequently used items from a slower memory in a faster memory. In the right circumstances, caching greatly improves the performance of the overall system and hence it is a fundamental technique in the design of operating systems, discussed at length in any standard textbook [21, 37]. In the web context, caching is often mentionedin the context of a web proxy caching web pages [26, Chapter 11]. In our web crawler context, since the number of visited URLs becomes too large to store in main memory, we store the collection of visited URLs on disk, and cache a small portion in main memory.Caching terminology is as follows: the cache is memory used to store equal sized atomic items. A cache has size k if it can store at most k items.1 At each unit of time, the cache receives a request for an item. If the requested item is in the cache, the situation is called a hit and no further action is needed. Otherwise, the situation is called a miss or a fault. If the cache has fewer than k items, the missed item is added to the cache. Otherwise, the algorithm must choose either to evict an item from the cache to make room for the missed item, or not to add the missed item. The caching policy or caching algorithm decides which item to evict. The goal of the caching algorithm is to minimize the number of misses.Clearly, the larger the cache, the easier it is to avoid misses. Therefore, the performance of a caching algorithm is characterized by the miss ratio for a given size cache. In general, caching is successful for two reasons:_ Non-uniformity of requests. Some requests are much more popular than others. In our context, for instance, a link to is a much more common occurrence than a link to the authors’ home pages._ Temporal correlation or locality of reference. Current requests are more likely to duplicate requests made in the recent past than requests made long ago. The latterterminology comes from the computer memory model – data needed now is likely to be close in the address space to data recently needed. In our context, temporal correlation occurs first because links tend to be repeated on the same page – we found that on average about 30% are duplicates, cf. Section 4.2, and second, because pages on a given host tend to be explored sequentially and they tend to share many links. For example, many pages on a Computer Science department server are likely to share links to other Computer Science departments in the world, notorious papers, etc.Because of these two factors, a cache that contains popular requests and recent requests is likely to perform better than an arbitrary cache. Caching algorithms try to capture this intuition in various ways.We now describe some standard caching algorithms, whose performance we evaluate in Section 5.3.1 Infinite cache (INFINITE)This is a theoretical algorithm that assumes that the size of the cache is larger than the number of distinct requests.3.2 Clairvoyant caching (MIN)More than 35 years ago, L´aszl´o Belady [2] showed that if the entire sequence of requests is known in advance (in other words, the algorithm is clairvoyant), then the best strategy is to evict the item whose next request is farthest away in time. This theoretical algorithm is denoted MIN because it achieves the minimum number of misses on any sequence and thus it provides a tight bound on performance.3.3 Least recently used (LRU)The LRU algorithm evicts the item in the cache that has not been requested for the longest time. The intuition for LRU is that an item that has not been needed for a long time in the past will likely not be needed for a long time in the future, and therefore the number of misses will be minimized in the spirit of Belady’s algorithm.Despite the admonition that “past performance is no guarantee of future results”, sadly verified by the current state of the stock markets, in practice, LRU is generally very effective. However, it requires maintaining a priority queue of requests. Thisqueue has a processing time cost and a memory cost. The latter is usually ignored in caching situations where the items are large.3.4 CLOCKCLOCK is a popular approximation of LRU, invented in the late sixties [15]. An array of mark bits M0;M1; : : : ;Mk corresponds to the items currently in the cache of size k. The array is viewed as a circle, that is, the first location follows the last. A clock handle points to one item in the cache. When a request X arrives, if the item X is in the cache, then its mark bit is turned on. Otherwise, the handle moves sequentially through the array, turning the mark bits off, until an unmarked location is found. The cache item corresponding to the unmarked location is evicted and replaced by X.3.5 Random replacement (RANDOM)Random replacement (RANDOM) completely ignores the past. If the item requested is not in the cache, then a random item from the cache is evicted and replaced.In most practical situations, random replacement performs worse than CLOCK but not much worse. Our results exhibit a similar pattern, as we show in Section 5. RANDOM can be implemented without any extra space cost; see Section 6.3.6 Static caching (STATIC)If we assume that each item has a certain fixed probability of being requested, independently of the previous history of requests, then at any point in time the probability of a hit in a cache of size k is maximized if the cache contains the k items that have the highest probability of being requested.There are two issues with this approach: the first is that in general these probabilities are not known in advance; the second is that the independence of requests, although mathematically appealing, is antithetical to the locality of reference present in most practical situations.In our case, the first issue can be finessed: we might assume that the most popular k URLs discovered in a previous crawl are pretty much the k most popular URLs in the current crawl. (There are also efficient techniques for discovering themost popular items in a stream of data [18, 1, 11]. Therefore, an on-line approach might work as well.) Of course, for simulation purposes we can do a first pass over our input to determine the k most popular URLs, and then preload the cache with these URLs, which is what we did in our experiments.The second issue above is the very reason we decided to test STATIC: if STATIC performs well, then the conclusion is that there is little locality of reference. If STATIC performs relatively poorly, then we can conclude that our data manifests substantial locality of reference, that is, successive requests are highly correlated.4. EXPERIMENTAL SETUPWe now describe the experiment we conducted to generate the crawl trace fed into our tests of the various algorithms. We conducted a large web crawl using an instrumented version of the Mercator web crawler [29]. We first describe the Mercator crawler architecture, and then report on our crawl.4.1 Mercator crawler architectureA Mercator crawling system consists of a number of crawling processes, usually running on separate machines. Each crawling process is responsible for a subset of all web servers, and consists of a number of worker threads (typically 500) responsible for downloading and processing pages from these servers.Each worker thread repeatedly performs the following operations: it obtains a URL from the URL Frontier, which is a diskbased data structure maintaining the set of URLs to be downloaded; downloads the corresponding page using HTTP into a buffer (called a RewindInputStream or RIS for short); and, if the page is an HTML page, extracts all links from the page. The stream of extracted links is converted into absolute URLs and run through the URL Filter, which discards some URLs based on syntactic properties. For example, it discards all URLs belonging to web servers that contacted us and asked not be crawled.The URL stream then flows into the Host Splitter, which assigns URLs to crawling processes using a hash of the URL’s host name. Since most links are relative, most of the URLs (81.5% in our experiment) will be assigned to the local crawling process; the others are sent in batches via TCP to the appropriate peer crawlingprocesses. Both the stream of local URLs and the stream of URLs received from peer crawlers flow into the Duplicate URL Eliminator (DUE). The DUE discards URLs that have been discovered previously. The new URLs are forwarded to the URL Frontier for future download. In order to eliminate duplicate URLs, the DUE must maintain the set of all URLs discovered so far. Given that today’s web contains several billion valid URLs, the memory requirements to maintain such a set are significant. Mercator can be configured to maintain this set as a distributedin-memory hash table (where each crawling process maintains the subset of URLs assigned to it); however, this DUE implementation (which reduces URLs to 8-byte checksums, and uses the first 3 bytes of the checksum to index into the hash table) requires about 5.2 bytes per URL, meaning that it takes over 5 GB of RAM per crawling machine to maintain a set of 1 billion URLs per machine. These memory requirements are too steep in many settings, and in fact, they exceeded the hardware available to us for this experiment. Therefore, we used an alternative DUE implementation that buffers incoming URLs in memory, but keeps the bulk of URLs (or rather, their 8-byte checksums) in sorted order on disk. Whenever the in-memory buffer fills up, it is merged into the disk file (which is a very expensive operation due to disk latency) and newly discovered URLs are passed on to the Frontier.Both the disk-based DUE and the Host Splitter benefit from URL caching. Adding a cache to the disk-based DUE makes it possible to discard incoming URLs that hit in the cache (and thus are duplicates) instead of adding them to the in-memory buffer. As a result, the in-memory buffer fills more slowly and is merged less frequently into the disk file, thereby reducing the penalty imposed by disk latency. Adding a cache to the Host Splitter makes it possible to discard incoming duplicate URLs instead of sending them to the peer node, thereby reducing the amount of network traffic. This reduction is particularly important in a scenario where the individual crawling machines are not connected via a high-speed LAN (as they were in our experiment), but are instead globally distributed. In such a setting, each crawler would be responsible for web servers “close to it”.Mercator performs an approximation of a breadth-first search traversal of theweb graph. Each of the (typically 500) threads in each process operates in parallel, which introduces a certain amount of non-determinism to the traversal. More importantly, the scheduling of downloads is moderated by Mercator’s politeness policy, which limits the load placed by the crawler on any particular web server. Mercator’s politeness policy guarantees that no server ever receives multiple requests from Mercator in parallel; in addition, it guarantees that the next request to a server will only be issued after a multiple (typically 10_) of the time it took to answer the previous request has passed. Such a politeness policy is essential to any large-scale web crawler; otherwise the crawler’s operator becomes inundated with complaints. 4.2 Our web crawlOur crawling hardware consisted of four Compaq XP1000 workstations, each one equipped with a 667 MHz Alpha processor, 1.5 GB of RAM, 144 GB of disk2, and a 100 Mbit/sec Ethernet connection. The machines were located at the Palo Alto Internet Exchange, quite close to the Internet’s backbone.The crawl ran from July 12 until September 3, 2002, although it was actively crawling only for 33 days: the downtimes were due to various hardware and network failures. During the crawl, the four machines performed 1.04 billion download attempts, 784 million of which resulted in successful downloads. 429 million of the successfully downloaded documents were HTML pages. These pages contained about 26.83 billion links, equivalent to an average of 62.55 links per page; however, the median number of links per page was only 23, suggesting that the average is inflated by some pages with a very high number of links. Earlier studies reported only an average of 8 links [9] or 17 links per page [33]. We offer three explanations as to why we found more links per page. First, we configured Mercator to not limit itself to URLs found in anchor tags, but rather to extract URLs from all tags that may contain them (e.g. image tags). This configuration increases both the mean and the median number of links per page. Second, we configured it to download pages up to 16 MB in size (a setting that is significantly higher than usual), making it possible to encounter pages with tens of thousands of links. Third, most studies report the number of unique links per page. The numbers above include duplicate copies of a link on a page. If weonly consider unique links3 per page, then the average number of links is 42.74 and the median is 17.The links extracted from these HTML pages, plus about 38 million HTTP redirections that were encountered during the crawl, flowed into the Host Splitter. In order to test the effectiveness of various caching algorithms, we instrumented Mercator’s Host Splitter component to log all incoming URLs to disk. The Host Splitters on the four crawlers received and logged a total of 26.86 billion URLs.After completion of the crawl, we condensed the Host Splitter logs. We hashed each URL to a 64-bit fingerprint [32, 8]. Fingerprinting is a probabilistic technique; there is a small chance that two URLs have the same fingerprint. We made sure there were no such unintentional collisions by sorting the original URL logs and counting the number of unique URLs. We then compared this number to the number of unique fingerprints, which we determined using an in-memory hash table on avery-large-memory machine. This data reduction step left us with four condensed host splitter logs (one per crawling machine), ranging from 51 GB to 57 GB in size and containing between 6.4 and 7.1 billion URLs.In order to explore the effectiveness of caching with respect to inter-process communication in a distributed crawler, we also extracted a sub-trace of the Host Splitter logs that contained only those URLs that were sent to peer crawlers. These logs contained 4.92 billion URLs, or about 19.5% of all URLs. We condensed the sub-trace logs in the same fashion. We then used the condensed logs for our simulations.5. SIMULATION RESULTSWe studied the effects of caching with respect to two streams of URLs:1. A trace of all URLs extracted from the pages assigned to a particular machine. We refer to this as the full trace.2. A trace of all URLs extracted from the pages assigned to a particular machine that were sent to one of the other machines for processing. We refer to this trace as the cross subtrace, since it is a subset of the full trace.The reason for exploring both these choices is that, depending on otherarchitectural decisions, it might make sense to cache only the URLs to be sent to other machines or to use a separate cache just for this purpose.We fed each trace into implementations of each of the caching algorithms described above, configured with a wide range of cache sizes. We performed about 1,800 such experiments. We first describe the algorithm implementations, and then present our simulation results.5.1 Algorithm implementationsThe implementation of each algorithm is straightforward. We use a hash table to find each item in the cache. We also keep a separate data structure of the cache items, so that we can choose one for eviction. For RANDOM, this data structure is simply a list. For CLOCK, it is a list and a clock handle, and the items also contain “mark” bits. For LRU, it is a heap, organized by last access time. STATIC needs no extra data structure, since it never evicts items. MIN is more complicated since for each item in the cache, MIN needs to know when the next request for that item will be. We therefore describe MIN in more detail. Let A be the trace or sequence of requests, that is, At is the item requested at time t. We create a second sequence Nt containing the time when At next appears in A. If there is no further request for At after time t, we set Nt = 1. Formally,To generate the sequence Nt, we read the trace A backwards, that is, from tmax down to 0, and use a hash table with key At and value t. For each item At, we probe the hash table. If it is not found, we set Nt = 1and store (At; t) in the table. If it is found, we retrieve (At; t0), set Nt = t0, and replace (At; t0) by (At; t) in the hash table. Given Nt, implementing MIN is easy: we read At and Nt in parallel, and hence for each item requested, we know when it will be requested next. We tag each item in the cache with the time when it will be requested next, and if necessary, evict the item with the highest value for its next request, using a heap to identify itquickly.5.2 ResultsWe present the results for only one crawling host. The results for the other three hosts are quasi-identical. Figure 2 shows the miss rate over the entire trace (that is, the percentage of misses out of all requests to the cache) as a function of the size of the。

搜索作文的名词解释英语

搜索作文的名词解释英语搜索是预测将某个互联网信息资源在网络中找到并呈现给用户的过程，它主要通过搜索引擎进行实现。

而作文是对某一主题进行阐述和论述的文字表达形式。

在英语中，搜索作文的名词解释是"Essay on the Definition of Search"。

搜索（Search）Search is the process of retrieving and presenting a particular internet information resource to users through the internet. This process is primarily facilitated by search engines. The search engines crawl the web, index the content, and match user queries with relevant information.作文（Essay）An essay is a written expression that aims to elaborate and discuss a specific topic or theme. It is a structured piece of writing that presents arguments, analyses, and conclusions.搜索作文（Essay on the Definition of Search）Search essay refers to a composition that defines and explores the concept of search, its significance, and its impact on various aspects of life. This essay delves into the technical aspects, ethical considerations, and social implications of search, while also highlighting its benefits and challenges.引言（Introduction）The introduction of the essay provides an overview of the significance of search in the contemporary digital age. It emphasizes the importance of search engines and their role in providing accurate and relevant information to users.技术方面（Technical Aspects）This section explores the technical aspects of search. It discusses web crawling, indexing, and ranking algorithms employed by search engines to retrieve information efficiently. Additionally, it highlights the role of artificial intelligence in enhancing search capabilities.伦理考虑（Ethical Considerations）The essay then delves into ethical considerations related to search. It examines issues such as privacy, data security, and bias in search engine results. It discusses the responsibility of search engines and the need for transparency and fairness.社会影响（Social Implications）The impact of search on society is another crucial aspect explored in this essay. It analyzes the influence of search on knowledge acquisition, information consumption, and decision-making processes. It also examines the role of search in shaping public opinion and its impact on democracy.优势与挑战（Benefits and Challenges）The essay assesses the benefits and challenges associated with search. It highlights the advantages of quick access to information, global connectivity, and improved efficiency. However, it also acknowledges challenges such as information overload, fake news, and algorithmic biases.结论（Conclusion）In conclusion, the essay summarizes the key points discussed throughout the essay. It emphasizes the integral role of search engines in modern society while recognizing the need for continuous improvement and ethical considerations.参考文献（References）The essay concludes with a list of references citing sources used to support the arguments and claims made throughout the essay. These references provide additional reading materials for those interested in further exploration of the topic.通过对搜索作文的名词解释英语的探讨，读者能够深入了解搜索的概念、技术和其对社会的影响。

linked-in-arch

2008 JavaOneSM Conference | .sun/javaone | 13
LinkedIn Architecture: 2006
2008 JavaOneSM Conference | .sun/javaone | 14
LinkedIn Architecture: Today
2008 JavaOneSM Conference | .sun/javaone |
2
Agenda
LinkedIn(troduction) Agile engineering process Architecture Building LinkedIn with Java™ Questions/Discussion
• • • • • •
22 million members 4+ million unique visitors/month 40 million page views/day 2 million searches/day 250K invitations sent/day 1 million answers posted
Tools
Hudson (CI) Eclipse+Mylyn
JIRA/Greenhopper
JUnit, HtmlUnit
2008 JavaOneSM Conference | .sun/javaone | 23
Why we love Java™ at LinkedIn
Static typing is a lifesaver • Huge codebase, 1M+ lines of code
Graph operations: • findRoute(m1, m2) • visit(visitor, deg) • visit(visitor, deg, since)

CrawlingtheWeb

•要控制搜集G的一部分呢？
While STACK is not empty,
URLcurr := pop(STACK) PAGE := look-up(URLcurr) STORE(<URLcurr, PAGE>, COLLECTION) For every URLi in PAGE,
push(URLi, STACK)
Crawling the Web
/~wbia
彭波 pb@ 北京大学信息科学技术学院
10/25/2011
本次课大纲
How to collect data from Web?
Build a Crawler High Performance Web Crawler Distributed Crawling/Incremental Crawling State-of-art technology
Return COLLECTION
A More Complete Correct Algorithm
PROCEDURE SPIDER4(G, {SEEDS}) Initialize COLLECTION <big file of URL-page pairs>
Initialize VISITED <big hash-taSblTeA>CK
<href …>
网页为节点网页中的HyperLink为有向边 Crawl == 图遍历, right?
<href …> <href …>
系统框图
Fetcher
Extractor
Writer
PreProcessor
FБайду номын сангаасontier

Gmap地理信息技术在空管系统中的应用

2014. 12电脑编程技巧与维护Gmap 地理信息技术在空管系统中的应用余朋(民航西南空管局，成都 610202)摘要：地理信息技术在空管系统中的运用越来越广泛，其作用也来越突出。

Gmap （Google Ma ps ）作为 Google 公司地理信息技术的核心产品，其产品强大功能和良好的开源性能为地理信息技术爱好者提供了良好的开发平台，利用 Gmap 的地理信息技术在空管系统中的运用，实现空管数字化与地理信息的结合。

从满足民航业务需求出发，描述了 G map 在空管系统中的运用，实现了空管系统与 G map 地理信息技术的结合。

关键词：地理信息技术； Gm ap 技术；空管系统Application of Gmap Geographic Information Technology in AirTraffic Control SystemYU Peng（SWATMB of CAAC , Chengdu 610202 , China ）Abstract ：Geographic information technology in air traffic control system utilize more widely, its role is increasingly promi - nent. Gmap (Google Maps) as Google company's geographical information technology core product, its products powerful performance and a good open source for GIS technology enthusiasts to provide a good development platform, using Gmap ge - ographic information technology in air traffic control system achieve ATC digitization combined with geographic information. This article from the civil aviation business to meet the needs, describe Gmap in the use of air traffic control systems, air traffic control systems and realize the combination of Gmap geographic information technologies. Key w ords ：Geographic information t echnology ; G map t echonology ; A TC s ystem1 基本概念地理信息技术俗称 GIS ，它是基于计算机硬件、软件系统支持下的空间信息系统，对整个或部分地球表层（包括大气层）空间中的有关地理分布数据进行采集、储存、管理、运算、分析、显示和描述的技术系统； GIS 技术主要基于空间数据采集、空间数据处理、空间数据管理以及图形图像处理等若干个业务子系统组成； GIS 的操作对象主要是空间数据，即点、线、面以及体（具有三维要素的实体），空间数据必须都按统一的地理坐标进行编码，实现对其定位、定性和定量的描述，这使得 GIS 技术与其他信息系统的根本区别，同时这也是 GIS 技术的难点之一； GIS 技术目前能被众多的领域所重视，主要基于其数据模拟、综合与数据分析能力，实现地理空间过程演化的模拟和预测；空管行业作为国内以及国际的重要产业，地理信息技术的应用已比较广泛；但由于空管航行业、数据的特殊性以及地理信息数据量庞大等原因，在国内民航系统领域中的应用并不普遍；尤其在民航一线工作岗位的应用就显得更加稀少。

全新版大学英语综合教程 B2 U4 (课文及翻译)

Integrated Course Book 2 Unit 4Listen to the recording two or three times and then think over the following questions:1. Is the hero a student or an employee?2. What was he doing when the boss came in?3. How did he act in front of his boss?4. Can you guess what the texts in this unit are going to be about?The following words in the recording may be new to you:surf vt.（在网上）漫游log onto 进入（计算机系统）unpredictable a.不可预测的When an idle moment turned up at work, people used to reach for the newspaper, providing the boss wasn't looking. Nowadays they are more likely to spend their spare moments surfing the Internet. Needless to say, the boss is usually no more happier than before, thinking that his staff should be looking for some useful work to do. So what happens to the surfer who hears the boss's footsteps approaching? This is the situation the writer of the poem you are about to hear found himself in. Will he be caught in the act?Surfing the InternetStepping into the lab,I found no one is inside.So I think I'm in the clearBecause the boss is nowhere in sight.I log onto the web and start to surfAnd then my hair stands up with fright.The footsteps coming down the hallAre quickening in pace.There is no time to exit,No way to save my face.So I press the power buttonAnd relax just a bit.There is no way he can tellExactly what I hit.I act all surprised,Don't know why my machine died."Simply unpredictable theseComputers are!" I cried."So we'll get you a new one,A computer that won't crash", he exclaims.Do you think he'll wonderWhen the new one acts the same?intergrated courseMaia Szalavitz, formerly a television producer, now spends her timeas a writer. In this essay she explores digital reality and itsconsequences. Along the way, she compares the digital world to the "real"world, acknowledging the attractions of the electronic dimension.迈亚·塞拉维茨曾是电视制片人，目前从事写作。

关于爬虫的外文文献

关于爬虫的外文文献爬虫技术作为数据采集的重要手段，在互联网信息挖掘、数据分析等领域发挥着重要作用。

本文将为您推荐一些关于爬虫的外文文献，以供学习和研究之用。

1."Web Scraping with Python: Collecting Data from the Modern Web"作者：Ryan Mitchell简介：本书详细介绍了如何使用Python进行网页爬取，从基础概念到实战案例，涵盖了许多常用的爬虫技术和工具。

通过阅读这本书，您可以了解到爬虫的基本原理、反爬虫策略以及如何高效地采集数据。

2."Scraping the Web: Strategies and Techniques for Data Mining"作者：Dmitry Zinoviev简介：本书讨论了多种爬虫策略和技术，包括分布式爬虫、增量式爬虫等。

同时，还介绍了数据挖掘和文本分析的相关内容，为读者提供了一个全面的爬虫技术学习指南。

3."Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Instagram, Pinterest, and More"作者：Matthew A.Russell简介：本书主要关注如何从社交媒体平台（如Facebook、Twitter 等）中采集数据。

通过丰富的案例，展示了如何利用爬虫技术挖掘社交媒体中的有价值信息。

4."Crawling the Web: An Introduction to Web Scraping and Data Mining"作者：Michael H.Goldwasser, David Letscher简介：这本书为初学者提供了一个关于爬虫技术和数据挖掘的入门指南。

内容包括：爬虫的基本概念、HTTP协议、正则表达式、数据存储和数据分析等。

2023年12月六级真题第2套

大学英语六级考试2023年12月真题(第二套)Part] Writing (30 minutes) Directions: For this part,you are allowed 30 minutes to write an essay that begins with the sentence “Nowadaysparents are increasingly aware that allowing kids more freedom to explore and learn on their own helps foster their independence and boost their confidence.”You can make comments,cite examples or use your personal experiences to develop your essay.You should write at least 150words but no more than 200words (not including the sentence given).Part II Listening Comprehension (30 minutes) Section ADirections: In this section,you will hear two long conversations.At the end of each conversation,you will hear four questions.Both the conversation and the questions will bespoken only once.Afier you hear a question,you must choose the best answer from the four choices marked A),B),C)and D).Then mark the corresponding letter o n Answer Sheet I with a single line through the centre.Questions 1 to 4 are based on the conversation you have just heard.1.A)Renting a car instead of driving their own. C)Exploring more summer holiday resorts.B)Surfing online to check out the best deals. D)Spending their holidays in a novel way.2.A)He once owned a van. C)He did not find holiday homes appealing.B)He was well travelled. D)He did not like to be locked into one place.3.A)Ensure the safety of passengers. C)Receive instructions via computers.B)Drive under any weather condition.D)Generate their own electricity.4.A)Having one's basic needs covered while away from home.B)Enjoying the freedom to choose where to go and work.C)Slowing down in one's increasingly hectic life.D)Riding one's mountain bike on vacation.Questions 5 to 8 are based on the conversation you have just heard.5.A)She has missed several important appointments lately.B)Her job performance has worsened over the past month.C)She has lagged behind most of her co-workers in output.D)Her physical health has deteriorated these past few weeks.6.A)Serious health issues. C)Some problems at home.B)Disturbance of her mind. D)Penalty for curtailed output.7.A)The woman's work proficiency. C)His engaging personalityB)The woman's whole-hearted support. D)His management capability8.A)The woman will resume her work in two weeks.B)The man will be back at his 100%in a couple of weeks.C)The woman will be off work on the next two Mondays.D)The man will help the woman get back to her usual self.·2023年12月六级真题(第二套)·10Section BDirections: In this section,you will hear two passages.At the end of each passage,you will hear three or four questions.Both the passage and the questions will be spoken only once.After you hear a question,you must choose the best answer from the four choices marked A),B),C)and D).Then mark the corresponding letter on Answer Sheet I with a single line through the centre.Questions 9 to 11 are based on the passage you have just heard.9.A)It can have an impact on our moods and emotions.B)It can enable us to live a healthier and longer life.C)It can improve our financial status significantly.D)It can help us achieve better work performance.10.A)One's health tends to differ before and after marriage.B)One's health status is related to one's social background.C)The wealthier one's spouse is,the healthier one becomes.D)The spouse's level of education can impact one's health.11.A)They benefited a lot from their career achievements.B)They showed interest in their spouse's occupations.C)They had much in common with their spouses.D)They had more education than their spouses.Questions 12 to 15 are based on the passage you have just heard.12.A)Eliminating their root cause. C)Identifying the cities'geological features.B)Forecasting flood risks accurately. D)Finding out the changing climate patterns.13.A)To validate his hypothesis about the gravity of floods.B)To determine the frequency of high tides causing floods.C)To improve his mathematical flooding model.D)To see the feasibility of his project on flooding.14.A)To study the consequences of high tides on flooded areas.B)To teach local citizens how to collect data of incoming floods.C)To forecast rapid floods in real time.D)To classify the flooding data processed.15.A)They set up Internet-connected water-level sensors.B)They tracked the rising tides with video-cameras.C)They used newly-developed supercomputing facilities.D)They observed the direction of water flow on the spot.Section CDirections: In this section,you will hear three recordings of lectures or talks followed by three or four questions. The recordings will be played only once.Afier you hear a question,you must choose the best answer from the four choices marked A),B),C)and D).Then mark the corresponding letter on Answer Sheet I with a single line through the centre.Questions 16 to 18 are based on the recording you have just heard.16.A)To argue about the value of a college degree.B)To debate the validity of current survey data.11·2023年12月六级真题(第二套) ·C)To account for the drastic decline in employment among men.D)To compare men without college degrees with those who have.17.A)The increase in women taking up jobs. C)The factor of wages.B)The issue of changing job requirements. D)The impact of inflation.18.A)Men's unwillingness to accept low wages in times of growing inflation.B)The wage gap between those with college degrees and those without.C)More jobs requiring their holders to have a college degree nowadays.D)The sharp decline in marriage among men with no college degrees.Questions 19 to 21 are based on the recording you have just heard.19.A)More and more people attach importance to protecting endangered animals.B)More and more people prioritize animal welfare when buying things to wear.C)An increasing number of people demand to free animals being kept in cages.D)An increasing number of people follow the latest trend of becoming vegetarians.20.A)Avoided the use of leather and fur. C)Refrained from using chemicals in their products.B)Labelled all their products as vegan.D)Utilized a silk substitute made from mushrooms.21.A)Whether they actually signify a substantial change. C)Whether they can be considered sustainable.B)Whether they effectively protect animals at large. D)Whether they can be regarded as ethical.Questions 22 to 25 are based on the recording you have just heard.22.A)The era we live in is the most peaceful in history.B)The world has seen more violence in recent years.C)The environmental welfare has worsened in the world.D)The belief is less prevalent that the world is going to hell.23.A)They were actually not in their right mind.B)They believed the world was deteriorating.C)They did not wish to live in the previous century.D)They were convinced by the statistics presented to them.24.A)The subjectivity of mass media. C)Our psychological biases.B)The current state of affairs. D)Our ancestors'influence.25.A)Spreading exciting news around us far and wide. C)Calculating dangerous risks to our survival.B)Vacuuming up depressing or enraging stories. D)Paying attention to negative information. Part ⅢReading Comprehension (40 minutes)Section ADirections: In this section,there is a passage with ten blanks.You are required to select one word for each blank from a list of choices given in a word bank following the passage.Read the passage through carefiully before making your choices.Each choice in the bank is identified by a letter:Please mark the corresponding letter for each item on Answer Sheet 2 with a single line through the centre.You may not use any of the words in the bank more than once.Our brains respond to language expressing facts differently than they do to words conveying possibility, scientists at New York University have recently found.Their work offers new insights into the impact word choice has on how we 26between statements expressing what is real versus what is merely possible.The ·2023年12月六级真题(第二套)·12researchers assert their findings are important because we are presented with false information all the time. Some of this is 27 ,as is the case with deceptive advertisements,but the problem is 28_by individuals who believe they are sharing correct information.Thus,it is more important than ever to separate the factual from the possible or merely 29 in how we communicate.This is especially true as the study makes clear that information presented as fact 30 special responses in our brains,which are distinct from when we process the same content with clear indicators of 31In their new study,the scientists intended to 32 how the brain computes possibilities as expressed by words such as “may,”“might,”and “if.”The researchers compared brain responses to statements expresing factual 33 and those expressing possibility."There is a monster under my bed"exemplifies a factual statement.“I will stay home,”is also factual.This is opposed to statements that express possibility,like “There might be a monster under my bed,"or “If it rains,I will stay home."The results of the study showed that factual language_ 34 a rapid increase in brain activity,with the brain responding more powerfully and showing more engagement with factual phrases compared to those communicating possibility.Thus,facts rule when it comes to the brain.Brain regions involved in processing 35 rapidly distinguish facts from possibilities.Further,these regions respond in a much more robust fashion to factual statements.Section BDirections: In this section,you are going to read a passage with ten statements attached to it.Each statement contains information given in one of the paragraphs.Identify the paragraph from which the information is derived.You may choose a paragraph more than once.Each paragraph is marked with a letter:Answer the questions by marking the corresponding letter on Answer Sheet 2.Treasure FeverA)Most visitors come to Cape Canaveral,on the northeast coast of Florida,for the tourist attractions.It's hometo the second-busiest cruise ship port in the world and is a gateway to the cosmos.Nearly 1.5 million visitors flock here every year to watch rockets,spacecraft,and satellites blast off into the solar system from Kennedy Space Center Visitor Complex.Nearly 64 kilometers of undeveloped beach and 648 square kilometers of protected refuge fan out from the cape's sandy shores.B)Yet some of Cape Canaveral's most legendary attractions lie unseen,wedged under the sea's surface inmud and sand,for this part of the world has a reputation as a deadly ship trap.Over the centuries,dozens of majestic Old World sailing ships smashed and sank on this irregular stretch of windy Florida coast.They were vessels built for war and commerce,crossing the globe carrying everything from coins to cannons,boxes of silver and gold,chests of jewels and porcelain,and pearls from the Caribbean.13·2023年12月六级真题(第二套)·C)Cape Canaveral contains one of the greatest concentrations of colonial shipwrecks in the world.In recentyears,advances in radar,diving,detection equipment,computers,and GPS have transformed the hunt.The naked eye might see a pile of rocks,but technology can reveal the precious artifacts(人工制品) that lie hidden on the ocean floor.D)As technology renders the seabed more accessible,the hunt for treasure-filled ships has drawn a fresh tideof salvors(打捞人员)and their investors—as well as marine archaeologists(考古学家)wanting to bring to light the lost relics.But of late,when salvors have found vessels,their rights have been challenged in court.The big question:who should have control of these treasures?E)High-stakes fights over shipwrecks pit archacologists against treasure hunters in a vicious cycle ofaccusations.Archacologists regard themselves as protectors of history,and they see salvors as careless destroyers.Salvors feel they do the hard work of searching for ships,only to have them stolen from under them when discovered.This kind of clash inevitably takes place on a grand scale.Aside from the salvors, their investors,and the maritime archacologists who serve as expert witnesses,the battles sweep in local and international govemments and organizations like UNESCO that work to protect under-water heritage.The court cases that ensue stretch on for years.Are finders keepers,or do the ships belong to the countries that made them and sent them sailing centuries ago?Where once salvors and archacologists worked side by side, now they belong to opposing,and equally contemptuous,tribes.F)Nearly three million vessels lie wrecked on the Earth's ocean floor-from old canoes to the Titanic—andikely less than one percent have been explored.Some—like an ancient Roman ship found off Antikythera, Greece,dated between 70 and 60 BC and carrying astonishingly sophisticated gears and dials for navigating by the sun-are critical to a new understanding of our past.No wonder there is an eternal stirring among everybody from salvors to scholars to find them.G)In May 2016,a salvor named Bobby Pritchett,president of Global Marine Exploration (GME)in Tampa,Florida,announced that he had discovered scattered remains of a ship buried a kilometer off Cape Canaveral.Over the prior three years,he and his crew had obtained 14 state permits to survey a nearly 260-square- kilometer area off the cape;they worked 250 days a year,backed by investor funds of,he claims,US S4 million.It was hard work.Crew members were up at dawn,dragging sensors from their expedition vessels back and forth,day in and day out,year after year,to detect metal of any ing computer technology, Pritchett and his crew created intricate,color-coded maps marked with the GPS coordinates of thousands offinds,all invisible under a meter of sand.H)One day in 2015,the magnetometer (磁力计)picked up metal that turned out to be an iron cannon;when thedivers blew the sand away,they also discovered a more precious bronze cannon with markings indicating French royalty and,not far off,a famous marble column carved with the coat of arms of France,known from historical paintings.The discovery was cause for celebration.The arifacts indicated the divers had likely found the wreck of La Trinité,a 16th-century French vessel that had been at the center of a bloody battle between France and Spain that changed the fate of the United States of America.I)And then the legal storm began,with GME and Pritchett pitted against Florida and France.The SunkenMilitary Craft Act of 2004,a US federal act,protects any vessel that was on a military mission,allowing the originating country to claim their ship even centuries later.In 2018,two long years after Pritchett's discovery, the federal district court ruled in favor of France.For Pritchett,the decision was lions of dollars of investor funding and years of labor were lost.J)But this is far from the first time a salvor has lost all rights to a discovery.In 2012,for instance,Spain won a five-year legal battle against Odyssey Marine Exploration,which had hauled 594,000 gold and silver coins ·2023年12月六级真题(第二套) · 14from a Spanish wreck off the coast of Portugal across the Atlantic to the United States.“Treasure hunters can be naive,”says attorney David Concannon,who has had several maritime archaeologists as clients and represented two sides in the battles over the Titanic for 20 years.“Many treasure hunters don't understand they are going to have to fight for their rights against a govemment that has an endless supply of money for legal battles that treasure hunters are likely to lose.”K)Puting an inflated price on artifacts rather than viewing them as cultural and historical treasures that transcend any price is what iritates many archaeologists.For the archaeologist,everything in a wreck matters-hair,fabric,a fragment of a newspaper,rat bones-all things speak volumes.Archacologists don't want artifacts ending up in a private collection instead of taking humanity on a jourmey of understanding.L)George Bass is one of the pioneers of under-water archacology,and a researcher at Texas A&M University.He has testified in court against treasure hunters,but says archacology is not without its own serious problems.He believes archacologists need to do a better job themselves instead of routinely criticizing treasure hunters.“Archaeology has a terrible reputation for not publishing enough on its excavations (发掘) and finds,"he says.Gathering data,unearthing and meticulously preserving and examining finds,verifying identity and origin,piecing together the larger story,and writing and publishing a comprehensive paper or book can take decades.A bit cynically,Bass describes colleagues who never published because they waited so long they became ill or died.Who is more at fault,Bass asks,the professional archacologist who carefully excavates a site and never publishes on it or the treasure hunter who locates a submerged wreck,salvages part,conserves part,and publishes a book on the operation?M)Pritchett concedes that his find deserves careful excavation and preservation."I think what I found should go in a museum,"he says."But I also think I should get paid for what I found.”Indeed,it's a bit of a mystery why governments,archacologists,and treasure hunters can't work together-and why salvors aren't at least given a substantial finder's fee before the original owner takes possession of the vessel and its artifacts.36.Exploration of shipwrecks on the sea floor is crucial in updating our understanding of humanity's past.37.Quite a number of majestic ships sailing from Europe to America were wrecked off the Florida coast over thecenturies.38.Pritchett suffered a heavy loss when a US district court ruled against him.39.Recently,people who found treasures in shipwrecks have been sued over their rights to own them.40.Pritchett claims he got support of millions of dollars from investors for his shipwreck exploration.41.One pioneer marine scientist thinks archacologists should make greater efforts to publish their findings.42.With technological advancement in recent years,salvors now can detect the invaluable man-made objectslying buried under the sea.43.According to a lawyer,many treasure hunters are susceptible to loss because they are unaware they face afinancially stronger opponent in court.44.Salvors of treasures in sunken ships and marine archaeologists are now hostile to each other.45.Archacologists want to see artifacts help humans understand their past instead of being sold to privatecollectors at an outrageous price.Section CDirections: There are 2 passages in this section.Each passage is followed by some questions or unfinished statements.For each of them there are four choices marked A),B),C)and D).You should decide on the best choice and mark the corresponding letter on Answer Sheet 2 with a single line through the centre.15 ·2023年12月六级真题(第二套) ·Passage OneQuestions 46 to 50 are based on the following passage.Could you get by without using the internet for four and a half years?That's exactly what singer and actress Selena Gomez has done in a bid to improve her mental health.She has spoken extensively about the relationship between her social media usage and her mental wellbeing, recalling feeling like “an addict”when she became Instagram's most followed user in 2016.“Taking a break from social media was the best decision that I ve ever made for my mental health”,says she.“The unnecessary hate and comparisons went away once I put my phone down.”Ditching the web at large,however,is a far more subtle and complicated prospect.The increasing digitisation of our society means that everything from paying a gas bill to ploting a route to a friend's house and even making a phone call is at the merey of your internet connection.Actively opting out of using the internet becomes a matter of privilege.Ms Gomez's multi-millionaire status has allowed her to take the “social"out of social media,so she can continue to leverage her enormous fame while keeping the trolls(恶意挑衅的帖子)at bay.The fact that she's still the second most-followed woman on Instagram suggests it's entirely possible to maintain a significant web profile to promote various projects-by way of a dedicated team—without being exposed to the cruel comments, hate mail and rape or death threats.It goes without saying that this is fundamentally different from how the rest of us without beauty deals and films to publicise use the likes of Instagram,TikTok and Twiter,but even the concept of a digital detox( 戒瘾 ) requires having a device and connectivity to choose to disconnect from.The UK's digital divide has worsened over the past two years,leaving poorer families without broadband connections in their homes.Digital exclusion is a major threat to wider societal equality in the UK,so witnessing companies like Faccbook championing the metaverse(元宇宙)as the next great frontier when school children are struggling to complete their homework feels particularly irritating.Consequently,it's worth bearing in mind that while deleting all social media accounts will undoubtedly make some feel infinitely better,many other people benefit from the strong sense of community that sharing platforms can breed.Intemet access will continue to grow in importance as we edge further towards web 3.0,and greater resources and initiatives are needed to provide the underprivileged with the connectivity they desperately need to learn,work and ive.It's crucial that people who feel that social media is having a detrimental effect on their mental health are allowed to switch off and for those living in digital exclusion to be able to switch on in the first place.46.What do we learn about singer and actress Selena Gomez in the past four and a half years?A)She has had worsening mental problems. C)She has refrained from using social media.B)She has won Instagram's most followers. D)She has succeeded in a bid on the internet.47.Why does actively opting out of using the internet become a matter of privilege?A)Most people find it subtle and complicated to give up using the internet.B)Most people can hardly ditch the web while avoiding hate and comparisons.C)Most people can hardly get by without the internet due to growing digitisation.D)Most people have been seriously addicted to the web without being aware of it.48.Why does the author say “witnes ing companies…feels particularly irritating”(Lines2-4,Para.6)?A)The UK digital divide would further worsen due to the metaverse.B)The concept of the metaverse is believed to be still quite illusory.·2023年12月六级真题(第二套) ·16C)School children would be drawn farther away from the real world.D)Most families in the UK do not have stable broadband connections.49.What is worth bearing in mind concerming social media platforms?A)They are conducive to promoting societal equality.B)They help many people feel connected with others.C)They provide a necessary device for a digital detox.D)They create a virtual community on the internet.50.What does the author think is really important for those living in digital exclusion?A)Having access to the internet. C)Getting more educational resources.B)Edging further towards web 3.0. D)Opening more social media accounts.Passage TwoQuestions 51 to 55 are based on the following passage.Psychologists have long been in disagreement as to whether competition is a learned or a genetic component of human behavior.Whatever it is,you cannot but recognize the effect competition is exerting in academics and many other areas of contemporary life.Psychologically speaking,competition has been seen as an inevitable consequence of human drives. According to Sigmund Freud,humans are born screaming for attention and full of organic drives for fulfillment in various areas.Initially,we compete for the attention of our parents.Thereafter,we are at the mercy of a battle between our base impulses for self-fulfillment and social and cultural norms which prohibit pure indulgence.Curent work in anthropology(人类学)has suggested,however,that this view of the role of competition in human behavior may be incorrect.Thomas Hobbes,one of the great philosophers of the seventeenth century,is perhaps best remembered for his characterization of the "natural world,"that is,the world before the imposition of the will of humanity,as being“nasty,brutish,and short.”This image of the pre-rational world is still widely held,reinforced by Charles Darwin's highly influential work,The Origin of Species,which established the doctrine of natural selection.This doctrine,which takes for granted that those species best able to adapt to and master the natural environment in which they live will survive,has suggested that the struggle for survival is an inherent human trait which determines a person's success.Darwin's theory has even been summarized as “survival of the fittest”—a phrase Darwin himself never used—further highlighting competition's role in success. As it has often been pointed out,however,there is nothing in the concept of natural selection that suggests that competition is the most successful strategy for “survival of the fittest.”Darwin asserted in The Origin of Species that the struggles he was describing should be viewed as metaphors and could easily include dependence and cooperationMany studies have been conducted to test the importance placed on competition as opposed to other values, such as cooperation—by various cultures,and generally conclude that Americans uniquely praise competition as natural,inevitable,and desirable.In 1937,the world-renowned anthropologist Margaret Mead published Cooperation and Competition among Primitive Peoples,based on her studies of several societies that did notprize competition,and,in fact,seemed at times to place a negative value on it.One such society was the Zuni Indians of Arizona,and they,Mead found,valued cooperation far more than competition.After studying dozens of such cultures,Mead's final conclusion was that competitiveness is a culturally created aspect of human behavior,and that its prevalence in a particular society is relative to how that society values it.17 ·2023年12月六级真题(第二套) ·51.What does the author think is easy to see in many areas of contemporary life?A)The disagreement on the inevitability of competition.B)The consequence of psychological investigation.C)The effect of human drives.D)The impact of competition.52.According to psychology,what do people strive to do following the initial stage of their life?A)Fulfill individual needs without incurring adverse effects of human drives.B)Indulge in cultural pursuits while keeping their base impulses at bay.C)Gain extensive recognition without exposing pure indulgence.D)Satisfy their own desires while observing social conventions.53.What do we learn about the “natural world”characterized by Thomas Hobbes?A)It gets misrepresented by philosophers and anthropologists.B)It gets distorted in Darwin's The Origin of Species.C)It is free from the rational intervention of humans.D)It is the pre-rational world rarely appreciated nowadays.54.What can we conclude from Darwin's assertion in The Origin of Species?A)All species inherently depend on others for survival.B)Struggles for survival do not exclude mutual support.C)Competition weighs as much as cooperation as a survival strategy.D)The strongest species proves to be the fittest in natural selection.55.What conclusion did Margaret Mead reach after studying dozens of different cultures?A)It is characteristic of humans to be competitive.B)Americans are uniquely opposed to cooperation.C)Competition is relatively more prevalent in Western societies.D)People's attitude towards competition is actually culture-bound.Part IV Translation (30 minutes) Directions: For this part,you are allowed 30 minutes to translate a passage from Chinese into English.You should write your answer on Answer Sheet 2.随着经济与社会的发展，中国人口结构发生了显著变化，逐渐步入老龄化社会。

Crawling Web Pages with Support for

– If the href attribute from ai does not contain any associated scripting code and it has not got an onclick attribute (or, if the system is configured to do so, other attributes used to assign scripting code to specific events such as onmouseover), the anchor ai is added to the master list of URLs.
The Problem with Conventional Crawlers
• • • • • Client-Side scripting languages Session maintenance mechanisms Redirections Applets and flash code Other issues
• There exist some anchors that, when clicked, can generate undesirable actions(e.g. a call to the “javascript:close” method closes the browser). The approach followed to avoid this is to capture these undesirable events and to ignore them. • The crawler captures all the new navigation events that happen as a consequence of the click. Each navigation event produces a URL. Let An be the set of all the new URLs. • Ap = An U Ap. • Once the execution of the events associated to a click over an anchor has finished, the crawler analyzes again the same page looking for new anchors that could have been generated by the click event (for instance, new options corresponding to pop-up menus), Anp. New anchors are also added to Ap, Ap = Anp U Ap.

search-engine-optimization

HOW SEARCH ENGINES WORK
Crawling the Web Search engines 'crawl' the web to put sites into their databases. This is done by automated programs they use called 'spiders' or 'bots'. These programs use the hyperlink structure of the web to move between pages and documents and record them into their databases, known as the search engine's 'index'. It is estimated that of the 20 billion approximate existing pages, search engines have only crawled between 8 and 10 billion!
SEARCH ENGINE OPTIMIZATION
HOW SEARCH ENGINES WORK
Processing Queries When you send a search query to a search engine, the engine goes into it's index and finds all of the pages that match that query. The match is detected in different ways, depending on how the user specifically phrased the search. For example, if you search car and driver magazine on Google, Google will return 8.25 million results. However, if you put quotation marks around the same phrase and search, Google only shows 166 thousand results. In the first situation, Google used it's "Findall" mode, which showed all the pages with the words "car", "driver" and "magazine" in them (Google, by default, does not search the word "and"). In the second situation, by putting the quotation marks around the phrase, Google looked for pages with that exact phrase. Google has 11 advanced operators like this which can be used in various ways to determine the best match. Ranking Results Now that the best matches for your search query have been decided, the search engine has to rank them in the most appropriate order. It uses a complex algorithm to run calculations on every match to decide which is the most relevant. They are then sorted out for you in order of most relevant to least relevant.

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Crawling the WebGautam Pant1,Padmini Srinivasan1,2,and Filippo Menczer31Department of Management Sciences2School of Library and Information ScienceThe University of Iowa,Iowa City IA52242,USAemail:gautam-pant,padmini-srinivasan@3School of InformaticsIndiana University,Bloomington,IN47408,USAemail:ﬁl@Summary.The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automat-ically download a partial snapshot of the Web.While some systems rely on crawlers that exhaustively crawl the Web,others incorporate“focus”within their crawlers to harvest application or topic speciﬁc collections.We discuss the basic issues related with developing a crawling infrastructure.This is followed by a review of several topical crawling algorithms,and evaluation metrics that may be used to judge their performance.While many innovative applications of Web crawling are still being invented,we take a brief look at some developed in the past.1IntroductionWeb crawlers are programs that exploit the graph structure of the Web to move from page to page.In their infancy such programs were also called wanderers,robots,spiders,ﬁsh,and worms,words that are quite evocative of Web imagery.It may be observed that the noun‘crawler’is not indicative of the speed of these programs,as they can be considerably fast.In our own experience,we have been able to crawl up to tens of thousands of pages within a few minutes.4From the beginning,a key motivation for designing Web crawlers has been to retrieve Web pages and add them or their representations to a local repos-itory.Such a repository may then serve particular application needs such as those of a Web search engine.In its simplest form a crawler starts from a seed page and then uses the external links within it to attend to other pages. 4on a Pentium4workstation with an Internet2connection.2G.Pant,P.Srinivasan,F.MenczerThe process repeats with the new pages oﬀering more external links to follow, until a suﬃcient number of pages are identiﬁed or some higher level objective is reached.Behind this simple description lies a host of issues related to net-work connections,spider traps,canonicalizing URLs,parsing HTML pages, and the ethics of dealing with remote Web servers.In fact a current generation Web crawler can be one of the most sophisticated yet fragile parts[5]of the application in which it is embedded.Were the Web a static collection of pages we would have little long term use for crawling.Once all the pages had been fetched to a repository(like a search engine’s database),there would be no further need for crawling.However,the Web is a dynamic entity with subspaces evolving at diﬀering and often rapid rates.Hence there is a continual need for crawlers to help applications stay current as new pages are added and old ones are deleted,moved or modiﬁed.General purpose search engines serving as entry points to Web pages strive for coverage that is as broad as possible.They use Web crawlers to maintain their index databases[3]amortizing the cost of crawling and indexing over the millions of queries received by them.These crawlers are blind and exhaustive in their approach,with comprehensiveness as their major goal.In contrast, crawlers can be selective about the pages they fetch and are then referred to as preferential or heuristic-based crawlers[10,6].These may be used for building focused repositories,automating resource discovery,and facilitating software agents.There is a vast literature on preferential crawling applica-tions,including[15,9,31,20,26,3].Preferential crawlers built to retrieve pages within a certain topic are called topical or focused crawlers.Synergism between search engines and topical crawlers is certainly possible with the lat-ter taking on the specialized responsibility of identifying subspaces relevant to particular communities of users.Techniques for preferential crawling that focus on improving the“freshness”of a search engine have also been suggested [3].Although a signiﬁcant portion of this chapter is devoted to description of crawlers in general,the overall slant,particularly in the latter sections,is towards topical crawlers.There are several dimensions about topical crawlers that make them an exciting object of study.One key question that has moti-vated much research is:How is crawler selectivity to be achieved?Rich contex-tual aspects such as the goals of the parent application,lexical signals within the Web pages and also features of the graph built from pages already seen —these are all reasonable kinds of evidence to exploit.Additionally,crawlers can and often do diﬀer in their mechanisms for using the evidence available to them.A second major aspect that is important to consider when studying crawlers,especially topical crawlers,is the nature of the crawl task.Crawl characteristics such as queries and/or keywords provided as input criteria to the crawler,user-proﬁles,and desired properties of the pages to be fetched (similar pages,popular pages,authoritative pages etc.)can lead to signiﬁ-cant diﬀerences in crawler design and implementation.The task could be con-Crawling the Web3 strained by parameters like the maximum number of pages to be fetched(long crawls vs.short crawls)or the available memory.Hence,a crawling task can be viewed as a constrained multi-objective search problem.However,the wide variety of objective functions,coupled with the lack of appropriate knowledge about the search space,make the problem a hard one.Furthermore,a crawler may have to deal with optimization issues such as local vs.global optima[28].The last key dimension is regarding crawler evaluation strategies neces-sary to make comparisons and determine circumstances under which one or the other crawlers work parisons must be fair and made with an eye towards drawing out statistically signiﬁcant diﬀerences.Not only does this require a suﬃcient number of crawl runs but also sound methodologies that consider the temporal nature of crawler outputs.Signiﬁcant challenges in evaluation include the general unavailability of relevant sets for particular topics or queries.Thus evaluation typically relies on deﬁning measures for estimating page importance.Theﬁrst part of this chapter presents a crawling infrastructure and within this describes the basic concepts in Web crawling.Following this,we review a number of crawling algorithms that are suggested in the literature.We then discuss current methods to evaluate and compare performance of diﬀerent crawlers.Finally,we outline the use of Web crawlers in some applications.2Building a Crawling InfrastructureFigure1shows theﬂow of a basic sequential crawler(in section2.6we con-sider multi-threaded crawlers).The crawler maintains a list of unvisited URLs called the frontier.The list is initialized with seed URLs which may be pro-vided by a user or another program.Each crawling loop involves picking the next URL to crawl from the frontier,fetching the page corresponding to the URL through HTTP,parsing the retrieved page to extract the URLs and ap-plication speciﬁc information,andﬁnally adding the unvisited URLs to the frontier.Before the URLs are added to the frontier they may be assigned a score that represents the estimated beneﬁt of visiting the page corresponding to the URL.The crawling process may be terminated when a certain number of pages have been crawled.If the crawler is ready to crawl another page and the frontier is empty,the situation signals a dead-end for the crawler.The crawler has no new page to fetch and hence it stops.Crawling can be viewed as a graph search problem.The Web is seen as a large graph with pages at its nodes and hyperlinks as its edges.A crawler starts at a few of the nodes(seeds)and then follows the edges to reach other nodes. The process of fetching a page and extracting the links within it is analogous to expanding a node in graph search.A topical crawler tries to follow edges that are expected to lead to portions of the graph that are relevant to a topic.4G.Pant,P.Srinivasan,F.Menczerend C r a w l i n g L o o p Fig.1.Flow of a basic sequentialcrawler2.1FrontierThe frontier is the to-do list of a crawler that contains the URLs of unvisited pages.In graph search terminology the frontier is an open list of unexpanded (unvisited)nodes.Although it may be necessary to store the frontier on disk for large scale crawlers,we will represent the frontier as an in-memory data structure for simplicity.Based on the available memory,one can decide the maximum size of the frontier.Due to the large amount of memory available on PCs today,a frontier size of a 100,000URLs or more is not exceptional.Given a maximum frontier size we need a mechanism to decide which URLs to ignore when this limit is reached.Note that the frontier can ﬁll rather quickly as pages are crawled.One can expect around 60,000URLs in the frontier with a crawl of 10,000pages,assuming an average of about 7links per page [30].The frontier may be implemented as a FIFO queue in which case we have a breadth-ﬁrst crawler that can be used to blindly crawl the Web.The URL to crawl next comes from the head of the queue and the new URLs are added to the tail of the queue.Due to the limited size of the frontier,we need to make sure that we do not add duplicate URLs into the frontier.A linear search to ﬁnd out if a newly extracted URL is already in the frontier is costly.One solution is to allocate some amount of available memory to maintain a separate hash-table (with URL as key)to store each of the frontier URLs for fast lookup.The hash-table must be kept synchronized with the actual frontier.A more time consuming alternative is to maintain the frontier itself as a hash-table (again with URL as key).This would provide fast lookup for avoiding duplicate URLs.However,each time the crawler needs a URL to crawl,it would need to search and pick the URL with earliest time-stamp (time when a URL was added to the frontier).If memory is less of an issueCrawling the Web5 than speed,theﬁrst solution may be preferred.Once the frontier reaches its maximum size,the breadth-ﬁrst crawler can add only one unvisited URL from each new page crawled.If the frontier is implemented as a priority queue we have a preferential crawler which is also known as a best-ﬁrst crawler.The priority queue may be a dynamic array that is always kept sorted by the estimated score of unvisited URLs.At each step,the best URL is picked from the head of the queue.Once the corresponding page is fetched,the URLs are extracted from it and scored based on some heuristic.They are then added to the frontier in such a manner that the order of the priority queue is maintained.We can avoid duplicate URLs in the frontier by keeping a separate hash-table for lookup.Once the frontier’s maximum size(MAX)is exceeded,only the best MAX URLs are kept in the frontier.If the crawlerﬁnds the frontier empty when it needs the next URL to crawl,the crawling process comes to a halt.With a large value of MAX and several seed URLs the frontier will rarely reach the empty state.At times,a crawler may encounter a spider trap that leads it to a large number of diﬀerent URLs that refer to the same page.One way to alleviate this problem is by limiting the number of pages that the crawler accesses from a given domain.The code associated with the frontier can make sure that every consecutive sequence of k(say100)URLs,picked by the crawler, contains only one URL from a fully qualiﬁed host name(). As side-eﬀects,the crawler is polite by not accessing the same Web site too often[14],and the crawled pages tend to be more diverse.2.2History and Page RepositoryThe crawl history is a time-stamped list of URLs that were fetched by the crawler.In eﬀect,it shows the path of the crawler through the Web starting from the seed pages.A URL entry is made into the history only after fetching the corresponding page.This history may be used for post crawl analysis and evaluations.For example,we can associate a value with each page on the crawl path and identify signiﬁcant events(such as the discovery of an excellent resource).While history may be stored occasionally to the disk,it is also maintained as an in-memory data structure.This provides for a fast lookup to check whether a page has been crawled or not.This check is important to avoid revisiting pages and also to avoid adding the URLs of crawled pages to the limited size frontier.For the same reasons it is important to canonicalize URLs(section2.4)before adding them to the history.Once a page is fetched,it may be stored/indexed for the master appli-cation(such as a search engine).In its simplest form a page repository may store the crawled pages as separateﬁles.In that case,each page must map to a uniqueﬁle name.One way to do this is to map each page’s URL to a compact string using some form of hashing function with low probabil-ity of collisions(for uniqueness ofﬁle names).The resulting hash value is6G.Pant,P.Srinivasan,F.Menczerused as theﬁle name.We use the MD5one-way hashing function that pro-vides a128bit hash code for each URL.Implementations of MD5and other hashing algorithms are readily available in diﬀerent programming languages (e.g.,refer to Java2security framework5).The128bit hash value is then converted into a32character hexadecimal equivalent to get theﬁle name. For example,the content of /is stored into aﬁle named160766577426e1d01fcb7735091ec584.This way we haveﬁxed-length ﬁle names for URLs of arbitrary size.(Of course,if the application needs to cache only a few thousand pages,one may use a simpler hashing mechanism.) The page repository can also be used to check if a URL has been crawled before by converting it to its32characterﬁle name and checking for the exis-tence of thatﬁle in the repository.In some cases this may render unnecessary the use of an in-memory history data structure.2.3FetchingIn order to fetch a Web page,we need an HTTP client which sends an HTTP request for a page and reads the response.The client needs to have timeouts to make sure that an unnecessary amount of time is not spent on slow servers or in reading large pages.In fact we may typically restrict the client to download only theﬁrst10-20KB of the page.The client needs to parse the response headers for status codes and redirections.We may also like to parse and store the last-modiﬁed header to determine the age of the document.Error-checking and exception handling is important during the page fetching process since we need to deal with millions of remote servers using the same code.In addition it may be beneﬁcial to collect statistics on timeouts and status codes for identifying problems or automatically changing timeout values.Modern programming languages such as Java and Perl provide very simple and often multiple programmatic interfaces for fetching pages from the Web.However, one must be careful in using high level interfaces where it may be harder toﬁnd lower level problems.For example,with Java one may want to use the .Socket class to send HTTP requests instead of using the more ready-made .HttpURLConnection class.No discussion about crawling pages from the Web can be complete without talking about the Robot Exclusion Protocol.This protocol provides a mech-anism for Web server administrators to communicate theirﬁle access poli-cies;more speciﬁcally to identifyﬁles that may not be accessed by a crawler. This is done by keeping aﬁle named robots.txt under the root directory of the Web server(such as /robots.txt).This ﬁle provides access policy for diﬀerent User-agents(robots or crawlers).A User-agent value of‘*’denotes a default policy for any crawler that does not match other User-agent values in theﬁle.A number of Disallow entries may be provided for a User-agent.Any URL that starts with the value of a 5Crawling the Web7 Disallowﬁeld must not be retrieved by a crawler matching the User-agent. When a crawler wants to retrieve a page from a Web server,it mustﬁrst fetch the appropriate robots.txtﬁle and make sure that the URL to be fetched is not disallowed.More details on this exclusion protocol can be found at /wc/norobots.html.It is eﬃcient to cache the access policies of a number of servers recently visited by the crawler.This would avoid accessing a robots.txtﬁle each time you need to fetch a URL. However,one must make sure that cache entries remain suﬃciently fresh.2.4ParsingOnce a page has been fetched,we need to parse its content to extract informa-tion that will feed and possibly guide the future path of the crawler.Parsing may imply simple hyperlink/URL extraction or it may involve the more com-plex process of tidying up the HTML content in order to analyze the HTML tag tree(see section2.5).Parsing might also involve steps to convert the ex-tracted URL to a canonical form,remove stopwords from the page’s content and stem the remaining words.These components of parsing are described next.URL Extraction and CanonicalizationHTML Parsers are freely available for many diﬀerent languages.They provide the functionality to easily identify HTML tags and associated attribute-value pairs in a given HTML document.In order to extract hyperlink URLs from a Web page,we can use these parsers toﬁnd anchor tags and grab the values of associated href attributes.However,we do need to convert any relative URLs to absolute URLs using the base URL of the page from where they were retrieved.Diﬀerent URLs that correspond to the same Web page can be mapped onto a single canonical form.This is important in order to avoid fetching the same page many times.Here are some of the steps used in typical URL canonicalization procedures:•convert the protocol and hostname to lowercase.For example,HTTP:// is converted to .•remove the‘anchor’or‘reference’part of the URL.Hence,http:// /faq.html#what is reduced to /faq.html.•perform URL encoding for some commonly used characters such as‘˜’.This would prevent the crawler from treating .uiowa.edu/~pant/as a diﬀerent URL from / %7Epant/.•for some URLs,add trailing‘/’s. and /must map to the same canonical form.The decision to add a trailing‘/’will require heuristics in many cases.8G.Pant,P.Srinivasan,F.Menczer•use heuristics to recognize default Web pages.File names such as index.html or index.htm may be removed from the URL with the as-sumption that they are the defaultﬁles.If that is true,they would be retrieved by simply using the base URL.•remove‘..’and its parent directory from the URL path.Therefore,URL path/%7Epant/BizIntel/Seeds/../ODPSeeds.dat is reduced to /%7Epant/BizIntel/ODPSeeds.dat.•leave the port numbers in the URL unless it is port80.As an alternative, leave the port numbers in the URL and add port80when no port number is speciﬁed.It is important to be consistent while applying canonicalization rules.It is pos-sible that two seemingly opposite rules work equally well(such as that for port numbers)as long as you apply them consistently across URLs.Other canoni-calization rules may be applied based on the application and prior knowledge about some sites(e.g.,known mirrors).As noted earlier spider traps pose a serious problem for a crawler.The “dummy”URLs created by spider traps often become increasingly larger in size.A way to tackle such traps is by limiting the URL sizes to,say,128or 256characters.Stoplisting and StemmingWhen parsing a Web page to extract content information or in order to score new URLs suggested by the page,it is often helpful to remove commonly used words or stopwords6such as“it”and“can”.This process of removing stopwords from text is called stoplisting.Note that the Dialog7system recog-nizes no more than nine words(“an”,“and”,“by”,“for”,“from”,“of”,“the”,“to”,and“with”)as the stopwords.In addition to stoplisting,one may also stem the words found in the page.The stemming process normalizes words by conﬂating a number of morphologically similar words to a single root form or stem.For example,“connect,”“connected”and“connection”are all reduced to“connect.”Implementations of the commonly used Porter stemming algo-rithm[29]are easily available in many programming languages.One of the authors has experienced cases in bio-medical domain where stemming reduced the precision of the crawling results.2.5HTML tag treeCrawlers may assess the value of a URL or a content word by examining the HTML tag context in which it resides.For this,a crawler may need to utilize the tag tree or DOM structure of the HTML page[8,24,27].Figure2shows 6for an example list of stopwords refer to /idom/ir_ resources/linguistic_utils/stop_words7Crawling the Web9 a tag tree corresponding to an HTML source.The<html>tag forms the root of the tree and various tags and texts form nodes of the tree.Unfortunately, many Web pages contain badly written HTML.For example,a start tag may not have an end tag(it may not be required by the HTML speciﬁcation),or the tags may not be properly nested.In many cases,the<html>tag or the <body>tag is all-together missing from the HTML page.Thus structure-based criteria often require the prior step of converting a“dirty”HTML document into a well-formed one,a process that is called tidying an HTML page.8This includes both insertion of missing tags and the reordering of tags in the page. Tidying an HTML page is necessary for mapping the content of a page onto a tree structure with integrity,where each node has a single parent.Hence,it is an essential precursor to analyzing an HTML page as a tag tree.Note that analyzing the DOM structure is only necessary if the topical crawler intends to use the HTML document structure in a non-trivial manner.For example, if the crawler only needs the links within a page,and the text or portions of the text in the page,one can use simpler HTML parsers.Such parsers are also readily available in many languages.<html><head><title>Projects</title></head><body><h4>Projects</h4><ul><li> <a href="blink.html">LAMP</a> Linkage analysis with multiple processors.</li><li> <a href="nice.html">NICE</a> The network infrastructure for combinatorial exploration.</li><li> <a href="amass.html">AMASS</a> A DNA sequence assembly algorithm.</li><li> <a href="dali.html">DALI</a> A distributed, adaptive, first-order logic theorem prover.</li></ul></body></html>htmlbodyheadh4 titletext text ulliatextFig. 2.An HTMLpage and the corre-sponding tag tree2.6Multi-threaded CrawlersA sequential crawling loop spends a large amount of time in which either the CPU is idle(during network/disk access)or the network interface is idle(dur-ing CPU operations).Multi-threading,where each thread follows a crawling 8/People/Raggett/tidy/10G.Pant,P.Srinivasan,F.Menczerloop,can provide reasonable speed-up and eﬃcient use of available bandwidth. Figure3shows a multi-threaded version of the basic crawler in Figure1.Note that each thread starts by locking the frontier to pick the next URL to crawl. After picking a URL it unlocks the frontier allowing other threads to access it.The frontier is again locked when new URLs are added to it.The locking steps are necessary in order to synchronize the use of the frontier that is now shared among many crawling loops(threads).The model of multi-threaded crawler in Figure3follows a standard parallel computing model[18].Note that a typical crawler would also maintain a shared history data structure for a fast lookup of URLs that have been crawled.Hence,in addition to the frontier it would also need to synchronize access to the history.The multi-threaded crawler model needs to deal with an empty frontier just like a sequential crawler.However,the issue is less simple now.If a thread ﬁnds the frontier empty,it does not automatically mean that the crawler as a whole has reached a dead-end.It is possible that other threads are fetching pages and may add new URLs in the near future.One way to deal with the situation is by sending a thread to a sleep state when it sees an empty frontier. When the thread wakes up,it checks again for URLs.A global monitor keeps track of the number of threads currently sleeping.Only when all the threads are in the sleep state does the crawling process stop.More optimizations can be performed on the multi-threaded model described here,as for instance to decrease contentions between the threads and to streamline network access.This section has described the general components of a crawler.The com-mon infrastructure supports at one extreme a very simple breadth-ﬁrst crawler and at the other end crawler algorithms that may involve very complex URL selection mechanisms.Factors such as frontier size,page parsing strategy, crawler history and page repository have been identiﬁed as interesting and important dimensions to crawler deﬁnitions.3Crawling AlgorithmsWe now discuss a number of crawling algorithms that are suggested in the literature.Note that many of these algorithms are variations of the best-ﬁrst scheme.The diﬀerence is in the heuristics they use to score the unvisited URLs with some algorithms adapting and tuning their parameters before or during the crawl.3.1Naive Best-First CrawlerA naive best-ﬁrst was one of the crawlers detailed and evaluated by the authors in an extensive study of crawler evaluation[22].This crawler represents a fetched Web page as a vector of words weighted by occurrence frequency. The crawler then computes the cosine similarity of the page to the query or description provided by the user,and scores the unvisited URLs on theFrontier Fig.3.A multi-threaded crawler modelpage by this similarity value.The URLs are then added to a frontier that is maintained as a priority queue based on these scores.In the next iteration each crawler thread picks the best URL in the frontier to crawl,and returns with new unvisited URLs that are again inserted in the priority queue after being scored based on the cosine similarity of the parent page.The cosine similarity between the page p and a query q is computed by:sim (q,p )=v q ·v p v q · v p(1)where v q and v p are term frequency (TF)based vector representations of the query and the page respectively,v q ·v p is the dot (inner)product of the two vectors,and v is the Euclidean norm of the vector v .More sophisticated vector representation of pages,such as the TF-IDF [32]weighting scheme often used in information retrieval,are problematic in crawling applications because there is no a priori knowledge of the distribution of terms across crawled pages.In a multiple thread implementation the crawler acts like a best-N-ﬁrst crawler where N is a function of the number of simultaneously running threads.Thus best-N-ﬁrst is a generalized version of the best-ﬁrst crawler that picks N best URLs to crawl at a time.In our research we have found the best-N-ﬁrst crawler (with N =256)to be a strong competitor [28,23]showing clear superiority on the retrieval of relevant pages.Note that the best-ﬁrst crawler keeps thefrontier size within its upper bound by retaining only the best URLs based on the assigned similarity scores.3.2SharkSearchSharkSearch[15]is a version of FishSearch[12]with some improvements.It uses a similarity measure like the one used in the naive best-ﬁrst crawler for estimating the relevance of an unvisited URL.However,SharkSearch has a more reﬁned notion of potential scores for the links in the crawl frontier.The anchor text,text surrounding the links or link-context,and inherited score from ancestors inﬂuence the potential scores of links.The ancestors of a URL are the pages that appeared on the crawl path to the URL.SharkSearch,like its predecessor FishSearch,maintains a depth bound.That is,if the crawler ﬁnds unimportant pages on a crawl path it stops crawling further along that path.To be able to track all the information,each URL in the frontier is associated with a depth and a potential score.The depth bound(d)is provided by the user while the potential score of an unvisited URL is computed as: score(url)=γ·inherited(url)+(1−γ)·neighborhood(url)(2) whereγ<1is a parameter,the neighborhood score signiﬁes the contex-tual evidence found on the page that contains the hyperlink URL,and the inherited score is obtained from the scores of the ancestors of the URL.More precisely,the inherited score is computed as:inherited(url)=δ·sim(q,p)if sim(q,p)>0δ·inherited(p)otherwise(3)whereδ<1is again a parameter,q is the query,and p is the page from which the URL was extracted.The neighborhood score uses the anchor text and the text in the“vicinity”of the anchor in an attempt to reﬁne the overall score of the URL by allowing for diﬀerentiation between links found within the same page.For that purpose, the SharkSearch crawler assigns an anchor score and a context score to each URL.The anchor score is simply the similarity of the anchor text of the hyperlink containing the URL to the query q,i.e.sim(q,anchor text).The context score on the other hand broadens the context of the link to include some nearby words.The resulting augmented context,aug context,is used for computing the context score as follows:context(url)=1if anchor(url)>0sim(q,aug context)otherwise(4)Finally we derive the neighborhood score from the anchor score and the context score as:neighborhood(url)=β·anchor(url)+(1−β)·context(url)(5)。

淘宝直通车新手入门教程

页数:7
直通车教程

页数:12
什么是淘宝直通车,具体怎么做直通车

页数:6
SEM推广直通车推广标品爆款高分教程()

页数:4
拼多多直通车推广场景基础入门教程

页数:2
2009年7月长沙房地产市场月报(合富辉煌)

页数:59
淘宝运营内部教程完整版-初级

页数:20
那些被高手隐瞒的直通车绝密技巧--很全面的直通车讲解

页数:8
华尔思华为安全HCIE直通车课程大纲

页数:4
【淘宝培训教程】这下小类目卖家要笑了,关于直通车定向的福利爆料!

页数:7