当前位置:文档之家› A Survey on PageRank

A Survey on PageRank

A Survey on PageRank
A Survey on PageRank

Internet Mathematics Vol.2,No.1:73-120

A Survey on PageRank

Computing

Pavel Berkhin

Abstract.This survey reviews the research related to PageRank https://www.doczj.com/doc/025895805.html,po-nents of a PageRank vector serve as authority weights for web pages independent of their textual content,solely based on the hyperlink structure of the web.PageRank is typically used as a web search ranking component.This de?nes the importance of the model and the data structures that underly PageRank https://www.doczj.com/doc/025895805.html,puting even a single PageRank is a di?cult computational https://www.doczj.com/doc/025895805.html,puting many PageRanks is a much more complex challenge.Recently,signi?cant e?ort has been invested in building sets of personalized PageRank vectors.PageRank is also used in many diverse applications other than ranking.

We are interested in the theoretical foundations of the PageRank formulation,in the acceleration of PageRank computing,in the e?ects of particular aspects of web graph structure on the optimal organization of computations,and in PageRank stability.We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search.We also discuss link-based search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing.

1.Introduction

The tremendous popularity of Internet search engines presents a striking example of a successful fusion of many developments in di?erent areas of computer science and related?elds from statistics to information retrieval to natural language processing.One of the most di?cult problems in web search is the ranking of the results recalled in response to a user query.Since contemporary web crawls ?A K Peters,Ltd.

1542-7951/05$0.50per page73

74Internet Mathematics discover billions of pages,broad topic queries may result in recalling hundreds of thousand of pages containing the query terms.Only a few dozens of the recalled pages are actually presented to the user.Moreover,these results are presented in order of relevance.A variety of ranking features are used by Internet search engines to come up with a good order.Some of the query-independent features are based on page content.Some utilize the hyperlink structure of the web. The ascent of Google to prominence is to a certain degree associated with the concept of a new relevance feature of the second kind,PageRank.PageRank as-signs authority weights to each page of a web graph based on the web hyperlink structure.It is based on an assumption that authoritative pages usually contain links to good pages,therefore endorsing them in a degree inversely proportional to the number of their out-links.This fundamental assumption is not always true in practice partly because good pages may erroneously link to bad ones,partly because they have a natural tendency to emphasize pages within their own sites, and partly because of the consistent e?orts of malicious sites containing spam. While this a?ects its usefulness,PageRank remains an important relevance fea-ture when combined with other ranking components.Many issues are related to PageRank computing:from?ltering,weighting,and compressing the link data to the most advantageous modi?cation of the underlying model to?nding fast methods of computing its solution.

PageRank is based on a random surfer model[Page et al.98]and may be viewed as a stationary distribution of a Markov chain.The PageRank solution is a principal eigenvector of a linear system that can be found via the power method.We discuss this model and some technical modi?cations(teleportation mechanism)needed to safeguard the existence of a solution and the convergence of an iterative process in Section2.

The sheer size of the World Wide Web presents a challenge to PageRank com-puting.Di?erent developments to speed-up the power method include extrapola-tion,adaptive,and block structure methods.PageRank can also be reformulated in terms of a linear system.Linear systems have a rich arsenal of fast iterative methods.The e?ective solution of PageRank depends on a special enumeration of its nodes corresponding to a DAG structure.We discuss e?ective procedures for PageRank computing and its stability with respect to changes in the web graph and in the teleportation coe?cient in Section3.

Simultaneously with the random surfer model,a di?erent but closely related approach,the HITS algorithm[Kleinberg99],was invented.This algorithm results in two authority vectors.PageRank results in one.More importantly, HITS is designed to work at query time,while PageRank is precomputed in advance of its utilization in a search.HITS,its modi?cations,and other later models leading to a concept of authority weights de?ned on a directed graph

Berkhin:A Survey on PageRank Computing 75such as,for example,SALSA,as well as the use of such weights in applications other than web search relevance ranking are surveyed in Section 4.The PageRank vector serves as only one of the many relevance features for ranking web search results.Generic PageRank corresponds to a uniform tele-portation.In principle,PageRanks with di?erent teleportations can be used.A nonuniform teleportation makes sense,since it leads to a topical or user-speci?c personalized PageRank [Page et al.98,Haveliwala 02b].Computing many such PageRanks is di?cult,but in some cases it may be signi?cantly optimized.Re-cent results in personalized PageRank are presented in Section 5.Remarkable e?orts have been made to facilitate the PageRank computing in-frastructure.They range from special means to estimate PageRank convergence,to web graph analysis,to special data compression techniques.We present them in Section 6.2.De?nitions and Fundamentals In this section we formally de?ne PageRank and consider the theoretical foun-dations for its existence and for the convergence of the power iteration.2.1.De?nitions The PageRank vector (PageRank citation ranking weigh t)was introduced in [Page et al.98]and [Brin and Page 98].It assigns to pages authority scores that are higher for pages with many in-links from authoritative pages with relatively few out-links .Let W (G,L )be a directed graph with vertices (or nodes)G that are web HTML pages and directed edges L that are hyperlinks.The edges L can be associated with a n ×n sparse adjacency matrix that we denote by the same symbol L ,where n =|G |is the number of web pages.An element L ij is equal to 1if a page i has a link i →j to a page j and is equal to 0otherwise.The out-degree of a node i is the number of outgoing links,deg (i )= j L ij .All the de?nitions can be generalized to a weighted directed graph,in which case L ij ≥0.A transition matrix is de?ned as P ij =L ij /deg (i )when deg (i )>0.For such i it is row-stochastic ,meaning that i -row elements sum to one.We set P ij =0when deg (i )=0.Consider the following random surfer model .A surfer travels along the directed graph.If at a step k he is located at a page i ,then at the next step k +1he moves uniformly at random to any out-neighbor j of i .Obviously,pages can be considered as n states of a Markov chain with a transition matrix P .Given a distribution p (k )= p (k )i of probabilities for a surfer to be at a page i at a step

76Internet Mathematics k ,the probability for being at a page j on the next step is proportional to p (k +1)j = i →j p (k )i /deg (i )= i P ij p (k )i or p (k +1)=P T p (k ).(2.1)(If all pages have out-links,p (k +1)j is indeed a probability distribution.)An equi-librium state of transformation (2.1)correctly re?ects our modeling objectives.The more incoming links a page has,the higher its importance.The importance of a page j is also proportional to the importance of its in-neighbors and in-versely proportional to the neighbor’s out-degrees.This motivates the following de?nition.De?nition 2.1.A PageRank vector is a stationary point of the transformation (2.1)with nonnegative components (a stationary distribution for a Markov chain)p =Ap,A =P T .(2.2)Since we are interested in a probability distribution,the sum of the p -components is assumed to be one.We use an L 1norm x = |x i |as our standard norm.This de?nition has a problem.Matrix P has zero rows for dangling pages (pages that have no outgoing links),since for them deg (i )=0,while a Markov chain transition matrix has to be row-stochastic.Dangling pages serve as sinks or attractors :some p -volume comes to them but then cannot leave.Di?erent remedies have been explored.One approach [Page et al.98,Section 2.7]suggests getting rid of dangling pages.The rationale is that these pages do not in?uence any other pages.Notice that deleting dangling pages generates new dangling pages.Another alternative considered in that paper is to renormalize P T p (k +1)so that it remains a probability distribution to compensate for sinks.The authors of [Kamvar et al.03b]consider adding deleted dangling pages on ?nal iterations.Still another option advocated by [Jeh and Widom 02b]is to add a self-link to each dangling page and,after computation,to adjust the constructed PageRank.Another approach [Bianchini et al.03]suggests introducing an ideal page (sink)with a self-link to which each dangling page links.Along a similar line,see [Eiron et al.04].The most widely used device [Page et al.98,Haveliwala et al.03b,Kamvar et al.03c,Langville and Meyer 05]modi?es the matrix P by adding arti?cial links that uniformly connect dangling pages to all G pages.Let v be a distribution,for example,v =e,e =e n =(1/n,...,1/n ),and let d be a dangling page indicator,d i =δ(deg (i ),0).Here,as usual,δ(a,b )=1when a =b and is zero otherwise.Consider the matrix P =P +d ·v T .(2.3)

Berkhin:A Survey on PageRank Computing77 Rows corresponding to dangling pages are changed to v.The trick is popular since,as we will see shortly,it is also useful in a general situation of nondangling pages and is easy to compute.Aside from a particular approach,the decisive fact is that there are many dangling pages to deal with.(In reality,on a Yahoo!web graph from data collected in August,2004,and consisting of billions of pages, more than60%of pages have not been crawled,and therefore,these pages are dangling for sure.Around14%of actually crawled pages are also dangling.In a corresponding host graph62%of the pages are dangling pages.)

When does the stationary point in(2.2)exist and when does the process in (2.1),called the power method or power iteration,converge to the solution?For a nonnegative row-stochastic matrix P independent of the initial distribution p(0) (elements are nonnegative and sum to one),the power method converges to a principal eigenvector p of A,p has positive elements,and the principal eigenvalue λ1=1is simple as long as two conditions are satis?ed:

1.The graph W(G,L)is strongly connected,meaning that there is a directed

path from each node to any other node.

2.The graph W(G,L)is aperiodic,meaning that for any pages i and j there

are paths from i to j(with whatever repetitions)of any length l except for a?nite set of lengths.

While Condition2is guaranteed for the web,Condition1is routinely violated. For example,there are many pages without in-links.To satisfy both conditions, the trick used for dangling pages is repeated:the matrix is modi?ed to

P =cP +(1?c)E,E=(1,...,1)·v T,0

The original P is sparse.We still can bene?t from this fortunate fact after the transformations(2.3)and(2.4),since the multiplication with matrix P can be carried out[Kamvar et al.03c]in terms of the original P:the vector

78Internet Mathematics

y=Ax=P T x can be computed by the following code:

y=cP T x,

γ= x ? y ,

y=y+γv.

(2.5)

General information related to PageRank can be found in a survey article [Langville and Meyer04a].It contains some explanatory examples corresponding to small graphs.Henzinger[Henzinger00]also reviews related issues.

2.2.Related Theory

The Perron-Frobenius theorem for positive matrices is proved in[Axelsson94, Chapter4].Under Condition1of strong connectivity,it guarantees thatλ1is simple and that the corresponding eigenvector has positive components.The fact that the eigenvalueλ1=1is a consequence of stochasticity.Condition2 of aperiodicity relates not to the existence of a solution but to the convergence of(2.1),which is not guaranteed in its absence.For example,for a strongly connected two-node graph a?b that does not satisfy the aperiodicity condition (since all paths from a to b have odd lengths),the iterations(2.1)for p(0)= (q,1?q)oscillate without convergence between p(0)and p(1)=(1?q,q).

A matrix is reducible if after a permutation of coordinates into two nonempty groups,?rst of size d and second of size n?d,it has a block triangular form

P=

P11P12

0P22

,(2.6)

where dim(P11)=d×d,dim(P22)=(n?d)×(n?d).It is called irreducible otherwise[Axelsson94,De?nition4.2,Theorem4.2].An adjacency matrix and a transition matrix of a directed graph are irreducible if and only if the associated graph is strongly connected[Axelsson94,Theorem4.3].If the graph W is not strongly connected,then factorization of its strongly connected components (SCC)leads to a directed acyclic graph(DAG),discussed in Section3.4.Every DAG’s?nal node(i.e.,having no out-links)corresponds to a SCC that serves as an attractor in a corresponding Markov chain.In plain language all the p-volumes gradually get sucked into such states.Condition1of strong connectivity is necessary for the existence of a unique positive solution of(2.2).If P is reducible,then the system(2.2)either has multiple positive solutions,if P12=0 (combinations of solutions for two blocks),or has no positive solution(since p1=P T11p1and P11is not stochastic if P12=0).

So far,we only knowλ1.A less recognized fact is that eigenvalues of P and P are interrelated:λ1(P )=λ1(P )=1,λj(P )=cλj(P )for j>1[Langville

Berkhin:A Survey on PageRank Computing79 and Meyer05].In particular,the second eigenvalueλ2(P )≤c.In reality,for a web graphλ2(P )=c.This amazing fact,?rst proven by[Haveliwala and Kamvar03],is important for extrapolation.It has the astonishing consequence that a teleportation mechanism introduced to secure strong connectivity also determines the convergence rate.The authors also prove that if

A=(cP +(1?c)E)T,E=(1,...,1)·v T,(2.7)

where P is row stochastic and has at least two irreducible closed subsets,then A has a second eigenvalueλ2=c.The web graph W(G,E)has many strongly connected components[Kumar et al.00].A proof can also be found in[Langville and Meyer05].

Further information about random walks on directed graphs can be found in [Motwani and Raghavan95,Ahuja et al.93,Lov′a sz96].Not all of the theory is relevant to PageRank.The Markov model de?ned by a?nite graph with a moderate number of outgoing links has its own intrinsic features.It has been observed[Page et al.98]that the web graph is expanding,meaning that the number of vertices reachable from a(not too large)subset exceeds by a factor αthe cardinality of the initial set.A graph has a high expansion factor if and only if its principal eigenvalue is su?ciently larger than the second eigenvalue (“eigenvalue gap”).A random walk on a graph is rapidly-mixing if iterations converge in log(|G|)time,which in turn depends on the principal eigenvalue separation.

As we explained,pages G can be viewed as Markov chain nodes with a transi-tion matrix P.The theory of Markov chains includes the important concept of ergodic states.Let N(i,k)be the number of times a surfer visited page i during the?rst k steps.The solution of Equation(2.2)is related to N.When strong connectivity and aperiodicity are in place,the Ergodic theorem[Motwani and Raghavan95,Theorem6.2]says that each state is ergodic:

N(i,k)/k=p i.

lim

k→∞

The transition matrix P is related to an adjacency matrix L by the formula P=D?1·L,where a diagonal matrix D={deg(i)}contains out-degrees. Generalizations of this formulation are discussed in[Ding et al.01].Brandes and Cornelson[Brandes and Cornelsen03]present a general view of other similar constructions for directed graphs;independently of PageRank that induces an order over G,they also describe a method of spectral graph layout resulting in another ordering of undirected graphs.The method relates to the important concepts of the Laplacian matrix and the minimum energy state Fiedler vector.

80Internet Mathematics PageRank can be formulated in a form alternative to a spectral problem. Since(1,..,1)T p= p =1,the PageRank vector,in addition to an eigensystem p=Ap,A=P T,also satis?es the equation

p=cP T p+(1?c)v=cP T p+(1?c+cs)v,(2.8) where s=s(p)=d T·p is a“dangling sink.”The actual value of s a?ects the solution only through rescaling it.Therefore,any s,for example,s=0(that results in the spectral eigensystem solution in the absence of dangling pages), can be used.This fact has important implications.We elaborate on this topic in Section3.5.Finally,modifying P with teleportation can be avoided[Langville and Meyer05,Tomlin03]if in(2.7)G is augmented with an additional ideal sink node n+1:

?p=?P?p,?P=

cP (1?c)(1, (1)

v T0

,?p=

p

(1?c)

.

This construction is also useful in handling dangling pages[Eiron et al.04,Lee et al.03].

https://www.doczj.com/doc/025895805.html,age

The traditional application of PageRank is ranking web search results.The ap-pealing quality of PageRank is that it is available in query time.Since PageRank only uses web graph connectivity and is query-independent,it has to be blended with other features,including query-dependent features.Along with PageR-ank,another similar algorithm reviewed below and called HITS[Kleinberg99] works in query time.The bene?ts of PageRank are best for under-speci?ed (broad topic)queries,for example“University”[Page et al.98].In other words, it is best for queries with very high recall based on a straightforward all-term inclusion procedure.From an application standpoint,PageRank has some prob-lems.For example,an arti?cially high rank is assigned to copyright warnings, disclaimers,and highly interlinked mailing archives.Another problem is a po-tential topic drift.Di?erent remedies were suggested,starting with an insightful article[Bharat and Henzinger98](which we get back to in the context of HITS). The major issue is how to merge connectivity and content data.Internet search engines blend PageRank with other relevance features using proprietary ranking formulas and other techniques,and any statistics re?ecting its signi?cance in this regard are not public.There is no certainty about the importance of per-sonalized PageRank either,except for the fact that no other obvious link-based personalization feature exists.HostRank,PageRank constructed over the host graph,is another useful relevance feature.

Berkhin:A Survey on PageRank Computing81

Algorithm1.(Poor man’s PageRank algorithm.)

Input:Given a transition matrix P,a teleportation vector v,and a coe?cient c

Output:Compute PageRank p

begin

Initialize p(0)=v,k=0

repeat

p(k+1)=cP T p(k)

γ= p(k) ? p(k+1)

p(k+1)=p(k+1)+γv

δ= p(k+1)?p(k)

k=k+1

untilδ<

end

return p(k)

Before moving further we would like to present the reader with the“Poor man’s PageRank algorithm”implementation of the power iteration that takes (2.5)into account;see Algorithm1.

3.Acceleration of Convergence

As we explained,our goal is to?nd a stationary point p for an eigensystem

p=Ap,A=P T.(3.1) We start with some distribution p(0)(e.g.,p(0)=e)and use the power iteration

p(k+1)=Ap(k)(3.2) to compute the eigenvector associated with the principal eigenvalueλ1=1. According to(2.5),A-multiplication successfully exploits the sparse nature of the transition matrix P(on average,there are between7and20out-links per page depending on the pruning used)and thus,a major iteration has O(n) computational complexity.This is good news;however,the graph size n is very large.We use a simple L1stopping criterion for transparency,but we will return to this issue in Section6.1.

Before we investigate more complex algorithms,what is the convergence rate of the power iteration?We already know thatλ1(P)=1,λ2(P)=c.It is easy to see that a convergence rate of the power iteration process for any P

82Internet Mathematics isλ2(P)/λ1(P)or c in our case.We would like to provide somewhat simpler reasoning speci?cally related to PageRank that leads to a slightly weaker result. Following[Bianchini et al.03],if we subtract from Equation(2.8),p=cP T p+ (1?c)v,its iterative version(3.2),we get

p?p(k+1)=cP T(p?p(k))=c k

P T

k

(p?p(0)),

and hence,since P T ≤1,the rate of convergence is bounded by c.The number of iterations k≈log( )/log(c)needed to achieve -accuracy is independent of the graph.

Next we review di?erent algorithmic methods to accelerate the power method (3.2)and other methods of computing PageRank.In Section6we talk about infrastructural means to achieve computational e?ciently.Since a web graph is volatile,there is a natural limit on how accurate we would like to be.Also,the whole adventure with computing PageRank makes sense only if the ideal object is stable.This is also one of our concerns in this section.

3.1.Extrapolation Methods

Extrapolation methods are based on the expansion of iterates p(k)in a series

p(k)=u1+λk2u2+···+λk m u m+···,(3.3) where terms u m are(not orthonormal)eigenvectors of A(we utilize thatλ1=1). These methods try to improve approximation of the ideal solution u1by p(k) through clever suppression of the higher terms.More speci?cally,while com-puting power iterations,from time to time,we construct a linear combination of several iterates that exactly eliminates one or more harmonics after the very?rst term and regard the remaining tail as approximately relatively small.In doing so,we assume that1<λ2≤λ3≤...≤λm...and carefully analyze coe?cients of the?rst m terms in series(3.3).The constructed combination is used as a closer starting point for further iterations.

Kamvar,Schlosser,and Garcia-Molina[Kamvar et al.03c]present a variety of extrapolation methods.The simplest Aitken method aims at eliminating the second sub-principal term.Setting m=2and truncating the tail(denoted by “...”),we get

p(k?2)=u1+u2+...

p(k?1)=u1+λ2u2+...

p(k)=u1+λ22u2+....

(3.4)

These are three equations with three unknowns,and we can eliminate two of them,u2andλ2.This is what the Aitken method does.After an initial

Berkhin:A Survey on PageRank Computing83 speedup,the acceleration is not signi?cant.They also suggest a straightfor-ward generalization—a quadratic extrapolation that is based on a similar set of equations for m=3,

p(k?3)=u1+u2+u3+...

p(k?2)=u1+λ2u2+λ3u3+...

p(k?1)=u1+λ22u2+λ23u3+...

p(k)=u1+λ32u2+λ33u3+....

(3.5)

Based on some matrix algebra,the authors developed formulas to eliminate un-knowns corresponding to the second and third terms.Both the Aitken method and quadratic extrapolation are applied periodically within(3.2).Reported ex-perimental speedup for quadratic extrapolation is59%.

The presented extrapolation techniques were immediately superseded by an elegant discovery:λ2(P )=c[Haveliwala and Kamvar03](see Section2.2).The rate of convergence p(k+1)?p(k) / p(k)?p(k?1) →|λ2/λ1|=c is con?rmed by experiments.In the Markov chain context,|λ1|?|λ2|=1?c is known as stability.The authors also relate eigenvectors corresponding toλ2=c to spam detection.In extrapolation methods,knowingλ2has a profound consequence: we do not need to bother computing it in Equations(3.4)and(3.5). Extrapolation methods developed in[Haveliwala et al.03b]utilize this dis-covery.A d extrapolation starts with a representation similar to that in Equa-tion(3.5)under the assumption that m=d+1andλj for j=2:d+1form a system of d-roots of c d.This means,in particular,thatλd j=c d.For example, for d=2,λ2=c andλ3=?c.We see now that

p(k?d)=u1+u2+···+u d+1+...

p(k)=u1+λd2u2+···+λd d+1u d+...

(3.6)

and,hence,the equation p new=u1=p(k)?c d p(k?d)

1?c d becomes a viable extrap-

olation.It is suggested to be used periodically in conjunction with the power method(3.2).For c=0.85the best speedups correspond to d=4(25.8%)and d=6(30%).In our experience this extrapolation works as predicted with large graphs.

3.2.Adaptive Method

There is no need to compute PageRank components exactly.How much accuracy is actually required is an intriguing question that we revisit in Section6.1.What-ever the required accuracy is,there is a disparity in the magnitudes of PageRank components:authorities of pages di?er dramatically,and most values are small

84Internet Mathematics (the lower bound (1?c )/n is achieved on pages without in-links).A less trans-parent observation is that small components converge much faster even when measured on a relative scale i (k )= p (k +1)i ?p (k )i /p (k )i [Kamvar et al.03a].For example,the authors show that,on a graph of 280,000nodes (3M links)for c =0.85,around 80%of PageRank components converged ( i (k )< =10?3)in less than 15iterations and their average magnitude was 0.37of the average magnitude of the remaining components.The proposed acceleration,adaptive method ,simply stops iterating on already converged components.Assume that p =(p N ,p C )is a split of p into m not yet converged and (n ?m )already converged components.Let A N and A C be the corresponding m ×n and (n ?m )×n submatrices of A .Then,we have p (k +1)N p (k +1)C = A N A C · p (k )N p (k )C or p (k +1)N =A N p (k )p (k +1)C =p (k )C ,(3.7)where p (k )C ≈A C p (k )and we only need to iterate over the nonconverged p N .Here,dim(A N )=m ×n and dim(A C )=(n ?m )×n .When m becomes smaller,the size of A N becomes smaller as well,and many computations are avoided.The modi?ed algorithm presented in the paper actually deletes the links corresponding to converged pages from a sparse matrix A .This means decomposition A N =[A NN ,A NC ].The explicit reordering of A is important,and a smart strategy is used to enable housekeeping.The savings are about 20%in overall time.Little hope for the reordering of a several-billion-node web graph currently exists.This restricts the application of this method to smaller graphs such as,for example,a host graph.The adaptive method requires slightly more major iterations k to achieve convergence.3.3.Block Structure Method The web has a particular structure.On the very primitive level,there are many more links within the same domains (hosts)than between pages of di?erent do-mains (hosts).Kamvar et al.brilliantly exploit this idea [Kamvar et al.03b].The intuition behind the idea is that blocks roughly represent the whole graph,but there are many fewer blocks than pages and they are more interconnected.Therefore,computing “aggregate”PageRank is easy.If we are also able to ap-proximate PageRank within blocks that is again a series of smaller problems,then we may ?nd a good starting point from which to ignite a full graph algo-rithm.The authors compute PageRank via the algorithm blockRank in several steps.A slightly updated version of this development is presented in Algorithm 2.

Berkhin:A Survey on PageRank Computing 85Algorithm 2.(Block structure method (blockRank ).)Input :Given a graph W over nodes G = G I ,I =1:N ,a teleportation vector v ,and a coe?cient c Output :Compute PageRank p begin for I =1:N do let P II be a transition matrix over block G I for i ∈G I set v I,i =v i / j ∈G I v j l I =pageRank (P II ,v I ,v I )end for I =1:N do ?v I = i ∈G I v i for J =1:N do ?L IJ = i ∈I,j ∈J P ij l i end end let ?P be a transition matrix corresponding to weighted block structure ?L b =pageRank (?P ,?v ,?v )set s =(b 1l 1,b 2l 2,...,b N l n )p =pageRank (P,s,v )end return p We assume that a full graph W (G,L )is de?ned over nodes G that can be divided into (disjoint)blocks G I ,so that G = G I ,I =1:N (block indices are denoted by capitals,I,J ).Let pageRank (P,p (0),v )denote a general purpose (eigensystem)PageRank-computing algorithm for a transition matrix P start-ing with p (0),utilizing teleportation vector v ,and using whatever acceleration (Kendall distance (see (6.1))is suggested as the stopping criterion).The method starts with computing N local PageRanks l I .At the next step we aggregate connectivity information to a block level introducing a weighted directed graph structure ?L on the set of blocks.A link from a block I to a block J has weight ?L IJ = i ∈I,j ∈J P ij l i combining links between their constituents weighted by their local PageRanks.We now compute the BlockRank authority vector b for a block-level transition matrix ?P .Finally,a vector s that is a rough approximation to full PageRank is assembled.It is used as the initial guess to compute full PageRank.This algorithm has many useful properties:(1)it does not need much accu-racy in computing local PageRanks,(2)it allows obvious parallelization,(3)it

86Internet Mathematics may keep within-the-block and between-the-block connectivity data in the core

memory,and(4)it bene?ts from the fact that relatively few blocks change after

a subsequent crawl.We return to this method in the personalization context. Similar ideas for computing HostRank are also advocated by[Broder et al.04]. The context is reversed from constructing a good approximation s to be used as the initial guess to considering s as a?nal object that is called HostRank. Other host aggregations,in particular uniform l I,are suggested.The described methodology has actually been tried on real-life AltaVista2003we

b data.

The block approach to PageRank performs well on large web graphs and, probably,currently constitutes the most practical acceleration technique.

3.4.DAG Structure of the Web

A di?erent,but related,block structure of the web is induced by general graph factorization.Consider blocks equal to(disjoint)strongly connected components (SCC)of W.Since strong connectivity is an equivalence relationship,we can factorize G into some number q of super-nodes,each representing one SCC block. The super-nodes form a directed acyclic graph(DAG).The DAG structure ef-fectively de?nes a partial“?ow”order on a super-graph.The largest central SCC contains Yahoo!,MSN,and similar common sites.There are also preceding super-nodes(that have paths leading to the central component)and succeed-ing super-nodes(that have paths leading from the central component).So,the blocked graph looks like a bow tie.Finding a set of all SCCs is related to a depth-?rst search(DFS).In graph theory an indexation that enumerates nodes within SCCs contiguously and SCCs themselves in increasing“?ow”order is known under the term topological sort.For insights on out-of-core memory DFS, see[Chiang et al.95].

Under such enumeration(corresponding to a permutation of the original P), we get a block upper triangle representation

P=?

??

?

P11P12 (1)

P22 (2)

......

P qq

?

??

?.(3.8)

It establishes a relation between strong connectivity and irreducibility(see(2.6)): matrix P is irreducible if and only if there is a single SCC(q=1).The corre-sponding partitioning of PageRank p leads to signi?cant improvements[Arasu et al.02].Equation(2.8)has the form

p=cP T p+b,b=const·v.(3.9)

Berkhin:A Survey on PageRank Computing87 Contrary to some references,const=(1?c)but depends on a sink term in Equation(2.8).This does not matter,since the solution will only di?er by a scalar.In view of(3.8),Equation(3.9)can be cast as a system of linear equations,j=0:q,

p j=cP T jj p j+f j,f j=b j+cP T1j p1+···+cP T j?1,j p j?1.(3.10)

We do not know how this method performs in practice on large graphs.

3.5.Spectral and Equation Solvers

So far we tried to accelerate the solution of the spectral problem for eigensys-tem(3.1)withλ1=1.Meanwhile,Equation(3.9)reformulates a problem in terms of a linear system:

(1?cP T)p=b(=const·v).(3.11)

We would like to re?ect brie?y on this simple fact.While solving eigensys-tem(3.1)is based on the simple iterative procedure(3.2),analogous simple iterations

p(k)=cP T p(k?1)+b,(3.12) known as the Jacobi method,can be used to?nd a solution of the linear system (3.11).While eigensystem formulation of PageRank is well studied,?nding an e?ective solution of the linear PageRank equation(3.11)is a relatively new?eld. The connection between(normalized)eigensystem PageRank p E and linear PageRank p L is de?ned by the formula

p L=

1?c

c·sink(p)+1?c

p E.(3.13)

Linear PageRank p L has an application advantage:it depends linearly on the teleportation vector,while traditional eigensystem PageRank p E does not.As a consequence,it is easy to construct linear combinations of,for example,topic-speci?c PageRanks that are crucial in personalization.

Linear systems have a rich arsenal of di?erent iterative methods[Axelsson94]. The simplest ones,Jacobi and Gauss-Seidel,were successfully tried for Equa-tion(3.9)[Arasu et al.02,Del Corso et al.04].The Jacobi method is already de?ned by(3.12).The Gauss-Seidel method is similar,but it reutilizes already computed components:

88Internet Mathematics

p(k+1) i =(1?c)v i+c

j

a ij p(k+1)

j

+c

j>i

a ij p(k)

j

.(3.14)

Thus,the representation(3.8)becomes very useful:many terms in Equa-tion(3.14)vanish.In numerical experiments Gauss-Seidel saves up to40%of it-erations,though it has overhead costs.It has an important advantage of working in place.In experiments it works?ne with large graphs.However,Gauss-Seidel has one disadvantage:it is very hard to parallelize.

Two other simple splitting methods are the successive overrelaxation method (SOR),which depends on a parameterω,and the similar symmetric successive overrelaxation method(SSOR),both known to be faster than Gauss-Seidel.For a detailed description of their application to?nite Markov chains,see[Stewart99]. For general information see[Axelsson94,Chapters5–7]and[Golub and Loan96]. One important practical observation that is more about art than about science is that performance of these methods depends on the enumeration order,and there is experimental evidence that(3.8)is bene?cial.The feasibility of SOR for a small W is numerically proven,since a goodωof the form1+ can be found. Existence of a goodωfor large web graphs with more than?ve to ten million nodes is questionable.

It follows from Equation(3.11)[Haveliwala et al.03a]that p=const·(1?cP T)?1v=Qv.Here,(1?cP T)is diagonally dominant and invertible.When v has only one nonzero element,p coincides with a column of the matrix Q. So,in principle,Q contains full information for personalization(see Section5). Computing Q in practice is,of course,infeasible.

3.6.Advanced Numerical Linear Algebra Methods

While previous sections were either dealing with straightforward acceleration of the power method or with traditional methods for linear PageRank formulation related to it,here we review a few approaches that substantially deviate from this paradigm.Indeed,PageRank represents a huge but standard instance of computing a stationary distribution for an irreducible Markov chain that,in turn,is an instance of a general eigensystem problem,which,as we explained, can be translated into a linear system formulation.Recently,an interest in applying modern numerical linear algebra methods to PageRank has emerged. At this moment it is too early to judge how they will perform on real-life web graphs.Since the reviewed methods are based on complex constructions that are beyond the scope of our presentation,we provide no details.The reader can ?nd explanations and further tips in the references mentioned below.

A comprehensive review of numerical methods related to?nding stationary distributions of Markov chains can be found in[Stewart99].Along with iterative

Berkhin:A Survey on PageRank Computing89 methods,?nding equilibrium can be performed by direct methods.A direct projection method for?nding Markov chain stationary distribution is presented in[Benzi04].

In Section3.3,the block structure method was considered.The idea of ag-gregating di?erent states(pages in our cases)in several subgraphs is very well known in the theory of computing stationary distributions for Markov chains under the name of aggregation/disaggregation method.Its iterative analog was applied to PageRank[Langville and Meyer04b].The method is equivalent to the preconditioning of the power method by LU factorization.Extensive analy-sis of the convergence properties of this algorithm has been done by[Ipsen and Kirkland04].We do not know how it performs on real web graphs.Lee,Golub, and Zenios[Lee et al.03]study the performance on relatively large web graphs of a two-stage decomposition technique related to an aggregation of dangling pages(corresponding to(3.8)with q=2).In the language of Markov chains, the property of having upper triangular form is called lumpability.The authors report that the algorithm converges in20%of the original running time.

The Krylov subspace for a matrix A is de?ned as

K k(A,v)=span

v,Av,A2v,...,A k?1v

.

Modern numerical linear algebra uses these subspaces in di?erent advanced iter-ative procedures;[Golub and Greif04]applies di?erent versions of a celebrated Arnoldi algorithm,which is an instance of such a procedure.The Arnoldi al-gorithm and its variations(e.g.,combination with SVD)are used to?nd the rank-one subspace of I?P T.This insightful report also studies dependency of PageRank on the teleportation coe?cient c(see Section3.7).

Experiments with an application of other instances of Krylov subspace meth-ods GMRES,BiCG,and BiCGSTAB to linear system PageRank can be found in[Gleich et al.04].In many cases these methods outperform the power method and the Jacobi method by50%,even though their convergence is not stationary, meaning that between iterations the errors behave erratically.Experiments also showed that the convergence of these methods degrades less than the conver-gence of the power method with a reduction in teleportation when c→1.The big advantage of these methods is that they are parallelizable.Parallel versions have been benchmarked on several web graphs.For another e?ort to parallelize PageRank,see[Manaskasemsak and Rungsawang04].

3.7.Sensitivity Analysis

The teleportation parameter is something that we added to the model for purely technical reasons.When teleportation gradually decreases(c→1),the negative

90Internet Mathematics e?ects of spam decrease as well,but so does the convergence rate.For this reason it is important to?nd the dynamics of PageRank p(c)as a function of c[Golub and Greif04].Linear system PageRank satis?es the equation

p=cP T p+(1?c)v.(3.15) We would like to?nd p =?p/?c.Di?erentiating with respect to c,we get

p =cP T p +P T p?v,

or after substituting P T p from Equation(3.15),we get

p =cP T p +(p?v)/c=cP T p +(1?c)?v,?v=(p?v)/(c(1?c)). This elegant formula shows that p satis?es the same linear system as PageRank itself for the right-hand side?v.So,(1)stability decreases when c→1,and (2)the components that deviate the most from the teleportation vector su?er the highest instability.

Any web graph is only an approximate snapshot of reality that depends on many factors from crawler performance and con?guration to a setting of para-meters de?ning link pruning to an actual constant appearance of new pages and a decay of old ones.Therefore,computed PageRank is de?ned over volatile data, and its stability with respect to changes in an underlying web graph becomes a very important consideration that makes or breaks its practical ranking utility. It is also very important computationally,since there is no point in iterating trying to attain accuracy beyond the natural bounds of sensitivity.

The stability of PageRank with respect to small perturbations in the structure of a graph W is studied in[Ng et al.01b,Ng et al.01a].They prove a bound on the change in PageRank caused by a change of links touching(coming to or from)a subset of pages C:

?p?p ≤2

1?c E C,E C=

i∈C

p i.

The authors also analyze the HITS algorithm(see Section4.1)and provide ex-amples that demonstrate that PageRank is more stable than HITS.They show that the sensitivity of HITS depends on eigengap,which is the di?erence be-tween the principal and second eigenvalues of L T L,where the matrix L is the adjacency matrix of a subgraph associated with the HITS algorithm.

Similar results regarding robustness of PageRank are reported in[Bianchini et al.02,Bianchini et al.03].The authors studied subsets(communities)of pages and found useful representations for the quantity E C introduced above

Berkhin:A Survey on PageRank Computing91 that they called the“energy”of C.They got a slightly better bound

?p?p ≤2c

1?c

E C.

The next stage in sensitivity research is presented in[Borodin et al.01]and [Lempel and Moran03].In addition to a L1stability,these papers also study rank stability,which is the stability of indices not with respect to changes in weights but with respect to changes in induced rankings.Ranking measures(for example, (6.1))are explained in Section6.1.The authors analyze several models other than PagRank(like,for example,HITS and SALSA)that are introduced in the next section.Their(mostly negative asymptotic)results concerning stability and rank stability were obtained under certain special conditions on graph topology and were formulated as estimates in terms of a number of altered links.

From a practical standpoint,a number of altered links is a less attractive measure than the weighted measure E C utilizing page importance(for exam-ple,crawlers that are mostly responsible for web graph perturbation distinguish between“important”and“unimportant”pages,and E C provides a convenient abstraction for this).Further improvement to stability estimates in terms of a weighted measure was achieved by[Lee and Borodin96].The authors prove that it is enough to take into account only the weights of a“forward”subset F?C of pages whose out-links(not in-links)have been changed:

?p?p ≤2c

1?c

i∈F

p i.

A number of interesting results are obtained in[Chien et al.01].In addition to sensitivity analysis,the authors try to address the problem of incremental re-computing of PageRank after small perturbations(evolution)in the web struc-ture.Adding some edges results in a new matrix?P=P+E.The authors rigorously analyze the problem of inserting a single link.The suggested method is to(1)isolate a subgraph W1,where the in?uence of a change is signi?cant, from the total web graph;(2)collapse the remaining nodes of W\W1into a heavy supernode?;and(3)compute an approximate PageRank on this new graph structure W1+?.

Constructing a subgraph can be outlined as follows:assign a unit weight to a node i of a newly inserted link i→j,and propagate this weight from i trans-ferring c/deg(i)to each out-link including j.After several steps of the breadth-?rst-search(BFS)algorithm,retain the accessed nodes with a total weight above a prescribed threshold.Transition probabilities to/from the supernode also have instructive de?nitions.For example,the transition term from a supernode?to

92Internet Mathematics a node j∈W1is de?ned in terms of uniform PageRank p as

P?j=1

Q

i∈?

p i P ij,Q=

i∈?

p i.

In practice,each crawl results in new IDs for all pages.Finding related IDs is a major problem.A much more humble suggestion is to simply initialize a small portion of outstanding pages with their former PageRank values,distributing the rest of the authority uniformly.

4.Other Approaches

Along with the random surfer model,other usages of hyperlink data were sug-gested for the purpose of computing the authority weight of a web page.His-torically,[Larson96]was one of the?rst to apply ideas of bibliometrics to the web.An even earlier pre-Internet attempt to utilize graph structure was done by [Frisse88].Another approach[Carri`e re and Kazman97]suggests characterizing a page by the number of its in-links and introduces the concept of a neighborhood subgraph.A major boost in the context of web search relevance is associated with the appearance of the HITS algorithm.We survey it and other similar algo-rithms in this section.We also review applications of PageRank beyond ranking the web pages.

4.1.HITS Algorithm:Hubs and Authorities

A framework similar to PageRank computing introduced in[Kleinberg99]has several distinctive features:(1)it works with a web subgraph speci?c to a par-ticular query,rather than with a full graph W;(2)it computes two weights, authority weight x i and hub weight y i,for each page instead of one PageRank weight;and(3)it allows clustering of results for multi-topic or polarized queries such as“jaguar”or“abortion.”

A subgraph used in the computations is constructed as follows.The top t (around200)results recalled for a given query are picked according to a text-based relevance criterion.This is a root set.All pages pointed by out-links of a root set are added,along with up to d(around50)pages corresponding to in-links of each page in a root set(some hubs have enormous amounts of in-links). The result is a focused(or neighborhood or augmented)subgraph corresponding to a query.One of the goals is to allow the engine to report highly relevant pages that do not even contain the query term.

The authority weight of a page is an aggregated signi?cance of the hubs that point to it(“beauty lies in the eye of the beholder”),while the hub weight of

比较PageRank算法和HITS算法的优缺点

题目:请比较PageRank算法和HITS算法的优缺点,除此之外,请再介绍2种用于搜索引擎检索结果的排序算法,并举例说明。 答: 1998年,Sergey Brin和Lawrence Page[1]提出了PageRank算法。该算法基于“从许多优质的网页链接过来的网页,必定还是优质网页”的回归关系,来判定网页的重要性。该算法认为从网页A导向网页B的链接可以看作是页面A对页面B的支持投票,根据这个投票数来判断页面的重要性。当然,不仅仅只看投票数,还要对投票的页面进行重要性分析,越是重要的页面所投票的评价也就越高。根据这样的分析,得到了高评价的重要页面会被给予较高的PageRank值,在检索结果内的名次也会提高。PageRank是基于对“使用复杂的算法而得到的链接构造”的分析,从而得出的各网页本身的特性。 HITS 算法是由康奈尔大学( Cornell University ) 的JonKleinberg 博士于1998 年首先提出。Kleinberg认为既然搜索是开始于用户的检索提问,那么每个页面的重要性也就依赖于用户的检索提问。他将用户检索提问分为如下三种:特指主题检索提问(specific queries,也称窄主题检索提问)、泛指主题检索提问(Broad-topic queries,也称宽主题检索提问)和相似网页检索提问(Similar-page queries)。HITS 算法专注于改善泛指主题检索的结果。 Kleinberg将网页(或网站)分为两类,即hubs和authorities,而且每个页面也有两个级别,即hubs(中心级别)和authorities(权威级别)。Authorities 是具有较高价值的网页,依赖于指向它的页面;hubs为指向较多authorities的网页,依赖于它指向的页面。HITS算法的目标就是通过迭代计算得到针对某个检索提问的排名最高的authority的网页。 通常HITS算法是作用在一定范围的,例如一个以程序开发为主题的网页,指向另一个以程序开发为主题的网页,则另一个网页的重要性就可能比较高,但是指向另一个购物类的网页则不一定。在限定范围之后根据网页的出度和入度建立一个矩阵,通过矩阵的迭代运算和定义收敛的阈值不断对两个向量authority 和hub值进行更新直至收敛。 从上面的分析可见,PageRank算法和HITS算法都是基于链接分析的搜索引擎排序算法,并且在算法中两者都利用了特征向量作为理论基础和收敛性依据。

大数据pagerank算法设计

算法设计: 假设一个有集合:A,B,C和D 是由4个网页组成的。在同一个页面之中,多个指向相同的链接,把它们看作是同一个链接,并且每个页面初始的PageRank值相同。因为要满足概率值位于0到1之间的需求,我们假设这个值是0.25。 在每一次的迭代中,给定页面的PR值(PageRank值)会被平均分配到此页面所链接到的页面上。 倘若全部页面仅链接到A,这样的话A的PR值就是B,C和D的PR值之和,即:PR(A)=PR(B)+PR(C)+PR(D){\displaystyle PR(A)=PR(B)+PR(C)+PR(D)} 再次假设C链接到了A,B链接到了A和C,D链接到了A,B,C。最开始的时候一个页面仅仅只会有一票。正因为这样,所以的话B将会给A ,C这两个页面每一个页面半票。按照这样来类比推算,D所投出去的票将只会有三分之一的票会被添加到属于A 的PR值上: {\displaystyle PR(A)={\frac {PR(B)}{2}}+{\frac {PR(C)}{1}}+{\frac {PR(D)}{3}}} 换个方式表达的话,算法将会依据每一个页面链接出来的总数 {\displaystyle L(x)}平均的分配每一个页面的PR值,然后把它添加至它指向的页面:

最后,这些全部的PR值将会被变换计算成为百分比的形式然后会再乘上一个修正系数。因为“没有向外链接的网页”它传递出去的PR值将会是0,而且这将递归地差生影响从而使得指向它的页面的PR值的计算出来得到的结果同样是零,因此每一个页面要有预先设置好了的一个最小值: 需要注意的是,在Sergey Brin和Lawrence Page的1998年原版论文中给每一个页面设定的最小值是1-d,而不是这里的(1-d)/N,这将导致集合中所有网页的PR值之和为N(N为集合中网页的数目)而并不是所期待的1。 所以,一个页面的PR值直接取决于指向它的的页面。如果在最初给每个网页一个随机且非零的PR值,经过重复计算,这些页面的PR值将会逐渐接近于某一个固定 定值,也就是处于收敛的状态,即最终结果。这就是搜索引擎使用该算法的原因。【测试环境】 【测试数据】

Pagerank算法与网页排序方法的建模

Pagerank 算法与网页排序方法的建模 摘要 随着互联网的飞速发展,各种杂乱无章的信息充斥其中,如何对数以亿记的相关网页进行排序成为搜索引擎的核心问题。针对这个现象本文根据题目要求建立了两个模型: 模型一:结合Google 的Pagerank 算法,建立了网上冲浪模型,得到Pagerank 算法定义: n i i 1 i P R(T )P R(A )(1d )d C (T ) ==-+∑ 用迭代算法通过MATLAB 编程计算出网页的PR 值; 模型二:由于传统PR 值算法仅考虑网页的外链和内链数量,偏重于旧网页;另外,传统算法不能区分网页中的链接与网页的主题是否相关,容易产生主题漂移现象;考虑其算法存在的缺陷,在此基础上为给出对搜索网页进行排序的方法,着重考虑搜索出的网页以下几个方面:外链,内链,时间反馈因子和相关度,对PR 值进行改进,得到以下公式: Wt V VT sim VT V sim T PR d d p PR k i m j j i i P i +?+-=∑ ∑==1 1 , ,) () ()()1()( 以PR 值的高低来对搜索网页进行排序; 对于如何使新网站在搜索引擎中排名靠前,从影响网页的PR 值的因素:內链、外链、时间反馈因子和相关度出发对提高网页的PR 值以使其在搜索引擎中排名靠前给出了稳健的建议。 关键词 Pagerank 迭代算法 MATLAB 时间反馈因子 相关度 一、问题重述 随着互联网的发展,面对众多杂乱无章的信息,如何对数以亿计的相关网页进行排序成为搜索引擎算法的核心问题。一个搜索引擎的算法,要考虑很多的方面。主要是“域

名、密度、内链、外链、相关度、服务器稳定、内容更新、域名时间、内容数量”这些方面。不同的搜索引擎侧重点也不同,比如Google,它对收录的网站有一个重要性排名的指数,被称为Pagerank,作为对搜索网页排序的重要参数。 根据搜索引擎与Pagerank,考虑如下问题: 1.考察Google的Pagerank算法,建立数学模型,给出合理的Pagerank的计算方法; 2.如果你是搜索引擎的建设者,请考虑你会侧重考虑搜索网页的那些方面,给出你对搜索网页进行排序的方法; 3.如果你是某新网站的建设者,请考虑使你的网站在第2题中你建立的搜索引擎中排名靠前的方法。 二、问题分析 互联网的迅速发展,使现有的搜索引擎面临着巨大的挑战,面对众多杂乱无章的信息,如何对数以亿计的相关网页进行排序成为搜索引擎算法的核心问题,因此,搜索引擎排序算法也就称为众多搜索引擎关注的关键问题之一。 对于问题1,根据题目要求,结合Google的Pagerank算法,PageRank算法的基本思想是:页面的重要程度用PageRank值来衡量,PageRank值主要体现在两个方面:引用该页面的页面个数和引用该页面的页面重要程度。若B网页设置有连接A网页的链接(B为A的导入链接时),说明B认为A有链接价值,是一个“重要”的网页。当B网页级别(重要性)比较高时,则A网页可从B网页这个导入链接分得一定的级别(重要性),并平均分配给A网页上的导出链接,由此建立了网上冲浪模型,用迭代算法计算出网页的PR值。 对于问题2,经过对Google的Pagerank算法的分析,发现该算法仅考虑了搜索出的网页的外链和内链的数量,以此来确定网页的PR值偏重于旧网页,即越旧的网页排名越靠前;对一个刚放到网上不久的新网页,指向它的网页就很少,通过计算后的PR 值就很低,在搜索结果中也就被排在了靠后的位置。然而在有些时候,比如新闻类网页和商务性信息,用户当然是希望先看到新的网页,因此我们在计算PR值时考虑加入时间反馈因子,使得在网络上存在时间比较长的网页被沉下去,在搜索结果中被排在靠后的位置;存在时间短的网页就会浮上来,在搜索结果中被排在较靠前的位置,方便用户查看。时间反馈因子利用搜索引擎的搜索周期来表征,即如果一个网页存在时间较长,它将在每个搜索周期中都能被搜到,对网页采取在同一个周期里不管搜到该网页几次,都算一次处理的方法,网页的存在时间正比于搜索引擎搜到该网页的次数,时间反馈因子与网页的存在时间成反比关系。 另外,Google的Pagerank算法是基于网页链接结构进行分析的算法,不能区分网页中的链接与网页的主题是否相关,这样就容易出现搜索引擎排序结果中大量与查询主题无关的网页的现象,即产生主题漂移现象。为解决这个问题,引入主题相关度这个概念。主题相关度就是搜索出的网页与其链入和链出网页的相似度,可用余弦相似度来度量计算。 在加入了时间反馈因子和相关性因素后,改进网页的PR值的算法,以PR值高低的来对搜索的网页进行排序。 对于问题三,主要通过模型二的结果,加强有力的因素,避免不利的方面 三、模型假设与符号说明

pagerank算法实验报告

PageRank算法实验报告 一、算法介绍 PageRank是Google专有的算法,用于衡量特定网页相对于搜索引擎索引中的其他网页而言的重要程度。它由Larry Page 和Sergey Brin在20世纪90年代后期发明。PageRank实现了将链接价值概念作为排名因素。 PageRank的核心思想有2点: 1.如果一个网页被很多其他网页链接到的话说明这个网页比较重要,也就是pagerank值会相对较高; 2.如果一个pagerank值很高的网页链接到一个其他的网页,那么被链接到的网页的pagerank值会相应地因此而提高。 若页面表示有向图的顶点,有向边表示链接,w(i,j)=1表示页面i存在指向页面j的超链接,否则w(i,j)=0。如果页面A存在指向其他页面的超链接,就将A 的PageRank的份额平均地分给其所指向的所有页面,一次类推。虽然PageRank 会一直传递,但总的来说PageRank的计算是收敛的。 实际应用中可以采用幂法来计算PageRank,假如总共有m个页面,计算如公式所示: r=A*x 其中A=d*P+(1-d)*(e*e'/m) r表示当前迭代后的PageRank,它是一个m行的列向量,x是所有页面的PageRank初始值。 P由有向图的邻接矩阵变化而来,P'为邻接矩阵的每个元素除以每行元素之和得到。 e是m行的元素都为1的列向量。 二、算法代码实现

三、心得体会 在完成算法的过程中,我有以下几点体会: 1、在动手实现的过程中,先将算法的思想和思路理解清楚,对于后续动手实现 有很大帮助。 2、在实现之前,对于每步要做什么要有概念,然后对于不会实现的部分代码先 查找相应的用法,在进行整体编写。 3、在实现算法后,在寻找数据验证算法的过程中比较困难。作为初学者,对于 数据量大的数据的处理存在难度,但数据量的数据很难寻找,所以难以进行实例分析。

PageRank算法的核心思想

如何理解网页和网页之间的关系,特别是怎么从这些关系中提取网页中除文字以外的其他特性。这部分的一些核心算法曾是提高搜索引擎质量的重要推进力量。另外,我们这周要分享的算法也适用于其他能够把信息用结点与结点关系来表达的信息网络。 今天,我们先看一看用图来表达网页与网页之间的关系,并且计算网页重要性的经典算法:PageRank。 PageRank 的简要历史 时至今日,谢尔盖·布林(Sergey Brin)和拉里·佩奇(Larry Page)作为Google 这一雄厚科技帝国的创始人,已经耳熟能详。但在1995 年,他们两人还都是在斯坦福大学计算机系苦读的博士生。那个年代,互联网方兴未艾。雅虎作为信息时代的第一代巨人诞生了,布林和佩奇都希望能够创立属于自己的搜索引擎。1998 年夏天,两个人都暂时离开斯坦福大学的博士生项目,转而全职投入到Google 的研发工作中。他们把整个项目的一个总结发表在了1998 年的万维网国际会议上(WWW7,the seventh international conference on World Wide Web)(见参考文献[1])。这是PageRank 算法的第一次完整表述。 PageRank 一经提出就在学术界引起了很大反响,各类变形以及对PageRank 的各种解释和分析层出不穷。在这之后很长的一段时间里,PageRank 几乎成了网页链接分析的代名词。给你推荐一篇参考文献[2],作为进一步深入了解的阅读资料。

PageRank 的基本原理 我在这里先介绍一下PageRank 的最基本形式,这也是布林和佩奇最早发表PageRank 时的思路。 首先,我们来看一下每一个网页的周边结构。每一个网页都有一个“输出链接”(Outlink)的集合。这里,输出链接指的是从当前网页出发所指向的其他页面。比如,从页面A 有一个链接到页面B。那么B 就是A 的输出链接。根据这个定义,可以同样定义“输入链接”(Inlink),指的就是指向当前页面的其他页面。比如,页面C 指向页面A,那么C 就是A 的输入链接。 有了输入链接和输出链接的概念后,下面我们来定义一个页面的PageRank。我们假定每一个页面都有一个值,叫作PageRank,来衡量这个页面的重要程度。这个值是这么定义的,当前页面I 的PageRank 值,是I 的所有输入链接PageRank 值的加权和。 那么,权重是多少呢?对于I 的某一个输入链接J,假设其有N 个输出链接,那么这个权重就是N 分之一。也就是说,J 把自己的PageRank 的N 分之一分给I。从这个意义上来看,I 的PageRank,就是其所有输入链接把他们自身的PageRank 按照他们各自输出链接的比例分配给I。谁的输出链接多,谁分配的就少一些;反之,谁的输出链接少,谁分配的就多一些。这是一个非常形象直观的定义。

相关主题
相关文档 最新文档