当前位置:文档之家› Automatic Meaning Discovery Using Google

Automatic Meaning Discovery Using Google

Automatic Meaning Discovery Using Google
Automatic Meaning Discovery Using Google

Automatic Meaning Discovery Using Google

Rudi Cilibrasi?

CWI

Paul Vitanyi?

CWI,University of Amsterdam, Abstract

We present a new theory of relative semantics between objects,based on information distance and Kolmogorov complexity.This theory is then applied to construct a method to automatically

extract the meaning of words and phrases from the world-wide-web using Google page counts.

The approach is novel in its unrestricted problem domain,simplicity of implementation,and

manifestly ontological underpinnings.The world-wide-web is the largest database on earth,and

the latent semantic context information entered by millions of independent users averages out to

provide automatic meaning of useful quality.We give examples to distinguish between colors and

numbers,cluster names of paintings by17th century Dutch masters and names of books by English

novelists,the ability to understand emergencies,and primes,and we demonstrate the ability to do a

simple automatic English-Spanish translation.Finally,we use the WordNet database as an objective

baseline against which to judge the performance of our method.We conduct a massive randomized

trial in binary classi?cation using support vector machines to learn categories based on our Google

distance,resulting in an a mean agreement of87%with the expert crafted WordNet categories.

1Introduction

Objects can be given literally,like the literal four-letter genome of a mouse,or the literal text of War and Peace by Tolstoy.For simplicity we take it that all meaning of the object is represented by the literal object itself.Objects can also be given by name,like“the four-letter genome of a mouse,”or“the text of War and Peace by Tolstoy.”There are also objects that cannot be given literally,but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like“home”or“red.”To make computers more intelligent one would like to represent meaning in computer-digestable form.Long-term and labor-intensive efforts like the Cyc project[19]and the WordNet project[30]try to establish semantic relations between common objects,or,more precisely, names for those objects.The idea is to create a semantic web of such vast proportions that rudimentary intelligence,and knowledge about the real world,spontaneously emerge.This comes at the great cost of designing structures capable of manipulating knowledge,and entering high quality contents in these structures by knowledgeable human experts.While the efforts are long-running and large scale,the overall information entered is minute compared to what is available on the world-wide-web.

The rise of the world-wide-web has enticed millions of users to type in trillions of characters to create billions of web pages of on average low quality contents.The sheer mass of the information

?Supported in part by the Netherlands BSIK/BRICKS project,and by NWO project612.55.002.Address:CWI,Kruislaan413,1098SJ Amsterdam, The Netherlands.Email:Rudi.Cilibrasi@cwi.nl.

?Part of this work was done while the author was on sabbatical leave at National ICT of Australia,Sydney Laboratory at UNSW.Supported in part by the EU EU Project RESQ IST-2001-37559,the ESF QiT Programmme,the EU NoE PASCAL,and the Netherlands BSIK/BRICKS project.Address: CWI,Kruislaan413,1098SJ Amsterdam,The Netherlands.Email:Paul.Vitanyi@cwi.nl.

1

Dagstuhl Seminar Proceedings 06051

Kolmogorov Complexity and Applications

http://drops.dagstuhl.de/opus/volltexte/2006/629

available about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense.We devise a general method to tap the amorphous low-grade knowledge available for free on the world-wide-web,typed in by local users aiming at personal grati?cation of diverse objectives,and yet globally achieving what is effectively the largest semantic electronic database in the world.Moreover,this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries, like Google.

Previously,we and others developed a compression-based method to establish a universal similarity metric among objects given as?nite binary strings[2,22,23,7,8],which was widely reported [17,18,12].Such objects can be genomes,music pieces in MIDI format,computer programs in Ruby or C,pictures in simple bitmap formats,or time sequences such as heart rhythm data.This method is feature-free in the sense that it doesn’t analyze the?les looking for particular features;rather it analyzes all features simultaneously and determines the similarity between every pair of objects according to the most dominant shared feature.The crucial point is that the method analyzes the objects themselves.This precludes comparison of abstract notions or other objects that don’t lend themselves to direct analysis,like emotions,colors,Socrates,Plato,Mike Bonanno and Albert Einstein.While the previous method that compares the objects themselves is particularly suited to obtain knowledge about the similarity of objects themselves,irrespective of common beliefs about such similarities,here we develop a method that uses only the name of an object and obtains knowledge about the similarity of objects by tapping available information generated by multitudes of web users.Here we are reminded of the words of D.H.Rumsfeld[28]“A trained ape can know an awful lot/Of what is going on in this world/Just by punching on his mouse/For a relatively modest cost!”

1.1An Example:While the theory we propose is rather intricate,the resulting method is simple enough.We give an example:At the time of doing the experiment,a Google search for“horse”, returned46,700,000hits.The number of hits for the search term“rider”was12,200,000.Searching for the pages where both“horse”and“rider”occur gave2,630,000hits,and Google indexed8,058,044,651 web https://www.doczj.com/doc/354578948.html,ing these numbers in the main formula(3.6)we derive below,with N=8,058,044,651, this yields a Normalized Google Distance between the terms“horse”and“rider”as follows:

NGD(horse,rider)≈0.443.

In the sequel of the paper we argue that the NGD is a normed semantic distance between the terms in question,usually in between0(identical)and1(unrelated),in the cognitive space invoked by the usage of the terms on the world-wide-web as?ltered by Google.Because of the vastness and diversity of the web this may be taken as related to the current objective meaning of the terms in society.We did the same calculation when Google indexed only one-half of the current number of pages:4,285,199,774.It is instructive that the probabilities of the used search terms didn’t change signi?cantly over this doubling of pages,with number of hits for“horse”equal23,700,000,for“rider”equal6,270,000,and for“horse, rider”equal to1,180,000.The NGD(horse,rider)we computed in that situation was≈0.460.This is in line with our contention that the relative frequencies of web pages containing search terms gives objective information about the semantic relations between the search terms.If this is the case,then the Google probabilities of search terms and the computed NGD’s should stabilize(become scale invariant)with a growing Google database.

2

1.2Related Work:There is a great deal of work in both cognitive psychology[34],linguistics,and computer science,about using word(phrases)frequencies in text corpora to develop measures for word similarity or word association,partially surveyed in[31,33],going back to at least[32].One of the most successful is Latent Semantic Analysis(LSA)[34]that has been applied in various forms in a great number of applications.We discuss LSA and its relation to the present approach in Appendix A. As with LSA,many other previous approaches of extracting meaning from text documents are based on text corpora that are many order of magnitudes smaller,and that are in local storage,and on assumptions that are more re?ned,than what we propose.In contrast,[10,1]and the many references cited there, use the web and Google counts to identify lexico-syntactic patterns or other data.Again,the theory, aim,feature analysis,and execution are different from ours,and cannot meaningfully be compared. Essentially,our method below automatically extracts meaning relations between arbitrary objects from the web in a manner that is feature-free,up to the search-engine used,and computationally feasible. This seems to be a new direction altogether.

1.3Outline of This Work:The main thrust is to develop a new theory of semantic distance between

a pair of objects,based on(and unavoidably biased by)a background contents consisting of a database of documents.An example of the latter is the set of pages constituting the world-wide-web.Semantics relations between pairs of objects is distilled from the documents by just using the number of documents in which the objects occur,singly and jointly(irrespective of location or multiplicity).The theoretical underpinning is based on the theory of Kolmogorov complexity[24],and expressed as coding and compression.This allows to express and prove properties of absolute relations between objects that cannot even be expressed by other approaches.The theory,application,and the particular NGD formula to express the bilateral semantic relations are(as far as we know)not equivalent to any earlier theory,application,and formula in this area.The current paper is a next step in a decade of cumulative research in this area,of which the main thread is[24,2,25,23,7,8]with[22,3]using the related approach of[26].We?rst start with a technical introduction outlining some notions underpinning our approach:Kolmogorov complexity,information distance,and compression-based similarity metric (Section2).Then we give a technical description of the Google distribution,the Normalized Google Distance,and the universality of these notions(Section3).While it may be possible in principle that other methods can use the entire world-wide-we

b to determine relative meaning between terms,we do not know of a method that both uses the entire web,or computationally can use the entire web, and(or)has the same aims as our method.To validate our method we therefore cannot compare its performance to other existing methods.Ours is a new proposal for a new task.We validate the method in the following way:by theoretical analysis,by anecdotical evidence in a plethora of applications, and by systemati

c an

d massiv

e comparison o

f accuracy in a classi?cation application compared to the uncontroversial body of knowledge in the WordNet database.In Section3we give the theoretic underpinnin

g of the method and prove its universality.In Section4we present a plethora of clustering and classi?cation experiments to validate the universality,robustness,and accuracy of our proposal.In Section5we test repetitive automatic performance against uncontroversial semantic knowledge:We present the results of a massive randomized classi?cation trial we conducted to gauge the accuracy of our method to the expert knowledge as implemented over the decades in the WordNet database.The actual experimental data can be downloaded from[5].The method is implemented as an easy-to-use software tool available on the web[6],available to all.

3

1.4Materials and Methods:The application of the theory we develop is a method that is justi?ed by the vastness of the world-wide-web,the assumption that the mass of information is so diverse that the frequencies of pages returned by Google queries averages the semantic information in such a way that one can distill a valid semantic distance between the query subjects.It appears to be the only method that starts from scratch,is feature-free in that it uses just the web and a search engine to supply contents, and automatically generates relative semantics between words and phrases.A possible drawback of our method is that it relies on the accuracy of the returned counts.It is well-known,see[1],that the returned google counts are inaccurate,and especially if one uses the boolean OR operator between search terms. For example,a search for“horse”returns106million hits,“rider”30.1million hits,but“horse OR rider”116million hits,respectively,violating set theory.(These numbers held at the time of writing this sentence.Above,we used an example with the same search terms but with numbers based on Google searches of the year before.)The AND operator we use is less problematic,and we do not use the OR operator.Furthermore,Google apparently estimates the number of hits based on samples,and the number of indexed pages changes rapidly.To compensate for the latter effect,we have inserted a normalizing mechanism in the CompLearn software.Linguists judge the accuracy of Google counts trustworthy enough:In[20](see also the many references to related research)it is shown that web searches for rare two-word phrases correlated well with the frequency found in traditional corpora,as well as with human judgments of whether those phrases were natural.Thus,Google is the simplest means to get the most information.Note,however,that a single Google query takes a fraction of a second,and that Google restricts every IP address to a maximum of(currently)500queries per day—although they are cooperative enough to extend this quotum for noncommercial purposes.The experimental evidence provided here shows that the combination of Google and our method yields reasonable results,gauged against common sense(‘colors’are different from‘numbers’)and against the expert knowledge in the WordNet data base.A reviewer suggested downscaling our method by testing it on smaller text corpora.This does not seem useful.Clearly perfomance will deteriorate with decreasing data base size.A thought experiment using the extreme case of a single web page consisting of a single term suf?ces.Practically addressing this issue is begging the question.Instead,in Section3 we theoretically analyze the relative semantics of search terms established using all of the web,and its universality with respect to the relative semantics of search terms using subsets of web pages.

2Technical Preliminaries

The basis of much of the theory explored in this paper is Kolmogorov complexity.For an introduction and details see the textbook[24].Here we give some intuition and notation.We assume a?xed reference universal programming system.Such a system may be a general computer language like LISP or Ruby,and it may also be a?xed reference universal Turing machine in a given standard enumeration of Turing machines.The latter choice has the advantage of being formally simple and hence easy to theoretically manipulate.But the choice makes no difference in principle,and the theory is invariant under changes among the universal programming systems,provided we stick to a particular choice. We only consider universal programming systems such that the associated set of programs is a pre?x code—as is the case in all standard computer languages.The Kolmogorov complexity of a string x is the length,in bits,of the shortest computer program of the?xed reference computing system that produces x as output.The choice of computing system changes the value of K(x)by at most an additive?xed constant.Since K(x)goes to in?nity with x,this additive?xed constant is an ignorable quantity if we consider large x.

4

One way to think about the Kolmogorov complexity K(x)is to view it as the length,in bits,of the ultimate compressed version from which x can be recovered by a general decompression program. Compressing x using the compressor gzip results in a?le x g with(for?les that contain redundancies)the length|x g|<|x|.Using a better compressor bzip2results in a?le x b with(for redundant?les)usually |x b|<|x g|;using a still better compressor like PPMZ results in a?le x p with(for again appropriately redundant?les)|x p|<|x b|.The Kolmogorov complexity K(x)gives a lower bound on the ultimate value:for every existing compressor,or compressors that are possible but not known,we have that K(x) is less or equal to the length of the compressed version of x.That is,K(x)gives us the ultimate value of the length of a compressed version of x(more precisely,from which version x can be reconstructed by a general purpose decompresser),and our task in designing better and better compressors is to approach this lower bound as closely as possible.

2.1Normalized Information Distance:In[2]we considered the following notion:given two strings x and y,what is the length of the shortest binary program in the reference universal computing system such that the program computes output y from input x,and also output x from input y.This is called the information distance and denoted as E(x,y).It turns out that,up to a negligible logarithmic additive term,

E(x,y)=K(x,y)?min{K(x),K(y)}.

This distance E(x,y)is actually a metric:up to close precision we have E(x,x)=0,E(x,y)>0for x=y,E(x,y)=E(y,x)and E(x,y)≤E(x,z)+E(z,y),for all x,y,z.We now consider a large class of admissible distances:all distances(not necessarily metric)that are nonnegative,symmetric,and computable in the sense that for every such distance D there is a pre?x program that,given two strings x and y,has binary length equal to the distance D(x,y)between x and y.Then,

E(x,y)≤D(x,y)+c D,(2.1) where c D is a constant that depends only on D but not on x,y,and we say that E(x,y)minorizes D(x,y) up to an additive constant.We call the information distance E universal for the family of computable distances,since the former minorizes every member of the latter family up to an additive constant.If two strings x and y are close according to some computable distance D,then they are at least as close according to distance E.Since every feature in which we can compare two strings can be quanti?ed in terms of a distance,and every distance can be viewed as expressing a quanti?cation of how much of a particular feature the strings have in common(the feature being quanti?ed by that distance),the information distance determines the distance between two strings according to the dominant feature in which they are similar.This means that,if we consider more than two strings,every pair may have a different dominant feature.If small strings differ by an information distance which is large compared to their sizes,then the strings are very different.However,if two very large strings differ by the same (now relatively small)information distance,then they are very similar.Therefore,the information distance itself is not suitable to express true similarity.For that we must de?ne a relative information distance:we need to normalize the information distance.Such an approach was?rst proposed in [22]in the context of genomics-based phylogeny,and improved in[23]to the one we use here.The normalized information distance(NID)has values between0and1,and it inherits the universality of the information distance in the sense that it minorizes,up to a vanishing additive term,every other possible normalized computable distance(suitably de?ned).In the same way as before we can identify the computable normalized distances with computable similarities according to some features,and the

5

NID discovers for every pair of strings the feature in which they are most similar,and expresses that similarity on a scale from0to1(0being the same and1being completely different in the sense of sharing no features).Considering a set of strings,the feature in which two strings are most similar may be a different one for different pairs of strings.The NID is de?ned by

NID(x,y)=K(x,y)?min(K(x),K(y))

max(K(x),K(y))

.(2.2)

It has several wonderful properties that justify its description as the most informative metric[23].

2.2Normalized Compression Distance:The NID is uncomputable since the Kolmogorov complex-ity is uncomputable.But we can use real data compression programs to approximate the Kolmogorov complexities K(x),K(y),K(x,y).A compression algorithm de?nes a computable function from strings to the lengths of the compressed versions of those strings.Therefore,the number of bits of the com-pressed version of a string is an upper bound on Kolmogorov complexity of that string,up to an additive constant depending on the compressor but not on the string in question.Thus,if C is a compressor and we use C(x)to denote the length of the compressed version of a string x,then we arrive at the Normal-ized Compression Distance:

NCD(x,y)=C(xy)?min(C(x),C(y))

max(C(x),C(y))

,(2.3)

where for convenience we have replaced the pair(x,y)in the formula by the concatenation xy.This transition raises several tricky problems,for example how the NCD approximates the NID if C approximates K,see[8],which do not need to concern us here.Thus,the NCD is actually a family of compression functions parameterized by the given data compressor C.The NID is the limiting case, where K(x)denotes the number of bits in the shortest code for x from which x can be decompressed by a general purpose computable decompressor.

3Theory of Googling for Meaning

Every text corpus or particular user combined with a frequency extractor de?nes its own relative frequencies of words and pharses usage.In the world-wide-web and Google setting there are millions of users and text corpora,each with its own distribution.In the sequel,we show(and prove)that the Google distribution is universal for all the individual web users distributions.The number of web pages currently indexed by Google is approaching1010.Every common search term occurs in millions of web pages.This number is so vast,and the number of web authors generating web pages is so enormous(and can be assumed to be a truly representative very large sample from humankind),that the probabilities of Google search terms,conceived as the frequencies of page counts returned by Google divided by the number of pages indexed by Google,approximate the actual relative frequencies of those search terms as actually used in society.Based on this premiss,the theory we develop in this paper states that the relations represented by the Normalized Google Distance(3.6)approximately capture the assumed true semantic relations governing the search terms.The NGD formula(3.6)only uses the probabilities of search terms extracted from the text corpus in question.We use the world-wide-web and Google,but the same method may be used with other text corpora like the King James version of the Bible or the Oxford English Dictionary and frequency count extractors,or the world-wide-web again and Yahoo as frequency count extractor.In these cases one obtains a text corpus and frequency extractor biased

6

semantics of the search terms.To obtain the true relative frequencies of words and phrases in society is a major problem in applied linguistic research.This requires analyzing representative random samples of suf?cient sizes.The question of how to sample randomly and representatively is a continuous source of debate.Our contention that the web is such a large and diverse text corpus,and Google such an able extractor,that the relative page counts approximate the true societal word-and phrases usage,starts to be supported by current real linguistics research [35,20].

3.1The Google Distribution:Let the set of singleton Google search terms be denoted by S .In the sequel we use both singleton search terms and doubleton search terms {{x ,y }:x ,y ∈S }.Let the set of web pages indexed (possible of being returned)by Google be ?.The cardinality of ?is denoted by M =|?|,and at the time of this writing 8·109≤M ≤9·109(and presumably greater by the time of reading this).Assume that a priori all web pages are equi-probable,with the probability of being returned by Google being 1/M .A subset of ?is called an event .Every search term x usable by Google de?nes a singleton Google event x ??of web pages that contain an occurrence of x and are returned by Google if we do a search for x .Let L :?→[0,1]be the uniform mass probability function.The probability of such an event x is L (x )=|x |/M .Similarly,the doubleton Google event x T y ??is the set of web pages returned by Google if we do a search for pages containing both search term x and search term y .The probability of this event is L (x T y )=|x T y |/M .We can also de?ne the other Boolean combinations:?x =?\x and x S y =?(?x T ?y ),each such event having a probability equal to its cardinality divided by M .If e is an event obtained from the basic events x ,y ,...,corresponding to basic search terms x ,y ,...,by ?nitely many applications of the Boolean operations,then the probability L (e )=|e |/M .

Google events capture in a particular sense all background knowledge about the search terms concerned available (to Google)on the web.The Google event x ,consisting of the set of all web pages containing one or more occurrences of the search term x ,thus embodies,in every possible sense,all direct context in which x occurs on the web.

R EMARK 1.It is of course possible that parts of this direct contextual material link to other web pages in which x does not occur and thereby supply additional context.In our approach this indirect context is ignored.Nonetheless,indirect context may be important and future re?nements of the method may take it into account.?The event x consists of all possible direct knowledge on the web regarding x .Therefore,it is natural to consider code words for those events as coding this background knowledge.However,we cannot use the probability of the events directly to determine a pre?x code,or,rather the underlying information content implied by the probability.The reason is that the events overlap and hence the summed probability exceeds 1.By the Kraft inequality [11]this prevents a corresponding set of code-word lengths.The solution is to normalize:We use the probability of the Google events to de?ne a probability mass function over the set {{x ,y }:x ,y ∈S }of Google search terms,both singleton and doubleton terms.There are |S |singleton terms,and |S

|2 doubletons consisting of a pair of non-identical terms.

De?ne N =∑{x ,y }?S

|x \y |,

counting each singleton set and each doubleton set (by de?nition unordered)once in the summation.Note that this means that for every pair {x ,y }?S ,with x =y ,the web pages z ∈x T y are counted three times:once in x =x T x ,once in y =y T y ,and once in x T y .Since every web page that is indexed

7

by Google contains at least one occurrence of a search term,we have N≥M.On the other hand,web pages contain on average not more than a certain constantαsearch terms.Therefore,N≤αM.De?ne

g(x)=g(x,x),g(x,y)=L(x \

y)M/N=|x

\

y|/N.(3.4)

Then,∑{x,y}?S g(x,y)=1.This g-distribution changes over time,and between different samplings from the distribution.But let us imagine that g holds in the sense of an instantaneous snapshot.The real situation will be an approximation of this.Given the Google machinery,these are absolute probabilities which allow us to de?ne the associated pre?x code-word lengths(information contents)for both the singletons and the doubletons.The Google code G is de?ned by

G(x)=G(x,x),G(x,y)=log1/g(x,y).(3.5) In contrast to strings x where the complexity C(x)represents the length of the compressed version of x using compressor C,for a search term x(just the name for an object rather than the object itself),the Google code of length G(x)represents the shortest expected pre?x-code word length of the associated Google event x.The expectation is taken over the Google distribution p.In this sense we can use the Google distribution as a compressor for Google“meaning”associated with the search terms.The associated NCD,now called the normalized Google distance(NGD)is then de?ned by(3.6),and can be rewritten as the right-hand expression:

NGD(x,y)=G(x,y)?min(G(x),G(y))

max(G(x),G(y))

=

max{log f(x),log f(y)}?log f(x,y)

log N?min{log f(x),log f(y)}

,(3.6)

where f(x)denotes the number of pages containing x,and f(x,y)denotes the number of pages containing both x and y,as reported by Google.This NGD is an approximation to the NID of(2.2) using the pre?x code-word lengths(Google code)generated by the Google distribution as de?ning a compressor approximating the length of the Kolmogorov code,using the background knowledge on the web as viewed by Google as conditional information.In practice,use the page counts returned by Google for the frequencies,and we have to choose N.From the right-hand side term in(3.6)it is apparent that by increasing N we decrease the NGD,everything gets closer together,and by decreasing N we increase the NGD,everything gets further apart.Our experiments suggest that every reasonable (M or a value greater than any f(x))value can be used as normalizing factor N,and our results seem in general insensitive to this choice.In our software,this parameter N can be adjusted as appropriate, and we often use M for N.The following are the main properties of the NGD(as long as we choose parameter N≥M):

1.The range of the NGD is in between0and∞(sometimes slightly negative if the Google counts

are untrustworthy and state f(x,y)>min{f(x),f(y)},See Section1.4);

(a)If x=y or if x=y but frequency f(x)=f(y)=f(x,y)>0,then NGD(x,y)=0.That is,

the semantics of x and y in the Google sense is the same.

(b)If frequency f(x)=0,then for every search term y we have f(x,y)=0,and the NGD(x,y)=

log f(y)/log(N/f(y)).

2.The NGD is always nonnegative and NGD(x,x)=0for every x.For every pair x,y we

have NGD(x,y)=NGD(y,x):it is symmetric.However,the NGD is not a metric:it does

8

not satisfy NGD (x ,y )>0for every x =y .As before,let x denote the set of web pages containing one or more occurrences of x .For example,choose x =y with x =y .Then,f (x )=f (y )=f (x ,y )and NGD (x ,y )=0.Nor does the NGD satisfy the triangle inequality NGD (x ,y )≤NGD (x ,z )+NGD (z ,y )for all x ,y ,z .For example,choose z =x S y ,x T y =/0,x =x T z ,y =y T z ,and |x |=|y |=√N .Then,f (x )=f (y )=f (x ,z )=f (y ,z )=√N ,f (z )=2√N ,and f (x ,y )=0.This yields NGD (x ,y )=1and NGD (x ,z )=NGD (z ,y )=2/log (N /4),which violates the triangle inequality for N >64.

3.The NGD is scale-invariant in the following sense:If the number N of pages indexed by Google (accounting for the multiplicity of different search terms per page)grows,the number of pages containing a given search term goes to a ?xed fraction of N ,and so does the number of pages containing a given conjunction of search terms.This means that if N doubles,then so do the f -frequencies.For the NGD to give us an objective semantic relation between search terms,it needs to become stable when the number N grows unboundedly.

3.2Universality of Google Distribution:A central notion in the application of compression to learning is the notion of “universal distribution,”see [24].Consider an effective enumeration P =p 1,p 2,...of probability mass functions with domain S .The list P can be ?nite or countably in?nite.D EFINITION 1.A probability mass function p u occurring in P is universal for P ,if for every p i in P there is a constant c i >0and ∑i =u c i ≥1,such that for every x ∈S we have p u (x )≥c i ·p i (x ).Here c i may depend on the indexes u ,i ,but not on the functional mappings of the elements of list P nor on x .

If p u is universal for P ,then it immediately follows that for every p i in P ,the pre?x code-word length for source word x ,see [11],associated with p u ,minorizes the pre?x code-word length associated with p i ,by satisfying log1/p u (x )≤log1/p i (x )+log c i ,for every x ∈S .

In the following we consider partitions of the set of web pages,each subset in the partition together with a probability mass function of search terms.For example,we may consider the list A =1,2,...,a of web authors producing pages on the web,and consider the set of web pages produced by each web author,or some other partition.“We author”is just a metaphor we use for convenience.Let web author i of the list A produce the set of web pages ?i and denote M i =|?i |.We identify a web author i with the set of web pages ?i he produces.Since we have no knowledge of the set of web authors,we consider every possible partion of ?into one of more equivalence classes,?=?1S ···S ?a ,?i T ?j =/0(1≤i =j ≤a ≤|?|),as de?ning a realizable set of web authors A =1,...,a .

Consider a partition of ?into ?1,...,?a .A search term x usable by Google de?nes an event x i ??i of web pages produced by web author i that contain search term x .Similarly,x i T y i is the set of web pages produced by i that is returned by Google searching for pages containing both search term x and search term y .Let N i =∑{x ,y }?S

|x i \y i |.

Note that there is an αi ≥1such that M i ≤N i ≤αi M i .For every search term x ∈S de?ne a probability mass function g i ,the individual web author’s Google distribution ,on the sample space {{x ,y }:x ,y ∈S }by

g i (x )=g i (x ,x ),g i (x ,y )=|x i

\y i |/N i .(3.7)

Then,∑{x ,y }?S g i (x ,y )=1.

9

T HEOREM3.1.Let?1,...,?a be any partition of?into subsets(web authors),and let g1,...,g a be the corresponding individual Google distributions.Then the Google distribution g is universal for the enumeration g,g1,...,g a.

Proof.We can express the overall Google distribution in terms of the individual web author’s

distributions:

g(x,y)=∑

i∈A N i

N

g i(x,y).

Consequently,g(x,y)≥(N i/N)g i(x,y).Since also g(x,y)≥g(x,y),we have shown that g(x,y)is universal for the family g,g1,...,g a of individual web author’s google distributions,according to De?nition1.

R EMARK2.Let us show that,for example,the uniform distribution L(x)=1/s(s=|S|)over the search terms x∈S is not universal,for s>2.By the requirement∑c i≥1,the sum taken over the number a of web authors in the list A,there is an i such that c i≥1/a.Taking the uniform distribution on say s search terms assigns probability1/s to each of them.By the de?nition of universality of a probability mass function for the list of individual Google probability mass functions g i,we can choose the function g i freely(as long as a≥2,and there is another function g j to exchange probabilities of search terms with).So choose some search term x and set g i(x)=1,and g i(y)=0for all search terms y=x.Then,we obtain g(x)=1/s≥c i g i(x)=1/a.This yields the required contradiction for s>a≥2.

?3.3Universality of Normalized Google Distance:Every individual web author produces both an individual Google distribution g i,and an individual pre?x code-word length G i associated with g i(see

[11]for this code)for the search terms.

D EFINITION2.The associated individual normalized Google distance NGD i of web author i is de?ned according to(3.6),with G i substituted for G.

These Google distances NGD i can be viewed as the individual semantic distances according to the bias of web author i.These individual semantics are subsumed in the general Google semantics in the following sense:The normalized Google distance is universal for the family of individual normalized Google distances,in the sense that it is as about as small as the least individual normalized Google distance,with high probability.Hence the Google semantics as evoked by all of the web society in a certain sense captures the biases or knowledge of the individual web authors.In Theorem3.2we show that,for every k≥1,the inequality

NGD(x,y)<βNGD i(x,y)+γ,(3.8) withβ=(min{G i(x),G i(y)?log k)/min{G i(x),G i(y)}andγ=(log kN/N i)/min{G(x),G(y)},is satis?ed with probability going to1with growing k.

R EMARK3.To interpret(3.8),we observe that in case G i(x)and G i(y)are large with re-spect to log k,thenβ≈1.If moreover log N/N i is large with respect to log k,thenγ≈(log N/N i)/min{G(x),G(y)}.Let us estimateγunder reasonable assumptions.Without loss of generality assume G(x)≤G(y).If f(x)=|x|,the number of pages returned on query x,then G(x)=log(N/f(x)).Thus,γ≈(log N?log N i)/(log N?log f(x)).The uniform expectation of N i is N/|A|,and N divided by that expectation of N i equals|A|,the number of web authors producing web pages.The uniform expectation of f(x)is N/|S|,and N divided by that expectation of f(x)equals

10

|S |,the number of Google search terms we use.Thus,the more the number of search terms exceeds the number of web authors,the more γgoes to 0in expectation.?

R EMARK 4.To understand (3.8),we may consider the codelengths involved as the Google database changes over time.It is reasonable to expect that both the total number of pages as well as the total number of search terms in the Google database will continue to grow for some time.In this period,the sum total probability mass will be carved up into increasingly smaller pieces for more and more search terms.The maximum singleton and doubleton codelengths within the Google database will grow.But the universality property of the Google distribution implies that the Google distribution’s code length for almost all particular search terms will be within a negligible error of the best codelength among any of the individual web authors.The size of this gap will grow more slowly than the codelength for any particular search term over time.Thus,the coding space that is suboptimal in the Google distribution’s code is an ever-smaller piece (in terms of proportion)of the total coding space.?

T HEOREM 3.2.(i)For every pair of search terms x ,y,the set of web authors B for which (3.8)holds,has ∑{N i /N :i ∈B }>(1?1/k )2.That is,if we select a web page uniformly at random from the total set ?of web pages,then we have probability >(1?1/k )2that (3.8)holds for web author i that authored the selected web page.

(ii)For every web author i ∈A ,the g i -probability concentrated on the pairs of search terms for which (3.8)holds is at least (1?2/k )2.

Proof.The pre?x code-word lengths G i associated with g i satisfy G (x )≤G i (x )+log N /N i and G (x ,y )≤G i (x ,y )+log N /N i .Substituting G (x ,y )by G i (x ,y )+log N /N i in the middle term of (3.6),we obtain NGD (x ,y )≤G i (x ,y )?max {G (x ),G (y )}+log N /N i min {G (x ),G (y )}

.(3.9)Markov’s Inequality says the following:Let p be any probability mass function;let f be any nonnegative function with p -expected value E =∑i p (i )f (i )<∞.For E >0we have ∑i {p (i ):f (i )/E >k }<1/k .

(i)De?ne a probability mass function p (i )=N i /N for i ∈A .For every x ∈S ,the Google probability mass value g (x )equals the expectation of the individual Google probability mass values at x ,that is,∑i ∈A p (i )g i (x ).We can now argue as follows:for every x as above,we can consider x ?xed and use “g i (x )”for “f (i )”,and “g (x )”for “E ”in Markov’s Inequality.Then,∑i {p (i ):g i (x )/g (x )>k }<1/k ,and therefore ∑i {p (i ):g i (x )/g (x )≥k }>1?1/k .Hence,there is a x -dependent subset of web authors B x ={i ∈A :g i (x )/g (x )≥k },such that ∑{p (i ):i ∈B }>1?1/k .By de?nition g i (x )≥kg (x )for i ∈B x .Then,for i ∈B x ,we have G i (x )?log k ≤G (x ).Substitute G i (x )?log k for G (x ),with probability >(1?1/k )that G i (x )?log k ≤G (x )for web author i that authored a web page that is selected uniformly at random from the set ?of all web pages.Recall that ∑{p (i ):i ∈B x }=∑{N i :i ∈B x }/N .Similarly this holds for search term y with respect to the equivalently de?ned B y .The combination of search terms x and y therefore satisfy both G i (x )?log k ≤G (x )and G i (y )?log k ≤G (y ),for i ∈B with B =B x T B y .Then,∑{p i :i ∈B }=∑{N i :i ∈B }/N ≥∑{N i :i ∈B x }/N ×∑{N i :i ∈B y }/N >(1?1/k )2.Hence,the probability that both search terms x and y satisfy G i (x )?log k ≤G (x )and G i (y )?log k ≤G (y ),respectively,for web author i that authored a web page selected uniformly at random from ?,is greater than (1?1/k )2.Substitute both G (x )and G (y )according to these probabilistically satis?ed inequalities in (3.9),both in the max-term in the numerator,and in the min-term in the denominator.This proves item (i).

11

(ii)Fix web author i ∈A .We consider the conditional probability mass functions g (x )=g (x |x ∈S )

and g i (x )=g i (x |x ∈S )over single search terms:The g i -expected value of g (x )/g i (x )is

x g i (x )g (x )g i (x )≤1.

Then,by Markov’s Inequality ∑x {g i (x ):g (x )≤jg i (x )}>1?1j .(3.10)

Let ∑x ∈S g (x )=h and ∑x ∈S g i (x )=h i .Since the probability of an event of a doubleton set of search terms is not greater than that of an event based on either of the constituent search terms,1≥h ,h i ≥1/2.Therefore,2g (x )≥g (x )≥g (x )and 2g i (x )≥g i (x )≥g i (x ).Then,for the search terms x satisfying (3.10),we have ∑x

{g i (x ):g (x )≤2jg i (x )}>1?1j .

For the x ’s with g (x )≤2jg i (x )we have G i (x )≤G (x )+log2j .Substitute G i (x )?log2j for G (x )(there is g i -probability ≥1?1/j that G i (x )?log2j ≤G (x ))and G i (y )?log2j ≤G (y )in (3.9),both in the max-term in the numerator,and in the min-term in the denominator.Noting that the two g i -probabilities (1?1/j )are independent,the total g i -probability that both substitutions are justi?ed is at least (1?1/j )2.Substituting k =2j proves item (ii).

Therefore,the Google normalized distance minorizes every normalized compression distance based on a particular user’s generated probabilities of search terms,with high probability up to an error term that in typical cases is ignorable.

4Applications and Experiments

4.1Hierarchical Clustering:We used our software tool available from https://www.doczj.com/doc/354578948.html,,the same tool that has been used in our earlier papers [8,7]to construct trees representing hierarchical clusters of objects in an unsupervised way.However,now we use the normalized Google distance (NGD )instead of the normalized compression distance (NCD ).The method works by ?rst calculating a distance matrix whose entries are the pairswise NGD ’s of the terms in the input list.Then calculate a best-matching unrooted ternary tree using a novel quartet-method style heuristic based on randomized hill-climbing using a new ?tness objective function for the candidate trees.Let us brie?y explain what the method does;for more explanation see [9,8].Given a set of objects as points in a space provided with a (not necessarily metric)distance measure,the associated distance matrix has as entries the pairwise distances between the objects.Regardless of the original space and distance measure,it is always possible to con?gure n objects is n -dimensional Euclidean space in such a way that the associated distances are identical to the original ones,resulting in an identical distance matrix.This distance matrix contains the pairwise distance relations according to the chosen measure in raw form.But in this format that information is not easily usable,since for n >3our cognitive capabilities rapidly fail.Just as the distance matrix is a reduced form of information representing the original data set,we now need to reduce the information even further in order to achieve a cognitively acceptable format like data clusters.To extract a hierarchy of clusters from the distance matrix,we determine a dendrogram (ternary tree)that agrees with the distance matrix according to a ?delity measure.This allows us

12

black

n8

white

n4blue

n14n13

n10

chartreuse

n6

n7

purple

eight

n9

seven

n11five

n15

four

n0

fortytwo n2

green n5one n16n12n3orange

red

six small

n18n1

three transparent

zero

two n17yellow

Figure 1:Colors and numbers arranged into a tree using NGD .

to extract more information from the data than just ?at clustering (determining disjoint clusters in dimensional representation).This method does not just take the strongest link in each case as the “true”one,and ignores all others;instead the tree represents all the relations in the distance matrix with as little distortion as is possible.In the particular examples we give below,as in all clustering examples we did but not depicted,the ?delity was close to 1,meaning that the relations in the distance matrix are faithfully represnted in the tree.The objects to be clustered are search terms consisting of the names of colors,numbers,and some tricky words.The program automatically organized the colors towards one side of the tree and the numbers towards the other,Figure 1.It arranges the terms which have as only meaning a color or a number,and nothing else,on the farthest reach of the color side and the number side,respectively.It puts the more general terms black and white,and zero,one,and two,towards the center,thus indicating their more ambiguous interpretation.Also,things which were not exactly colors or numbers are also put towards the center,like the word “small”.We may consider this an example of automatic ontology creation.As far as the authors know there do not exist other experiments that create this type of semantic meaning from nothing (that is,automatically from the web using Google).Thus,there is no baseline to compare against;rather the current experiment can be a baseline to evaluate the behavior of future systems.

4.2Dutch 17th Century Painters:In the example of Figure 2,the names of ?fteen paintings by Steen,Rembrandt,and Bol were entered.In the experiment,only painting title names were used;the associated painters are given below.We do not know of comparable experiments to use as baseline to judge the performance;this is a new type of contents clustering made possible by the existence of the

13

complearn version 0.8.19

tree score S(T) = 0.940019

compressor: google

Username: cilibrar k0k8

k2

Swartenhout k5Hendrickje slapend

k7

k3k10Portrait of Johannes Wtenbogaer

Portrait of Maria Trip

The Stone Bridge

The Prophetess Anna

k9

k11

Consul Titus Manlius T orquatus

k12

Maria Rey

V enus and Adonis

Woman at her T oilet

k1The Merry Family

Prince’s Day k4k6Keyzerswaert

Leiden Baker Arend Oostwaert

T wo Men Playing Backgammon

Figure 2:Hierarchical clustering of pictures

web and search engines.The painters and paintings used are as follows:

Rembrandt van Rijn:Hendrickje slapend;Portrait of Maria Trip;Portrait of Johannes Wtenbogaert ;The Stone Bridge ;The Prophetess Anna ;

Jan Steen:Leiden Baker Arend Oostwaert ;Keyzerswaert ;Two Men Playing Backgammon ;Woman at her Toilet ;Prince’s Day ;The Merry Family ;

Ferdinand Bol:Maria Rey ;Consul Titus Manlius Torquatus ;Swartenhout ;Venus and Adonis .

4.3English Novelists:Another example is English novelists.The authors and texts used are:

William Shakespeare:A Midsummer Night’s Dream;Julius Caesar;Love’s Labours Lost;Romeo and Juliet .

Jonathan Swift:The Battle of the Books;Gulliver’s Travels;Tale of a Tub;A Modest Proposal ;

Oscar Wilde:Lady Windermere’s Fan;A Woman of No Importance;Salome;The Picture of Dorian Gray .

The clustering is gven in Figure 3,and to provide a feeling for the ?gures involved we give the associated NGD matrix in Figure 4.The S (T )value in Figure 3gives the ?delity of the tree as a representation of the pairwise distances in the NGD matrix (1is perfect and 0is as bad as possible.For details see [6,8]).The question arises why we should expect this.Are names of artistic objects so distinct?(Yes.The point also being that the distances from every single object to all other objects are involved.The tree takes this global aspect into account and therefore disambiguates other meanings of the objects to retain the meaning that is relevant for this collection.)Is the distinguishing feature subject matter or title style?(In these experiments with objects belonging to the cultural heritage it is clearly

14

complearn version 0.8.19

tree score S(T) = 0.940416

compressor: google

Username: cilibrar k0Love’s Labours Lost k4

k2

A Midsummer Night’s Dream

Romeo and Juliet

Julius Caesar

k3

k1Salome

k9k8

A Woman of No Importance

Lady Windermere’s Fan

The Picture of Dorian Gray

k6Gulliver’s Travels k7k5

A Modest Proposal

Tale of a Tub

The Battle of the Books

Figure 3:Hierarchical clustering of authors

A Woman of No Importance 0.0000.4580.4790.4440.4940.1490.3620.4710.3710.300

0.2780.261A Midsummer Night’s Dream 0.458-0.0110.5630.3820.3010.5060.3400.2440.4990.537

0.5350.425A Modest Proposal 0.4790.5730.0020.3230.5060.5750.6070.5020.6050.335

0.3600.463Gulliver’s Travels 0.4450.3920.3230.0000.3680.5090.4850.3390.5350.285

0.3300.228Julius Caesar 0.4940.2990.5070.3680.0000.6110.3130.2110.3730.491

0.5350.447Lady Windermere’s Fan 0.1490.5060.5750.5650.6120.0000.5240.6040.5710.347

0.3470.461Love’s Labours Lost 0.3630.3320.6070.4860.3130.5250.0000.3510.5490.514

0.4620.513Romeo and Juliet 0.4710.2480.5020.3390.2100.6040.3510.0000.3890.527

0.5440.380Salome 0.3710.4990.6050.5400.3730.5680.5530.3890.0000.520

0.5380.407Tale of a Tub 0.3000.5370.3350.2840.4920.3470.5140.5270.5240.000

0.1600.421The Battle of the Books 0.2780.5350.3590.3300.5330.3470.4620.5440.5410.160

0.0000.373The Picture of Dorian Gray 0.2610.4150.4630.2290.4470.3240.5130.3800.4020.420

0.3730.000

Figure 4:Distance matrix of pairwise NGD ’s

15

a subject matter.To stress the point we used“Julius Caesar”of Shakespeare.This term occurs on the we

b overwhelmingly in other contexts and styles.Yet the collection of the other objects used,and the semanti

c distance towards those objects,determine

d th

e meaning of“Julius Caesar”in this experiment.) Does the system gets confused i

f we add more artists?(Representin

g the NGD matrix in bifurcating trees without distortion becomes more dif?cult for,say,more than25objects.See[8].)What about other subjects,like music,sculpture?(Presumably,the system will be more trustworthy if the subjects are more common on the web.)These experiments are representative for those we have performed wit

h the current software.We did not cherry-pick the best outcomes.For example,all experiments with these three English writers,with different selections of four works of each,always yielded a tree so that we could draw a convex hull around the works of each author,without overlap.Interestingly,a similar experiment with Russian authors gave worse results.This performance can be a baseline to judge future systems against.The reader can do his own experiments to satisfy his curiosity using our publicly available software tool at https://www.doczj.com/doc/354578948.html,/,also used in the depicted experiments. On the web page https://www.doczj.com/doc/354578948.html,/clo/listmonths/t.html the onging cumulated results of all(in December2005some160)experiments by the public,including the ones depicted here,are recorded.

4.4SVM–NGD Learning:We augment the Google method by adding a trainable component of the learning system.Here we use the Support Vector Machine(SVM)as a trainable component.For the SVM method used in this paper,we refer to the exposition[4].We use LIBSVM software for all of our SVM experiments.

The setting is a binary classi?cation problem on examples represented by search terms.We require a human expert to provide a list of at least40training words,consisting of at least20positive examples and20negative examples,to illustrate the contemplated concept class.The expert also provides,say, six anchor words a1,...,a6,of which half are in some way related to the concept under consideration. Then,we use the anchor words to convert each of the40training words w1,...,w40to6-dimensional training vectorsˉv1,...,ˉv40.The entry v j,i ofˉv j=(v j,1,...,v j,6)is de?ned as v j,i=NGD(w i,a j) (1≤i≤40,1≤j≤6).The training vectors are then used to train an SVM to learn the concept, and then test words may be classi?ed using the same anchors and trained SVM model.

In Figure5,we trained using a list of“emergencies”as positive examples,and a list of“almost emergencies”as negative examples.The?gure is self-explanatory.The accuracy on the test set is75%. In Figure6the method learns to distinguish prime numbers from non-prime numbers by example; it illustrates several common features of our method that distinguish it from the strictly deductive techniques.

4.5NGD Translation:Yet another potential application of the NGD method is in natural language translation.(In the experiment below we don’t use SVM’s to obtain our result,but determine correlations instead.)Suppose we are given a system that tries to infer a translation-vocabulary among English and Spanish.Assume that the system has already determined that there are?ve words that appear in two different matched sentences,but the permutation associating the English and Spanish words is,as yet,undetermined.This setting can arise in real situations,because English and Spanish have different rules for word-ordering.Thus,at the outset we assume a pre-existing vocabulary of eight English words with their matched Spanish translation.Can we infer the correct permutation mapping the unknown words using the pre-existing vocabulary as a basis?We start by forming an NGD matrix using additional English words of which the translation is known,Figure7.We label the columns

16

Training Data

Positive Training(22cases)

avalanche bomb threat broken leg burglary car collision death threat?re?ood gas leak heart attack hurricane landslide murder overdose pneumonia

rape roof collapse sinking ship stroke tornado

train wreck trapped miners

Negative Training(25cases)

arthritis broken dishwasher broken toe cat in tree contempt of court dandruff delayed train dizziness drunkenness enumeration

?at tire frog headache leaky faucet littering missing dog paper cut practical joke rain roof leak

sore throat sunset truancy vagrancy vulgarity Anchors(6dimensions)

crime happy help safe urgent

wash

Testing Results

Positive tests Negative tests

Predictions electrocution,heat stroke,pregnancy,traf?c jam

homicide,looting,

meningitis,robbery,

suicide

Negative sprained ankle acne,annoying sister,

Predictions camp?re,desk,

mayday,meal

Accuracy15/20=75.00%

Figure5:Google-SVM learning of“emergencies.”

17

Training Data

Positive Training(21cases)

111317192 232933137 414347553 596167771

73

Negative Training(22cases)

1012141516 1820212224 2526272830 32333446

89

Anchors(5dimensions)

composite number orange prime record Testing Results

Positive tests Negative tests Positive101,103,110 Predictions107,109,

79,83,

89,91,

97

Negative36,38, Predictions40,42,

44,45,

46,48,

49 Accuracy18/19=94.74%

Figure6:Google-SVM learning of primes.

18

Given starting vocabulary

English Spanish

tooth diente

joy alegria

tree arbol

electricity electricidad

table tabla

money dinero

sound sonido

music musica Unknown-permutation vocabulary

plant bailar

car hablar

dance amigo

speak coche

friend planta Figure7:English-Spanish Translation Problem

Predicted(optimal)permutation English Spanish plant planta

car coche dance bailar speak hablar friend amigo

Figure8:Translation Using NGD

19

by the translation-known English words,the rows by the translation-unknown words.The entries of the matrix are the NGD’s of the English words labeling the columns and rows.This constitutes the English basis matrix.Next,consider the known Spanish words corresponding to the known English words.Form a new matrix with the known Spanish words labeling the columns in the same order as the known English https://www.doczj.com/doc/354578948.html,bel the rows of the new matrix by choosing one of the many possible permutations of the unknown Spanish words.For each permutation,form the NGD matrix for the Spanish words,and compute the pairwise correlation of this sequence of values to each of the values in the given English word basis matrix.Choose the permutation with the highest positive correlation.If there is no positive correlation report a failure to extend the vocabulary.In this example,the computer inferred the correct permutation for the testing words,see Figure8.

5Systematic Comparison with WordNet Semantics

WordNet[30]is a semantic concordance of English.It focusses on the meaning of words by dividing them into categories.We use this as follows.A category we want to learn,the concept,is termed, say,“electrical”,and represents anything that may pertain to electronics.The negative examples are constituted by simply everything else.This category represents a typical expansion of a node in the WordNet hierarchy.In an experiment we ran,the accuracy on the test set is100%:It turns out that “electrical terms”are unambiguous and easy to learn and classify by our method.The information in the WordNet database is entered over the decades by human experts and is precise.The database is an academic venture and is publicly accessible.Hence it is a good baseline against which to judge the accuracy of our method in an indirect manner.While we cannot directly compare the semantic distance,the NGD,between objects,we can indirectly judge how accurate it is by using it as basis for a learning algorithm.In particular,we investigated how well semantic categories as learned using the NGD–SVM approach agree with the corresponding WordNet categories.For details about the structure of WordNet we refer to the of?cial WordNet documentation available online.We considered100randomly selected semantic categories from the WordNet database.For each category we executed the following sequence.First,the SVM is trained on50labeled training samples.The positive examples are randomly drawn from the WordNet database in the category in question.The negative examples are randomly drwan from a dictionary.While the latter examples may be false negatives,we consider the probability negligible.Per experiment we used a total of six anchors,three of which are randomly drawn from the WordNet database category in question,and three of which are drawn from the dictionary.Subsequently,every example is converted to6-dimensional vectors using NGD.The i th entry of the vector is the NGD between the i th anchor and the example concerned (1≤i≤6).The SVM is trained on the resulting labeled vectors.The kernel-width and error-cost parameters are automatically determined using?ve-fold cross validation.Finally,testing of how well the SVM has learned the classi?er is performed using20new examples in a balanced ensemble of positive and negative examples obtained in the same way,and converted to6-dimensional vectors in the same manner,as the training examples.This results in an accuracy score of correctly classi?ed test examples.We ran100experiments.The actual data are available at[5].A histogram of agreement accuracies is shown in Figure9.On average,our method turns out to agree well with the WordNet semantic concordance made by human experts.The mean of the accuracies of agreements is0.8725. The variance is≈0.01367,which gives a standard deviation of≈0.1169.Thus,it is rare to?nd agreement less than75%.The total number of Google searches involved in this randomized automatic trial is upper bounded by100×70×6×3=126,000.A considerable savings resulted from the fact

20

TEMSDiscovery2.5操作指南概论

TEMS DISCOVERY DISCOVERY的几大功能: 一:数据展示(地理化窗口/layer 3/图形化显示)都是在project中可以直接打开显示的。二:出报告 三:地理化的差值分析/平均分析 Discovery和TI导入数据的想法不一样,TI是用logfile进行导入后分析,discovery是通过PROJECT形式导入各种数据(.cel/map/log这些数据是基于project) 第一步:新建一个project:点击project explorer---new

上图中我们需要给project定义一个project name。然后SAVE一下。(再导入cell/map之前GIS/CELL CONFIGATION是空的,导入之后这里会有相应的显示) UDR:uers defined region(用户自定义区域) 第二步: 导入数据 路测数据 地理化数据

小区数据 天线数据(天线的主瓣旁瓣) 覆盖图(planning tools导出来的)

在导入.cel(小区数据) 文件时的选项:要定义小区数据是属于哪一个project(define target project),然后Browse小区数据。 导入过程中,我们会在TASK WINDOW中看到相应的project/.cel导入信息。 导入好小区数据之后我们会在project Explorer中看到我们新建的project (20100801)中会出现Composite(组合)/datasets(数据组),现在这里还是空的,然后我们右键project(比如:20100801)—view/edit properties会看到我们cell configuration已经存在CELL文件了。 ,

Deep Learning for Human Part Discovery in Images

Deep Learning for Human Part Discovery in Images Gabriel L.Oliveira,Abhinav Valada,Claas Bollen,Wolfram Burgard and Thomas Brox Abstract—This paper addresses the problem of human body part segmentation in conventional RGB images,which has several applications in robotics,such as learning from demon-stration and human-robot handovers.The proposed solution is based on Convolutional Neural Networks(CNNs).We present a network architecture that assigns each pixel to one of a prede?ned set of human body part classes,such as head, torso,arms,legs.After initializing weights with a very deep convolutional network for image classi?cation,the network can be trained end-to-end and yields precise class predictions at the original input resolution.Our architecture particularly improves on over-?tting issues in the up-convolutional part of the network.Relying only on RGB rather than RGB-D images also allows us to apply the approach outdoors.The network achieves state-of-the-art performance on the PASCAL Parts dataset.Moreover,we introduce two new part segmentation datasets,the Freiburg sitting people dataset and the Freiburg people in disaster dataset.We also present results obtained with a ground robot and an unmanned aerial vehicle. I.INTRODUCTION Convolutional Neural Networks(CNNs)have recently achieved unprecedented results in multiple visual perception tasks,such as image classi?cation[14],[24]and object detection[7],[8].CNNs have the ability to learn effective hierarchical feature representations that characterize the typical variations observed in visual data,which makes them very well-suited for all visual classi?cation tasks.Feature descriptors extracted from CNNs can be transferred also to related tasks.The features are generic and work well even with simple classi?ers[25]. In this paper,we are not just interested in predicting a single class label per image,but in predicting a high-resolution semantic segmentation output,as shown in Fig.1. Straightforward pixel-wise classi?cation is suboptimal for two reasons:?rst,it runs in a dilemma between localization accuracy and using large receptive?elds.Second,standard implementations of pixel-wise classi?cation are inef?cient computationally.Therefore,we build upon very recent work on so-called up-convolutional networks[4],[16].In contrast to usual classi?cation CNNs,which contract the high-resolution input to a low-resolution output,these networks can take an abstract,low-resolution input and predict a high-resolution output,such as a full-size image[4].In Long et al.[16], an up-convolutional network was attached to a classi?cation network,which resolves the above-mentioned dilemma:the contractive network part includes large receptive?elds,while the up-convolutional part provides high localization accuracy. All authors are with the Department of Computer Science at the University of Freiburg,79110Freiburg,Germany.This work has partly been supported by the European Commission under ERC-StG-PE7-279401-VideoLearn, ERC-AG-PE7-267686-LIFENA V,and FP7-610603-EUROPA2. (a)PASCAL Parts(b)MS COCO (c)Freiburg Sitting People(d)Freiburg People in Disaster Fig.1:Input image(left)and the corresponding mask(right) predicted by our network on various standard datasets. In this paper,we technically re?ne the architecture of Long et al.and apply it to human body part segmentation,where we focus especially on the usability in a robotics context.Apart from architectural changes,we identify data augmentation strategies that substantially increase performance. For robotics,human body part segmentation can be a very valuable tool,especially when it can be applied both indoors and outdoors.For persons who cannot move their upper body, some of the most basic actions such as drinking water is rendered impossible without assistance.Robots could identify human body parts,such as hands,and interact with them to perform some of these tasks.Other applications such as learning from demonstration and human robot handovers can also bene?t from accurate human part segmentation.For a learning-from-demonstration task,one could take advantage of the high level description of human parts.Each part could be used as an explicit mapping between the human and joints of the robot for learning control actions.Tasks such as human-robot handovers could also bene?t.A robot that needs to hand a tool to its human counterpart must be able to detect where the hands are to perform the task. Human body part segmentation has been considered a very challenging task in computer vision due to the wide variability of the body parts’appearance.There is large variation due to pose and viewpoint,self-occlusion,and clothing.Good results have been achieved in the past in conjunction with depth sensors[22].We show that CNNs can handle this variation very well even with regular RGB cameras,which can be used also outdoors.The proposed network architecture yields correct body part labels and also localizes them precisely. We outperform the baseline by Long et al.[16]by a large

中国电视纪录片发展现状研究

中国电视纪录片发展现状研究 本文的目的在于研究中国电视纪录片发展现状。新中国成立以来电视纪录片的的发展经历了四个阶段新闻纪录、专题纪录、创新纪录、媒 介融合。从研究历史出发通过对当下纪录片生态环境、市场化问题、 话语权三大问题的现状分析归纳出体制内外纪录片发展中共同面临 的问题。世纪在新的传播环境和传播语境下中国当下的电视纪录片依 托传播学的理念从自身改造突破问题寻找出路。关键词电视纪录片市场化问题话语权传播过程引言引言研究缘起在年第届奥斯卡金像奖 提名名单中华人女导演杨紫烨凭借执导的环保题材纪录短片《仇岗卫士》成功入围最佳纪录短片提名。杨紫烨接受采访时说“现在是中国纪录片最好的时代。与“最好时代”不相称的是纪录片现状的尴尬局面翻阅电视报几乎找不到它的身影即使找到了也被安排在午夜等非黄 金时段相亲节目选秀节目竟猜节目……充斥于荧屏成为了老百姓茶 余饭后的谈资。与二十世纪九十年代的辉煌相比电视纪录片节目渐渐 冷清甚至已淡出人们的视线。纪录片遇到了怎样的困境把电视台的资源拱手让于其他节目电视纪录片在中国为何会有此境遇它的出路又 在哪里这就是笔者写作的缘起也是重点研究的问题。纪录得益于电 影。年月日法国卢米埃尔兄弟开创了电影的先河,工厂大门》、火车到站》、《婴儿进餐》等影片的公开收费放映使得电影真正走入了人类世 界展示出独有的光影魅力。这些影片就像一幅幅活动的相片带有很大程度的纪实性质。而就在年电影很快登陆中国。上海、北京、香港、

台湾陆续出现电影但放映的都是外国人的影片。直到年北京丰泰照相馆的老板任庆泰以著名京剧艺术家谭鑫培作为拍摄对象拍下了他表 演定军山》的几个片断观众反响热烈。这预示了中国纪录片的萌芽。 而国际上公认的第一部纪录电影是罗伯特?弗拉哈迪在年拍摄的北方的纳努克》这也是他的第一部电影。直到今天这部电影仍然充满着无穷的魔力被热爱纪录片的专家学者作为研究欣赏图本。原因就在于他开创了纪录片的拍摄手法。纪录片依托电影发展壮大在电影和电视界闯下了一番天地被全世界人民所认同。从年电视发明以后人们就可以足不出户了解世界新闻、博览社会百态。影视合流成了趋势。电视 纪录片应电视技术的成熟、媒体力量的聚合诞生了。美国国家地理频道、探索频道依托纪录片而崛起、发展英国的日本的在世界上在纪录片的专属领域中享有美誉。中国的电视杨紫烨现在是中国纪录片最好时代新浪网? 引言纪录片发从年起步至今已走过了五十三年的历史 现状又是如何呢研究的目的和意义选择中国电视纪录片的现状作为 论文的研究对象其目的和意义在于第一新千年已进入第十一个年头 科学技术日新月异中国电视纪录片自身在承载内容和外在形式上都 表现出个性化、丰富化的特点通过回顾半个世纪的电视纪录史分析出每个时期电视纪录片的共性站在历史的肩头才能更好的审视现在展 望未来。第二电视生态环境与纪录片发展息息相关在市场经济时代纪 录片面临着哪些问题又该如何把握自己的话语权这些问题的探讨是 纪录片现状生存必须面对的课题。第三电视纪录片是一个复杂动态的

用STM32F4-Discovery套件自带调试器烧录STM32芯片

用STM32F4-Discovery套件自带调试器烧录STM32芯片 碧云天书 STM32F4-Discovery自带了SWD调试连接器,可以用来调试和烧录STM32芯片和开发板。一般STM32开发板上的调试接口为20脚的JTAG接口,而STM32F4-Discovery板载的SWD调试连接器为6教SWD接口,可以用一条20脚转6脚的连接线将SWD调试器连接到开发板的JTAG接口上。 一、硬件连接 下图是JLink接口的SWD端口配置图,可以作为连接参考。引脚编号为简易牛角座顶视图对应的编号。红线标识的引脚对应着ST-LINK/V2调试连接器CN2的6个引脚。 表1STM32F4-Discovery自带的ST-LINK/V2调试连接器CN2引脚定义(SWD) 引脚CN2说明 1VDD_TARGET来自应用的VDD 2SWCLK SWD时钟 3GND地线 4SWDIO SWD数据输入/输出 5NRST目标MCU的复位 6SWO保留(TRACESWO,连接目标MCU的PB3,可以不接) 由于使用ST-LINK/V2上的NRST就得断开SB11锡桥,因此不使用NRST线。需要连接剩下的5根线,分别是VCC,SWDIO,SWCLK,SWO,GND。其中SWO也可以不接,这样就只需要连4条线。下面的表2总结了连线方式。 表2连接STM32F4-Discovery自带的ST-LINK/V2调试连接器到开发板JTAG接口的连线 VDD SWCLK GND SWDIO SWO(可省略) 12346 ST-LINK/V2 (CN2) JTAG接口194713

连接线实物 使用STM32F4-Discovery自带的ST-LINK/V2调试连接器时,需要把CN3上的跳线拔掉,这时板载的ST-LINK/V2处于调试外部开发板状态。如下图:

Discovery纽约时代广场探索博物馆EB-5项目

Discovery纽约时代广场探索博物馆EB-5项目 项目概况 探索频道(Discovery Channel)于1985年在美国创立,探索频道目前覆盖全球 超过160个国家、4亿5千万个家庭,探索公司同时也是美国的上市公司,是美国最大的主流媒体之一。 Discovery博物馆(mDiscovery Times Square)成立于2009年,是探索频道(Discovery)的官方合作伙伴,为纽约市的前五大的博物馆。地处于时代广场核心的44街与第七第八大道中间,过去成功展出:泰坦尼克号、哈利波特、法老王和兵马俑等世界知名展览,已接待超过数百万人次的游客。继成功推出纽约时代广场第一期娱乐项目“百老汇4D剧院”项目(进展顺利,投资者均取得I-526移民申请通过)后,曼哈顿区域中心(MRC)又重磅推出位于纽约时代广场的第二期娱乐项目Discovery博物馆——探索纽约项目,该项目与第一期4D剧院项目仅隔一街距离。

项目特点 独一无二的地理优势 纽约时代广场在2013年迎接了5340万次游客,游客总消费超过了400亿美金,旅游消费预计将会在未来4年每年以8.5%的速度增长。 良好的发展前景——纽约市的旅游统计表

足够的就业机会创造 依照Michael Evans所做出的就业人数计算(即RIMS Ⅱ计算方式,该计算方法为美国移民局比较推荐的就业机会计算方式),该项目预计产生593个新的就业机会。远远超过EB-5所需的240个就业机会空间高达60%。 银行专户还款 Discovery博物馆参观门票预计价格为22美元,娱乐产业一直以来都是现金流十分可观的产业,依照与其他时代广场相似项目比较并且保守评估推算,每年项目净利润预计高达一千万美元以上,项目承诺在营运方面将保留60%的现金存放至还款账户中,专款专户作为未来贷款五年还款准备。 资金结构

DAVID使用方法介绍

DAVID使用说明文档 一、DAVID简介 DA VID (the Database for Annotation,Visualization and Integrated Discovery)的网址是https://www.doczj.com/doc/354578948.html,/。DA VID是一个生物信息数据库,整合了生物学数据和分析工具,为大规模的基因或蛋白列表(成百上千个基因ID或者蛋白ID列表)提供系统综合的生物功能注释信息,帮助用户从中提取生物学信息。 DA VID这个工具在2003年发布,目前版本是v6.7。和其他类似的分析工具,如GoMiner,GOstat等一样,都是将输入列表中的基因关联到生物学注释上,进而从统计的层面,在数千个关联的注释中,找出最显著富集的生物学注释。最主要是功能注释和信息链接。 二、分析工具: DAVID需要用户提供感兴趣的基因列表,在基因背景下,使用提供的分析工具,提取该列表中含有的生物信息。这里说的基因列表和背景文件的选取对结果至关重要。 1.基因列表:这个基因列表可能是上游的生物信息分析产生的基因ID列表。对于富集分析而言,一般情况下,大量的基因组成的列 表有更高的统计意义,对富集程度高的特殊Terms有更高的敏感度。富集分析产生的p-value在相同或者数量相同的基因列表中具有可比性。 DAVID对于基因列表的格式要求为每行一个基因ID或者是基因ID用逗号分隔开。基因列表的质量会直接影响到分析结果。这里定性给出好的基因列表应该具有的特点,一个好的基因列表至少要满足以下的大部分的要求: (1)包含与研究目的相关的大部分重要的基因(如标识基因)。

教你DIY中文增强版Geexbox

教你DIY中文增强版Geexbox Geexbox是一款可以从光盘上直接启动的Linux多媒体操作系统【当然也可以从硬盘和USB闪存上启动】,它是基于Linux和MPlayer进行开发应用的,它可以让你不用进windows 就可以欣赏大片。它几乎支持大部分主流媒体格式,包括AVI、RM、RMVB、MPEG-4、MP3及外挂中文字幕,可以让旧电脑变成强悍的媒体中心。可惜官方提供的只有英文的ISO 镜像,因此网上也出现了不少网友定制的中文版。他们是怎么做的呢?其实很简单。利用官方提供的GeeXboX ISO Generator,你也可以轻松DIY属于自己的Geexbox中文增强版。还犹豫什么呢?下面就和笔者一起来体会DIY的乐趣。 一、GeeXboX ISO Generator初上手 “工欲善其事,必先利其器”,首先,请你到Geexbox的官方网站 (https://www.doczj.com/doc/354578948.html,/en/downloads.html)下载最新的GeeXboX ISO Generator。然后将下载到的geexbox-generator-1.0.i386.tar.gz用Winrar解压到硬盘中(本文以“D:\geexbox”为例进行说明)。进入解压目录,双击generator.exe运行软件(这个镜像生成器还包括在Linux和Macosx下使用的程序)。进入程序界面,你可以看到八个标签页,它们分别是:界面设置(Interface)、音频设置(Audio)、视频设置(Video)、遥控设置(Remote control)、网络设置(Network)、服务设置(Services)、液晶显示设置(Lcd display)、套件设置(Packages)。 接下来,请你单击“Packages“,进入套件设置项。这里列出的都是一些非常有用却没有包含在压缩包中的解码器(Codecs)、固件(Filmwares)、字体(Fonts)和主题(Themes)。建议你选中所有的解码器、固件、主题以及字体——“Chinese Simplified-GB2312”,然后点击”DownlOad”按钮下载。好啦,沏杯热茶慢慢等,Generator自己会通过网络把相应的文件下载到本地硬盘中。(如图1)心急的朋友如果受不了牛速,你也可以直接进入官方ftp下载所需资源: ⑴.解码器:https://www.doczj.com/doc/354578948.html,/codecs/ 将下得的压缩包解压至 D:\geexbox\iso\GEEXBOX\codec\即可。 ⑵.固件:https://www.doczj.com/doc/354578948.html,/firmwares/ 将下得的压缩包解压至 D:\geexbox\iso\GEEXBOX\firmwares\即可。 ⑶.字体:https://www.doczj.com/doc/354578948.html,/fonts/ 将下得的压缩包解压至 D:\geexbox\i18n\fonts\即可。

discovery软件在测井资料标准化中的应用

discovery软件在测井资料标准化中的应用 趋势而分析方法是依据物质的某一物理参数的测量值来研究幷空间分布特点及变化规律的方法。任何汕出实际地质参数在横向上差不多上具有某种规律性渐变,即可看作是趋势面变化。趋势而分析的差不多思路确实是对标准层的测井响应多项式趋势面作图,并认为与地层原始趋势而具有一致性。若趋势面分析的残差图仅为随机变量,则是测井刻度误差造成的,若存在一组专门残差值,则认为是岩性变化导致的0 1981年J H Doveton和E?Bomcman 进一步用趋势而分析来描述这一标准化过程,1991年石汕大学熊绮华教授在进行牛庄洼陷万全汕田油藏描述研究过程中采纳该方法对测井曲线进行标准化。 Discovery软件是应用较为广泛的油藏描述软件,该软件在用趋势面分析方法进行测井 曲线标准化方而具有操作简单、图形化输出及运算等特点,使得测井曲线标准化变得专门方便。 1 Discovery软件的趋势面分析方法 1.1趋势面分析方法的数学原理 若趋势而分析的残差图仅为随机变量,则是测井刻度误差造成的,若存在一组专门残差值,则认为是岩性变化导致的。它的数学方法概述如下: 设用z(x,y)表示所研究的地质特点,其中(x,y)是平面上点的坐标.则趋势值和剩余值用下式表示: z(x,y)= Z (x,y)+e 其中:2(xj)为趋势值,C为剩余值。 关于已知的数据:z,x\yiJH2 No 通常用回来分析求出趋势值和剩余值,即依照已知的数据求出回来方程f(x?y),使得: N 2 =乞忆一/(兀,片)] r-l 达到最小。实际上这确实是最小二乘意义下的曲面拟合咨询题,即依据运算值z(xj)用回来分析方法求出一个回来而: 对应于回来而上的值Z = 为趋势值,残差z,.名为剩余值。

纪录片制作机构

探索频道(Discovery Channel)是由探索传播公司(Discovery Communications, Inc./DCI;NASDAQ:DISCA,旗下拥有197多家全球性电视网,包括Discovery探索频道、动物星球频道、Discovery科学频道和Discovery高清视界频道等)于1985年创立的,总部位于美国马里兰州蒙哥马利县银泉市。探索频道主要播放流行科学、科技、历史、考古及自然纪录片。 探索频道自1985年在美国启播后、现今已成为世界上发展最迅速的有线电视网络之一、覆盖面遍及全国百分之九十九的有线电视订户、在全球145个国家和地区有超过14400万个家庭订户。探索频道是全球最大的纪录片制作及买家、它吸引全球最优秀的纪录片制作人、所以探索频道的节目均被认为是世界上最优秀的纪实娱乐节目。也是世界上发行最广的电视品牌,目前到达全球160多个国家和地区的30600多万家庭,以35种不同语言播出节目。 探索频道在世界主要国家地区均有落地,但探索频道会因应不同地区设立不同版本,加上字幕或配音。美国版本主要播放写实电视节目,如著名的流言终结者系列。亚洲探索频道除着重播放写实节目之外,也播放文化节目,如介绍中国、日本文化的一系列节目。 亚洲探索频道于1994年成立,总部在新加坡,为美国Discovery传播公司(DCI)的全资附属机构,提供二十四小时精彩的纪实娱乐节目。据2005年泛亚媒体调查(PAX)的结果显示,探索频道在富裕成人中连续9年被公认为亚洲地区收视人口最多的有线及卫星电视频道。在新加坡举办的2004年“亚洲电视大奖”评选中,探索频道还荣膺“年度最佳有线及卫星电视频道”。 中国国际电视总公司(中央电视台全额投资的大型国有独资公司,成立于1984年,是中国内地规模最大、赢利能力最强的传媒公司)境外卫星代理部接收探索频道信号,通过亚太6号卫星(东经134度)发射KU波段信号。该服务一般只提供给三星级或以上的涉外宾馆酒店,外国人居住区,领事馆及大使馆。中国大陆各省市的地方电视台会转播或播放探索频道制作的节目。同时,还与浙江华数集团成立合资公司,向由杭州电视台开办的四个面向全国播出的高清付费电视频道(求索纪录、求索生活、求索科学、求索动物)提供绝大多数的节目内容。

discover微波操作手册

微波合成仪标准操作手册 一、操作流程 1、例行检查:仪器开机前,首先检查仪器整体是否正常;反应腔及内衬溢出杯是否清洁;检查自 动压控装置APD是否清洁;自动进样器是否在正常位置;仪器电源线、数据线、气体管路连接情况是否正常。经检查一切正常方可开机。如内衬、APD不清洁或其它问题未经处理而运行仪器所造成的损害,属于非正常操作范畴。 2、开机顺序:先打开计算机电源,再打开Discover主机电源,然后运行Synergy软件(在计算机 桌面上)。最后打开空压机电源。 3、登记制度:检查、开机均正常,请认真按规定填写仪器使用记录,记录信息不全将承担后续使 用问题的责任。检查、开机、运行过程中,发现任何问题请及时联系管理员。 4、启动软件:运行Synergy软件,选择用户名并输入密码,进入软件操作界面后,可从屏幕右下 方工具栏察看Discover和Explorer的联机情况。 5、放入样品:按要求装配好微波反应管(详见第六部分),放入仪器衰减器。 6、选择方法:打开软件界面中相应用户的“M ethod”文件夹图标,选择所需方法,单击鼠标左键拖 拽到相应样品位置,如有需要,可新建方法或对方法进行修改(详见第四部分) 7、运行前检查:检查衰减器是否处于锁定状态;察看屏幕右侧温度、压力的显示是否正常。 8、运行方法:点击软件界面上部工具栏中的“P lay”按钮,仪器自动运行。 二、禁止的操作项 1、严禁频繁开关机;开机后1min内关机;关机后1min内开机。 2、严禁修改电脑系统设置如注册表项等内容。 3、严禁使用破损的、有裂痕的、划痕严重的反应瓶。 4、严禁使用变形的样品盖。 5、反应瓶盖必须严格按要求装配,禁止未经过检查就放置于自动进样器架上。 6、严禁将标签纸粘贴在反应瓶的任何部位。 7、严禁将文献中多模微波仪器(特别是家用微波炉)的反应条件直接用于该仪器。 8、严禁长时间无人值守,仪器运行过程中,必须每2小时进行巡视查看,并做好检查记录。 9、微波程序运行过程中,严禁非仪器管理员在线修改反应参数。 10、仪器登陆用户只有管理员的权限可以设置为“Admin”其他均设置为“User”。 11、仪器各登陆用户的参数设置应符合仪器要求(详见第三部分),禁止修改。

SuperScan 使用教程

扫描工具SuperScan使用教程(如何使用SuperScan) SuperScan 是由Foundstone开发的一款免费的,但功能十分强大的工具,与许多同类工具比较,它既是一款黑客工具,又是一款网络安全工具。一名黑客可以利用它的拒绝服务攻击(DoS,denial of service)来收集远程网络主机信息。而做为安全工具,SuperScan能够帮助你发现你网络中的弱点。下面我将为你介绍从哪里得到这款软件并告诉你如何使用它。 如何获得SuperScan SuperScan4.0是免费的,并且你可以在如下地址下载: https://www.doczj.com/doc/354578948.html,:81/up/soft3_2010/SuperScan.rar 因为SuperScan有可能引起网络包溢出,所以Foundstone站点声明某些杀毒软件可能识别SuperScan是一款拒绝服务攻击(Dos)的代理。 SuperScan4.0只能在Windows XP或者Windows 2000上运行。对一些老版本的操作系统,你必须下载SuperScan3.0版。 SuperScan的使用 给SuperScan解压后,双击SuperScan4.exe,开始使用。打开主界面,默认为扫描(Scan)菜单,允许你输入一个或多个主机名或IP范围。你也可以选文件下的输入地址列表。输入主机名或IP范围后开始扫描,点Play button,SuperScan开始扫描地址,如下图A。

图A:SuperScan允许你输入要扫描的IP范围。 扫描进程结束后,SuperScan将提供一个主机列表,关于每台扫描过的主机被发现的开放端口信息。SuperScan还有选择以HTML格式显示信息的功能。如图B。

美国探索教育视频资源服务平台

1、美国探索教育视频资源服务平台 平台内容及意义 大众文化的流行,娱乐学习一体化的浪潮席卷全球。同时随着社会发展,多学科交叉融合,使得社会对大学生综合能力要求颇高。在某一个方面出类拔萃的复合型人才,越来越受到企业社会的青睐。综合性人才在当今社会炙手可热,因此学校在重视专业课的同时,加强对课外知识的普及符合当今教育时代的发展需求。 美国探索教育视频资源服务平台坚持以“科教兴国”为总方略,以提高在校师生综合素质、开拓师生眼界为宗旨;以教育、科学、文化、历史、探险等为题材的多学科交叉融合的教育视频资源服务平台。平台始终坚持科学研究与教学理论相统一,历史知识和文化教育相结合,以求达到师生即使足不出户,亦能知大千世界之神奇、能知世界各地前沿性科学技术,能解世间万物之疑惑。此平台已经成为西安数图网络科技有限公司一个独具特色的教育资源服务平台。 平台特色 美国探索教育视频资源服务平台,结合高校科学教育及科普知识所需,精选整合美国探索频道(Discovery)和美国国家地理频道(National Geography)两大世界知名频道近年来的最新节目,精心制作而成。 1、美国探索频道(Discovery) 1985年开播 使用客户在全球达到160多个国家,3亿零6百多万家庭。 通过15颗卫星用36种语言、24小时播放来源于全球不同地方摄制的精彩高品质纪实节目 2、美国国家地理频道(National Geography) 遍布全球达171个国家及地区 通过48种语言收看 荣获1次奥斯卡金像奖和2次金像奖提名,129座艾美奖 平台分类 自然科学,历史人文,科学发现,生命科学,旅游风光,体育探索,军事侦探,交通机械,工程建筑

discovery教程

第一章:前言 (1) 第二章:微机油藏描述系统集成 (3) 一、Landmark公司微机油藏描述系统发展历程 (3) 二、微机油藏描述系统各模块集成 (4) (一)工区、数据管理系统 (二)GESXplorer地质分析与制图系统 (三)SeisVision 2D/3D二维三维地震解释系统 (四)PRIZM 测井多井解释系统 (五)ZoneManager层管理与预测 (六)GMAPlus正演建模 三、Discovery微机油藏描述系统软件特色 (12) 第三章:微机三维地震解释系统软件应用方案研究 (13) 一、工区建立 (13) (一)工区目录建立 (二)一般工区建立 (三)工区管理 二、数据输入 (20) (一)地质数据输入 1 井头数据输入 2 井斜数据输入 3 分层数据输入 4 试油数据输入 5 生产数据加载 6 速度数据输入 (二)测井数据输入 1 ASCII格式测井数据输入 2 LAS格式测井数据输入 (三)地震数据输入 1 SEG-Y三维地震数据输入 2 层位数据输入 3 断层数据输入

三、微机地质应用 (31) (一)微机地质应用工作流程工作流程 1 地质分析工作流程 2 沉积相分析工作流程 (二)微机地质应用 1 井位图建立 2 等值线图(isomap)建立 3 各种剖面图(Xsection)建立 4 生产现状图制作 5 沉积相图制作 四、微机三维地震解释综合应用 (48) (一)微机三维地震解释工作流程 1 合成记录及层位工作流程 2 地震解释工作流程 3 速度分析工作流程 (二)微机三维地震解释综合应用 1 地震迭后处理-相干体 2 合成记录制作及层位标定 3 层位和断层建立、解释 4 三维可视化 5 速度分析与时深转换 6 构造成图 7 地震测网图建立 8 地震属性提取 五、微机单井测井解释及多井评价 (104) (一)微机单井测井解释及多井评价工作流程 1 测井曲线环境校正与标准化工作流程 2 测井分析流程 (二)微机单井测井解释及多井评价 1 打开测井曲线 2 测井曲线显示模板制作 3.测井曲线显示、编辑与预处理 4.交会图制作与分析 5 测井解释模型建立与解释 6 测井解释成果报告

BBC一百多部记录片

BBC一百多部记录片 BBC.生物记录片.细胞 https://www.doczj.com/doc/354578948.html,/cszGSiqUkU9cr(访问密码:e215)自然风光喜马拉雅山脉 https://www.doczj.com/doc/354578948.html,/cs4iYcAeiHKIn 提取码:28c1自然风光巴厘岛 https://www.doczj.com/doc/354578948.html,/csizn3trNnCGv 提取码:e5edBBC纪录片《野性水域终极挑战》[MKV] https://www.doczj.com/doc/354578948.html,/Qi24t6zR3TyCK (提取码:bbcb)[历史地理] 詹姆斯·卡梅隆的深海挑战. https://www.doczj.com/doc/354578948.html,/lk/cJxR8pIvfSvR8 访问密码4076远方的家-边疆行全100集 https://www.doczj.com/doc/354578948.html,/cszGATNBFhjjw(访问密码:52c6)美丽中国湿地行50集

https://www.doczj.com/doc/354578948.html,/cszX2JZKa6UVy 访问密码2f2f李小龙:勇士的旅程》(Bruce Lee A Warriors Journey) https://www.doczj.com/doc/354578948.html,/csFPTqFZr8GTz 提取码6c71CHC高清纪录片:星球奥秘之地球雪球期MKV 720P 1.4G 英语中字 https://www.doczj.com/doc/354578948.html,/QGEpqiPbfGfsG (访问密码:cdb2)探索频道:狂野亚洲:四季森林 https://www.doczj.com/doc/354578948.html,/cJxPJZXa8wGzA 访问密码1034BBC 纪录片《美国的未来》[MKV/4集全] https://www.doczj.com/doc/354578948.html,/QivxnUNbqLEau (提取码:27fe)生命的奇迹.全5集 https://www.doczj.com/doc/354578948.html,/cJXTIkq5jLBY5 访问密码7d2f《华尔街》高清收藏版[HDTV/720p/MKV/全10集] https://www.doczj.com/doc/354578948.html,/cy5PrZeud43Rk 提取码8497远方的家-沿海行(高清全112集) https://www.doczj.com/doc/354578948.html,/cszX4jUKD29ay 访问密码a52aBBC

全球最好的电视台

全球著名电视台 掌门人:霍珂灵 标签:文化国家 电视台(TV station /television station )指的是制作电视节目并通过电视或网络播放的媒体机构。它由国家或商业机构创办的媒体运作组织,传播视频和音频同步的资讯信息,这些资讯信息可通过有线或无线方式为公众提供付费或免费的视频节目。其播出时间固定,节目内容一部分为其自己制作,也有相当部分为外购。比较有名的电视台:CNN,BBC,TVB,CCTV等。 美国有线电视新闻网(CNN ) CNN由特德·特纳于1980年创办,1995年被时代—华纳公司兼并。总部设在美国佐治亚州首府亚特兰大市,在美国本土以外设有28个分部,在世界各地的雇员达4000人。CNN使用英语和西班牙语广播,它的资金来源于用户付费和广告收入。CNN因独家报道1991年海湾战争而成为家喻户晓的有线新闻广播公司,目前已覆盖全球210个国家和地区。 ? 什么叫CNN? ?CNN是什么? ?CNN什么意思啊好像最近很流行还有什么流行词啊? ?美国的CNN公司是什么东西请消息说明一下 ?CNN 是美国的还是法国的 ?CNN歪曲报道原文 英国广播公司(BBC) 这一新闻频道由英国广播公司于1991年成立。它在海外拥有250名记者和58个分部,资金来源于用户付费和广告收入。该频道声称在全球拥有2.7亿个家庭用户。英国广播公司今年宣布,计划于2007年新开播一个阿拉伯语的新闻频道。 ? BBC是什么? ?BBC什么意思 ?BBC是什么啊 ?BBC是哪个国家的媒体哦? ?bbc的经典语录(games[TV]的BBC) ?求bbc所有纪录片目录 半岛电视台(AlJazeera) 半岛电视台由卡塔尔政府于1996年成立。它在全球雇有170名记者,拥有26个分部。世界各地都能收看到半岛电视台的阿拉伯语频道。半岛电视台因不断报道伊拉克和中东其他地区的一些事件而遭到美国的指责。美国总统布什甚至曾计划轰炸它的卡塔尔总部。2006年,该电视台还将推出英语频道。 ?半岛电视台的相关资料? ?卡塔尔半岛电视台与cctv ?为什么半岛电视台收视率全球第一?cctv1呢? ?基地组织为什么要把拉登的录音送到半岛电视台? ?半岛电视台在中东哪里?据说很有名的! ?半岛电视台是哪国的 欧洲新闻电视台(Euronews) 欧洲新闻电视台建立于1993年,它的特点之一就是使用英语、法语、德语、意大利语、葡萄牙语、西班牙语和俄语7种语言播报新闻。该电视台所以能这样做是因为它主要使用各个通讯社提供的图像,而没有亮相屏幕的新闻主播。该电视台由19个欧洲公共部门电视频道共同所有,总部设在法国城市里昂,雇

纪录片是否要完全真实

纪录片不一定要完全真实 对于纪录片真实性的鉴定,就犹如不同的人看《哈姆雷特》,每个人都有自己的看法,而我的观点是:纪录片不一定要完全真实。我在这里提到的完全真实是指没有摆拍,没有编排。我认为纪录片中可以存在重现,摆拍。 有种对纪录片的定义是:一切真实记录社会和自然事物的非虚构的电影片或电视片都是纪录片。对于非虚构的电影片或电视片就可能存在编排和摆拍。 我的想法在国外和少数中国导演那里可以得到些许的认可。 在国外,纪录片是很受欢迎的,甚至纪录片的频道需要付费。就拿众所周知的美国的Discovery探索频道为例,美国的Discovery探索频道于1985年开播,是世界上发行最广的电视品牌,目前到达全球160多个国家和地区的3亿零6百多万家庭,以35种不同语言播出节目。 美国的Discovery探索频道的很多纪录片就是摆拍,重现的。Discovery有一档栏目叫重案夜现场,这个栏目并不是完全跟拍警方的破案过程,而是进行情景再现的,以摆拍,采访的方式进行重述。在这个节目里事件是真实的,专家的口述是真实的,而犯罪现场的以及犯罪证据,甚至犯罪过程的还原都是情景再现的,除了重案夜现场,历史零时差,与恐龙共舞特别篇等等都是情景再现的方式。情景再现即编排和摆拍。

黑格尔曾经说过:真实不是别的,而是缓慢的成熟过程。我觉得这句话,对于中国的纪录片仍然是很实用的。在我们国家,为什么人们不喜欢看纪录片?我想很大原因是因为我们国家的纪录片很多是不成熟的,但是有些导演的纪录片是很招人喜欢的,比如张以庆导演的影片《英和白》《幼儿园》《周周的世界》,冷冶夫的《伴》《油菜花开》等等,那么他们的影片是否是完全真实的呢? 冷冶夫在接受采访时说,他的《油菜花开》:“基本全部是摆拍,因为它是一种实验纪录片,国外翻译过来是“真实电影”,这种纪录片除了载体好以外,它的故事也好。我在主流媒体做的都是纪实风格的纪录片,很多人看不到我的另一面,所以我今天斗胆地放了这样一部片子”。当记者问到:“那您觉得摆拍还叫纪录片吗?”冷冶夫答道:“其实国际上往往把有没有这件事作为纪录片的鉴定。写剧本拍摄,那属于虚构的故事片,如果有这么件事,不管你怎么弄,它都是属于非虚构类的。国外对纪录片的分类特别粗,你也可以看到,包括国外那些Discovery节目几乎都用了情景再现的方式。” 我个人喜欢看《油菜花开》这样的纪录片,首先它的镜头很美,假如是跟拍,想必一定没有这么美的镜头;其次选材更容易,事件的结局知道,就更容易分析这件事件,就更容易找到切入点,在接下来编排摆拍时就更容易制造氛围,从而达到教育感化等效果,如果从开始就跟拍的纪录片,不一定能准确料定时间的结局,就不容易分析事件。 张以庆导演的纪录片一直以选材新颖,立意深刻著称,他肯花大

相关主题
文本预览
相关文档 最新文档