nettime's_roving_reporter on Tue, 27 Aug 2002 15:49:18 +0200 (CEST)


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

<nettime> googlewatch: PageRank -- Google's Original Sin


     [via <tbyfield@panix.com>]

<http://www.google-watch.org/pagerank.html>

                      PageRank: Google's Original Sin

                              by Daniel Brandt
                                August 2002

     By 1998, the dot-com gold rush was in full swing. Web search
     engines had been around since 1995, and had been immediately touted
     by high-tech pundits (and Forbes magazine) as one more element in
     the magical mix that would make us all rich. Such innovations meant
     nothing less than the end of the business cycle.

     But the truth of the matter, as these same pundits conceded after
     the crash, was that the false promise of easy riches put
     bottom-line pressures on companies that should have known better.
     One of the most successful of the earliest search engines was
     AltaVista, then owned by Digital Equipment Corporation. By 1998 it
     began to lose its way. All the pundits were talking "portals," so
     AltaVista tried to become a portal, and forgot to work on improving
     their search ranking algorithms.

     Even by 1998, it was clear that too many results were being
     returned by the average search engine for the one or two keywords
     that were entered by the searcher. AltaVista offered numerous ways
     to zero in on specific combinations of keywords, but paid much less
     attention to the "ranking" problem. Ranking, or the ordering of
     returned results according to some criteria, was where the action
     should have been. Users don't want to figure out Boolean logic, and
     they will not be looking at more than the first twenty matches out
     of the thousands that might be produced by a search engine. What
     really matters is how useful the first page of results appears on
     search engine A, as opposed to the results produced by the same
     terms entered into engine B. AltaVista was too busy trying to be a
     portal to notice that this was important.

                                Enter Google

     By early 1998, Stanford University grad students Larry Page and
     Sergey Brin had been playing around with a particular ranking
     algorithm. They presented a paper titled The Anatomy of a
     Large-Scale Hypertextual Web Search Engine at a World Wide Web
     conference. With Stanford as the assignee and Larry Page as the
     inventor, a patent was filed on January 9, 1998. By the time it was
     finally granted on September 4, 2001 (Patent No. 6,285,999), the
     algorithm was known as "PageRank," and Google was handling 150
     million search queries per day. AltaVista continued to fade; even
     two changes of ownership didn't make a difference.

     Google hyped PageRank, because it was a convenient buzzword that
     satisfied those who wondered why Google's engine did, in fact,
     provide better results. Even today, Google is proud of their
     advantage. The hype approaches the point where bloggers sometimes
     have to specify what they mean by "PR" -- do they mean PageRank,
     the algorithm, or do they mean the Public Relations that Google
     does so well:

     PageRank relies on the uniquely democratic nature of the web by
     using its vast link structure as an indicator of an individual
     page's value. In essence, Google interprets a link from page A to
     page B as a vote, by page A, for page B. But, Google looks at more
     than the sheer volume of votes, or links a page receives; it also
     analyzes the page that casts the vote. Votes cast by pages that are
     themselves "important" weigh more heavily and help to make other
     pages "important."

     Google goes on to admit that other variables are also used, in
     addition to PageRank, in determining the relevance of a page. While
     the broad outlines of these additional variables are easily
     discerned by webmasters who study how to improve the ranking of
     their websites, the actual details of all algorithms are considered
     trade secrets by Google, Inc. It's in Google's interest to make it
     as difficult as possible for webmasters to cheat on their rankings.

                          It's all in the ranking

     Beyond any doubt, search engines have become increasingly important
     on the web. E-commerce is very attuned to the ranking issue,
     because higher ranking translates directly into more sales. Various
     methods have been designed by various engines to monetize the
     ranking situation, such as paid placement, pay per click, and pay
     for inclusion. On June 27, 2002, the U.S. Federal Trade Commission
     issued guidelines that recommended that any ranking results
     influenced by payment, rather than by impartial and objective
     relevance criteria, ought to be clearly labeled as such in the
     interests of consumer protection. It appears, then, that any
     algorithm such as PageRank, that can reasonably pretend to be
     objective, will remain an important aspect of web searching for the
     foreseeable future.

     Not only have engines improved their ranking methods, but the web
     has grown so huge that most surfers use search engines several
     times a day. All portals have built-in search functions, and most
     of them have to rely on one of a handful of established search
     engines to provide results. That's because only a few engines have
     the capacity to "crawl" or "spider" more than two billion web pages
     frequently enough to keep their database current. Google is perhaps
     the only engine that is known for consistent, predictable crawling,
     and that's only been true for less than two years. It takes almost
     a week to cover the available web, and another week to calculate
     PageRank for every page. Google's main update cycle is about 28
     days, which is a bit too slow for news-hungry surfers. In August,
     2001 they also began a second "mini-crawl" for news sites, which
     are now checked every day. Results from each crawl are mingled
     together, giving the searcher an impression of freshness.

     For the average webmaster, the mechanics of running a successful
     site have changed dramatically from 1996 to 2002. This is due
     almost entirely to the increased importance of search engines. Even
     though much of the dot-com hype collapsed in 2000 and 2001 (a
     welcome relief to noncommercial webmasters who remembered the
     pre-hype days), the fact remains that by now, search engines are
     the fundamental consideration for almost every aspect of web design
     and linking. It's close to a wag-the-dog situation. That's why the
     algorithms that search engines consider to be consistent with the
     FTC's idea of impartial and objective ranking criteria deserve
     closer scrutiny.

                   What objective criteria are available?

     Ranking criteria fall into three broad categories. The first is
     link popularity, which is used by a number of search engines to
     some extent. Google's PageRank is the original form of "link pop,"
     and remains its purest expression. The next category is on-page
     characteristics. These include font size, title, headings, anchor
     text, word frequency, word proximity, file name, directory name,
     and domain name. The last is content analysis. This generally takes
     the form of on-the-fly clustering of produced results into two or
     more categories, which allows the searcher to "drill down" into the
     data in a more specific manner. Each method has its place. Search
     engines use some combination of the first two, or they use on-page
     characteristics alone, or perhaps even all three methods.

     Content analysis is very difficult, but also very enticing. When it
     works, it allows for the sort of graphical visualization of results
     that can give a search engine an overnight reputation for
     innovation and excellence. But many times it doesn't work well,
     because computers are not very good at natural language processing.
     They cannot understand the nuances within a large stack of prose
     from disparate sources. Also, most top engines work with dozens of
     languages, which makes content analysis more difficult, since each
     language has its own nuances. There are several search engines that
     have made interesting advances in content analysis and even
     visualization, but Google is not one of them. The most promising
     aspect of content analysis is that it can be used in conjunction
     with link pop, to rank sites within their own areas of
     specialization. This provides an extra dimension that addresses
     some of the problems of pure link popularity.

     Link popularity, which is "PageRank" to Google, is by far the most
     significant portion of Google's ranking cocktail. While in some
     cases the on-page characteristics of one page can trump the
     superior PageRank of a competing page, it's much more common for a
     low PageRank to completely bury a page that has perfect on-page
     relevance by every conceivable measure. To put it another way, it's
     frequently the case that a page with both search terms in the
     title, and in a heading, and in numerous internal anchors, will get
     buried in the rankings because the sponsoring site isn't
     sufficiently popular, and is unable to pass sufficient PageRank to
     this otherwise perfectly relevant page. In December 2000, Google
     came out with a downloadable toolbar attachment that made it
     possible to see the relative PageRank of any page on the web. Even
     the dumbed-down resolution of this toolbar, in conjunction with
     studying the ranking of a page against its competition, allows for
     considerable insight into the role of PageRank.

     Moreover, PageRank drives Google's monthly crawl, such that sites
     with higher PageRank get crawled earlier, faster, and deeper than
     sites with low PageRank. For a large site with an average-to-low
     PageRank, this is a major obstacle. If your pages don't get
     crawled, they won't get indexed. If they don't get indexed in
     Google, people won't know about them. If people don't know about
     them, then there's no point in maintaining a website. Google starts
     over again on every site for every 28-day cycle, so the missing
     pages stand an excellent chance of getting missed on the next cycle
     also. In short, PageRank is the soul and essence of Google, on both
     the all-important crawl and the all-important rankings. By 2002
     Google was universally recognized as the world's most popular
     search engine.

                       How does PageRank measure up?

     In the first place, Google's claim that "PageRank relies on the
     uniquely democratic nature of the web" must be seen for what it is,
     which is pure hype. In a democracy, every person has one vote. In
     PageRank, rich people get more votes than poor people, or, in web
     terms, pages with higher PageRank have their votes weighted more
     than the votes from lower pages. As Google explains, "Votes cast by
     pages that are themselves 'important' weigh more heavily and help
     to make other pages 'important.'" In other words, the rich get
     richer, and the poor hardly count at all. This is not "uniquely
     democratic," but rather it's uniquely tyrannical. It's corporate
     America's dream machine, a search engine where big business can
     crush the little guy. This alone makes PageRank more closely
     related to the "pay for placement" schemes frowned on by the
     Federal Trade Commission, than it is related to those "impartial
     and objective ranking criteria" that the FTC exempts from labeling.

     Secondly, only big guys can have big databases. If your site has an
     average PageRank, don't even bother making your database available
     to Google's crawlers, because they most likely won't crawl all of
     it. This is important for any site that has more than a few
     thousand pages, and a home page of about five or less on the
     toolbar's crude scale.

     Thirdly, in order for Google to access the links to crawl a deep
     site of thousands of pages, a hierarchical system of doorway pages
     is needed so that crawler can start at the top and work its way
     down. A single site with thousands of pages typically has all
     external links coming into the home page, and few or none coming
     into deep pages. The home page PageRank therefore gets distributed
     to the deep pages by virtue of the hierarchical internal linking
     structure. But by the time the crawler gets to the real "meat" at
     the bottom of the tree, these pages frequently end up with a
     PageRank of zero. This zero is devastating for the ranking of that
     page, even assuming that Google's crawler gets to it, and it ends
     up in the index, and it has excellent on-page characteristics. The
     bottom line is that only big, popular sites can put their databases
     on the web and expect Google to cover their data adequately. And
     that's true even for websites that had their data on the web long
     before Google started up in 1999.

                       What about non-database sites?

     There are other areas where PageRank has a negative effect, even
     for sites without a lot of data. The nature of PageRank is so
     discriminatory, that it's rather like the exact opposite of
     affirmative action. While many see affirmative action as reverse
     discrimination, no one would claim (apart from economists who
     advocate more tax cuts for the rich) that the opposite, which would
     be deliberate discrimination in favor of the already-privileged, is
     a solution for anything. Yet this is essentially what Google
     claims.

     Those who launch new websites in 2002 have a much more difficult
     time getting traffic to their sites than they did before Google
     became dominant. The first step for a new site is to get listed in
     the Open Directory Project. This is used by Google to seed the
     crawl every month. But even after a year of trying to coax links to
     your new site from other established sites, the new webmaster can
     expect fewer than 30 visitors per day. Sites with a respectable
     PageRank, on the other hand, get tens of thousands of visitors per
     day. That's the scale of things on the web -- a scale that is best
     expressed by the fact that Google's zero-to-ten toolbar is a
     logarithmic scale, perhaps with a base of six. To go from an old
     PageRank of four to a new rank of five requires several times more
     incoming links. This is not easy to achieve. The cure for cancer
     might already be on the web somewhere, but if it's on a new site,
     you won't find it.

     PageRank also encourages webmasters to change their linking
     patterns. On search engine optimization forums, webmasters even
     discuss charging for little ads with links, according to the
     PageRank they've achieved for their site. This would benefit those
     sites with a lower PageRank that pay for such ads. Sometimes these
     PageRank achievements are the result of link farms or other shady
     practices, which Google tries to detect and then penalizes with a
     PageRank of zero. At other times professional optimizers get away
     with spammy techniques. Mirror sites and duplicate pages on other
     domains are now forbidden by Google and swiftly punished, even when
     there are good reasons for maintaining such sites. Overall, linking
     patterns have changed significantly because of Google. Many
     webmasters are stingy about giving out links (which can dilute your
     transference of PageRank to a given site), at the same time that
     they're desperate for more links from others.

                           What should Google do?

     We feel that PageRank has run its course. Google doesn't have to
     abandon it entirely, but they should de-emphasize it. The first
     step is to stop reporting PageRank on the toolbar. This would mute
     the awareness of PageRank among optimizers and webmasters, and
     remove some of the bizarre effects that such awareness has
     engendered. The next step would be to replace all mention of
     PageRank in their own public relations documentation, in favor of
     general phrases about how link popularity is one factor among many
     in their ranking algorithms. And Google should adjust the balance
     between their various algorithms so that excellent on-page
     characteristics are not completely cancelled by low link
     popularity.

     PageRank must be streamlined so that the "tyranny of the rich"
     characteristics are scaled down in favor of a more egalitarian
     approach to link popularity. This would greatly simplify the
     complex and recursive calculations that are now required to rank
     two billion web pages, which must be very expensive for Google. The
     crawl must not be PageRank driven. There should be a way for Google
     to arrange the crawl so that if a site cannot be fully covered in
     one cycle, Google's crawlers can pick up where they left off on the
     next cycle.

     Google is so important to the web these days, that it probably
     ought to be a public utility. Regulatory interest from agencies
     such as the FTC is entirely appropriate, but we feel that the FTC
     addressed only the most blatant abuses among search engines.
     Google, which only recently began using sponsored links and ad
     boxes, was not even an object of concern to the Ralph Nader group,
     Commercial Alert, that complained to the FTC.

     This was a mistake, because Commercial Alert failed to look closely
     enough at PageRank. Some aspects of PageRank, as presently
     implemented by Google, are nearly as pernicious as pay for
     placement. There is no question that the FTC should regulate
     advertising agencies that parade as search engines, in the
     interests of protecting consumers. Google is still a search engine,
     but not by much. They can remain a search engine only by fixing
     PageRank's worst features.

     _________________

     Daniel Brandt is founder and president of Public Information
     Research, Inc., a tax-exempt public charity that sponsors NameBase.
     He began compiling NameBase in 1982, from material that he started
     collecting in 1974, and is now the programmer and webmaster for
     PIR's several sites. He participates in various forums where
     webmasters share observations about the often-secretive algorithms,
     bugs, and behavior of various search engines. Brandt has been
     watching Google's interaction with NameBase ever since Google, in
     October, 2000, became the first search engine to go "deep" on PIR's
     main site by crawling thousands of dynamic pages.

                               Google Watch 

#  distributed via <nettime>: no commercial use without permission
#  <nettime> is a moderated mailing list for net criticism,
#  collaborative text filtering and cultural politics of the nets
#  more info: majordomo@bbs.thing.net and "info nettime-l" in the msg body
#  archive: http://www.nettime.org contact: nettime@bbs.thing.net