A fundamental question in network science is how to measure the relative popularity of a given node. The word “popularity” has no precise mathematical meaning, so we need to think creatively as to how we would interpret this subjective notion.
One answer is to consider a function of the number of edges incident with a node. For example, Katy Perry has over one hundred million Twitter followers and so has over one hundred million directed edges pointing to her on the network; we call that her in-degree. Under any reasonable interpretation, Katy Perry is popular.
However, in-degree is superficially easy to manipulate. We could create, for example, one hundred and fifty million fake Twitter accounts and all have them follow any given account. It’s impractical sounding, but if popularity were simply in-degree, then it is possible to link spam accounts to artificially increase their popularity.
How do we avoid this kind of link spamming?
I like to tell the story in my graduate class on networks about the anarchist librarian. Imagine you were looking for a book in a library, say something on weather. You go to the weather section, but you find only books on fish. That is the anarchist librarian hard at work, misfiling all the books!
The web may be viewed as a vast, on-line library (or as Jon Kleinberg said in one of his talks, it may be viewed as a crowd). There are all kinds of people adding all kinds of things to the library, and moving books around without caring about which book is an authority on any given subject.
For example, suppose you were looking for information on heart disease on the web. When you search “heart disease” you would hope to get accurate medical information rather than an ad for a $19.99 herbal extract.
Sergey Brin and Larry Page devised PageRank in part to overcome these challenges. PageRank is one way to measure the popularity of websites. The algorithm was a critical tool in making Google a success, especially in the earlier development of the company.
What is it?
In short: PageRank is the stationary distribution of a random walk on a directed graph with teleportation.
Let me explain in simpler terms. Imagine a web surfer moving randomly from site to site on the web, following hyperlinks to move from one site to another. That’s called a random walk. At a given site, the surfer has an equal probability to move to any other sites by links. Occasionally, the surfer gets bored and moves to a randomly chosen site. That is the teleportation part.
PageRank is the probability of landing on a site assuming the surfer is following pages and teleporting as described above. It can be computed precisely something called a PageRank matrix, which mathematically encodes the random walk with teleportation. You only need to know the link structure of the underlying network to find its corresponding PageRank matrix.
Linear algebra kicks in and tells us that the PageRank matrix is ergodic, so by the Perron-Frobenius theorem, it has an eigenvector with eigenvalue 1. This eigenvector is the stationary distribution piece That eigenvector has coordinates which are the PageRank scores for each corresponding node. It may sound complicated to the uninitiated, but the PageRank matrix can be written down by any first-year undergraduate taking a linear algebra course.
Computing the PageRank eigenvector is a little trickier, especially with large matrices, so there is a numerical technique called the power method that is helpful here. The power method only uses matrix multiplication and converges quickly to approximate the PageRank eigenvector after a few iterations.
Why is PageRank effective?
PageRank is an effective ranking tool because it uses only the underlying link structure of the web. You can’t spam search results, for instance, by adding lots of text to a site. PageRank measures popularity recursively: you are popular if you are linked to by other popular sites. This self-referential aspect of PageRank is why eigenvectors are useful here.
Teleportation is useful to avoid what are called spider traps, where the random walk would enter a node with no edges coming out of it. Teleportation gets you out of spider traps. There are many tweaks we could make to PageRank, such as modifying the teleportation based on your own search habits, and that’s the idea behind personalized PageRank.
PageRank applies, of course, to many other kinds of networks, not just the web graph. It has found applications in disparate topics ranging from social networks to protein interaction networks.
My first book A Course on the Web Graph has a chapter on PageRank and is appropriate for readers with a mathematical background. Amy Langville and Carl Meyer wrote a book Google’s PageRank and Beyond that I recommend for those who want to deeper into the topic. Recently, Steven Strogatz gave an interview discussing eigenvectors in three minutes without formulas. That is worth a listen.
Is PageRank still useful to Google in its searches? Possibly so, but it may not play as prominent a role as it did in the past. There are many advanced, algorithmic tools available now in web search. Notice how Google completes your query before you can fully type it in? And how it sometimes feels like it is reading your mind? That’s machine learning/artificial intelligence and an algorithm Google calls Rank Brain.
Whatever role PageRank plays in web search, it remains an important tool in the arsenal of network science in our quest to better understand real-world networks.
One thought on “All about PageRank”
[…] PageRank is the stationary distribution of a random walk on a directed graph with teleportation. […]