Google Page Rank.
In this Centrale Supelec problem , we studied the Perron-Frobenius theorem. This theorem has many applications, one of which is at the heart of the Google search engine.
We say that \(M>0\) if \(\forall i,j,\quad m_{ij}>0\) et \(\rho(A)=\max\{\abs{\lambda},\quad \lambda\in \mathrm{Sp}(A)\}\in\mathbb{R}^+\).
Let \(A \in M_n(\mathbb{R})\).
- \(\rho(A)>0\) and \(\rho(A)\) is an eigenvalue of \(A\), of order of multiplicity 1: \(\dim(\ker(A-\rho(A) I))=1\).
- There is an eigenvector \(X>0\) of \(A\) corresponding to the eigenvalue \(\rho(A)\) with positive components.
- Any other eigenvalue of \(A\) in \(\mathbb{C}\) verifies \(\abs{\lambda}<\rho(A)\).
When a search is performed on Google, the algorithm must select and order web pages according to their pertienence. To do this, it assigns a score to each web page:
- A score directly evaluates the relevance of the page to the search, for example by comparing the words contained in the page with those present in the search.
- Another score is attributed according to the intrinsic interest of the page: this is the Page Rank algorithm.
The principle is to estimate the quality of a web page by analyzing all the links pointing to it: the more a page is cited by quality pages, the higher its score.
The score \(s(x_i)\) of a page will thus be the sum of the scores of the pages pointing to it, weighted by the inverse of the total number of outgoing links according to the following formula: \[s(x_i) = \sum_{x_j\to x_i} \frac{1}{d(x_j)}s(x_j) \tag{1}\] where \(d(x)\) is the total number of links referenced by the page \(x\) (if the page has no outgoing links, we can consider that it references itself and posit \(d(x)=1\)).
If we let \(N\) the total number of web pages, \(S\in M_{1N}(\mathbb{R})\) the row vector whose components are the \(s(x_i)\), and \(M \in M_N(\mathbb{R})\) the matrix defined by: \[ m_{ij} = \left\{ \begin{array}{ll} \frac{1}{d(x_i)} & \mbox{si } x_i \mbox{points to } x_j. \\ 0 & \mbox{sinon.} \end{array} \right. \] then the vector of scores satisfies the equation \[ \begin{alignat*}{1} SM&=S \tag{2} \\\Leftrightarrow M^TS^T &= S^T \end{alignat*} \] We are therefore looking for an eigenvector, with strictly positive components, of \(M\) associated with the eigenvalue \(1\).
We can rephrase this problem in terms of probabilities. We can represent the web as a graph, whose vertices are the web pages, with an edge directed from \(x_i\) to \(x_j\) to symbolize a link to \(x_j\) present in the page \(x_i\).
Let's imagine that we're moving around the vertices of this graph: starting from a page \(x_i\), we choose equiprobably one of the \(d(x_i)\) outgoing links.
This defines a homogeneous Markov chain, and even a random walk.
The probability of the transition from the x_i state to the x_j state is \[ \begin{alignat*}{1} p_{ij} &= \left\{ \begin{array}{ll} \frac{1}{d(x_i)} & \mbox{if } x_i \mbox{ points to } x_j. \\0 & \mbox{otherwise.} \end{array} \right. \\\ &= m_{ij} \end{alignat*} \] which shows that the matrix \(M\) is the transition matrix of the Markov chain. If \(S_n\) is the vector giving the probabilities of being in each state at time \(n\), we have: \[ S_n M = S_{n+1} \]
Here's an example of a graph with 5 web pages:
The associated transition matrix is: \[ M=\begin{bmatrix} 0 &\frac{1}{2} &\frac{1}{2} & 0 & 0 \\\frac{1}{2} & 0 & \frac{1}{2} & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \\ \frac{1}{3} & 0 & \frac{1}{3} & \frac{1}{3} & 0 \end{bmatrix} \]
Finding \(S\in M_{1N}(\mathbb{R})\), with positive components, verifying \[ SM=S \] amounts to finding an invariant measure (or probability if we normalize between 0 and 1).
The difficulty lies in the structure of the web. The matrix \(M\) contains many 0s (sparse), and there are many sites with no outgoing links, which are therefore "absorbing" in the Markov chain. We can show that equation (2) then has no "interesting" solution, i.e. one with \(\forall i,\quad s(x_i)>0\): all transient states in the Markov chain will automatically have \(s(x_i)=0\) (see box), which makes the solution inoperative for ranking pages.
The idea is to modify displacements on the graph, leaving the possibility of resetting the chain with a small probability, for example \(\beta=0.15\). To be more precise:
- with probability \(1-\beta\), we follow a link as before.
- with probability \(\beta\), any web page is chosen equiprobably.
The probability of the transition from state \(x_i\) to state \(x_j\) becomes \[ \begin{alignat*}{1} \tilde{p}_{ij}&= \left\{ \begin{array}{ll} (1-\beta)\frac{1}{d(x_i)} + \beta\frac{1}{N} & \mbox{si } x_i \mbox{ points to } x_j. \\ \beta\frac{1}{N} & \mbox{sinon.} \end{array} \right. \\ &= (1-\beta)m_{ij} + \beta\frac{1}{N} \end{alignat*} \]
The graph in our example becomes, with probability \(\frac{\beta}{5}\) transitions in gray:
The associated transition matrix is: \[ \tilde{M}=\begin{bmatrix} \frac{\beta}{5} &\frac{1-\beta}{2}+\frac{\beta}{5} & \frac{1-\beta}{2}+\frac{\beta}{5} & \frac{\beta}{5} & \frac{\beta}{5} \\\frac{1-\beta}{2} +\frac{\beta}{5}& \frac{\beta}{5} & \frac{1-\beta}{2} +\frac{\beta}{5}& \frac{\beta}{5} & \frac{\beta}{5}\\ \frac{\beta}{5} & \frac{\beta}{5} & \frac{\beta}{5} & \frac{\beta}{5} & (1-\beta)+\frac{\beta}{5} \\ \frac{\beta}{5} & \frac{\beta}{5} & \frac{\beta}{5} & (1-\beta)+\frac{\beta}{5} & \frac{\beta}{5} \\ \frac{1-\beta}{3}+\frac{\beta}{5} & \frac{\beta}{5} & \frac{1-\beta}{3}+\frac{\beta}{5} & \frac{1-\beta}{3}+\frac{\beta}{5} & \frac{\beta}{5} \end{bmatrix} \]
The transition matrix \(\tilde{M}\) defined by \(\tilde{m}_{ij}= \tilde{p}_{ij}\) is irreducible, and we even have \(\tilde{M}^T>0\). Perron's theorem applies. Since the matrix \(\tilde{M}\) is stochastic (\(\sum_{j=1}^N m_{ij}=1\)), we can show that \(\rho(\tilde{M}^T)=1\). Perron's theorem justifies the existence of a vector \(X>0\) such that \(\tilde{M}^TX=X\), and so by positing \(S=\frac{1}{\sum x_i}X^T\), we have \(S>0\) and \(S\tilde{M}=S\). And it's easy to prove that the probability \(S\) is unique.
References
S. U. Pillai, T. Suel and Seunghun Cha,
,
IEEE Signal Processing Magazine, vol. 22, issue: 2, pp. 62-75, March 2005.
Sergey Brin, Lawrence Page,
,
Computer Networks, vol. 30, pp. 107-117, 1998.
A few definitions about Markov chains
\(E\) is a finite or infinite space at most countable. \(X\) is a homogeneous Markov chain with transition matrix \(M\).
For any \(x\in E\) such that \(P(X_0=x)>0\), we denote \[ P_x=P(.\mid X_0=x) \] the probability law conditioned on the event \(X_0=x\).
For \(y\in E\), we define the time of first arrival in \(y\) by \[ T_y= \inf\{n\in \mathbb{N}^*, X_n=y\} \] with the convention \(\inf \emptyset = +\infty\)
We say that \(x\) leads to \(y\) if \(P_x(T_y < +\infty)>0\)
The states "x" and "y" are said to communicate if "x" leads to "y" and "y" leads to "x". This is an equivalence relation.
A chain is irreducible if all states communicate.
a state is recurrent if \(P_x(T_x<+\infty)=1\). Starting from the \(x\) state, it is certain to return to it.
A state is transient if \(P_x(T_x<+\infty)<1\). Starting from the \(x\) state, it is possible to never return to it.