A PageRank View of Research Rank

The present day “publish or perish” madness in academia involves counting number of papers researchers have published, as well as the number of citations their papers got.

(a) One might argue that not all citations are equally valuable: a citation in a paper that is itself often cited is more valuable than a citation in a paper that no one cites. Design a xPageRank style algorithm which would rank papers according to their “importance”, and then use such an algorithm to rank researchers by their “importance”.

(b) Assume now that you do not have information on the citations in each published paper, but instead you have for every researcher a list of other researchers who have cited him and how many times they cited him. Design again a PageRank style algorithm which would rank researchers by their importance.

Solution-(a):

we can view this problem as PageRank problem, where each paper is viewed as webpage, and citation is viewed as hyperlink, and the paper rank result is viewed as webpage rank result. Thus, we would have a system of equations, one for each paper P in the paper library:

\left\{ \rho(P) = \sum_{P_i \rightarrow P}{\frac{\rho(P_i)}{\#(P_i)}} \right\}_{P \ \in \ \text{Paper-Library}}

it means that the importance of paper P is dependent on the importance score of the papers that cited P.

and we use matrix C_1 to represent the citation relationship such that:

c(i,j) = \left\{  \frac{1}{\#(P_i)} \space If\ paper\ P_i\ cited\ P_j  \atop  0 \space \text{otherwise}  \right.

There might be dangling node paper which reference no other one, we can do C_1 + \frac{1}{M}\mathbf{de^T} to make them pointing every one else to resolve the dangling problem. Besides, when a reader is diving in one paper, he might have a chance to jump to another paper randomly rather than following the citation list, we denote it as \theta , finally the Google-Matrix like paper citation matrix can be written as:

\mathbb{C} = \theta\left( C_1 + \frac{1}{M}\mathbf{de^{T}} \right) + \frac{1 - \theta}{M}\mathbf{ee^T}

in the process of iteratively computing paper rank, we use vector \pi^*_i to represent the intermediate result after round i_{th} . So we keep running following iterative equation iteratively

{(\pi^*_{i+1})}^T = {(\pi^*_{i})}^T\mathbb{C}

until the \pi^* converge. And then, the converged vector contains corresponding importance score for each paper.

Finally, the importance score of researcher R_i equals to sum of all his(her) own papers importance score, i.e

R_i = \sum_{if\ R_i\ own\ paper\ P_j}\pi[j]

Solution-(b):

Analysis:

In this problem, it still can be modeled as PageRank problem like problem-(a), but it needs a little modification on how we construct citation matrix.

Firstly, Since we know how a researcher R_i received citations from other researchers like R_j , we could only construct a receiving matrix rather than citation matrix. Then we do a transpose on it, it will becomes a citation matrix again.

Secondly, When we are calculating adjacency relation r(i,j) , instead of assign it to be \left\{\frac{1}{\text{numbers of node i points out}}\right\} (a.k.a \frac{1}{\#(P_i)} ), we make it equals to \left\{\frac{\text{how many times of node j points to i}}{\text{total number of i was pointed to}}\right\} , in which the times of node j pointing to i means that researcher i having received citations from a specific researcher j for k times. And total number of i was pointed to means how many citations R_i has received in total, and we denote it with \#(R_i)

Procedure:

From analysis, we can have equation

r(i,j) = \left\{  \frac{k}{\#(R_i)} \space If\ researcher\ R_i\ received\ \ citations\ from\ R_j\space for \space k \space times  \atop  0 \space \text{otherwise}  \right.

Using this equation, we can construct the researcher citation receiving matrix \mathbb{R} , where \mathbb{R}[i][j] stands for researcher R_i receive the weighted importance score from R_j . Then we do a transpose to it, it turns into a citation matrix again, a.k.a R^T , which will work with Google matrix construction equation again. we denote it with C_1

Similarly, to tackle dangling node and randomization problem, we use following equation to construct Google-Matrix like researcher citation relation matrix \mathbb{C} with random walk probability

\theta :

\mathbb{C} = \theta\left( C_1 + \frac{1}{M}\mathbf{de^{T}} \right) + \frac{1 - \theta}{M}\mathbf{ee^T}

Then, we use \pi_i^* to denotes the importance vector of researchers, and run iteratively on following equation

(\pi_{i+1}^*)^T = (\pi_i^*)^T \mathbb{C}

until it converges, then the \pi^* contains corresponding importance score for each researcher, and the increasing rank of importance scores is the rank of researcher.

Posted in TechNotes | Leave a comment

A Simple Gotha on PageRank

Below is given the graph of the Internet one milisecond after the Big Bang. Construct the corresponding Google matrix with α = 7/8.

Figure\ 1:\ The\ Internet\ 1ms\ after\ the\ Big\ Bang

Solution:

With equation

g(i,j) = \left\{  \frac{1}{\#(P_i)} \space if \space P_i \rightarrow P_j  \atop  0 \space \text{otherwise}  \right.

we can obtain the topology matrix G_1 :

G_1 =  \begin{matrix}  0 & 1/4 & 1/4 & 1/4 & 0 & 0 & 0 & 1/4 \\  1/4 & 0 & 1/4 & 0 & 1/4 & 0 & 1/4 & 0 \\  0 & 1/3 & 0 & 1/3 & 0 & 0 & 1/3 & 0 \\  0 & 0 & 0 & 0 & 1/2 & 1/2 & 0 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\  1/3 & 0 & 1/3 & 0 & 0 & 0 & 1/3 & 0 \\  0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\  1/3 & 0 & 1/3 & 0 & 1/3 & 0 & 0 & 0  \end{matrix}

we found that it has dangling node, so we need to eliminate it by make it connects to every node, thus G_1 + \frac{1}{M}\mathbf{d}e^{T} . In addition, a random walk with \alpha = 7/8 should also be taken into consideration, as a result, we would have the Google matrix equation:

7/8\left( G_1 + \frac{1}{M}\mathbf{de^{T}} \right) + \frac{1 - 7/8}{M}\mathbf{ee^T}

Finally, the Google matrix is calculated to be:

G =  \begin{matrix}  1/64 & 15/64 & 15/64 & 15/64 & 1/64 & 1/64 & 1/64 & 15/64 \\  15/64 & 1/64 & 15/64 & 1/64 & 15/64 & 1/64 & 15/64 & 1/64 \\  1/64 & 59/192 & 1/64 & 59/192 & 1/64 & 1/64 & 59/192 & 1/64 \\  1/64 & 1/64 & 1/64 & 1/64 & 29/64 & 29/64 & 1/64 & 1/64 \\  1/8 & 1/8 & 1/8 & 1/8 & 1/8 & 1/8 & 1/8 & 1/8 \\  59/192 & 1/64 & 59/192 & 1/64 & 1/64 & 1/64 & 59/192 & 1/64 \\  1/8 & 1/8 & 1/8 & 1/8 & 1/8 & 1/8 & 1/8 & 1/8 \\  59/192 & 1/64 & 59/192 & 1/64 & 59/192 & 1/64 & 1/64 & 1/64  \end{matrix}

In case there is any ambiguity, python style pseudo code is provided.

def constructGoogleMatrix(adjacency, alpha):
    """
    :type   adjacency: Dict{node,Set{node}}
    :rtype  G: List[List[float]]
    """

    G = [[0]*len(adjacency) for i in range(len(adjacency))]

    for node in adjacency:
        if len(adjacency[node]) > 0:
            weight = 1 / len(adjacency[node])
            for neighbour in adjacency[node]:
                G[node.num][neighbour.num] = weight*alpha + (1-alpha)/len(adjacency)
        else: # dangling node
            G[node.num][0:] = [(1/len(adjacency))*alpha + (1-alpha)/len(adjacency)] * len(G[node.num])
    return G
Posted in TechNotes | Tagged , , | Leave a comment