03 December 2013
PageRank starts with a simple and clever idea: the importance of a page is determined by how many pages link to it, and how important they are. It's a link-analysis algorithm, and it ranks pages by how important they are to the entire collection.
In other words, when a source page (like this blog) links to another, target page (like a Wikipedia article), some of the source's importance should transfer over to the target.
The only things you need are pages and their links. It's a graph structure.
A graph can be stored in SQL using 2 tables, Nodes and Edges
CREATE TABLE Nodes (NodeId int not null ,NodeWeight decimal(10,5) not null ,NodeCount int not null default(0) ,HasConverged bit not null default(0) ,constraint NodesPK primary key clustered (NodeId) ) CREATE TABLE Edges (SourceNodeId int not null ,TargetNodeId int not null ,constraint EdgesPK primary key clustered (SourceNodeId, TargetNodeId) ,constraint EdgeChk check SourceNodeId <> TargetNodeId --ignore self references )
Imagine a node is an object like a web page. An edge is a pointer from a source node to a target node, like a hyperlink pointing from one web page to another.
We are preventing a few things. We ignore edges where a node points back to itself. And a node is allowed to point to another node only once.
Imagine the Internet has just 4 web pages: pages 1, 2, 3 and 4. These pages have links between them:
We can store this in SQL
INSERT INTO Nodes (NodeId, NodeWeight) VALUES (1, 0.25) ,(2, 0.25) ,(3, 0.25) ,(4, 0.25) INSERT INTO Edges (SourceNodeId, TargetNodeId) VALUES (2, 1) --page 2 links to pages 1 and 3 ,(2, 3) ,(3, 1) --page 3 links to page 1 ,(4, 1) -- page 4 links to the 3 other pages ,(4, 2) ,(4, 3)
Initially, we give all nodes the same weight, 0.25. All nodes start out equal.
We need to know how many edges each node has pointing away from it. Imagine counting the number of links a web page has going somewhere else. Let's calculate that and store it in our Nodes table.
declare @TotalNodeCount int set @TotalNodeCount = (select count(*) from Nodes) UPDATE n --if a node has 0 edges going away then assign it the total # of nodes. SET n.NodeCount = isnull(x.TargetNodeCount, @TotalNodeCount) FROM Nodes n LEFT OUTER JOIN ( SELECT SourceNodeID, TargetNodeCount = count(*) FROM Edges GROUP BY SourceNodeId ) as x ON x.SourceNodeID = n.NodeId
Let's look at the most important part: calculating how we transfer weight from source nodes to targets.
SELECT e.TargetNodeId ,TransferWeight = sum(n.NodeWeight / n.NodeCount) * @DampingFactor FROM Nodes n INNER JOIN Edges e ON n.NodeId = e.SourceNodeId GROUP BY e.TargetNodeId
You can see there's a damping factor, which is the percentage of a node's weight that gets transferred via its edge to another node.
We also include a margin of error, which is a small number that represents an acceptable amount of precision.
Now we're ready to run PageRank.
declare @DampingFactor decimal(3,2) = 0.85 --set the damping factor ,@MarginOfError decimal(10,5) = 0.001 --set the stable weight ,@TotalNodeCount int ,@IterationCount int = 1 -- we need to know the total number of nodes in the system set @TotalNodeCount = (select count(*) from Nodes) -- iterate! WHILE EXISTS ( -- stop as soon as all nodes have converged SELECT * FROM dbo.Nodes WHERE HasConverged = 0 ) BEGIN UPDATE n SET NodeWeight = 1.0 - @DampingFactor + isnull(x.TransferWeight, 0.0) -- a node has converged when its existing weight is the same as the weight it would be given -- (plus or minus the stable weight margin of error) ,HasConverged = case when abs(n.NodeWeight - (1.0 - @DampingFactor + isnull(x.TransferWeight, 0.0))) < @MarginOfError then 1 else 0 end FROM Nodes n LEFT OUTER JOIN ( -- Here's the weight calculation in place SELECT e.TargetNodeId ,TransferWeight = sum(n.NodeWeight / n.NodeCount) * @DampingFactor FROM Nodes n INNER JOIN Edges e ON n.NodeId = e.SourceNodeId GROUP BY e.TargetNodeId ) as x ON x.TargetNodeId = n.NodeId -- for demonstration purposes, return the value of the nodes after each iteration SELECT @IterationCount as IterationCount ,* FROM Nodes set @IterationCount += 1 END
This example takes 5 iterations to complete. It turns out computing PageRank for the entire world wide web takes only 100 iterations.
I'd recommend you try this out yourself. My code is available on GitHub.