PageRank in SQL

03 December 2013

One of the most profound ideas in the last 20 years is PageRank, the original algorithm of Google's search engine.

PageRank starts with a simple and clever idea: the importance of a page is determined by how many pages link to it, and how important they are. It's a link-analysis algorithm, and it ranks pages by how important they are to the entire collection.

In other words, when a source page (like this blog) links to another, target page (like a Wikipedia article), some of the source's importance should transfer over to the target.

The only things you need are pages and their links. It's a graph structure.

SQL and Graphs

A graph can be stored in SQL using 2 tables, Nodes and Edges

CREATE TABLE Nodes
(NodeId int not null
,NodeWeight decimal(10,5) not null
,NodeCount int not null default(0)
,HasConverged bit not null default(0)
,constraint NodesPK primary key clustered (NodeId)
)

CREATE TABLE Edges (SourceNodeId int not null ,TargetNodeId int not null ,constraint EdgesPK primary key clustered (SourceNodeId, TargetNodeId) ,constraint EdgeChk check SourceNodeId <> TargetNodeId --ignore self references )

Imagine a node is an object like a web page. An edge is a pointer from a source node to a target node, like a hyperlink pointing from one web page to another.

We are preventing a few things. We ignore edges where a node points back to itself. And a node is allowed to point to another node only once.

Imagine a Tiny Web

Imagine the Internet has just 4 web pages: pages 1, 2, 3 and 4. These pages have links between them:

  • Page 2 links to pages 1 and 3
  • Page 3 links to page 1
  • Page 4 links to pages 1, 2 and 3.
  • Page 1 links to nothing.

We can store this in SQL

INSERT INTO Nodes (NodeId, NodeWeight)
VALUES
 (1, 0.25)
,(2, 0.25)
,(3, 0.25)
,(4, 0.25)

INSERT INTO Edges (SourceNodeId, TargetNodeId) VALUES (2, 1) --page 2 links to pages 1 and 3 ,(2, 3) ,(3, 1) --page 3 links to page 1 ,(4, 1) -- page 4 links to the 3 other pages ,(4, 2) ,(4, 3)

Initially, we give all nodes the same weight, 0.25. All nodes start out equal.

Get Ready!

We need to know how many edges each node has pointing away from it. Imagine counting the number of links a web page has going somewhere else. Let's calculate that and store it in our Nodes table.

declare @TotalNodeCount int

set @TotalNodeCount = (select count(*) from Nodes)

UPDATE n --if a node has 0 edges going away then assign it the total # of nodes. SET n.NodeCount = isnull(x.TargetNodeCount, @TotalNodeCount) FROM Nodes n LEFT OUTER JOIN ( SELECT SourceNodeID, TargetNodeCount = count(*) FROM Edges GROUP BY SourceNodeId ) as x ON x.SourceNodeID = n.NodeId

Get Set!

Let's look at the most important part: calculating how we transfer weight from source nodes to targets.

  • The weight of each target node is the sum of the nodes' weights that point to it, sum(n.NodeWeight)
  • Divide each source node's weight by the number of nodes it links to, n.NodeCount
  • Only transfer part of the weight, multiplying by the @DampingFactor, 0.85

SELECT
     e.TargetNodeId
    ,TransferWeight = sum(n.NodeWeight / n.NodeCount) * @DampingFactor
FROM Nodes n
INNER JOIN Edges e
  ON n.NodeId = e.SourceNodeId
GROUP BY e.TargetNodeId

You can see there's a damping factor, which is the percentage of a node's weight that gets transferred via its edge to another node.

We also include a margin of error, which is a small number that represents an acceptable amount of precision.

Now we're ready to run PageRank.

GO!

declare @DampingFactor decimal(3,2) = 0.85 --set the damping factor
    ,@MarginOfError decimal(10,5) = 0.001 --set the stable weight
    ,@TotalNodeCount int
    ,@IterationCount int = 1

-- we need to know the total number of nodes in the system set @TotalNodeCount = (select count(*) from Nodes)

-- iterate! WHILE EXISTS ( -- stop as soon as all nodes have converged SELECT * FROM dbo.Nodes WHERE HasConverged = 0 ) BEGIN

<span class="k">UPDATE</span> <span class="n">n</span> <span class="k">SET</span> 
<span class="n">NodeWeight</span> <span class="o">=</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="o">@</span><span class="n">DampingFactor</span> <span class="o">+</span> <span class="k">isnull</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">TransferWeight</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">)</span>

<span class="c1">-- a node has converged when its existing weight is the same as the weight it would be given</span>
<span class="c1">-- (plus or minus the stable weight margin of error)</span>
<span class="p">,</span><span class="n">HasConverged</span> <span class="o">=</span> <span class="k">case</span> <span class="k">when</span> <span class="k">abs</span><span class="p">(</span><span class="n">n</span><span class="p">.</span><span class="n">NodeWeight</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span><span class="p">.</span><span class="mi">0</span> <span class="o">-</span> <span class="o">@</span><span class="n">DampingFactor</span> <span class="o">+</span> <span class="k">isnull</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">TransferWeight</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">)))</span> <span class="o">&lt;</span> <span class="o">@</span><span class="n">MarginOfError</span> <span class="k">then</span> <span class="mi">1</span> <span class="k">else</span> <span class="mi">0</span> <span class="k">end</span>
<span class="k">FROM</span> <span class="n">Nodes</span> <span class="n">n</span>
<span class="k">LEFT</span> <span class="k">OUTER</span> <span class="k">JOIN</span>
<span class="p">(</span>
    <span class="c1">-- Here&#39;s the weight calculation in place</span>
    <span class="k">SELECT</span>
        <span class="n">e</span><span class="p">.</span><span class="n">TargetNodeId</span>
        <span class="p">,</span><span class="n">TransferWeight</span> <span class="o">=</span> <span class="k">sum</span><span class="p">(</span><span class="n">n</span><span class="p">.</span><span class="n">NodeWeight</span> <span class="o">/</span> <span class="n">n</span><span class="p">.</span><span class="n">NodeCount</span><span class="p">)</span> <span class="o">*</span> <span class="o">@</span><span class="n">DampingFactor</span>
    <span class="k">FROM</span> <span class="n">Nodes</span> <span class="n">n</span>
    <span class="k">INNER</span> <span class="k">JOIN</span> <span class="n">Edges</span> <span class="n">e</span>
      <span class="k">ON</span> <span class="n">n</span><span class="p">.</span><span class="n">NodeId</span> <span class="o">=</span> <span class="n">e</span><span class="p">.</span><span class="n">SourceNodeId</span>
    <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">e</span><span class="p">.</span><span class="n">TargetNodeId</span>
<span class="p">)</span> <span class="k">as</span> <span class="n">x</span>
<span class="k">ON</span> <span class="n">x</span><span class="p">.</span><span class="n">TargetNodeId</span> <span class="o">=</span> <span class="n">n</span><span class="p">.</span><span class="n">NodeId</span>

<span class="c1">-- for demonstration purposes, return the value of the nodes after each iteration</span>
<span class="k">SELECT</span>
    <span class="o">@</span><span class="n">IterationCount</span> <span class="k">as</span> <span class="n">IterationCount</span>
    <span class="p">,</span><span class="o">*</span>
<span class="k">FROM</span> <span class="n">Nodes</span>

<span class="k">set</span> <span class="o">@</span><span class="n">IterationCount</span> <span class="o">+=</span> <span class="mi">1</span>

END

Victory!

This example takes 5 iterations to complete. It turns out computing PageRank for the entire world wide web takes only 100 iterations.

I'd recommend you try this out yourself. My code is available on GitHub.

Happy Coding!

Permalink

SQL Saturday 265 - Tips for Presenting

05 November 2013

This previous Saturday was SQL Saturday Oregon. My favorite was a workshop Q&A about how to give technical presentations, organized by Jes Borland. Present were a good mix of newcomers, experienced speakers, and inexperienced speakers. Here's what I took away from the session.

Why Should I Speak?

  • Technical presentations make you more comfortable when speaking at work.
  • Technical presentations make you more comfortable speaking in public, which is a great skill.
  • You meet new and interesting people. The networking is great.
  • It's a great way to conquer a fear.

Getting Started

  • Use the 'Rule of 3' - present 3 key ideas, and that's it.
  • Rehearse. A lot.
  • Record yourself. See how you look afterwards. Adjust. Repeat
  • Relax before you get on stage. Take a deep breath.
  • Start off someplace friendly. Work "lunch brownbags" are good. SQL Saturdays and user groups are great.
  • Tell people to ask questions at the end. Then they won't interrupt when you're speaking.
    • If you feel confident that you can get back on topic, allow questions during the presentation.
  • Do not go on tangents.
  • If you are running short of time and need to skip ahead, don't advance through many different slides if you need to catch up. People feel like they're missing out.
    • Bring up a demo, flip to the right slide, and then flip back
    • Or, go directly to the appropriate slide if you're running in 'Presenter View'
  • Add a smiley face to your laptop background or post-it. It'll make you smile more. That helps the audience bond with you.
  • Do not, DO NOT B.S. your way out of question you can't answer. Say "I don't know, let's work on it together afterwards" and keep going.
  • Most of the session demand is at the 200 level (advanced beginner or intermediate). Most people wear many hats, and they're trying to learn and progress. They haven't specialized enough to need a 400- or 500-level talk because they can't.
  • Add a little humor, especially self-deprecating humor.

Titles and Abstracts

  • First, start with ideas.
  • Then build an abstract from those ideas
  • Build an outline
  • Send the abstract (and outline) to speakers who are approachable. Ask for their help. They probably will.
  • Titles are difficult, and important. Often people will only look at the titles
  • Putting funny jokes, or pop-culture references in titles doesn't translatein large, culturally diverse events.
  • Having a slightly cute presentation title depends on the conference culture. SQL Saturdays and PASS Summit love slightly cute titles. Tech-Ed really doesn't.

Audience Selection

  • Explain what the session is about. Then tell people it's OK to leave if it isn't what they expect. This is a good thing.
    • It's not personal.
  • You end up with a more engaged audience, and fewer difficult audience members.
  • This improves your session evaluations, which ask "did you learn what you were expecting to learn" and "did the session meet your expectations"

Being present

  • Look at your audience, not the screen
  • Use a presenter mouse or presenter 'wand' and walk around. You can advance the slides remotely. This works great for slides, not demos.
  • Ask for a show of hands. It keeps people engaged.

How to Keep From Speaking Too Quickly

  • Practice.
  • Add time marks on the slides. Slow down if you see yourself going too fast.

Difficult Audience Members

  • They're relatively rare.
  • Most common type is someone who wants to be the "smartest person in the room", and stump the presenter
  • Say, "That's interesting, but a bit off topic. If you and anyone else wants to chat, come up afterwards. We can work on it together."

Demos

  • Practice them
  • NEVER, EVER type in a demo
  • Optionally, record a demo, and play it back live.

How to Recover When You Bomb Onstage

  • Laugh
  • Apologize. "Oops. Lost my train of thought. Sorry about that"
  • Take a deep breath
  • Move on
  • Practice screwing up in rehearsal. Practice recovering.
  • Have a plan B if things go sideways, especially for a demo
Permalink