PASS Summit - A Veteran's Guide

02 November 2012

There have been some excellent posts about SQL PASS Summit: more advice than you can shake a stick at. I'm going to focus on 3 topics: picking sessions, endurance, and follow up.

Picking Sessions

This will be my 5th PASS Summit. I have experimented with different approaches to picking sessions, and come up with a guide. Time is valuable.

  1. Go to Dr. David DeWitt's keynote/session. He is an intellectual powerhouse, and never to be missed.
  2. Narrow down your choices to the better presenters; don't filter by subject yet. A good presenter can turn a topic from dry to riveting. A mediocre presenter does the reverse.
  3. Pick a mix of relevant topics. As a database developer, I choose a mix of performance tuning, T-SQL tips and tricks, architecture discussions, and professional development. Again, only consider the best presenters.
  4. Don't judge a session by its level. 500-level sessions are always advanced; everything else depends on the presenter.
  5. Pick the session where you will ask more questions.
  6. Lightning talks are great. Go to them if nothing else awesome is happening.
  7. If nothing fits for a given time slot, take a break. Write down your notes, go for a walk, let your batteries recharge. Be social.
  8. The content can be hit-or-miss for Microsoft execs.
  9. After each session, take notes on whether you like the topic and presenter. Use that for future reference.

Endurance

The week of PASS is grueling: 18-hour days of mental and social stimulation are common.

  • Don't expect to remember what you've seen in a session. Write it down for later.
  • Drink lots of liquid. I have a cup of coffee, a bottle of water before lunch, water with lunch, 2 bottles of water in the afternoon, and a beer with dinner.
  • Eat lightly at lunch. Otherwise you will be sleepy in the afternoon.
  • Get the session recordings. There are always conflicts where two great sessions are happening at the same time. The recordings are great for this. They're also a great deal.
  • Use paper. Paper never runs out of batteries, is lighter than anything electronic, and is amazingly versatile when you want to draw diagrams.
  • Chat with 2-3 new people each session. Learn a bit about what they're doing, their expertise and challenges. Write this down on their business card. By the end of PASS you'll have met a couple dozen people.
  • Wisdom often sounds simplistic, obvious, or old. Don't dismiss an idea or technique for those reasons; the best ideas are simple and obvious in hindsight.
  • Drink in extreme moderation. Few things are more pointless than attending a session with a hangover.
  • There is always something happening in the evening. Check Twitter to find out what. Barring that, go to Tap House; it's the de-facto watering hole.

Follow Up

The week of PASS Summit is too much for anyone to absorb fully. So, don't expect to. A lot can be done in the following weeks.

  • Go over your notes. Try out what you've learned. Doing something the next step in learning after hearing about it.
  • Follow up with people you've met. If you know of a blog post or contact particularly relevant to their job, role, or challenge, share your knowledge.
  • Watch the sessions with your team. Lunch brownbags are great for this. Bring popcorn.

I hope to see you at PASS.

Permalink

Data Scientist, Part 1.1

26 October 2012

We learn from other people. We often learn from others with similar interests. Who are they? How do we find them?

I am studying at the University of Washington to become a data scientist. My first course, Introduction to Data Science, started a few weeks ago. All of the students completed a survey. Our professor, Dr. Bill Howe, made the anonymized results public. I put the data and my analysis on GitHub for anyone to see.

I am interested in three questions:

  1. What data science topics are students most interested in?
  2. How similar are students' preferences?
  3. How should we group students together?

Pick The Winner

Which data science topics are the most popular? Let's use boxplots to find out; they are an effective way to summarize this data.

This can be done in 7 lines of code in R:


survey <- read.csv(file="ReshapedResponses.csv", header=TRUE) survey[,3] <- survey[,3]+2 #go from range (-2 to 2) to (0 to 4) orderedSurvey = with(survey, reorder(Question, Response, median))

p <- ggplot(data=survey, aes(x=Response, y=orderedSurvey))+ geom_boxplot(notch=TRUE)+ labs(x="Importance", y="Question") plot(p)

The results are clear. My fellow students want to learn about techniques to work with "big data" or "fairly big data" in both practical and abstract ways. There is also a desire to understand machine learning.

Am I like You?

My fellow students are a diverse group. How similar are our learning preferences?

If we pick a random student, S1, how similar are their preferences to another student's, S2? If we can calculate a single metric to measure similarity for one pair of students, we could calculate that metric for all pairs of students. This is a Euclidean distance problem, and it comes from problems solved using clustering algorithms.

We'll be using R again, again writing 7 lines of code:

 
survey <- read.csv(file="surveyresults.csv",header=TRUE,sep="\t") survey.clean <- survey[,5:13]+2 #go from range [-2 2] to [0 4] survey.matrix <- as.matrix(survey.clean) #create a matrix survey.dist <- dist(survey.matrix) #compute matrix distance survey.mds <- cmdscale(survey.dist) #compute mds of the distance survey.mds.df <- as.data.frame(survey.mds) #get a table again qplot(data=survey.mds.df, x=V1, y=V2) #plot the results

The result shows each student's learning preference compared to their classmates'.

Finding Niches

Which groups of students in that chart have similar learning preferences? How do we find those groups? How big should they be?

Time for another algorithm: k-means clustering. This algorithm is used to find the k best clusters for a set of points. For example, if k=3, it would group all the data into 3 clusters.

Since k is not chosen automatically, we need to find a good value for it. There are a few different approaches to consider. We'll use the elbow method: look at a graph of k vs efficiency, and identify where the line bends (like an elbow). It's at k=5.

Great! Let's see the clusters. We can do that with 2 lines of R code:

 
survey.kmeans <- kmeans(x=survey.mds, centers=5) qplot(data=survey.mds.df, aes(x=V1, y=V2, color=factor(survey.kmeans$cluster)))

We can see there are 5 groups. They are 3, 6, 8, 8, and 12 people in size, and cover ~80% of the variation in learning preferences between students. That's pretty good. If this data wasn't anonymous, a teacher could use it to group students together with similar interests. Success!

Resources and Next Steps

This may seem intimidating for IT professionals or developers. It isn't. There are fantastic resources available. You just need data, a computer, time, and curiosity.

I have progressed this far using only:

What's next? More data. More algorithms. More questions. And above all, more insight.

Permalink