PASS Summit Attendance and Predictions

17 December 2014

Technical conferences live and die by their community. An engaged audience and talented speakers will be very successful. The goal of conference organizers, then, is to identify and develop good speakers and good content, so their audience thrives.

I joined the PASS Programs Team earlier this year to help predict session attendance for PASS Summit 2014. This data was used to help assign different sized rooms for sessions.

However, the true test of a prediction is to compare it to reality In this case, we have the predictions before Summit from the analysis I did then. I now have the conference data on attendance and attendee feedback.

This post will analyze session attendance and compare them with predictions.

Where Is Everybody

How many people were attending Summit at different times?

The most popular times were midday (5,400 on Wednesday) and early afternoon (4,300 on Thursday). The mornings were relatively empty. Fridays were quieter as well, with less than 1,300 people in attendance.

Measuring Error

With any prediction project, the definition of an error metric is critically important. There are a few classic examples:

  • The difference between prediction and reality
  • The percentage difference between prediction and reality
  • Root mean squared error (RMSE)

It turns out my predictions were wildly bad. Some sessions had a predicted attendance of 323..and 12 people showed up. That's just awkward.

Redefining Error

My predictions of sessions were inaccurate using common error metrics. They were also useful.

Overcrowded sessions are worse than empty sessions. It's OK for a session room to be half empty. It's only awkward when a room is 85% empty or so.

However, it's really bad when a session is even 5% over capacity, because it means people are standing, getting turned away, etc.

Let's redefine "error" to mean underprediction: when more people show up to a session than predicted:

There were just 2 sessions that underpredicted (the ones above the dotted line, above).

Musical Chairs

People don't sit right next to each other at conferences. It's not considered polite (because of elbow room, not enough deodorant, etc). A room will feel full far before all of the chairs are occupied.

Let's count the number of sessions that were over 90% full.

Year    Sessions    % of Sessions
2011 32 14%
2012 50 17%
2013 51 17%
2014 18 8%

That's an improvement. Let's look at our success criteria: sessions that are 20-89% full:

Year    Sessions    % of Sessions
2011 181 77%
2012 202 67%
2013 196 65%
2014 154 64%

We can see that the % of sessions with a good attendance range stayed the same. That's because we increased our other failure criteria: rooms that are less than 20% full:

Year    Sessions    % of Sessions
2011 23 10%
2012 51 17%
2013 55 18%
2014 67 28%

We've traded a painful problem (overcrowded sessions) for an awkward one (mostly-empty rooms). We also had 18 sessions that were over-crowded. Clearly we should do better next year.

However, this year's PASS Summit had the fewest over-crowded rooms since 2011. During the conference, I heard anecdotes that rooms were better allocated this year. I'll call that a win.

Keep Going

There's no reason I should be the only person looking at this data. The #sqlpass community is large and technically savvy. To that end, I've made all of the raw data public.

I didn't account for the time of day, conference fatigue, or the presentation topics. There's quite a bit of additional work we can do to improve session attendance predictions and room allocation.

You can contact me anytime via Twitter (@DevNambi) or email (me@devnambi.com).

Permalink

Cheap Computing with AWS Spot Instances

16 November 2014

You can run Amazon Web Services' VMs cheaply by bidding for computing capacity, using Spot Instances. The virtual machines (instances) you get are identical to on-demand VMs. The only difference is the pricing.

To do this, you request spot instances, and specify the maximum bid per hour you'll pay. If your maximum bid is more than the current bid for that type of VM, your request is granted. As long as the current bid price is less than your maximum price, you'll keep your computing capacity. If your maximum bid is ever less than the current bid price, then your instance is destroyed and its capacity given to a higher bidder.

My first reaction to spot instances was disbelief. Why should I use a virtual machine that can be destroyed at a moment's notice? I'd never get any work done! Then I saw the price tag...

Spot instances are far cheaper than their on-demand brethren.

If you specify a high maximum bid (say, 3x the on-demand price, here's the daily cost over 90 days to run 7 different instance types in the Oregon (us-west-2) region:

Instance Type On-Demand Spot Discount
t1.micro $0.48 $0.13 73%
m3.medium $1.67 $3.91 -134%
c3.xlarge $12.39 $2.30 54%
r3.4xlarge $10.00 $6.82 80%
h1.4xlarge $73.96 $5.91 92%
g2.2xlarge $15.50 $2.71 82%
cc2.8xlarge $47.77 $7.14 85%

Not everything is a deal. A few VMs (like m3.mediums) were more expensive than on-demand VMs. However, most types and locations, including most of the powerful choices, were much less expensive.

When computing capacity is this cheap, economics starts to change. It may be cheaper to use spend developer time to re-architect an application to run on spot instances. A company that runs dozens or hundreds of AWS instances may well save money by using spot instances:

Startups have it easy. They can build their system architectures to use this environment from the start. Of the 30-odd startups where I have contacts, all of them use spot instances widely, to save money.

There are several approaches to doing so, including:

  • Run everything statelessly, using queues and permanent storage.
  • Run spot instances as part of resilient architectures that can compensate for failures

It's All About the Money

To learn more about spot instance pricing, let's look at history. This is easy; AWS exposes both a price history API as well as documentation. It's easy to use the API to download data. I pulled down 90 days' history for every Amazon region, availability zone (AZ), and instance type.

All of the prices below are the median daily cost to run a VM with an infinite bid, unless noted otherwise.

Prices by Region

To start, prices vary dramatically by location. The discounts you see below are the % difference in price between a spot instance and an on-demand instance in the same location.

As we can see from the graph above, the median discount between on-demand and spot instances to run a VM for a day ranges from 38% (Sao Paolo) to 72% (Oregon).

Prices by AZ and Instance Category

Looking deeper, some Availability Zones (AZs) have far larger discounts than others. A few instance categories are really cheap.

Not all AZs have the same discounts, even in the same region. This doesn't make sense; it's evidence of inefficient bidding by AWS users. However, it's fantastic news for bargain hunters.

For example, if I had a workload in Northern Virginia (us-east-1) that needed a general-purpose instance, I'd pick the AZ with a 39% discount (us-east-1b) instead of the one with the -20% discount (us-east-1d).

The differences are larger across regions. In Oregon (us-west-2) we could run 30 m3.2xlarge VMs for less than $100 a day, instead of ~$400 a day for on-demand instances.

That's $100 a day for 900GB of RAM, 780 compute units, and 4.8TB of SSD storage.

Prices by Instance Type

Some instance types have reliably larger discounts.

High-memory types (the r3 family), the h1.4xlarge storage type, and cluster computing types often have deep discounts.

Hunting for Deals

Now let's look for the biggest deals we can find. Let's look at every single instance type, per region, per AZ.

The best deals are in the upper right, which have the most cores per $ and the most GB of RAM per $.

For the last 90 days, the best deal was in the Tokyo region (ap-northeast-1). You could run a cr1.8xlarge instance (244GB of RAM, 88 compute units, 240GB of SSD) for $9 a day, instead of the usual $98 a day.

Let's say we have a large, distributed-computing workload. Common examples are physics simulations, genome sequencing, or web log analysis. We could spend $252 for 4 cr1.8xlarge instances:

  • 976GB of RAM
  • 352 compute units
  • 960GB of SSD
  • 10Gb networking
  • 168 hours to get your work done

When I do distributed data processing work, I dream about having resources like this.

Cloud Myths

Let's use this historical pricing to look at some myths about spot instances:

Don't use overseas datacenters, because they're too expensive

We already looked at price discounts by region and AZ. There are some regions that don't have huge discounts (Sao Paolo), but many others that do (Tokyo, Singapore, Australia, Ireland).

I'm guessing this comes from the fact that some overseas instances don't all of the instance types yet, which includes some instance categories that have deep discounts, like the high-storage category.

GPU instances are expensive Bitcoin miners are eating up all of the capacity

Again, no. The g2.2xlarge GPU-specific instance has a median discount of 83% across all regions. If we look at this instance type across all regions and AZs, we can see that the typical cost to run one is in the $2.1-$4 a day range, which is far cheaper than its $15.40 a day on-demand price.

Big instances don't help your application

This is often true. Very, very few developers or sysadmins know how well their applications scale, because they don't have the time or resources to test them under varying load, and on different computers of different sizes. That's a topic for another day.

If your workload doesn't benefit from having lots of memory or cores in a single machine, then you're better off running smaller VMs with good single-threaded CPU speed (the c3 family).

The cheapest spot price for c3 spot instances over 90 days was $0.70 a day for c3.large VMs in the Tokyo region. Those VMs have 2 Ivy Bridge cores, 3.75GB of RAM, and 32GB of SSD.

If your workload can be broken down into small, independent chunks of work (still single-threaded), you could spend $20 a day for 28 of those VMs.

The core question is never "how do I get the biggest computer for cheap", it's "how do I do my work for the least amount of money".

Don't run big instances because they're more expensive than smaller ones

Let's say you do know how well your applications scale. If your workload parallelizes well and works faster with lots of RAM, SSD, and compute cores, then larger instances are a great deal. Optimize for RAM-per-dollar, or cores-per-dollar, or SSD-per-dollar. In that case, your cheapest options are:

CPU Bottleneck:

For this, you want an instance with as many compute units as possible. If we look at compute units per $, the cheapest options have been the cc2.8xlarge and g2.2xlarge instances, usually in the Oregon or Tokyo regions. You can run a cc2.8xlarge instance and its 88 compute units for as little as $7 a day.

Memory Bottleneck

For this, you want an instance with as much memory as possible. If we look at GB of RAM per $, the cheapest options have been the cr1.8xlarge and r3.8xlarge instances, usually in the Tokyo (ap-northeast-1) region. You can run instances with 244GB of RAM for as little as $9 a day.

I/O Bottleneck

For this, you want an instance with SSD storage. If we look at SSD per $, the cheapest options have been the storage-centric h1.4xlarge instances. A single instance has 2,048GB of SSD storage and can be run for as little is $5.20 a day. That's ~400GB GB of SSD for a dollar a day.

Be Contrary

Spot pricing is complicated, because it's the combination of several different topics:

This is the area where quantitative finance folks thrive. Lucky for us, there's a simple way to find deals when bidding:

Buy What's Not Popular

Permalink