The Seattle Art Museum with Data

05 November 2013

How can data help the Seattle Art Museum?

This was a thought exercise from my last University of Washington Data Science class.

In the Beginning was the Context

The goal of the Seattle Art Museum (SAM) is "a welcome place for people to connect with art and to consider its relationship to their lives." Art is undergoing critical challenges. The emphasis of STEM in schools and civil discourse has de-emphasized (and often de-funded) arts and the humanities for the next generation.

The amount of content people consume is increasing, but it's all online. Art museums can't compete against Reddit for mass appeal.

People aren't doing very well, either. The average family has less income and job stability than 30 years ago, even as the US GDP is up 350% and the stock market is up 500+ %. Chances are you have more immediate things on your mind than art.

Art and Relevance

Google searches for "art museum" have been trending steadily downward over the past several years.

In 2011 SAM had 600K visitors, which is ~16% of the Seattle metro population. It's worse than that, because many visitors are tourists or repeat customers. A more useful number is membership counts, 48K, which is ~1.3% of the Seattle population.

Oh, and that was the most popular year SAM ever had, because of a Picasso exhibit.

I asked 30 random people in downtown Seattle what they thought about the art museum, which was less than 3 blocks away.

  • 14 of them didn't know there was an art museum there
  • 13 of them had never been there, and didn't seem interested
  • 3 had been to the art museum, but wouldn't go again soon because "it's too expensive" and "it wasn't interesting"
  • 0 of them knew it was there and would go again.

The plural of anecdote is not data. However, this experience reinforced what I found via Google Trends: most people don't think of art at all, let alone in their daily routine. Therefore the biggest challenge is to find ways to make art more relevant to people's lives.

Goals and Ladders

The best way to tackle a huge goal is to break it down into its key pieces.

  • Figure out how art can be relevant for different groups of people
  • Tailor SAM to different groups of people
  • Find ways to bring those different groups of people to SAM itself

That's it. You'll notice a non-goal is to make SAM more profitable. It's a secondary goal, because there's no point in having a rich museum that nobody visits.

From Goals To Questions

Once we have goals, let's start asking questions. Let's think of art museums as fundamentally a data problem, with the potential for data-driven results.

Groups of People

People are different, and react to experiences differently. What are the characteristics of SAM visitors? What kind of art does each person enjoy? Why do they come to the museum?

How can we identify why certain groups of people enjoy certain kinds of art? What can we do to make that art even more interesting to them?

Groups of Art

What pieces of art are similar? How can we identify that? How can we identify new art and who will enjoy it?

What is the best way to use this information to tailor current and future art exhibits? What new, interesting art style will be a hit, even if nobody knows it yet?

Art Online

People are spending an increasing amount of time online. That's helpful; art is a very visual medium. How can we use that? What visual services can be used to spread the word about art? Facebook? Pinterest?

After all, 61% of people in the US have smartphones, and 85%+ of people in the US have an Internet connection.

How can art be used to illuminate the current events and daily routines of our lives?

Science and Experiments

A key concept in data science is the scientific method. Each of the questions above can be measured and tested using controlled experiments.

Figure out how art is relevant for different groups of people

People spend more time near art they enjoy.  Give every visitor a tablet or guidebook with an RFID tag, and use that to track where each visitor goes inside the museum. Identify the demographics and characteristics of people who visit each exhibit.

But, most people don't visit the museum. We need a larger sample, from a broader audience. Let's have a monthly museum 'free day' to cast a wider net, and collect more data.

Once we have enough data, let's build a predictive model to anticipate who will visit what art. Then let's re-arrange the museum and see if our predictions hold true. If they do, then we're onto something.

This is a type of collaborative filtering called user-user similarity.

Tailor SAM to different groups of people

The previous section's data gives us more than just user behavior, it also tells us about art similarity. If 97% of the people who visit exhibit X also visit exhibit Y, then those exhibits are probably similar.

We can use this to build similarity groupings for art. Let's identify characteristics about each group of art. Then we can use that to identify new kinds art that we think will bring people to the museum.

This is a type of collaborative filtering called item-item similarity.

Find ways to bring those different groups of people to SAM itself

The insight from the previous 2 sections provides the grist for a highly successful marketing campaign.

For example, if men ages 18-25 like impressionist art, then showing Monet paintings on video games would make a lot of sense.

Finally, we would customize the Seattle Art Museum website to display different content depending on the user.


What Now?

"The future is already here - it's just not evenly distributed" - William Gibson 

I suspect what I've described above would be far more advanced than most museums could do. However, it's something that most retail stores do every day. Larger companies like Amazon, Netflix, eBay and Etsy do this far more, because it's a key advantage to their business.

After talking with nonprofits, I realized the biggest challenge isn't coming up with an idea, but finding skilled engineers. That's where you, dear reader, come in. The majority of my readers are data professionals. You have the ability to help a nonprofit or small business grow this way.


PASS Summit 2013 Keynote - Back to Basics

22 October 2013

Dr. David DeWitt recently presented a keynote (video, slides) for PASS Summit 2013 on the new Hekaton query engine. I was impressed by how the new engine design is rooted in basic engineering principles.

First Principles

Software engineers and IT staff are bound to the economics and practicalities of the computing industry. These trends define what we can reasonably do.

1: It's About Latency

Peter NorvigDirector of Research at Google, famously wrote Numbers Every Programmer Should Know, describing the latency of different operations.

When a CPU is doing work, the job of the rest of the computer is to feed it data and instructions. Reading 1MB of data from memory is ~ 800 times faster than reading it sequentially from a disk.

A recent hype has been "in-memory" technology. These products are based on a constraint: RAM is far, far faster than the disk or network.

"In-memory" means "stored in RAM". It's hard to market "stored in RAM" as the new hotness when it's been around for decades.

2: It's About Money

The price of CPU cycles has dropped dramatically. So has the cost of basic storage and RAM.

You can buy a 10-core server with 1 terabyte of RAM for $50K. That's cheaper than hiring a single developer or DBA. It is now cost effective to fit database workloads entirely into memory.

3: It's About Humility

I can write code that is infinitely fast, has 0 bugs, and is infinitely scalable. How? By removing it.

The best way to make something faster is to have it do less work.

4: It's About Physics

CPU scaling is running out of headroom. Even if Moore's Law isn't ending, it has transformed into something less useful. Single-threaded performance hasn't improved in some time. The current trend is to add cores.

What software and hardware companies have done is add support for parallel and multicore programming. Unfortunately, parallel programming is notoriously difficult, and runs head-first into a painful problem:

Amdahl's Law

As the amount of parallel code increases, the serial part of the code becomes the bottleneck.

Luckily for us, truly brilliant people, like Dr. Maurice Herlihy, have invented entirely parallel architectures.

5: It's About Quality Data

"Big Data" is all the rage nowadays. The number and quality of sensors has increased dramatically, and people are putting more of their information online. A few places have truly big data, like Facebook, Google or the NSA.

For most companies, however, the volume of quality data isn't increasing at nearly as rapid a pace. I see this all the time; OLTP databases are growing at a much smaller pace than their associated 'big data' click-streams.

6: It's About Risk

Systems are not upgraded quickly. IT professionals live with a hard truth: change brings risk. For existing systems the benefit of change must outweigh the cost.

Many NoSQL deployments are in new companies or architectures because they don't have to migrate and re-architect an existing (and presumably working) system.

Backwards compatibility is a huge selling point. It reduces risk.

7: It's About Overhead

Brilliant ideas don't come from large groups. The most impressive changes come from small groups of dedicated people.

However, most companies have overhead (email, managers, PMs, accounting, etc). It is easy to destroy a team's productivity by adding overhead.

I have been in teams where 3 weeks of design/coding/testing work required 4 months of planning and project approvals.

Overhead drains productive time and morale.

Smart companies realize this and build isolated labs:

The Keynote

Dr. DeWitt's keynote covered how these basic principles contributed to the Hekaton project.

  1. Be Faster, Cheaper: Assume the workload is entirely in memory because RAM is cheap. Optimize data structures for random access
  2. Do Less Work: Reduce instructions-per-transaction using compiled procedures
  3. Avoid Amdahl's Law: Avoid locks and latches using MVCC and a latch-free design. The only shared objects I could identify were the clock generator and the transaction log.
  4. Sell to Real People: Build it into SQL Server with backwards compatibility to encourage adoption.
  5. Build It Smartly: Use a small team of dedicated professionals. The Jim Gray Systems Lab has 9 staff and 7 grad students. Microsoft's Hekaton team had 7 people. That's it.


I have hope for the new query engine, but also concerns:

  • It's only in SQL Server Enterprise Edition ($$$$). Microsoft's business folks clearly aren't encouraging wide adoption of this feature.
  • The list of restrictions for compiled stored procedures makes them useless without major code changes
  • The new cost model and query optimizer will have bugs. It took years of revisions for the existing optimizer to stabilize.

Here Endeth the Lesson:

  1. Make architecture changes based on sound engineering principles**
  2. Assemble a small group of brilliant people, and then get out of the way.**