How can data help the Seattle Art Museum?
This was a thought exercise from my last University of Washington Data Science class.
The goal of the Seattle Art Museum (SAM) is "a welcome place for people to connect with art and to consider its relationship to their lives." Art is undergoing critical challenges. The emphasis of STEM in schools and civil discourse has de-emphasized (and often de-funded) arts and the humanities for the next generation.
The amount of content people consume is increasing, but it's all online. Art museums can't compete against Reddit for mass appeal.
People aren't doing very well, either. The average family has less income and job stability than 30 years ago, even as the US GDP is up 350% and the stock market is up 500+ %. Chances are you have more immediate things on your mind than art.
Google searches for "art museum" have been trending steadily downward over the past several years.
In 2011 SAM had 600K visitors, which is ~16% of the Seattle metro population. It's worse than that, because many visitors are tourists or repeat customers. A more useful number is membership counts, 48K, which is ~1.3% of the Seattle population.
Oh, and that was the most popular year SAM ever had, because of a Picasso exhibit.
I asked 30 random people in downtown Seattle what they thought about the art museum, which was less than 3 blocks away.
The plural of anecdote is not data. However, this experience reinforced what I found via Google Trends: most people don't think of art at all, let alone in their daily routine. Therefore the biggest challenge is to find ways to make art more relevant to people's lives.
The best way to tackle a huge goal is to break it down into its key pieces.
That's it. You'll notice a non-goal is to make SAM more profitable. It's a secondary goal, because there's no point in having a rich museum that nobody visits.
Once we have goals, let's start asking questions. Let's think of art museums as fundamentally a data problem, with the potential for data-driven results.
People are different, and react to experiences differently. What are the characteristics of SAM visitors? What kind of art does each person enjoy? Why do they come to the museum?
How can we identify why certain groups of people enjoy certain kinds of art? What can we do to make that art even more interesting to them?
What pieces of art are similar? How can we identify that? How can we identify new art and who will enjoy it?
What is the best way to use this information to tailor current and future art exhibits? What new, interesting art style will be a hit, even if nobody knows it yet?
People are spending an increasing amount of time online. That's helpful; art is a very visual medium. How can we use that? What visual services can be used to spread the word about art? Facebook? Pinterest?
How can art be used to illuminate the current events and daily routines of our lives?
A key concept in data science is the scientific method. Each of the questions above can be measured and tested using controlled experiments.
People spend more time near art they enjoy. Give every visitor a tablet or guidebook with an RFID tag, and use that to track where each visitor goes inside the museum. Identify the demographics and characteristics of people who visit each exhibit.
But, most people don't visit the museum. We need a larger sample, from a broader audience. Let's have a monthly museum 'free day' to cast a wider net, and collect more data.
Once we have enough data, let's build a predictive model to anticipate who will visit what art. Then let's re-arrange the museum and see if our predictions hold true. If they do, then we're onto something.
This is a type of collaborative filtering called user-user similarity.
The previous section's data gives us more than just user behavior, it also tells us about art similarity. If 97% of the people who visit exhibit X also visit exhibit Y, then those exhibits are probably similar.
We can use this to build similarity groupings for art. Let's identify characteristics about each group of art. Then we can use that to identify new kinds art that we think will bring people to the museum.
This is a type of collaborative filtering called item-item similarity.
The insight from the previous 2 sections provides the grist for a highly successful marketing campaign.
For example, if men ages 18-25 like impressionist art, then showing Monet paintings on video games would make a lot of sense.
Finally, we would customize the Seattle Art Museum website to display different content depending on the user.
"The future is already here - it's just not evenly distributed" - William Gibson
I suspect what I've described above would be far more advanced than most museums could do. However, it's something that most retail stores do every day. Larger companies like Amazon, Netflix, eBay and Etsy do this far more, because it's a key advantage to their business.
After talking with nonprofits, I realized the biggest challenge isn't coming up with an idea, but finding skilled engineers. That's where you, dear reader, come in. The majority of my readers are data professionals. You have the ability to help a nonprofit or small business grow this way.Permalink
Dr. David DeWitt recently presented a keynote (video, slides) for PASS Summit 2013 on the new Hekaton query engine. I was impressed by how the new engine design is rooted in basic engineering principles.
Software engineers and IT staff are bound to the economics and practicalities of the computing industry. These trends define what we can reasonably do.
When a CPU is doing work, the job of the rest of the computer is to feed it data and instructions. Reading 1MB of data from memory is ~ 800 times faster than reading it sequentially from a disk.
A recent hype has been "in-memory" technology. These products are based on a constraint: RAM is far, far faster than the disk or network.
"In-memory" means "stored in RAM". It's hard to market "stored in RAM" as the new hotness when it's been around for decades.
The price of CPU cycles has dropped dramatically. So has the cost of basic storage and RAM.
You can buy a 10-core server with 1 terabyte of RAM for $50K. That's cheaper than hiring a single developer or DBA. It is now cost effective to fit database workloads entirely into memory.
I can write code that is infinitely fast, has 0 bugs, and is infinitely scalable. How? By removing it.
The best way to make something faster is to have it do less work.
CPU scaling is running out of headroom. Even if Moore's Law isn't ending, it has transformed into something less useful. Single-threaded performance hasn't improved in some time. The current trend is to add cores.
What software and hardware companies have done is add support for parallel and multicore programming. Unfortunately, parallel programming is notoriously difficult, and runs head-first into a painful problem:
As the amount of parallel code increases, the serial part of the code becomes the bottleneck.
"Big Data" is all the rage nowadays. The number and quality of sensors has increased dramatically, and people are putting more of their information online. A few places have truly big data, like Facebook, Google or the NSA.
For most companies, however, the volume of quality data isn't increasing at nearly as rapid a pace. I see this all the time; OLTP databases are growing at a much smaller pace than their associated 'big data' click-streams.
Systems are not upgraded quickly. IT professionals live with a hard truth: change brings risk. For existing systems the benefit of change must outweigh the cost.
Many NoSQL deployments are in new companies or architectures because they don't have to migrate and re-architect an existing (and presumably working) system.
Backwards compatibility is a huge selling point. It reduces risk.
Brilliant ideas don't come from large groups. The most impressive changes come from small groups of dedicated people.
However, most companies have overhead (email, managers, PMs, accounting, etc). It is easy to destroy a team's productivity by adding overhead.
I have been in teams where 3 weeks of design/coding/testing work required 4 months of planning and project approvals.
Overhead drains productive time and morale.
Smart companies realize this and build isolated labs:
Dr. DeWitt's keynote covered how these basic principles contributed to the Hekaton project.
I have hope for the new query engine, but also concerns: