The Road Through Strata - Wednesday

12 February 2014

This was my second day at Strata.

Keynotes

The keynotes at Strata were very short, 5-10 minutes each. This was a mixed blessing; presenters were brief, but many of them used nothing but data buzzwords. I was strongly reminded of The Worst Speech in the World.

P (quality) = (reality / buzzwords + reality).

However, there were two amazing speakers: Farrah Bostic and David Epstein. They made clear points, had a bit of light humor, and were refreshing immune to buzzword-itis.

Farrah Bostic’s argument was “How we decide what to measure, matters.” Market research, surveys and focus groups are more biased than we think, leading to flawed decisions and flawed results. I’ve seen the results of this anecdotally, when people making decisions using the data they had rather than the problem they had.

David Epstein had two points. The first is that before collecting tons of data, you should determine what is important and what can be changed. Collecting data, and then analyzing it, should enable change that is possible. His second point was the famous “10,000 hours of practice” study was based on a flawed study of 40 gifted violinists; it isn’t generally applicable. Even the original researcher, K.A. Ericsson, called the hype around the 10,000 hours idea “the danger of delegating education to journalists.”

Big Data: Too Few Artists

This was the earth-shaking session of the day. Chris Re is a Stanford professor with a stupefying vision for data analysis.

A challenge with data problems, like software engineering, is to reduce the time between idea and product. One huge bottleneck is the cognitive/human time required to build a good model from data.

Building a good model requires iterating over 2 steps:

Getting data and extracting features from it
Testing any and all features against various models to see which combinations are meaningful.

The second step can be streamlined, even automated.

Automated Machine Learning

(IMAGE OF SKYNET)

For everything but the largest data sets, it is computationally/economically possible to run hundreds, even thousands, of machine learning models on a data set and use statistical methods to identify the best ones.

This is an old idea. Data scientists tune machine learning models using hyperparameters all the time. I often use hyperparameter searches when tuning a model; it’s a fast way to tune a good model.

This leaves us with the first step: generating features.

It’s About the Features, Stupid

One of the big lessons in machine learning is more data trumps a complicated model. This is exemplified in the seminal paper “The unreasonable effectiveness of data.”

Another lesson is better features trumps a complicated model. The best data scientists spend time adding features to their data (feature engineering).

Deep Dive

Chris’ ideas are brought to fruition in DeepDive, a system that has a user define features, but not machine learning or statistics. The tool does all of the machine learning and statistical analysis, and then shows the results. It’s already been used on paleobiology data (extracting data from PDF-formatted research papers) with promising results.

I’ll be following this closely.

Thinking with Data

Max Shron’s premise was simple: good data science benefits from having an intellectual framework. The detail of this session is in his new book.

Scoping

“How do we make sure we’re solving the right problem?”

Data scientists aren’t the first to ask that question. Designers have this problem all the time, worse than we do. Vague, conflicting requests are a fact of life.

Borrowing from designers and their scoping framework can:

Help defining a data problem clearly by asking careful questions.
Reduce the chance of error by using mockups. A fake graph can be very helpful.
Help deliver a clear presentation by copying narrative structure: setup, conflict, resolution, and denouement.

Arguments

Convincing people of something, even with data, is a form of argument. Data scientists can benefit from 2500 years of work in the humanities, rhetoric, and social sciences.

Knowing the structure of an argument can help with:

Clarifying what you need to convince people of.
Anticipating objections and questions, so you can be prepared
Identifying indirect and opportunity costs that may trip up your ideas
Keeping your presentation concise by not covering already-agreed-upon terms and definitions.

This was the most intellectual of the sessions I attended, and one of the most helpful.

Tracking zzzzz

In contrast, Monica Rogati’s session was lighthearted and utterly entertaining. This was an amazing example of telling a story using data.

The topic? Sleep.

As a data scientist for Jawbone, Monica is effectively running the world’s largest sleep study, with access to 50 million nights’ sleep. Some findings:

Hawaii is the most sleep deprived. Vermont is the least.
The conventional wisdom for jet lag is 1 day recovery time 1-hour time zone. It actually takes 2 days.
A coast-to-coast trip takes 6-7 days to recover from.
Fishing, pets, hiking, and softball are correlated with more sleep.
Socializing at work, personal grooming, and commuting are correlated with less.

I’ll be looking at this session again, looking for presentation tips.

Errata

I asked 30 people at Data after Dark, and 23 of them knew how to count cards. No wonder we’re playing poker and not blackjack!
A vendor booth offering free Bud Light at the Exhibit Hall was completely empty. No surprise; there was free scotch and craft beer nearby. Competition matters.
I found intriguing research paper on the theory behind join algorithms (warning, math heavy)

That’s it for tonight. Until tomorrow, data nerds!