This was my last day at the Strata conference.
Buzzwords are the new stopwords.
The vast majority of the keynotes were nothing but buzzwords, again. The audience reacted logically; they ignored the presenters and checked email, Facebook, and Twitter. One guy was trading stocks.
A notable exception was Matei Zaharia’s presentation on Spark. Spark is one of the most popular big data projects around, and Matei presented real stories and details.
The best keynote was James Burke’s. He was illuminating, funny, and persuasive. His argument was history and discovery are messy and full of unexpected change.
Some memorable quotes:
Discovery and progress happens between disciplines, not through specialization. I see this all the time in software teams. The most creative work comes from groups of disparate people working together.
Our society rewards specialists more than generalists. The result is a larger number of narrower niches; we know more and more about less and less. Broad thinkers are desperately needed but not valued.
I enjoy thinking in systems. I was taught and raised this way, fortunately. Being able to see the trees and the forest comes in very handy. I encourage everyone to try this.
Technical change usually doesn’t cause problems directly. Its biggest headaches are predominantly due to side effects.
Facebook is a great example. Posting your personal info to friends isn’t controversial. What’s controversial is when none of it’s private anymore and visible to employers, parents, random strangers, and stalkers.
Society is reactionary to scientific, technical and industrial change. It’s important to be mindful of that.
One great session was Hadley Wickham’s session on R. R is one of the most popular languages for data analysis, and one I use daily.
One of Hadley’s points is that it is good to code when doing analysis.
The two projects Hadley is working on are dplyr and ggviz.
Dplyr is pretty amazing; it’s a way to create query-like operations in R and have them work against data frames, data cubes, or even backends like RDBMS or BigQuery. I’m reminded of LINQ and lambda expressions.
One of the beautiful parts of dplyr is that it’s declarative. You code what you want done, but not exactly how. Anyone familiar with SQL will feel right at home.
Ggviz is the other package Hadley is working on. It’s the update to ggplot2, and produces interactive visualizations using HTML, JavaScript, and SVG. It is built using Vega and Shiny.
IPython notebooks are the de facto way to share data analysis, for several reasons:
Brian Granger gave a great series of demos about the upcoming IPython 2, which is going to be even more user-friendly. I’m looking forward to it.
One of my favorite sessions, this was a panel discussion between Drew Conway, Jake Porway, Rayid Ghani, and Elena Eneva. They were discussing how data science can be used for social good.
The key takeaways:
Talking to dozens of people and attending many sessions led me to some unexpected conclusions…
Breakthroughs happen in 3 ways:
Those are in descending order of difficulty.
Data Integration is not a solved problem
Chris Re mentioned a study done for various CTOs. The result was stark: if you’re a CTO faced with a big integration challenge, your best course of action is to quit.
People are messy
It seems like data professionals have a bit of OCD. We like things to be clean and orderly.
However, people are messy. They come in all shapes and sizes, with biases, irrational behavior and communication headaches. We have to accept people as they are or face a constant impedance mismatch with the very people we are supposed to serve.
Work on big problems
I met some amazing data scientists over the past few days. Most of them will never be famous, even if they’re exceptionally smart.
They work on boring projects. Nobody cares if a brilliant data scientist works on online advertising, or a new kind of social media platform, or becomes yet another high-finance quant.
However, people do notice when the data scientist who changes how a city does building inspections. What matters is relative impact.
This isn’t a new idea. Michael Lewis’ Moneyball was about more than stats coming into baseball; it was a beautiful example that quantitative skill can have a dramatic impact in areas where it doesn’t currently exist. For example:
Want to change the world? Find out where all the money goes in education (it’s not to teachers). Build a platform to crowdsource finance for farmers and remove all the middlemen. Figure out how music affects the brain.
Build big things.
PermalinkThis was my second day at Strata.
The keynotes at Strata were very short, 5-10 minutes each. This was a mixed blessing; presenters were brief, but many of them used nothing but data buzzwords. I was strongly reminded of The Worst Speech in the World.
P (quality) = (reality / buzzwords + reality).
However, there were two amazing speakers: Farrah Bostic and David Epstein. They made clear points, had a bit of light humor, and were refreshing immune to buzzword-itis.
Farrah Bostic’s argument was “How we decide what to measure, matters.” Market research, surveys and focus groups are more biased than we think, leading to flawed decisions and flawed results. I’ve seen the results of this anecdotally, when people making decisions using the data they had rather than the problem they had.
David Epstein had two points. The first is that before collecting tons of data, you should determine what is important and what can be changed. Collecting data, and then analyzing it, should enable change that is possible. His second point was the famous “10,000 hours of practice” study was based on a flawed study of 40 gifted violinists; it isn’t generally applicable. Even the original researcher, K.A. Ericsson, called the hype around the 10,000 hours idea “the danger of delegating education to journalists.”
This was the earth-shaking session of the day. Chris Re is a Stanford professor with a stupefying vision for data analysis.
A challenge with data problems, like software engineering, is to reduce the time between idea and product. One huge bottleneck is the cognitive/human time required to build a good model from data.
Building a good model requires iterating over 2 steps:
The second step can be streamlined, even automated.
(IMAGE OF SKYNET)
For everything but the largest data sets, it is computationally/economically possible to run hundreds, even thousands, of machine learning models on a data set and use statistical methods to identify the best ones.
This is an old idea. Data scientists tune machine learning models using hyperparameters all the time. I often use hyperparameter searches when tuning a model; it’s a fast way to tune a good model.
This leaves us with the first step: generating features.
One of the big lessons in machine learning is more data trumps a complicated model. This is exemplified in the seminal paper “The unreasonable effectiveness of data.”
Another lesson is better features trumps a complicated model. The best data scientists spend time adding features to their data (feature engineering).
Chris’ ideas are brought to fruition in DeepDive, a system that has a user define features, but not machine learning or statistics. The tool does all of the machine learning and statistical analysis, and then shows the results. It’s already been used on paleobiology data (extracting data from PDF-formatted research papers) with promising results.
I’ll be following this closely.
Max Shron’s premise was simple: good data science benefits from having an intellectual framework. The detail of this session is in his new book.
“How do we make sure we’re solving the right problem?”
Data scientists aren’t the first to ask that question. Designers have this problem all the time, worse than we do. Vague, conflicting requests are a fact of life.
Borrowing from designers and their scoping framework can:
Convincing people of something, even with data, is a form of argument. Data scientists can benefit from 2500 years of work in the humanities, rhetoric, and social sciences.
Knowing the structure of an argument can help with:
This was the most intellectual of the sessions I attended, and one of the most helpful.
In contrast, Monica Rogati’s session was lighthearted and utterly entertaining. This was an amazing example of telling a story using data.
The topic? Sleep.
As a data scientist for Jawbone, Monica is effectively running the world’s largest sleep study, with access to 50 million nights’ sleep. Some findings:
I’ll be looking at this session again, looking for presentation tips.
That’s it for tonight. Until tomorrow, data nerds!
Permalink