This was my second day at Strata.
The keynotes at Strata were very short, 5-10 minutes each. This was a mixed blessing; presenters were brief, but many of them used nothing but data buzzwords. I was strongly reminded of The Worst Speech in the World.
P (quality) = (reality / buzzwords + reality).
Farrah Bostic’s argument was “How we decide what to measure, matters.” Market research, surveys and focus groups are more biased than we think, leading to flawed decisions and flawed results. I’ve seen the results of this anecdotally, when people making decisions using the data they had rather than the problem they had.
David Epstein had two points. The first is that before collecting tons of data, you should determine what is important and what can be changed. Collecting data, and then analyzing it, should enable change that is possible. His second point was the famous “10,000 hours of practice” study was based on a flawed study of 40 gifted violinists; it isn’t generally applicable. Even the original researcher, K.A. Ericsson, called the hype around the 10,000 hours idea “the danger of delegating education to journalists.”
A challenge with data problems, like software engineering, is to reduce the time between idea and product. One huge bottleneck is the cognitive/human time required to build a good model from data.
Building a good model requires iterating over 2 steps:
The second step can be streamlined, even automated.
(IMAGE OF SKYNET)
For everything but the largest data sets, it is computationally/economically possible to run hundreds, even thousands, of machine learning models on a data set and use statistical methods to identify the best ones.
This leaves us with the first step: generating features.
One of the big lessons in machine learning is more data trumps a complicated model. This is exemplified in the seminal paper “The unreasonable effectiveness of data.”
Chris’ ideas are brought to fruition in DeepDive, a system that has a user define features, but not machine learning or statistics. The tool does all of the machine learning and statistical analysis, and then shows the results. It’s already been used on paleobiology data (extracting data from PDF-formatted research papers) with promising results.
I’ll be following this closely.
“How do we make sure we’re solving the right problem?”
Data scientists aren’t the first to ask that question. Designers have this problem all the time, worse than we do. Vague, conflicting requests are a fact of life.
Borrowing from designers and their scoping framework can:
Convincing people of something, even with data, is a form of argument. Data scientists can benefit from 2500 years of work in the humanities, rhetoric, and social sciences.
Knowing the structure of an argument can help with:
This was the most intellectual of the sessions I attended, and one of the most helpful.
In contrast, Monica Rogati’s session was lighthearted and utterly entertaining. This was an amazing example of telling a story using data.
The topic? Sleep.
As a data scientist for Jawbone, Monica is effectively running the world’s largest sleep study, with access to 50 million nights’ sleep. Some findings:
I’ll be looking at this session again, looking for presentation tips.
That’s it for tonight. Until tomorrow, data nerds!Permalink
This was my first day at Strata. Here’s what I found.
I made the very mistake I warned against yesterday: I went to sessions based on the topic, and not the quality of the speaker.
I’ll be more selective about my sessions for the next couple days.
I asked a dozen people, from a variety of industries, what they did for a living. I also asked how ensured their work wasn’t being used to make more profit in an unethical way.
Nobody had an answer to the latter question. I’m fervently hoping this is due to my low sample size and not broadly representative of the data analytics community.
In addition to my ethical survey I had the chance to talk to people from a D.C. startup, the Lawrence Berkeley Lab, Microsoft Research, Netflix, Etsy, Vertafore, the Department of Defense, and Sage Bionetworks. Everyone was ridiculously smart, and most of them were data scientists.
I came prepared with a list of questions:
I found some common elements:
The range of subject areas covered was immense.
There were some boring problems discussed…
Luckily, I was saved by the amount of discussion on data-intensive genomics…
The societal benefit from this work could be immense. I understand why he was so cheerful when he talked.
I was impressed by the quality of thought put into the project:
Deep neural networks have gotten a lot of press lately, mostly because they can work well on problems most ML algorithms struggle with (image recognition, speech recognition, machine translation).
Ilya Sutskever gave a good, useful intro into deep neural networks. ‘Deep’ in this case refers to 8-10 layer of neurons ‘hidden’ between the input and output layers. A traditional neural net has 1-2 hidden layers.
The reasoning to look at 10 layers is great. Humans can do a variety of things in 0.1 seconds. However, neurons are pretty slow; they can only fire about 100/second. Therefore a human task that happens in under 0.1 seconds takes only 10 layers of neurons to do.
One of the big problems behind neural networks is that they require a lot of data to train at this depth. They are also not intuitive to tune; Ilya didn’t go over that at all in his session. It was a good 101-level talk.
“Give me explainability or give me depth”
For more, I’d recommend the Neural Networks Blog.
The reception afterwards was mostly dull. The food was good, and free. The vendors, however, were spreading their own particular flavors of FUD.
I asked 11 different vendors for the data to back up claims behind their value propositions. The responses were a comic mix of dumbfounded expressions, misdirection, and spin. It’s hilarious that companies selling to data and analysis professionals don’t use data to back up their marketing claims.
I find myself excited about the potential to meet awesome people and learn amazing things.
I’m looking forward to tomorrow.Permalink