Machine Learning Conferences: Strata

This week I will be at the Strata conference. It’s the place to be if you’re a data scientist. I shall be there, blogging the whole time.

Getting There

I spent about 90 minutes looking at different options and found a price difference of $600+ between my first choice and the cheapest option.

Well Begun is Half Done

Conferences are expensive and brief. My uber-goal is to be as effective as possible in a limited time. There isn’t time to see everything or meet everyone. I must be selective with my time.

I have been to many technical conferences over the years, and developed tactics that work well.

Conferences last longer than adrenaline. Sleep, nutrition, and hydration are critically important. A brain needs fuel.

Conference Content

Great speakers are good teachers regardless of their session topic. Great topics are only sometimes presented well. I attend sessions based on speaker(s)’ quality.

I often learn as much from a session recording as I can by attending. A great example is DataGotham’s YouTube channel.

Signal

I learn from the brightest people I can find: engineers, scientists, and researchers. They are always practitioners of some kind.

I contact people before the conference. I say I want to chat, mention some topics we have in common, and ask if they’d like to meet or exchange contact info.

I follow most of the speakers on Twitter, and pay close attention to what sessions they recommend. Those are invariably good.

I prepare a list of questions. What do I want to learn? What are the most useful questions to ask? Which ones minimize my own bias?

I go to as many informal/collaborative events as I can. It’s amazing what I can learn from someone when they let their hair down.

Strata has some good options for this:

…and Noise

I rarely learn anything useful from salespeople, marketers, recruiters, or PMs. I avoid them.

I automatically disqualify anyone who is sexist, racist, or otherwise mean. I try to call them on their behavior, and then avoid them. I have better things to do than deal with their crap.

Great technology sells itself, often by word of mouth. I wonder why companies even do technical marketing. Good engineers have finely honed bullsh*t filters.

A heavily-marketed product is often an inferior product. A company that spends  on marketing is choosing to spend less in R&D. I have an anti-marketing bias when making purchasing decisions for this reason.

Jobs and Careers

The best positions often aren’t advertised. Recruited positions are often terrible.

Technical people can spot talent. I network with technical folks. I avoid recruiters.

Very few of my contacts stay in the same job for more than 5 years; 1-2 years is typical. I find it helpful to cultivate useful contacts, especially people who work in healthy companies or ethical industries.

I follow Wheaton’s Law. I network to meet people, learn from them, and lay the groundwork for a potential next gig. The least I can do is return the favor, often. It’s ethical and pays dividends.

Day 1

The Good

It’s important to explain data science products to users. They don’t trust black boxes.
Online learning (updating a model incrementally) is very useful, but it’s still not available for most algorithms.
Assembling, transforming, and cleaning data still takes 80%+ of the time.
The age-old machine learning headaches live on: overfitting, outlier detection and removal, feature extraction, the curse of dimensionality, and mismatched tooling for prototyping vs. production.

The Bad

I made the very mistake I warned against yesterday: I went to sessions based on the topic, and not the quality of the speaker.

I missed out on amazing sessions by John Foreman, Jeff Heer, and Carlos Guestrin.

I’ll be more selective about my sessions for the next couple days.

The Ugly

I asked a dozen people, from a variety of industries, what they did for a living. I also asked how ensured their work wasn’t being used to make more profit in an unethical way.

Nobody had an answer to the latter question. I’m fervently hoping this is due to my low sample size and not broadly representative of the data analytics community.

Meeting People

In addition to my ethical survey I had the chance to talk to people from a D.C. startup, the Lawrence Berkeley Lab, Microsoft Research, Netflix, Etsy, Vertafore, the Department of Defense, and Sage Bionetworks. Everyone was ridiculously smart, and most of them were data scientists.

I came prepared with a list of questions:

Questions

What’s your name? Are you from the Bay Area? Where do you work?
What are you passionate about? What do you like to do?
What’s your ideal problem to solve?
What projects do you wish other people would help you with?
What’s one question you wish people would ask you?
What do you think people should pay more attention to?

I found some common elements:

They are learning, and confused by, the myriad software stacks and languages available nowadays.
They all want to learn from each other.
They all want more in-depth sessions.

Data-Intensive Everything

The range of subject areas covered was immense.

Data-Intensive Physics

Sloan Digital Sky Survey (SDSS)
Large Synoptic Survey Telescope (LSST)
Search for Extra-Terrestrial Intelligence (SETI)
Large Hadron Collider (LHC)

Data-Intensive Medicine

Personalized medicine
Predictive health - preventative care
Early detection (for cancers)

Data-Intensive Cybersecurity

Intrusion detection
Fraud monitoring

Data-Intensive IT

Automatic root-cause analysis
Monitoring with intelligent anomaly detection
Capacity analysis and automatic scaling

Data-Intensive Cruft

There were some boring problems discussed…

Show people more interesting ads
Recommend movies, books, or news articles to people
Recommend matches on a dating site
Improve high-frequency trading systems

Luckily, I was saved by the amount of discussion on data-intensive genomics…

Data-Intensive Genomics

On Monday night I attended a Big Data Science meetup, and the best presenter was Frank Nothaft, a grad student at UC Berkeley, working on large-scale genomics.

Why?

The cost of processing a genome is now less than $1,000 per genome. Also, the price is dropping faster than Moore’s law. The cost of computation may be the bottleneck in genomics.
Personalized medicine is now possible. Doctors would have the ability to identify which genetic traits make us more or less susceptible to different diseases, cancers, and so on.
Data volumes are large. 200-1,000GB per genome.
Analyzing enough genomes to do population analyses requires petabytes of data.

The societal benefit from this work could be immense. I understand why he was so cheerful when he talked.

How

I was impressed by the quality of thought put into the project:

Use open-source, popular software stacks because they will improve over time
Add interface support for many languages, such as Python, C++, C#, PHP, Ruby, etc.
Identify tools that are best at each part of the software stack to improve performance and scalability. In this case, that’s Apache Spark, Avro, Parquet, and HDFS.
Using a columnar data store (Parquet) on top of HDFS for data storage. Genomic data can be efficiently stored in a columnar format and this leads to better parallelism.
Use an interoperable data-storage setup (Avro) to support multiple interfaces
Add support for SQL-like queries (Shark, Impala)
Test the performance and scalability, both at a single node and scaling out.

There’s a lot more detail, available on the website, the in-depth research paper, or the entirely-public codebase.

Deep Neural Networks

Deep neural networks have gotten a lot of press lately, mostly because they can work well on problems most ML algorithms struggle with (image recognition, speech recognition, machine translation).

Ilya Sutskever gave a good, useful intro into deep neural networks. ‘Deep’ in this case refers to 8-10 layer of neurons ‘hidden’ between the input and output layers. A traditional neural net has 1-2 hidden layers.

The reasoning to look at 10 layers is great. Humans can do a variety of things in 0.1 seconds. However, neurons are pretty slow; they can only fire about 100/second. Therefore a human task that happens in under 0.1 seconds takes only 10 layers of neurons to do.

One of the big problems behind neural networks is that they require a lot of data to train at this depth. They are also not intuitive to tune; Ilya didn’t go over that at all in his session. It was a good 101-level talk.

“Give me explainability or give me depth”

For more, I’d recommend the Neural Networks Blog.

Open Reception

The reception afterwards was mostly dull. The food was good, and free. The vendors, however, were spreading their own particular flavors of FUD.

I asked 11 different vendors for the data to back up claims behind their value propositions. The responses were a comic mix of dumbfounded expressions, misdirection, and spin. It’s hilarious that companies selling to data and analysis professionals don’t use data to back up their marketing claims.

Day 2

Keynotes

The keynotes at Strata were very short, 5-10 minutes each. This was a mixed blessing; presenters were brief, but many of them used nothing but buzzwords. I was strongly reminded of The Worst Speech in the World.

P (quality) = (reality / buzzwords + reality).

However, there were two amazing speakers: Farrah Bostic and David Epstein. They made clear points, had a bit of light humor, and were refreshing immune to buzzword-itis.

Farrah Bostic’s argument was “How we decide what to measure, matters.” Market research, surveys and focus groups are more biased than we think, leading to flawed decisions and flawed results. I’ve seen the results of this anecdotally, when people making decisions using the data they had rather than the problem they had.

David Epstein had two points. The first is that before collecting tons of data, you should determine what is important and what can be changed. Collecting data, and then analyzing it, should enable change that is possible. His second point was the famous “10,000 hours of practice” study was based on a flawed study of 40 gifted violinists; it isn’t generally applicable. Even the original researcher, K.A. Ericsson, called the hype around the 10,000 hours idea “the danger of delegating education to journalists.”

Big Data: Too Few Artists

This was the earth-shaking session of the day. Chris Re is a Stanford professor with a stupefying vision for data analysis.

A challenge with data problems, like software engineering, is to reduce the time between idea and product. One huge bottleneck is the cognitive/human time required to build a good model from data.

Building a good model requires iterating over 2 steps:

Getting data and extracting features from it
Testing any and all features against various models to see which combinations are meaningful.

The second step can be streamlined, even automated.

Automated Machine Learning

(TO DO IMAGE OF SKYNET)

For everything but the largest data sets, it is computationally/economically possible to run hundreds, even thousands, of machine learning models on a data set and use statistical methods to identify the best ones.

This is an old idea. Data scientists tune machine learning models using hyperparameters all the time. I often use hyperparameter searches when tuning a model; it’s a fast way to tune a good model.

This leaves us with the first step: generating features.

It’s About the Features, Stupid

One of the big lessons in machine learning is more data trumps a complicated model. This is exemplified in the seminal paper “The unreasonable effectiveness of data.”

Another lesson is better features trumps a complicated model. The best data scientists spend time adding features to their data (feature engineering).

Deep Dive

Chris’ ideas are brought to fruition in DeepDive, a system that has a user define features, but not machine learning or statistics. The tool does all of the machine learning and statistical analysis, and then shows the results. It’s already been used on paleobiology data (extracting data from PDF-formatted research papers) with promising results.

I’ll be following this closely.

Thinking with Data

Max Shron’s premise was simple: good data science benefits from having an intellectual framework. The detail of this session is in his new book.

Scoping

“How do we make sure we’re solving the right problem?”

Data scientists aren’t the first to ask that question. Designers have this problem all the time, worse than we do. Vague, conflicting requests are a fact of life.

Borrowing from designers and their scoping framework can:

Help defining a data problem clearly by asking careful questions.
Reduce the chance of error by using mockups. A fake graph can be very helpful.
Help deliver a clear presentation by copying narrative structure: setup, conflict, resolution, and denouement.

Arguments

Convincing people of something, even with data, is a form of argument. Data scientists can benefit from 2500 years of work in the humanities, rhetoric, and social sciences.

Knowing the structure of an argument can help with:

Clarifying what you need to convince people of.
Anticipating objections and questions, so you can be prepared
Identifying indirect and opportunity costs that may trip up your ideas
Keeping your presentation concise by not covering already-agreed-upon terms and definitions.

This was the most intellectual of the sessions I attended, and one of the most helpful.

Tracking zzzzz

In contrast, Monica Rogati’s session was lighthearted and utterly entertaining. This was an amazing example of telling a story using data.

The topic? Sleep.

As a data scientist for Jawbone, Monica is effectively running the world’s largest sleep study, with access to 50 million nights’ sleep. Some findings:

Hawaii is the most sleep deprived. Vermont is the least.
The conventional wisdom for jet lag is 1 day recovery time 1-hour time zone. It actually takes 2 days.
A coast-to-coast trip takes 6-7 days to recover from.
Fishing, pets, hiking, and softball are correlated with more sleep.
Socializing at work, personal grooming, and commuting are correlated with less.

I’ll be looking at this session again, looking for presentation tips.

Errata

I asked 30 people at Data after Dark, and 23 of them knew how to count cards. No wonder we’re playing poker and not blackjack!
A vendor booth offering free Bud Light at the Exhibit Hall was completely empty. No surprise; there was free scotch and craft beer nearby. Competition matters.
I found intriguing research paper on the theory behind join algorithms (warning, math heavy)

Day 3

Keynotes

Buzzwords are the new stopwords.

The vast majority of the keynotes were nothing but buzzwords, again. The audience reacted logically; they ignored the presenters and checked email, Facebook, and Twitter. One guy was trading stocks.

A notable exception was Matei Zaharia’s presentation on Spark. Spark is one of the most popular big data projects around, and Matei presented real stories and details.

James Burke Keynote

The best keynote was James Burke’s. He was illuminating, funny, and persuasive. His argument was history and discovery are messy and full of unexpected change.

Some memorable quotes:

“Information causes change. If it doesn’t, it isn’t information” - Claude Shannon
“Anybody could have done that. I just got there first”
“The number of potential connections in a brain is greater than the number of .atoms in the known universe. You have one of these. What are you doing with it?”
“We are constrained to predict the future from the past. That’s all we have”

Thinking in Systems

Discovery and progress happens between disciplines, not through specialization. I see this all the time in software teams. The most creative work comes from groups of disparate people working together.

Our society rewards specialists more than generalists. The result is a larger number of narrower niches; we know more and more about less and less. Broad thinkers are desperately needed but not valued.

I enjoy thinking in systems. I was taught and raised this way, fortunately. Being able to see the trees and the forest comes in very handy. I encourage everyone to try this.

You Never Expect the Spanish Inquisition What Really Happens

Technical change usually doesn’t cause problems directly. Its biggest headaches are predominantly due to side effects.

Facebook is a great example. Posting your personal info to friends isn’t controversial. What’s controversial is when none of it’s private anymore and visible to employers, parents, random strangers, and stalkers.

Society is reactionary to scientific, technical and industrial change. It’s important to be mindful of that.

Expressing Yourself in R

One great session was Hadley Wickham’s session on R. R is one of the most popular languages for data analysis, and one I use daily.

One of Hadley’s points is that it is good to code when doing analysis.

It’s reproducible
It helps with automation
Code, as text, is a precise form of communication.

The two projects Hadley is working on are dplyr and ggviz.

Dplyr is pretty amazing; it’s a way to create query-like operations in R and have them work against data frames, data cubes, or even backends like RDBMS or BigQuery. I’m reminded of LINQ and lambda expressions.

One of the beautiful parts of dplyr is that it’s declarative. You code what you want done, but not exactly how. Anyone familiar with SQL will feel right at home.

Ggviz is the other package Hadley is working on. It’s the update to ggplot2, and produces interactive visualizations using HTML, JavaScript, and SVG. It is built using Vega and Shiny.

IPython Notebooks

IPython notebooks are the de facto way to share data analysis, for several reasons:

They can self-contain data, code, and output graphics.
They are inherently reproducible.
They support many languages.

Brian Granger gave a great series of demos about the upcoming IPython 2, which is going to be even more user-friendly. I’m looking forward to it.

Data for Good

One of my favorite sessions, this was a panel discussion between Drew Conway, Jake Porway, Rayid Ghani, and Elena Eneva. They were discussing how data science can be used for social good.

The key takeaways:

The most important activity is listening. You’re only valuable when you solve real problems.
Not all problems are data problems.
You don’t need to build a complicated model. Simple models often go far.
More than 1/3 of the audience had volunteered with a nonprofit at some point. It was a very civic-minded audience
A lot of the best presenters were in the audience. That was oddly heartening.
The turnover for data scientists is high. Consider social impact the next time you’re looking for a gig.

Closing Thoughts

Talking to dozens of people and attending many sessions led me to some unexpected conclusions…

Breakthroughs happen in 3 ways:

Designing a new algorithm in statistics or machine learning
Applying an existing algorithm in stats/ML to a new kind of system (bigger scale or a new language)
Applying stats/ML to a problem/industry that hasn’t seen it before.

Those are in descending order of difficulty.

Data Integration is not a solved problem

Chris Re mentioned a study done for various CTOs. The result was stark: if you’re a CTO faced with a big integration challenge, your best course of action is to quit.

People are messy

It seems like data professionals have a bit of OCD. We like things to be clean and orderly.

However, people are messy. They come in all shapes and sizes, with biases, irrational behavior and communication headaches. We have to accept people as they are or face a constant impedance mismatch with the very people we are supposed to serve.

Work on big problems

I met some amazing data scientists over the past few days. Most of them will never be famous, even if they’re exceptionally smart.

They work on boring projects. Nobody cares if a brilliant data scientist works on online advertising, or a new kind of social media platform, or becomes yet another high-finance quant.

However, people do notice when the data scientist who changes how a city does building inspections. What matters is relative impact.

This isn’t a new idea. Michael Lewis’ Moneyball was about more than stats coming into baseball; it was a beautiful example that quantitative skill can have a dramatic impact in areas where it doesn’t currently exist. For example:

Construction
Agriculture
Fashion
Music
Art
Consumer advice
Education
Government
Campaign finance
99% of the nonprofits in the world

Want to change the world? Find out where all the money goes in education (it’s not to teachers). Build a platform to crowdsource finance for farmers and remove all the middlemen. Figure out how music affects the brain.

Build big things.

Published 10 February 2014