The Road Through Strata - Day 1
11 February 2014
This was my first day at Strata. Here's what I found.
- It's important to explain data science products to users. They don't trust black boxes.
- Online learning (updating a model incrementally) is very useful, but it's still not available for most algorithms.
- Assembling, transforming, and cleaning data still takes 80%+ of the time.
- The age-old machine learning headaches live on: overfitting, outlier detection and removal, feature extraction, the curse of dimensionality, and mismatched tooling for prototyping vs. production.
I made the very mistake I warned against yesterday: I went to sessions based on the topic, and not the quality of the speaker.
I missed out on amazing sessions by John Foreman, Jeff Heer, and Carlos Guestrin.
I'll be more selective about my sessions for the next couple days.
I asked a dozen people, from a variety of industries, what they did for a living. I also asked how ensured their work wasn't being used to make more profit in an unethical way.
Nobody had an answer to the latter question. I'm fervently hoping this is due to my low sample size and not broadly representative of the data analytics community.
In addition to my ethical survey I had the chance to talk to people from a D.C. startup, the Lawrence Berkeley Lab, Microsoft Research, Netflix, Etsy, Vertafore, the Department of Defense, and Sage Bionetworks. Everyone was ridiculously smart, and most of them were data scientists.
I came prepared with a list of questions:
- What's your name? Are you from the Bay Area? Where do you work?
- What are you passionate about? What do you like to do?
- What's your ideal problem to solve?
- What projects do you wish other people would help you with?
- What's one question you wish people would ask you?
- What do you think people should pay more attention to?
I found some common elements:
- They are learning, and confused by, the myriad software stacks and languages available nowadays.
- They all want to learn from each other.
- They all want more in-depth sessions.
The range of subject areas covered was immense.
- Sloan Digital Sky Survey (SDSS)
- Large Synoptic Survey Telescope (LSST)
- Search for Extra-Terrestrial Intelligence (SETI)
- Large Hadron Collider (LHC)
- Personalized medicine
- Predictive health - preventative care
- Early detection (for cancers)
- Intrusion detection
- Fraud monitoring
- Automatic root-cause analysis
- Monitoring with intelligent anomaly detection
- Capacity analysis and automatic scaling
There were some boring problems discussed...
- Show people more interesting ads
- Recommend movies, books, or news articles to people
- Recommend matches on a dating site
- Improve high-frequency trading systems
Luckily, I was saved by the amount of discussion on data-intensive genomics...
On Monday night I attended a Big Data Science meetup, and the best presenter was Frank Nothaft, a grad student at UC Berkeley, working on large-scale genomics.
- The cost of processing a genome is now less than $1,000 per genome. Also, the price is dropping faster than Moore's law. The cost of computation may be the bottleneck in genomics.
- Personalized medicine is now possible. Doctors would have the ability to identify which genetic traits make us more or less susceptible to different diseases, cancers, and so on.
- Data volumes are large. 200-1,000GB per genome.
- Analyzing enough genomes to do population analyses requires petabytes of data.
The societal benefit from this work could be immense. I understand why he was so cheerful when he talked.
I was impressed by the quality of thought put into the project:
- Use open-source, popular software stacks because they will improve over time
- Add interface support for many languages, such as Python, C++, C#, PHP, Ruby, etc.
- Identify tools that are best at each part of the software stack to improve performance and scalability. In this case, that's Apache Spark, Avro, Parquet, and HDFS.
- Using a columnar data store (Parquet) on top of HDFS for data storage. Genomic data can be efficiently stored in a columnar format and this leads to better parallelism.
- Use an interoperable data-storage setup (Avro) to support multiple interfaces
- Add support for SQL-like queries (Shark, Impala)
- Test the performance and scalability, both at a single node and scaling out.
There's a lot more detail, available on the website, the in-depth research paper, or the entirely-public codebase.
Deep Neural Networks
Deep neural networks have gotten a lot of press lately, mostly because they can work well on problems most ML algorithms struggle with (image recognition, speech recognition, machine translation).
Ilya Sutskever gave a good, useful intro into deep neural networks. 'Deep' in this case refers to 8-10 layer of neurons 'hidden' between the input and output layers. A traditional neural net has 1-2 hidden layers.
The reasoning to look at 10 layers is great. Humans can do a variety of things in 0.1 seconds. However, neurons are pretty slow; they can only fire about 100/second. Therefore a human task that happens in under 0.1 seconds takes only 10 layers of neurons to do.
One of the big problems behind neural networks is that they require a lot of data to train at this depth. They are also not intuitive to tune; Ilya didn't go over that at all in his session. It was a good 101-level talk.
"Give me explainability or give me depth"
For more, I'd recommend the Neural Networks Blog.
The reception afterwards was mostly dull. The food was good, and free. The vendors, however, were spreading their own particular flavors of FUD.
I asked 11 different vendors for the data to back up claims behind their value propositions. The responses were a comic mix of dumbfounded expressions, misdirection, and spin. It's hilarious that companies selling to data and analysis professionals don't use data to back up their marketing claims.
I find myself excited about the potential to meet awesome people and learn amazing things.
I'm looking forward to tomorrow.