15 June 2014
A few weeks ago I had the privilege of attending of the PASS Business Analytics conference (PASS BAC) with 750 other data geeks in San Jose, CA.
The last conference I had attended was Strata, another data-to-decisions shindig. I’ve decided to compare and contrast the two.
The SqlPass and Strata communities have some things in common:
Everyone wants to get more out of their data. Businesses are desperate to do so. We data scientists, analysts and engineers are paid to work with data.
We’re in the enviable position of being well paid to do the work we love. Most such jobs pay poorly: teaching, global health, cancer research, nursing, cooking, music, art, English, history, biology, physics, etc.
One of the best events was Wednesday’s unconference session. It was a rare chance to crowdsource discussion topics and expertise.
One of the problems everyone has is cleaning up data. Surveys show that over 80% of a data project is spent cleaning up data.
This is, sadly, an area with comparatively few tools. People built their own solutions, using 3 different approaches.
Clean with Code
This is the first approach everybody thinks of when facing a data-cleaning problem. They find patterns in good data, and use rules (if/then, case, switch statements) to clean up data. This is often the ‘T’ (transform) in an ETL process.
This approach has flaws. It requires lots of development time to prototype, implement, and validate. It’s hard to debug, especially if there’s a data pipeline with multiple, sequential data-cleaning steps. It also does not handle new errors or edge cases well, so it requires ongoing developer support to maintain.
Clean with Vendors
This is the second approach. Some data-cleaning problems are so common that many businesses face them. Cleaning up addresses is a canonical example.
When you’re facing a common problem, use a common solution. Buy a third-party product or service and have it do the tricky work. After all, there’s no reason thousands of companies need to build their own address-cleaning pipelines.
However, this approach doesn’t make sense when the data to clean is internal (custom) or sensitive (SSNs, banking data, medical records).
Clean with Math
The third approach is my favorite: math. Specifically, statistics.
There’s an entire branch of statistics devoted to detecting when data is missing or if an outlier can be safely discarded. Most importantly, there are techniques here that let an analyst know whether missing/outlier data meaningfully changes the result of a problem. This is what scientific research has done for decades.
The other half of this approach is machine learning. Let’s see using our address problem. Imagine that we strip an address down to its component parts (building number, street, apartment/suite, city, post code, etc.). We can then train a machine learning model to look at an address’ text, assign probabilities that each bit of an address belongs to one of those elements, and predict how likely that piece of data is what we want.
There are very few tools for this approach thus far. That means you have to build these solutions in-house, and for that you need a machine-learning savvy developer. They’re awfully rare.
After getting a disappointing Data Science Bingo score during the Thursday morning keynote, I set out to find out how common certain tools and techniques were in my audience. I asked 33 people which technologies they knew about or used.
In addition to asking about specific tools, I also asked about tool-neutral techniques for analyzing and processing data.
The attendees may be future data scientists, but this sample suggests a distinct gap in their machine learning, statistical, and engineering skills. I found only a few promising analysts in my survey.
The main limitation of PASS BAC is that it’s a Microsoft-centric conference on data analysis. That’s like having a World Cup with only the U.S., Canada, and Belize.
The most common command-line tools for data analysis are Python and R; they have amazing libraries and support reproducible research. Microsoft has Excel, which has terrible library support, and no support for reproducible research. It’s contributed to world-economy-changing, fascist-government-inspiring screwups.
One way to think of code is that it’s a precise form of communication to other people as well as to machines. The only practical way to do command-line data analysis on the MS stack is to run a Linux VM in Azure. Oops.
Strata and PASS BAC have one huge difference: their implicit messages are different.
This shows up a lot in different lessons. PASS BAC sessions are predominantly about how to help analysts solve X or Y problem.
Strata sessions are about solving X problem, and evolving the solution to an automated process or data product (Google Search results, Netflix movie recommendations, Hotmail’s spam filter, and LinkedIn’s “People You May Know” suggestions).
PASS BAC assumes there are data-hungry, precocious analysts in every company that will magically find insights if you feed them a diet of raw data and Excel. That’s a dangerous assumption.
The maturity curve of BI is long, but it bends towards automation.
This is an organizational challenge: how do we empower people to solve challenges that are beyond their current capabilities? There are 4 options.
Option A: dumb down the problem so people can solve it using their existing skills
Option B: educate people so they can tackle the problem.
Option C: use a different approach so there are no people in the first place
Option D: tackle a different problem
Tool-building vendors take the first approach. Their value proposition is “we solve your biggest challenge using our deus-ex-machine tool.” That sounds a lot better than “your staff’s outdated skills are your biggest challenge, we just help out with a smaller problem.”.
Another common challenge I heard discussed was communication lapses. A battle of resources, priorities, and perspectives between developers and IT admins is still ongoing for the PASS BAC attendees.
There were some common lessons that people liked:
Strata was heavily represented by startups, academia, and scientific researchers; PASS BAC had established industries, such as insurance, finance, and retail.
These audiences are different, with different goals. Strata felt fast, even frantic at times. Startups are always rushed. Scientific research never has enough resourcing.
PASS BAC was slower, more methodical. The goal of the audience isn’t to disrupt an industry, but to defend one. They have far more resources, and assume they have lots of time.
When companies are slow to adopt data-driven decision-making, I wonder how they do some things:
A huge part of an organization improving using data is the cultural approach decision-makers make with it. A key question is what decision makers do when the results of data analysis disagree with their ‘gut’.
The best session, by far, was David McCandless’ (@DavidMc, Information Is Beautiful) .
I was awestruck by how he used visualizations to illustrate the gaps in human perception. For example, using relative amounts to illustrate what one billion looks like, a million lines of code, or the probabilistic results in the Drake Equation.
It makes sense to use visuals to communicate, since we use 75% of our sensory neurons to process what we see. Data visualization is a type of communication. Its goals are the same as any kind of communication: clarity, emotional connection, and minimizing misunderstandings. If you want to learn the principles of data visualization, study communication and journalism.
There are a few common things to do and avoid.
Ways to Succeed
Ways to Fail
Despite having a fun time speaking and chatting at PASS BAC, I doubt I’ll go back. It’s a useful conference for BI practitioners in established industries, looking for incremental improvements. There’s nothing wrong with that, but it’s not for me.
I can network better at a local meetup, learn more by doing my own research or reading a good research paper. I can reach a wider audience with a blog post.