Becoming a Data Scientist

22 November 2018

As a data scientist, one of the most common questions I hear is “How do I become a data scientist? How can I do what you do?”

Here are my answers to that and similar questions…

1. What’s it like to be a data scientist? What does your day to day job look like?

Well, my work is about communication and curiosity.

Before I start any data science effort, I want to know how it why it’s needed. If I deliver perfect predictions or recommendations, what will someone do with them? Is it ethical? Well thought out? Do they even know?

If someone does not know what they will do with predictions and analysis, it’s my job to tell them that.

For useful projects, I try to build up my curiosity. What am I trying to predict? What am I trying to recommend? What do I need to understand about the problem? How does everything work?

A huge part of this is humility and curiosity; I’m trying to identify my customers’ blind spots, and my own. I’m trying to be aware of and compensate for my blind spots. I’m trying to make correct assumptions as shortcuts, to avoid “analysis paralysis”. This involves asking lots of questions, thinking with a pen and paper, and asking again.

Once I have a better understanding of a problem, I start looking at data. What is the ideal data set for this problem? Do we have it? (No). What data do we have? Is that going to be good enough to answer the questions we have?

Sometimes my projects stop here, and I tell people to collect data before going any further. Usually I have to tell people what kind of data they need to have the accuracy they’re looking for.

If I have data I can use, my job becomes very domestic: cleaning. 80% of my time is acquiring, cleaning, and transforming data sets into a format I can use. The final 20%: feature extraction, machine learning, validating predictions, and making fun visuals.

Finally, once I have good results, I put on my cynic’s hat, and try to prove myself wrong. What assumptions did I make? What am I not seeing? What else could be going on that I didn’t check for?

…the end of this explanation is usually when the developers I talk to turn pale and walk away. This is a very human, very messy process, and the code I write is only part of my work.

2. How do you think your degree helped / did not help you?

My degree was useful in some ways:

  • Data structures classes. Knowing the pros and cons of arrays, hash tables, and trees has been essential.
  • Data algorithms classes. Big-O notation has come in handy ever since.
  • How to communicate technical concepts to non-technical audiences. How to tell stories.
  • A little experience gathering requirements
  • Training in advanced statistics.

In other ways my degree was missing important things:

  • How to get, clean, and transform data
  • The craft of software engineering (abstraction, coupling, code reviews, collaborative development)
  • The key lessons learned by software development over the last 40 years (keep it simple, ask probing questions about requirements)
  • Scientific computing (manipulating data, statistical programming, etc)
  • How to visualize data effectively

3. Do you wish you had done something extra at school to help prepare for industry?

Yes indeed. I wish I had done lots of small side projects. Building recommendation enginees, data classifiers, topic analyses…it would all be great training.

I wish I had known the most popular languages and tools, so I could study them. Linux. SQL. Python. R. Jupyter notebooks.

4. What do you think about going into industry versus graduate school?

I don’t have a complete answer to this, because I don’t have the graduate school experience to compare it to.

I’ve only had the chance to do in-depth research into a subject once, when studying student behavior as a UW data scientist. Graduate students have more opportunity there. However, there’s a huge difference between paying to learn in graduate school versus being paid to work as a data scientist.

5. How did you get your first job after college?

Well, I was naïve in college. I listened to advisors in college and perfected my résumé and cover letter. I thought that was what I needed to get a job.

It turns out that what matters is networking. The success rate for online applications is ~1%; networking and referrals are around ~20%.

I submitted hundreds of job applications. Sometimes 10+ a day, with variations on the same cover letter and resume. Nothing ever came of them. I finally finagled an interview as a system administrator for an advertising startup through a friend.

For the last 13 years I have grown from a sysadmin, to SQL developer, to a senior software engineer, to a data scientist.

6. What qualities do you think make a great data scientist?

  • Curiosity. It takes a desire to learn to poke at data and work over a problem for weeks, and from many angles.
  • Grit. It takes determination, stubbornness, and a willingness to get your hands dirty. I’ve lost track of the number of odd software errors I’ve run into. Anyone who’s work as easy as using Excel is in for a rude awakening.
  • Communication. Every data project I’ve ever been on has been a team effort. Just as important is the ability to work with whoever your customers are.
  • Integrity. It’s easy to find results that aren’t there, to find insights that aren’t true. It takes integrity to admit when you’re wrong, when you haven’t found anything.
  • Ambition. A passion to experiment, to fail, to recover and worry away at a problem. In short, the same qualities I would expect from a tinkerer, builder, and mad chemist.

In many ways these are the same qualities I would look for in any scientist. Data scientists need the same qualities as other scientists.

7. If you were starting over again today, what would you do? What do you wish you knew sooner?

I’d practice my communication skills more, get used to working as a team. Things like body language and humor are important.

I’d be less optimistic about changing a company to suit me. I can be successful in many different industries. That doesn’t mean I want to. I shouldn’t stay too long in any one place. When the company politics and organizational culture hinder me too much, I should trust my gut and leave. I should take more risks.

It’s important for me to do my part to help humanity as a whole, to give back to society. Only organizations that have that mission will suit me well.

8. How can I become a data scientist?

I’ve know of 3 ways:

  • Academia. Master’s or PhD in computer science, data science, machine learning, or computational math. This is the expensive, ‘traditional’ path. Some of the benefits are learning in a community, lots of support, and recognition when you’re done. It’s also the most expensive, and kinda slow.
  • DIY. Study on your own. Find blogs and online resources and meetups and learn from them. Do side projects and blog about the results. This is the fastest way, but it requires the most self motivation.
  • Current Role. Grow into the role in your day job. You need a supportive boss and work culture to attempt this. This also takes the longest to do. It’s hard to learn skills not immediately needed in your job. On the bright side, it’s free, and low risk.

Joining Fred Hutch

11 May 2018

After my last job, I decided to work somewhere already working to make the world more equitable and just. I am fortunate to be a data scientist and engineer; it gives me the opportunity to work in many places.

I was lucky enough to find, apply for, and get a job as a data engineer at the Fred Hutchinson Cancer Research Center. I’ve lost friends and family to the cancer, the same as everyone else.

My new day job is to build a data ‘commonwealth’, where researchers can upload data, process it, and share it. It is an evolutionary step in data intensive science, after open source scientific computing and open access research. Helping scientists with reproducible research and “building upon the work of others” can dramatically accelerate the pace of scientific discovery. That’s my dream for this job.

My evening plans involve learning about cancer biology, genomics, and bioinformatics. My next career goal is to be both a data scientist and cancer researcher.

I’m hoping to find the time to write, about data engineering, bioinformatics, cancer biology, and more. Stay tuned :)