Data Scientist, Part 1

22 October 2012

I want to be able to find data, analyze it, and turn it into meaning. Anywhere. Any time. I want to become a data scientist.

The term “data scientist” is not well defined. Many people try to define it precisely. This is ironic, because a huge part of the job is to use data to find knowledge and to measure uncertainty. For all of these people with definitions, I ask: where is your data?

Luckily, there is some agreement about what data scientists do and know. Armed with that basic information, I started studying. I learn the most when I scatter-gather.

Scatter

First, I read a lot of books and blog posts. I watched videos. I started learning different tools and libraries.

The whole time I was careful to take notes, looking for similarities, for patterns, for connections between topics. Who are the key people in each field? What are the most popular concepts, and tools? What lessons and warnings keep cropping up?

Some common skills and tools emerged:

Skill	Popular Tool
Statistics	R. Python. SciPy. NumPy
Programming / Scripting	Python. Java. Ruby. Regular expressions.
Working at scale (“big data”)	Hadoop. Hive. Pig. HBase. Impala
Infrastructure	Linux. AWS
Visualization	Tableau. ggplot2. D3.js
Storytelling	N/A
Domain knowledge	N/A
Linear algebra	R. Python.
Machine learning	R. Python. Mahout.
RDBMS	SQL queries. MySQL. PostgreSQL. SQL Server.
NoSQL	Mongo. Cassandra. Redis
Files	Log parsing. Regular expressions.

Gather

After collecting all of this information, I puzzled through what the data meant. The most common lessons are:

Machine learning techniques (ML) is immensely powerful. Using ML tools is quite easy. Understanding why they work and how they work is hard.
There are hundreds of ways to analyze data. Data scientists must quickly determine which approach(es) are relevant and which are not.
Compared with other disciplines, data science does not have the same depth of common knowledge or training. Object-oriented programming or data warehousing are mature disciplines. Data science is very young. Therefore, judgment is key.
Learn by doing. Pick a question. Find some data. Do some analysis. Communicate it. Reflect. Repeat.
Learn from your mistakes.
Learn from other people’s mistakes
Some of the brightest data scientists speak publicly, have blogs, and are on Twitter. Learn from them.
This is serious work, but also a lot of fun. Enjoy yourself.

The more I learn about this work, the more I love it.