Data Scientist, Part 1

22 October 2012

I want to be able to find data, analyze it, and turn it into meaning. Anywhere. Any time. I want to become a data scientist.

The term "data scientist" is not well defined. Many people try to define it precisely. This is ironic, because a huge part of the job is to use data to find knowledge and to measure uncertainty. For all of these people with definitions, I ask: where is your data?

Luckily, there is some agreement about what data scientists do and know. Armed with that basic information, I started studying. I learn  the most when I scatter-gather.


First, I read a lot of books and blog posts. I watched videos. I started learning different tools and libraries.

The whole time I was careful to take notes, looking for similarities, for patterns, for connections between topics. Who are the key people in each field? What are the most popular concepts, and tools? What lessons and warnings keep cropping up?

Some common skills and tools emerged:

Skill Popular Tool
Statistics R. Python. SciPy. NumPy
Programming / Scripting Python. Java. Ruby. Regular expressions.
Working at scale ("big data") Hadoop. Hive. Pig. HBase. Impala
Infrastructure Linux. AWS
Visualization Tableau. ggplot2. D3.js
Storytelling N/A
Domain knowledge N/A
Linear algebra R. Python.
Machine learning R. Python. Mahout.
RDBMS SQL queries. MySQL. PostgreSQL. SQL Server.
NoSQL Mongo. Cassandra. Redis
Files Log parsing. Regular expressions.


After collecting all of this information, I puzzled through what the data meant. The most common lessons are:

The more I learn about this work, the more I love it.