“Data Science” is a term coined by DJ Patil (@DPatil) and Jeffrey Hammerbacher (@hackingdata). The term became popular after an article, *Data Scientist, the Sexiest Job of the 21st Century*. It’s a term similar to “webmaster”: it covers multiple roles. It is also *not* easy, as Joseph Misiti eloquently describes.

Data scientists do 3 things:

- Ask Questions
- Write Code
- Do Math

Math + Code = Machine Learning (ML). It’s such a dynamic field of study that the best sources are often blogs, research papers and conferences instead of books or classes. Here are the resources I’ve found most helpful.

- Dataists
- Kaggle’s No Free Hunch
- Pedro Domingos’s Overview of Machine Learning
- Jeremy Kun’s Introduction to Machine Learning
- BigML’s Everything You Want to Know about ML
- Buck Woody’s Setting up a Data Science Laboratory

- Machine Learning for Hackers, by Drew Conway (@drewconway) and John Myles White (@johnmyleswhite). * There is also a Python derivative, Will It Python
- Data Analysis Using Open Source Tools, by Philipp Janert

- Coursera’s Machine Learning class
- Coursera’s Introduction to Data Science

The most popular languages for machine learning are R and Python. The second tier of popular languages are SQL, Java, Hadoop/Mahout/Hive, and Matlab.

My favorite place to start with R is John Cook’s Introduction to R for programmers.

Python is popular because it’s simple and has more libraries than God. The most popular Python libraries for machine learning are NumPy, SciPy, SciKit-Learn and Pandas.tnu

There are *thousands* of machine learning algorithms. Luckily for us, a few have risen to prominence. The most popular machine learning algorithms are:

- Decision Trees / Random Forests
- Linear Regression
- Logistic Regression
- Association Rules
- K-Means Clustering

Machine learning uses a lot of matrix math. It’s quite easy to do matrix math in SQL. Joe Celko has some great examples of matrix math in SQL.

I want to be a data scientist. I want to learn in the most efficient way. I want to learn from the best.

One of the foremost data scientists is Hilary Mason, (blog, @hmason). She has a tremendous ability to make difficult concepts easy to understand. See: An Introduction to Machine Learning in 30 Minutes.

What did I learn from that video? This can be **fun**!

In addition to learn the necessary math, I should use the most appropriate tools. A little sleuthing found a survey of the data scientists competing at Kaggle.com.

The winner? R , the open-source tool for statistical analysis.

The other tool to learn? Python, due to its ease of use and large number of libraries.

Combined, those two tools make it easy to find, consume, and analyze data from many places. Next up: math.