15 August 2012

“Data Science” is a term coined by DJ Patil (@DPatil) and Jeffrey Hammerbacher (@hackingdata). The term became popular after an article, *Data Scientist, the Sexiest Job of the 21st Century*. It’s a term similar to “webmaster”: it covers multiple roles. It is also *not* easy, as Joseph Misiti eloquently describes.

Data scientists do 3 things:

- Ask Questions
- Write Code
- Do Math

Math + Code = Machine Learning (ML). It’s such a dynamic field of study that the best sources are often blogs, research papers and conferences instead of books or classes. Here are the resources I’ve found most helpful.

- Dataists
- Kaggle’s No Free Hunch
- Pedro Domingos’s Overview of Machine Learning
- Jeremy Kun’s Introduction to Machine Learning
- BigML’s Everything You Want to Know about ML
- Buck Woody’s Setting up a Data Science Laboratory

- Machine Learning for Hackers, by Drew Conway (@drewconway) and John Myles White (@johnmyleswhite). * There is also a Python derivative, Will It Python
- Data Analysis Using Open Source Tools, by Philipp Janert

- Coursera’s Machine Learning class
- Coursera’s Introduction to Data Science

The most popular languages for machine learning are R and Python. The second tier of popular languages are SQL, Java, Hadoop/Mahout/Hive, and Matlab.

My favorite place to start with R is John Cook’s Introduction to R for programmers.

Python is popular because it’s simple and has more libraries than God. The most popular Python libraries for machine learning are NumPy, SciPy, SciKit-Learn and Pandas.tnu

There are *thousands* of machine learning algorithms. Luckily for us, a few have risen to prominence. The most popular machine learning algorithms are:

- Decision Trees / Random Forests
- Linear Regression
- Logistic Regression
- Association Rules
- K-Means Clustering

Machine learning uses a lot of matrix math. It’s quite easy to do matrix math in SQL. Joe Celko has some great examples of matrix math in SQL.