15 August 2012
"Data Science" is a term coined by DJ Patil (@DPatil) and Jeffrey Hammerbacher (@hackingdata). The term became popular after an article, Data Scientist, the Sexiest Job of the 21st Century. It's a term similar to "webmaster": it covers multiple roles. It is also not easy, as Joseph Misiti eloquently describes.
Data scientists do 3 things:
- Ask Questions
- Write Code
- Do Math
Math + Code = Machine Learning (ML). It's such a dynamic field of study that the best sources are often blogs, research papers and conferences instead of books or classes. Here are the resources I've found most helpful.
- Machine Learning for Hackers, by Drew Conway (@drewconway) and John Myles White (@johnmyleswhite).
- Data Analysis Using Open Source Tools, by Philipp Janert
- Hilary Mason's (@hmason) Machine Learning in 30 Minutes
- The Strata conference videos
The most popular languages for machine learning are R and Python. The second tier of popular languages are SQL, Java, Hadoop/Mahout/Hive, and Matlab.
My favorite place to start with R is John Cook's Introduction to R for programmers.
Python is popular because it's simple and has more libraries than God. The most popular Python libraries for machine learning are NumPy, SciPy, SciKit-Learn and Pandas.tnu
There are thousands of machine learning algorithms. Luckily for us, a few have risen to prominence. The most popular machine learning algorithms are:
Machine learning uses a lot of matrix math. It's quite easy to do matrix math in SQL. Joe Celko has some great examples of matrix math in SQL.