15 August 2012
“Data Science” is a term coined by DJ Patil (@DPatil) and Jeffrey Hammerbacher (@hackingdata). The term became popular after an article, Data Scientist, the Sexiest Job of the 21st Century. It’s a term similar to “webmaster”: it covers multiple roles. It is also not easy, as Joseph Misiti eloquently describes.
Data scientists do 3 things:
Math + Code = Machine Learning (ML). It’s such a dynamic field of study that the best sources are often blogs, research papers and conferences instead of books or classes. Here are the resources I’ve found most helpful.
The most popular languages for machine learning are R and Python. The second tier of popular languages are SQL, Java, Hadoop/Mahout/Hive, and Matlab.
My favorite place to start with R is John Cook’s Introduction to R for programmers.
Python is popular because it’s simple and has more libraries than God. The most popular Python libraries for machine learning are NumPy, SciPy, SciKit-Learn and Pandas.tnu
There are thousands of machine learning algorithms. Luckily for us, a few have risen to prominence. The most popular machine learning algorithms are:
Machine learning uses a lot of matrix math. It’s quite easy to do matrix math in SQL. Joe Celko has some great examples of matrix math in SQL.