I want to be able to find data, analyze it, and turn it into meaning. Anywhere. Any time. I want to become a data scientist.

The term "data scientist" is not well defined. Many people try to define it *precisely*. This is ironic, because a huge part of the job is to *use data* to find knowledge and to measure uncertainty. For all of these people with definitions, I ask: where is your data?

Luckily, there is some agreement about what data scientists do and know. Armed with that basic information, I started studying. I learn the most when I *scatter-gather*.

First, I read a lot of books and blog posts. I watched videos. I started learning different tools and libraries.

The whole time I was careful to take notes, looking for similarities, for patterns, for connections between topics. Who are the key people in each field? What are the most popular concepts, and tools? What lessons and warnings keep cropping up?

Some common skills and tools emerged:

Skill |
Popular Tool |
---|---|

Statistics | R. Python. SciPy. NumPy |

Programming / Scripting | Python. Java. Ruby. Regular expressions. |

Working at scale ("big data") | Hadoop. Hive. Pig. HBase. Impala |

Infrastructure | Linux. AWS |

Visualization | Tableau. ggplot2. D3.js |

Storytelling | N/A |

Domain knowledge | N/A |

Linear algebra | R. Python. |

Machine learning | R. Python. Mahout. |

RDBMS | SQL queries. MySQL. PostgreSQL. SQL Server. |

NoSQL | Mongo. Cassandra. Redis |

Files | Log parsing. Regular expressions. |

After collecting all of this information, I puzzled through what the data meant. The most common lessons are:

- Machine learning techniques (ML) is immensely powerful. Using ML tools is quite easy. Understanding
*why*they work and*how*they work is hard. - There are hundreds of ways to analyze data. Data scientists must quickly determine which approach(es) are relevant and which are not.
- Compared with other disciplines, data science does not have the same depth of common knowledge or training. Object-oriented programming or data warehousing are mature disciplines. Data science is very young. Therefore, judgment is key.
- Learn by doing. Pick a question. Find some data. Do some analysis. Communicate it. Reflect. Repeat.
- Learn from your mistakes.
- Learn from other people's mistakes
- Some of the brightest data scientists speak publicly, have blogs, and are on Twitter. Learn from them.
- This is serious work, but also a lot of fun. Enjoy yourself.

The more I learn about this work, the more I love it.

*"You'll love this expensive neighborhood. It has great schools" - Every Realtor, ever*

Maybe.

This is the second blog post analyzing the quality of schools using data. In the last post we identified a key success metric and saw the quality of schools. Now we'll ask: Are good schools always in expensive neighborhoods?

**WARNING:** Correlation != causation. We'll see correlations between housing prices and school quality. This does not mean school quality changes *because* of the price of a house. Children don't become better students because their bedroom has fancier paint.

When parents buy a house, a 'good neighborhood' means 'near good schools'. Parents will spend as much as they can afford to be near a good school. Let's see if it's a good idea to spend more; below is a graph of schools' High Achiever % compared with house prices.

There is a rough trend where school quality increases as house prices increase. Let's see how much of a trend there is. A little linear math (least squares' regression) produces this equation.

```
High Achiever % = Median House Price * 0.0000002439 - 0.0274391
R^2 = 0.259, SSE = 0.818, MSE = 0.00223, p-value = < 0.0001
```

An R² value of 1 means two variables are perfectly correlated. An R² value of 0 means no correlation. Here there is a correlation of 0.259. The p-value is very small (*less than 0.05*), meaning that the linear regression is probably significant. So, we're not grasping at straws.

We're interested in schools with a high High Achiever % that is not due to house prices. This is also known as the *residual. *

```
High Achiever % Not Due To House Prices =
High Achiever % - High Achiever % Due to House Prices (equation)
```

This is the same data as above, except without the influence of house prices. Here we can see which schools are good deals; they have positive values.

For example, Mercer Island high school has a 16% High Achiever value, which is quite good. However, it is a *very* low score for the cost, because the median house costs over $1 million. Therefore its residual value is negative (it's not worth the price). In contrast, Friday Harbor high school has a 27% High Achiever value, with a median house price of around $360K. It's a far better deal.

Parents could use this data to make informed decisions about where to move. For example, we can see that houses over $400K don't provide much (if any) additional improvement in school quality.

What if we want to rent, instead of buying? Let's look at school quality compared to rent price:

There is a correlation between rental prices and school quality. More linear math, and we find:

```
High Achiever % = Average Rent * 0.0001197 - 0.0684526
R^2 = 0.247, SSE = 0.724, MSE = 0.00282, p-value = < 0.0001
```

There is still a correlation, but it's weaker compared to house prices (an R² value of .247 instead of .259). Still, we can use the linear equation to plot the residual, looking for good deals:

The correlation (R²) between rent and school quality is weaker than between house prices and school quality. There are more potential deals when renting instead of buying.

**The Takeaway**: Don't assume schools are better because they're in a pricier neighborhood; that's not always true. And consider renting instead of buying.

Sometimes we don't have to dig too deeply. This basic level of analysis is sufficient for parents on a budget.

However, there is a *lot* more insight in this data. In the next post we will look at which factors influence school quality. Stay tuned for more data!