Find Good Schools Using Data - Part One

17 September 2012

This is the first in a series of blog posts analyzing the quality of schools using data. I'll go over how to identify a good school, and what factors influence the quality of a school. Finally, I'll test some common wisdom rumors to see if they are accurate.


"The mind is not a vessel to be filled but a fire to be lit" - Plutarch

Parents care about their children, and sacrifice to help them advance in the world. One of the biggest expenses is putting their child into a good school, often by moving near one. They believe, rightly, that a great education enables their children to fulfill their potential.

Unfortunately, there is no simple way to find a good school. Parents gather what information they can from family, friends, 'top 100' articles, etc. Let's use data instead.

What is Success?

What is the definition of a good school? A good school teaches children to identify and achieve their goals in life. That means a school will instill confidence, fluid intelligence, street savvy, and a broad base of knowledge in its students. Sadly for data scientists and analysts, these traits are hard to quantify.

Let's use a narrower definition: a good school is one that prepares students to successfully complete college and start a career. A study by the US Department of Education found that intense classes are the best predictor of college completion. These classes are often called Advanced Placement, International Baccalaureate, or Honors classes. We can use AP test scores and counts as a metric to measure the quality of a school.

A good school should exhibit the following behavior:

  • A high percentage of students that take an AP test do well, and receive a good score.
  • Students take several advanced classes, not just one. This is a sign of intellectual breadth as well as depth.
  • A high and/or increasing percentage of students take advanced classes over time. The school is improving over time.

Show Me the Data!

Now that we have our goals, we need data! I collected data for every school in Washington state. This includes AP test counts, scores, SAT scores, school budget information, teacher counts, and teacher evaluations. However, people are influenced by their environment. I have also collected neighborhood data for school zip codes covering factors such as crime rates, parents' education levels, urban density information, and so on. This gives us 655 high schools to use for our analysis, with over 30 variables for each school.

Let's start with the obvious question: which schools are the best in the state? I have defined a single metric that defines success: High Achiever %. It is the percentage of students in the school that take an AP test and achieve a good score in the test.

 We Have A Winner

At the very top is Newport High School , which sends 52% of its students to AP tests, and 79% of them achieve a good score, for a High Achiever % of 41.2%. The runner up is Interlake High School, which sends 48% of its students to AP tests, and 77% achieve a good score, for a High Achiever % of 39.6%.  

It is not clear why those schools are the best. It's also not clear whether they are the best for the price, since they are in very expensive neighborhoods.

There's a lot of variation there, but you can see that different schools have dramatically different High Achiever % values.  Let's look at the data a little differently: by measuring High Achiever % and Tests Per Student independently. Ideally a good school would have both.

As we can see here, there appears to be a relationship between High Achiever % and Tests Per Student. That makes sense; a school that can educate its students extremely well in one subject is likely to do so in multiple subjects.

In the next post, we will look at the relationship between a school quality and income. Are good schools always in expensive neighborhoods? Find out!


Data Science

15 August 2012

"Data Science" is a term coined by DJ Patil (@DPatil) and Jeffrey Hammerbacher (@hackingdata). The term became popular after an article, Data Scientist, the Sexiest Job of the 21st Century. It's a term similar to "webmaster": it covers multiple roles. It is also not easy, as Joseph Misiti eloquently describes.

Data scientists do 3 things:

  1. Ask Questions
  2. Write Code
  3. Do Math

Math + Code = Machine Learning (ML). It's such a dynamic field of study that the best sources are often blogs, research papers and conferences instead of books or classes. Here are the resources I've found most helpful.



  1. Machine Learning for Hackers, by Drew Conway (@drewconway) and John Myles White (@johnmyleswhite).
  2. Data Analysis Using Open Source Tools, by Philipp Janert


  1. Hilary Mason's (@hmason) Machine Learning in 30 Minutes
  2. The Strata conference videos
  3. DataGotham


  1. Strata
  2. DataGotham



The most popular languages for machine learning are R and Python. The second tier of popular languages are SQL, Java, Hadoop/Mahout/Hive, and Matlab.

My favorite place to start with R is John Cook's Introduction to R for programmers.

Python is popular because it's simple and has more libraries than God. The most popular Python libraries for machine learning are NumPy, SciPy, SciKit-Learn and Pandas.tnu


There are thousands of machine learning algorithms. Luckily for us, a few have risen to prominence. The most popular machine learning algorithms are:


Machine learning uses a lot of matrix math. It's quite easy to do matrix math in SQL.  Joe Celko has some great examples of matrix math in SQL.