Groceries as a Data Problem

05 January 2017

The Benefits of Cooking

People spend a lot of money on healthcare, mostly on preventable diseases like Type-2 diabetes and heart disease. Poor diet is a key reason why. Restaurants, frozen meals, and fast food use ingredients that are worse for you than stuff you can cook yourself.

There are many structural reasons why it is hard to eat healthy (food deserts, healthy food is expensive, home cooking's lower profit margin). It's also psychologically harder to cook than to eat out.

I surveyed ~200 friends, and heard the same thing:

"I don't know what to make for dinner"

After inquiring a bit more, I reframed the question:

"I don't know what recipe(s) I could make with the ingredients in the house, or what groceries to keep stocked."

Aha! It turns out one of the challenges to eating healthy is the psychological challenge of figuring out what to cook. That's great, because it's an information retrieval challenge. Data to the rescue!


A Data Problem

I have a recipe data set containing 24K recipes, each of which lists ingredients like chicken, tomatoes, or olive oil. There are ~11K unique ingredients and 216K recipe-ingredient combinations to look at.

I prefer to start simply, by counting.

Common Ingredients

Of the 216K recipe-ingredient combinations, half of them are just 50 common ingredients.

This is great news; I can keep a modest number of ingredients in stock and be able to make a huge variety of recipes. The frequency of ingredients follows a power law curve.

Unsurprisingly, the most frequent ingredients are salt, sugar, and pepper. There are also some I didn't expect: cornstarch, pecans, mayonnaise.

Non-perishable Ingredients

Let's look at non-perishable ingredients. These 60 items are 60% of the non-perishable ingredients in all recipes. 'Non-perishable' to mean it lasts at least a month in the fridge or pantry in my house.

These are the staples to keep handy, and to buy in bulk.

Perishable Ingredients

Let's look at perishable ingredients next. These 21 ingredients are 60% of the perishable items in all recipes. That's another small number of things to keep around, and many of these also last quite a while in a refrigerator.

I suspect that the most frequent ingredients are common because they are used in many cultures. Onions, garlic, tomatoes, and eggs are used around the world in a huge variety of dishes.

This is the start of several posts on food and data. Stay tuned!


Regular Expressions in SQL Server

29 December 2016

sql-server-regex logo

Databases store text, and the best way to manipulate text is to use a regular expression ('regex'). Using regular expressions in SQL queries has been possible in many database engines for decades.

Now you can use regular expressions in SQL Server queries, too. I've created an open-source project, sql-server-regex, that lets you run regular expressions in T-SQL queries using scalar and table-valued functions.


The most common regular expression use cases are supported, including Match, Split, Group Match, and Replace.

You can use it with all versions of SQL Server that support SQL CLRs. That's every version since SQL Server 2005, except for SQL Azure.

Next Steps

The sql-server-regex code is being tuned for performance and tested for edge-case bugs. If you'd like to help, fork the code on GitHub and get going!