Tag Archives: statistics

A Good Book Has No Ending

“A good book has no ending” – R.D. Cumming

Last week I was immensely lucky to be at SQL Cruise Alaska, and to attend Buck Woody’s (b / t) session on career development. A mere two hours later, I came away with about 5 pages of notes, a stunned expression, and the realization that I have not been sufficiently proactive in my own career. I’ll have a more complete write up in a few days, but one of the things Buck did was lay down a challenge for everyone there.

The challenge is for each of us to read 12 books that will help us advance our careers, and then do book reviews on our respective blogs. Here are my book choices, in no particular order:

  1. Hadoop, the Definitive Guide, by Tom White. For a developer who does business intelligence work, Hadoop/MapReduce is now a requisite skill. This is double true with companies that have large amounts of data (web logs, etc). I’ll be using what I learn to analyze the stock market data I have been downloading from online APIs.
  2. Basic Business Statistics, by Berenson, Levine, Krehbiel, Stephan. Working with large data sets is important. Equally important is the ability to know what kinds of analysis are best for a given problem. That means statistical analysis. My knowledge in this area is insufficient, hence this book choice.
  3. The Long Tail, by Chris Anderson. The reason to read this book is because of a question: how does the Internet affect businesses? Therefore, how does the Internet affect business needs? Therefore, how does the Internet change who is needed in future businesses?
  4. Pro SQL Azure, by Scott Klein and Herve Roggero. I expect most medium-sized businesses, and all start-ups, to eventually use cloud computing for their IT needs. The cost benefits are too great to ignore. For me, that means learning a lot more about the best practices when doing database work ‘in the cloud’.
  5. On  Writing Well, by William Zinsser. As someone without a lot of writing experience, gaining writing skill is critical. My style as a blogger is not that accessible, and I believe my readers would benefit if I expressed ideas in a more intuitive fashion.
  6. 97 Things Every Programmer Should Know, by Kevlin Henney. As a working developer, I occasionally get ‘stuck’ in a problem and forget about tips and tricks that can help. Memorizing said tricks would make me more effective when I write code.
  7. Applied Architecture Patterns on the Microsoft Platform, by Seroter, Fairweather, Thomas, Sexton, and Ramani. This book was recommended to me by Brent Ozar (b / t), and I had the luxury of skimming through it one day at a library. In 10 minutes I had a half-dozen options to solve design problems I had been struggling with. I would recommend this book to senior developers and budding architects.
  8. Adapt: Why Success Always Starts With Failure, by Tim Harford. I read Tim Harford’s The Undercover Economist two months ago, and am highly impressed by how well he can convey complex ideas using intuitive examples. Rapid prototyping and innovation is the best way for any company to produce new products and services. That means building lots of projects, many of which will fail, in order to find a few runaway successes.
  9. The Mythical Man-Month, by Frederick Brooks. This is one of the classic books in software engineering because it spells out a lot of common wisdom, and then explains why it’s all wrong. I’m very curious about which aphorisms and assumptions I am using that are incorrect.
  10. The Data Warehouse Lifecycle Toolkit, by Kimball, Reeves, Ross, and Thornthwaite. I read this book four years ago, when I was starting as a junior developer. I didn’t understand a lot of it, but the basics I picked up were enough for me to rapidly understand what my mentors were trying to teach me. I hope to pick up a lot more nuance this time around.
  11. Design Is How It Works, by Jay Greene. A lot of the truly iconic products and services out there are successful because of amazing design. As a developer who does a lot of back-end work, user-facing design is not one of my strengths. This would help remedy that.
  12. The Innovator’s Way, by Peter Denning and Robert Dunham. I don’t believe a developer can be successful unless they are creative and innovative. Working in a large company makes that tricky sometimes. Having some best practices and the proper mental attitude would be very, very helpful.

The second part of the book challenge is to check on at least one other person at SQL Cruise to make sure they’re reading as well. I’m tagging a few other people. Enjoy!

T-SQL Tuesday #16: Statistics using CLR Aggregates

T-SQL Tuesday logoIt is T-SQL Tuesday again, brought to you by Jes Schultz Borland (@Grrl_Geek). The theme for this T-SQL Tuesday is Aggregations.

For my T-SQL Tuesday post, I’ve written up some CLR aggregation functions that can do statistical analysis. I’ve built three of the most common: median(), percentile(), and covariance().

Here’s my code in a ZIP file: MathCLR – By Dev Nambi.

Median

Median finds the middle number in a set of data. If you sorted the numbers and then picked the exact middle value, that is the median.

I’m using the same sample data set from my last post, which is about the runtime of a backup job over the past week. As we can see here, the median runtime is 33 minutes.

Percentile

Percentile finds the number under which X percent of the values fall. Using our sample data set, we can find that 80% of the backup runtimes are under 480 minutes.

The way to use the function is to pass in the data set as the first parameter, and the percentile you’re looking for as the second.

Covariance

Covariance is less well known. It shows the relationship between two sets of numbers. It finds the  correlations between two sets of numbers.

Using our data set, let’s see if there’s a relationship between backup time and the amount of data backed up. As we can see below, the covariance is negative. That means that either the backups get faster as they get larger (unlikely), or something else is at work.

How about a relationship between disk queue length on the backup file server and the runtime? Aha! That shows a very high covariance, suggesting that the disks on the backup file server are the bottleneck. Either the backup server is too slow, or else some other process is slowing it down.

 

Good luck with your analysis!

No Perf for You! Read-only Databases stops Auto Create Stats

Read-only databases cannot create new statistics

A while back, I was working with Kendra Little (@KendraLittleblog), and we were using log shipping to make warm standby copies of reporting databases. The warm standby databases were marked as READ_ONLY. However, this meant the query optimization process couldn’t create new statistics (MSDN) when users ran ad-hoc SELECT statements.

Query Plans With Missing Statistics

Using a one-table unit test, we are able to see that a read-only database cannot create stats. The estimating query plan shows a warning symbol, symbolizing that the query optimizer cannot create statistics on necessary columns.

Query Plan With Missing Stats

Click on the image to see the warning in detail and actual query plan

Query Plans With Manually-Created Statistics

Let’s suppose that there are stats on every single column of every single table. Then, the query optimizer can do a better job of producing good plans. Using the query in our one-table unit test now produces a plan without warnings.

Query Plan With Stats

Click on the image to see the actual query plan

MANUALLY CREATED STATISTICS SCRIPT AND TEST

My solution is to manually create statistics on every column where they don’t already exist. Below is an example script and a unit test for it.

Script: Manually-Create-Stats.sql
Unit Test:  Manually-Create-Stats-Test.sql

Performance results

Unit Test Perf Results

The performance results are pretty clear. Using statistics in a simple, one-table unit test creates a slight performance boost. I have seen gains of 10-30% with many reporting queries; the effect of missing statistics as you join more tables together, use larger tables, and use columns for filtering.