Hardware and Architecture

13 November 2012

Software engineers and data professionals are constrained by the direction of computing hardware (Norvig). CPU speed has improved more quickly than the speed of RAM, and dramatically outpaced disk/network speed.

The Basics

  • CPU / GPU capacity is increasing
  • Memory speed isn't really keeping up, but it's doing alright
  • Storage and networking are barely getting faster

An application can execute 8 million CPU instructions in the time it takes to do a 1 seek to disk. It can execute 150 million instructions in the time a packet does a network round trip via the Internet.

If your application goes to disk or the network, its performance capability slows down by several orders of magnitude. If the application has to do this synchronously, its performance is pretty well shot.

Let's bring these speeds to human scale.

  • A single CPU instruction takes about 1/2 a second.
  • An L2 cache reference takes 7 seconds.
  • Reading from RAM takes 100 seconds.
  • Getting data from disk takes almost 17 weeks.
  • Reading 1MB from disk takes ~8 months.

Implications

The abundance of computing speed and the relative scarcity of hardware/network speed has big implications for application design.

Algorithms and data structures that save disk/network usage at the expense of CPU cost are making an excellent trade. A canonical example is using compression to fit more of an application in into L1 cache, L2 cache, and RAM.

The most compelling features of some large data applications use compression:

  • SQL Server columnstore indexes, xVelocity - data is compressed in a way to achieve high CPU cache usage.
  • SQL Server Hekaton - keep everything in memory, no disk access needed.
  • Google's Dremel/BigQuery - data is stored in a columnar format and highly compressed.
  • Caching applications/tiers use RAM instead of disk because the performance is  two orders of magnitude (80-100X) faster.
  • Hadoop clusters regularly compress their files for better performance.

It is cheaper to move code to the data than the reverse. A Hadoop cluster will try and sends the map() and reduce() code to the nodes that already have the data, for just this reason.

Web applications transmit compressed data to the browser, which can use JavaScript and abundant CPU power to do complicated rendering. This exploits two hardware trends:

  1. Compressed data is smaller than a corresponding graphic, reducing network delay. . There is more CPU capacity on 1,000,000 clients than 200 servers, so companies can buy less hardware for the same user load.

Big Fat Checks

The logic is different for enterprises. Proven, reliable options are more important than fast or cheap ones. Large businesses are willing to spend more to buy tiered SAN storage than help their engineers learn about caching. Why? Because the former is vendor-supported and proven to work over decades.

This is a losing strategy. The best applications use hardware trends to their advantage, rather than try to overcome them using sheer scale or price.

The cost of basic compute resources (quad-core CPUs, 16-64GB of RAM, 2-3TB of magnetic disk) is dirt cheap and getting cheaper. Hardware and core applications are rapidly becoming a commodity or service.

Challenge = Opportunity

These ideas aren't grasped by many engineers, and certainly not many architects. That's a shame, but it means that it's very possible to achieve dramatic performance improvements by thinking carefully.

For any hard-core engineers out there, I have a humble request: study compression algorithms. Write them into existing, easy-to-use libraries in as many languages as you can. In the long run, this has the potential to reduce cost and increase performance across a massive number of applications.

The 2010-2020 decade in computing is turning out to be an exciting one. Let's help shape it to to be even better.

Permalink

Big Data - Brains vs. Hype

12 November 2012

This past week was PASS Summit 2012, the annual gathering of database nerds professionals to learn about database management, business intelligence, and upcoming trends. A common theme this year was big data.

Few people could agree on a definition or the implications of big data; there is a lot of FUD going around. This is tragically ironic because big data tools and techniques arose as a backlash to misinformation and silos. The original MapReduce paper (Dean, 2004), describes why it was built: a way for more Google employees to do more analysis on massive data sets, to help Google build better products.

The goal: Make data driven decisions.

It's about Brains, Stupid

Data has no value on its own. Storing all your data in Hadoop gives you nothing but a very big bill.

The goal isn't big data. The goal is better decisions.

To have better decisions, you need data and brains. The benefit comes when talented people analyze data, and use that analysis to make things better. Great tools are not a panacea.

Here is a handy guide to help you decide whether your organization can build, use, and benefit from a big data project:

Question Yes No
Does your organization make data-driven decisions more than 50% of the time? +100 -500
Does your organization currently make data-driven decisions more than 80% of the time? +200 -100
Do you have people with statistical and programming skill to analyze data well? +200 -200
Do you currently let projects fail, and learn from the failure? +75 -150
Are your colleagues curious and open-minded? +50 -200
Do you use your existing data as much as possible? +50 -50
Think of a relevant data set to your business. Can you brainstorm at least 10 uses for it? +50 -100
Do you understand the design and use of 'big data' tools enough to identify marketing vs. reality? +25 -400

Add up all of your scores:

  • 300+: Your team and culture can make things happen. Go forth and be awesome.
  • 200 to 300: You can probably benefit from a big data project, but be sure to address gaps.
  • 100 to 200: You'll benefit more from helping your organization become more data driven, curious, and honest than by any magic project or product.
  • Less than 0: Watch out for pointy-haired bosses and dysfunctional office politics.

Big Data is valuable only within certain business problems, organizational cultures, and with the certain types of people involved. It is not a panacea.

The real panacea, as always, is having smart people, a curious/honest organizational culture, and a collective desire to do amazing things.

Permalink