Software engineers and data professionals are constrained by the direction of computing hardware (Norvig). CPU speed has improved more quickly than the speed of RAM, and dramatically outpaced disk/network speed.
An application can execute 8 million CPU instructions in the time it takes to do a 1 seek to disk. It can execute 150 million instructions in the time a packet does a network round trip via the Internet.
If your application goes to disk or the network, its performance capability slows down by several orders of magnitude. If the application has to do this synchronously, its performance is pretty well shot.
Let's bring these speeds to human scale.
The abundance of computing speed and the relative scarcity of hardware/network speed has big implications for application design.
Algorithms and data structures that save disk/network usage at the expense of CPU cost are making an excellent trade. A canonical example is using compression to fit more of an application in into L1 cache, L2 cache, and RAM.
The most compelling features of some large data applications use compression:
It is cheaper to move code to the data than the reverse. A Hadoop cluster will try and sends the map() and reduce() code to the nodes that already have the data, for just this reason.
The logic is different for enterprises. Proven, reliable options are more important than fast or cheap ones. Large businesses are willing to spend more to buy tiered SAN storage than help their engineers learn about caching. Why? Because the former is vendor-supported and proven to work over decades.
This is a losing strategy. The best applications use hardware trends to their advantage, rather than try to overcome them using sheer scale or price.
The cost of basic compute resources (quad-core CPUs, 16-64GB of RAM, 2-3TB of magnetic disk) is dirt cheap and getting cheaper. Hardware and core applications are rapidly becoming a commodity or service.
These ideas aren't grasped by many engineers, and certainly not many architects. That's a shame, but it means that it's very possible to achieve dramatic performance improvements by thinking carefully.
For any hard-core engineers out there, I have a humble request: study compression algorithms. Write them into existing, easy-to-use libraries in as many languages as you can. In the long run, this has the potential to reduce cost and increase performance across a massive number of applications.
The 2010-2020 decade in computing is turning out to be an exciting one. Let's help shape it to to be even better.Permalink
This past week was PASS Summit 2012, the annual gathering of database
nerds professionals to learn about database management, business intelligence, and upcoming trends. A common theme this year was big data.
Few people could agree on a definition or the implications of big data; there is a lot of FUD going around. This is tragically ironic because big data tools and techniques arose as a backlash to misinformation and silos. The original MapReduce paper (Dean, 2004), describes why it was built: a way for more Google employees to do more analysis on massive data sets, to help Google build better products.
The goal: Make data driven decisions.
Data has no value on its own. Storing all your data in Hadoop gives you nothing but a very big bill.
The goal isn't big data. The goal is better decisions.
To have better decisions, you need data and brains. The benefit comes when talented people analyze data, and use that analysis to make things better. Great tools are not a panacea.
Here is a handy guide to help you decide whether your organization can build, use, and benefit from a big data project:
|Does your organization make data-driven decisions more than 50% of the time?||+100||-500|
|Does your organization currently make data-driven decisions more than 80% of the time?||+200||-100|
|Do you have people with statistical and programming skill to analyze data well?||+200||-200|
|Do you currently let projects fail, and learn from the failure?||+75||-150|
|Are your colleagues curious and open-minded?||+50||-200|
|Do you use your existing data as much as possible?||+50||-50|
|Think of a relevant data set to your business. Can you brainstorm at least 10 uses for it?||+50||-100|
|Do you understand the design and use of 'big data' tools enough to identify marketing vs. reality?||+25||-400|
Add up all of your scores:
Big Data is valuable only within certain business problems, organizational cultures, and with the certain types of people involved. It is not a panacea.
The real panacea, as always, is having smart people, a curious/honest organizational culture, and a collective desire to do amazing things.Permalink