Compressing Tabular Data

11 January 2013

I have a bandwidth problem. I am a software developer who builds services using ‘the cloud’. I have to copy large amounts of data between datacenters, like those run by Amazon Web Services or Microsoft’s Azure. Such WAN copies are predominantly slow and pricey. Much of the the data is tabular (stored as rows and columns). I use .csv files a lot. I need bigger tubes.

A solution is to compress the files. CSV files compress somewhat well. CPU capacity is cheap and abundant, especially compared to network bandwidth. Like all software developers, my solutions must optimize for cost and speed.

There is still a problem for large data sets. Even with 90% compression (1 TiB -> 10 GiB), data transfer takes time. There is a better way.

It’s All About the Algorithms

The key is to understand compression algorithms. They analyze one file at a time, in sequential order, looking at blocks of data. The more alike the data blocks, the better the compression.

Csv files compress inefficiently, because a csv file is stored row by row. The data across a row can be very different (for example: numbers, strings, dates, decimals). The data down an entire column is likely to be very similar (for example: all numbers or all dates).

A better way to compress tabular data is column by column instead of row by row. This is already part of large-scale data tools like Vertica, SQL Server’s Columnstore indexes, and Cassandra’s secondary indexes.

A simple way to turn a CSV file into a column-oriented (columnar) format is to save each column to a separate file. To load the data back in, read a single line from each file (column), and ‘stitch’ the data back together into a row.

Show Me the Numbers

All algorithms should be tested. The ‘data_per_formatted’ data from a Kaggle competition is a good test set. The csv file is 7.69 GiB uncompressed. The columnar set of files for this data is the same size.

A simple Powershell script ran both csv and columnar data sets through 20 different compression tests, using 7Zip as the compression tool. Each test recorded the time to compress and the resulting file size. The tests themselves were run on a desktop.

The columnar data compresses more efficiently than the csv data. It can also be faster.

As we can see, the best csv test reduced the data to 581 MiB. The best columnar test reduced the data to 402 MiB, which is a 31% improvement. Equally important, columnar data compresses better than csv data no matter which compression algorithm you use.

Time to Copy

The goal is to reduce the time to copy data from one place to another, including the time to compress and decompress the data.

Using compression is almost always a good choice. Its time advantage decreases only when the network speed approaches the speed of compression itself.

Compression is most appropriate for data in a certain size range, between 1 GiB and 20 TiB. Smaller data sets don’t take very long to copy. Larger data sets larger are best copied using the massively scalable bandwidth of FedEx.

These problems aren’t going away. The continuing rise of distributed system architectures, cloud computing, larger data volumes guarantees, and IT budgetary pressure guarantee that. Dramatic improvements in speed and cost can be achieved by applying existing tools in unexpected ways.


Observations on Software Engineering

01 January 2013

Software engineering is a very human craft, full of trade-offs. These trade-offs don’t behave in a linear way. The reaction of “if __ is good, do more of it” _is therefore dangerously misguided.

Let’s see this visually:

Software becomes more useful as features are added. However, software becomes useless as it becomes too complicated for its users.

When an application has a small codebase, it is hard to change because it is highly coupled. When an application has a large codebase, it is hard to change because nobody understands the whole system.

The ideal balance is in the middle. Remember, all code is debt.

A common reaction to the trade-off above is to use existing components and tools instead of building your own. However, the time and cost of system integration between components spirals out of control as their number increases.

The Silicon Triangle

Pick two. Pick any two.

People want speed, scope, and total control. However, increased scope and control each requires more time. This is a classic trilemma. I call this variation of the Iron Triangle the Silicon Triangle.

Assets and Liabilities

Asset Liability
Commonly used features Rarely used features
Normal users Edge-case users
System uptime System complexity
System scalability System bottlenecks
Automation Manual processes
Simplicity Complexity
Test coverage Lines of code
Domain expertise Ignorance and hubris

Where Do We Go From Here?

These compromises will affect everything you do. Find your own balance and wisdom. Reflect on your failures and successes. Use that insight to be wiser and thus more effective.

Happy New Year!

Note: these graphs don’t use real data; they are for illustration purposes.