11 January 2013
I have a bandwidth problem. I am a software developer who builds services using ‘the cloud’. I have to copy large amounts of data between datacenters, like those run by Amazon Web Services or Microsoft’s Azure. Such WAN copies are predominantly slow and pricey. Much of the the data is tabular (stored as rows and columns). I use .csv files a lot. I need bigger tubes.
A solution is to compress the files. CSV files compress somewhat well. CPU capacity is cheap and abundant, especially compared to network bandwidth. Like all software developers, my solutions must optimize for cost and speed.
There is still a problem for large data sets. Even with 90% compression (1 TiB -> 10 GiB), data transfer takes time. There is a better way.
The key is to understand compression algorithms. They analyze one file at a time, in sequential order, looking at blocks of data. The more alike the data blocks, the better the compression.
Csv files compress inefficiently, because a csv file is stored row by row. The data across a row can be very different (for example: numbers, strings, dates, decimals). The data down an entire column is likely to be very similar (for example: all numbers or all dates).
A better way to compress tabular data is column by column instead of row by row. This is already part of large-scale data tools like Vertica, SQL Server’s Columnstore indexes, and Cassandra’s secondary indexes.
A simple way to turn a CSV file into a column-oriented (columnar) format is to save each column to a separate file. To load the data back in, read a single line from each file (column), and ‘stitch’ the data back together into a row.
All algorithms should be tested. The ‘data_per_formatted’ data from a Kaggle competition is a good test set. The csv file is 7.69 GiB uncompressed. The columnar set of files for this data is the same size.
A simple Powershell script ran both csv and columnar data sets through 20 different compression tests, using 7Zip as the compression tool. Each test recorded the time to compress and the resulting file size. The tests themselves were run on a desktop.
The columnar data compresses more efficiently than the csv data. It can also be faster.
As we can see, the best csv test reduced the data to 581 MiB. The best columnar test reduced the data to 402 MiB, which is a 31% improvement. Equally important, columnar data compresses better than csv data no matter which compression algorithm you use.
The goal is to reduce the time to copy data from one place to another, including the time to compress and decompress the data.
Using compression is almost always a good choice. Its time advantage decreases only when the network speed approaches the speed of compression itself.
Compression is most appropriate for data in a certain size range, between 1 GiB and 20 TiB. Smaller data sets don’t take very long to copy. Larger data sets larger are best copied using the massively scalable bandwidth of FedEx.
These problems aren’t going away. The continuing rise of distributed system architectures, cloud computing, larger data volumes guarantees, and IT budgetary pressure guarantee that. Dramatic improvements in speed and cost can be achieved by applying existing tools in unexpected ways.