Using a Variation on RLE to Achieve Lossless Compression for Near-consecutive or Grouped Tabular Data

Jack Devey

5.00/5 (3 votes)

9 Jun 2023CPOL3 min read

7.9K

Introducing a lossless compression mechanism for data structured in a table or matrix

In scenarios where there is a requirement for storing masses of loosely consecutive or grouped data, a variation of run-length-encoding that repeats content until overridden can be implemented to achieve a form of lossless compression.

Introduction

I’m working on a project called Track that among other things, requires me to store data about lots of trains in a tabular format. In the UK, most trains (or rolling stock) are ordered by operating companies in batches, leading to groups of almost identical stock. Each unit is numbered in a near consecutive way, and most of the time, this is the only way to distinguish between stock from the same group.

Therefore, when it comes to storing this as a table, there is quite a lot of unnecessary repetition, which can be efficiently removed using a variation on run-length-encoding.

Examples

It’s probably best if I try and explain with some examples. Let’s say we need to store the following data about five trains:

Number	Operator	Carriages	Built
453101	AWR	8	2010
453102	AWR	8	2011
453103	AWR	5	2011
453104	LWR	5	2011
453105	LWR	5	2011

The first thing you might notice is that a lot of the data is grouped, and the only thing that changes every time is the number. This is a perfect candidate for variable-length run-length-encoding implementation.

Number	Operator	Carriages	Built
453101	AWR	8	2010
+	/	/	+
+	/	5	/
+	LWR	/	/
+	/	/	/

As you can see, the first record is stored in full, this sets the defaults for each of the following cells. Any occurrence where data is not filled (represented with a slash /) will be replaced with the previous value, which at the start will be taken from the first record.

Occurrences where data is incremented from the value above are represented with a plus +, this can be seen in the number column.

How Does this Differ from RLE?

RLE is a very simple algorithm that is used to compress data by storing the number of times a value is repeated. For example, the following data:

Apple Apple Apple Apple Banana Banana Apple Apple Apple

can be compressed to:

4 Apple 2 Banana 3 Apple

The difference between this and my implementation is that my implementation only stores a value once until it is overidden, and thus only works with data represented vertically. ‘Rotating’ the data to the vertical and compressing it with my implementation would result in:

Apple
/
/
/
Banana
/
Apple
/
/

Which as you can see, is more complex than the original data! It’s only really when data is stored in a tabular format with multiple columns of data that my algorithm can be used to gain a substantial effect.

Performance

I’ve implemented this technique on a larger dataset of 87 trains with size of 10,430 bytes and the compressed result is only 3,123 bytes, which gives a reduction of 70.3%.

Caveats

This technique is not suitable for all data, and there are a few caveats to be aware of:

The data must be grouped in some way, with lots of repetition or consecutive data that can be removed and represented through slashes / or pluses +.
The data must be able to be stored in a tabular format (as proven above).
Processing the data is slower, as it must be decompressed before it can be used.

This is a very simple technique that can be used to achieve a form of lossless compression for data that is stored in a tabular format. It’s not suitable for all data, but it can be very useful in certain scenarios as shown in the examples.

History

9^th June, 2023: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)