In scenarios where there is a requirement for storing masses of loosely consecutive or grouped data, a variation of run-length-encoding that repeats content until overridden can be implemented to achieve a form of lossless compression.
Introduction
I’m working on a project called Track that among other things, requires me to store data about lots of trains in a tabular format. In the UK, most trains (or rolling stock) are ordered by operating companies in batches, leading to groups of almost identical stock. Each unit is numbered in a near consecutive way, and most of the time, this is the only way to distinguish between stock from the same group.
Therefore, when it comes to storing this as a table
, there is quite a lot of unnecessary repetition, which can be efficiently removed using a variation on run-length-encoding.
Examples
It’s probably best if I try and explain with some examples. Let’s say we need to store the following data about five trains:
Number | Operator | Carriages | Built |
453101 | AWR | 8 | 2010 |
453102 | AWR | 8 | 2011 |
453103 | AWR | 5 | 2011 |
453104 | LWR | 5 | 2011 |
453105 | LWR | 5 | 2011 |
The first thing you might notice is that a lot of the data is grouped, and the only thing that changes every time is the number. This is a perfect candidate for variable-length run-length-encoding implementation.
Number | Operator | Carriages | Built |
453101 | AWR | 8 | 2010 |
+ | / | / | + |
+ | / | 5 | / |
+ | LWR | / | / |
+ | / | / | / |
As you can see, the first record is stored in full, this sets the defaults for each of the following cells. Any occurrence where data is not filled (represented with a slash /
) will be replaced with the previous value, which at the start will be taken from the first record.
Occurrences where data is incremented from the value above are represented with a plus +
, this can be seen in the number column.
How Does this Differ from RLE?
RLE is a very simple algorithm that is used to compress data by storing the number of times a value is repeated. For example, the following data:
Apple Apple Apple Apple Banana Banana Apple Apple Apple
can be compressed to:
4 Apple 2 Banana 3 Apple
The difference between this and my implementation is that my implementation only stores a value once until it is overidden, and thus only works with data represented vertically. ‘Rotating’ the data to the vertical and compressing it with my implementation would result in:
Apple
/
/
/
Banana
/
Apple
/
/
Which as you can see, is more complex than the original data! It’s only really when data is stored in a tabular format with multiple columns of data that my algorithm can be used to gain a substantial effect.
Performance
I’ve implemented this technique on a larger dataset of 87 trains with size of 10,430 bytes and the compressed result is only 3,123 bytes, which gives a reduction of 70.3%.
Caveats
This technique is not suitable for all data, and there are a few caveats to be aware of:
- The data must be grouped in some way, with lots of repetition or consecutive data that can be removed and represented through slashes
/
or pluses +
. - The data must be able to be stored in a tabular format (as proven above).
- Processing the data is slower, as it must be decompressed before it can be used.
This is a very simple technique that can be used to achieve a form of lossless compression for data that is stored in a tabular format. It’s not suitable for all data, but it can be very useful in certain scenarios as shown in the examples.
History
- 9th June, 2023: Initial version