Introduction
My wish to share my latest LZSS decompressor Nakamichi 'Loonette' has all to do with the hope someone (in the future) to improve on it.
No sane person needs slowestness, yet in order to explore the potential of deep LZSS (on English texts), I did dive into the deep.
Background
One of the main goals is to boost I/O bandwidth (linear reads) 2x or 3x by using superfast decompression. NTFS uses some kind of LZSS to achieve that. Recently, I saw m.2 SSD drives breaking the 1GB/s Linear Reads barrier, with super implementations as LzTurbo and LZ4 these 1024MB/s look like an automobile from the 70's. Now, we reach for ten times as much.
A Glimpse at Compression Approach
Right into the ZIP face-off:
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir
02/20/2015 06:00 PM 33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015 03:46 AM 13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015 05:37 PM 11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015 09:04 PM 11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>timer32.exe
7za.exe t Agatha_Christie_89-ebooks_TXT.tar.zip
7-Zip (A) 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
Processing archive: Agatha_Christie_89-ebooks_TXT.tar.zip
Testing Agatha_Christie_89-ebooks_TXT.tar
Everything is Ok
Size: 33258496
Compressed: 11173343
Kernel Time = 0.015 = 3%
User Time = 0.421 = 92%
Process Time = 0.436 = 95% Virtual Memory = 2 MB
Global Time = 0.457 = 100% Physical Memory = 4 MB
The battlehorse 7-zip decompresses roughly at 33,258,496/0.436/1024/1024= 72MB/s while Nakamichi 'Loonette' at 256MB/s (shown further below). Of course, ZIP sees 64KB arear while 'Loonette' 512MB, it means that on BIG&REDUNDANT TEXTS, compression ratio will go up from 3:1 to 4:1.
As it was stated, it is slowest because only brutal memmem()
match finding is used. Choosing Match Lengths to be either 12 or 8 bytes long simplifies the compression if hashing is to be implemented.
Chosen order is sizewise:
// 12:2 = 6
// 8:2 = 4
// 12:3 = 4
// 12:4 = 3
// 8:3 = 2.6
// 8:4 = 2
The second number stands for bytes used to encode Match Offsets i.e., 2, 3 or 4.
The compression is based on classical LZSS (Lempel–Ziv–Storer–Szymanski), I used blueprints by a Japanese coder (many thanks Ito) and simplified both the encoder and decoder.
The attached (in Download Section) archive contains the C source and its Assembly counterpart generated by the latest Intel C optimizer.
A Glimpse at Decompression Approach
My obsession with superfast textual processing resulted in appearance of third-fastest (LzTurbo and LZ4 being the first and second) textual decompressor known to me. It is called Nakamichi 'Tengu' and decompresses 1GB of English texts on i5 4690K @3.5GHz (no-turbo) as:
>Nakamichi_Tengu_XMM_PREFETCH_4096.exe DDETT-Definitive_Decompression_English_Texts_Torture.tar.Nakamichi
Nakamichi 'Tengu', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 359115942 bytes ...
RAM-to-RAM performance: 1728 MB/s.
Memory pool starting address: 0000000016394080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 50442 clocks or 10.394MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 16%
>lz4 -9 -Sx -b -T1 DDETT-Definitive_Decompression_English_Texts_Torture.tar
Nb of threads = 1 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 864 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 905969664 -> 315340335 ( 34.81%), 22.6 MB/s , 2116.8 MB/s
>lz4 -9 -Sx -b DDETT-Definitive_Decompression_English_Texts_Torture.tar
Nb of threads = 4 ; Compression Level = 9
Not enough memory for 'DDETT-Definitive_Decompression_English_Texts_Torture.tar' full size;
testing 896 MB only...
Loading DDETT-Definitive_Decompression_English_Texts_Torture.tar...
DDETT-Definitiv : 939524096 -> 329287352 ( 35.05%), 85.9 MB/s , 8265.6 MB/s
Yann, the author of LZ4, wrote an almost perfect decompressor, AFAIU utilizing 1/5 of the absolute maximum both in single-thread 2116/10394 and multi-thread 8265/4????, AMAZING!
However my word is for Nakamichi 'Loonette', featuring 8KB/2MB/512MB or (16-3)bit/(24-3)bit/(32-3)bit Sliding Windows with 2/3/4 bytes long offsets. Only 12 (up to 16) and 8 matchlengths are used.
So, this etude is all about writing an heavily optimized LZSS decompress function capable to extract compressed English texts at INSANE speeds disregarding nowadays RAM and CPU's caches latencies (and bandwidth for that matter). Putting aside the boring (both speedwise and sizewise) memory limitations, behold the simplicity of Loonette's decompression:
unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
char* retLOCAL = ret;
char* srcLOCAL = src;
char* srcEndLOCAL = src+srcSize;
unsigned int DWORDtrio;
while (srcLOCAL < srcEndLOCAL) {
DWORDtrio = *(unsigned int*)srcLOCAL;
#ifndef _N_GP
#ifdef _N_prefetch_4096
_mm_prefetch((char*)(srcLOCAL + 64*64), _MM_HINT_T0);
#endif
#endif
DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
if ( (DWORDtrio & 0x03) == 0x00 ) {
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( (uint64_t)(srcLOCAL+1) ), 16);
#endif
#ifdef _N_XMM
SlowCopy128bit( (const char *)( (uint64_t)(srcLOCAL+1) ), retLOCAL );
#endif
retLOCAL+= (DWORDtrio>>3);
srcLOCAL+= (DWORDtrio>>3)+1;
} else {
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), 16);
#endif
#ifdef _N_XMM
SlowCopy128bit( (const char *)( (uint64_t)(retLOCAL-(DWORDtrio>>3)) ), retLOCAL );
#endif
srcLOCAL+= 1+(DWORDtrio&0x03); retLOCAL+= 12-(DWORDtrio&0x04);
}
}
return (unsigned int)(retLOCAL - ret);
}
A hundred bytes of beauty!
And the quick showdown versus LZ4 (on my laptop Core 2 Q9550s @2833MHz) to see how jammed is the existing CPU-RAM subsystem:
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15
>Nakamichi_Loonette_XMM_PREFETCH_4096.exe
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Usage: Nakamichi filename
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>Nakamichi_Loonette_XMM_PREFETCH_4096.exe
Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 4220
NumberOf(Short)Matches[Short]Window (8): 412312
NumberOf(Short)Matches[Medium]Window (8): 848071
NumberOf(Short)Matches[Big]Window (8): 353660
NumberOf(Long)Matches[Short]Window (12): 191049
NumberOf(Long)Matches[Medium]Window (12): 794178
NumberOf(Long)Matches[Big]Window (12): 595339
RAM-to-RAM performance: 1839 bytes/s.
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>dir
02/20/2015 06:09 AM 33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/13/2015 03:46 AM 13,365,137 Agatha_Christie_89-ebooks_TXT.tar.lz4
02/20/2015 05:37 PM 11,745,883 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/20/2015 03:06 AM 86,636 Nakamichi_Loonette.c
02/20/2015 03:07 AM 445,206 Nakamichi_Loonette_XMM_PREFETCH_4096.cod
02/20/2015 12:21 PM 117,248 Nakamichi_Loonette_XMM_PREFETCH_4096.exe
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>
Nakamichi_Loonette_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette', written by Kaze, based on Nobuo Ito's LZSS source,
babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 11745883 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000001040080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 192239 clocks or 2.727MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%
D:\_KAZE\EBOOKs_for_benching\Nakamichi_Loonette_Intel15>lz4 -9 -Sx -b -T1
Agatha_Christie_89-ebooks_TXT.tar
Nb of threads = 1 ; Compression Level = 9
Agatha_Christie : 33258496 -> 13365090 ( 40.19%), 12.7 MB/s , 909.2 MB/s
Nakamichi 'Loonette' has it own homepage: http://www.sanmayce.com/Hayabusa/.
Points of Interest
Hope someone is as passionate as I am to improve on it, that's the name of the game - 'Reaching for the stars.'
The French artist Rachelle Bartel, a.k.a. Lillycat created this soulful piece of art.
Add-on from 2015-Feb-23:
Happy to share even more refined 'Loonette' called 'Loonette-Hoshimi'.
Since Agatha Christie is the queen of English prose, quite naturally her complete works is the best testdataset for English texts compression.
The Guinness Book of World Records lists Christie as the best-selling novelist of all time. Her novels have sold roughly 2 billion copies ... /Wikipedia/
D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Compressing 33258496 bytes ...
\; Each rotation means 64KB are encoded; Done 100%
NumberOfFullLiterals (lower-the-better): 171
Legend: MatchLengths: 4=Tiny, 8=Short, 12=Medium, 16=Long; WindowSizes: 2/3/4=Short/Medium/Long
NumberOf(Tiny)Matches[Short]Window (4)[2]: 325551
NumberOf(Short)Matches[Short]Window (8)[2]: 240654
NumberOf(Medium)Matches[Short]Window (12)[2]: 80607
NumberOf(Long)Matches[Short]Window (16)[2]: 52966
NumberOf(Tiny)Matches[Medium]Window (4)[3]: 382909
NumberOf(Short)Matches[Medium]Window (8)[3]: 677654
NumberOf(Medium)Matches[Medium]Window (12)[3]: 440648
NumberOf(Long)Matches[Medium]Window (16)[3]: 239806
NumberOf(Short)Matches[Long]Window (8)[4]: 272624
NumberOf(Medium)Matches[Long]Window (12)[4]: 461732
NumberOf(Long)Matches[Long]Window (16)[4]: 266965
RAM-to-RAM performance: 1249 bytes/s.
D:\Nakamichi_Loonette_Intel15_Hoshimi>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe Agatha_Christie_89-ebooks_TXT.tar.Nakamichi /test
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 10854012 bytes ...
RAM-to-RAM performance: 320 MB/s.
Memory pool starting address: 0000000000E60080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 194229 clocks or 2.699MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 11%
D:\Nakamichi_Loonette_Intel15_Hoshimi>7za.exe a -tzip -mx9 Agatha_Christie_89-ebooks_TXT.tar.zip Agatha_Christie_89-ebooks_TXT.tar
D:\Nakamichi_Loonette_Intel15_Hoshimi>tangelo_32bit.exe c Agatha_Christie_89-ebooks_TXT.tar Agatha_Christie_89-ebooks_TXT.tar.tangelo
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method0 Agatha_Christie_89-ebooks_TXT.tar -method 0
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method1 Agatha_Christie_89-ebooks_TXT.tar -method 1
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method2 Agatha_Christie_89-ebooks_TXT.tar -method 2
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method3 Agatha_Christie_89-ebooks_TXT.tar -method 3
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method4 Agatha_Christie_89-ebooks_TXT.tar -method 4
D:\Nakamichi_Loonette_Intel15_Hoshimi>zpaq64.exe add Agatha_Christie_89-ebooks_TXT.tar.method5 Agatha_Christie_89-ebooks_TXT.tar -method 5
D:\Nakamichi_Loonette_Intel15_Hoshimi>dir
02/23/2015 03:33 PM 33,258,496 Agatha_Christie_89-ebooks_TXT.tar
02/23/2015 03:52 PM 33,277,230 Agatha_Christie_89-ebooks_TXT.tar.method0.zpaq
02/23/2015 03:52 PM 11,709,902 Agatha_Christie_89-ebooks_TXT.tar.method1.zpaq
02/23/2015 03:52 PM 9,557,649 Agatha_Christie_89-ebooks_TXT.tar.method2.zpaq
02/23/2015 03:53 PM 6,462,103 Agatha_Christie_89-ebooks_TXT.tar.method3.zpaq
02/23/2015 03:54 PM 6,387,670 Agatha_Christie_89-ebooks_TXT.tar.method4.zpaq
02/23/2015 03:56 PM 6,036,509 Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq
02/23/2015 03:03 PM 10,854,012 Agatha_Christie_89-ebooks_TXT.tar.Nakamichi
02/23/2015 03:10 PM 6,475,761 Agatha_Christie_89-ebooks_TXT.tar.tangelo
02/23/2015 03:12 PM 11,173,343 Agatha_Christie_89-ebooks_TXT.tar.zip
02/23/2015 05:51 AM 99,469 Nakamichi_Loonette_Hoshimi.c
02/23/2015 05:51 AM 771,399 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.cod
02/23/2015 05:51 AM 118,784 Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe
D:\Nakamichi_Loonette_Intel15_Hoshimi>
Along with strengthening the compression ratio I was lucky to boost decompression speed and reduce the code by 2bytes:
; Nakamichi 'Loonette-Hoshimi' decompression loop, 74-14+2=98 bytes long:
; mark_description "Intel(R) C++ Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.0.108 Build 20140";
; mark_description "-O3 -QxSSE2 -D_N_XMM -D_N_prefetch_4096 -FAcs";
.B7.3::
00014 41 0f 18 8a 00
10 00 00 prefetcht0 BYTE PTR [4096+r10]
0001c b8 ff ff ff ff mov eax, -1
00021 41 8b 12 mov edx, DWORD PTR [r10]
00024 89 d1 mov ecx, edx
00026 83 f1 03 xor ecx, 3
00029 c1 e1 03 shl ecx, 3
0002c d3 e8 shr eax, cl
0002e 23 d0 and edx, eax
00030 89 d0 mov eax, edx
00032 83 e0 03 and eax, 3
00035 75 18 jne .B7.5
.B7.4::
00037 f3 41 0f 6f 42
01 movdqu xmm0, XMMWORD PTR [1+r10]
0003d c1 ea 04 shr edx, 4
00040 f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
00045 4c 03 ca add r9, rdx
00048 ff c2 inc edx
0004a 4c 03 d2 add r10, rdx
0004d eb 22 jmp .B7.6
.B7.5::
0004f 89 d1 mov ecx, edx
00051 83 e2 0c and edx, 12
00054 c1 e9 04 shr ecx, 4
00057 ff c0 inc eax
00059 83 c2 04 add edx, 4
0005c 48 f7 d9 neg rcx
0005f 49 03 c9 add rcx, r9
00062 4c 03 d0 add r10, rax
00065 f3 0f 6f 01 movdqu xmm0, XMMWORD PTR [rcx]
00069 f3 41 0f 7f 01 movdqu XMMWORD PTR [r9], xmm0
0006e 4c 03 ca add r9, rdx
.B7.6::
00071 4d 3b d0 cmp r10, r8
00074 72 9e jb .B7.3
XMM transfers are slow on Core 2, they are unaligned and penalties are high especially when RAM is accessed. So on Intel 3rd/4th/5th generations Nakamichi 'Loonette-Hoshimi' will surpass 1GB/s easily. Below is my attempt to put on the map or rather juxtapose Nakamichi 'Loonette-Hoshimi' with one of the beast in compression.
Notes about superb ZPAQ (written by Dr. Matt Mahoney and released to the public domain) methods:
Method 0 is not just like tar where the files are concatenated together with some headers in between. It is still the zpaq archive format, where the files are fragmented and deduplicated and packed into blocks. The blocks are stored in non context modeled format, which means split into smaller blocks (64K) with 4 byte headers to indicate their size. There are separate blocks storing fragment SHA-1 hashes and file names. So there is more overhead than with tar but sometimes some savings if you have duplicate files or fragments.
Method 1 uses LZ77, compressing by replacing duplicate strings with pointers to previous occurrences.
Method 2 is the same but spends more time looking for better matches (using a suffix array instead of a hash table).
Method 3 uses either BWT (context sorting) or LZ77 for long matches and an order 1 context model and arithmetic coding for literals depending on the file type.
Method 4 either uses LZ77, BWT or a high order context model.
Method 5 uses a complex, high order context mixing model with over 20 bit prediction components.
Also. Dr. Mahoney wrote one very cute tool, called 'fv', drawing a very informative image of the structure of a given file. It allows one to see the hotspots of string repetitions: Legend (how to interpret the image):
The following diagrams show the distribution of string matches of length 1 (black), 2 (red), 4 (green), and 8 (blue). The horizontal axis represents the position in the file. The vertical axis shows the distance backwards to the previous match on a logarithmic scale. The major tick marks reading upwards are 1, 10, 100, 1000, etc.
I will try to interpret 3 different files - the original, the Hoshimi archive and the strongest ZPAQ archive.
First 'fv-Agatha_Christie_89-ebooks_TXT.tar.bmp':
The blue band at the top shows that matches of length 8 are most often separated by at least 10^4 bytes (4 major tick marks) up to the entire length of the file.
More precisely, 3x10^4=30,000.
The red band shows that matches of length 2 are separated by about 4x10 to 4x100 bytes.
Second 'fv-Agatha_Christie_89-ebooks_TXT.tar.Nakamichi.bmp':
One good compressor should erase all the blues and greens from the picture, in here 'Hoshimi' failed to compress some vague remnants of green (4bytes long matches) in area 1x10^6 and 1x10^7, that is, there are some uncomprressed 4bytes chunks 1MB..10MB apart of each other.
The red band shows that matches of length 2 are separated by about 2x10^3 to 5x10^5 bytes.
Vertical stripes with no color are 'problems', in those parts of the file the compressor was blinded by something and couldn't see any correlations.
Third 'fv-Agatha_Christie_89-ebooks_TXT.tar.method5.zpaq.bmp':
The red band shows that matches of length 2 are separated by about 3x10^3 to 3x10^5 bytes.
The (星見 - hoshimi) literal translation is 'Looking Stars', or more poetically 'Stargazing'.
Hanami (花見, lit. "flower viewing") is the Japanese traditional custom of enjoying the transient beauty of flowers, ...
Also in Buddhism Moongazing (月見 - tsukimi) is of great importance, also there is Sungazing (? - ?) branch of Yoga called Surya Yoga.
If a muse comes to me, next variant will be called 'Tsukimi'.
Add-on from 2015-Mar-23:
Didn't like the 'if-else' branching, so 'Loonette-Hoshimi' became branchless, sadly the code size jumped from 98 to 125 bytes. As for the decompression speed, Q9550s disappoints (see further below the 'enwik8' test), however I am careless of the penalties, the etude is just fine.
unsigned int Decompress (char* ret, char* src, unsigned int srcSize) {
char* retLOCAL = ret;
char* srcLOCAL = src;
char* srcEndLOCAL = src+srcSize;
unsigned int DWORDtrio;
unsigned int Flag;
uint64_t FlagMASK; uint64_t FlagMASKnegated;
while (srcLOCAL < srcEndLOCAL) {
DWORDtrio = *(unsigned int*)srcLOCAL;
DWORDtrio = DWORDtrio&( 0xFFFFFFFF >> ((3-(DWORDtrio & 0x03))<<3) );
Flag=!(DWORDtrio & 0x03);
FlagMASKnegated= Flag - 1; FlagMASK= ~FlagMASKnegated;
#ifdef _N_XMM
SlowCopy128bit( (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), retLOCAL);
#endif
#ifdef _N_GP
memcpy(retLOCAL, (const char *)( 1+ ((uint64_t)(srcLOCAL)&FlagMASK) + ((uint64_t)(retLOCAL-(DWORDtrio>>4)-1)&FlagMASKnegated) ), 16);
#endif
retLOCAL+= ((uint64_t)((DWORDtrio>>4))&FlagMASK) + ((uint64_t)((1+((DWORDtrio>>2)&0x03))<<2)&FlagMASKnegated) ;
srcLOCAL+= ((uint64_t)((DWORDtrio>>4)+1)&FlagMASK) + ((uint64_t)(1+(DWORDtrio&0x03))&FlagMASKnegated) ;
}
return (unsigned int)(retLOCAL - ret);
}
The 'enwik8' test done on core 2 Q9550s:
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette-Hoshimi_branchless.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 192 MB/s.
Memory pool starting address: 0000000002620080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193784 clocks or 2.706MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 7%
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2 enwik8
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>Nakamichi_Loonette_Hoshimi_XMM_PREFETCH_4096.exe enwik8.Nakamichi /bench
Nakamichi 'Loonette-Hoshimi', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Decompressing 34968896 bytes ...
RAM-to-RAM performance: 256 MB/s.
Memory pool starting address: 0000000002710080 ... 64 byte aligned, OK
Copying a 512MB block 1024 times i.e. 512GB READ + 512GB WRITTEN ...
memcpy(): (512MB block); 524288MB copied in 193453 clocks or 2.710MB per clock
RAM-to-RAM performance vs memcpy() ratio (bigger-the-better): 9%
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>\sha1sum.exe enwik8
57b8363b814821dc9d47aa4d41f58733519076b2 enwik8
D:\_KAZE\Nakamichi_Loonette_Hoshimi_vs_enwik8>
I don't believe I can refine it further.
History
First Nakamichi was written back a year ago, as of now 2015-Feb-20, 'Loonette' is the latest and most refined.