Benchmarking the basic task of finding a needle in a haystack, an overview, showcasing (scalar and vector, side-by-side) superfast memmem() functions written in C.
Introduction
It's vector time.
It is time to benchmark the basic task of finding a needle in a haystack, using vectors.
Having implemented a simplistic vector memmem()
etude, glad to share it. Showcasing (scalar and vector, side-by-side) superfast memmem()
functions written in C.
The console tool attached here is only a wrapper around these functions, being a skeleton itself, NyoTengu serves well as a blueprint (starting point) for further development of search tools - with more functionality.
Background
The theme is all over the books and what not, yet, exact searching is not optimized, it is an ugly fact, I am not aware of function (let alone tool) working at 10GB/s rates on modern machines, see below how my mainstream laptop executes these shared C etudes - compiled with GCC and ICL.
Full-Text Exact Showdown — Fastest GREP vs NyoTengu
The testmachine: laptop running Windows 10, i5-7200U @2.5GHz, 8GB DDR4 2133MHz:
Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle: "Linus Torvalds"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle1.txt | 11,714,194,285 B/s; 0.084 seconds | 602 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle1.txt | 11,441,771,162 B/s; 0.086 seconds | 602 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "Linus Torvalds" li...tar | N.A.; 0.256 seconds | 602 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle: "5.8.5"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle2.txt | 9,553,323,495 B/s; 0.103 seconds | 74028 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle2.txt | 9,461,464,615 B/s; 0.104 seconds | 74028 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "5.8.5" li...tar | N.A.; 0.212 seconds | 74028 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: enwiki-20201001-all-titles (1,193,068,592 bytes)
Needle: "grep"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe enwiki-20201001-all-titles grep.txt | 12,051,197,898 B/s; 0.099 seconds | 173 |
| NyoTengu_YMM_GCC730_64bit.exe enwiki-20201001-all-titles grep.txt | 11,583,190,213 B/s; 0.103 seconds | 173 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "grep" enwiki-2...titles | N.A.; 0.378 seconds | 173 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: gcc-7.3.0.tar (615,567,360 bytes)
Needle: "SIMD"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe gcc-7.3.0.tar SIMD.txt | 11,837,833,846 B/s; 0.052 seconds | 3606 |
| NyoTengu_YMM_GCC730_64bit.exe gcc-7.3.0.tar SIMD.txt | 11,192,133,818 B/s; 0.055 seconds | 3606 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "SIMD" gcc-7.3.0.tar | N.A.; 0.121 seconds | 3606 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle: "ACGT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 4.txt | 3,508,357,471 B/s; 0.956 seconds | 1607850 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 4.txt | 4,651,858,173 B/s; 0.721 seconds | 1607850 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "ACGT" "SCB_DNA...-28" | N.A.; 5.701 seconds | 1607850 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle: "AACCGGTT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher | Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 8.txt | 2,187,860,236 B/s; 1.533 seconds | 2723 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 8.txt | 1,968,303,839 B/s; 1.704 seconds | 2723 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "AACCGGTT" "SCB_...-28" | N.A.; 5.320 seconds | 2723 |
+------------------------------------------------------------------------------+-----------------------------------+---------+
In order to reproduce the benchmark, here is the package (except the four corpora):
The four corpora are downloadable outside the package:
- https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.8.5.tar.xz
- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
- https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-all-titles.gz
- https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz
Note: The fastest GREP - RIPGREP (https://github.com/BurntSushi/ripgrep) received 22,000 stars on GitHub, as far as I see, it is second only to the Linux kernel with 99,000 stars!
Verdict: The current NyoTengu is 2x-5x faster than current RIPGREP, this is the 256bit world, who knows what will happen in the 512bit one...
I intend to replace the memcmp()
instances with Railgun_Trolldom_64
's comparator (not using external code)...
I believe, Railgun_Nyotengu_XMM_YMM_ZMM()
is currently the FASTEST memmem()
available, will be glad to be proven wrong. :P
Using the Code
The attached NyoTengu.c is 3022 lines long, it is compilable as follows:
On Windows, with ICL:
icl /O3 Nyotengu.c -DXMMtengu -D_WIN32_ENVIRONMENT_
-D_N_HIGH_PRIORITY /FeNyotengu_XMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DYMMtengu -D_WIN32_ENVIRONMENT_
-D_N_HIGH_PRIORITY /FeNyotengu_YMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DZMMtengu -D_WIN32_ENVIRONMENT_
-D_N_HIGH_PRIORITY /FeNyotengu_ZMM_IntelV150_32bit /FAcs
On *nix, with GCC:
gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC_64bit.elf
-DYMMtengu -D_POSIX_ENVIRONMENT_
The following snippet was generated by GCC 7.3.0 in this way:
gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.exe
-DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
gcc -O3 -mavx2 -fomit-frame-pointer -S NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.asm
-DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
so, the whole ASM fragment:
.seh_proc Railgun_Nyotengu_XMM_YMM_ZMM
Railgun_Nyotengu_XMM_YMM_ZMM:
pushq %r15
.seh_pushreg %r15
pushq %r14
.seh_pushreg %r14
pushq %r13
.seh_pushreg %r13
pushq %r12
.seh_pushreg %r12
pushq %rbp
.seh_pushreg %rbp
pushq %rdi
.seh_pushreg %rdi
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
subq $120, %rsp
.seh_stackalloc 120
.seh_endprologue
movl %r8d, %r13d
movq %rcx, %rsi
movq %rdx, %r10
leaq (%rcx,%r13), %r15
cmpl %r9d, %r8d
jb .L293
movl %r9d, %ebp
cmpl $3, %r9d
jbe .L321
movl (%rdx), %eax
movl %eax, 80(%rsp)
cmpl $127, %r8d
ja .L269
movl %eax, %edi
movl %eax, %r13d
movl %eax, %r14d
movzbl %al, %esi
sall $24, %edi
leal -1(%r9), %r9d
sall $8, %r13d
leaq (%rcx,%rbp), %rdx
sall $16, %r14d
movl %edi, 72(%rsp)
movslq %r9d, %rdi
movzwl %r13w, %r13d
andl $16711680, %r14d
negq %rbp
movq %rdi, 80(%rsp)
movl %eax, %r12d
movl %esi, 32(%rsp)
jmp .L275
.p2align 4,,10
.L270:
movl %eax, %r8d
movl $1, %ecx
andl $65280, %r8d
cmpl %r13d, %r8d
je .L274
movl %eax, %r8d
movl $2, %ecx
andl $16711680, %r8d
cmpl %r14d, %r8d
je .L274
xorl %ecx, %ecx
andl $-16777216, %eax
cmpl 72(%rsp), %eax
setne %cl
addq $3, %rcx
.L274:
addq %rcx, %rdx
cmpq %rdx, %r15
jb .L293
.L275:
leaq (%rdx,%rbp), %rbx
movl (%rbx), %eax
cmpl %r12d, %eax
jne .L270
movq %rdx, %r8
movl %r9d, %eax
subq 80(%rsp), %r8
movl $1, %ecx
xorl %edi, %edi
jmp .L271
.p2align 4,,10
.L273:
leal (%rax,%rdi), %esi
cmpl %esi, %r9d
jne .L272
cmpl %r11d, 32(%rsp)
setne %r11b
movzbl %r11b, %r11d
addl %r11d, %edi
.L272:
addl $1, %ecx
addq $1, %r8
subl $1, %eax
je .L319
.L271:
movl %ecx, %r11d
movsbl (%r10,%r11), %r11d
cmpb (%r8), %r11b
je .L273
leal 1(%rdi), %ecx
addq %rcx, %rdx
cmpq %rdx, %r15
jnb .L275
.p2align 4,,10
.L293:
xorl %ebx, %ebx
.L319:
movq %rbx, %rax
addq $120, %rsp
popq %rbx
popq %rsi
popq %rdi
popq %rbp
popq %r12
popq %r13
popq %r14
popq %r15
ret
.p2align 4,,10
.L321:
leal -1(%r9), %eax
movsbl (%rdx), %ebx
addq %rbp, %rsi
movsbl (%rdx,%rax), %eax
sall $8, %ebx
addl %eax, %ebx
movzbl %bh, %edi
cmpl $3, %r9d
je .L265
.p2align 4,,10
.L261:
movsbl -2(%rsi), %eax
movsbl -1(%rsi), %ecx
sall $8, %eax
addl %ecx, %eax
cmpl %eax, %ebx
je .L322
leaq 1(%rsi), %rax
cmpb %dil, %cl
je .L267
addq $2, %rsi
cmpq %rsi, %r15
jnb .L261
jmp .L293
.p2align 4,,10
.L324:
cmpb %cl, 1(%r10)
je .L323
leaq 1(%rsi), %rax
cmpb %cl, %dil
je .L286
.L325:
movq %rsi, %rax
xorl %esi, %esi
cmpb %dil, %dl
setne %sil
leaq 2(%rsi,%rax), %rsi
.L263:
cmpq %rsi, %r15
jb .L293
.L265:
movsbl -3(%rsi), %eax
movsbl -1(%rsi), %r8d
movzbl -2(%rsi), %ecx
sall $8, %eax
movl %r8d, %edx
addl %r8d, %eax
cmpl %eax, %ebx
je .L324
leaq 1(%rsi), %rax
cmpb %cl, %dil
jne .L325
.L286:
movq %rax, %rsi
jmp .L263
.p2align 4,,10
.L267:
cmpq %rax, %r15
jb .L293
movq %rax, %rsi
jmp .L261
.p2align 4,,10
.L269:
shrl $5, %r8d
leaq 32(%rcx), %rbx
vpbroadcastd 80(%rsp), %ymm4
movq %r15, 104(%rsp)
leal -1(%r8), %r14d
movq %rbx, 88(%rsp)
leaq 4096(%rcx), %rax
movl %r9d, 216(%rsp)
salq $5, %r14
movq %rax, %r12
movq %r14, 72(%rsp)
xorl %r14d, %r14d
movq %r13, 96(%rsp)
movq %rdx, %r13
vmovdqu %ymm4, 32(%rsp)
.L278:
vmovdqu 32(%rsp), %ymm3
prefetcht0 (%r12)
vpcmpeqd 1(%rsi,%r14), %ymm3, %ymm0
vpcmpeqd (%rsi,%r14), %ymm3, %ymm1
vpcmpeqd 3(%rsi,%r14), %ymm3, %ymm2
vpor %ymm0, %ymm1, %ymm1
vpcmpeqd 2(%rsi,%r14), %ymm3, %ymm0
vpor %ymm0, %ymm2, %ymm0
vpor %ymm0, %ymm1, %ymm0
vmovmskps %ymm0, %eax
testl %eax, %eax
jne .L326
.L276:
addq $32, %r14
cmpq 72(%rsp), %r14
jb .L278
movq %r13, %r10
movq 96(%rsp), %r13
movq 104(%rsp), %r15
movl 216(%rsp), %r12d
subq %r14, %r13
cmpq %r13, %rbp
ja .L294
movl 80(%rsp), %ebx
leal -1(%r12), %r9d
leaq (%r14,%rbp), %rax
negq %rbp
addq %rsi, %rax
movzbl %bl, %edi
movl %ebx, %edx
movl %ebx, %r13d
movl %ebx, %r12d
movl %edi, 32(%rsp)
sall $8, %edx
movl %ebx, %edi
sall $16, %r13d
sall $24, %edi
movzwl %dx, %r14d
andl $16711680, %r13d
movl %edi, 72(%rsp)
movslq %r9d, %rdi
movq %rdi, 80(%rsp)
jmp .L284
.p2align 4,,10
.L279:
movl %edx, %r8d
movl $1, %ecx
andl $65280, %r8d
cmpl %r14d, %r8d
je .L283
movl %edx, %r8d
movl $2, %ecx
andl $16711680, %r8d
cmpl %r13d, %r8d
je .L283
xorl %ecx, %ecx
andl $-16777216, %edx
cmpl 72(%rsp), %edx
setne %cl
addq $3, %rcx
.L283:
addq %rcx, %rax
cmpq %rax, %r15
jb .L294
.L284:
leaq (%rax,%rbp), %rbx
movl (%rbx), %edx
cmpl %r12d, %edx
jne .L279
movq %rax, %r8
movl %r9d, %edx
subq 80(%rsp), %r8
movl $1, %ecx
xorl %edi, %edi
.p2align 4,,10
.L280:
movl %ecx, %r11d
movsbl (%r10,%r11), %r11d
cmpb (%r8), %r11b
jne .L327
leal (%rdx,%rdi), %esi
cmpl %esi, %r9d
jne .L281
cmpl %r11d, 32(%rsp)
setne %r11b
movzbl %r11b, %r11d
addl %r11d, %edi
.L281:
addl $1, %ecx
addq $1, %r8
subl $1, %edx
jne .L280
vzeroupper
jmp .L319
.p2align 4,,10
.L326:
movq 88(%rsp), %rax
leaq (%r14,%rsi), %r15
leaq (%rax,%r14), %rdi
vzeroupper
jmp .L277
.p2align 4,,10
.L328:
addq $1, %r15
cmpq %r15, %rdi
je .L276
.L277:
movq %rbp, %r8
movq %r13, %rdx
movq %r15, %rcx
movq %r15, %rbx
call memcmp
testl %eax, %eax
jne .L328
jmp .L319
.p2align 4,,10
.L322:
leaq -2(%rsi), %rbx
jmp .L319
.p2align 4,,10
.L327:
leal 1(%rdi), %ecx
jmp .L283
.p2align 4,,10
.L323:
leaq -3(%rsi), %rbx
jmp .L319
.p2align 4,,10
.L294:
xorl %ebx, %ebx
vzeroupper
jmp .L319
.seh_endproc
My wish is for other coders to refine it even further, having a toy to play with is something that quickly brings results.
Points of Interest
The performance of the AVX512 tool yet remains to be seen, if a fellow member can run the benchmark on a CPU supporting it, please do and share the output here in Comments Section below.
History
- 17th October, 2020: Finished initial version