Fastest Fulltext (Vector&Scalar) Exact Searcher?!

Sanmayce

4.33/5 (3 votes)

21 Oct 2020Public Domain2 min read

19.1K

209

A fulltext CLI tool reporting number of exact matches, FAST!

Benchmarking the basic task of finding a needle in a haystack, an overview, showcasing (scalar and vector, side-by-side) superfast memmem() functions written in C.

Download source - 28.8 KB

Introduction

It's vector time.
It is time to benchmark the basic task of finding a needle in a haystack, using vectors.

Having implemented a simplistic vector memmem() etude, glad to share it. Showcasing (scalar and vector, side-by-side) superfast memmem() functions written in C.

The console tool attached here is only a wrapper around these functions, being a skeleton itself, NyoTengu serves well as a blueprint (starting point) for further development of search tools - with more functionality.

Background

The theme is all over the books and what not, yet, exact searching is not optimized, it is an ugly fact, I am not aware of function (let alone tool) working at 10GB/s rates on modern machines, see below how my mainstream laptop executes these shared C etudes - compiled with GCC and ICL.

Full-Text Exact Showdown — Fastest GREP vs NyoTengu

The testmachine: laptop running Windows 10, i5-7200U @2.5GHz, 8GB DDR4 2133MHz:

Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle:   "Linus Torvalds"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle1.txt                 | 11,714,194,285 B/s; 0.084 seconds |     602 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle1.txt                    | 11,441,771,162 B/s; 0.086 seconds |     602 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "Linus Torvalds" li...tar |               N.A.; 0.256 seconds |     602 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: linux-5.8.5.tar (983,992,320 bytes)
Needle:   "5.8.5"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe linux-5.8.5.tar needle2.txt                 |  9,553,323,495 B/s; 0.103 seconds |   74028 |
| NyoTengu_YMM_GCC730_64bit.exe linux-5.8.5.tar needle2.txt                    |  9,461,464,615 B/s; 0.104 seconds |   74028 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "5.8.5" li...tar          |               N.A.; 0.212 seconds |   74028 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: enwiki-20201001-all-titles (1,193,068,592 bytes)
Needle:   "grep"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe enwiki-20201001-all-titles grep.txt         | 12,051,197,898 B/s; 0.099 seconds |     173 |
| NyoTengu_YMM_GCC730_64bit.exe enwiki-20201001-all-titles grep.txt            | 11,583,190,213 B/s; 0.103 seconds |     173 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "grep" enwiki-2...titles  |               N.A.; 0.378 seconds |     173 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: gcc-7.3.0.tar (615,567,360 bytes)
Needle:   "SIMD"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe gcc-7.3.0.tar SIMD.txt                      | 11,837,833,846 B/s; 0.052 seconds |    3606 |
| NyoTengu_YMM_GCC730_64bit.exe gcc-7.3.0.tar SIMD.txt                         | 11,192,133,818 B/s; 0.055 seconds |    3606 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "SIMD" gcc-7.3.0.tar      |               N.A.; 0.121 seconds |    3606 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle:   "ACGT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 4.txt               |  3,508,357,471 B/s; 0.956 seconds | 1607850 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 4.txt                  |  4,651,858,173 B/s; 0.721 seconds | 1607850 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "ACGT" "SCB_DNA...-28"    |               N.A.; 5.701 seconds | 1607850 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

Haystack: SCB_DNA-Genome-Homo-sapiens-GCA_000001405.28-2019-02-28 (3,353,989,743 bytes)
Needle:   "AACCGGTT"
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Searcher                                                                     |          Performance; Search Time | Matches |
+------------------------------------------------------------------------------+-----------------------------------+---------+
| Nyotengu_YMM_IntelV150_64bit.exe "SCB_DNA...-2019-02-28" 8.txt               |  2,187,860,236 B/s; 1.533 seconds |    2723 |
| NyoTengu_YMM_GCC730_64bit.exe "SCB_DNA...-2019-02-28" 8.txt                  |  1,968,303,839 B/s; 1.704 seconds |    2723 |
| ripgrep-12.1.1-x86_64-pc-windows-gnu -c -F --stats "AACCGGTT" "SCB_...-28"   |               N.A.; 5.320 seconds |    2723 |
+------------------------------------------------------------------------------+-----------------------------------+---------+

In order to reproduce the benchmark, here is the package (except the four corpora):

www.sanmayce.com/Railgun/Benchmark_SRC-(Linus-Torvalds)_DNA-(HUMAN-GENOME)_TXT-(enwiki-all-titles)_unfinished_Nyotengu.zip

The four corpora are downloadable outside the package:

https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.8.5.tar.xz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
https://dumps.wikimedia.org/enwiki/20201001/enwiki-20201001-all-titles.gz
https://ftp.gnu.org/gnu/gcc/gcc-7.3.0/gcc-7.3.0.tar.xz

Note: The fastest GREP - RIPGREP (https://github.com/BurntSushi/ripgrep) received 22,000 stars on GitHub, as far as I see, it is second only to the Linux kernel with 99,000 stars!

Verdict: The current NyoTengu is 2x-5x faster than current RIPGREP, this is the 256bit world, who knows what will happen in the 512bit one...

I intend to replace the memcmp() instances with Railgun_Trolldom_64's comparator (not using external code)...

I believe, Railgun_Nyotengu_XMM_YMM_ZMM() is currently the FASTEST memmem() available, will be glad to be proven wrong. :P

Using the Code

The attached NyoTengu.c is 3022 lines long, it is compilable as follows:

On Windows, with ICL:

icl /O3 Nyotengu.c -DXMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_XMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DYMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_YMM_IntelV150_32bit /FAcs
icl /O3 Nyotengu.c -DZMMtengu -D_WIN32_ENVIRONMENT_ 
        -D_N_HIGH_PRIORITY /FeNyotengu_ZMM_IntelV150_32bit /FAcs

On *nix, with GCC:

gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC_64bit.elf 
        -DYMMtengu -D_POSIX_ENVIRONMENT_

The following snippet was generated by GCC 7.3.0 in this way:

gcc -O3 -mavx2 -fomit-frame-pointer NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.exe 
        -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY
gcc -O3 -mavx2 -fomit-frame-pointer -S NyoTengu.c -o NyoTengu_YMM_GCC730_64bit.asm 
        -DYMMtengu -D_WIN32_ENVIRONMENT_ -D_N_HIGH_PRIORITY

so, the whole ASM fragment:

ASM

    .seh_proc    Railgun_Nyotengu_XMM_YMM_ZMM
Railgun_Nyotengu_XMM_YMM_ZMM:
    pushq    %r15
    .seh_pushreg    %r15
    pushq    %r14
    .seh_pushreg    %r14
    pushq    %r13
    .seh_pushreg    %r13
    pushq    %r12
    .seh_pushreg    %r12
    pushq    %rbp
    .seh_pushreg    %rbp
    pushq    %rdi
    .seh_pushreg    %rdi
    pushq    %rsi
    .seh_pushreg    %rsi
    pushq    %rbx
    .seh_pushreg    %rbx
    subq    $120, %rsp
    .seh_stackalloc    120
    .seh_endprologue
    movl    %r8d, %r13d
    movq    %rcx, %rsi
    movq    %rdx, %r10
    leaq    (%rcx,%r13), %r15
    cmpl    %r9d, %r8d
    jb    .L293
    movl    %r9d, %ebp
    cmpl    $3, %r9d
    jbe    .L321
    movl    (%rdx), %eax
    movl    %eax, 80(%rsp)
    cmpl    $127, %r8d
    ja    .L269
    movl    %eax, %edi
    movl    %eax, %r13d
    movl    %eax, %r14d
    movzbl    %al, %esi
    sall    $24, %edi
    leal    -1(%r9), %r9d
    sall    $8, %r13d
    leaq    (%rcx,%rbp), %rdx
    sall    $16, %r14d
    movl    %edi, 72(%rsp)
    movslq    %r9d, %rdi
    movzwl    %r13w, %r13d
    andl    $16711680, %r14d
    negq    %rbp
    movq    %rdi, 80(%rsp)
    movl    %eax, %r12d
    movl    %esi, 32(%rsp)
    jmp    .L275
    .p2align 4,,10
.L270:
    movl    %eax, %r8d
    movl    $1, %ecx
    andl    $65280, %r8d
    cmpl    %r13d, %r8d
    je    .L274
    movl    %eax, %r8d
    movl    $2, %ecx
    andl    $16711680, %r8d
    cmpl    %r14d, %r8d
    je    .L274
    xorl    %ecx, %ecx
    andl    $-16777216, %eax
    cmpl    72(%rsp), %eax
    setne    %cl
    addq    $3, %rcx
.L274:
    addq    %rcx, %rdx
    cmpq    %rdx, %r15
    jb    .L293
.L275:
    leaq    (%rdx,%rbp), %rbx
    movl    (%rbx), %eax
    cmpl    %r12d, %eax
    jne    .L270
    movq    %rdx, %r8
    movl    %r9d, %eax
    subq    80(%rsp), %r8
    movl    $1, %ecx
    xorl    %edi, %edi
    jmp    .L271
    .p2align 4,,10
.L273:
    leal    (%rax,%rdi), %esi
    cmpl    %esi, %r9d
    jne    .L272
    cmpl    %r11d, 32(%rsp)
    setne    %r11b
    movzbl    %r11b, %r11d
    addl    %r11d, %edi
.L272:
    addl    $1, %ecx
    addq    $1, %r8
    subl    $1, %eax
    je    .L319
.L271:
    movl    %ecx, %r11d
    movsbl    (%r10,%r11), %r11d
    cmpb    (%r8), %r11b
    je    .L273
    leal    1(%rdi), %ecx
    addq    %rcx, %rdx
    cmpq    %rdx, %r15
    jnb    .L275
    .p2align 4,,10
.L293:
    xorl    %ebx, %ebx
.L319:
    movq    %rbx, %rax
    addq    $120, %rsp
    popq    %rbx
    popq    %rsi
    popq    %rdi
    popq    %rbp
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    ret
    .p2align 4,,10
.L321:
    leal    -1(%r9), %eax
    movsbl    (%rdx), %ebx
    addq    %rbp, %rsi
    movsbl    (%rdx,%rax), %eax
    sall    $8, %ebx
    addl    %eax, %ebx
    movzbl    %bh, %edi
    cmpl    $3, %r9d
    je    .L265
    .p2align 4,,10
.L261:
    movsbl    -2(%rsi), %eax
    movsbl    -1(%rsi), %ecx
    sall    $8, %eax
    addl    %ecx, %eax
    cmpl    %eax, %ebx
    je    .L322
    leaq    1(%rsi), %rax
    cmpb    %dil, %cl
    je    .L267
    addq    $2, %rsi
    cmpq    %rsi, %r15
    jnb    .L261
    jmp    .L293
    .p2align 4,,10
.L324:
    cmpb    %cl, 1(%r10)
    je    .L323
    leaq    1(%rsi), %rax
    cmpb    %cl, %dil
    je    .L286
.L325:
    movq    %rsi, %rax
    xorl    %esi, %esi
    cmpb    %dil, %dl
    setne    %sil
    leaq    2(%rsi,%rax), %rsi
.L263:
    cmpq    %rsi, %r15
    jb    .L293
.L265:
    movsbl    -3(%rsi), %eax
    movsbl    -1(%rsi), %r8d
    movzbl    -2(%rsi), %ecx
    sall    $8, %eax
    movl    %r8d, %edx
    addl    %r8d, %eax
    cmpl    %eax, %ebx
    je    .L324
    leaq    1(%rsi), %rax
    cmpb    %cl, %dil
    jne    .L325
.L286:
    movq    %rax, %rsi
    jmp    .L263
    .p2align 4,,10
.L267:
    cmpq    %rax, %r15
    jb    .L293
    movq    %rax, %rsi
    jmp    .L261
    .p2align 4,,10
.L269:
    shrl    $5, %r8d
    leaq    32(%rcx), %rbx
    vpbroadcastd    80(%rsp), %ymm4
    movq    %r15, 104(%rsp)
    leal    -1(%r8), %r14d
    movq    %rbx, 88(%rsp)
    leaq    4096(%rcx), %rax
    movl    %r9d, 216(%rsp)
    salq    $5, %r14
    movq    %rax, %r12
    movq    %r14, 72(%rsp)
    xorl    %r14d, %r14d
    movq    %r13, 96(%rsp)
    movq    %rdx, %r13
    vmovdqu    %ymm4, 32(%rsp)
.L278:
    vmovdqu    32(%rsp), %ymm3
    prefetcht0    (%r12)
    vpcmpeqd    1(%rsi,%r14), %ymm3, %ymm0
    vpcmpeqd    (%rsi,%r14), %ymm3, %ymm1
    vpcmpeqd    3(%rsi,%r14), %ymm3, %ymm2
    vpor    %ymm0, %ymm1, %ymm1
    vpcmpeqd    2(%rsi,%r14), %ymm3, %ymm0
    vpor    %ymm0, %ymm2, %ymm0
    vpor    %ymm0, %ymm1, %ymm0
    vmovmskps    %ymm0, %eax
    testl    %eax, %eax
    jne    .L326
.L276:
    addq    $32, %r14
    cmpq    72(%rsp), %r14
    jb    .L278
    movq    %r13, %r10
    movq    96(%rsp), %r13
    movq    104(%rsp), %r15
    movl    216(%rsp), %r12d
    subq    %r14, %r13
    cmpq    %r13, %rbp
    ja    .L294
    movl    80(%rsp), %ebx
    leal    -1(%r12), %r9d
    leaq    (%r14,%rbp), %rax
    negq    %rbp
    addq    %rsi, %rax
    movzbl    %bl, %edi
    movl    %ebx, %edx
    movl    %ebx, %r13d
    movl    %ebx, %r12d
    movl    %edi, 32(%rsp)
    sall    $8, %edx
    movl    %ebx, %edi
    sall    $16, %r13d
    sall    $24, %edi
    movzwl    %dx, %r14d
    andl    $16711680, %r13d
    movl    %edi, 72(%rsp)
    movslq    %r9d, %rdi
    movq    %rdi, 80(%rsp)
    jmp    .L284
    .p2align 4,,10
.L279:
    movl    %edx, %r8d
    movl    $1, %ecx
    andl    $65280, %r8d
    cmpl    %r14d, %r8d
    je    .L283
    movl    %edx, %r8d
    movl    $2, %ecx
    andl    $16711680, %r8d
    cmpl    %r13d, %r8d
    je    .L283
    xorl    %ecx, %ecx
    andl    $-16777216, %edx
    cmpl    72(%rsp), %edx
    setne    %cl
    addq    $3, %rcx
.L283:
    addq    %rcx, %rax
    cmpq    %rax, %r15
    jb    .L294
.L284:
    leaq    (%rax,%rbp), %rbx
    movl    (%rbx), %edx
    cmpl    %r12d, %edx
    jne    .L279
    movq    %rax, %r8
    movl    %r9d, %edx
    subq    80(%rsp), %r8
    movl    $1, %ecx
    xorl    %edi, %edi
    .p2align 4,,10
.L280:
    movl    %ecx, %r11d
    movsbl    (%r10,%r11), %r11d
    cmpb    (%r8), %r11b
    jne    .L327
    leal    (%rdx,%rdi), %esi
    cmpl    %esi, %r9d
    jne    .L281
    cmpl    %r11d, 32(%rsp)
    setne    %r11b
    movzbl    %r11b, %r11d
    addl    %r11d, %edi
.L281:
    addl    $1, %ecx
    addq    $1, %r8
    subl    $1, %edx
    jne    .L280
    vzeroupper
    jmp    .L319
    .p2align 4,,10
.L326:
    movq    88(%rsp), %rax
    leaq    (%r14,%rsi), %r15
    leaq    (%rax,%r14), %rdi
    vzeroupper
    jmp    .L277
    .p2align 4,,10
.L328:
    addq    $1, %r15
    cmpq    %r15, %rdi
    je    .L276
.L277:
    movq    %rbp, %r8
    movq    %r13, %rdx
    movq    %r15, %rcx
    movq    %r15, %rbx
    call    memcmp
    testl    %eax, %eax
    jne    .L328
    jmp    .L319
    .p2align 4,,10
.L322:
    leaq    -2(%rsi), %rbx
    jmp    .L319
    .p2align 4,,10
.L327:
    leal    1(%rdi), %ecx
    jmp    .L283
    .p2align 4,,10
.L323:
    leaq    -3(%rsi), %rbx
    jmp    .L319
    .p2align 4,,10
.L294:
    xorl    %ebx, %ebx
    vzeroupper
    jmp    .L319
    .seh_endproc

My wish is for other coders to refine it even further, having a toy to play with is something that quickly brings results.

Points of Interest

The performance of the AVX512 tool yet remains to be seen, if a fellow member can run the benchmark on a CPU supporting it, please do and share the output here in Comments Section below.

History

17^th October, 2020: Finished initial version

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication