Benchmark memchar (with GCC builtins)
Iakh via Digitalmars-d
digitalmars-d at puremagic.com
Mon Nov 2 18:33:14 PST 2015
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu
wrote:
> Could you please take a look at GCC's generated code and
> implementation of memchr? -- Andrei
So i did. I rewrite code to do main work in cacheLineSize chunks.
And this
is what GLIBC version do.
So main loop looks this:
-----
do
{
// ptr16 is aligned 64
ubyte16 r1 = __builtin_ia32_pcmpeqb128(ptr16[0], niddles);
ubyte16 r2 = __builtin_ia32_pcmpeqb128(ptr16[1], niddles);
ubyte16 r3 = __builtin_ia32_pcmpeqb128(ptr16[2], niddles);
ubyte16 r4 = __builtin_ia32_pcmpeqb128(ptr16[3], niddles);
r3 = __builtin_ia32_pmaxub128(r1, r3);
r4 = __builtin_ia32_pmaxub128(r2, r4);
r4 = __builtin_ia32_pmaxub128(r3, r4);
mask = __builtin_ia32_pmovmskb128(r4);
if (mask != 0)
{
mask = __builtin_ia32_pmovmskb128(r1);
mixin(CheckMask); // Check and return value
++ptr16; num -= 16;
mask = __builtin_ia32_pmovmskb128(r2);
mixin(CheckMask);
++ptr16; num -= 16;
r3 = __builtin_ia32_pcmpeqb128(*ptr16, niddles);
mask = __builtin_ia32_pmovmskb128(r3);
mixin(CheckMask);
++ptr16; num -= 16;
r4 = __builtin_ia32_pcmpeqb128(*ptr16, niddles);
mask = __builtin_ia32_pmovmskb128(r4);
mixin(CheckMask);
}
num -= 64;
ptr16 += 4;
}
while (num > 0);
-----
and my best result:
-----
Naive: 21.46 TickDuration(132842482)
SIMD: 1.161 TickDuration(7188211)
(was)SIMD: 3.04 TickDuration(18920182)
C: 1 TickDuration(6189222)
More information about the Digitalmars-d
mailing list