The same code (below), when compiled by the Intel compiler for and run on Ivy Bridge and Haswell works much faster on HSW than on IVB. (I can't disclose the exact speedup as HSW is not yet released, but the new instructions involved have not been secret for a while no and neither have the BKMs)
The new gather instructions are useful, and the compiler uses them automatically. Of course there are no miracles - as I increase A and B, the performance goes down, and eventually becomes on par with that of the old implementation because memory latency becomes a limiting factor.
#define A (2*1024) #define B 64 int a[A]; int b[B]; void main() { int i,j,sum = 0, min = 100000000; ticks t1,t2; for (i = 0; i < A; i++) a[i] = i; // Init arrays for (i = 0; i < B; i++) b[i] = (i*113 + 113) % A; for (j = 0; j < 10000; j++) { sum = 0; t1 = rdtscll(); for (i = 0; i < B; i++) sum += a[b[i]]; t2 = rdtscll(); if (min > (t2 - t1)) min = t2 - t1; // measure best case, dont warmup } printf ("sum = %i, min = %i\n", sum, min); // print sum to trick compiler }
The new gather instructions are useful, and the compiler uses them automatically. Of course there are no miracles - as I increase A and B, the performance goes down, and eventually becomes on par with that of the old implementation because memory latency becomes a limiting factor.