izard

The same code (below), when compiled by the Intel compiler for and run on Ivy Bridge and Haswell works much faster on HSW than on IVB. (I can't disclose the exact speedup as HSW is not yet released, but the new instructions involved have not been secret for a while no and neither have the BKMs)

#define A (2*1024)
#define B 64
int a[A]; int b[B];

void main()
{
  int i,j,sum = 0, min = 100000000;
  ticks t1,t2;

  for (i = 0; i < A; i++) a[i] = i; // Init arrays
  for (i = 0; i < B; i++) b[i] = (i*113 + 113) % A;

  for (j = 0; j < 10000; j++)
  {
    sum = 0;
    t1 = rdtscll();
    for (i = 0; i < B; i++) sum += a[b[i]];
    t2 = rdtscll();
    if (min > (t2 - t1)) min = t2 - t1; // measure best case, dont warmup
  }
  printf ("sum = %i, min = %i\n", sum, min); // print sum to trick compiler
}

The new gather instructions are useful, and the compiler uses them automatically. Of course there are no miracles - as I increase A and B, the performance goes down, and eventually becomes on par with that of the old implementation because memory latency becomes a limiting factor.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

May. 15th, 2013

May. 15th, 2013

Gather

Profile

July 2025

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags