u-arch latencies improvement
Oct. 12th, 2011 10:42 amQ: How can it be that Intel architects developing Nehalem u-arch managed to make unaligned access latency lower. Aren't all latencies fixed due to some concrete constraints?
A: No, almost nothing is fixed; there is just a tradeoff between different design decisions for latencies and power/transistor budget. Actually each major u-arch makes some latencies adjustments (not all of those are for good).
Consider the lineage of
Dothan->Merom->Nehalem->Haswell->nextOne (I think cannot name it though wikipedia tries :)
In each step, some design decisions persist (L1C, L1D sizes; later L2 size), some improve gradually (number of instructions in flight in OOO engine, number of execution ports), some latencies are going up (L1D read 3->4->5 cycles, some generic x86 instruction's latencies), some down (unaligned loads, store forwards, some SIMD instructions latencies).
Main idea here is that e.g. unaligned load will always be more expensive than aligned. The question is whether it is worth spending more transistors to lower each particular latency. Usually utilizing more transistors makes is possible to make latency smaller. E.g. a good explanation from Hennessy, Patterson about transistor budget for cache set implementation applies here.
With each new u-arch, thanks to Moore's law there is additional transistor and power budget to spend. So architects just simulate some set of benchmarks with different latencies parameters trying to make a best fit. No magic.
P.S. I am not an CPU architect and even in this post try not to reveal actual timings (too lazy to check which are confidential and which are public)
A: No, almost nothing is fixed; there is just a tradeoff between different design decisions for latencies and power/transistor budget. Actually each major u-arch makes some latencies adjustments (not all of those are for good).
Consider the lineage of
Dothan->Merom->Nehalem->Haswell->nextOne (I think cannot name it though wikipedia tries :)
In each step, some design decisions persist (L1C, L1D sizes; later L2 size), some improve gradually (number of instructions in flight in OOO engine, number of execution ports), some latencies are going up (L1D read 3->4->5 cycles, some generic x86 instruction's latencies), some down (unaligned loads, store forwards, some SIMD instructions latencies).
Main idea here is that e.g. unaligned load will always be more expensive than aligned. The question is whether it is worth spending more transistors to lower each particular latency. Usually utilizing more transistors makes is possible to make latency smaller. E.g. a good explanation from Hennessy, Patterson about transistor budget for cache set implementation applies here.
With each new u-arch, thanks to Moore's law there is additional transistor and power budget to spend. So architects just simulate some set of benchmarks with different latencies parameters trying to make a best fit. No magic.
P.S. I am not an CPU architect and even in this post try not to reveal actual timings (too lazy to check which are confidential and which are public)