Here is a simple pseudocode:
SoA is a structure of arrays.
transform functions take an array, make a lot of calculations, return transformed array.
Baseline performance: let's say 200 seconds.
Optimization 1. Enable compiler vectorization, resolve vectorization hazards. All transform functions get vectorized by Intel Compiler.
Execution time is now 20 seconds.
Optimization 2. For every transform function, change inner loops to handcoded intrinsics w/o any clever tricks.
Execution time is now 12 seconds.
Profiling in Intel Vtune Amplifier. Found that out of 12 seconds:
2 seconds: transform1
2 seconds: load_array_fromSoA
2 seconds: transform2
3 seconds: transform3
3 seconds: transform4
CPI is 0.8, everything is in L1 cache. I need another ~20% speedup, target performance is 10 seconds. If I make CPI down to 0.7, I'll almost make it.
Optimization 3. Changed transform1 to return arrays, removed load_array_fromSoA, instead added array pointer argument to transform2
Profiling:
2 seconds: transform1
4.2 seconds: transform2
What? Everything is in L1, there is no misses and other data hazards. Need to find out what the hell is happening.
Reversed this optimization.
Optimization 4.
Changed transform2 vectorized algorithm using clever SIMD tricks. Ran a u-bench of transform2 in a tight loop. transform2_v2 is 1.2x faster than transform2 in the loop. IACA also shows ~20% throughput increase from this optimization.
Replaced transform2 with transform2_v2 in the application, and now it takes 2.2 seconds instead of 2.
Why? Because in the real app, pipeline holds almost 200u-ops mix from transform1, load_array.., transform2, transform3, and in the u-bench it is only transform2 or transform2_v2. While throughput performance of transform2_v2 is better, latency is not, especially considering other code's influence.
Thinking about Optimization 5. now...
function f(array_A)
{
SoA = transform1(A);
for i=0 to 4
array_B=load_array_from_SoA(SoA, i);
array_C=transform2(array_B);
array_D+=transform3(array_C);
return transform4(array_D);
}
SoA is a structure of arrays.
transform functions take an array, make a lot of calculations, return transformed array.
Baseline performance: let's say 200 seconds.
Optimization 1. Enable compiler vectorization, resolve vectorization hazards. All transform functions get vectorized by Intel Compiler.
Execution time is now 20 seconds.
Optimization 2. For every transform function, change inner loops to handcoded intrinsics w/o any clever tricks.
Execution time is now 12 seconds.
Profiling in Intel Vtune Amplifier. Found that out of 12 seconds:
2 seconds: transform1
2 seconds: load_array_fromSoA
2 seconds: transform2
3 seconds: transform3
3 seconds: transform4
CPI is 0.8, everything is in L1 cache. I need another ~20% speedup, target performance is 10 seconds. If I make CPI down to 0.7, I'll almost make it.
Optimization 3. Changed transform1 to return arrays, removed load_array_fromSoA, instead added array pointer argument to transform2
Profiling:
2 seconds: transform1
4.2 seconds: transform2
What? Everything is in L1, there is no misses and other data hazards. Need to find out what the hell is happening.
Reversed this optimization.
Optimization 4.
Changed transform2 vectorized algorithm using clever SIMD tricks. Ran a u-bench of transform2 in a tight loop. transform2_v2 is 1.2x faster than transform2 in the loop. IACA also shows ~20% throughput increase from this optimization.
Replaced transform2 with transform2_v2 in the application, and now it takes 2.2 seconds instead of 2.
Why? Because in the real app, pipeline holds almost 200u-ops mix from transform1, load_array.., transform2, transform3, and in the u-bench it is only transform2 or transform2_v2. While throughput performance of transform2_v2 is better, latency is not, especially considering other code's influence.
Thinking about Optimization 5. now...