AVX on Sandy Bridge architecture is only suitable for floating point, and for very limited set of integer ops. In the future CPU generations, it may be extended to broader integer operations support.
I think still it will support only vector operations over 32 bit (or bigger) values. So when e.g. in some crypto code I need to do vectorized ROTL on 16 bit ints, I have to write something like:
// C++ vector pseudo-code
__m256i REG; // input
__m256i CARRY;
// repeat for each bit shifted.
CARRY = REG & 1000000000000000100000000000000b;
REG = REG * 2; // main shift
CARRY = CARRY / (1>>15); // carry shift
REG = REG ^ CARRY; // clear carry bits if set in REG
REG = REG | CARRY; // set carry bits in REG
I don't see how this sequence can be made smaller and I don't see how it can be easily pipelined.
I think still it will support only vector operations over 32 bit (or bigger) values. So when e.g. in some crypto code I need to do vectorized ROTL on 16 bit ints, I have to write something like:
// C++ vector pseudo-code
__m256i REG; // input
__m256i CARRY;
// repeat for each bit shifted.
CARRY = REG & 1000000000000000100000000000000b;
REG = REG * 2; // main shift
CARRY = CARRY / (1>>15); // carry shift
REG = REG ^ CARRY; // clear carry bits if set in REG
REG = REG | CARRY; // set carry bits in REG
I don't see how this sequence can be made smaller and I don't see how it can be easily pipelined.