Prefetchw can be found in linux kernel source and windows kernel, but there is a trick: it was first added to AMD CPUs, and then once it appeared in Intel CPUs, it just stopped raising SIGILL, but nothing else happened.
Then eventually it was implemented properly, but I still get questions from customers on when it can be faster. Here is my last answer from an email:
On write, the cache line status turns to Modified, and turning to Modified from Exclusive state is faster than turning from Shared to Modified. PREFETCHW prefetches data to Exlusive state, so then when write happens later it can be faster. So if the cache line is fetched from DRAM, there is no difference.
Then eventually it was implemented properly, but I still get questions from customers on when it can be faster. Here is my last answer from an email:
On write, the cache line status turns to Modified, and turning to Modified from Exclusive state is faster than turning from Shared to Modified. PREFETCHW prefetches data to Exlusive state, so then when write happens later it can be faster. So if the cache line is fetched from DRAM, there is no difference.