zeroupper causes incorrect results












1














I have a following code inside a for loop



    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));

_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();


It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).



I'm using mingw compiler with posix thread version 7.2



When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.



The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.



Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.



EDIT: I extracted the code. It looks like this:



#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){

/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());

__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}

i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();

dataInt = _mm_loadu_si128((__m128i *) (&source[i]));

__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);

_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}


However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.



EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation










share|improve this question




















  • 1




    You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
    – Peter Cordes
    Nov 20 '18 at 16:59






  • 2




    gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
    – Marc Glisse
    Nov 20 '18 at 17:05










  • @PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
    – TStancek
    Nov 21 '18 at 9:18










  • @PeterCordes I added the extracted version of the function.
    – TStancek
    Nov 21 '18 at 12:22


















1














I have a following code inside a for loop



    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));

_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();


It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).



I'm using mingw compiler with posix thread version 7.2



When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.



The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.



Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.



EDIT: I extracted the code. It looks like this:



#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){

/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());

__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}

i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();

dataInt = _mm_loadu_si128((__m128i *) (&source[i]));

__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);

_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}


However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.



EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation










share|improve this question




















  • 1




    You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
    – Peter Cordes
    Nov 20 '18 at 16:59






  • 2




    gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
    – Marc Glisse
    Nov 20 '18 at 17:05










  • @PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
    – TStancek
    Nov 21 '18 at 9:18










  • @PeterCordes I added the extracted version of the function.
    – TStancek
    Nov 21 '18 at 12:22
















1












1








1







I have a following code inside a for loop



    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));

_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();


It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).



I'm using mingw compiler with posix thread version 7.2



When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.



The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.



Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.



EDIT: I extracted the code. It looks like this:



#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){

/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());

__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}

i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();

dataInt = _mm_loadu_si128((__m128i *) (&source[i]));

__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);

_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}


However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.



EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation










share|improve this question















I have a following code inside a for loop



    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));

_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();


It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).



I'm using mingw compiler with posix thread version 7.2



When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.



The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.



Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.



EDIT: I extracted the code. It looks like this:



#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){

/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());

__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}

i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();

dataInt = _mm_loadu_si128((__m128i *) (&source[i]));

__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);

__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);

_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}


However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.



EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation







c++ gcc mingw avx avx2






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 11:17

























asked Nov 20 '18 at 14:39









TStancek

1637




1637








  • 1




    You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
    – Peter Cordes
    Nov 20 '18 at 16:59






  • 2




    gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
    – Marc Glisse
    Nov 20 '18 at 17:05










  • @PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
    – TStancek
    Nov 21 '18 at 9:18










  • @PeterCordes I added the extracted version of the function.
    – TStancek
    Nov 21 '18 at 12:22
















  • 1




    You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
    – Peter Cordes
    Nov 20 '18 at 16:59






  • 2




    gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
    – Marc Glisse
    Nov 20 '18 at 17:05










  • @PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
    – TStancek
    Nov 21 '18 at 9:18










  • @PeterCordes I added the extracted version of the function.
    – TStancek
    Nov 21 '18 at 12:22










1




1




You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59




You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59




2




2




gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05




gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05












@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18




@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18












@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22






@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22














0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395416%2fzeroupper-causes-incorrect-results%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395416%2fzeroupper-causes-incorrect-results%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Wiesbaden

Marschland

Dieringhausen