zeroupper causes incorrect results
I have a following code inside a for loop
dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));
_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();
It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).
I'm using mingw compiler with posix thread version 7.2
When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.
The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.
Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.
EDIT: I extracted the code. It looks like this:
#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){
/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());
__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}
i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();
dataInt = _mm_loadu_si128((__m128i *) (&source[i]));
__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);
_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}
However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.
EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation
c++ gcc mingw avx avx2
add a comment |
I have a following code inside a for loop
dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));
_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();
It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).
I'm using mingw compiler with posix thread version 7.2
When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.
The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.
Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.
EDIT: I extracted the code. It looks like this:
#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){
/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());
__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}
i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();
dataInt = _mm_loadu_si128((__m128i *) (&source[i]));
__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);
_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}
However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.
EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation
c++ gcc mingw avx avx2
1
You normally never need to manually insert a_mm256_zeroupper()
. Compilers insert thevzeroupper
asm instruction as needed. But still an interesting question; I would have expected GCC to respect_mm256_zeroupper()
placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59
2
gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05
@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18
@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22
add a comment |
I have a following code inside a for loop
dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));
_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();
It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).
I'm using mingw compiler with posix thread version 7.2
When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.
The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.
Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.
EDIT: I extracted the code. It looks like this:
#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){
/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());
__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}
i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();
dataInt = _mm_loadu_si128((__m128i *) (&source[i]));
__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);
_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}
However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.
EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation
c++ gcc mingw avx avx2
I have a following code inside a for loop
dataInt = _mm_loadu_si128((__m128i *) (&x[i]));
__m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));
_mm256_storeu_ps(&y[i], converted);
_mm256_zeroupper();
It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).
I'm using mingw compiler with posix thread version 7.2
When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.
The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.
Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.
EDIT: I extracted the code. It looks like this:
#include <iostream>
#include <limits>
#include <immintrin.h>
#include <xmmintrin.h>
int __attribute__ ((__target__ ("avx2,sse4.2"))) main(){
/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());
__m128i dataInt;
int runs = 32;
int16_t source[32];
float target[32];
int i = 0;
for (; i < 32; ++i) {
source[i] = std::numeric_limits<int16_t>::min()+i;
}
i=0;
for (; i < runs; i += 8) {
// _mm256_zeroupper();
dataInt = _mm_loadu_si128((__m128i *) (&source[i]));
__m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);
__m256 converted = _mm256_cvtepi32_ps(val_unpacked);
__m256 maxVinFloat = _mm256_set1_ps(max_val);
converted = _mm256_div_ps(converted, maxVinFloat);
_mm256_storeu_ps(&target[i], converted);
_mm256_zeroupper();
}
i = 0;
for (; i < 32; ++i) {
std::cout << target [ i ] <<" ";
}}
However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.
EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation
c++ gcc mingw avx avx2
c++ gcc mingw avx avx2
edited Nov 21 '18 at 11:17
asked Nov 20 '18 at 14:39
TStancek
1637
1637
1
You normally never need to manually insert a_mm256_zeroupper()
. Compilers insert thevzeroupper
asm instruction as needed. But still an interesting question; I would have expected GCC to respect_mm256_zeroupper()
placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59
2
gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05
@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18
@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22
add a comment |
1
You normally never need to manually insert a_mm256_zeroupper()
. Compilers insert thevzeroupper
asm instruction as needed. But still an interesting question; I would have expected GCC to respect_mm256_zeroupper()
placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59
2
gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05
@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18
@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22
1
1
You normally never need to manually insert a
_mm256_zeroupper()
. Compilers insert the vzeroupper
asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper()
placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?– Peter Cordes
Nov 20 '18 at 16:59
You normally never need to manually insert a
_mm256_zeroupper()
. Compilers insert the vzeroupper
asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper()
placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?– Peter Cordes
Nov 20 '18 at 16:59
2
2
gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05
gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05
@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18
@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18
@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22
@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395416%2fzeroupper-causes-incorrect-results%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395416%2fzeroupper-causes-incorrect-results%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You normally never need to manually insert a
_mm256_zeroupper()
. Compilers insert thevzeroupper
asm instruction as needed. But still an interesting question; I would have expected GCC to respect_mm256_zeroupper()
placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?– Peter Cordes
Nov 20 '18 at 16:59
2
gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05
@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18
@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22