zeroupper causes incorrect results

I have a following code inside a for loop

    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));

    __m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);

    __m256 converted = _mm256_cvtepi32_ps(val_unpacked);



    converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));



    _mm256_storeu_ps(&y[i], converted);

    _mm256_zeroupper();

It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).

I'm using mingw compiler with posix thread version 7.2

When I compile the program without optimisation, it runs just fine, but when I turn on optimisation (I do not have control over the individual optimisations, it is inside a project I am working on, but it should be using lvl of optimisation -O3 ), I start to get wrong results.

The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.

Seemingly the optimisation does not respect the placement of the zeroupper instruction and calls it somewhere in the middle of the loop and not at the very end, thus discarding useful data. Is something like that possible? I could not find any discussion regarding this topic on the internet.

EDIT: I extracted the code. It looks like this:

#include <iostream>

#include <limits>

#include <immintrin.h>

#include <xmmintrin.h>  

 int  __attribute__ ((__target__ ("avx2,sse4.2"))) main(){



/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());



__m128i dataInt;

int runs = 32;

int16_t source[32];

float target[32];

int i = 0;

for (; i < 32; ++i) {

    source[i] = std::numeric_limits<int16_t>::min()+i;

}



i=0;

for (; i < runs; i += 8) {

    // _mm256_zeroupper();



     dataInt = _mm_loadu_si128((__m128i *) (&source[i]));



      __m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);

    __m256 converted =  _mm256_cvtepi32_ps(val_unpacked);



    __m256 maxVinFloat = _mm256_set1_ps(max_val);

    converted = _mm256_div_ps(converted, maxVinFloat);



    _mm256_storeu_ps(&target[i], converted);

    _mm256_zeroupper();

}

i = 0;

for (; i < 32; ++i) {

    std::cout << target [ i ] <<"  ";

}}

However when compiled on online compilers, the output is fine, even when using lvl 3 optimisation. But my Clion using compiler described in my original post outputs several infinities, because the register maxVinFloat with max values is composed of zeroes in one half of the register. So,it seems the register is optimised to be initialised only once and other iterations of the cycle output infinities.

EDIT2: My mistake, it does output infinities on online compilers, but I forgot to remove the volatile part (that fixes the problem) when testing it, just run that code
here https://www.tutorialspoint.com/compile_cpp_online.php with -O2 optimisation

edited Nov 21 '18 at 11:17

asked Nov 20 '18 at 14:39

TStancek

1637

1

You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59

2

gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05

@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18

@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22

add a comment |

I have a following code inside a for loop

    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));

    __m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);

    __m256 converted = _mm256_cvtepi32_ps(val_unpacked);



    converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));



    _mm256_storeu_ps(&y[i], converted);

    _mm256_zeroupper();

It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).

I'm using mingw compiler with posix thread version 7.2

The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.

EDIT: I extracted the code. It looks like this:

#include <iostream>

#include <limits>

#include <immintrin.h>

#include <xmmintrin.h>  

 int  __attribute__ ((__target__ ("avx2,sse4.2"))) main(){



/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());



__m128i dataInt;

int runs = 32;

int16_t source[32];

float target[32];

int i = 0;

for (; i < 32; ++i) {

    source[i] = std::numeric_limits<int16_t>::min()+i;

}



i=0;

for (; i < runs; i += 8) {

    // _mm256_zeroupper();



     dataInt = _mm_loadu_si128((__m128i *) (&source[i]));



      __m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);

    __m256 converted =  _mm256_cvtepi32_ps(val_unpacked);



    __m256 maxVinFloat = _mm256_set1_ps(max_val);

    converted = _mm256_div_ps(converted, maxVinFloat);



    _mm256_storeu_ps(&target[i], converted);

    _mm256_zeroupper();

}

i = 0;

for (; i < 32; ++i) {

    std::cout << target [ i ] <<"  ";

}}

edited Nov 21 '18 at 11:17

asked Nov 20 '18 at 14:39

TStancek

1637

1

You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59

2

gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05

@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18

@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22

add a comment |

I have a following code inside a for loop

    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));

    __m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);

    __m256 converted = _mm256_cvtepi32_ps(val_unpacked);



    converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));



    _mm256_storeu_ps(&y[i], converted);

    _mm256_zeroupper();

It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).

I'm using mingw compiler with posix thread version 7.2

The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.

EDIT: I extracted the code. It looks like this:

#include <iostream>

#include <limits>

#include <immintrin.h>

#include <xmmintrin.h>  

 int  __attribute__ ((__target__ ("avx2,sse4.2"))) main(){



/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());



__m128i dataInt;

int runs = 32;

int16_t source[32];

float target[32];

int i = 0;

for (; i < 32; ++i) {

    source[i] = std::numeric_limits<int16_t>::min()+i;

}



i=0;

for (; i < runs; i += 8) {

    // _mm256_zeroupper();



     dataInt = _mm_loadu_si128((__m128i *) (&source[i]));



      __m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);

    __m256 converted =  _mm256_cvtepi32_ps(val_unpacked);



    __m256 maxVinFloat = _mm256_set1_ps(max_val);

    converted = _mm256_div_ps(converted, maxVinFloat);



    _mm256_storeu_ps(&target[i], converted);

    _mm256_zeroupper();

}

i = 0;

for (; i < 32; ++i) {

    std::cout << target [ i ] <<"  ";

}}

edited Nov 21 '18 at 11:17

asked Nov 20 '18 at 14:39

TStancek

1637

I have a following code inside a for loop

    dataInt = _mm_loadu_si128((__m128i *) (&x[i]));

    __m256i val_unpacked = _mm256_cvtepi16_epi32(dataInt);

    __m256 converted = _mm256_cvtepi32_ps(val_unpacked);



    converted = _mm256_div_ps(converted, _mm256_set1_ps(max_val));



    _mm256_storeu_ps(&y[i], converted);

    _mm256_zeroupper();

It basically just converts vector of int16 to floats in range [-1,1] (max_val is const variable and equal to numeric_limit::max).

I'm using mingw compiler with posix thread version 7.2

The problem is in the zeroupper instruction. When I remove it in mode with optimisations, it again gives correct results.

EDIT: I extracted the code. It looks like this:

#include <iostream>

#include <limits>

#include <immintrin.h>

#include <xmmintrin.h>  

 int  __attribute__ ((__target__ ("avx2,sse4.2"))) main(){



/*volatile*/ float max_val = static_cast<float> (std::numeric_limits<int16_t>::max());



__m128i dataInt;

int runs = 32;

int16_t source[32];

float target[32];

int i = 0;

for (; i < 32; ++i) {

    source[i] = std::numeric_limits<int16_t>::min()+i;

}



i=0;

for (; i < runs; i += 8) {

    // _mm256_zeroupper();



     dataInt = _mm_loadu_si128((__m128i *) (&source[i]));



      __m256i val_unpacked =_mm256_cvtepi16_epi32(dataInt);

    __m256 converted =  _mm256_cvtepi32_ps(val_unpacked);



    __m256 maxVinFloat = _mm256_set1_ps(max_val);

    converted = _mm256_div_ps(converted, maxVinFloat);



    _mm256_storeu_ps(&target[i], converted);

    _mm256_zeroupper();

}

i = 0;

for (; i < 32; ++i) {

    std::cout << target [ i ] <<"  ";

}}

c++ gcc mingw avx avx2

edited Nov 21 '18 at 11:17

asked Nov 20 '18 at 14:39

TStancek

1637

edited Nov 21 '18 at 11:17

asked Nov 20 '18 at 14:39

TStancek

1637

edited Nov 21 '18 at 11:17

asked Nov 20 '18 at 14:39

TStancek

1637

asked Nov 20 '18 at 14:39

TStancek

1637

asked Nov 20 '18 at 14:39

TStancek

1637

1

You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59

2

gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05

@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18

@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22

add a comment |

1

You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59

2

gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05

@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18

@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22

You normally never need to manually insert a _mm256_zeroupper(). Compilers insert the vzeroupper asm instruction as needed. But still an interesting question; I would have expected GCC to respect _mm256_zeroupper() placement in the source and treat it as zeroing the top half of every ``__m256` that was in scope at the time. Can you add enough C++ code to the question to make a Minimal, Complete, and Verifiable example we can copy/paste into a compiler, e.g. on godbolt.org, and look at the asm output?
– Peter Cordes
Nov 20 '18 at 16:59

gcc.gnu.org/bugzilla/show_bug.cgi?id=82735
– Marc Glisse
Nov 20 '18 at 17:05

@PeterCordes Working on it, but seems that extracting the code removes the problem. It is going to take a lot more time (that I do not currently have). My original intention was just to scan my options quickly, to get information whether some else dealt with this before or whether it is known to be a problem. But seems I am gonna have to dig a lot deeper.
– TStancek
Nov 21 '18 at 9:18

@PeterCordes I added the extracted version of the function.
– TStancek
Nov 21 '18 at 12:22

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53395416%2fzeroupper-causes-incorrect-results%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

5MnECd6,hiyRD5fjcCYx W

搜尋此網誌

Ytukyg