Most performant byte reordering

jars121

Hi,

I receive SPI data from a remote processor which is acting as the SPI slave. The remote processor packs bytes into a buffer, which is shifted out via DMA. The SPI master initiates transfers with a word length of 32 bits. I'm having an issue whereby the received bytes are incorrectly ordered:

//Array as sent via DMA from the remote SPI slave

uint8_t remoteArray[256];
for (uint8_t i = 0; i < 256; i++) {
    remoteArray[i] = i;
}

//Array as received on the SPI master

for (uint8_t i = 0; i < 256; i++) {
    qDebug() << masterArray[i];
}

I expected the masterArray output to be 0, 1, 2, 3, 4, 5, 6, 7, 8, etc., but each group of 4 digits is backwards. I.e. 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8. I can reorder these in a loop, but I'd like to understand what's happened here, as well as the most performant way to ensure the data is correctly ordered.

If I use an SPI transfer word length of 8 on the SPI master, the data is correctly ordered. However I need to use a word length of 32 bits for hardware-specific reasons.

JonB

@jars121 said in Most performant byte reordering:

The SPI master initiates transfers with a word length of 32 bits.

I know nothing about "SPI", but from the code & output you show it looks like it is being sent as a 32-bit integer in reverse order?

Christian Ehrlicher

@jars121 said in Most performant byte reordering:

as well as the most performant way to ensure the data is correctly ordered.

It depends on your compiler how good it's compiled into bytecode. A simple loop like this is should be enough:

void reverse(const char *in, char *out)
{
    for (int i = 0; i < 256; i += 4) {
        const auto ofs = i * 4;
        out[ofs + 0] = in[ofs + 3];
        out[ofs + 1] = in[ofs + 2];
        out[ofs + 2] = in[ofs + 1];
        out[ofs + 3] = in[ofs + 0];
    }
}

Kent-Dorfman

The first thing you need to do is to verify that your remote slave is in fact sending the bytes in the order that you think it is. It probably is not.

jars121

@JonB said in Most performant byte reordering:

@jars121 said in Most performant byte reordering:

The SPI master initiates transfers with a word length of 32 bits.

I know nothing about "SPI", but from the code & output you show it looks like it is being sent as a 32-bit integer in reverse order?

Thanks for your input. It certainly appears that way, but as I'll detail in a response below, the data on the wire is in the correct order.

@Christian-Ehrlicher said in Most performant byte reordering:

@jars121 said in Most performant byte reordering:

as well as the most performant way to ensure the data is correctly ordered.

It depends on your compiler how good it's compiled into bytecode. A simple loop like this is should be enough:
void reverse(const char *in, char *out)
{
    for (int i = 0; i < 256; i += 4) {
        const auto ofs = i * 4;
        out[ofs + 0] = in[ofs + 3];
        out[ofs + 1] = in[ofs + 2];
        out[ofs + 2] = in[ofs + 1];
        out[ofs + 3] = in[ofs + 0];
    }
}

Thanks for providing that! This is the loop approach I'd already tested which works perfectly well. I'm hoping to understand why the ordering issue is occurring so perhaps another approach could be explored.

@Kent-Dorfman said in Most performant byte reordering:

The first thing you need to do is to verify that your remote slave is in fact sending the bytes in the order that you think it is. It probably is not.

I've checked the MISO line with my oscilloscope and can see that the data out of the slave is in the correct order. I.e. 0, 1, 2, 3, 4, 5, 6, 7, 8. This leads me to believe that the 32-bit SPI word length is the culprit here and is using a reverse byte order for some reason.

jars121

I've come across the following, which is included in the description of the spi_transfer struct within the Linux kernel SPI driver:

In-memory data values are always in native CPU byte order, translated from the wire byte order (big-endian except with SPI_LSB_FIRST)

I've tried setting SPI_LSB_FIRST, but this has no impact (despite not returning an error), so it may be a hardware limitation.

SimonSchroeder

The reason for this is most likely endianness: There is big endian and little endian. If one computer uses one and the second the other byte order transmission will reverse the byte order. In a simplified view Intel x86 was the only one doing little endian and everybody else was doing big endian. In the modern world ARM processors share little endian with x86, but can be switched to big endian.

However, I cannot provide you with any short-cut solution as I don't know SPI either. I would expect that all processors have a way to do the byte swap efficiently (maybe some SSE on x86). https://stackoverflow.com/questions/105252/how-do-i-convert-between-big-endian-and-little-endian-values-in-c lists some built-in commands to do byte swaps with VS and GCC.

jars121

Thanks for your input everyone. This is definitely an Endianness issue. In the end I've packaged each 32-bit sequence in reverse order on the remote processor so the data is received and parsed correctly on the SPI Master. I had hoped there was a hardware configuration that would change the SPI parsing Endianness but it looks like there wasn't.

JonB

@jars121
Now that we are happy you do indeed need to swap the bytes/endianness, let's go back to your original question:

Most performant byte reordering

The algorithm @Christian-Ehrlicher showed you does indeed do the job, simply. But is it the most "performant"? I haven't looked at the code it generates, and I don't know how clever compiling optimized might make it.

But "byte swappers" have been around for a long time in C/C++. Presumably they can take advantage of machine code to be efficient. You don't say which platform/compiler you are on, but I note (for 32-bit) that MSVC has

unsigned long _byteswap_ulong(unsigned long value);

and GCC has

uint32_t __builtin_bswap32 (uint32_t x)

If you are going to do this a lot and really care about "performant" you might examine how these compare to your own code?! :)

artwaw

@jars121 if you know that your source uses certain endianness your can make use of this https://doc.qt.io/qt-5/qtendian.html#details and save yourself trouble?

JonB

@jars121
From @artwaw's link, one of the qFromLittleEndian/qFromBigEndian() looks like it will do your swapping, and only if necessary on platform. Whether it does it efficiently I don't know because I didn't look at its definition....

kkoehne

@JonB said in Most performant byte reordering:

Whether it does it efficiently I don't know because I didn't look at its definition....

It's not hard to find the definition though ...

https://code.woboq.org/qt5/qtbase/src/corelib/global/qendian.cpp.html

You see that there's special ifdef's for SSSE3, AVX2, and SSE2 . An because Thiago (the Qt Core maintainer) works for Intel, I think it's most likely it's rather optimized at least on the x86/x64 architectures ;)

JonB

@kkoehne
Indeed. And the fact that there are "special calls" in code looks promising. But knowing when those code cases apply and whether they are "performant" compared to one's own C++ loop is beyond me! Hence left as an exercise to the reader ;-)

Christian Ehrlicher

You're aware that we're talking about 256 bytes here? How high is the data rate that we have to discuss about if simd instructions are really needed? Measure before use!

SimonSchroeder

@kkoehne said in Most performant byte reordering:

You see that there's special ifdef's for SSSE3, AVX2, and SSE2 . An because Thiago (the Qt Core maintainer) works for Intel, I think it's most likely it's rather optimized at least on the x86/x64 architectures ;)

There are #ifdefs distinguishing between different platforms. It is not decided at runtime which version you choose. I am not sure for which platform Qt is precompiled (and most people will use a precompiled version of Qt). Most definitely you will not get AVX2. For that extra bit of performance (if there is some) you would need to compile Qt yourself accordingly. And this totally depends on which processors (up to which age) you target.

Also note that the source code only has SIMD implementations for x86. For other processors, like ARM, there is no optimization.