casts, if's and array-access.

J.Hilk

Hello everyone,

I have the following situation.
I have a specific data-storage class that is accessed from all over my program, rapidly and very very often, 90% of the time, with read only purpose.

So my intention is to make the access as fast as possible.

The data is stored in a 3d member array e.g: m_data[x][y][z];

To prevent access violations, I would have to add 3 If-statements to see, if the requested data point is part of the array. I believe however, that will take a significant amount of time.

the dimensions are [0-15],[0-15], [0-255], so my idea was, maybe casting the values to something in range would do the trick, but as far as I know, there is no 1/2Byte.
So maybe std::bitset<4> would work!?

Is this the right approach, or do I think to complicated, and I should just stick with 3 if-statements?

sierdzio

Assuming your variables are unsigned, this should be enough:

if (x < 16 && y < 16 && z < 256)

All in a single statement. To gain a little bit more time, you can pass x, y, z as const refs or even do perfect forwarding. And of course, you can compile with -O3 to get more compiler optimisations.

However, what I recommend you do first (before even starting any implementation) is to write and run a benchmark on your existing code. Only when you have statistically significant data of your current solution, you will be able to test and see if other solutions really win you anything.

J.Hilk

@sierdzio

HI, thanks for the answer.

Are

if (x < 16 && y < 16 && z < 256)

and

if(x <16){
    if(y<16){
        if(z<256){
             ....
        }
    }
}

handled differntly? I assumed the compiler would reduce both down to the same process steps.

I still have in the back of my mind, from one of the earliest informatics lessions, that comparing is one of the heaviest operations, as in it takes the most steps.
Where as casting - e.g int to short - is one step, simply dropping part of the allocated memory.

But that might also be remnants of my assambler days...

You're right, I should do some testing before going forward.

sierdzio

@J.Hilk said in casts, if's and array-access.:

Are [...] handled differntly? I assumed the compiler would reduce both down to the same process steps.

Check assembly output to verify, I'm not sure. If you compress them into single statement you definitely get the result.

I still have in the back of my mind, from one of the earliest informatics lessions, that comparing is one of the heaviest operations, as in it takes the most steps.
Where as casting - e.g int to short - is one step, simply dropping part of the allocated memory.

I think modern CPUs are heavily optimised in that area. Maybe you are right, I have not checked. That's why I recommend benchmarking, both the execution time and CPU ticks.

Another, perhaps a bit crazy, idea: use SIMD instructions, namely vec_cmpgt(a, b) to compare all 3 values in a single CPU operation. More info here.

jsulm

@J.Hilk "maybe casting the values to something in range would do the trick" - in this case if the caller passes an invalid index it will still get something, but is this intended behaviour? What would you do with this three if statements if one of the indexes is out of range? Throw an exception? Return default value? ...?

J.Hilk

@jsulm it returns a default value, 0 is this case.

@sierdzio I'll look into SIMD instructions, sounds at the very least, interesting;-)

jsulm

@J.Hilk I don't think SIMD will improve anything in this case. SIMD helps if you have many iterations but in this case it is only one. It will probably take longer to load the data into appropriate SIMD registers then do those 3 if's.

kshegunov

You could in principle go with one if, but as @sierdzio said: benchmark your code first and make sure this is a bottle neck. Here's one suggestion how you could "simplify" the checking:

T m_data[15][15][255]
static const T * const end = &m_data[14][14][254];

// Then in the function that makes the check:
T * item = &m_data[x][y][z]; //< You may need to modify this, so there's no dereferencing, but you get the point
return item > end ? defaultValue : *item;

J.Hilk

Thanks to everyone for the contribution.

as suggested I did some testing, with the 3 methodes discuss in this thread.

I'll post my code, and my result for further discussions:
Its my first test in this direction so I'm not sure if I've done everything like I should have.

I forced the random numbers to be valid, to guarantee the max time for each validation check in each method. I also fell back to Qt for this...

//storage  Class 
//-----------------------------------------headers-------------------------
#ifndef DATASTORAGE_H
#define DATASTORAGE_H

#include <QObject>

class dataStorage : public QObject
{
    Q_OBJECT
public:
    explicit dataStorage(QObject *parent = nullptr);

    uint64_t &getvalueMethod_1(const uint &x, const uint &y, const uint &z);
    uint64_t &getvalueMethod_2(const uint &x, const uint &y, const uint &z);
    uint64_t &getvalueMethod_3(const uint &x, const uint &y, const uint &z);
signals:

public slots:

protected:
    uint64_t m_data[16][16][256];
    const uint64_t *m_last;
};

#endif // DATASTORAGE_H

//------------------------------Sources------------------------

#include "datastorage.h"

dataStorage::dataStorage(QObject *parent) : QObject(parent)
{
    for(int x(0); x < 16;x++)
        for(int y(0); y < 16; y++)
            for(int z(0); z<256;z++)
                m_data[x][y][z] = qrand();
    m_last = &m_data[15][15][255];
}

uint64_t &dataStorage::getvalueMethod_1(const uint &x, const uint &y, const uint &z)
{
    if(x < 16)
        if(y <16)
            if(z < 256)
                return m_data[x][y][z];
    return m_data[0][0][0];
}

uint64_t &dataStorage::getvalueMethod_2(const uint &x, const uint &y, const uint &z)
{
    if(x <16 && y < 16 && z < 256)
        return m_data[x][y][z];
    return m_data[0][0][0];
}

uint64_t &dataStorage::getvalueMethod_3(const uint &x, const uint &y, const uint &z)
{
    uint64_t *item = &m_data[x][y][z];
    return item > m_last ? m_data[0][0][0] : *item;
}

// -------------------------------------Main---------------------------------
#include "datastorage.h"

#include <QDebug>
#include <QVector>
#include <QElapsedTimer>

uint getRand(int max, int min){
    return rand()%(max-min+1)+min;
}

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    dataStorage dStorage;

    QVector<QVector<uint> > vindex;

    for(uint32_t i = 0; i < 10000000; i++ )
        vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(255,0)}));

    QElapsedTimer eTime;

    uint64_t value;
    qint64 average1(0),average2(0),average3(0);

    for(int k(0); k <20; k++){
        qDebug() << "Test-cycle:" << k+1;
        eTime.start();
        for(int j(0); j <10; j++)
            for(uint32_t i = 0; i < 10000000;i++){
                value = dStorage.getvalueMethod_1(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
            }
        qDebug() << "Time for Method 1: "<< eTime.elapsed();
        average1 += eTime.elapsed();

        eTime.start();
        for(int j(0); j <10; j++)
            for(uint32_t i = 0; i < 10000000;i++){
                value = dStorage.getvalueMethod_2(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
            }
        qDebug() << "Time for Method 2: "<< eTime.elapsed();
        average2 += eTime.elapsed();

        eTime.start();
        for(int j(0); j <10; j++)
            for(uint32_t i = 0; i < 10000000;i++){
                value = dStorage.getvalueMethod_3(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
            }
        qDebug() << "Time for Method 3: "<< eTime.elapsed();
        average3 += eTime.elapsed();
    }

    average1 = average1/20; average2 = average2/20; average3 = average3/20;
    qDebug() << "Average Time for Method 1:" << average1;
    qDebug() << "Average Time for Method 2:" << average2;
    qDebug() << "Average Time for Method 3:" << average3;

    return a.exec();
}

the result was:

Average Time for Method 1: 11757ms
Average Time for Method 2: 11587ms
Average Time for Method 3: 12012ms

My conclusion of this would be:
Overall the 3 methods are equally fast!
With method 3, the pointer check, being slightly slower, and the inline if-check beeing the fastest

kshegunov

Can you pull the assembly for those three.
Additional notes:

You will want to make them methods inline, otherwise I feel the call instruction cost is pretty significant compared to the function's body (and also the QVector::at where the arguments are taken from the precomputed array).
Edit: Have also a brief look at this table
I couldn't see getRand defined, but if it does what I think it does, then you have a systematic error - you're accessing only real elements of the array (not out-of-bound ones). Which may put test 3 at a disadvantage (or it may not, I'm not sure either way).
There's no difference and can never be any between:

if(x < 16)
        if(y <16)
            if(z < 256)
                return m_data[x][y][z];

if(x <16 && y < 16 && z < 256)

which is due to the short-circuit evaluation of && and ||. That's also the most probable reason you get virtually identical results for test 1 and 2.

J.Hilk

hi, @kshegunov

like you suggested, I wrapped the getvalueMethod_X functions in an inline statement.

getRand is defined right above main:

uint getRand(int max, int min){
    return rand()%(max-min+1)+min;
}

But you're right it would be better to also have values that are outside the intended scope:
so I made the following changes:

for(uint32_t i = 0; i < 3333333; i++ )
        vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(255,0)}));
    for(uint32_t i = 3333333; i < 6666666; i++ )
        vindex.append(QVector<uint>({getRand(30,15),getRand(15,0),getRand(255,0)}));
    for(uint32_t i = 6666666; i < 10000000; i++ )
        vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(400,254)}));

and pushed the number of cycles to 60.

I'm however not sure, what you mean with:

and also the QVector::at where the arguments are taken from the precomputed array

AFAIK uint64_t m_data[16][16][256]; doesn't have an at opperator. I changed the initializaion however, for the CPU to have one less step:

inline uint64_t &getvalueMethod_3(const uint &x, const uint &y, const uint &z)
    {
        uint64_t *item(&m_data[x][y][z]);
        return item > m_last ? m_data[0][0][0] : *item;
    }

The result with those changes:

Average Time for Method 1: 10813
Average Time for Method 2: 10819
Average Time for Method 3: 11313

Overall 10% faster than previously, so the call instruction did indeed take up a significant amount of time!
Method 1 and 2 are even more equal, like expected.
But no change for method 3.

kshegunov

@J.Hilk said in casts, if's and array-access.:

I'm however not sure, what you mean with

I meant here:

value = dStorage.getvalueMethod_1(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));

but I just realized QVector::at returns a const reference so nevermind that. However you may want to consider using a statically sized c-style array, as you may be getting a lot of cache misses due to the implicit sharing pointer indirection from the vector -
another thing to try.

But no change for method 3.

I never gave any guarantees it'd be faster; that's why you're profiling to begin with ;)
And while on the subject, did you get that assembly? And do say how do you compile your tests, with -g -O2?

J.Hilk

@kshegunov

I actually tested it with '[]' calls and it doubled the execution time.

may be getting a lot of cache misses due to the implicit sharing pointer indirection from the vector

the variables are passed as:
inline uint64_t &getvalueMethod_2(const uint &x, const uint &y, const uint &z) with QVector::at returning a const reference I'm directly passing the memory, am I not?

And while on the subject, did you get that assembly?

I'm sorry, but I'm not familiar with that term! Its something I would do to start my lawn mower. But thats probably just me as a non native speaker x)

And do say how do you compile your tests, with -g -O2?

Haven't touch the default compiler settings. My project, that is potentially using this class, would use the default settings as well, so I thought better not to touch it.

I did however make an embarrassing mistake, that I shall not comment further on...

but I'm now down to 300-1000 nsec per function call. The 3 functions are virtually the same in access time.

Save to say: "Use whatever is better for your OCD-Code-structure!"

sierdzio

By assembly, @kshegunov meant this: https://stackoverflow.com/questions/137038/how-do-you-get-assembler-output-from-c-c-source-in-gcc

By comparing assembly code that the compiler produces we can really see how well optimized the code is after the compiler had done it's magic.

I did however make an embarrassing mistake, that I shall not comment further on...
but I'm now down to 300-1000 nsec per function call. The 3 functions are virtually the same in access time.

Oh now you've made me super curious! :-)

All in all, good to know that all these solutions are equally valid. We've all learned a bit here.

kshegunov

@J.Hilk said in casts, if's and array-access.:

I actually tested it with '[]' calls and it doubled the execution time.

This is very odd, are you sure you didn't do anything strange. I'd expect it to be a bit faster, but noticeable.

I'm directly passing the memory, am I not?

Yes, but the QVector has a data pointer inside which is there to implement implicit sharing. So you're actually getting something like:

QVectorData * d; //< Already in the vector implementation
return d->elements[index]; //< Indirection through the data pointer

albeit most of it is probably inlined by the compiler. Still the memory fetch is there.

I'm sorry, but I'm not familiar with that term! Its something I would do to start my lawn mower. But thats probably just me as a non native speaker x)

Okay, I mean the compiled binary code - the assembly. You can get that from Creator by selecting Debug > Operate by instruction from the menu when you have an active debug session.

Haven't touch the default compiler settings.

Well, perhaps you should or rather create a profile target for Creator. It'd include -g -O2 switches if you work with gcc, meaning the compiler will generate debug info, but will also perform the standard release optimizations. The idea is to have the optimized code (what you'd have in release) without stripping all the debug info - allowing for a reasonable profiling.

but I'm now down to 300-1000 nsec per function call.

That's in agreement with the table I posted earlier - a couple or so (short) instructions; a few clock ticks.

The 3 functions are virtually the same in access time.

Don't rush it, we would still love to do some more dissecting ;P
I share @sierdzio's curiosity.

SGaist

Hi,

Shouldn't it rather be:

QVector<int> paramVector = vindex.at(i);
value = dStorage.getvalueMethod_1(paramVector.at(0), paramVector.at(1), paramVector.at(2));

to ensure the compiler can optimise access a bit ?