casts, if's and array-access.

jsulm

@J.Hilk I don't think SIMD will improve anything in this case. SIMD helps if you have many iterations but in this case it is only one. It will probably take longer to load the data into appropriate SIMD registers then do those 3 if's.

kshegunov

You could in principle go with one if, but as @sierdzio said: benchmark your code first and make sure this is a bottle neck. Here's one suggestion how you could "simplify" the checking:

T m_data[15][15][255]
static const T * const end = &m_data[14][14][254];

// Then in the function that makes the check:
T * item = &m_data[x][y][z]; //< You may need to modify this, so there's no dereferencing, but you get the point
return item > end ? defaultValue : *item;

J.Hilk

Thanks to everyone for the contribution.

as suggested I did some testing, with the 3 methodes discuss in this thread.

I'll post my code, and my result for further discussions:
Its my first test in this direction so I'm not sure if I've done everything like I should have.

I forced the random numbers to be valid, to guarantee the max time for each validation check in each method. I also fell back to Qt for this...

//storage  Class 
//-----------------------------------------headers-------------------------
#ifndef DATASTORAGE_H
#define DATASTORAGE_H

#include <QObject>

class dataStorage : public QObject
{
    Q_OBJECT
public:
    explicit dataStorage(QObject *parent = nullptr);

    uint64_t &getvalueMethod_1(const uint &x, const uint &y, const uint &z);
    uint64_t &getvalueMethod_2(const uint &x, const uint &y, const uint &z);
    uint64_t &getvalueMethod_3(const uint &x, const uint &y, const uint &z);
signals:

public slots:

protected:
    uint64_t m_data[16][16][256];
    const uint64_t *m_last;
};

#endif // DATASTORAGE_H

//------------------------------Sources------------------------

#include "datastorage.h"

dataStorage::dataStorage(QObject *parent) : QObject(parent)
{
    for(int x(0); x < 16;x++)
        for(int y(0); y < 16; y++)
            for(int z(0); z<256;z++)
                m_data[x][y][z] = qrand();
    m_last = &m_data[15][15][255];
}

uint64_t &dataStorage::getvalueMethod_1(const uint &x, const uint &y, const uint &z)
{
    if(x < 16)
        if(y <16)
            if(z < 256)
                return m_data[x][y][z];
    return m_data[0][0][0];
}

uint64_t &dataStorage::getvalueMethod_2(const uint &x, const uint &y, const uint &z)
{
    if(x <16 && y < 16 && z < 256)
        return m_data[x][y][z];
    return m_data[0][0][0];
}

uint64_t &dataStorage::getvalueMethod_3(const uint &x, const uint &y, const uint &z)
{
    uint64_t *item = &m_data[x][y][z];
    return item > m_last ? m_data[0][0][0] : *item;
}

// -------------------------------------Main---------------------------------
#include "datastorage.h"

#include <QDebug>
#include <QVector>
#include <QElapsedTimer>

uint getRand(int max, int min){
    return rand()%(max-min+1)+min;
}

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    dataStorage dStorage;

    QVector<QVector<uint> > vindex;

    for(uint32_t i = 0; i < 10000000; i++ )
        vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(255,0)}));

    QElapsedTimer eTime;

    uint64_t value;
    qint64 average1(0),average2(0),average3(0);

    for(int k(0); k <20; k++){
        qDebug() << "Test-cycle:" << k+1;
        eTime.start();
        for(int j(0); j <10; j++)
            for(uint32_t i = 0; i < 10000000;i++){
                value = dStorage.getvalueMethod_1(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
            }
        qDebug() << "Time for Method 1: "<< eTime.elapsed();
        average1 += eTime.elapsed();

        eTime.start();
        for(int j(0); j <10; j++)
            for(uint32_t i = 0; i < 10000000;i++){
                value = dStorage.getvalueMethod_2(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
            }
        qDebug() << "Time for Method 2: "<< eTime.elapsed();
        average2 += eTime.elapsed();

        eTime.start();
        for(int j(0); j <10; j++)
            for(uint32_t i = 0; i < 10000000;i++){
                value = dStorage.getvalueMethod_3(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
            }
        qDebug() << "Time for Method 3: "<< eTime.elapsed();
        average3 += eTime.elapsed();
    }

    average1 = average1/20; average2 = average2/20; average3 = average3/20;
    qDebug() << "Average Time for Method 1:" << average1;
    qDebug() << "Average Time for Method 2:" << average2;
    qDebug() << "Average Time for Method 3:" << average3;

    return a.exec();
}

the result was:

Average Time for Method 1: 11757ms
Average Time for Method 2: 11587ms
Average Time for Method 3: 12012ms

My conclusion of this would be:
Overall the 3 methods are equally fast!
With method 3, the pointer check, being slightly slower, and the inline if-check beeing the fastest

kshegunov

Can you pull the assembly for those three.
Additional notes:

You will want to make them methods inline, otherwise I feel the call instruction cost is pretty significant compared to the function's body (and also the QVector::at where the arguments are taken from the precomputed array).
Edit: Have also a brief look at this table
I couldn't see getRand defined, but if it does what I think it does, then you have a systematic error - you're accessing only real elements of the array (not out-of-bound ones). Which may put test 3 at a disadvantage (or it may not, I'm not sure either way).
There's no difference and can never be any between:

if(x < 16)
        if(y <16)
            if(z < 256)
                return m_data[x][y][z];

if(x <16 && y < 16 && z < 256)

which is due to the short-circuit evaluation of && and ||. That's also the most probable reason you get virtually identical results for test 1 and 2.

J.Hilk

hi, @kshegunov

like you suggested, I wrapped the getvalueMethod_X functions in an inline statement.

getRand is defined right above main:

uint getRand(int max, int min){
    return rand()%(max-min+1)+min;
}

But you're right it would be better to also have values that are outside the intended scope:
so I made the following changes:

for(uint32_t i = 0; i < 3333333; i++ )
        vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(255,0)}));
    for(uint32_t i = 3333333; i < 6666666; i++ )
        vindex.append(QVector<uint>({getRand(30,15),getRand(15,0),getRand(255,0)}));
    for(uint32_t i = 6666666; i < 10000000; i++ )
        vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(400,254)}));

and pushed the number of cycles to 60.

I'm however not sure, what you mean with:

and also the QVector::at where the arguments are taken from the precomputed array

AFAIK uint64_t m_data[16][16][256]; doesn't have an at opperator. I changed the initializaion however, for the CPU to have one less step:

inline uint64_t &getvalueMethod_3(const uint &x, const uint &y, const uint &z)
    {
        uint64_t *item(&m_data[x][y][z]);
        return item > m_last ? m_data[0][0][0] : *item;
    }

The result with those changes:

Average Time for Method 1: 10813
Average Time for Method 2: 10819
Average Time for Method 3: 11313

Overall 10% faster than previously, so the call instruction did indeed take up a significant amount of time!
Method 1 and 2 are even more equal, like expected.
But no change for method 3.

kshegunov

@J.Hilk said in casts, if's and array-access.:

I'm however not sure, what you mean with

I meant here:

value = dStorage.getvalueMethod_1(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));

but I just realized QVector::at returns a const reference so nevermind that. However you may want to consider using a statically sized c-style array, as you may be getting a lot of cache misses due to the implicit sharing pointer indirection from the vector -
another thing to try.

But no change for method 3.

I never gave any guarantees it'd be faster; that's why you're profiling to begin with ;)
And while on the subject, did you get that assembly? And do say how do you compile your tests, with -g -O2?

J.Hilk

@kshegunov

I actually tested it with '[]' calls and it doubled the execution time.

may be getting a lot of cache misses due to the implicit sharing pointer indirection from the vector

the variables are passed as:
inline uint64_t &getvalueMethod_2(const uint &x, const uint &y, const uint &z) with QVector::at returning a const reference I'm directly passing the memory, am I not?

And while on the subject, did you get that assembly?

I'm sorry, but I'm not familiar with that term! Its something I would do to start my lawn mower. But thats probably just me as a non native speaker x)

And do say how do you compile your tests, with -g -O2?

Haven't touch the default compiler settings. My project, that is potentially using this class, would use the default settings as well, so I thought better not to touch it.

I did however make an embarrassing mistake, that I shall not comment further on...

but I'm now down to 300-1000 nsec per function call. The 3 functions are virtually the same in access time.

Save to say: "Use whatever is better for your OCD-Code-structure!"

sierdzio

By assembly, @kshegunov meant this: https://stackoverflow.com/questions/137038/how-do-you-get-assembler-output-from-c-c-source-in-gcc

By comparing assembly code that the compiler produces we can really see how well optimized the code is after the compiler had done it's magic.

I did however make an embarrassing mistake, that I shall not comment further on...
but I'm now down to 300-1000 nsec per function call. The 3 functions are virtually the same in access time.

Oh now you've made me super curious! :-)

All in all, good to know that all these solutions are equally valid. We've all learned a bit here.

kshegunov

@J.Hilk said in casts, if's and array-access.:

I actually tested it with '[]' calls and it doubled the execution time.

This is very odd, are you sure you didn't do anything strange. I'd expect it to be a bit faster, but noticeable.

I'm directly passing the memory, am I not?

Yes, but the QVector has a data pointer inside which is there to implement implicit sharing. So you're actually getting something like:

QVectorData * d; //< Already in the vector implementation
return d->elements[index]; //< Indirection through the data pointer

albeit most of it is probably inlined by the compiler. Still the memory fetch is there.

I'm sorry, but I'm not familiar with that term! Its something I would do to start my lawn mower. But thats probably just me as a non native speaker x)

Okay, I mean the compiled binary code - the assembly. You can get that from Creator by selecting Debug > Operate by instruction from the menu when you have an active debug session.

Haven't touch the default compiler settings.

Well, perhaps you should or rather create a profile target for Creator. It'd include -g -O2 switches if you work with gcc, meaning the compiler will generate debug info, but will also perform the standard release optimizations. The idea is to have the optimized code (what you'd have in release) without stripping all the debug info - allowing for a reasonable profiling.

but I'm now down to 300-1000 nsec per function call.

That's in agreement with the table I posted earlier - a couple or so (short) instructions; a few clock ticks.

The 3 functions are virtually the same in access time.

Don't rush it, we would still love to do some more dissecting ;P
I share @sierdzio's curiosity.

SGaist

Hi,

Shouldn't it rather be:

QVector<int> paramVector = vindex.at(i);
value = dStorage.getvalueMethod_1(paramVector.at(0), paramVector.at(1), paramVector.at(2));

to ensure the compiler can optimise access a bit ?