Unsolved casts, if's and array-access.
-
@J.Hilk I don't think SIMD will improve anything in this case. SIMD helps if you have many iterations but in this case it is only one. It will probably take longer to load the data into appropriate SIMD registers then do those 3 if's.
-
You could in principle go with one
if
, but as @sierdzio said: benchmark your code first and make sure this is a bottle neck. Here's one suggestion how you could "simplify" the checking:T m_data[15][15][255] static const T * const end = &m_data[14][14][254]; // Then in the function that makes the check: T * item = &m_data[x][y][z]; //< You may need to modify this, so there's no dereferencing, but you get the point return item > end ? defaultValue : *item;
-
Thanks to everyone for the contribution.
as suggested I did some testing, with the 3 methodes discuss in this thread.
I'll post my code, and my result for further discussions:
Its my first test in this direction so I'm not sure if I've done everything like I should have.I forced the random numbers to be valid, to guarantee the max time for each validation check in each method. I also fell back to Qt for this...
//storage Class //-----------------------------------------headers------------------------- #ifndef DATASTORAGE_H #define DATASTORAGE_H #include <QObject> class dataStorage : public QObject { Q_OBJECT public: explicit dataStorage(QObject *parent = nullptr); uint64_t &getvalueMethod_1(const uint &x, const uint &y, const uint &z); uint64_t &getvalueMethod_2(const uint &x, const uint &y, const uint &z); uint64_t &getvalueMethod_3(const uint &x, const uint &y, const uint &z); signals: public slots: protected: uint64_t m_data[16][16][256]; const uint64_t *m_last; }; #endif // DATASTORAGE_H //------------------------------Sources------------------------ #include "datastorage.h" dataStorage::dataStorage(QObject *parent) : QObject(parent) { for(int x(0); x < 16;x++) for(int y(0); y < 16; y++) for(int z(0); z<256;z++) m_data[x][y][z] = qrand(); m_last = &m_data[15][15][255]; } uint64_t &dataStorage::getvalueMethod_1(const uint &x, const uint &y, const uint &z) { if(x < 16) if(y <16) if(z < 256) return m_data[x][y][z]; return m_data[0][0][0]; } uint64_t &dataStorage::getvalueMethod_2(const uint &x, const uint &y, const uint &z) { if(x <16 && y < 16 && z < 256) return m_data[x][y][z]; return m_data[0][0][0]; } uint64_t &dataStorage::getvalueMethod_3(const uint &x, const uint &y, const uint &z) { uint64_t *item = &m_data[x][y][z]; return item > m_last ? m_data[0][0][0] : *item; } // -------------------------------------Main--------------------------------- #include "datastorage.h" #include <QDebug> #include <QVector> #include <QElapsedTimer> uint getRand(int max, int min){ return rand()%(max-min+1)+min; } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); dataStorage dStorage; QVector<QVector<uint> > vindex; for(uint32_t i = 0; i < 10000000; i++ ) vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(255,0)})); QElapsedTimer eTime; uint64_t value; qint64 average1(0),average2(0),average3(0); for(int k(0); k <20; k++){ qDebug() << "Test-cycle:" << k+1; eTime.start(); for(int j(0); j <10; j++) for(uint32_t i = 0; i < 10000000;i++){ value = dStorage.getvalueMethod_1(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2)); } qDebug() << "Time for Method 1: "<< eTime.elapsed(); average1 += eTime.elapsed(); eTime.start(); for(int j(0); j <10; j++) for(uint32_t i = 0; i < 10000000;i++){ value = dStorage.getvalueMethod_2(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2)); } qDebug() << "Time for Method 2: "<< eTime.elapsed(); average2 += eTime.elapsed(); eTime.start(); for(int j(0); j <10; j++) for(uint32_t i = 0; i < 10000000;i++){ value = dStorage.getvalueMethod_3(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2)); } qDebug() << "Time for Method 3: "<< eTime.elapsed(); average3 += eTime.elapsed(); } average1 = average1/20; average2 = average2/20; average3 = average3/20; qDebug() << "Average Time for Method 1:" << average1; qDebug() << "Average Time for Method 2:" << average2; qDebug() << "Average Time for Method 3:" << average3; return a.exec(); }
the result was:
- Average Time for Method 1: 11757ms
- Average Time for Method 2: 11587ms
- Average Time for Method 3: 12012ms
My conclusion of this would be:
Overall the 3 methods are equally fast!
With method 3, the pointer check, being slightly slower, and the inline if-check beeing the fastest -
Can you pull the assembly for those three.
Additional notes:-
You will want to make them methods inline, otherwise I feel the
call
instruction cost is pretty significant compared to the function's body (and also theQVector::at
where the arguments are taken from the precomputed array).
Edit: Have also a brief look at this table -
I couldn't see
getRand
defined, but if it does what I think it does, then you have a systematic error - you're accessing only real elements of the array (not out-of-bound ones). Which may put test 3 at a disadvantage (or it may not, I'm not sure either way). -
There's no difference and can never be any between:
if(x < 16) if(y <16) if(z < 256) return m_data[x][y][z];
if(x <16 && y < 16 && z < 256)
which is due to the short-circuit evaluation of
&&
and||
. That's also the most probable reason you get virtually identical results for test 1 and 2. -
-
hi, @kshegunov
like you suggested, I wrapped the
getvalueMethod_X
functions in an inline statement.getRand
is defined right above main:uint getRand(int max, int min){ return rand()%(max-min+1)+min; }
But you're right it would be better to also have values that are outside the intended scope:
so I made the following changes:for(uint32_t i = 0; i < 3333333; i++ ) vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(255,0)})); for(uint32_t i = 3333333; i < 6666666; i++ ) vindex.append(QVector<uint>({getRand(30,15),getRand(15,0),getRand(255,0)})); for(uint32_t i = 6666666; i < 10000000; i++ ) vindex.append(QVector<uint>({getRand(15,0),getRand(15,0),getRand(400,254)}));
and pushed the number of cycles to 60.
I'm however not sure, what you mean with:
and also the QVector::at where the arguments are taken from the precomputed array
AFAIK
uint64_t m_data[16][16][256];
doesn't have an at opperator. I changed the initializaion however, for the CPU to have one less step:inline uint64_t &getvalueMethod_3(const uint &x, const uint &y, const uint &z) { uint64_t *item(&m_data[x][y][z]); return item > m_last ? m_data[0][0][0] : *item; }
The result with those changes:
- Average Time for Method 1: 10813
- Average Time for Method 2: 10819
- Average Time for Method 3: 11313
Overall 10% faster than previously, so the call instruction did indeed take up a significant amount of time!
Method 1 and 2 are even more equal, like expected.
But no change for method 3. -
@J.Hilk said in casts, if's and array-access.:
I'm however not sure, what you mean with
I meant here:
value = dStorage.getvalueMethod_1(vindex.at(i).at(0),vindex.at(i).at(1), vindex.at(i).at(2));
but I just realized
QVector::at
returns a const reference so nevermind that. However you may want to consider using a statically sized c-style array, as you may be getting a lot of cache misses due to the implicit sharing pointer indirection from the vector -
another thing to try.But no change for method 3.
I never gave any guarantees it'd be faster; that's why you're profiling to begin with ;)
And while on the subject, did you get that assembly? And do say how do you compile your tests, with-g -O2
? -
I actually tested it with '[]' calls and it doubled the execution time.
may be getting a lot of cache misses due to the implicit sharing pointer indirection from the vector
the variables are passed as:
inline uint64_t &getvalueMethod_2(const uint &x, const uint &y, const uint &z)
with QVector::at returning a const reference I'm directly passing the memory, am I not?And while on the subject, did you get that assembly?
I'm sorry, but I'm not familiar with that term! Its something I would do to start my lawn mower. But thats probably just me as a non native speaker x)
And do say how do you compile your tests, with -g -O2?
Haven't touch the default compiler settings. My project, that is potentially using this class, would use the default settings as well, so I thought better not to touch it.
I did however make an embarrassing mistake, that I shall not comment further on...
but I'm now down to 300-1000 nsec per function call. The 3 functions are virtually the same in access time.
Save to say: "Use whatever is better for your OCD-Code-structure!"
-
By assembly, @kshegunov meant this: https://stackoverflow.com/questions/137038/how-do-you-get-assembler-output-from-c-c-source-in-gcc
By comparing assembly code that the compiler produces we can really see how well optimized the code is after the compiler had done it's magic.
I did however make an embarrassing mistake, that I shall not comment further on...
but I'm now down to 300-1000 nsec per function call. The 3 functions are virtually the same in access time.Oh now you've made me super curious! :-)
All in all, good to know that all these solutions are equally valid. We've all learned a bit here.
-
@J.Hilk said in casts, if's and array-access.:
I actually tested it with '[]' calls and it doubled the execution time.
This is very odd, are you sure you didn't do anything strange. I'd expect it to be a bit faster, but noticeable.
I'm directly passing the memory, am I not?
Yes, but the
QVector
has a data pointer inside which is there to implement implicit sharing. So you're actually getting something like:QVectorData * d; //< Already in the vector implementation return d->elements[index]; //< Indirection through the data pointer
albeit most of it is probably inlined by the compiler. Still the memory fetch is there.
I'm sorry, but I'm not familiar with that term! Its something I would do to start my lawn mower. But thats probably just me as a non native speaker x)
Okay, I mean the compiled binary code - the assembly. You can get that from Creator by selecting Debug > Operate by instruction from the menu when you have an active debug session.
Haven't touch the default compiler settings.
Well, perhaps you should or rather create a profile target for Creator. It'd include
-g -O2
switches if you work with gcc, meaning the compiler will generate debug info, but will also perform the standard release optimizations. The idea is to have the optimized code (what you'd have in release) without stripping all the debug info - allowing for a reasonable profiling.but I'm now down to 300-1000 nsec per function call.
That's in agreement with the table I posted earlier - a couple or so (short) instructions; a few clock ticks.
The 3 functions are virtually the same in access time.
Don't rush it, we would still love to do some more dissecting ;P
I share @sierdzio's curiosity. -
Hi,
Shouldn't it rather be:
QVector<int> paramVector = vindex.at(i); value = dStorage.getvalueMethod_1(paramVector.at(0), paramVector.at(1), paramVector.at(2));
to ensure the compiler can optimise access a bit ?