To Mutex or not to Mutex

J.Hilk

Hi everyone,

I'm considering to drop a QMutex lock/unlock call for the extra bit of performance, because we all strive for peak efficency, right !? 😉

and I would like some opinions on it.

I have a QByteArray that one thread updates constantly in a While-loop. This happens at the end, after a lot of different calculations beforehand. The updating is done via memcpy

The other thread now periodically (via QTimer) takes the data from that QByteArray and - using again memcpy - creates an QImage out of it. This is the only Crossthread operation.

The only case where I could see conflict happen, is when both memcpy calls would happen at the exact same time.
And one operation is stopped halfway through so that the other operation would outrun the first one.

That should be a very, very rare situation. If it happens at all, I could imagin, that the RAM access call is mabye queued for the different threads/operations ?

I don't really know how the hardware does this, maybe someone has more insight on this.

Greetings

JonB

@J.Hilk
Yes, the conflict will happen on simultaneous thread memcpys, with potential partial-copy-thread-switch.

No, you are "wicked" if you don't mutex/lock :)

You should not be thinking about "the RAM access call is mabye queued for the different threads/operations" or "how the hardware does this".

You are at the mercy of whatever memcpy happens to do. That is compiler-specific at best, and could also vary between debug versus release versus optimization or not, etc. For all you know it could be a byte-by-byte software loop with no hardware anything special, and totally interruptible.

So you must mutex, if you want me to use & trust your code :)

In any case:

I have a QByteArray that one thread updates constantly in a While-loop. This happens at the end, after a lot of different calculations beforehand. The updating is done via memcpy

Do you mean there is a lot of QByteArray calculation & updating before you hit the shared memcpy, or do you mean you're doing lots of memcpys as you go along? If it's the former, the overhead on a memcpy-lock may be minimal compared to the calculation/updating time? If it's the latter, can't you do all the calculation & updating in a thread-only memory area and then do just one final shared-memory memcpy, so there is only a need for the mutex on that one occurrence?

J.Hilk

@JonB said in To Mutex or not to Mutex:

@J.Hilk
Yes, the conflict will happen on simultaneous thread memcpys, with potential partial-copy-thread-switch.

No, you are "wicked" if you don't mutex/lock :)

You should not be thinking about "the RAM access call is mabye queued for the different threads/operations" or "how the hardware does this".

You are at the mercy of whatever memcpy happens to do. That is compiler-specific at best, and could also vary between debug versus release versus optimization or not, etc. For all you know it could be a byte-by-byte software loop with no hardware anything special, and totally interruptible.

So you must mutex, if you want me to use & trust your code :)

I've been reading on it, its way more complicated than one thinks, obviously. All cores share the same RAM-Bus. Therefore the RAM-access is actually queued, there are Systems with more buses but they are far and in between.

However, my program is most likly not even going down to that level. The two threads will have the memory of the QByteArry in their respective caches. And here is where Cache coherence comes into play.

If a cache memoryblock is written to it will not update the other caches until the write operation is done and a "sufficient" amount of time has passed.

So I should be fine :)

In any case:

Do you mean there is a lot of QByteArray calculation & updating before you hit the shared memcpy, or do you mean you're doing lots of memcpys as you go along? If it's the former, the overhead on a memcpy-lock may be minimal compared to the calculation/updating time? If it's the latter, can't you do all the calculation & updating in a thread-only memory area and then do just one final shared-memory memcpy, so there is only a need for the mutex on that one occurrence?

That's actually what I'm doing, I'm "missusing" the QBytearray as a char* and memory allocation. Also, I have 2 QBytarrays, one to do operations on and one to store the result to make it accessable the other thread. I do this to prevent "interim states"

JonB

@J.Hilk
You will do as you please :) But I have no idea why you're going down to some cache level. I see three issues at least:

That link talks about "In a shared memory multiprocessor system with a separate cache memory for each processor". What makes you think your threads have anything to do with separate processors?
Caches have some limited size. memcpy may be exceeding that.
I believe you're still assuming a memcpy is some "atomic" operation. Why should it be, when it may be just a byte-by-byte software loop? I don't see how any of this caching addresses what happens if you get halfway through one memcpy and then it swaps to to the other thread? This then has nothing to do with some "caches being in sync", even assuming they are it simply means the second memcpy will be copying the memory from an "incomplete/interrupted" source area, and you'll get garbage at one end or the other of the destination area...

?

JKSH

I'm with @JonB on this. memcpy-ing a large chunk of data is not atomic; your OS can interrupt it halfway.

My company does training/consultation. One of the common stories we hear goes like this: "My code used to work just fine, but it stopped working when I upgraded my PC!" This is a symptom of race conditions.

@J.Hilk said in To Mutex or not to Mutex:

I'm considering to drop a QMutex lock/unlock call for the extra bit of performance

How much is this "extra bit", exactly?

kshegunov

@J.Hilk said in To Mutex or not to Mutex:

Therefore the RAM-access is actually queued, there are Systems with more buses but they are far and in between.

Yes, that's the von Neumann architecture, even the instruction fetching is on that same bus. In contrast with the Harvard architecture you have separate buses for data and for instructions. I don't see the connection to your problem though.

The thing is this: You have something in memory that is to be accessed by two threads. This means inevitably one of the threads (or both) is going be suspended by the task manager of your OS, and then after yielding to some other task they will be woken up again. Now this reentry is what's going to screw with your idea that memory access is queued, as it's non-deterministic, it can happen at any time. You don't have access to the low level, or really care if the fetches over the bus are queued, you care about the way your threads access the memory segment.

All that said, you should indeed drop the mutex, but not because it is slow (it's actually rather fast) and not because you don't need locking ...
If I understood your original problem correctly, then you need to go and read about the producer-consumer on a circular buffer (that uses semaphores for synchronization). That's how you can squeeze out the performance you want. :D

J.Hilk

Thanks everyone for the feedback, I appreciate it.

@JKSH said in To Mutex or not to Mutex:

I'm with @JonB on this. memcpy-ing a large chunk of data is not atomic; your OS can interrupt it halfway.

Thats good to know, I wasn't aware of that!

How much is this "extra bit", exactly?

To be honest, no idea. It won't be much though, as my recursive function before hand is the most time consuming. And probably not the most effective way to do the needed calculations :-(

@kshegunov said in To Mutex or not to Mutex:

The thing is this: You have something in memory that is to be accessed by two threads. This means inevitably one of the threads (or both) is going be suspended by the task manager of your OS, and then after yielding to some other task they will be woken up again. Now this reentry is what's going to screw with your idea that memory access is queued, as it's non-deterministic, it can happen at any time. You don't have access to the low level, or really care if the fetches over the bus are queued, you care about the way your threads access the memory segment.

Right, even though memcpy is meant to be the fastest library routine for memory-to-memory copy, it's not save from being put to "sleep" by the OS halfway through the task.

All that said, you should indeed drop the mutex, but not because it is slow (it's actually rather fast) and not because you don't need locking ...
If I understood your original problem correctly, then you need to go and read about the producer-consumer on a circular buffer (that uses semaphores for synchronization). That's how you can squeeze out the performance you want. :D

I'll have to indeed read that up. Thanks for the reference.

I decided to share the code with you all,
It's part of a coding challenge I had with a friend, (he just started to learn). And we already compared our ways/results, so no need for secrecy ;-)

We were supposed to create a visual fractal based on the Abelian_sandpile_model

He you can find the QWidget & QThread & QMutex implementation:
Sindepile-Challenge-QWidget

He you can find the QML & QtConcurrent & (no)QMutex implementation:
Sindepile-Challenge-QML

kshegunov

@J.Hilk said in To Mutex or not to Mutex:

Right, even thoug memcpy is meant to be the fastest library routine for memory-to-memory copy, it's not save from being put to "sleep" by the OS halfway through the task.

The speed would depend on the actual C library implementation. For windows (XP) the standard library just did a word by word copy on the aligned memory block. So in any case it wasn't that fast, but then again how else can it be done? You have only one memory read per CPU cycle at most.

I'll have to indeed read that up. Thanks for the reference.

Dig up the Qt example on the producer consumer (I was too lazy to do it for you). If memory serves me it should cover your case.

J.Hilk

@kshegunov
I did, I think you mean this one Wait Conditions Example

Verry interesting approach and I can see uses for this, but I don't think it is useful for my situation.

I'm populating the data array of an QImage with the calculated values. I don't want faulty interim images so, I'll have to wait until all calulations are done before I start copying.
If I look and unlock between each Byte that is copied to tell the other thread that part of the data is ready to be copied, I'll end up - by an 800x800 image - with 640000 mutex locks, unlocks, bytecount increments & notification signals. On the first clance, this seems slower than locking before the copy function and unlocking after.
this is with a Buffer size of 4 Byte(argb) of course

kshegunov

@J.Hilk said in To Mutex or not to Mutex:

I did, I think you mean this one Wait Conditions Example

Yes that's the one.

I'm populating the data array of an QImage with the calculated values. I don't want faulty interim images so, I'll have to wait until all calulations are done before I start copying.

The point is you should not copy at all. Just create a biggish buffer and while one thread's working on part of it, the other can read the already produced data (another part of the buffer). You don't need to have interim data, just modify the counters accordingly and free so many resources as the thread provided.

If I look and unlock between each Byte that is copied to tell the other thread that part of the data is ready to be copied, I'll end up - by an 800x800 image - with 640000 mutex locks, unlocks, bytecount increments & notification signals. On the first clance, this seems slower than locking before the copy function and unlocking after.

You misunderstand me sir. The mutex is a mutual exclusive lock that's going to serialize the access to the whole buffer. Here that's not the idea. The idea is to write one part of the buffer without restricting the consumer thread to read other parts of the same buffer. This is allowed as long as they don't read/write the same element (byte).

this is with a Buffer size of 4 Byte(argb) of course

I don't follow, you have a buffer of bytes. What you put in it is an implementation detail. You could as well write whole images, or single bytes, or arrays of integers or whatever.

J.Hilk

Ok, this is new for me, so let's see if I got this right ;-),

In the example there are the following objects that are shared betweens the 2 threads:

const int DataSize = 100000;

const int BufferSize = 8192;
char buffer[BufferSize];

QWaitCondition bufferNotEmpty;
QWaitCondition bufferNotFull;
QMutex mutex;
int numUsedBytes = 0;

DataSize is the actual size of the data that should, in my case, be copied.
BufferSize is the amount of bytes you want to be at least ahead of of the other thread. This can be anything between 1 and DataSize.

I guess reasonable would be, in my case, a Pixel(4Bytes) or a line (bytesPerLine)

The Producer thread has this in it's run function:

for (int i = 0; i < DataSize; ++i) {
            mutex.lock();
            if (numUsedBytes == BufferSize)
                bufferNotFull.wait(&mutex);
            mutex.unlock();

            buffer[i % BufferSize] = "ACGT"[QRandomGenerator::global()->bounded(4)];

            mutex.lock();
            ++numUsedBytes;
            bufferNotEmpty.wakeAll();
            mutex.unlock();
        }
}

For the edge case BufferSize = DataSize this could be simplyfied to

mutex.lock
for (int i = 0; i < DataSize; ++i) {
            buffer[i % BufferSize] = "ACGT"[QRandomGenerator::global()->bounded(4)];
        }
}
mutex.unlock();

The example also requieres that reading and writing happens all the time simultaneously, otherwise Buffer[i] will be overwritten before the other thread could read it.
In my case reading happens every 50ms (20Hz Frequenzy) where as writing happens irregulary, when ever the recursive function finished one iteration.

But never the less I could adjust the example for my needs, I think:

char buffer[rows*cols];

QMutex mutex;
int numLinesWritten = 0;
int numLinesRead = 0;

QWaitCondition wairForReader;
QWaitCondition waitForWriter;

//Producer
mutex.lock();
if(numLinesRead == 0){
    //Other thread reads currently line 0
    wairForReader.wait(&mutex);
    mutex.unlock();
}
mutex.lock();
numLinesWritten = 0;
mutex.unlock();

for(numLinesWritten;  numLinesWritten < rows; ){
     mutex.lock();
     if(numLinesWritten > numLinesRead)
         wairForReader.wait(&mutex);
     mutex.unlock();
     for(uint x(0); x < cols; x++){
              buffer[cols * numLinesWritten + x] =m_data[cols * numLinesWritten + x];
    }
    mutex.lock();
    ++numlinesWritten;
    waitForWriter.wakeAll();
    mutex.unlock();
}

The Consumer look similar, I don't have the time to do that right now.
This is from the top of my head and may be wrong :-)

But seems unnecessary complex with lots of extra additions ( on numlinesWritten & numLinesRead) multiple locks and unlocks and wake calls.

correct me when I'm wrongere here, of course.

Like I said, for my case it could be faster to simply wrap the read and write in a lock, like I did, because reader tries to read the data irregulary every 20 ms.

//Producer
...
if(m_Mutex.tryLock(0)){
     //If other thread currently reads, do not wait to lock, skip this write cycle
      memcpy(m_GrainsCopy, m_Grains, arraySize);
      m_Mutex.unlock();
}
//restart while 


//Consumer
m_Mutex.lock();
//if not free wait until it is
for(uint i(0); i < arraySize; i++ ){
     memcpy(image + (i*4), colors.at(m_GrainsCopy[i]), 4);
}
m_Mutex.unlock();

kshegunov

You have now completely confused me ... :)
As I am an old fashioned guy, I'll work with semaphores here. Consider the following very short snippet that should be more than sufficient for your case:

Common data:

const int bufferSize = 20000; // or w/e
QByteArray data(bufferSize); // The shared buffer
QSemaphore bytesToRead(0), bytesToWrite(bufferSize); // Sync

Producer

DataType dataToWrite;
int bytes = sizeof(DataType);

bytesToWrite.acquire(bytes);
// Write the data to the actual buffer
bytesToRead.release(bytes);

Consumer

DataType dataToRead;
int bytes = sizeof(DataType);

bytesToRead.acquire(bytes);
// Read the data from the buffer
bytesToWrite.release(bytes);

And that's all - easy peasy. :)

PS.
Are you working on a genetics problem, as this "ACGT" sounds very much like the four nitrogen bases?

J.Hilk

@kshegunov

thanks, I'll take a lok at it :)

@kshegunov said in To Mutex or not to Mutex:

PS.
Are you working on a genetics problem, as this "ACGT" sounds very much like the four nitrogen bases?

No, that is directly from the example from Qt's webside

You can find my project here

kshegunov

@J.Hilk said in To Mutex or not to Mutex:

You can find my project here

Yep, I glanced at it, looks okay more or less. That's a very odd way of doing threading however. Why not use signals and slots to transfer the data between the worker and the GUI? Or at least use a thread-safe queue with your data to push and pull out of, seems a bit simpler than writing to a buffer and then reading it back?

J.Hilk

@kshegunov said in To Mutex or not to Mutex:

That's a very odd way of doing threading however. Why not use signals and slots to transfer the data between the worker and the GUI? Or at least use a thread-safe queue with your data to push and pull out of, seems a bit simpler than writing to a buffer and then reading it back?

😳
very simple reason, from my point of view at least.

As long as the While loop is running the event loop of the thread is "on halt" therefore no Signals are emitted/received during that time.

Would I want to pass the data to the gui thread via SignalSlot, I would either have to call processEvents or exit the loop and reenter it via a QTimer or QMetaObject::invokeMethod (with Qt::QueuedConnection, as a parameter).
This slowes down the calculation of the next value significantly.

QMetaObject::invokeMethod was the method I used at first, and changed it later.

QQueue however may be an option, haven't thought about that one to be honest :)

kshegunov

@J.Hilk said in To Mutex or not to Mutex:

As long as the While loop is running the event loop of the thread is "on halt" therefore no Signals are emitted/received during that time.

Emission of signals is not affected by a blocked event loop. Only the execution of the slot connected to the signal if the receiving object's thread's event loop is blocked. :)

J.Hilk

@kshegunov said in To Mutex or not to Mutex:

@J.Hilk said in To Mutex or not to Mutex:

As long as the While loop is running the event loop of the thread is "on halt" therefore no Signals are emitted/received during that time.

Emission of signals is not affected by a blocked event loop. Only the execution of the slot connected to the signal if the receiving object's thread's event loop is blocked. :)

😱
0_1530014535045_79493d01-0a82-489b-a52e-b4f34fc079ba-image.png

Discover and share your #QtStories

Upcoming Forum Update May 2nd

To Mutex or not to Mutex