Random crashes after switching to multi-thread
-
Hello everyone,
Recently, we switched our application to multi-thread in order to avoid some heavy work to freeze the main event loop.
Since then, some of our users are experiencing random crashes.
As always, we got dumps from Google-Breakpad and they are showing crashes in each and every corner of our application.
Some are coming from QSlotObject (apparently, the slot it is trying to call isn't there anymore), some are coming from getters (for example, I'm calling isNull on a QSharedPointer and it returns false but immediately after, it crashes accessing the underlying pointer), and so on...Right now, the only explanation we can give about this is memory corruption, or something like this. Like if instances of our classes are just disappearing randomly.
The main problem is that we didn't manage to reproduce these crashes. The application is running fine for a lot of users, including myself. It never crashed like this.
So we're kind of stuck here... The only things we got are dumps from Google-Breakpad and assumptions.That's why I'm asking for help here.
Does anybody have an idea from where these random crashes may come from ?
Any advices ? Where should we start ? Does this have anything to do with memory corruption ?Thanks in advance for your help.
If you need more informations or anything else, please ask. -
Since it is difficult to give you an exact answer, I can tell you from my projects that what you describe is likely caused by programming mistakes. The difficult part about these error in multithreaded applications is that they're not deterministic and the probability you encounter them might be different on every machine your program is running on. So those are extremely different to chase down and usually can be fixed by an experienced developer checking the code.
If you can't reproduce the bug there is no other way than eliminating it by investigating the code where it crashes, reading documentation and trying so solve it in theory. Imo, these kind of bugs are tricky to solve, you also encounter them in robotics and machine programming where several control loops are running in parallel. It works 99% of time, but there seems a certain combination that makes the machine fail. Also, here the only chance you have is to isolate all control loops to work correctly theoretically with all outside combinations.
-
Hi,
@Moonlight-Angel said:
Since then, some of our users are experiencing random crashes.
...
The main problem is that we didn't manage to reproduce these crashes. The application is running fine for a lot of users, including myself.This is a classic case of a race condition: http://stackoverflow.com/questions/34510/what-is-a-race-condition On your machine, the correct thread "wins the race". However, on your user's machine, the wrong thread "wins the race" and causes a crash. To eliminate the crash, you must make your threads wait for each other at appropriate times.
See http://doc.qt.io/qt-5/threads-synchronizing.html as a starting point.
Also, make sure you obey the rules of reentrancy and thread-safety: http://doc.qt.io/qt-5/threads-reentrancy.html (for example, all QWidget functions must only be called in the GUI thread)
-
Thanks for your answers.
@cybercatalyst Thanks for clarifying that, you just confirmed what I was afraid of :D.
However, I'm not sure that the crashes come from some code being executed incorrectly or anything like this. It's really weird to understand and thus to explain.
For example, I have an instance of the logger that is created at startup. Suddenly, in the middle of nowhere, trying to access that instance causes a crash. I just can't understand how is this even possible ?!
That instance is created once at startup and then passed as reference to the classes that could possibly make use of the logger.@JKSH We had one case of race condition and with the dump, we were able to solve it, cause it crashed at a specific moment, in a specific class that was prone to race condition.
The other crashes are occurring in classes where I think there can't be race conditions.
I was aware of reentrancy and thread-safety while refactoring the application but yes, maybe I just made a mistake at some place, but the dumps we're receiving don't tell much, or maybe I'm not interpreting them correctly, who knows. Anyway, all the classes shared between multiple threads are protected with mutexes.
All the things that are GUI-related are indeed processed in the GUI thread. We are only using signals to update the GUI, all the other threads aren't even aware of the GUI, they're just emitting signals and the GUI is connected to them with a QueuedConnection.
Correct me if I'm wrong, but when one class is destroyed, all the slots connected to signals from this class are disconnected too, right ? Then why is my application sometimes crashing because it is apparently calling some slot in the GUI that doesn't exist anymore ?Thanks again for your answers. I'll try to investigate more in-depth, but I'm kind of lost right now. \o/
-
@Moonlight-Angel said:
when one class is destroyed, all the slots connected to signals from this class are disconnected too, right ?
That's right.
Then why is my application sometimes crashing because it is apparently calling some slot in the GUI that doesn't exist anymore ?
How do you do your deletions? Don't use the regular
delete
(because the object could have already started processing signals/events at the point when you delete it). Make sure you useQObject::deleteLater()
.If you're still stuck (and you're on Linux or OS X), try a thread checker: http://www.kdab.com/helgrind-howto/
-
@JKSH Most of the classes that contains signals are class members allocated on the stack and they are destroyed when the class itself is destroyed, so there's no call to
deleteLater()
.
But this didn't seem to be a problem before, the slots were disconnected correctly.
I'll check helgrind, according to your link, with the latest versions of Qt, there's no need to patch/compile Qt manually, so let's give it a try.BTW, the project is pretty huge, that's why I'm lost ATM, there's so much things to check and so much possible sources that can cause these crashes :/.
While I'm on it, do you think it would help if I was able to reproduce the crashes on a system where I can run the application in a debugger ? Maybe I'll see if I can reproduce it in some VM with the same "specs" as the user's computers.
-
@Moonlight-Angel said:
@JKSH Most of the classes that contains signals are class members allocated on the stack and they are destroyed when the class itself is destroyed, so there's no call to
deleteLater()
.Then make sure the top-level class is not destroyed while its members are in the middle of processing signals/events.
-
@Moonlight-Angel
Deletion and creation are by far not the only operations that you need to be aware of in multithreaded applications. Also, if you read a variable in one thread and write it from another thread, your program will crash, for example. You can to make sure this doesn't happen by using mutexes or semaphores. -
@cybercatalyst As I said above, I'm already using mutexes.
@Moonlight-Angel said:
I was aware of reentrancy and thread-safety while refactoring the application but yes, maybe I just made a mistake at some place, but the dumps we're receiving don't tell much, or maybe I'm not interpreting them correctly, who knows. Anyway, all the classes shared between multiple threads are protected with mutexes.
-
I bet a beer you made a mistake on that. But unfortunately that's up to you, since we cannot access the code.
-
@cybercatalyst That's totally possible, this was my first time playing with multi-threading.
I'm still not sure if I implemented some things correctly.
For example, the Logger instance I'm passing to some classes.
I'm storing the reference as a class member and made a getter to be able to access the logger inside the methods when I need it.
However, I didn't protected the getter with mutexes as the Logger methods are protected. The getter simply returns the reference, it doesn't change anything in the class.
Do I need to protect the getter too ? -
Since you do not write your logger instance (but probably members of it), you do not need to guard the getter. However, I am not sure whether a forum is the right format for finding such type of issues. In this case I think it is better for you to hire someone experienced to chase down the bugs for you. Otherwise this thread might get veeery long :)
-
@cybercatalyst You're right.
Thanks for your answers anyway, it really helped me. ;)