Intermittent segFault on closing, in malloc_consolidate, in debug binary only, when using dialog from dynamic lib

Moschops · wrote on 5 Mar 2014, 10:09

I have a QT application that exhibits a segFault on closing, sometimes, if I have used a Dialog that is built separately as a dynamic library (on the Linux build only - the windows build doesn't exhibit the segFault).

The stack trace shows it happening in malloc_consolidate, as called by dl_fini (that is, the QT application has exited and we're now in the tidy up phase).

It does not happen in my release build of the same QT application and separate dialog library. I struggle to come up with an explanation why it would happen in the debug but never in the release (there's nothing very special about the different between my debug and release builds - debug has debugging symbols and no optimisation, release has the reverse).

It's a big application and valgrind churns out lots of things (many of them deep inside the QT libraries), none of which jump out at me, and nothing that makes me think it would only happen in the debug build.

I have a separate, lightweight wrapper executable that does little more than show the dialog in the dynamic library (this is why the dialog exists in a separate library - so it can be called effectively as a standalone dialog) and that never exhibits a segFault on exit, so I'm trying to come up with a way that the interaction of the main application and this dynamic library dialog could do something that causes a segFault on exit (in debug build), and I'm coming up blank. Does anyone have any ideas? I'm wondering about threads; the main application has many threads, which is a significant difference.

It also seems to be the case that if the user, once finished with the dialog from the dynamic library, does some other things in the main application first (that do not call upon the dynamic library) the segFault does not exhibit on exit, but just waiting a while and then closing still exhibits the segFault.

Chris Kawa · wrote on 5 Mar 2014, 13:00

It's next to impossible to say something without looking at the code.
It can be a number of things.
It depends on where and how you create the dialog, how you access it, how and where you delete it. It might be something the dialog does that corrupts some other part of memory. It might be the way you load or unload your library. Maybe you're loading release library in debug program or the other way around? Countless possible reasons.
When the program crashes in debug and not in release (or vice versa) it's usually a memory corruption problem like double delete, writing out of bounds of some array or referencing a deleted pointer.
These types of corruption have often random effects (or worse - no visible effects in some configurations).

Moschops · wrote on 5 Mar 2014, 13:08

Yes. The code is huge. many many files and many many lines and all sorts.

I'm hoping that some of the odd facts (like happenings only when the dialog is called from one application and not from the relatively lightweight wrapper, or that it only happens in the debug version and not the release) might be a meaningful clue.

For example, if the problem was simply trashing some memory inside the dialog, I'd expect to see the segfault in the lightweight wrapper application and the bigger application as well (I think). The library is linked to at compile time, rather than loaded at run-time (I think - I'm still feeling my way with this; the original coders are no longer available). I'm also wondering why doing something (pretty much anything) in the application from which the dialog was called seems to quell the segFault - if it was a double free or some such, I'd expect that fiddling around for a while would make no difference - if it's going to get a second free at application exit, fiddling around with the application first should make no difference.

I suspect it is either a double free or a trashed chunk of memory, but the characteristics of it (particularly that doing something first and then closing) make it maddeningly hard to theorise about.

Chris Kawa · wrote on 5 Mar 2014, 13:26

These types of bugs can't be rationalized or discussed because they're not deterministic. It all depends of the current layout of memory.

Eg. something might get deleted once and then another time and nothing (obvious) happens. But if you happen to allocate something in there between deletes (simple "int i" might suffice sometimes) it's a crash with incomprehensible call stack.

But I don't want you to steer towards double deletion. It might be anything really. I once spent half a day debugging an app because two coincidences clashed - one guy typed auto i = instead of auto& i = and another didn't make a class non-copyable where it should be. The crashes occurred in a totally unrelated parts of the code and seemed to depend on yet more unrelated user actions.

I would suggest to employ the debugger. Don't wait until it crashes. Step somewhere before your dialog is created and try to follow what it's doing (AKA "rubber duck debugging":http://en.wikipedia.org/wiki/Rubber_duck_debugging).

Moschops · wrote on 5 Mar 2014, 14:31

Good advice, but given that there are hundreds of files and tens of thousands of lines of code, it's going to take a while :(

I'm ramping up valgrind and assorted other tools to see what I get (I did try valgrind before, but now I'm going to have to go deeper)

Edit: Nuts. It doesn't seem to exhibit under valgrind. Makes me wonder about race hazards.