Huge number of threads => out of memory?



  • Hello

    I have to decode huge text data (about 150GB) composed of "messages" (about 5kB) and of course, I use multithreading to help to reduce the decoding time like for all my heavy tasks.

    The program is really simple, there is no interface, only data to proceed. Unfortunately, I have a problem. The main loop reads the data and sends it to threads via MyFutureWatcher::setFuture(QtConcurrent::run(this, &decodeMessage, messageString). I expected the setFuture call to be blocked after a certain number of calls waiting for some threads to be completed. This is not the case and of course as the file reading is faster than the message decoding on a i5 4 cores PC (there is no problem on a R7 6C/12T PC), there is a stack overflow somewhere after 5 minutes, strangely not trapped in debug mode.

    My first attempt to fix the problem was to add a MyFutureWatcher::waitForFinished() each 1000 messages read. Unexpectedly the problem is only reduced but not solved!?!? The only dirty way to solve it is to add a Sleep(400) :-/ to speed down the main loop. I tried in Qt5.4, Qt5.7 and Qt5.8 mingw and Qt5.7 and Qt5.8 VS2013 with the same behavior.

    My questions now:

    • Has somebody had the problem and solved it?
    • Is there something to check stacks size in the watcher to wait when stacks are full or more than a limit? At least is there a method to know the number pending threads in the watcher. I didn't see anything in the doc... :-(
    • Why MyFutureWatcher::waitForFinished() does not work as I expected? It seems to be OK in all my other programs.

    Thanx for your anwers!



  • There is something fishy going on here. can you post the code for MyFutureWatcher and decodeMessage


  • Moderators

    @Nmut
    why do you create a new thread for each message?! This is a you encountered just overkill.
    Isn't it enough to simply have one thread and just send the message (one-by-one) via a queued signal to the thread?
    The thread then can use QQueue to queue each message and processes them.



  • @VRonin
    I will not post my code :-P.
    Here is a small code that shows the problem.

    #include "widget.h"
    
    #include <QFutureWatcher>
    #include <QtConcurrent>
    
    Widget::Widget(QWidget *parent) :
        QWidget(parent)
    {
        setupUi(this);
    }
    
    void Widget::on__pbRun_clicked()
    {
        QFutureWatcher<void> watcher;
    
        for (int i = 1; i < INT_MAX; ++i)       // from 1 to avoid watcher to wait at first loop!
        {
            QByteArray message;     // new instance to avoid Qt lazy copy optimizations! :-D
            message.fill('x', 4096);
            watcher.setFuture(QtConcurrent::run(this, Widget::decodeMessage, message));
    
            if (i % 100000 == 0)
                watcher.waitForFinished();
        }
    
        watcher.waitForFinished();
    }
    
    void Widget::decodeMessage(const QByteArray & message)
    {
        Q_UNUSED(message)
    
        // do some really heavy and long stuff!!! :-P
        while(true);
    }
    

    In this example, the waitForFinish in the main loop works, but you have the problem of overflow if you comment this waitForFinish...

    @raven-worx
    I'm doing this way as it is very simple (2 lines of codes for thread management). As the decoding task is heavy, the thread management overhead is not noticeable.
    Anyway, your solution will not solve the stack oversize problem as the queue will crow forever.

    Thanks for your answers.


  • Moderators

    @Nmut said in Huge number of threads => out of memory?:

    As the decoding task is heavy, the thread management overhead is not noticeable.

    sure, the thread management slows down the whole process. Since there are so many threads one cycle/round of all threads takes longer of course.

    Anyway, your solution will not solve the stack oversize problem as the queue will crow forever.

    as i suggested to use QQueue it won't grow to infinity. Every consumption (dequeue) will remove if from the queue. Thats why it is called queue.



  • @raven-worx

    1. I checked the overhead of thread management, I'm not a complete beginner you know 8-D!
      In this specific case, I have 11.3 x the mono thread speed (direct call) using a 6C/12T processor, that is very satisfying result... :-P

    2. Sure it is growing as we queue faster than we dequeue! You are implementing the same behaviour as Qt one by yourself... My concern is about the control of this Qt queuing process (in QFuture watcher or in QThreadPool, I don't know).

    For sure I have a problem in my code as the dummy sample (really dummy with infinite loops :-D) I wrote has no problem when implementing the waitForFinish in the main loop, but this is only part of the problem.



  • I might have missed something, but wouldn't QtConcurrent help? This would at least reduce the painful task of creating threads and setting QFuture...



  • ok, lets remove everything that has to do with multithreading.

    INT_MAX is (usually) 2,147,483,647

    you are creating 2bln QByteArray each of size 4kB = 8TB of RAM

    No system can run your code

    QtConcurrent::mapped is probably what you need if you can find a way not to load all the input data at once



  • @JohanSolo
    I only use QFutureWatcher to be blocked until all the thread are finished.
    If you check my code, it is VERY simple, as a monothreading code but 2 declarations and 2 lines of code... I don't understand your comment.

    @VRonin
    Sorry for the confusion, this is not real code but some dummy one to highlight my problem (no guards for overflows in Qt)!!! :-D
    Of course infinite loops are not very good programming. And I don't deliver buggy code with smileys! :-P

    I'm doing some coding and answering to users by phone in parallel to this thread, may be some things are not clear.



  • My point is: the argument you pass to the method run in parallel has to be stored in memory both before and after executing your code (as long as the QFuture is alive).

    So if you do something similar to the above is only natural you are running out of memory



  • @Nmut said in Huge number of threads => out of memory?:

    @JohanSolo
    I only use QFutureWatcher to be blocked until all the thread are finished.
    If you check my code, it is VERY simple, as a monothreading code but 2 declarations and 2 lines of code... I don't understand your comment.

    I read a bit too fast your code actually, my bad!

    Is was suggesting without saying something similar to @VRonin, QtConcurrent::mapped does not require you to set the QFuture manually, and still blocks until the processing is finished.

    Edit: sometimes I should just shut up.



  • @VRonin
    Yes, then I hope that some Qt multi threading guru will help me to manage by myself the size of the "stacks".
    What solutions I imagine:

    • check the data stack size => wait if > 200MB for example
    • check the number of pending threads => wait if > 50 for example

    QFutureWatcher::waitForFinish method I usually use in my multithreading codes doesn't seam to work in this case... Maybe I found a bug with my "unaccurate" usage.

    The goal is to have very simple code that works on all the client's different desktop as fast as possible.

    @JohanSolo
    I never used QtConcurrent::mapped in this context. In this case I have some input text data that I interpret, I can't easily do some "batches" for QtConcurrent::mapped. Maybe you highlight a usage I don't know?
    Edit: Every answer is a new step to the truth, whatever the direction! :-)



  • How are you storing/acquiring the 5kB "messages" at the moment? I mean before doing any computation on them



  • @VRonin
    I'm reading the data from different files (28 days, 1 file a day) and doing some fast pre-computations (cleaning, formatting, validation) and then I send the data in a QByteArray to the decoding method (here are the treads) to save delta with other data source in a binary format.
    The problem I have is to adapt the data throughput (messages read from files) to the decoding threads capacity (no storage needed in this phase) to have the better efficiency whatever the client PC from 2 cores to 16 cores/32 threads (no problem here as my 6C/12T does the job correctly, the SSD throughput is the limit!). Users are so impatient! :-D



  • Could you save the pre-computed file (in a QTemporaryFile?), store all of them in a container (QVector<QTemporaryFile*> ?) and use QtConcurrent::mapped on them?

    mapped will use QThread::idealThreadCount threads simultaneously



  • @VRonin
    This can be a solution. The size of temporary data is then not a problem. But the data size (drive space usage) and the time can be a problem (disk writes and reads)...
    I see another simple solution when I was reading my answer before replying to you. The best way to split the tasks is to have a thread for each day, each thread is pre-processing and decoding in one pass. This is far to be optimal (bad load balancing, hard drive concurrent access) but it is a very simple solution.

    I continue to dig in the problem, I feel my solution is really to use QFuturWatcher::waitForFinished(). I have to understand why it doesn't work as expected.



  • I gave up with this interesting problem, too much things to manage.
    There is a problem with QFutureWatcher::waitForFinished in some specific cases and some missing controls for QThreadPool and QFutureWatcher, but I can find a dirty workaround... I don't have enough time to work on Qt these days.
    Why projects are so time critical? :-D


  • Qt Champions 2016

    If I understand your issue correctly, you'd only need one semaphore for blocking to ensure you don't overflow the QtConcurent run queue. It'd look something like this:

    static QSemaphore barrier(5000);
    
    barrier.acquire();
    QtConcurrent::run([&barrier] () -> void {
        QThread::sleep(1000); //< Simulate a long running operation
        barrier.release(); //< Free up a resource for the data producer
    });
    

    You may also want to look at the producer-consumer example that comes with Qt.



  • @kshegunov
    You are right, my problem is very similar to this producer-consumer example. The producer (files reading, main loop) must wait for the consumer (data decoding) on 2 or 4 threads processors and fast HDD but on more than 10 treads processors, the consumers (the threads) must wait for the producer.
    I will study these examples (semaphore and wait condition examples).
    Unfortunately, this will not explain why the example I posted here works as expected (waitForFinished that synchronizes the threads) but not my final code.

    Edit: I just tried with semaphore. It works as expected BUT this is REALLY not efficient. On 12 threads, it is about 2 times slower than my version with QFutureWatcher::waitForFinished(). I don't have other machines for the moment to test with but I suppose the efficiency is poor on all type of processors.


  • Qt Champions 2016

    @Nmut said in Huge number of threads => out of memory?:

    Unfortunately, this will not explain why the example I posted here works as expected (waitForFinished that synchronizes the threads) but not my final code.

    waitForFinished is not synchronizing any threads, and you have an error in the example posted above. You're resetting the future to the watcher at each iteration and at some point you're stopping the loop to wait for the last operation (not all of them).

    I just tried with semaphore. It works as expected BUT this is REALLY not efficient.

    You'll have to show benchmark results with the test code to claim that. Designing benchmark code isn't exactly trivial.

    On 12 threads, it is about 2 times slower than my version with QFutureWatcher::waitForFinished(). I don't have other machines for the moment to test with but I suppose the efficiency is poor on all type of processors.

    Firstly, QtConcurrent::run doesn't run the job immediately, it has a thread pool and it puts the job in a queue and whenever there's a free thread from that pool then it executes it. Calling QtConcurrent::run multiple times does not create new threads.
    Secondly, on the i5 you have 4 cores, meaning that you can have 4 threads executing in parallel everything else is scheduled. One core can execute a single thread at any one time, the "illusion" of threading on single core is created by the OS's scheduler, which allocates time slots and puts to sleep one thread to switch execution to another (so called context switching). Having more threads than the number of cores will not give you any efficiency, on the contrary, it will even reduce the speed.



  • I appreciate your clear answer but this highlights that I'm not clear in my posts...

    @kshegunov said in Huge number of threads => out of memory?:

    waitForFinished is not synchronizing any threads, and you have an error in the example posted above. You're resetting the future to the watcher at each iteration and at some point you're stopping the loop to wait for the last operation (not all of them).

    I'm trying to use QFutureWatcher::waitForFinish() to let Qt execute the jobs I stacked (in QFutureWatcher or QThreadPool, I don't know the internal mechanisms but for sure it is my problem of memory as Qt has to store the data somewhere! :-D).
    I call waitForFinish every xxx calls to wait for all the tasks to be executed, stopping the main loop to avoid stack overflow. This is not efficient but far better than mutex/semaphore usage (years of Qt experience in muti-threading). This is my preferred way to multi-thread my applications as this is very simple code, really easy to read and to maintain.
    This is the first time I use this for this number of thread (10000+) and of course I understand that this is sub optimal and strange.
    My concern is about the waitForFinish that is not working as I expect for the first time (not waiting for ALL the tasks to be completed, but for the first time I use it several times to balance producer/consumer tasks, not only to wait at the end of threads/set of batchs).

    You'll have to show benchmark results with the test code to claim that. Designing benchmark code isn't exactly trivial.

    This is only for my specific case, only using a huge data set when I time my code with a stopwatch, no complex benchmarking to elaborate here! :-)
    It is an average on the same code (just the thread management is different) and 10 data sets. I use a Ryzen R5 1600X and the 12 threads example is for me a good one as this is the only processor that does not need for the producer to wait for the consumer. so the benchmarks are valid. Of course there non regression tests to validate the code modifications (output data checked).

    Firstly, QtConcurrent::run doesn't run the job immediately, it has a thread pool and it puts the job in a queue and whenever there's a free thread from that pool then it executes it. Calling QtConcurrent::run multiple times does not create new threads.

    I understand that and this is what I need! I want to stack the jobs and I want Qt to manage the underlying complexity! :-P

    Secondly, on the i5 you have 4 cores, meaning that you can have 4 threads executing in parallel everything else is scheduled. One core can execute a single thread at any one time, the "illusion" of threading on single core is created by the OS's scheduler, which allocates time slots and puts to sleep one thread to switch execution to another (so called context switching). Having more threads than the number of cores will not give you any efficiency, on the contrary, it will even reduce the speed.

    Of course.
    Maybe you missed the point that I test my program on different machines (i5 and i7 at work, Pentium and Ryzen at home) to simulate my program's target computers. I have to insure that my program will perform with the best efficiency whatever the target machine is.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.