Huge number of threads => out of memory?

Nmut · 8 Nov 2017, 09:37

@JohanSolo
I only use QFutureWatcher to be blocked until all the thread are finished.
If you check my code, it is VERY simple, as a monothreading code but 2 declarations and 2 lines of code... I don't understand your comment.

@VRonin
Sorry for the confusion, this is not real code but some dummy one to highlight my problem (no guards for overflows in Qt)!!! :-D
Of course infinite loops are not very good programming. And I don't deliver buggy code with smileys! :-P

I'm doing some coding and answering to users by phone in parallel to this thread, may be some things are not clear.

VRonin · 11 Aug 2017, 09:53

My point is: the argument you pass to the method run in parallel has to be stored in memory both before and after executing your code (as long as the QFuture is alive).

So if you do something similar to the above is only natural you are running out of memory

JohanSolo · 8 Nov 2017, 09:51

@Nmut said in Huge number of threads => out of memory?:

@JohanSolo
I only use QFutureWatcher to be blocked until all the thread are finished.
If you check my code, it is VERY simple, as a monothreading code but 2 declarations and 2 lines of code... I don't understand your comment.

I read a bit too fast your code actually, my bad!

~~Is was suggesting without saying something similar to @VRonin, QtConcurrent::mapped does not require you to set the QFuture manually, and still blocks until the processing is finished.~~

Edit: sometimes I should just shut up.

Nmut · 8 Nov 2017, 09:54

@VRonin
Yes, then I hope that some Qt multi threading guru will help me to manage by myself the size of the "stacks".
What solutions I imagine:

check the data stack size => wait if > 200MB for example
check the number of pending threads => wait if > 50 for example

QFutureWatcher::waitForFinish method I usually use in my multithreading codes doesn't seam to work in this case... Maybe I found a bug with my "unaccurate" usage.

The goal is to have very simple code that works on all the client's different desktop as fast as possible.

@JohanSolo
I never used QtConcurrent::mapped in this context. In this case I have some input text data that I interpret, I can't easily do some "batches" for QtConcurrent::mapped. Maybe you highlight a usage I don't know?
Edit: Every answer is a new step to the truth, whatever the direction! :-)

VRonin · 11 Aug 2017, 10:09

How are you storing/acquiring the 5kB "messages" at the moment? I mean before doing any computation on them

Nmut · replied to VRonin on 11 Aug 2017, 10:09

@VRonin
I'm reading the data from different files (28 days, 1 file a day) and doing some fast pre-computations (cleaning, formatting, validation) and then I send the data in a QByteArray to the decoding method (here are the treads) to save delta with other data source in a binary format.
The problem I have is to adapt the data throughput (messages read from files) to the decoding threads capacity (no storage needed in this phase) to have the better efficiency whatever the client PC from 2 cores to 16 cores/32 threads (no problem here as my 6C/12T does the job correctly, the SSD throughput is the limit!). Users are so impatient! :-D

VRonin · 8 Nov 2017, 11:31

Could you save the pre-computed file (in a QTemporaryFile?), store all of them in a container (QVector<QTemporaryFile*> ?) and use QtConcurrent::mapped on them?

mapped will use QThread::idealThreadCount threads simultaneously

Nmut · 15 Aug 2017, 17:44

@VRonin
This can be a solution. The size of temporary data is then not a problem. But the data size (drive space usage) and the time can be a problem (disk writes and reads)...
I see another simple solution when I was reading my answer before replying to you. The best way to split the tasks is to have a thread for each day, each thread is pre-processing and decoding in one pass. This is far to be optimal (bad load balancing, hard drive concurrent access) but it is a very simple solution.

I continue to dig in the problem, I feel my solution is really to use QFuturWatcher::waitForFinished(). I have to understand why it doesn't work as expected.

Nmut · 17 Aug 2017, 08:07

I gave up with this interesting problem, too much things to manage.
There is a problem with QFutureWatcher::waitForFinished in some specific cases and some missing controls for QThreadPool and QFutureWatcher, but I can find a dirty workaround... I don't have enough time to work on Qt these days.
Why projects are so time critical? :-D

kshegunov · 17 Aug 2017, 17:41

If I understand your issue correctly, you'd only need one semaphore for blocking to ensure you don't overflow the QtConcurent run queue. It'd look something like this:

static QSemaphore barrier(5000);

barrier.acquire();
QtConcurrent::run([&barrier] () -> void {
    QThread::sleep(1000); //< Simulate a long running operation
    barrier.release(); //< Free up a resource for the data producer
});

You may also want to look at the producer-consumer example that comes with Qt.

Nmut · 18 Aug 2017, 08:15

@kshegunov
You are right, my problem is very similar to this producer-consumer example. The producer (files reading, main loop) must wait for the consumer (data decoding) on 2 or 4 threads processors and fast HDD but on more than 10 treads processors, the consumers (the threads) must wait for the producer.
I will study these examples (semaphore and wait condition examples).
Unfortunately, this will not explain why the example I posted here works as expected (waitForFinished that synchronizes the threads) but not my final code.

Edit: I just tried with semaphore. It works as expected BUT this is REALLY not efficient. On 12 threads, it is about 2 times slower than my version with QFutureWatcher::waitForFinished(). I don't have other machines for the moment to test with but I suppose the efficiency is poor on all type of processors.

kshegunov · 18 Aug 2017, 12:53

@Nmut said in Huge number of threads => out of memory?:

Unfortunately, this will not explain why the example I posted here works as expected (waitForFinished that synchronizes the threads) but not my final code.

waitForFinished is not synchronizing any threads, and you have an error in the example posted above. You're resetting the future to the watcher at each iteration and at some point you're stopping the loop to wait for the last operation (not all of them).

I just tried with semaphore. It works as expected BUT this is REALLY not efficient.

You'll have to show benchmark results with the test code to claim that. Designing benchmark code isn't exactly trivial.

On 12 threads, it is about 2 times slower than my version with QFutureWatcher::waitForFinished(). I don't have other machines for the moment to test with but I suppose the efficiency is poor on all type of processors.

Firstly, QtConcurrent::run doesn't run the job immediately, it has a thread pool and it puts the job in a queue and whenever there's a free thread from that pool then it executes it. Calling QtConcurrent::run multiple times does not create new threads.
Secondly, on the i5 you have 4 cores, meaning that you can have 4 threads executing in parallel everything else is scheduled. One core can execute a single thread at any one time, the "illusion" of threading on single core is created by the OS's scheduler, which allocates time slots and puts to sleep one thread to switch execution to another (so called context switching). Having more threads than the number of cores will not give you any efficiency, on the contrary, it will even reduce the speed.

Nmut · 20 Aug 2017, 08:00

I appreciate your clear answer but this highlights that I'm not clear in my posts...

@kshegunov said in Huge number of threads => out of memory?:

waitForFinished is not synchronizing any threads, and you have an error in the example posted above. You're resetting the future to the watcher at each iteration and at some point you're stopping the loop to wait for the last operation (not all of them).

I'm trying to use QFutureWatcher::waitForFinish() to let Qt execute the jobs I stacked (in QFutureWatcher or QThreadPool, I don't know the internal mechanisms but for sure it is my problem of memory as Qt has to store the data somewhere! :-D).
I call waitForFinish every xxx calls to wait for all the tasks to be executed, stopping the main loop to avoid stack overflow. This is not efficient but far better than mutex/semaphore usage (years of Qt experience in muti-threading). This is my preferred way to multi-thread my applications as this is very simple code, really easy to read and to maintain.
This is the first time I use this for this number of thread (10000+) and of course I understand that this is sub optimal and strange.
My concern is about the waitForFinish that is not working as I expect for the first time (not waiting for ALL the tasks to be completed, but for the first time I use it several times to balance producer/consumer tasks, not only to wait at the end of threads/set of batchs).

You'll have to show benchmark results with the test code to claim that. Designing benchmark code isn't exactly trivial.

This is only for my specific case, only using a huge data set when I time my code with a stopwatch, no complex benchmarking to elaborate here! :-)
It is an average on the same code (just the thread management is different) and 10 data sets. I use a Ryzen R5 1600X and the 12 threads example is for me a good one as this is the only processor that does not need for the producer to wait for the consumer. so the benchmarks are valid. Of course there non regression tests to validate the code modifications (output data checked).

Firstly, QtConcurrent::run doesn't run the job immediately, it has a thread pool and it puts the job in a queue and whenever there's a free thread from that pool then it executes it. Calling QtConcurrent::run multiple times does not create new threads.

I understand that and this is what I need! I want to stack the jobs and I want Qt to manage the underlying complexity! :-P

Secondly, on the i5 you have 4 cores, meaning that you can have 4 threads executing in parallel everything else is scheduled. One core can execute a single thread at any one time, the "illusion" of threading on single core is created by the OS's scheduler, which allocates time slots and puts to sleep one thread to switch execution to another (so called context switching). Having more threads than the number of cores will not give you any efficiency, on the contrary, it will even reduce the speed.

Of course.
Maybe you missed the point that I test my program on different machines (i5 and i7 at work, Pentium and Ryzen at home) to simulate my program's target computers. I have to insure that my program will perform with the best efficiency whatever the target machine is.

kshegunov · 20 Aug 2017, 10:48

@Nmut said in Huge number of threads => out of memory?:

I'm trying to use QFutureWatcher::waitForFinish() to let Qt execute the jobs I stacked (in QFutureWatcher or QThreadPool,

Which is what I said - future watcher does not stack anything. It's a utility class to watch a specific feature (as the name suggests).

I call waitForFinish every xxx calls to wait for all the tasks to be executed, stopping the main loop to avoid stack overflow.

You may intend that, but it's not how it works, and I have already written it in this and in my previous post.
If you want future synchronization, then look at QFutureSynchronizer.

This is not efficient but far better than mutex/semaphore usage (years of Qt experience in muti-threading).

It's better how? As a friendly suggestion: you should really take the time to understand what's going on before making such a claim.

This is my preferred way to multi-thread my applications as this is very simple code, really easy to read and to maintain.

I don't know how to let you down gently, so I won't beat around the bush - it is wrong, plain and simple.

This is the first time I use this for this number of thread (10000+) and of course I understand that this is sub optimal and strange.

I already told you this number of threads isn't useful, moreover QtConcurent::run doesn't start new threads it reuses the old ones. Where you get this number of threads is really beyond my ability to comprehend.

My concern is about the waitForFinish that is not working as I expect for the first time (not waiting for ALL the tasks to be completed, but for the first time I use it several times to balance producer/consumer tasks, not only to wait at the end of threads/set of batchs).

Because you're using it wrongly, it's not intended to synchronize threads, its usage is to observe a specific job and notify you when said job is done. One task - one future - one future watcher!

I understand that and this is what I need! I want to stack the jobs and I want Qt to manage the underlying complexity!

It already does all that. The only thing that I have put in my example is to prevent a prolific producer from overflowing the pending jobs queue.

Maybe you missed the point that I test my program on different machines (i5 and i7 at work, Pentium and Ryzen at home) to simulate my program's target computers. I have to insure that my program will perform with the best efficiency whatever the target machine is.

Then make sure your program is working properly before doing benchmarks. Opening the task manager and seeing 100% onall cores would a be a good indication that all the cores are crunching the numbers as fast as they can.

Nmut · 20 Aug 2017, 11:39

@kshegunov said in Huge number of threads => out of memory?:

Which is what I said - future watcher does not stack anything. It's a utility class to watch a specific feature (as the name suggests).

Understood!

You my intend that, but it's not how it works, and I have already written it in this and in my previous post.
If you want future synchronization, then look at QFutureSynchronizer.

THIS is the answer I expected! Thank you very much!!!
This is not as efficient I expect but really better than mutex/semaphore option.

It's better how? As a friendly suggestion: you should really take the time to understand what's going on before making such a claim.

Just performance side... But OK, this is a bad usage! :-/

I already told you this number of threads isn't useful, moreover QtConcurent::run doesn't start new threads it reuses the old ones. Where you get this number of threads is really beyond my ability to comprehend.

Bad wording. I know that a thread pool is used, and I "ask" for a task (not a thread) to be executed in the existing/available threads (idealThreadCount = nb of logical corese, SMT is usefull for me in this sepcific case).

It already does all that. The only thing that I have put in my example is to prevent a prolific producer from overflowing the pending jobs queue.

You understood my problem, but your solution was not performing as expected.

Then make sure your program is working properly before doing benchmarks. Opening the task manager and seeing 100% onall cores would a be a good indication that all the cores are crunching the numbers as fast as they can.

Yes. This is my way to code: insuring the algo is correct, then check that the results are correct with simple monothread code, then multi-threading the application, then profiling the application (good load balancing, maximum efficiency on all type of processors).

BTW, one more question for you :-P: how to detect in Qt SMT as in some cases, my performances are better using only "real" cores. For now, I only bench the processor at startup or for really simple code, I test on typical client targets the code and I force the thread count to the number of physical cores as needed. This is not a satisfying solution as the programme will not cope with new architecture (I had some big surprises with Ryzen for example).

kshegunov · 20 Aug 2017, 14:25

@Nmut said in Huge number of threads => out of memory?:

This is not as efficient I expect but really better than mutex/semaphore option.

Perhaps not, but I'm curious why do you think it's better than a mutex-semaphore concurrent queue?

Just performance side...

I can assure you, the lower level you get the more control and efficiency you are going to get, with the cost of a more complex code however.

Bad wording.

Okay. Then we understand each other.

You understood my problem, but your solution was not performing as expected.

The performance bottleneck may be in a different place entirely. I can't tell without seeing the code, or a test case or some kind of a tangible fact to base a recommendation on.

how to detect in Qt SMT as in some cases, my performances are better using only "real" cores.

You can't, the abstraction of "physical" vs "logical" cores goes beyond the scope of Qt, or the OS for that matter. You can usually disable the hyperthreading (or equivalent technology) at the BIOS/(U)EFI interface. This is strictly hardware implementation and you don't have that fine control (or distinction) directly.

This is not a satisfying solution as the programme will not cope with new architecture (I had some big surprises with Ryzen for example).

As a work-around you can pass the number of threads as a command line argument. This way you can specify exactly the number of threads you want at runtime and control it manually depending on the CPU.

PS.
If you still want, here's an excerpt/example from a program of mine that uses mutexes and semaphores to realize a thread-safe queue which is filled through Calculator::addData and the data's processed in Calculator::CalculatorWorker::run. It will squeeze every last drop of computational power you have. What to do with the processed data is left open (I cut out that part from the code).

class CalculationData;

class Calculator
{
	friend class CalculatorWorker;

public:
	~Calculator();

public:
	void begin();
	void finish();
	void addData(const CalculationData &); //< Put more data to be processed.

private:
	bool fetch(CalculationData &);

	// Calculation workers' queue
	class CalculatorWorker;

	QVector<CalculatorWorker *> workers;
	QSemaphore queueAvailability, inputNeeds;
	QMutex queueMutex;
	QQueue<CalculationData> queue;

	static const qint32 maxQueueSize;	// In data items
};

class Calculator::CalculatorWorker : public QThread
{
public:
	CalculatorWorker(Calculator *);

protected:
	void run() override;

private:
	Calculator * controller;
};

// ------------------------------------------------------------------------------------------------------------------ //
const qint32 Calculator::maxQueueSize = 10000;	// In data items

Calculator::~Calculator()
{
	finish();
}

void Calculator::begin()
{
	inputNeeds.release(maxQueueSize);

	// Start the worker threads
	qint32 workersNumber = QThread::idealThreadCount();
	workers.resize(workersNumber);

	for (qint32 i = 0; i < workersNumber; i++)  {
		workers[i] = new CalculatorWorker(this);
		workers[i]->start();
	}
}

void Calculator::finish()
{
	qint32 workersNumber = workers.size();
	if (workersNumber > 0)  {			// Wait for all workers to finish
		queueAvailability.release(workersNumber);

		for (qint32 i = 0; i < workersNumber; i++)  {
			workers[i]->wait();
			delete workers[i];
		}

		workers.clear();
	}

	inputNeeds.acquire(inputNeeds.available());		// Zero out
}

void Calculator::addData(const CalculationData & data)
{
	inputNeeds.acquire();	// First wait for more data to be consumed if needed (do not overfill the queue).

	QMutexLocker lock(&queueMutex);
	queue.enqueue(data);
	queueAvailability.release();
}

bool Calculator::fetch(CalculationData & data)
{
	queueAvailability.acquire();
	QMutexLocker lock(&queueMutex);
	if (queue.size() <= 0)
		return false;

	data = queue.dequeue();

	inputNeeds.release();

	return true;
}

// ------------------------------------------------------------------------------------------------------------------ //

Calculator::CalculatorWorker::CalculatorWorker(Calculator * parent)
	: controller(parent)
{
}

void Calculator::CalculatorWorker::run()
{
	CalculationData data;
	while (controller->fetch(data))  {
		// Process the data and do stuff with it ...
	}
}

Using it is rather simple:

int main( ... )
{
    Calculator manager;

    manager.start();
    while ( ... )  {
         CalculationData data;
         // Read/fetch whathever it is needed ...

        manager.addData(data);          // Fill the queue with data to process
    }

    manager.finish();  //< Just for completeness, it will do that when the object's destroyed anyway,
    return 0;
}

Nmut · 20 Aug 2017, 14:58

@kshegunov said in Huge number of threads => out of memory?:

Perhaps not, but I'm curious why do you think it's better than a mutex-semaphore concurrent queue?

Just what I see from tests, only about 2x speed in multi thread on a i7 4/8 processor => not a great efficiency. I have no time to investigate (and I'm on vacations this week, don't want to take too much time with that! :-D).

I can assure you, the lower level you get the more control and efficiency you are going to get, with the cost of a more complex code however.

In this specific case, the multi-threading overhead (semaphore management) is noticeable (tasks duration < 50ms). Of course a way to improve the code is to merge tasks (batches of 10 to 1000 jobs to reduce the impact of overhead but with the drawback of code complexity). This is new for me, I'm used to have tasks of 10 seconds to 10 minutes duration.
BTW I will probably have to go to batches of xxx jobs as even if the QFutureSynchronizer solution is what I expected, the overhead is noticeable...

The performance bottleneck may be in a different place entirely. I can't tell without seeing the code, or a test case or some kind of a tangible fact to base a recommendation on.

No problem, I always have to profile deeply my programs: scientific computing on desktop PCs with huge data sets of hundreds GB of data. Of course the users need there results ASAP as this is a replacement for super computers usage...
If I show you the algo, I don't think you will see the bottlenecks at first sight :-D ! Anyway I cannot disclose the code.

You can't, the abstraction of "physical" vs "logical" cores goes beyond the scope of Qt, or the OS for that matter. You can usually disable the hyperthreading (or equivalent technology) at the BIOS/(U)EFI interface. This is strictly hardware implementation and you don't have that fine control (or distinction) directly.
As a work-around you can pass the number of threads as a command line argument. This way you can specify exactly the number of threads you want at runtime and control it manually depending on the CPU.

The solution to let the user choose the number of threads to use is unfortunately not possible, I'm sure I will receive in the day tons of Problem Reports on performances! :-)
I will still use my dummy computation tests or processor type tests!

PS.
If you still want, here's an excerpt/example from a program of mine that uses mutexes and semaphores to realize a thread-safe queue which is filled through Calculator::addData and the data's processed in Calculator::CalculatorWorker::run. It will squeeze every last drop of computational power you have. What to do with the processed data is left open (I cut out that part from the code).

Thank you for your code! And again thank you for your great help.

kshegunov · 20 Aug 2017, 21:06

@Nmut said in Huge number of threads => out of memory?:

In this specific case, the multi-threading overhead (semaphore management) is noticeable (tasks duration < 50ms). Of course a way to improve the code is to merge tasks (batches of 10 to 1000 jobs to reduce the impact of overhead but with the drawback of code complexity).

You definitely need that, no matter what threading technology you choose. Such small tasks are inefficient for many reasons, not only because of the synchronization primitives. Batch them up in vector/array contiguously in memory (prefer std::vector here to avoid Qt's d-ptr indirection), so you don't get constant cache misses and stay away from heap (re)allocations. Aside from that you may want to explore lock-free queues[1, 2], although I really doubt the benefit will be that significant, especially if you batch the data first.

If I show you the algo, I don't think you will see the bottlenecks at first sight

Don't be so sure, you don't know what I do. In any case the point is moot, since you can't disclose any code.

Nmut · replied to kshegunov on 20 Aug 2017, 21:06

@kshegunov said in Huge number of threads => out of memory?:

You definitely need that, no matter what threading technology you choose. Such small tasks are inefficient for many reasons, not only because of the synchronization primitives. Batch them up in vector/array contiguously in memory (prefer std::vector here to avoid Qt's d-ptr indirection), so you don't get constant cache misses and stay away from heap (re)allocations. Aside from that you may want to explore lock-free queues[1, 2], although I really doubt the benefit will be that significant, especially if you batch the data first.

The performance is mainly linked to memory management, the data size is by far larger than caches (some MB). I collect and merge big data before doing some computations and statistics. This is surprisingly fast on modern processors. If I have some spare time, I will dig in this direction but I don't expect some improvement for the specific problem we are talking about.
And you are right, I have to do some generic code for this new type of problem. Before I had more very long tasks.

Don't be so sure, you don't know what I do. In any case the point is moot, since you can't disclose any code.

I do not doubt of your skills. The code is just sometimes a little bit confusing and obfuscated on old parts.

Thank you for the references.