Using Qt in a 25k connection tcp/ip socket server

VRHans

Given the known kernel performance issues and modern hardware - thread per client connection architectures are actually preferred if you need < 100k connections per server.

This is, of course, impacted heavily by the underlying socket handling method (async I/O, select, epoll, io completion ports, whatever...)

What does Qt use under the hood on linux? Is that constant/configurable/dependent upon how you use sockets?

Sorry for asking without doing much investigation, but it may be Qt version dependent, and when it comes to networking stuff - personal experience is always HUGELY valuable.

Does anyone do scalable networking work with Qt? Normally I wouldn't for a TCP/IP server but I have been loving using Qt this past year...

kshegunov

Hi,

@VRHans said in Using Qt in a 25k connection tcp/ip socket server:

Given the known kernel performance issues and modern hardware - thread per client connection architectures are actually preferred if you need < 100k connections per server.

Eh, if you say so. I'd prefer not to start 10k threads, much less 100k, and I'm pretty sure I'd hit the descriptor limit way before that.

What does Qt use under the hood on linux? Is that constant/configurable/dependent upon how you use sockets?

Use for what? If you mean QTcpSocket, then it's a wrapper around the BSD sockets API. Pretty standard stuff.

Does anyone do scalable networking work with Qt?

Very many people.

Normally I wouldn't for a TCP/IP server but I have been loving using Qt this past year...

Why wouldn't you? What are you thinking of doing (so we/I can actually give you straight to the point answers)?

Kind regards.

VRHans

Thanks for the reponse :)

@kshegunov said in Using Qt in a 25k connection tcp/ip socket server:

Eh, if you say so. I'd prefer not to start 10k threads, much less 100k, and I'm pretty sure I'd hit the descriptor limit way before that.

Well, usually the first thing you do is change the descriptor limit to be larger at process starup (just write to the file), and as for the threads, you pool them so you only pay to start them once, and depending upon expected load you spawn at startup...) At 10k connections, thread context switches are unimportant as cache misses are the biggest impact. The kernel issue relates to the way the default kernels poll descriptors.

When you really want to be performant at a really large scale (say 250k connections and above) then you need to be obsessed with cache efficiency (in fact you should spend a lot of time writing cache efficient data structures that are lock free) and using your own memory manager (beware of not cleaning up critical data before re-using though.)

What does Qt use under the hood on linux? Is that constant/configurable/dependent upon how you use sockets?

Use for what? If you mean QTcpSocket, then it's a wrapper around the BSD sockets API. Pretty standard stuff.

What I mean is "by what mechanisms does the Qt networking system monitor those file descriptors?"

Does anyone do scalable networking work with Qt?

Very many people.

Normally I wouldn't for a TCP/IP server but I have been loving using Qt this past year...

Why wouldn't you? What are you thinking of doing (so we/I can actually give you straight to the point answers)?

Because if you're dealing with real scale you need to be 100% aware of what your server is doing each step of the way - and having an abstraction layer (such as Qt) is generally a bad idea. But I love me some Qt, so I guess I'll just have to experiment a bit.

Luckily I need about 50k connections per server in an environment with a large amount of memory, and I'll be able to take advantage of process affinity.

Cheers!

kshegunov

@VRHans said in Using Qt in a 25k connection tcp/ip socket server:

Thanks for the reponse :)

I like a good jigsaw puzzle, that's all. Don't get any wrong ideas. ;P

Well, usually the first thing you do is change the descriptor limit to be larger at process starup (just write to the file)

Nope. This solves no problem at all, it only delays the inevitable realization that whatever resources you have you'll consume them dry at some point if you're not smart about it. (wow that sounded like a Greenpeace ad :D )

as for the threads, you pool them so you only pay to start them once, and depending upon expected load you spawn at startup...)

Then you don't have "thread per client". In any case you should state the desired connection density, rather than just a connection number, e.g. 1k/sec.

At 10k connections, thread context switches are unimportant as cache misses are the biggest impact.

Assuming you have 10k threads running, what "thread per client" implies, then sorry, this sounds very wrong. A thread context switch is a terrible, terrible thing. And with high number of active threads (i.e. not sleeping) you're going to degrade performance really fast. The context switch will just explode the number of cache line invalidations for both the data cache and the instruction cache. Even if you assume reentrancy on part of the code executed by the threads (i.e. lockless) the CPU will (probably) be forced to re-fetch instructions constantly to keep up with the context switching.

The kernel issue relates to the way the default kernels poll descriptors.

I don't know what you mean here.

When you really want to be performant at a really large scale (say 250k connections and above) then you need to be obsessed with cache efficiency (in fact you should spend a lot of time writing cache efficient data structures that are lock free)

Well, I suppose, although I'm not quite convinced cache efficiency is key. Even if you have something similar to DMA running between the ethernet controller and the CPU, all in all the data will be transferred over the single bus you have with von Neuman archs, thus it'd be "slow". Anyway, the TCP/IP stack has enough overhead as it is, and just on its own (e.g. acknowledge packages), so I don't know how obsessed with cache misses you should be. Additionally, it ultimately will depend on what you're doing, if you're doing calculations in said threads, then by all means optimize away, if not ... eh ...!

using your own memory manager

For what reason?

What I mean is "by what mechanisms does the Qt networking system monitor those file descriptors?"

QSocketNotifier, which if memory serves me is a powdered up select() over a file descriptor. In any case I like its API much more than the native ... no surprise there.

Because if you're dealing with real scale you need to be 100% aware of what your server is doing each step of the way - and having an abstraction layer (such as Qt) is generally a bad idea.

Or you could dig in and know how the abstraction layer works. In any case Qt, if I recall correctly, only wraps the system API for TCP; at least for the desktop platforms, I have no idea what's the state with the embeded.

I'll just have to experiment a bit.

Good idea. We can argue the theory all we want, in the end the experiment will decide what's the gain or loss of using Qt for this.

Kind regards.

VRHans

@kshegunov said in Using Qt in a 25k connection tcp/ip socket server:

@VRHans said in Using Qt in a 25k connection tcp/ip socket server:

Thanks for the reponse :)

I like a good jigsaw puzzle, that's all. Don't get any wrong ideas. ;P

I know :). All the fun in a TCP/IP server is in designing it. Once you start building it, it's less fun...

Well, usually the first thing you do is change the descriptor limit to be larger at process starup (just write to the file)

Nope. This solves no problem at all, it only delays the inevitable realization that whatever resources you have you'll consume them dry at some point if you're not smart about it. (wow that sounded like a Greenpeace ad :D )

I'm not sure you you mean here, but if you plan to have 50k connections, you need to ensure you allow enough FDs to handle the sockets. What I have done in the past is had a configuration MAX_CONNECTIONS and MAX_PENDING_CONNECTIONS, added them together at startup - then added that to the system default FDs to get the total number allowed and write that to the system configuration file. This ensures you don't exhaust file descriptors (don't know what the default is nowadays, but it used to be less than 2048 on many systems.)

as for the threads, you pool them so you only pay to start them once, and depending upon expected load you spawn at startup...)

Then you don't have "thread per client". In any case you should state the desired connection density, rather than just a connection number, e.g. 1k/sec.

Yes you do, when you accept a new connection, you acquire a thread from the thread pool, and accept the socket on that thread. When the client disconnects or is disconnected, cleanup on that thread, and return it to the thread pool

At 10k connections, thread context switches are unimportant as cache misses are the biggest impact.

Assuming you have 10k threads running, what "thread per client" implies, then sorry, this sounds very wrong. A thread context switch is a terrible, terrible thing.

It used to be, circa 2003, but the expense is lower now; more importantly, at a given scale the performance cost of a context switch versus data locality and increased connection handling swap - and the context switch becomes more desireable.

And with high number of active threads (i.e. not sleeping) you're going to degrade performance really fast. The context switch will just explode the number of cache line invalidations for both the data cache and the instruction cache. Even if you assume reentrancy on part of the code executed by the threads (i.e. lockless) the CPU will (probably) be forced to re-fetch instructions constantly to keep up with the context switching.

It depends. In generalized desktop performance or, worst case, game development - context switches are bad - primarily because the cache costs are relatively low. When you're handling 250,000 client connections the context switching is unimportant because other costs are much more critical to affecting your performance. At this scale, your goal is to update each client connection each 200-400 ms.

The kernel issue relates to the way the default kernels poll descriptors.

I don't know what you mean here.

What I mean is that after about 6,000 connections, the kernel's default methodology for handling FDs becomes inefficient. It's why you when you run an Apache server (for example) and you begin struggling at 8k clients and you decide to double the performance of the machine (memory, cpu, whatever) you see only a slight increase in effective connection support (e.g. 9k clients despite doubling the hardware performance.)

When you really want to be performant at a really large scale (say 250k connections and above) then you need to be obsessed with cache efficiency (in fact you should spend a lot of time writing cache efficient data structures that are lock free)

Well, I suppose, although I'm not quite convinced cache efficiency is key. Even if you have something similar to DMA running between the ethernet controller and the CPU, all in all the data will be transferred over the single bus you have with von Neuman archs, thus it'd be "slow". Anyway, the TCP/IP stack has enough overhead as it is, and just on its own (e.g. acknowledge packages), so I don't know how obsessed with cache misses you should be. Additionally, it ultimately will depend on what you're doing, if you're doing calculations in said threads, then by all means optimize away, if not ... eh ...!

Hehe, you've struck one of the approaches for speeding things up - DMA between user-space (polling FDs yourself directly rather than with the kernel) and the NIC! :)

using your own memory manager

For what reason?

For the very same caching reasons I mention above, and to avoid allocation/de-allocation entirely (plus it's very important to long running processes to mitigate memory fragmentation issues rather than relying on the "just dump the process and create a new one" method.

What I mean is "by what mechanisms does the Qt networking system monitor those file descriptors?"

QSocketNotifier, which if memory serves me is a powdered up select() over a file descriptor. In any case I like its API much more than the native ... no surprise there.

Because if you're dealing with real scale you need to be 100% aware of what your server is doing each step of the way - and having an abstraction layer (such as Qt) is generally a bad idea.

Or you could dig in and know how the abstraction layer works. In any case Qt, if I recall correctly, only wraps the system API for TCP; at least for the desktop platforms, I have no idea what's the state with the embeded.

I'll just have to experiment a bit.

Good idea. We can argue the theory all we want, in the end the experiment will decide what's the gain or loss of using Qt for this.

Talking about it is fun! Making it less so (but still fun...) ;)

Cheers!

tham

Why not prefer event based server or combine event based server with thread?Like listen the connection in main thread, use a thread pool with limited threads to handle the incoming connections? Qt5 support asynchronous operation, provide mature event queue, thread pool and so on. Combine the power of event based + multi-thread should not be too difficult with Qt5.

Any draw back compare with thread per clients server?Thanks

ps : Not an expert of network programming but have some interesting about server design problem. I only want to learn something from this post.

kshegunov

@tham said in Using Qt in a 25k connection tcp/ip socket server:

Why not prefer event based server or combine event based server with thread? Like listen the connection in main thread, use a thread pool with limited threads to handle the incoming connections?

That'd be my solution of choice, yes. You'd be best off striving to use just a number of threads that equals the number of cores on your CPU. Then you have (almost) no overhead from "emulating" multithreading on a single hardware thread (i.e. what context switches are). Of course I assume reentrancy throughout. Locking and waiting is a different kettle of fish entirely.

Any draw back compare with thread per clients server?

Not readily I could see, no.

Not an expert of network programming

Me neither. I'm a lowly physicist. ;)

@VRHans said in Using Qt in a 25k connection tcp/ip socket server:

Once you start building it, it's less fun...

I can give you half the code already. I have an example TCP server (thread-per-client) for my daemon module, here

I'm not sure you you mean here, but if you plan to have 50k connections, you need to ensure you allow enough FDs to handle the sockets. What I have done in the past is had a configuration MAX_CONNECTIONS and MAX_PENDING_CONNECTIONS, added them together at startup - then added that to the system default FDs to get the total number allowed and write that to the system configuration file. This ensures you don't exhaust file descriptors (don't know what the default is nowadays, but it used to be less than 2048 on many systems.)

Yes, that's my bad introducing the ambiguity. I meant the descriptors your threads will consume, while you were talking about the socket descriptors. Don't forget practically everything on Linux goes with a file descriptor, processes and threads included.

Yes you do, when you accept a new connection, you acquire a thread from the thread pool, and accept the socket on that thread. When the client disconnects or is disconnected, cleanup on that thread, and return it to the thread pool

Okay, this I understand. We agree on this, pooling is good (Qt provides for it readily from the core module, QThreadPool).

It used to be, circa 2003, but the expense is lower now; more importantly, at a given scale the performance cost of a context switch versus data locality and increased connection handling swap - and the context switch becomes more desireable.

It depends. In generalized desktop performance or, worst case, game development - context switches are bad - primarily because the cache costs are relatively low. When you're handling 250,000 client connections the context switching is unimportant because other costs are much more critical to affecting your performance. At this scale, your goal is to update each client connection each 200-400 ms.

You should already provide good data locality and this is all independent of how many threads you run. Even one thread's performance degrades when you invalidate the fetched cache lines constantly (i.e. you're going all over the RAM). All this said, a context switch can never be desirable. I'll give you a very simple example that illustrates the issue. Consider the following simple program:

int main(int, char **)
{
    int sum = 0, sumsum = 0;
    for (int i = 0; true; i++)  {
        sum += i;
        sumsum += sum;
    }

    return 0;
}

Right, so g++ generates the following (extracted from Creator) for the above code (AT&T flavor):

        2 [1]	{
0x400666                   55                    push   %rbp
0x400667  <+0x0001>        48 89 e5              mov    %rsp,%rbp
0x40066a  <+0x0004>        89 7d ec              mov    %edi,-0x14(%rbp)
0x40066d  <+0x0007>        48 89 75 e0           mov    %rsi,-0x20(%rbp)
        3 [1]	    int sum = 0, sumsum = 0;
0x400671  <+0x000b>        c7 45 fc 00 00 00 00  movl   $0x0,-0x4(%rbp)
0x400678  <+0x0012>        c7 45 f8 00 00 00 00  movl   $0x0,-0x8(%rbp)
        4 [1]	    for (int i = 0; true; i++)  {
0x40067f  <+0x0019>        c7 45 f4 00 00 00 00  movl   $0x0,-0xc(%rbp)
        5 [1]	        sum += i;
0x400686  <+0x0020>        8b 45 f4              mov    -0xc(%rbp),%eax
0x400689  <+0x0023>        01 45 fc              add    %eax,-0x4(%rbp)
        6 [1]	        sumsum += sum;
0x40068c  <+0x0026>        8b 45 fc              mov    -0x4(%rbp),%eax
0x40068f  <+0x0029>        01 45 f8              add    %eax,-0x8(%rbp)
        4 [2]	    for (int i = 0; true; i++)  {
0x400692  <+0x002c>        83 45 f4 01           addl   $0x1,-0xc(%rbp)
        5 [2]	        sum += i;
0x400696  <+0x0030>        eb ee                 jmp    0x400686 <main(int, char**)+32>

Basically you have a couple of instructions for the additions in the loop, an addition for the counter and of course the unconditional jump that is the for loop itself. So why did I pull that ugliness out? I want you to consider what the scheduler does when it performs a context switch (e.g. we are running that nice loop but the OS decides we have a thread that we have to yield to). Suppose we are at +0x0023, we just loaded the value of i in the ax register and we are ready to do the addition, then ... oh, no! we got our CPU time pulled under us. So it looks like this:

0x400686  <+0x0020>        8b 45 f4              mov    -0xc(%rbp),%eax
# Yeah baby! We are ready to rumble. We have i and will now do additiooon!
#< Context switch: Thread was suspended.
#< Load new instruction cache line
#< Invalidate registers' values
#< Invalidate data cache
# -- ( Run other thread ... )
# -- ( Suspend other thread - we are going to be run again ...)
#< Load our instruction cache line (we had it, but we lost it when we were suspended)
#< Invalidate registers' values (we had them, lost them, same reason).
#< Invalidate data cache (... do you get where I'm going with this ... )
#< Context switched back: Thread resumed; Reentry:
0x400689  <+0x0023>        01 45 fc              add    %eax,-0x4(%rbp)
# Hm, I thought we were going to do additions, instead we were sleeping forever waiting for some other thread ...

Cool, huh? Now imagine you have 10k threads, that are fighting for 4 or 8 hardware threads (cores). What'd be happening (especially on Linux where you don't have thread priorities) is this:

# We want to run our loop
        4 [1]	    for (int i = 0; true; i++)  {
0x40067f  <+0x0019>        c7 45 f4 00 00 00 00  movl   $0x0,-0xc(%rbp)
        5 [1]	        sum += i;
0x400686  <+0x0020>        8b 45 f4              mov    -0xc(%rbp),%eax
#< Context switch (yield to another thread); Reentry:
0x400689  <+0x0023>        01 45 fc              add    %eax,-0x4(%rbp)
#< Context switch (yield to another thread); Reentry:
        6 [1]	        sumsum += sum;
0x40068c  <+0x0026>        8b 45 fc              mov    -0x4(%rbp),%eax
0x40068f  <+0x0029>        01 45 f8              add    %eax,-0x8(%rbp)
#< Context switch (yield to another thread); Reentry:
        4 [2]	    for (int i = 0; true; i++)  {
0x400692  <+0x002c>        83 45 f4 01           addl   $0x1,-0xc(%rbp)
#< Context switch (yield to another thread); Reentry:
        5 [2]	        sum += i;
#Finally loop around.
0x400696  <+0x0030>        eb ee                 jmp    0x400686 <main(int, char**)+32>
# Oh, no, we had context switching all over and our cache lines were practically useless, because we needed to refetch them constantly ... :(

I hope I made my point.

What I mean is that after about 6,000 connections, the kernel's default methodology for handling FDs becomes inefficient.

And you think one that you write would be more efficient? Forgive me for saying so, but I find this dubious.

It's why you when you run an Apache server (for example) and you begin struggling at 8k clients and you decide to double the performance of the machine (memory, cpu, whatever) you see only a slight increase in effective connection support (e.g. 9k clients despite doubling the hardware performance.)

If you mean a typical Apache server (i.e. one that forks itself for each client), well then it's a small wonder. There's a lot of work that goes into forking a process (a little bit more than starting a thread, which is somewhat heavy on Linux). And all the above considerations also apply.

Hehe, you've struck one of the approaches for speeding things up - DMA between user-space (polling FDs yourself directly rather than with the kernel) and the NIC!

I don't really find this a good idea. It basically means you expose the hardware to the user space, going around kernel and drivers and all. Sounds wrong.

For the very same caching reasons I mention above, and to avoid allocation/de-allocation entirely

You can't do that. If you need fast allocation/dealocations then use the stack! A stack allocation is the weight of a single addition instruction.

plus it's very important to long running processes to mitigate memory fragmentation issues rather than relying on the "just dump the process and create a new one" method.

Linux already defragments your heap pretty well.

Talking about it is fun! Making it less so (but still fun...) ;)

That came out one long, long post. Check out my example app and see if it will help you with your experiments.

Kind regards.

VRHans

@tham That's what I'm doing for now in order to ascertain where things break down when using Qt.

Sometimes this type of approach is the best for a given scenario (especially given the load you plan to support for individual requests), sometimes it's not.

It's also the easiest way to go (which is attractive as well!)