QTreeView with lots of items is really slow. Can it be optimised or is something buggy?

Cipherstream

Since it appeared that the cause of the issue was in the C++ Qt code I installed Qt C++ dev env and was able to build and debug the code that @IgKh provided.

I tracked down the cause of the many calls to rowCount and columnCount for the list view. They were coming from line 2600 onwards in:
https://code.qt.io/cgit/qt/qtbase.git/tree/src/widgets/itemviews/qlistview.cpp?h=6.8

The method QListModeViewBase::doStaticLayout() iterates over every row in the model to precalculate offsets for items in the list. This code does an optimisation where if gridSize has been set it will use the grid size for all items instead of calculating it for every single item. I confirmed that this optimisation had an effect by adding the line self.view.setGridSize( QtCore.QSize(18,18) ) to my python code.

The list view already has another existing optimisation for item sizes enabled by doing self.view.setUniformItemSizes(True). This "fixed item size" flag should also be used in this setup code in the same way the fixed size from a grid setting is used.

However if you remember, list view was calling rowCount and columnCount twice for every item in the model. The 2nd calls are done in this same method when checking if the row is hidden. The same loop that iterates over all items in the model also does an "is hidden" check on every item. The "is hidden" check looks to see if each item is present in the list of "hidden rows". An optimisation that can be done for this is to check if that list of hidden rows is empty, in which case no items are hidden and so it doesn't need to individually check whether every single item is hidden.

The first optimisation makes sense since if a "uniform item size" optimisation flag is set, then the code shouldn't check every single item in the list for its size!
The second optimisation might be up for debate as to whether it should be included or not. It does make a drastic change in processing time for python list views with lots of items however.

It turns out that Qt is very easy to build from source on windows, so I was able to make a custom Qt build and test my optimisations with my original python code.

Without either optimisation:
time taken to display list with 70000000 items = 358.775 seconds

With "size" only optimisation:
time taken to display list with 70000000 items = 198.576 seconds

With "size" and "hidden" optimisations:
time taken to display list with 70000000 items = 0.992 seconds

So the python code went from taking 6 minutes preprocessing before showing the view to just under 1 second.

Christian Ehrlicher

Wrt the rowCount calls: https://codereview.qt-project.org/c/qt/qtbase/+/601341

Cipherstream

I looked into optimising the TreeView and unfortunately I can't see any way to optimise it without a major rewrite. :(

I assumed there was an issue with ListView as it did not make sense that a 1D list of items would take longer than the 2D TableView. This assumption turned out to be correct, and I was hoping that upon finding the fix it may also apply to TreeView.

The TreeView slowdown also occurs during its layout function QTreeViewPrivate::layout() where it loops over every row calling model->index() directly and model->rowCount.() via QTreeViewPrivate::hasVisibleChildren(). It is again the huge number of calls to these model functions across the C++/python barrier (due to the model being in python code) that cause the slowdown.

Interestingly while looking into the TreeView code I saw that it did implement an "is hidden" optimisation like I was suggesting for ListView. Based off that I think both ListView optimisations I suggested should be added to the source code. I'll add a post below with the changes.

Cipherstream

@Christian-Ehrlicher

These are the two optimisations that I feel should be made for ListView. Would you be able to submit them for me?
(Feel free to change them as needed)

1) uniform / fixed size optimisation

qtbase\src\widgets\itemviews\qlistview.cpp: line 2561
original:

void QListModeViewBase::doStaticLayout(const QListViewLayoutInfo &info)
{
    const bool useItemSize = !info.grid.isValid();
    const QPoint topLeft = initStaticLayout(info);
    QStyleOptionViewItem option;
    initViewItemOption(&option);
    option.rect = info.bounds;
    option.rect.adjust(info.spacing, info.spacing, -info.spacing, -info.spacing);

    // The static layout data structures are as follows:
    // One vector contains the coordinate in the direction of layout flow.
    // Another vector contains the coordinates of the segments.
    // A third vector contains the index (model row) of the first item
    // of each segment.

    int segStartPosition;
    int segEndPosition;
    int deltaFlowPosition;
    int deltaSegPosition;
    int deltaSegHint;
    int flowPosition;
    int segPosition;

    if (info.flow == QListView::LeftToRight) {
        segStartPosition = info.bounds.left();
        segEndPosition = info.bounds.width();
        flowPosition = topLeft.x();
        segPosition = topLeft.y();
        deltaFlowPosition = info.grid.width(); // dx
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : info.grid.height(); // dy
        deltaSegHint = info.grid.height();
    } else { // flow == QListView::TopToBottom
        segStartPosition = info.bounds.top();
        segEndPosition = info.bounds.height();
        flowPosition = topLeft.y();
        segPosition = topLeft.x();
        deltaFlowPosition = info.grid.height(); // dy
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : info.grid.width(); // dx
        deltaSegHint = info.grid.width();
    }

    for (int row = info.first; row <= info.last; ++row) {
        if (isHidden(row)) { // ###
            flowPositions.append(flowPosition);
        } else {
            // if we are not using a grid, we need to find the deltas

optimised

void QListModeViewBase::doStaticLayout(const QListViewLayoutInfo &info)
{
    const bool useItemSize = !info.grid.isValid() && !uniformItemSizes();
    const QPoint topLeft = initStaticLayout(info);
    QStyleOptionViewItem option;
    initViewItemOption(&option);
    option.rect = info.bounds;
    option.rect.adjust(info.spacing, info.spacing, -info.spacing, -info.spacing);
    const QSize uniformSize = (uniformItemSizes() && cachedItemSize().isValid())
                        ? cachedItemSize() : itemSize(option, modelIndex(info.first));
    const QSize fixedSize = info.grid.isValid() ? info.grid : uniformSize;

    // The static layout data structures are as follows:
    // One vector contains the coordinate in the direction of layout flow.
    // Another vector contains the coordinates of the segments.
    // A third vector contains the index (model row) of the first item
    // of each segment.

    int segStartPosition;
    int segEndPosition;
    int deltaFlowPosition;
    int deltaSegPosition;
    int deltaSegHint;
    int flowPosition;
    int segPosition;

    if (info.flow == QListView::LeftToRight) {
        segStartPosition = info.bounds.left();
        segEndPosition = info.bounds.width();
        flowPosition = topLeft.x();
        segPosition = topLeft.y();
        deltaFlowPosition = fixedSize.width(); // dx
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : fixedSize.height(); // dy
        deltaSegHint = fixedSize.height();
    } else { // flow == QListView::TopToBottom
        segStartPosition = info.bounds.top();
        segEndPosition = info.bounds.height();
        flowPosition = topLeft.y();
        segPosition = topLeft.x();
        deltaFlowPosition = fixedSize.height(); // dy
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : fixedSize.width(); // dx
        deltaSegHint = fixedSize.width();
    }

    for (int row = info.first; row <= info.last; ++row) {
        if (isHidden(row)) { // ###
            flowPositions.append(flowPosition);
        } else {
            // if we are not using a fixed size, we need to find the deltas

2) is hidden optimisation

qtbase\src\widgets\itemviews\qlistview_p.h: line 358
original:

    inline bool isHidden(int row) const {
        QModelIndex idx = model->index(row, 0, root);
        return isPersistent(idx) && hiddenRows.contains(idx);
    }

optimised:

    inline bool isHidden(int row) const {
        if (hiddenRows.isEmpty())
            return false;
        QModelIndex idx = model->index(row, 0, root);
        return isPersistent(idx) && hiddenRows.contains(idx);
    }

Explanation:

QListModeViewBase::doStaticLayout() is called when the view is initially setup or resized. It loops over every item in the model calculating item layout based off the size and hidden attributes for each item. Currently the view is accessing the model once per item to calculate the size and then a second time per item when checking if it is hidden.

When using python wrappers, calls between C++ code and python code is not as "cheap" as calls between C++ code. This potentially applies to other wrappers for Qt too.

Now look at the case where you have a list view in C++ code accessing a model in python code, and the model has 10 million items. This will do 20 million calls beteen C++ and python code every time QListModeViewBase::doStaticLayout() is called. This creates a huge lag when initially showing or resizing a list view.

Optimisation Recommendations:

QListView already has an optimisation flag for 'uniformItemSizes'. If this is set then it can assume all items are the same size and should not have to get the size individually for each item. This flag and behaviour should be used in QListModeViewBase::doStaticLayout().
QTreeView already has an optimisation for QTreeViewPrivate::isRowHidden() which first checks if the list of hidden rows is empty, in which case there are no hidden rows and so no need to check each individual item to see if they are hidden. This behaviour should be added to QListViewPrivate::isHidden().

Optimisation Test Results:

A listview and model were created with 70 million items.
The time taken by doStaticLayout() to process this before showing it was:

Using python wrappers:

359 seconds before optimisation
199 seconds with "fixed size" optimisation
1 second with "fixed size" and "is hidden" optimisation

In C++ code:

2.2 seconds before optimisation
1.5 seconds after both optimisations

So even the C++ code gets a 30% speed boost.
This was on a pretty fast PC, on a slower device the 30% speed boost in C++ code might be more meaningful.

Cipherstream

While looking through the ListView code I realised that it also had item limits due to INT_MAX size (ie the largest value a signed 32bit integer can hold).

This occurs due to the values calculated and stored in QList<int> QListModeViewBase::flowPositions;. These values are basically item index * row height (for a normal top to bottom list).

The default row height appears to be 18 and the value for INT_MAX is 2147483647. So 2147483647 / 18 = 119,304,647

So the maximum number of items for ListView is just under 120 million items. I tested this and indeed it fails to handle 120m items.

A fix for this might be to use 64bit signed integers like
```QList<int64_t> QListModeViewBase::flowPositions;``, and then check anywhere where flowPositions values are handled to make sure that are treated correctly for 64bit values.

JonB

@Cipherstream
Your work on tracking down and optimising has been amazing. I would not want in any way to detract from that, but may I make a couple of general observations.

You keep pointing the figure at "calls between C++ and python code" as being a big issue. But earlier I found and suggested to you this is not the case? Scroll up to my https://forum.qt.io/post/813335 where I claim timings show there is "no" overhead across the language boundaries. If you do millions of direct, Python calls to some standalone nonVirtualRowCount() function, no C++ involved, I find exactly the same timing.
You were only able to improve (some of) the speed by being prepared to go into C++ and change the source of Qt. That improved C++ by 30%, which is absolutely not to be sniffed at, but Python by a factor of 360x! That indicates to me that Python has a problem, and in general Qt is written for C++. Nobody notices the overhead of calling rowCount() because it's either inlined or so fast as to be insignificant. But if it is indeed the case that calling a simple, tiny Python function is that much slower than C++ (which I did not know and am horrified at the timings) then you may always encounter other places where something innocuous in C++ is turning out to be horrific in Python.

As I say, not in any way to dismiss the case you have examined and the enormous improvement you have achieved by changing the C++ code. Just that there may always be issues from Python if speed is critical.

IgKh

@JonB said in QTreeView with lots of items is really slow. Can it be optimised or is something buggy?:

which I did not know and am horrified at the timings

A kind of poorly kept secret is that CPython (which is pretty much is Python) is considered to be a mere "reference implementation" and its' maintainers put implementation simplicity above performance, or at least that was the case in the past. The implication of this is that CPython is horrendously slow, even for a dynamic scripting language. There is no JIT, every line of bytecode is interpreted as it is executed. There is no inlining, function and method calls are dynamically dispatched by looking for string keys in a series of dictionaries for every call. Absolutely every little thing is allocated on the heap with the full PyObject overhead and has reference counting. FFI calls into Python require acquiring the Global Interpreter Lock, which is not huge overhead on your typical GUI app (as the GIL is normally not contended), but is not free.

This situation led to the fact that most high-performance Python toolkits and libraries are implemented in C, Fortran, C++ or Rust and are exposed via bindings. PyQt / PySide included. For a Wireshark-like application like @Cipherstream is describing, that's probably the approach that will be the path of least resistance too - write the data layer in C++ (the data model, and the custom delegate which you'll probably end up needing) and expose those to Python via Shiboken. Leave to Python the overall construction and management of the UI, where it has the relative advantage.

Cipherstream

@JonB
Yes it seems python really shows its speed limitations once you start calling things 10s of millions of times xD

You are right with your timings showing that python calls by themselves are indeed slow. In my last writeup I am trying to point out that when rowCount etc are called within C++ code, there is minimal cost. So the devs might easily overlook (or not care) about calling them one or more times for every item in the model. It it not until these calls are routed out of the fast C++ code and into the slow python code that it becomes really noticable.

With that said, the fact that there is an optimisation flag for "uniform sizes" means that it must have come up as an issue at some point. So it makes sense for the code to use that same flag to skip variable size processing during layout too. The same goes for the "is hidden" processing where treeview already implements the optimisation I put forward for use in llistview. So it must also have come up as being something worth optimising in the past.

Cipherstream

@IgKh
I have been tossing up how to proceed, since I do need to use a treeview for some parts of my app. I wanted to make the app in python so that it would easily work cross platform.

Stay 100% python and just have slow load times for big files. Smaller files are still very quick.
Have C++ module for bottleneck code (like you suggest), but then once there is some C++ code it makes it less easily cross platform compatible.
Make it fully C++

I am not really happy with any of the options hehe, so progress has stalled for now. Maybe if I went with option 2 but made the C++ code be part of a pip installable module, it would still stay easily cross platform...

IgKh

@Cipherstream Option #2 is really not a big issue. More Python packages than you'd guess have a compiled component, and Python packaging tools handle that case well by building binary wheels. (As much as anything about the Python packaging ecosystem can be described as "well")