How to scroll QWebEnginePage efficiently?



  • I am trying to scroll the search results of Bing image search to the end, cannot find a working solution yet.

    bing_image_search::bing_image_search(QObject *parent) : //initialize....
    {
        auto *web_page = &get_web_page();
        connect(web_page, &QWebEnginePage::scrollPositionChanged,
                this, &bing_image_search::web_page_scroll_position_changed);
    }
      
    void bing_image_search::parse_page_link(QPointF const &point)
    {
        if(state_ != state::parse_page_link){
            return;
        }
    
        get_web_page().toHtml([this, point](QString const &contents)
        {
            qDebug()<<"get image link contents";
            QRegularExpression reg("(search\\?view=detailV2[^\"]*)");
            auto iter = reg.globalMatch(contents);
            QStringList links;
            while(iter.hasNext()){
                QRegularExpressionMatch match = iter.next();
                if(match.captured(1).right(20) != "ipm=vs#enterinsights"){
                    QString url = QUrl("https://www.bing.com/images/" + match.captured(1)).toString();
                    url.replace("&amp;", "&");
                    links.push_back(url);
                }
            }
            links.removeDuplicates();
            qDebug()<<"total match link:"<<links.size();
            if(links.size() > img_page_links_.size()){
                links.swap(img_page_links_);
            }
            if((size_t)img_page_links_.size() >= max_search_size_){
                state_ = state::parse_img_link;
            }else{
                get_web_page().findText("See more images", QWebEnginePage::FindFlag(), [this](bool found)
                {
                    if(found){
                        qDebug()<<"found See more images";
                        get_web_page().runJavaScript("document.getElementsByClassName(\"btn_seemore\")[0].click();"
                                                     "window.scrollTo(0, document.body.scrollHeight);");
                    }else{
                        qDebug()<<"cannot found See more images";
                        get_web_page().runJavaScript(js_scroll_to_window_height(1000), [this](QVariant const &result)
                        {
                            qDebug()<<"scroll page result:"<<result;
                            if(!result.toList()[0].toBool()){
                                state_ = state::parse_img_link;
                                parse_imgs_link();
                            }
                        });
                    }
                });
            }
        });
    }
    
    void bing_image_search::scroll_web_page(QPointF const &point)
    {
        //we need to setup timer if the web view are shown on the screen.
        //Because web view may not able to update in time, this may cause the signal scrollPositionChanged
        //never emit since the web page do not have enough of space to scroll down..
       //TODO : fix this poor solution
        QTimer::singleShot(1000, [=]()
        {
            if(state_ == state::parse_page_link){
                parse_page_link(point);
            }
        });
    }
      
    void bing_image_search::web_page_scroll_position_changed(const QPointF &point)
    {
        static size_t index = 0;
        qDebug()<<index++<<":"<<point.y();
        scroll_web_page(point);
    }
    

    java script of "js_scroll_to_window_height"

    namespace{
    
    QString doc_height()
    {
        return QString(
                    "function doc_height(){"
                    "  return Math.max("
                    "    document.body.scrollHeight, document.documentElement.scrollHeight,"
                    "    document.body.offsetHeight, document.documentElement.offsetHeight,"
                    "    document.body.clientHeight, document.documentElement.clientHeight);"
                    "}"
                    );
    }
    
    }
    
    QString js_scroll_to_window_height(qreal offset)
    {
        return doc_height() + QString("\n"
                    "var dheight = doc_height();"
                    "function scrollPage(){"
                    "  var cur_height = window.innerHeight + window.pageYOffset;"
                    "  if(Math.abs(window.pageYOffset - document.body.scrollHeight) < %1){"
                    "    return [false, cur_height, dheight];"
                    "  }else{"
                    "    window.scrollTo(0, window.pageYOffset + %1);"
                    "    return [true, cur_height, dheight];"
                    "  }"
                    "}"
                    "scrollPage()").arg(offset);
    }
    

    I got two problems

    1 : I cannot find a better way to scroll down the web page without the help if timer(function "scroll_web_page"), do I have a better way to scroll page?
    2 : I give the solution of stack overflow a shot(detect if browser window scroll to bottom), but none of them work as expected, I do some alternate on it, but this solution depend on luck a lots, sometime it can detect, sometime cannot.

    ps : scroll page would not emit loadFInished signal



  • One of the problem is, after I scroll the page, the height of the scroll bar may change, web view need times to reflect the change, if I scroll the page too fast, page scrolling action may end too early. No matter what I tried, I have to rely on timer and tune some parameters for specific search engine(google, bing, flickr), is this normal for web scraping, or I did something wrong(I hope I am wrong because I do not like to change parameters here and there)?Thanks


Log in to reply