Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Download and regex parse an url source code
Forum Update on Monday, May 27th 2025

Download and regex parse an url source code

Scheduled Pinned Locked Moved Solved General and Desktop
26 Posts 3 Posters 4.6k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    Mr Gisa
    wrote on 10 May 2018, 16:10 last edited by
    #1

    I was wondering, how can I download a web page source code and get all the links in it?

    I have this regex here:

    ((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)
    

    But now I need to know how to use it to parse the entire html code.

    J 1 Reply Last reply 11 May 2018, 07:30
    0
    • G Offline
      G Offline
      Gojir4
      wrote on 10 May 2018, 16:16 last edited by Gojir4 5 Oct 2018, 16:17
      #2

      You can test your regex using a tool like Expresso to ensure it marche correctly with your file content.

      But maybe QXmlQuery could be more efficient (and avoid some headaches) in this case.

      1 Reply Last reply
      1
      • M Offline
        M Offline
        Mr Gisa
        wrote on 10 May 2018, 16:39 last edited by
        #3

        I couldn't find an example extracting urls using QXmlQuery.

        1 Reply Last reply
        0
        • G Offline
          G Offline
          Gojir4
          wrote on 10 May 2018, 17:16 last edited by
          #4

          @Mr-Gisa I can provide some code, which I have used to parse Doxygen html files and extracts some urls from them (enums, function, classes, etc...), but unfortunately not before tomorrow morning, I have this at job...

          1 Reply Last reply
          0
          • G Offline
            G Offline
            Gojir4
            wrote on 10 May 2018, 17:18 last edited by Gojir4 5 Oct 2018, 17:21
            #5

            Just to clarify, do you mean standard html url, lile this: <a href="www.url.to/retrieve"> ?

            1 Reply Last reply
            0
            • M Offline
              M Offline
              Mr Gisa
              wrote on 10 May 2018, 17:25 last edited by
              #6

              Yes, exactly.

              G 1 Reply Last reply 10 May 2018, 17:59
              0
              • M Mr Gisa
                10 May 2018, 17:25

                Yes, exactly.

                G Offline
                G Offline
                Gojir4
                wrote on 10 May 2018, 17:59 last edited by
                #7

                @Mr-Gisa I realize now that I redirect you on QXmlQuery, but I forgot about regex. Your regex on the first post looks over complicated. Did you try something like this:

                <a href="(?<label>[\w\.]*)">(?<url>.*)</a>
                

                Or do you need to match a very specific url format? In this case can you show an example ?

                1 Reply Last reply
                1
                • M Offline
                  M Offline
                  Mr Gisa
                  wrote on 10 May 2018, 18:04 last edited by
                  #8

                  Yes, I need to match urls in a lot of formats, that is why I used the regex I sent you, I found that this one is more complete and matches a lot of urls.

                  G 1 Reply Last reply 11 May 2018, 07:10
                  0
                  • M Mr Gisa
                    10 May 2018, 18:04

                    Yes, I need to match urls in a lot of formats, that is why I used the regex I sent you, I found that this one is more complete and matches a lot of urls.

                    G Offline
                    G Offline
                    Gojir4
                    wrote on 11 May 2018, 07:10 last edited by
                    #9

                    @Mr-Gisa If you are still interested, here is the full code of my application parsing Doxygen html files using QXmlQuery. The code is not well commented but feel free to ask me if you need some details. I will try to remember, because I did that one year ago... You will see that sometimes I have mixed both QXmlQuery and QRegularExpression to achieve my goal.

                    That's not really the same that you are trying to achieve, but maybe it could help you to understand how QXmlQuery works.
                    I cannot provide the Doxygen HTML files because of confidentiality, but you can easily generate them using an arbitrary projet or C++ header file.

                    #include <QBuffer>
                    #include <QDebug>
                    #include <QCommandLineParser>
                    #include <QCoreApplication>
                    #include <QFileInfo>
                    #include <QTextStream>
                    #include <QXmlQuery>
                    #include <QJsonDocument>
                    #include <QJsonObject>
                    #include <QRegularExpression>
                    #include "iostream"
                    
                    #ifdef WIN32
                    #include "windows.h"
                    HANDLE hOut = NULL;
                    HANDLE hErr = NULL;
                    #endif
                    
                    QTextStream o(stdout);
                    QTextStream e(stderr);
                    
                    //const QString INDEX_QUERY = "for $a in doc($index)/html/body/div/div/table/tr/td/a\n"
                    //                            "where $a/@class = 'el'\n"
                    //                            "order by $a/text()\n"
                    //                            "return concat($a/text(), '-', $a/@href, ';')";
                    
                    const QString INDEX_QUERY = "for $item in doc($root)/html/body/div/div/table/tr/td/a\n"
                                                "where $item[@target]\n"
                                                "return concat($item/text(), '-', $item/@href, '\n')";
                    
                    
                    const QString CLASS_QUERY = "for $item in doc($root)/html/body/div/table/tr\n"
                                                "where contains($item/@class, 'memitem:')\n"
                                                "return $item";
                    
                    const QString MEMBER_QUERY = "for $item in doc($root)/html/tr/td/a\n"
                                                 "return concat($item/text(), \'-\', $item/@href,'\n')";
                    
                    const QString GLOBAL_INDEX_QUERY = "for $item in doc($root)/html/body/div/table/tr/td/a\n"
                                                       "where $item/@class = 'el' and $item[@href] and $item[not (@title)]\n"
                                                       "return concat($item/text(), '-', $item/@href, '\n')";
                    
                    const QString ENUM_QUERY = "for $item in doc($root)/html/body/div/table/tr\n"
                                               "where $item/td/text()[contains(., 'enum' )]\n"
                                               "return concat($item/td[@class = 'memItemRight'], ';')";
                    
                    const QString ENUM_QUERY_URL_FILTER = "for $item in doc($root)/html/body/div/table/tr\n"
                                                          "where $item/td/text()[contains(., 'enum' )]\n"
                                                          "return $item/td[@class = 'memItemRight']";
                    
                    const QString ENUM_URL_QUERY = "for $item in doc($root)/html/td\n"
                                                   "return $item/a[1]/@href/string()";
                    
                    
                    
                    const QString CLASS_HEADER_METHODS = "pub-methods";
                    const QString CLASS_HEADER_ATTRIBS = "pub-attribs";
                    const QString CLASS_HEADER_CONSTANTS = "pub-types";
                    const QString CLASS_HEADER_SIGNALS = "signals";
                    
                    enum ConsoleColors{
                        BLACK = 0,
                        DARK_BLUE = 1,
                        DARK_GREEN = 2,
                        DARK_CYAN = 3,
                        DARK_RED = 4,
                        DARK_MAGENTA = 5,
                        DARK_YELLOW = 6,
                        LIGHT_GREY = 7,
                        DARK_GREY = 8,
                        BLUE = 9,
                        GREEN = 10,
                        CYAN = 11,
                        RED = 12,
                        MAGENTA = 13,
                        YELLOW = 14,
                        WHITE = 15
                    
                    };
                    
                    
                    void msg(const QString &msg, int color = WHITE){
                    #ifdef WIN32
                        SetConsoleTextAttribute(hOut, color);
                    #endif
                        o << msg << endl;
                        o.flush();
                    #ifdef WIN32
                        SetConsoleTextAttribute(hOut, WHITE);
                    #endif
                    
                    }
                    int err(const QString &msg, int color = RED){
                    #ifdef WIN32
                        SetConsoleTextAttribute(hErr, color);
                    #endif
                        e << msg << endl;
                        e.flush();
                    #ifdef WIN32
                        SetConsoleTextAttribute(hOut, WHITE);
                    #endif
                        return 1;
                    }
                    
                    bool evaluateFile(const QString &filepath, const QString &query, QString &result, const QString &varName = "root"){
                        QString str;
                        QBuffer buf;
                        QFile f(filepath);
                        if(!f.open(QFile::ReadOnly)){
                            err(QString("Cannot open %1").arg(f.fileName()));
                            return false;
                        }
                    
                        str = QString(f.readAll());
                        f.close();
                        
                        str.replace("<html xmlns=\"http://www.w3.org/1999/xhtml\">", "<html>");
                        buf.setData(str.toUtf8());
                        buf.open(QIODevice::ReadOnly);
                    
                        QXmlQuery indexQuery;
                        indexQuery.bindVariable(varName, &buf);
                        indexQuery.setQuery(query);
                    
                        if(!indexQuery.evaluateTo(&result)){
                            return false;
                        }
                        return true;
                    }
                    
                    bool evaluateString(const QString &s, const QString &query, QString &result, const QString &varName = "root", bool wrapInHtml = true){
                        QString str = s;
                        QBuffer buf;
                        if(wrapInHtml){
                            str = "<html>\n" + str + "\n</html>";
                        } else {
                            str.replace("<html xmlns=\"http://www.w3.org/1999/xhtml\">", "<html>");
                        }
                        buf.setData(str.toUtf8());
                        buf.open(QIODevice::ReadOnly);
                    
                        QXmlQuery indexQuery;
                        indexQuery.bindVariable(varName, &buf);
                        indexQuery.setQuery(query);
                    
                        if(!indexQuery.evaluateTo(&result)){
                            return false;
                        }
                        result = result.trimmed();
                        return true;
                    }
                    
                    struct ClassFile{
                        QString file;
                        QString className;
                    
                        bool operator ==(const ClassFile &other){
                            return this->file == other.file && this->className == other.className;
                        }
                    };
                    
                    int main(int argc, char *argv[])
                    {
                        QCoreApplication a(argc, argv);
                    
                    #ifdef WIN32
                        hOut = GetStdHandle(STD_OUTPUT_HANDLE);
                        hErr = GetStdHandle(STD_ERROR_HANDLE);
                    #endif
                    
                        QCommandLineParser parser;
                        parser.setApplicationDescription("EM4315 Test Bench - Html Help Parser Helper");
                        parser.addHelpOption();
                        QCommandLineOption outputOption(QStringList() << "o" << "output", "Output file name, default is 'doc_links.json'", "file");
                        parser.addOption(outputOption);
                        parser.addPositionalArgument("annotated", QCoreApplication::translate("main", "Path to annotated.html file from doxygen"));
                        parser.addPositionalArgument("global", QCoreApplication::translate("main", "Path to group__gglobal.html file from doxygen"));
                    
                        // Process the actual command line arguments given by the user
                        parser.process(a.arguments());
                    
                        const QStringList posArgs = parser.positionalArguments();
                        if(posArgs.isEmpty()){
                            return err("Error: annotated.html path missing");
                        } else if(!QFileInfo::exists(posArgs.first())){
                            return err("Error: annotated.html file not found");
                        }
                    
                        QString path = posArgs.first();
                    
                        //Get object and types
                        //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                    
                        //Index file
                        QString result;
                        if(!evaluateFile(path, INDEX_QUERY, result)){
                            return err(QString("Index evaluation failed"));
                        }
                        msg(QString("Result: %1").arg(result));
                    
                        //Make list of each files for each class (object and types)
                        QList<ClassFile> classes;
                        for(const QString &s : result.split("\n", QString::SkipEmptyParts)){
                            int pos = s.indexOf("-");
                            if((pos < 0) || (pos >= (s.size() -1)))
                                continue;
                            ClassFile cf;
                            cf.className = s.mid(0, pos).trimmed();
                            cf.file = QFileInfo(path).absolutePath() + "/" + s.mid(pos + 1).trimmed();
                            if(classes.contains(cf))
                                continue;
                            classes << cf;
                            msg(QString("Class found: %1 - %2").arg(cf.className, cf.file), GREEN);
                        }
                    
                        //Process each class file and extract members
                        QVariantMap helpMap;
                        for(const ClassFile &cf : classes){
                    
                            helpMap.insert(cf.className, QFileInfo(cf.file).fileName());
                           
                            QString classRes;
                            if(!evaluateFile(cf.file, CLASS_QUERY, classRes)){
                                err(QString("Parsing error class: %1").arg(cf.className), YELLOW);
                                continue;
                            }
                    
                            QString enumRes;
                            if(!evaluateFile(cf.file, ENUM_QUERY, enumRes)){
                                err(QString("Enum parsing error class: %1").arg(cf.className), YELLOW);
                                enumRes = "";
                            } else {
                    
                                QString enumUrlFiltered;
                                if(!evaluateFile(cf.file, ENUM_QUERY_URL_FILTER, enumUrlFiltered)){
                                    err(QString("Enum parsing error class: %1").arg(cf.className), YELLOW);
                                    enumUrlFiltered = "";
                                }
                    
                                QString enumUrlRes;
                                //msg(enumUrlFiltered);
                                if(!evaluateString(enumUrlFiltered, ENUM_URL_QUERY, enumUrlRes)){
                                    err(QString("Enum url parsing error class: %1").arg(cf.className), YELLOW);
                                    enumUrlRes = "";
                                }
                    
                    
                                //Parse enum result
                                QStringList enums = enumRes.split(";", QString::SkipEmptyParts);
                                int i = 0 ;
                                QStringList enumUrls = enumUrlRes.split(" ", QString::SkipEmptyParts);
                                //QString currentUrl;
                                for(const QString &e : enums){
                                    if(i >= enumUrls.size())
                                        break;
                                    
                                    QRegularExpression htmlReplace("<br/>|<[/]{0,1}b>");
                                    QRegularExpression getEnumsRegEx("\\{([\\w\\W]+?)\\}");
                    
                                    QRegularExpressionMatch getEnumMatch = getEnumsRegEx.match(e);
                                    if(getEnumMatch.hasMatch()){
                                        QString content = getEnumMatch.captured();
                                        content = content.replace(htmlReplace, "");
                                        QStringList constants = content.split(",", QString::SkipEmptyParts);
                                        for(QString konst: constants){
                                            konst = konst.replace(QRegularExpression("^\\{[\\s\\\\n]*"), "").trimmed();
                                            int pos = konst.indexOf("=");
                                            const QString fullName =  cf.className + "." + (pos > 0 ? konst.left(pos).trimmed() : konst);
                    
                                            helpMap[fullName] = enumUrls.at(i).trimmed();
                                            //currentUrl = enumUrls.at(i).trimmed();
                                        }
                    
                                    }
                                    i++;
                                }
                            }
                    
                    
                    
                            //Parse class members        
                            QString memberRes;
                            if(!evaluateString(classRes, MEMBER_QUERY, memberRes)){
                                err(QString("Parsing error member: %1").arg(classRes), YELLOW);
                                continue;
                            }
                    
                            QStringList members = memberRes.split("\n", QString::SkipEmptyParts);
                            if(members.isEmpty())
                                continue;
                    
                            for(const QString &m : members){
                                int pos = m.indexOf("-");
                                if((pos < 1 )|| (pos >= (m.size()-1)))
                                    continue;
                    
                                if((cf.className == m.mid(0, pos).trimmed()) && cf.className.at(0).isUpper())
                                    helpMap.insert(cf.className, m.mid(pos + 1).trimmed());
                                else
                                    helpMap.insert(cf.className + "." + m.mid(0, pos).trimmed(), m.mid(pos + 1).trimmed());
                            }
                    
                        }
                    
                    
                    
                        //Get globals items
                        //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        if(posArgs.size() >= 2){
                    
                            path = posArgs.at(1);
                            //Index file
                            QString globalRes;
                            if(!evaluateFile(path, GLOBAL_INDEX_QUERY, globalRes)){
                                return err(QString("Global index evaluation failed"));
                            }
                    
                            //Parse query result
                            //Format "name-url"
                            QStringList globals = globalRes.split("\n", QString::SkipEmptyParts);
                            if(!globals.isEmpty()){
                                for(const QString &m : globals){
                                    int pos = m.indexOf("-");
                                    if((pos < 1 )|| (pos >= (m.size()-1)))
                                        continue;
                                    helpMap.insert(m.mid(0, pos).trimmed(), m.mid(pos + 1).trimmed());
                                }
                            }
                    
                            msg(QString("Global result: %1").arg(globalRes));
                        }
                    
                        //qDebug() << "Help map: " << helpMap;
                        QString output = parser.value(outputOption);
                        if(output.isEmpty())
                            output = "doc_links.json";
                    
                        QJsonObject json = QJsonObject::fromVariantMap(helpMap);
                        QFile fOut(output);
                        QJsonDocument jdoc(json);
                        if(!fOut.open(QFile::WriteOnly))
                            return err(QString("Cannot open output file %1").arg(fOut.fileName()));
                    
                        fOut.write(jdoc.toJson());
                        fOut.close();
                    
                        return 0;
                    }
                    
                    
                    1 Reply Last reply
                    2
                    • M Mr Gisa
                      10 May 2018, 16:10

                      I was wondering, how can I download a web page source code and get all the links in it?

                      I have this regex here:

                      ((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+(:[0-9]+)?|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)
                      

                      But now I need to know how to use it to parse the entire html code.

                      J Offline
                      J Offline
                      JonB
                      wrote on 11 May 2018, 07:30 last edited by JonB 5 Nov 2018, 07:35
                      #10

                      @Mr-Gisa
                      Just to be clear: web page source code is written in HTML + whatever (e.g. JavaScript, etc.). Qt does not have an HTML parser. You cannot reliably parse any HTML using regular expressions. You cannot reliably recognise what bits are really URLs in the web page with regular expressions.

                      What you can do is use regular expressions to "guess" what bits might be genuine URLs. You might find some spurious, "extra" ones, and you might miss some "genuine" ones. It might do better or worse on some HTML sources (e.g. Doxygen) than others.

                      So long as you are happy with this "approximation" that's OK. But you will never guarantee to get them all correctly.

                      I don't really know what @Gojir4 's code is doing. HTML is not XML, so if it's relying on HTML input being parseable as well-formed XML it will fail. XHTML is XML, and so will parse, but not many sites produce XHTML. Maybe Doxygen does, others do not.

                      G 1 Reply Last reply 11 May 2018, 08:03
                      1
                      • J JonB
                        11 May 2018, 07:30

                        @Mr-Gisa
                        Just to be clear: web page source code is written in HTML + whatever (e.g. JavaScript, etc.). Qt does not have an HTML parser. You cannot reliably parse any HTML using regular expressions. You cannot reliably recognise what bits are really URLs in the web page with regular expressions.

                        What you can do is use regular expressions to "guess" what bits might be genuine URLs. You might find some spurious, "extra" ones, and you might miss some "genuine" ones. It might do better or worse on some HTML sources (e.g. Doxygen) than others.

                        So long as you are happy with this "approximation" that's OK. But you will never guarantee to get them all correctly.

                        I don't really know what @Gojir4 's code is doing. HTML is not XML, so if it's relying on HTML input being parseable as well-formed XML it will fail. XHTML is XML, and so will parse, but not many sites produce XHTML. Maybe Doxygen does, others do not.

                        G Offline
                        G Offline
                        Gojir4
                        wrote on 11 May 2018, 08:03 last edited by
                        #11

                        @JonB You are right, Doxygen produce XHTML, I didn't notice that sorry. I assumed naively that HTML and XML can be parsed the same way.

                        1 Reply Last reply
                        0
                        • G Offline
                          G Offline
                          Gojir4
                          wrote on 11 May 2018, 08:13 last edited by
                          #12

                          From the doc of QXmlQuery
                          "QXmlQuery is typically used to query XML data, but it can also query non-XML data that has been modeled to look like XML."

                          and then the code example parse an HTML file:

                            QXmlQuery query;
                            query.setQuery("doc('index.html')/html/body/p[1]");
                          

                          I'm a little bit confused about this right now.

                          J 1 Reply Last reply 11 May 2018, 09:01
                          0
                          • G Gojir4
                            11 May 2018, 08:13

                            From the doc of QXmlQuery
                            "QXmlQuery is typically used to query XML data, but it can also query non-XML data that has been modeled to look like XML."

                            and then the code example parse an HTML file:

                              QXmlQuery query;
                              query.setQuery("doc('index.html')/html/body/p[1]");
                            

                            I'm a little bit confused about this right now.

                            J Offline
                            J Offline
                            JonB
                            wrote on 11 May 2018, 09:01 last edited by JonB 5 Nov 2018, 09:23
                            #13

                            @Gojir4
                            Yes, note the

                            non-XML data that has been modeled to look like XML

                            and the page's further:

                            The example uses QXmlQuery to match the first paragraph of an XML document and then output the result to a device as XML.

                            So (bearing in mind I know nothing about this!), what exactly does the doc('index.html') deliver? In http://doc.qt.io/qt-5/xmlprocessing.html I can see it mentions:

                            When Qt XML Patterns loads an XML resource, e.g., using the fn:doc() function

                            but I can't click on that. Where is fn:doc() documented?

                            EDIT
                            OK, fn:doc() is just an XQuery function for accessing the document object.

                            So that assumes that you already have a parsed document. All the examples I can see anywhere other than that example access a .xml file, not a .html one, which is as I would expect.

                            So I assume this will only work for you if the particular HTML file you pass happens to parse as XML, i.e. it's either XHTML in the first place, or the HTML it contains does not have anything HTML-but-not-XML in it (which may be the case for some HTML documents but not others).

                            Try putting, say, precisely <br> (and no </br>) somewhere in your HTML and see if it still parses? <br> is a common example of legal HTML, but is not legal in XHTML or XML...?

                            G 1 Reply Last reply 11 May 2018, 09:07
                            0
                            • J JonB
                              11 May 2018, 09:01

                              @Gojir4
                              Yes, note the

                              non-XML data that has been modeled to look like XML

                              and the page's further:

                              The example uses QXmlQuery to match the first paragraph of an XML document and then output the result to a device as XML.

                              So (bearing in mind I know nothing about this!), what exactly does the doc('index.html') deliver? In http://doc.qt.io/qt-5/xmlprocessing.html I can see it mentions:

                              When Qt XML Patterns loads an XML resource, e.g., using the fn:doc() function

                              but I can't click on that. Where is fn:doc() documented?

                              EDIT
                              OK, fn:doc() is just an XQuery function for accessing the document object.

                              So that assumes that you already have a parsed document. All the examples I can see anywhere other than that example access a .xml file, not a .html one, which is as I would expect.

                              So I assume this will only work for you if the particular HTML file you pass happens to parse as XML, i.e. it's either XHTML in the first place, or the HTML it contains does not have anything HTML-but-not-XML in it (which may be the case for some HTML documents but not others).

                              Try putting, say, precisely <br> (and no </br>) somewhere in your HTML and see if it still parses? <br> is a common example of legal HTML, but is not legal in XHTML or XML...?

                              G Offline
                              G Offline
                              Gojir4
                              wrote on 11 May 2018, 09:07 last edited by
                              #14

                              @JonB I think the fn:doc is part of the XQuery/XPath specification

                              J 1 Reply Last reply 11 May 2018, 09:16
                              0
                              • G Gojir4
                                11 May 2018, 09:07

                                @JonB I think the fn:doc is part of the XQuery/XPath specification

                                J Offline
                                J Offline
                                JonB
                                wrote on 11 May 2018, 09:16 last edited by
                                #15

                                @Gojir4
                                See my EDIT above.

                                G 1 Reply Last reply 11 May 2018, 09:55
                                0
                                • J JonB
                                  11 May 2018, 09:16

                                  @Gojir4
                                  See my EDIT above.

                                  G Offline
                                  G Offline
                                  Gojir4
                                  wrote on 11 May 2018, 09:55 last edited by
                                  #16

                                  @JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
                                  But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

                                  J 1 Reply Last reply 11 May 2018, 10:18
                                  0
                                  • G Gojir4
                                    11 May 2018, 09:55

                                    @JonB You are right, tags without corresponding closing tag, as <br>, are not handled by XQuery, you got the error "Opening and ending tag mismatch".
                                    But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

                                    J Offline
                                    J Offline
                                    JonB
                                    wrote on 11 May 2018, 10:18 last edited by
                                    #17

                                    @Gojir4

                                    But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

                                    And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!

                                    G 1 Reply Last reply 11 May 2018, 10:36
                                    0
                                    • J JonB
                                      11 May 2018, 10:18

                                      @Gojir4

                                      But, depending of the input format, I think this could be easily handled by making some replacement in the html code before to evaluate it with XQuery. That's what I did when I have used XQuery.

                                      And I do not think that is "easy", precisely because as I said you don't have a parser for HTML, and regular expressions are a hack which at best work "approximately" and at worst get it all wrong! That's all I was trying to warn the OP about --- it won't be robust for his random HTML pages. If it works for you/him, good luck!

                                      G Offline
                                      G Offline
                                      Gojir4
                                      wrote on 11 May 2018, 10:36 last edited by
                                      #18

                                      @JonB said in Download and regex parse an url source code:

                                      and regular expressions are a hack which at best work "approximately"

                                      I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
                                      But, that's only my opinion.

                                      J 1 Reply Last reply 11 May 2018, 10:40
                                      0
                                      • G Gojir4
                                        11 May 2018, 10:36

                                        @JonB said in Download and regex parse an url source code:

                                        and regular expressions are a hack which at best work "approximately"

                                        I dont agree about that, in my opinion, regex are extremely powerful and work as expected if used correctly. I agree that's not designed to make code parsing, but combined with other "search algorithm", like XQuery, or simple string manipulation, it can achieve almost everything. I'm using regex from years now, and I never see any "approximative" result, except, of course, when the regular expression itself was badly defined.
                                        But, that's only my opinion.

                                        J Offline
                                        J Offline
                                        JonB
                                        wrote on 11 May 2018, 10:40 last edited by
                                        #19

                                        @Gojir4
                                        I never said regular expressions themselves are "approximative"! Of course they work. But if you do not know/cannot correctly parse the input (HTML in this case), then what they recognise/do can, and often is, simply faulty. Your regular expression for recognising a URL might, for example, pick one up from inside a commented out fragment without knowing it has done so. That may or may not matter to you/the OP, I don't know.

                                        There are plenty of posts on, say, stackoverflow explaining why HTML cannot be correctly parsed/recognised via regular expressions.

                                        G 1 Reply Last reply 11 May 2018, 16:06
                                        1
                                        • M Offline
                                          M Offline
                                          Mr Gisa
                                          wrote on 11 May 2018, 13:48 last edited by
                                          #20

                                          I solved the problem by using the myhtml library, it's fast and did the trick really nicely.

                                          J 1 Reply Last reply 11 May 2018, 14:26
                                          0

                                          1/26

                                          10 May 2018, 16:10

                                          • Login

                                          • Login or register to search.
                                          1 out of 26
                                          • First post
                                            1/26
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • Users
                                          • Groups
                                          • Search
                                          • Get Qt Extensions
                                          • Unsolved