[Solved]Filter out substrings from text using patterns and regular expression



  • Hello Experts,

    I am a newbie for QT and stuck with regular expressions.
    Please help!

    I need to fetch list of strings from a big text (may be more than 20 matches).
    It is like scrapping, I need to get these strings from a pattern for example:

    <a href="<substring1>"><substring2></a>

    or

    <h>substr</h>

    like this, many strings from a single big string (like a web page source).

    Please suggest (code examples will be big help).

    Regards

    Zain


  • Moderators

    Why don't you use a proper XML parser?



  • Read first:

    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    then:

    If you have a very limited subset of HTML/XML you might be able to do it with regular expressions. But beware, If this is a webpage, your program will break sooner or later (i.e. when the webmaster changes things around).
    Now, what exactly didn't you get to work with QRegExp?



  • Thanks for the replies,

    My actual task is to fetch some strings(urls, headings) and get some element action like click on the buttons on web page.
    I Have done it in other IDE (flex) by using REGEX 'groups variables' and getElementyById().
    Just wanted to know if these REGEX 'groups variables' and getElementyById() kind of things are available in QT or not?
    IF Yes then how Can I use these (examples will be helpful)?
    OR any other more suitable options for these tasks.

    Thanks in advance



  • Still not quite enough information. Give an example of the literal input, and in what way you want to parse, i.e. what parts you want your program to syntactically grasp or what parts you want to extract.

    Further, I'm not familiar with Adobe Flex, so what do you mean with "groups variables", what does it do?

    //EDIT: From the input you have in your first post: Why not use a reg-ex like
    @<a href="([^Q])">([^<])</a>@ (replace Q with a quote, i.e. ", because this forum freaks out when I try to write it in there. Although it's certainly not supposed to freak out due to the code tags...
    Note that it will fail on nested tags, e.g. <a href="bla">hello <b>boom</b></a>, that's why you shouldn't parse more complex languages like XML/HTML with languages like regular expressions (see chomsky hierarchy. It's provable that regular expressions can never parse XML/HTML, no matter how smart you are in using them).



  • Thanks to all,
    Working now........


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.