Split a QString with a regexp and keep the seperators
Is there a way to use QString's split method with multiple seperators and keep the seperators at the end of each string in the resulting QStringList? For example
QString = aLongString;
QRegExp sep("(\ |\.|\?)");
QStringList stringList = aLongString.split(sep);
Thanks in advance.
QString::split(..) has overload for QRegExp. So, of course you can separate the string using regular expressions
Append separator to the end of the substring block is impossible using default QString::split(..) functionality. You need more complicated algorithm to do it. I prefer this one:
Found all occurences of separators (in your case -
-Insert afther each separator one QChar with MAX_INT code-
-Split QString with QChar(MAX_INT)-
Just push_back into QStringList pieces you got
Smth like that.
UPD: I won't give advices while I am tired. I won't give advices while I am tired. I won't give advices while I am tired. I won't give advices while I am tired. I won't give advices while I am tired.
Tucnak, why don't you just put the individual parts into a QStringList while searching for all occurrences?
Inserting into a string is a comparatively expensive operation: All the chars following after the insertion point need to be moved.
[quote author="Tobias Hunger" date="1352586191"]Tucnak, why don't you just put the individual parts into a QStringList while searching for all occurrences?
Inserting into a string is a comparatively expensive operation: All the chars following after the insertion point need to be moved.[/quote]
Thanks, ~Tobias. I am really tired so wrote stupid advice like this one. Of course you are right.
Thank you both for your replies. Tobias, can you elaborate a little bit? If I get it right, the general idea is to create an empty QStringList and start appending substrings based on the separators? In that case, I suppose I will have to work with indexes and ranges?
panosk: Yeap, you got that right.
OK, thanks a lot. Still, it would be nice to have a KeepSeparator option in QString::split(). Such an option, along with the existing KeepEmptyParts, would make splitting and rejoining strings even more convenient in a non-destructive manner :)
panosk: I really do not see the use case. You know the separator, otherwise you would not be able to split.
If you don't then you are better off parsing the string properly.
I assume you are still trying to parse class="whatever" from HTML? That is something that will go very wrong using RegExps, so do not do that. There are lots of ways a regexp-based approach will break down here.
@Tobias. My problem is that I have to use many separators and not only one. In the snippet I wrote in my first post (ignore the white space, I included it for variety's shake), the string will be splitted as expected, but then I cannot reconstruct it because the separators are lost so, for example, I don't know which strings end in a full stop or in a question mark.
I'm trying to achieve some sort of plain text sentence tokenization. I would never use regexps for parsing HTML or XML -- I always prefer a parser in such cases.
Eventually I will have to build a proper tokenizer, but it's not a priority right now so I'm trying to find the most convenient way to do it.