[SOLVED] String Encoding problem between Python, XHTML, Javascript
-
I'm putting together a minimal WYSIWYG editor using QWebView, which simply uses the contentEditable support and a new XMLSerializer() to output the DOM as serialized XML. To save, the serialized is passed back from javascript to python for saving.
I suspect that the translation between javascript strings and QString by the pyqtSlot(str) decorator is introducing an encoding problem and I don't know how to be stricter with encoding.
Overall the strategy fundamentally works, but there must be some kind of encoding problem between setContent() to pass in the origin HTML and the serialized string I get from the Javascript 'save' callback.
This has become obvious through, for example, the bad handling of some smart-quotes, which get turned into some kind of junk as if the unicode hasn't been processed properly (while the rest of the serialised XML string is fine).
The steps are as follows...
-
the HTML is loaded from a file, at this point smart quotes are properly encoded, (viewable in Chrome) and an example line looks like the following (imagine the smart quotes)...
@‘step into the user’s shoes’ and ‘walk the user’s walk’
@ -
the HTML is loaded into a QWebView using [python] ...
@self.view.setContent(buf.buffer(), "application/xhtml+xml", base_url)
@
...and the example line still renders in the view apparently with smart quotes...
@‘step into the user’s shoes’ and ‘walk the user’s walk’
@ -
the HTML DOM is serialised using [javascript]
@s = (new XMLSerializer()).serializeToString(document.documentElement)
@
...and the example line (by debugging and reading the string) still renders in the Javascript console with smart quotes..
@‘step into the user’s shoes’ and ‘walk the user’s walk’
@ -
the Serialized string is passed back as an argument to a pyqtSlot-decorated object with a save(...) method which was previously exposed to the QWebView page().mainFrame() using addToJavascriptWindowObject(...)
@editor.save(str)
@ -
The QString arriving at the Editor object's save method is then saved to a file, where the previously mentioned extract now looks like...
@?step into the user?s shoes? and ?walk the user?s walk?
@
What should I be doing to correctly handle passing in the string from the javascript side, or receiving it on the python side, to avoid this encoding issue?
The full code I'm using is available at https://github.com/cefn/firmware-codesign-readinglog/tree/master/ui
-
-
I've just created a much simpler test case which recreates the problem. https://github.com/cefn/firmware-codesign-readinglog/tree/master/ui/test
Perhaps someone with knowledge of throwing around string encodings in Qt can have a look at the Editor#save() method and figure out what needs to be done so that saved_test.html and test.html are identical after running test.py .
Currently the loaded file and saved file are identical except the Smart Quotes which are badly encoded for some unknown reason. Promising, but the encoding problem is a show-stopper :(
If these two files can be made identical, then it should be possible to throw together a WYSIWYG editor in QWebView in just a few lines of Javascript, using HTML's contentEditable support.
-
OK, so I found a hack which does the job. It involves iterating over every character in the javascript string and storing the Unicode codepoint in an array as a javascript number.
This array is passed to python as a QVariantList, which appears in python as a list of floats, which can then be wrangled through int() unichr(), join() and encode() to an ascii string suitable to be written to file. Nasty as hell, but it works.
JAVASCRIPT SIDE
@ function getChars(s) {
var chars = [];
for (var i = 0; i < s.length; i++) {
chars.push(s.charCodeAt(i));
}
return chars;
};var ser = new XMLSerializer(); var mystr = ser.serializeToString(document.documentElement); editor.save(getChars(mystr));
@
PYTHON SIDE
@ @pyqtSlot("QVariantList")
def save(self, serialized):# come in as floats from javascript domchars = [unichr(int(entry)) for entry in serialized] domunicode = ''.join(domchars) domascii = domunicode.encode("UTF-8") f = open("saved_" + filepath, 'w') f.write(domascii) f.close()
@
The original problematic version, which indicates the problem from getting unicode strings out of QWebView can be seen at https://github.com/cefn/firmware-codesign-readinglog/blob/7c25475ba27f565403b64aafc364012437d85a1e/ui/test/test.py
...and the fixed up version which loads and saves UTF-8 XHTML without change is at...
https://github.com/cefn/firmware-codesign-readinglog/blob/4b70f47db95bd2dabf33dae1ec747eaf0664b28d/ui/test/test.py -
Now even better. I've found that calling toUtf8() turns the unicode array implicit in the QString passed by the original @pyqtSlot decorator into something which can be written to file without messing about, and which preserves special characters. I've no idea why no combination of python str() and bytearray() encode and decode operations could seem to achieve this, but it's done now...
https://github.com/cefn/firmware-codesign-readinglog/blob/8c315b85c14f83539313bace54453565ba8aa9f6/ui/test/test.py