【Solved】how to detect the character encoding of a web page programmatically ?
-
I want to get a web page source code by Qt or PyQt ,I know how to get the encoded source code ,then I need get the codec in order to convert it into plain text ,so the problem is how to detect the character encoding of a web page programmatically ?can anyone help ?
this page is encoded by UTF8
http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==and this one is encoded by gb2312
http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html
your answer should test on this 2 page
-
The encoding of a HTML page is defined by a <META> tag, like:
@<meta http-equiv="content-type" content="text/html; charset=UTF-8">@This tag should be in the <HEAD> section. If it is missing, W3C Validator will complain the page is invalid!
Still, there may be pages where it is missing. In that case your best bet is looking for BOM (Byte Order Mark) at the beginning of the file to detect UTF-8, UTF-16 or UTF-32. If neither is present, I'd assume Latin-1 or UTF-8.
--
One question is how to parse the HTML document up to the point where the charset <META> tag is, if you don't know the encoding yet? Well, I think everything up to that point should be plain ASCII characters, so it should decode correctly as either ASCII or Latin-1 or UTF-8. Only with UTF-16 or UTF-32 you have problem, as they are very different! So you you need to check for the UTF-16 and UTF-32 BOM before everything else, I suppose. If not UTF-16 or UTF-32 is detected via BOM, read up to the <META> tag as plain ASCII (Latin-1). Then you'll know.
--
To sum up, I'd go like this:
File starts with UTF-16 BOM -> It's UTF-16. Stop here.
File starts with UTF-32 BOM -> It's UTF-32. Stop here.
In all other cases, assume Latin-1 for now and read up to the <META> charset tag
As soon as that tag is encountered, you know the correct encoding -> Switch to correct encoding.
If charset tag doesn't appear before </HEAD>, assume UTF-8 for the rest of the file.
-
@
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtNetwork import *
import sys
import chardetdef slotSourceDownloaded(reply):
redirctLocation=reply.header(QNetworkRequest.LocationHeader)
redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
#print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))if (reply.error()!= QNetworkReply.NoError): print('11111111', reply.errorString()) return pageCode=reply.readAll() charCodecInfo=chardet.detect(pageCode.data()) textStream=QTextStream(pageCode) codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] )) textStream.setCodec(codec) content=textStream.readAll() print(content) if content=='': print('---------', 'cannot find any resource !') return reply.deleteLater() qApp.quit()
if name == 'main':
app =QCoreApplication(sys.argv)
manager=QNetworkAccessManager ()
url =input('input url :')
request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
manager.get(request)
manager.finished.connect(slotSourceDownloaded)
sys.exit(app.exec_())
@