【Solved】how to detect the character encoding of a web page programmatically ?

redstoneleo

I want to get a web page source code by Qt or PyQt ,I know how to get the encoded source code ,then I need get the codec in order to convert it into plain text ,so the problem is how to detect the character encoding of a web page programmatically ?can anyone help ?

this page is encoded by UTF8
http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==

and this one is encoded by gb2312

http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html

your answer should test on this 2 page

bjanuario

It's a great quest mate ;) Try on Webkit if u have something on their classes that can help u on this.

MuldeR

The encoding of a HTML page is defined by a <META> tag, like:
@<meta http-equiv="content-type" content="text/html; charset=UTF-8">@

This tag should be in the <HEAD> section. If it is missing, W3C Validator will complain the page is invalid!

Still, there may be pages where it is missing. In that case your best bet is looking for BOM (Byte Order Mark) at the beginning of the file to detect UTF-8, UTF-16 or UTF-32. If neither is present, I'd assume Latin-1 or UTF-8.

--

One question is how to parse the HTML document up to the point where the charset <META> tag is, if you don't know the encoding yet? Well, I think everything up to that point should be plain ASCII characters, so it should decode correctly as either ASCII or Latin-1 or UTF-8. Only with UTF-16 or UTF-32 you have problem, as they are very different! So you you need to check for the UTF-16 and UTF-32 BOM before everything else, I suppose. If not UTF-16 or UTF-32 is detected via BOM, read up to the <META> tag as plain ASCII (Latin-1). Then you'll know.

--

To sum up, I'd go like this:

File starts with UTF-16 BOM -> It's UTF-16. Stop here.

File starts with UTF-32 BOM -> It's UTF-32. Stop here.

In all other cases, assume Latin-1 for now and read up to the <META> charset tag

As soon as that tag is encountered, you know the correct encoding -> Switch to correct encoding.

If charset tag doesn't appear before </HEAD>, assume UTF-8 for the rest of the file.

redstoneleo

@
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtNetwork import *
import sys
import chardet

def slotSourceDownloaded(reply):
redirctLocation=reply.header(QNetworkRequest.LocationHeader)
redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
#print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))

if (reply.error()!= QNetworkReply.NoError):
    print('11111111', reply.errorString())
    return

pageCode=reply.readAll()
charCodecInfo=chardet.detect(pageCode.data())

textStream=QTextStream(pageCode)
codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
textStream.setCodec(codec)
content=textStream.readAll()
print(content)

if content=='':
    print('---------', 'cannot find any resource !')
    return

reply.deleteLater()
qApp.quit()

if name == 'main':
app =QCoreApplication(sys.argv)
manager=QNetworkAccessManager ()
url =input('input url :')
request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
manager.get(request)
manager.finished.connect(slotSourceDownloaded)
sys.exit(app.exec_())
@