Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. 【Solved】how to detect the character encoding of a web page programmatically ?
QtWS25 Last Chance

【Solved】how to detect the character encoding of a web page programmatically ?

Scheduled Pinned Locked Moved General and Desktop
4 Posts 3 Posters 22.1k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R Offline
    R Offline
    redstoneleo
    wrote on 8 Jun 2013, 16:12 last edited by
    #1

    I want to get a web page source code by Qt or PyQt ,I know how to get the encoded source code ,then I need get the codec in order to convert it into plain text ,so the problem is how to detect the character encoding of a web page programmatically ?can anyone help ?

    this page is encoded by UTF8
    http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==

    and this one is encoded by gb2312

    http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html

    your answer should test on this 2 page

    1 Reply Last reply
    0
    • B Offline
      B Offline
      bjanuario
      wrote on 8 Jun 2013, 18:20 last edited by
      #2

      It's a great quest mate ;) Try on Webkit if u have something on their classes that can help u on this.

      1 Reply Last reply
      0
      • M Offline
        M Offline
        MuldeR
        wrote on 9 Jun 2013, 22:35 last edited by
        #3

        The encoding of a HTML page is defined by a <META> tag, like:
        @<meta http-equiv="content-type" content="text/html; charset=UTF-8">@

        This tag should be in the <HEAD> section. If it is missing, W3C Validator will complain the page is invalid!

        Still, there may be pages where it is missing. In that case your best bet is looking for BOM (Byte Order Mark) at the beginning of the file to detect UTF-8, UTF-16 or UTF-32. If neither is present, I'd assume Latin-1 or UTF-8.

        --

        One question is how to parse the HTML document up to the point where the charset <META> tag is, if you don't know the encoding yet? Well, I think everything up to that point should be plain ASCII characters, so it should decode correctly as either ASCII or Latin-1 or UTF-8. Only with UTF-16 or UTF-32 you have problem, as they are very different! So you you need to check for the UTF-16 and UTF-32 BOM before everything else, I suppose. If not UTF-16 or UTF-32 is detected via BOM, read up to the <META> tag as plain ASCII (Latin-1). Then you'll know.

        --

        To sum up, I'd go like this:

        File starts with UTF-16 BOM -> It's UTF-16. Stop here.

        File starts with UTF-32 BOM -> It's UTF-32. Stop here.

        In all other cases, assume Latin-1 for now and read up to the <META> charset tag

        As soon as that tag is encountered, you know the correct encoding -> Switch to correct encoding.

        If charset tag doesn't appear before </HEAD>, assume UTF-8 for the rest of the file.

        My OpenSource software at: http://muldersoft.com/

        Qt v4.8.6 MSVC 2013, static/shared: http://goo.gl/BXqhrS

        Go visit the coop: http://youtu.be/Jay...

        1 Reply Last reply
        0
        • R Offline
          R Offline
          redstoneleo
          wrote on 27 Oct 2013, 09:22 last edited by
          #4

          @
          from PyQt4.QtCore import *
          from PyQt4.QtGui import *
          from PyQt4.QtNetwork import *
          import sys
          import chardet

          def slotSourceDownloaded(reply):
          redirctLocation=reply.header(QNetworkRequest.LocationHeader)
          redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
          #print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))

          if (reply.error()!= QNetworkReply.NoError):
              print('11111111', reply.errorString())
              return
          
          pageCode=reply.readAll()
          charCodecInfo=chardet.detect(pageCode.data())
          
          textStream=QTextStream(pageCode)
          codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
          textStream.setCodec(codec)
          content=textStream.readAll()
          print(content)
          
          if content=='':
              print('---------', 'cannot find any resource !')
              return
          
          reply.deleteLater()
          qApp.quit()
          

          if name == 'main':
          app =QCoreApplication(sys.argv)
          manager=QNetworkAccessManager ()
          url =input('input url :')
          request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
          request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
          manager.get(request)
          manager.finished.connect(slotSourceDownloaded)
          sys.exit(app.exec_())
          @

          1 Reply Last reply
          0

          • Login

          • Login or register to search.
          • First post
            Last post
          0
          • Categories
          • Recent
          • Tags
          • Popular
          • Users
          • Groups
          • Search
          • Get Qt Extensions
          • Unsolved