Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Problem to getting Chinese and Japanese words from Page Source
Forum Updated to NodeBB v4.3 + New Features

Problem to getting Chinese and Japanese words from Page Source

Scheduled Pinned Locked Moved General and Desktop
13 Posts 3 Posters 7.5k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Z Offline
    Z Offline
    Zain
    wrote on last edited by
    #4

    Thanks for Reply.

    I am not sure upcoming data is in UTF-8 or not.I just saw following code in <head> section of the page source and pages have Japanese and chines words.
    @<meta content="text/html; charset=utf-8" http-equiv="Content-Type">@

    I have done change in my code here :
    @
    request->setRawHeader( "Accept-Charset", "utf-8" );
    request->setRawHeader( "charset", "utf-8" );
    @

    But nothing happened..goted same result.
    My main need is just get Page Source as it is into string and have to apply some RegExp on that string.Here Text file I just used to see what I am getting in QString object.

    I have done setting of QtCreator editor in FileEncoding (Default Encoding set UTF-8) is it make sense?

    I am dumb here.Can you Please send me example code for checking default Codec and setting QTextStream::setCodec() explicitly.

    Thanks
    Zain

    1 Reply Last reply
    0
    • Chris KawaC Offline
      Chris KawaC Offline
      Chris Kawa
      Lifetime Qt Champion
      wrote on last edited by
      #5

      The setting you are changing in QtCreator is for the encoding of your source files in the editor. It has nothing to do with the running program or the network request/reply encoding.

      I don't know. Maybe the webpage is lying in the meta tag? Check the response headers ( QNetworkreply::rawHeaderList() ) and see what encoding is set there.
      If that's not it I'd save the page from a browser and check the actual bytes of those japanese characters in some hex editor to see what encoding are they in.

      1 Reply Last reply
      0
      • Z Offline
        Z Offline
        Zain
        wrote on last edited by
        #6

        Thanks Krzysztof Kawa for reply.
        I got this in response headers after applying

        @ qDebug()<<reply->rawHeaderList(); @

        ("Date", "Server", "X-Powered-By", "Content-Encoding", "Vary", "Keep-Alive", "Connection", "Content-Type")

        Can you please help me to understand, what it means and how to know about encoding type from this .

        1 Reply Last reply
        0
        • Chris KawaC Offline
          Chris KawaC Offline
          Chris Kawa
          Lifetime Qt Champion
          wrote on last edited by
          #7

          This tells you what headers are attached to the reply. One of them is "Content-Encoding", so now you can call QNetworkReply::rawHeader("Content-Encoding") to get the actual encoding the response uses. Should be, but may or may not be the same as what the meta tag said.

          1 Reply Last reply
          0
          • Z Offline
            Z Offline
            Zain
            wrote on last edited by
            #8

            Thanks Krzysztof Kawa for reply.

            I tried @ qDebug()<<reply->rawHeader("Content-Encoding");@

            and got "gzip" in debug window.

            Any idea please, about "gzip".

            I would like to share when getting page source, Chinese word like 书籍 is showing "& #20070;& #31821;" (Here I have given space between & and # to show you code which I am getting in string means in real string it doesn't have space between & and #. This code is equivalent of word 书籍 if I write without space it shows 书籍 here in post).

            So here can you please help me, that how can I convert this code back as in Chinese word format in my Qt GUI application when reading from file.

            Also I updated my code here when righting into file like
            @
            QTextStream data( &outputFile );
            data.setCodec("UTF-8");
            data<<htmlString;
            @
            But same code is showing in file not Chinese format.
            Also changed setting of Qt Creator goto Edit->Select Encoding->selected UTF-8 and "Save with Encoding".Is it make an sense?

            1 Reply Last reply
            0
            • Chris KawaC Offline
              Chris KawaC Offline
              Chris Kawa
              Lifetime Qt Champion
              wrote on last edited by
              #9

              Leave the Qt Creator settings alone, they are for the code editor only. It is for example if you wanted to write something like this in your code:
              @
              QString s = "ąęśćźżół";
              @
              and save that .cpp file as a utf-8 file, which is a bad idea on its own. It has nothing to do with your case.

              "gzip" means just that the page is sent zipped to save the bandwidth, so it doesn't help much.

              It seems that the page doesn't use utf-8 characters to display the chinese characters but HTML entities (the &number; things) and this is the text you should get in the string variable, not the chinese characters (QString doesn't parse HTML). You would have to parse it on your own, display this content in something HTML aware, or do a crude replace of those entities.

              This all means that the actual source might be in a plain 1-byte encoding, like Windows1251.
              Try QString::fromLatin1() instead of QString::fromUtf8(), but it shouldn't matter really as the Latin1 is a subset of utf-8.

              1 Reply Last reply
              0
              • Z Offline
                Z Offline
                Zain
                wrote on last edited by
                #10

                Hi Chris,

                I resolved my issue with the help of QWebview instead of QnetworkAccessManager and got the HTML Source as it is and than used QString::fromUtf8() and QTextStream:: setCodec("UTF-8") for reading and writing from and to file.

                But still confused why not done with QnetworkAccessManager.

                Thanks for your help.

                1 Reply Last reply
                0
                • Chris KawaC Offline
                  Chris KawaC Offline
                  Chris Kawa
                  Lifetime Qt Champion
                  wrote on last edited by
                  #11

                  Out of curiosity I did a little local test, and it all seems to work "out of the box".
                  Here's my code:
                  @
                  QNetworkAccessManager* nam = new QNetworkAccessManager(this);
                  connect(nam, SIGNAL(finished(QNetworkReply*)), this, SLOT(finished(QNetworkReply*)));
                  QNetworkRequest rq(QUrl("file:///C:/Test/index.html"));
                  nam->get(rq);

                  void MainWindow::finished(QNetworkReply * reply)
                  {
                  QByteArray response = reply->readAll();
                  ui->plainTextEdit->setPlainText(QString::fromUtf8(response.data()));
                  }
                  @

                  This is what it looks like in the browser:
                  !http://img819.imageshack.us/img819/1424/browserv.jpg(page with chinese characters in a browser)!

                  And this is what my QNAM gets:
                  !http://img687.imageshack.us/img687/8936/namnq.jpg(page with chinese characters in a QNAM)!

                  1 Reply Last reply
                  0
                  • Z Offline
                    Z Offline
                    Zain
                    wrote on last edited by
                    #12

                    Hi Chris,

                    Thanks for reply.

                    Above code is working for me like a charm.But only for those pages which have Chinese words not for those which is containing Japanese and France characters.

                    I have applied same code with one test.html page containing Japanese words just like your index.html
                    for that this code is working fine but while applying on required page than I got following type of diamond symbols instead of Japanese characters from that page source.

                    ���ׂẴJ�e�S���

                    I have found one thing different in all three pages.

                    In Chinese country page
                    @<meta content="text/html; charset=UTF-8" http-equiv="content-type">@

                    In Japanese country page
                    @<meta content="text/html; charset=Shift_JIS" http-equiv="content-type">@

                    And in France country page
                    @<meta content="text/html; charset=iso-8859-1" http-equiv="content-type">@

                    Is it make any sense?

                    Can you please suggest me using "QNetworkAccessManager" how can I get as it is web page source which has Japanese and France Characters just done with Chinese Characters containing page.

                    Thanks again for your support.

                    1 Reply Last reply
                    0
                    • Z Offline
                      Z Offline
                      Zain
                      wrote on last edited by
                      #13

                      Hi Chris,

                      To make more clear here is an example Web page URL which have Japanese characters and I need to get page source of this web page as it is in QString object.

                      http://www.amazon.co.jp/BUFFALO-外付けハードディスク-Regza-HD-LB2-0TU2-フラストレーションフリーパッケージ/dp/B0052VIGXA/ref=sr_1_1?s=electronics&ie=UTF8&qid=1366439116&sr=1-1

                      1 Reply Last reply
                      0

                      • Login

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved