Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Problem to getting Chinese and Japanese words from Page Source
Forum Updated to NodeBB v4.3 + New Features

Problem to getting Chinese and Japanese words from Page Source

Scheduled Pinned Locked Moved General and Desktop
13 Posts 3 Posters 7.4k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Chris KawaC Online
    Chris KawaC Online
    Chris Kawa
    Lifetime Qt Champion
    wrote on last edited by
    #2

    Two things come to mind.

    You're setting an "Accept-Charset" header in your request with a value "win1251,utf-8" so it's likely that you're not getting a utf-8 response. Using QString::fromUtf8 on it might lead to the garbage you're getting. Try changing the header to just "utf-8", without the "q" parameter (which is =1 by default).

    Other thing - check what codec is set on the QTextStream using coded() member. It takes a system local encoding by default so if it's not already utf-8 try setting it explicitly with QTextStream::setCodec().

    Edit. Btw. QFile does not throw exceptions, so your try/catch is useless. The usual construct is this:
    @
    if(outputFile.open(QIODevice::WriteOnly | QIODevice::Text))
    {
    //do something with the file
    //you dont have to call outputFile.close(), it will be closed when outputFile goes out of scope
    }
    else
    {
    //couldn't open file
    }
    @

    1 Reply Last reply
    0
    • JKSHJ Offline
      JKSHJ Offline
      JKSH
      Moderators
      wrote on last edited by
      #3

      Edit: Krzysztof Kawa beat me to it ;)

      [quote author="Zain" date="1365498293"]
      @
      void MainWindow::on_btnGetSource_clicked()
      {
      //...
      request->setRawHeader( "Accept-Charset", "win1251,utf-8;q=0.7,*;q=0.7" );
      //...
      }

      void MainWindow::replyFinished()
      {
      //...
      QString htmlString=QString::fromUtf8(reply->readAll());
      //...
      }

      @[/quote]Have you made sure that your data is encoded in UTF-8? You accept Windows-1251 too. If your data is not UTF-8, then QString::fromUtf8() will give wrong results.

      Qt Doc Search for browsers: forum.qt.io/topic/35616/web-browser-extension-for-improved-doc-searches

      1 Reply Last reply
      0
      • Z Offline
        Z Offline
        Zain
        wrote on last edited by
        #4

        Thanks for Reply.

        I am not sure upcoming data is in UTF-8 or not.I just saw following code in <head> section of the page source and pages have Japanese and chines words.
        @<meta content="text/html; charset=utf-8" http-equiv="Content-Type">@

        I have done change in my code here :
        @
        request->setRawHeader( "Accept-Charset", "utf-8" );
        request->setRawHeader( "charset", "utf-8" );
        @

        But nothing happened..goted same result.
        My main need is just get Page Source as it is into string and have to apply some RegExp on that string.Here Text file I just used to see what I am getting in QString object.

        I have done setting of QtCreator editor in FileEncoding (Default Encoding set UTF-8) is it make sense?

        I am dumb here.Can you Please send me example code for checking default Codec and setting QTextStream::setCodec() explicitly.

        Thanks
        Zain

        1 Reply Last reply
        0
        • Chris KawaC Online
          Chris KawaC Online
          Chris Kawa
          Lifetime Qt Champion
          wrote on last edited by
          #5

          The setting you are changing in QtCreator is for the encoding of your source files in the editor. It has nothing to do with the running program or the network request/reply encoding.

          I don't know. Maybe the webpage is lying in the meta tag? Check the response headers ( QNetworkreply::rawHeaderList() ) and see what encoding is set there.
          If that's not it I'd save the page from a browser and check the actual bytes of those japanese characters in some hex editor to see what encoding are they in.

          1 Reply Last reply
          0
          • Z Offline
            Z Offline
            Zain
            wrote on last edited by
            #6

            Thanks Krzysztof Kawa for reply.
            I got this in response headers after applying

            @ qDebug()<<reply->rawHeaderList(); @

            ("Date", "Server", "X-Powered-By", "Content-Encoding", "Vary", "Keep-Alive", "Connection", "Content-Type")

            Can you please help me to understand, what it means and how to know about encoding type from this .

            1 Reply Last reply
            0
            • Chris KawaC Online
              Chris KawaC Online
              Chris Kawa
              Lifetime Qt Champion
              wrote on last edited by
              #7

              This tells you what headers are attached to the reply. One of them is "Content-Encoding", so now you can call QNetworkReply::rawHeader("Content-Encoding") to get the actual encoding the response uses. Should be, but may or may not be the same as what the meta tag said.

              1 Reply Last reply
              0
              • Z Offline
                Z Offline
                Zain
                wrote on last edited by
                #8

                Thanks Krzysztof Kawa for reply.

                I tried @ qDebug()<<reply->rawHeader("Content-Encoding");@

                and got "gzip" in debug window.

                Any idea please, about "gzip".

                I would like to share when getting page source, Chinese word like 书籍 is showing "& #20070;& #31821;" (Here I have given space between & and # to show you code which I am getting in string means in real string it doesn't have space between & and #. This code is equivalent of word 书籍 if I write without space it shows 书籍 here in post).

                So here can you please help me, that how can I convert this code back as in Chinese word format in my Qt GUI application when reading from file.

                Also I updated my code here when righting into file like
                @
                QTextStream data( &outputFile );
                data.setCodec("UTF-8");
                data<<htmlString;
                @
                But same code is showing in file not Chinese format.
                Also changed setting of Qt Creator goto Edit->Select Encoding->selected UTF-8 and "Save with Encoding".Is it make an sense?

                1 Reply Last reply
                0
                • Chris KawaC Online
                  Chris KawaC Online
                  Chris Kawa
                  Lifetime Qt Champion
                  wrote on last edited by
                  #9

                  Leave the Qt Creator settings alone, they are for the code editor only. It is for example if you wanted to write something like this in your code:
                  @
                  QString s = "ąęśćźżół";
                  @
                  and save that .cpp file as a utf-8 file, which is a bad idea on its own. It has nothing to do with your case.

                  "gzip" means just that the page is sent zipped to save the bandwidth, so it doesn't help much.

                  It seems that the page doesn't use utf-8 characters to display the chinese characters but HTML entities (the &number; things) and this is the text you should get in the string variable, not the chinese characters (QString doesn't parse HTML). You would have to parse it on your own, display this content in something HTML aware, or do a crude replace of those entities.

                  This all means that the actual source might be in a plain 1-byte encoding, like Windows1251.
                  Try QString::fromLatin1() instead of QString::fromUtf8(), but it shouldn't matter really as the Latin1 is a subset of utf-8.

                  1 Reply Last reply
                  0
                  • Z Offline
                    Z Offline
                    Zain
                    wrote on last edited by
                    #10

                    Hi Chris,

                    I resolved my issue with the help of QWebview instead of QnetworkAccessManager and got the HTML Source as it is and than used QString::fromUtf8() and QTextStream:: setCodec("UTF-8") for reading and writing from and to file.

                    But still confused why not done with QnetworkAccessManager.

                    Thanks for your help.

                    1 Reply Last reply
                    0
                    • Chris KawaC Online
                      Chris KawaC Online
                      Chris Kawa
                      Lifetime Qt Champion
                      wrote on last edited by
                      #11

                      Out of curiosity I did a little local test, and it all seems to work "out of the box".
                      Here's my code:
                      @
                      QNetworkAccessManager* nam = new QNetworkAccessManager(this);
                      connect(nam, SIGNAL(finished(QNetworkReply*)), this, SLOT(finished(QNetworkReply*)));
                      QNetworkRequest rq(QUrl("file:///C:/Test/index.html"));
                      nam->get(rq);

                      void MainWindow::finished(QNetworkReply * reply)
                      {
                      QByteArray response = reply->readAll();
                      ui->plainTextEdit->setPlainText(QString::fromUtf8(response.data()));
                      }
                      @

                      This is what it looks like in the browser:
                      !http://img819.imageshack.us/img819/1424/browserv.jpg(page with chinese characters in a browser)!

                      And this is what my QNAM gets:
                      !http://img687.imageshack.us/img687/8936/namnq.jpg(page with chinese characters in a QNAM)!

                      1 Reply Last reply
                      0
                      • Z Offline
                        Z Offline
                        Zain
                        wrote on last edited by
                        #12

                        Hi Chris,

                        Thanks for reply.

                        Above code is working for me like a charm.But only for those pages which have Chinese words not for those which is containing Japanese and France characters.

                        I have applied same code with one test.html page containing Japanese words just like your index.html
                        for that this code is working fine but while applying on required page than I got following type of diamond symbols instead of Japanese characters from that page source.

                        ���ׂẴJ�e�S���

                        I have found one thing different in all three pages.

                        In Chinese country page
                        @<meta content="text/html; charset=UTF-8" http-equiv="content-type">@

                        In Japanese country page
                        @<meta content="text/html; charset=Shift_JIS" http-equiv="content-type">@

                        And in France country page
                        @<meta content="text/html; charset=iso-8859-1" http-equiv="content-type">@

                        Is it make any sense?

                        Can you please suggest me using "QNetworkAccessManager" how can I get as it is web page source which has Japanese and France Characters just done with Chinese Characters containing page.

                        Thanks again for your support.

                        1 Reply Last reply
                        0
                        • Z Offline
                          Z Offline
                          Zain
                          wrote on last edited by
                          #13

                          Hi Chris,

                          To make more clear here is an example Web page URL which have Japanese characters and I need to get page source of this web page as it is in QString object.

                          http://www.amazon.co.jp/BUFFALO-外付けハードディスク-Regza-HD-LB2-0TU2-フラストレーションフリーパッケージ/dp/B0052VIGXA/ref=sr_1_1?s=electronics&ie=UTF8&qid=1366439116&sr=1-1

                          1 Reply Last reply
                          0

                          • Login

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • Users
                          • Groups
                          • Search
                          • Get Qt Extensions
                          • Unsolved