Problem to getting Chinese and Japanese words from Page Source



  • Hello All,

    In my application I am getting Page Source of requested page urls and assigning to QString than applying RegExp on Page Source data to fetch particular information.

    My problem is some pages have Chinese and Japanese words which I need to get as it is and use them.
    But when I am getting reply and assigning to QString or saving to file I got either ????? or some unpredictable characters instead of that Chinese or Japanese word.

    I need to get that Chinese and Japanese word as it is for display in my application and also save to in a file.

    Here is my code:
    @
    void MainWindow::on_btnGetSource_clicked()
    {
    QString url=ui->txtPageUrl->text();
    QNetworkAccessManager *qname=new QNetworkAccessManager(this);
    QNetworkReply *reply;
    QNetworkRequest request=new QNetworkRequest(QUrl(url));
    request->setRawHeader( "Accept-Charset", "win1251,utf-8;q=0.7,
    ;q=0.7" );
    request->setRawHeader( "charset", "utf-8" );
    reply=qname->get(*request);
    QObject::connect(reply,SIGNAL(finished()),this,SLOT(replyFinished()));
    }

    void MainWindow::replyFinished()
    {
    QNetworkReply *reply = qobject_cast<QNetworkReply *>(sender());

    QString htmlString=QString::fromUtf8(reply->readAll());
    
    reply->deleteLater();
    
                //Code to Browse file and save pagesource string into file
    
                      QString outputFilename = QFileDialog::getSaveFileName(
                              this,
                              tr("Save Document"),
                              QDir::currentPath(),
                              tr("Document files (*.txt );;All files (*.*)"
                                 ));
    
    
                      try
                      {
                          QFile outputFile&#40;outputFilename&#41;;
                          outputFile.open(QIODevice::WriteOnly | QIODevice::Text);
    
                          /* Check it opened OK */
                          if(!outputFile.isOpen())
                          {
                               qDebug("Unable to open file,Please try again.");
    
                              return;
                          }
    
    
                          QTextStream data( &outputFile );
    
                          data<&lt;htmlString;
                          /* Close the file */
                          outputFile.close();
                          qDebug("Source Code Saved Successfully.");
    
                      }
                      catch(...)
                      {
                          qDebug("Unable to open file,Please try again.");
    
                      }
                   //End of save
    

    //Append page source to palin text edit
    ui->txtPageSource->appendPlainText(htmlString);
    }

    @

    I tried with setting of QtCreator editor in FileEncoding (Default Encoding set UTF-8) but nothing changed.

    Please help me out that how can I deal with this.

    Thanks in Advance.
    Zain


  • Moderators

    Two things come to mind.

    You're setting an "Accept-Charset" header in your request with a value "win1251,utf-8" so it's likely that you're not getting a utf-8 response. Using QString::fromUtf8 on it might lead to the garbage you're getting. Try changing the header to just "utf-8", without the "q" parameter (which is =1 by default).

    Other thing - check what codec is set on the QTextStream using coded() member. It takes a system local encoding by default so if it's not already utf-8 try setting it explicitly with QTextStream::setCodec().

    Edit. Btw. QFile does not throw exceptions, so your try/catch is useless. The usual construct is this:
    @
    if(outputFile.open(QIODevice::WriteOnly | QIODevice::Text))
    {
    //do something with the file
    //you dont have to call outputFile.close(), it will be closed when outputFile goes out of scope
    }
    else
    {
    //couldn't open file
    }
    @


  • Moderators

    Edit: Krzysztof Kawa beat me to it ;)

    [quote author="Zain" date="1365498293"]
    @
    void MainWindow::on_btnGetSource_clicked()
    {
    //...
    request->setRawHeader( "Accept-Charset", "win1251,utf-8;q=0.7,*;q=0.7" );
    //...
    }

    void MainWindow::replyFinished()
    {
    //...
    QString htmlString=QString::fromUtf8(reply->readAll());
    //...
    }

    @[/quote]Have you made sure that your data is encoded in UTF-8? You accept Windows-1251 too. If your data is not UTF-8, then QString::fromUtf8() will give wrong results.



  • Thanks for Reply.

    I am not sure upcoming data is in UTF-8 or not.I just saw following code in <head> section of the page source and pages have Japanese and chines words.
    @<meta content="text/html; charset=utf-8" http-equiv="Content-Type">@

    I have done change in my code here :
    @
    request->setRawHeader( "Accept-Charset", "utf-8" );
    request->setRawHeader( "charset", "utf-8" );
    @

    But nothing happened..goted same result.
    My main need is just get Page Source as it is into string and have to apply some RegExp on that string.Here Text file I just used to see what I am getting in QString object.

    I have done setting of QtCreator editor in FileEncoding (Default Encoding set UTF-8) is it make sense?

    I am dumb here.Can you Please send me example code for checking default Codec and setting QTextStream::setCodec() explicitly.

    Thanks
    Zain


  • Moderators

    The setting you are changing in QtCreator is for the encoding of your source files in the editor. It has nothing to do with the running program or the network request/reply encoding.

    I don't know. Maybe the webpage is lying in the meta tag? Check the response headers ( QNetworkreply::rawHeaderList() ) and see what encoding is set there.
    If that's not it I'd save the page from a browser and check the actual bytes of those japanese characters in some hex editor to see what encoding are they in.



  • Thanks Krzysztof Kawa for reply.
    I got this in response headers after applying

    @ qDebug()<<reply->rawHeaderList(); @

    ("Date", "Server", "X-Powered-By", "Content-Encoding", "Vary", "Keep-Alive", "Connection", "Content-Type")

    Can you please help me to understand, what it means and how to know about encoding type from this .


  • Moderators

    This tells you what headers are attached to the reply. One of them is "Content-Encoding", so now you can call QNetworkReply::rawHeader("Content-Encoding") to get the actual encoding the response uses. Should be, but may or may not be the same as what the meta tag said.



  • Thanks Krzysztof Kawa for reply.

    I tried @ qDebug()<<reply->rawHeader("Content-Encoding");@

    and got "gzip" in debug window.

    Any idea please, about "gzip".

    I would like to share when getting page source, Chinese word like 书籍 is showing "& #20070;& #31821;" (Here I have given space between & and # to show you code which I am getting in string means in real string it doesn't have space between & and #. This code is equivalent of word 书籍 if I write without space it shows 书籍 here in post).

    So here can you please help me, that how can I convert this code back as in Chinese word format in my Qt GUI application when reading from file.

    Also I updated my code here when righting into file like
    @
    QTextStream data( &outputFile );
    data.setCodec("UTF-8");
    data<<htmlString;
    @
    But same code is showing in file not Chinese format.
    Also changed setting of Qt Creator goto Edit->Select Encoding->selected UTF-8 and "Save with Encoding".Is it make an sense?


  • Moderators

    Leave the Qt Creator settings alone, they are for the code editor only. It is for example if you wanted to write something like this in your code:
    @
    QString s = "ąęśćźżół";
    @
    and save that .cpp file as a utf-8 file, which is a bad idea on its own. It has nothing to do with your case.

    "gzip" means just that the page is sent zipped to save the bandwidth, so it doesn't help much.

    It seems that the page doesn't use utf-8 characters to display the chinese characters but HTML entities (the &number; things) and this is the text you should get in the string variable, not the chinese characters (QString doesn't parse HTML). You would have to parse it on your own, display this content in something HTML aware, or do a crude replace of those entities.

    This all means that the actual source might be in a plain 1-byte encoding, like Windows1251.
    Try QString::fromLatin1() instead of QString::fromUtf8(), but it shouldn't matter really as the Latin1 is a subset of utf-8.



  • Hi Chris,

    I resolved my issue with the help of QWebview instead of QnetworkAccessManager and got the HTML Source as it is and than used QString::fromUtf8() and QTextStream:: setCodec("UTF-8") for reading and writing from and to file.

    But still confused why not done with QnetworkAccessManager.

    Thanks for your help.


  • Moderators

    Out of curiosity I did a little local test, and it all seems to work "out of the box".
    Here's my code:
    @
    QNetworkAccessManager* nam = new QNetworkAccessManager(this);
    connect(nam, SIGNAL(finished(QNetworkReply*)), this, SLOT(finished(QNetworkReply*)));
    QNetworkRequest rq(QUrl("file:///C:/Test/index.html"));
    nam->get(rq);

    void MainWindow::finished(QNetworkReply * reply)
    {
    QByteArray response = reply->readAll();
    ui->plainTextEdit->setPlainText(QString::fromUtf8(response.data()));
    }
    @

    This is what it looks like in the browser:
    !http://img819.imageshack.us/img819/1424/browserv.jpg(page with chinese characters in a browser)!

    And this is what my QNAM gets:
    !http://img687.imageshack.us/img687/8936/namnq.jpg(page with chinese characters in a QNAM)!



  • Hi Chris,

    Thanks for reply.

    Above code is working for me like a charm.But only for those pages which have Chinese words not for those which is containing Japanese and France characters.

    I have applied same code with one test.html page containing Japanese words just like your index.html
    for that this code is working fine but while applying on required page than I got following type of diamond symbols instead of Japanese characters from that page source.

    ���ׂẴJ�e�S���

    I have found one thing different in all three pages.

    In Chinese country page
    @<meta content="text/html; charset=UTF-8" http-equiv="content-type">@

    In Japanese country page
    @<meta content="text/html; charset=Shift_JIS" http-equiv="content-type">@

    And in France country page
    @<meta content="text/html; charset=iso-8859-1" http-equiv="content-type">@

    Is it make any sense?

    Can you please suggest me using "QNetworkAccessManager" how can I get as it is web page source which has Japanese and France Characters just done with Chinese Characters containing page.

    Thanks again for your support.



  • Hi Chris,

    To make more clear here is an example Web page URL which have Japanese characters and I need to get page source of this web page as it is in QString object.

    http://www.amazon.co.jp/BUFFALO-外付けハードディスク-Regza-HD-LB2-0TU2-フラストレーションフリーパッケージ/dp/B0052VIGXA/ref=sr_1_1?s=electronics&ie=UTF8&qid=1366439116&sr=1-1


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.