Download all images from an HTML file
-
I want to download all the images from an HTML file and store it into a folder ( folder name : image ) After all images are downloaded the HTML file should be edited to point to local images .
I want to do everything in this function . How should i approach this task ?
bool save_images(QString filename) { QString content; qDebug() << filename; QFile file(filename); if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) { qDebug() <<"unable to open file"; return false; } else{ content = file.readAll(); qDebug() << content ; file.close(); } if(!file.open(QIODevice::WriteOnly | QIODevice::Text)) { qDebug() <<"unable to write to file"; return false; } else { qDebug() <<"write to file here"; /* QTextStream out(&file); out << "shit"; file.close(); */ } return true ; }
-
I think you are a little too unspecific, so I will reply without too many details:
-
Read in the .html-file.
-
I suggest using a QRegularExpression to extract the proper links (also see QRegExp for some more information about the syntax).
-
Maybe manipulate the paths to the images, in case they are relative to whatever website they are from.
-
Take a look at the Network Download Example (some of the other examples related to networking might be interesting, too) and implement something similar to download the images. This usually involves waiting, you might want to make your application multi-threaded.
-
Store the images locally. Maybe use a QMap or something like it to keep track of which linked image refers to which local one.
-
Again, use a QRegularExpression (or a simple string-replacement, if possible) to replace the image-paths inside the .html-file with local ones.
-
Profit.
PS: I don't know what you want to do with this program, but it might actually be easier to use something other than C++ for this task - I believe things like Python or maybe Perl would be appropriate tools for what you described (in case you have experience with any of them). But of course, Qt is always a fine choice, too. :)
-
-
Hey @thEClaw , sorry for not being too specific :
This is the whole C++ code where every operation will be done
C++
#include "dbmanager.h" #include <QtSql> #include <QSqlDatabase> #include <QSqlDriver> #include <QCoreApplication> #include <QDebug> #include <QNetworkAccessManager> #include <QNetworkRequest> #include <QNetworkReply> #include <QUrl> #include <QUrlQuery> #include <QJsonObject> #include <QJsonDocument> #include <QByteArray> #include <QFile> #include <QRegularExpression> #include <QString> #include <QTextStream> #include <QRegularExpression> dbmanager::dbmanager(QObject *parent) : QObject(parent) { } bool add_in_db(int pageid , int revid) { QDir databasePath; QString path = databasePath.currentPath()+"WTL.db"; QSqlDatabase db = QSqlDatabase::addDatabase("QSQLITE");//not dbConnection db.setDatabaseName(path); if(!db.open()) { qDebug() <<"error in opening DB"; } else { qDebug() <<"connected to DB" ; } QSqlQuery query; query.prepare("INSERT INTO pages (page_ID,page_revision) " "VALUES (? , ?)"); query.bindValue(0,pageid); query.bindValue(1, revid); if(query.exec()) { qDebug() << "done"; return(true); } else { qDebug() << query.lastError(); } return (false); } bool save_images(QString filename) { QString content; qDebug() << filename; QFile file(filename); if (!file.open(QIODevice::ReadOnly | QIODevice::Text)) { qDebug() <<"unable to open file"; return false; } else{ content = file.readAll(); //qDebug() << content ; file.close(); } if(!file.open(QIODevice::WriteOnly | QIODevice::Text)) { qDebug() <<"unable to write to file"; return false; } else { qDebug() <<"write to file here"; QTextStream out(&file); out << content; file.close(); } return true ; } void dbmanager::add() { QString text ; int pageid , revid; // create custom temporary event loop on stack QEventLoop eventLoop; // "quit()" the event-loop, when the network request "finished()" QNetworkAccessManager mgr; QObject::connect(&mgr, SIGNAL(finished(QNetworkReply*)), &eventLoop, SLOT(quit())); // the HTTP request QNetworkRequest req( QUrl( QString("http://en.wikitolearn.org/api.php?action=parse&page=Linear%20Algebra/Sets&format=json") ) ); QNetworkReply *reply = mgr.get(req); eventLoop.exec(); if (reply->error() == QNetworkReply::NoError) { //success //qDebug() << "Success" <<reply->readAll(); QString html = (QString)reply->readAll(); QJsonDocument jsonResponse = QJsonDocument::fromJson(html.toUtf8()); QJsonObject jsonObj = jsonResponse.object(); text = jsonResponse.object()["parse"].toObject()["text"].toObject()["*"].toString(); pageid = jsonResponse.object()["parse"].toObject()["pageid"].toInt(); revid = jsonResponse.object()["parse"].toObject()["revid"].toInt(); text = text.replace("\n",""); text = text.replace("'/index.php", "http://en.wikitolearn.org/index.php"); text = text.replace("&","&"); text = text.replace("MathShowImage&", "MathShowImage&"); text = text.replace("mode=mathml'", "mode=mathml"""); text = text.replace("<meta class=\"mwe-math-fallback-image-inline\" aria-hidden=\"true\" style=\"background-image: url(" ,"<img style=\"background-repeat: no-repeat; background-size: 100% 100%; vertical-align: -0.838ex;height: 2.843ex;\"" "src="); text = text.replace("<meta class=\"mwe-math-fallback-image-display\" aria-hidden=\"true\" style=\"background-image: url(" ,"<img style=\"background-repeat: no-repeat; background-size: 100% 100%; vertical-align: -0.838ex;height: 2.843ex;\"" "src="); text = text.replace("&mode=mathml);" , "&mode=mathml\">"); // qDebug() << text; qDebug() <<pageid; delete reply; } else { //failure qDebug() << "Failure" <<reply->errorString(); delete reply; } QDir dir; QString Folder_name = QString::number(pageid); if(QDir(Folder_name).exists()) { qDebug() << " already exist "; } else{ dir.mkdir(Folder_name); QString filename = Folder_name+".html"; QFile file(filename); file.open(QIODevice::WriteOnly | QIODevice::Text); QTextStream out(&file); out << text; // optional, as QFile destructor will already do it: file.close(); bool success = add_in_db(pageid,revid); if(success == true) { qDebug() <<"entry added to DB successfully "; } else { qDebug() <<" failed to add in DB "; } success = save_images(filename); } /* * ***************************************************** */ /************************** DB CODE was here ********************/ } void dbmanager::del() { qDebug() <<"DELETION CODE GOES HERE"; }
i don't know whether i should paste the whole HTML here ( it's too messy : no formatting )
but i can paste the important URL's that i need
IMG URL's
<img style="background-repeat: no-repeat; background-size: 100% 100%; vertical-align: -0.838ex;height: 2.843ex;"src=http://en.wikitolearn.org/index.php?title=Special:MathShowImage&hash=2af9544640fe8b97375512027efaaccd&mode=mathml">
and another type of IMG URL is this ( i have included HTML too for this one )
<p>The Bitwise XOR (exclusive or) performs a logical XOR function, which is equivalent to adding two bits and discarding the carry. The result is zero only when we have two zeroes or two ones. XOR can be used to toggle the bits between 1 and 0.\n</p>\n<div class=\"thumb tright\"><div class=\"thumbinner\" style=\"width:302px;\"><a href=\"/File:Xor.png\" class=\"image\"><img alt=\"\" src=\"//pool.wikitolearn.org/images/thumb/7/76/Xor.png/300px-Xor.png\" width=\"300\" height=\"150\" class=\"thumbimage\" srcset=\"//pool.wikitolearn.org/images/thumb/7/76/Xor.png/450px-Xor.png 1.5x, //pool.wikitolearn.org/images/thumb/7/76/Xor.png/600px-Xor.png 2x\" /></a> <div class=\"thumbcaption\"><div class=\"magnify\"><a href=\"/File:Xor.png\" class=\"internal\" title=\"Enlarge\"></a></div>xor</div></div></div>
-
Since you seem to have some problems with QRegularExpression (Read the documentation! It's the best I have ever seen!), I will give you a small example. I happen to be using something like this in a program I am working on:
QRegularExpression re("<img (?<junk>.*?) src=(?<path>\\S+?) (?<junk2>.*?)>"), QRegularExpression::CaseInsensitiveOption);
The
?<name>
-blocks are there to provide named captures in the matches. The first one is called junk - it is allowed to contain anything but the "src"-attribute (.*?
matches any sign in a "lazy" manner). junk2 behaves equivalently; if you don't need to capture (i.e. store for later access) these groups, you could simplify the expression a bit.
path will be storing the download-link for the image.\\S
will match every non-whitespace character,+?
means that there has to be at least one character like that. You will have to adapt to path including"
-signs, depends on your specific case.Here a bit of code I copied and slightly modified from the documentation for
QRegularExpression
:QString s = "<img style=\"background-repeat: no-repeat; background-size: 100% 100%; vertical-align: -0.838ex;height: 2.843ex;\" src=http://en.wikitolearn.org/index.php?title=Special:MathShowImage&hash=2af9544640fe8b97375512027efaaccd&mode=mathml>"; QRegularExpressionMatch match = re.match(s); while (match.hasMatch()) { QString junk = match.captured("junk"); // junk == style=\"background-repeat: no-repeat; background-size: 100% 100%; vertical-align: -0.838ex;height: 2.843ex;\" QString junk2 = match.captured("junk2"); // junk2 == "" QString path = match.captured("path"); // path == http://en.wikitolearn.org/index.php?title=Special:MathShowImage&hash=2af9544640fe8b97375512027efaaccd&mode=mathml }
Hand in your html-file instead of
s
, take care of some minute details (path enclosed in"
or not? Maybe no space before junk2? etc.), and then you should be able to do whatever you want with your links.
I hope that gets you started.