Unsolved Coding a web scraper - I have a big question
I've coded scrapers before, but this time there's something new I've never implemented.
There's some login form with a reCAPTCHA v2.
I've been thinking of showing a pop up window with the actual page for the user to enter the login details and "solve" the CAPTCHA. AND THEN, steal the session cookies once the protected content is reached and continue scraping from there as usual.
Is this feasible?
I'd say yes, but may somebody show me the right path to follow to implement the "pop up" part? I.e., using this or that component, etc.
Remember, I actually need to browse/render the real login page and in some way, control whatever happens inside like server responses or cookies.
Thanks for your help.
fcarney last edited by
Isn't the point of a captcha to prevent bots from scraping?
Maybe, but you can understand our wariness. It's a JS based system, so you need to run the whole JS machinery alongside the HTML layout engine. Probably
QWebEngineis what you want, however I've not used it myself. Thereafter you do what a browser'd do, I assume.
@kshegunov Maybe I should have used another words. But I understand what you say.
Thanks for your suggestion. I'm using QWebEngineView. It looks like the simplest option.
One thing I've noticed is that the web engine CRAWLS when it's in debug build. So this is a big annoyance because the login page behaves very unresponsive (talk about the reCAPTCHA...) and I have no explanation for this.
In the release build, it works flawlessly.
Unfortunately I really have no idea. I've never even looked at the documentation of that module, I just know it exists (for the mentioned purpose). Hopefully someone with an idea is going to pick up the thread and give you a decent suggestion.