How to extract text from PDF?
-
wrote on 31 Oct 2024, 20:07 last edited by
Hi everyone,
How can I extract text from a PDF file? 😁
-
wrote on 31 Oct 2024, 21:53 last edited by ChrisW67
Welcome to the forum.
Assuming you want to do this using Qt then there's no out-of-the-box way to achieve this.
How you go about it depends on what you are doing this for, how you want to handle non-text content, how you want to handle layout, what platform you are on, ...
You might get away with something like Ghostscript:
gs -sDEVICE=txtwrite -o output.txt input.pdf
(or a Windows equivalent) -
Hi and welcome to devnet,
Another option is to convert your pdf to images and use something like tesseract to do OCR on them.
-
wrote on 1 Nov 2024, 02:46 last edited by hskoglund 11 Jan 2024, 03:22
Hi, on Ubuntu there's pdftotext (a.k.a. poppler-utils).
Also there's a QPdfDocument class which has a getAllText() function. However it looks like you have to compile/build QPdfDocument yourself, i..e it's not included in the Qt installer.
1/4