Home / How to invoke Sumatra from another app to convert PDFs to text?

roklasfon

hi every body here, i have download sumatra PDF reader and it is really great application ,i want to use it to convert pdf files to texts ones through other application ,i mean is there any function or library in sumatra pdf i can use it or include it in my program so that function can use sumatra pdf to convert pdf file to text .

GitHubRulesOK

Pdf to text is a variable goal

Pdf pages can be textual vector objects (glyphs / characters / words ) placed via coordinates within a “page zone” thus converting to sequential plain text in those cases is difficult
The characters could be simply pixel images and need OCR conversion to letters in word sequences.

Those tasks are usually beyond the scope of a viewing application, so only in a few cases SumatraPdf can simply save some pages already stored as text into a text file
Take this example first paragraph as saved using any viewer app from the pdf

Points to note
The first three lines of exported textual output are not noticeably visible in the PDF until the end of page, that’s a common feature of PDF files ( what you see is NOT the order in which it is stored, especially needed for accessibility users )

Graphic lines and images are not exported as text. Audio readers may need hidden descriptive text tags.

Also (as plain text) the vertical text needs to be simply exported as horizontal words.

So it is best to use SumatraPDF to export pages or whole file to a dedicated 3rd party converter, selecting from different apps based on the type of source.
SumatraPDF is based on MuPDF which has OCR capability but its difficult to use, thus I suggest adding another.
One application that can handle many pdf types well is an editor such as Tracker Exchange and integrating 2 way exchange with SumatraPDF you have a fast viewer combined with a reasonable converter.

The best way to use SumatraPDF during conversion is as a previewer and source for manual cut and paste. Such that you can ensure there is 100% near as damit similarity so here converting a page using inline HTML

roklasfon

thank you for your answer but i am very sorry i did understand nothing from it all what i want is as Sumatra pdf is open source application i think it should has a library or function i can include them in my code so i can exploit Sumatra pdf to convert any pdf to text file like word application which it has a function OLE functions to convert word files to text ones so is there any one like that in sumatra pdf? i hope you understand me now

roklasfon

i have found this code in google group:

FUNCTION Sumatra_SelectAll( cPanel )
LOCAL nHFrame := Sumatra_FrameHandle( cPanel )

   IF nHFrame != 0
      SendMessage( nHFrame, 273 /*WM_COMMAND*/, 422 /*IDM_SELECT_ALL*/, 0 )
   ENDIF

RETURN NIL

can you explain it for me?

GitHubRulesOK

There are many forks of SumatraPDF and several involve dll wrappers but As Far As I Know the primary calling method as used by SumatraPDF is mainly a limited range of DDE directives and the code you quote is unfamiliar. What is the google page it was found on?
Whilst some programmers may have used SumatraPDF as an imbedded app and in turn that may have include functions like send keys CTRL C to copy any imbedded text like you can to the clipboard.
The point I am making is that PDF text is not as much use for general conversion, unlike the much simpler way that Ordinary / Rich text in a Word Processing file is always usable.

SumatraPDF is a viewer based on MuPDF which has a JavaScript based API which SumatraPDF does not currently use, nor does it use the Tesseract conversion features of MuPDF.
So it may be easier to look at calling MuPDF directly, (which has pdf in > text out features) without trying to go via SumatraPDFs rendering which is mainly for screen viewing.

In short whilst a processor of words can make a PDF of the images and characters, PDF viewers are not designed to make words, only carry the images and shapes of letters to the pixel screen or printer.

roklasfon

Thank you very much for your precious answer my dear brother but how i am going to call MuPDF directly and is this MuPDF has functions ?can you give me example please ?

GitHubRulesOK

Fuller MuPDF low level document manipulation is described in https://mupdf.com/docs/api/
However there is a simpler higher level acess described in https://www.mupdf.com/docs/manual-mutool-run.html

read(fileName)
Read the contents of a file and return them as a UTF-8 decoded string.

readline()
Read one line of input from stdin and return it as a string.

require(module)
Load a JavaScript module.

write(...)
Print arguments to stdout, separated by spaces.

Also see the simplest of all methods which is similar to copy and paste but via CLI
https://www.mupdf.com/docs/manual-mutool-convert.html and
https://www.mupdf.com/docs/manual-mutool-draw.html, and here is result of

mutool draw -o output.txt in.pdf
which we could call from SumatraPDF as either
mutool draw -o output.txt "%1" for current file or
mutool draw -o output.txt "%1" %p for current page

roklasfon

Hi my brother …thank you very much for your details answer …but sorry i got confused …can you tell me simply and step by step which function i have to use to convert PDF file to text and save the text file in an array so i can use the text in my program please just concentrate in the answer of this question dont please mention any thing else simply and step by step please.

GitHubRulesOK

IF and only IF the text was stored in the file as plain text then it is simple to save the PDF text content using SumatraPDF

The example you found uses windows Select All to gather whatever it finds in a customised copy of SumatraPDF then it uses its own method (in the background to save the contents) to import for its own application.
Here is a very simple manual example where I can as a user most simply use SaveAs to export the text. (That could be easily automated by macro applications such as TinyTask or AutoHotKey)

I can see how that same .txt is saved to a file by reading it back into SumatraPDF

It will not be exactly the same layout but ALL the words are there.
However Mutool DRAW will do a much better job keeping some whitespace so using

mutool draw -o output.txt “MyNotes Readme.pdf” this is the output.txt

A magazine with images will only save the text between the images
If I wish to save the images I would also need to use clipboard functions or an application that can export parts like mutool draw=Text, mutool extract=Images and there are other conversions like OCR

Anything more complex in SumatraPDF is beyond its native functions and will require using programming controls to manipulate any outputs.

roklasfon

thank you my dear brother for your answer …so this is converting function as you mention in your answer:
mutool draw -o output.txt “MyNotes Readme.pdf”
can i use this function in a program directly but what about “-o” how i can write in a program example in harbour programming language and which library i am going to include in order to my program to reach this function in which file it is ?
Note :i have download the source code file “mupdf-1.18.0-source.tar.gz” in which file the function you mention previously is?

GitHubRulesOK

Xalier I believe may use SumatraPDF as a plugin to simply view the pdf files
see License - Commercial use / Distribution with 3rd Party Software and https://idlagam.com/forum/resources/plugin.63/

I do not know how Harbour would call MuTools in the background especially as my knowledge is mainly windows but you are looking at a linux.tar.gz ?
It would most likely be via the shell there is a hint at https://github.com/harbour/core/blob/master/tests/osshell.prg

Once you handover the pdf file location to mutools in the shell and it builds the output (perhaps output in a temp folder, then I note there are example .prg files (e.g. https://github.com/harbour/core/blob/master/tests/fileio.prg) for reading txt that you could use to test reading sample output, but I do not know how Harbour moves that into arrays.
I also see there that there may be some string limits that suggest it wise to test and set a size for limit of maximum pages worth of text (so as not to exceed working memory). but I do not know Harbours limits.
It is best to raise such usage with others who program in Harbour at https://groups.google.com/g/harbour-users/

roklasfon

My brother still you dont answer me…my question is so simple but you still not get it yet …MUPDF is open source tools right ? that is mean all its function is available just tell me where is the location of the mutool function so i can include it in my harbour code …please tell me the location of source code of this function or its source code …i dont want to use it under shell command line i want to include its source code in my program so my application who is going to use this function …i hope it is clear now just tell me where i can find the source code of this function .

GitHubRulesOK

The latest development source is available directly from the git repository:

git clone --recursive git://git.ghostscript.com/mupdf.git

roklasfon

there is a lot of files in this link which one of them has the source code of saving a pdf file as text?

GitHubRulesOK

there is not one there are hundreds needed to handle all the programable objects in an adobe pdf. MuPdf is just like GhostScript (its bigger sibling) it is a library of functions that in turn depends on other 3rd party libraries.

open file
handle encryption
read index
deflate objects
read vectors
decode/define glyphs
convert some objects to text
convert some to lines that look like text etc etc

roklasfon

is that mean …i should include them all?

GitHubRulesOK

I do not know how many of those files your needs would require. you only need one small part of everything.
Without knowing exactly what type of pdf you use as input, and there are as many pdf structures as there are pdf generators, the most appropriate solution for your needs is possibly a single addressable PDF API package (which SumatraPDF is not) without the need to shell, most of those in that group including MuPDF are not Open Source or “freeware”.

If you are using a high level language such as java or say python for example then there are often language specific PDF API tools in github that can target text extraction like https://pdfreader.readthedocs.io/en/latest/tutorial.html#tutorial-content

Your need to integrate directly into Harbour will have been encountered many times by Harbour programmers, so my advice again is to ask those that have already re-designed that wheel, what they found easiest or most efficient to access PDF internals.

For example Sumatra PDFVIEW discussion is at https://hmgforum.com/search.php?keywords=pdfview&sid=a02879f5347dc059346ffd31f75e4389

SumatraPeter

Let me be very frank here. I’ve been silently following this discussion so far and it seems to me you’re either a novice at Harbour in particular or I suspect programming in general. First off, this isn’t a code-writing service; it’s a forum primarily for end users to ask for help with Sumatra, report bugs and so on. Second, even the dev Krzysztof doesn’t have the time to hand-hold novice coders - it’s clear from his prior (perfectly reasonable IMO) responses that folks are generally expected to figure things out for themselves when it comes to Sumatra’s open source code. Third, if you need help with the MuPDF API and not Sumatra or its UI additions, you need to either delve into that code yourself and figure it out or ask the MuPDF devs (I highly doubt they’ll have the time to hand-hold either). You can also try specifically coding-oriented help/discussion sites such as StackOverflow and CodeProject. Last but not least, even after you do manage to figure out where the MuPDF functions (written in C) are that deal with PDF to text conversion, just how do you plan to use them in your Harbour code? I’m not at all familiar with Harbour, but I would be highly surprised if it’s going to be as simple as “include [MuPDF’s C] source code in my program”.