Sumatra PDF (I currently use the Windows 64 version, v3.1.2), has long had an issue with copy/paste of unicode text. Please see these examples, which copy/paste correctly with Adobe Acrobat Reader and Chrome’s PDF reader, but not with Sumatra.
Home / SumatraPDF copy / paste issues with Unicode text
Since the source docs appear to have been processed by Tesseract greek I think it is possibly related to https://github.com/sumatrapdfreader/sumatrapdf/issues/544#
I previously found there that re-processing them through a different tesseract caller resolved the underlying word spacing
Here I use PDFXedit integrated with SumatraPDF to OCR the Pooh example from https://willus.com/k2pdfopt/help/ocr.shtml It is a first shot without training but its good enough to show it is the way the tesseract engine application is configured that causes the extra line feeds / paragraph spacing etc.
And here is the text pasted into WordPad after spell checking in Xchange then justified and saved as PDF for reading in SumatraPDF
With no problems to re copy and paste exactly 100% word-perfect without OCR
@GitHubRules–your post is not related to the the issue I intended to be discussed, so I will clarify. If you try copying and pasting the greek unicode text in those docs (e.g. book1.pdf from the link in my original post) from SumatraPDF into MS Word or into translate.google.com, you’ll see that the unicode values SumatraPDF puts into the clipboard are completely incorrect (see screen shots below). I believe this is because SumatraPDF does not seem to correctly evaluate the “ToUnicode” Cmap in the font dictionary within the PDF files. See Section 9.10.3 of the PDF 1.7 “Document Management” guide.
Hmmm…okay, in the latest version of SumatraPDF, 3.2, just released today, this problem seems to be fixed–the unicode chars are correct–but I still get a linefeed after every word–it does not paste the same as you show above. What exact version of SumatraPDF did you use above, and on what version of Windows?
I am not saying SumatraPDF is perfect and replaces the word space correct
I too get the Linefeeds from that type of OCR input (hence my comment re poor spacing) but if the source is conventional text word spacing (even when spaces are increased by justification as shown in my first comment) then the spaces are treated as inter word spaces.
I have added above sample book1.pdf for investigation to https://github.com/sumatrapdfreader/sumatrapdf/issues/544#issuecomment-599272224
Forgive my confusion–I’m new to this thread. Are you a developer on the project? Is there a list of developers somewhere?
No I am only one of two triage moderators
tea-boy Chief Executive and sole current Developer is KJK