Forum moved here!

Home / Can SumatraPDF add support for ignoring word accents in the search box?

thegrapevine

Can SumatraPDF add support for ignoring word accents in the search box?

Example:

Coração
Coracao

GitHubRulesOK

Everything is possible but that would be a massive change Much greater than I have seen in other apps the search box is a one character Pony, just like most find boxes in windows. It simply searches for the next character that matches input.
So for every language with e it would need to not match è,é,ê,ë,É,È,Ê,Ë plus those in other languages multiply that by at least 12 for A that I know of then the same for all other letters it suddenly becomes a significant slowdown to decide where the next match is.

Then compound that with a second letter so now the permutations are 2 x 2 not counting Uppercase. and so on by the time you have typed in a word of say 7 characters the slowdown and misfires get worse.

Since PDF does not really contain the logic to store or extract words except as ink blobs, of which on occasion some match basic font entries, it is amazing that we can sometimes get any search match.

thegrapevine

I’ve been a programmer for 15 years and I understand the concern. Could the Boyer Moore algorithm reduce this slowness? Well that would be wonderful for Sumatra PDF. Well I still have to resort to Adobe Acrobat Reader to use the search box.

GitHubRulesOK

You are probably correct that Acrobat may use an algorithm and in some cases just searching will stop at al but in a rough test with Polish it did not stop at ałl cases I tried. So looks like they may use a more customised search.
I also have to agree Adobe search seems very useful but cannot handle words with one “wrońg” character so a Boyer Moore approach where a level of confidence is applied would be attractive.

Popo_Lino

me too, I personally use it more often (I mean the Adobe Acrobat Reader)

timeSlice

I am curious why simple pre-processing would not do the trick? Under “pre-processing” I mean to remove all accents in the first place. And perform a regular search after.

Methods based on the Unicode Normalization Form NFD (for example, see stripAccents) can be used for accents removal. You could also take a look at this comment with important points on SO.

GitHubRulesOK

Hmm the text is usually encoded as similar to a zip stream and often encrypted so the unpacking order is decrypt, decode stream, determine font definition, lookup font, decode character shape, convert to pixels, present to screen many times over.

secondary either
user select block of characters, lookup font, decode character if possible, present to clipboard as best plain text as is possible (warts and all)
or
search current expanded displayed text for first best matching character and highlight if anything matches user input (but there may be nothing to match)

timeSlice

Simplistically, I suppose that “pre-processing” (be done only once) will yield a separate “search source” mapped to “display source”. So, search will be preformed against that “search source”, and actual highlight will be done on the correlated place in “display source”.
As far as I can understood from your answer, there’s a catch in “mapping” part, isn’t it?

GitHubRulesOK

PDF was designed to put text as inked shapes on paper using a set range of print engine fonts, thus never designed to be reverse engineered. Over the years as PDF became more complex, user expectation was that text extraction could improve with imbedded fonts and lookup tables, reality is it is generally worse with more ways for failure.

Converting html or word docs to PDF is a retrograde step since those formats are based on sequential words, line feeds, sentences and paragraph markers whilst they were never added to PDF methodologies. SO many pdf constructs are never going to reproduce the source words, since a page of ink can be placed in any order from its lower left origin, we are lucky that most converters will try to place words sequentially in the opposite direction at the same font height/style on common lines. A “good” PDF does not need to include any characters and definitely no white spaces as seen between words or paragraphs, just vectors that define the border of ink particles, but for portable convenience in minimising file size that includes using font lookup tables.

StackOverflow and many other forums have tens of thousands of why cant I, since it should be easy to. for one related text accents extraction issue see python - Encoding problems when extracting text with PyPDF2 - Stack Overflow but the majority are why cant I find a character with an accent rather than the topic here

timeSlice

I understand that pdf is highly cumbersome format, thank you for the additional details on this topic.
On the other hand, I still not clearly figure out complexity of this feature request.

So, for the sake of good order, let me rephrase original request more specifically.
The goal: implement accents-insensitive search. For example, search for word “facherent” must found “fâchèrent” on top of the page 22. This pdf from Project Gutenberg is used as reference for example.

Please forgot about my previous pre-processing pro­posal, to avoid unnecessary confusion.

I am not fully understanding this rationale. Basically, we need to searches for the next character that matches input. That operation includes at least “lookup font” and “decode character” (if possible) steps, right? So, to implement accents-insensitive search, we need only one additional step per character - strip accent(s) from decoded character. Comparison (match) count stay unchanged. For previous example, â to a (and è to e) will be “converted” before match. That way permutations and the consequent rapid increasing of difficulty can be avoided, isn’t it?

GitHubRulesOK

This is how the text looks as plain text for searching

Now assuming each letter is in reality just a number of bits saved as bytes lets see how searching by numbers works, this is not a true demonstration but illustrates the underlying complexities.

SumatraPDF.exe -zoom "fit width" -page 1 -search "4 1 5" b.pdf

So if we said use another 5(e) in place of 4(è) we can expect a totally different result.
SumatraPDF.exe -zoom “fit width” -page 1 -search “5 1 5” b.pdf

A further complicationn is when the font is built in sequential numbers SO if a scanned page was a sequence of è,é,ê,ë,É,È,Ê,Ë,e,E we could say the caracter 0-9 are all the same thus we would need some means to say they were the “case” to ignore. But fonts dont have to declare that they are different shapes just they must be given different numbers that are used for identical shapes of inking.

thegrapevine

GitHubRulesOK

Well spotted
I forgot that’s where the setting is but even my old trimmed down 100Mb 224 Files, 46 Folders acrobat with that feature uses more code/disk space for those scripted extras than SumatraPDF does in its whole one file

thegrapevine

What do you think about using the D programming language? https://dlang.org/ Dlang’s Phobos library has regular expression Regex and Booyer Moore Finder. The D programming language is compatible with C and C++. It creates extremely small files, and is extremely fast like C.