Sumatra PDF is a PDF, ePub, MOBI, CHM, XPS, DjVu, CBZ, CBR reader for Windows

Can SumatraPDF add support for ignoring word accents in the search box?

Can SumatraPDF add support for ignoring word accents in the search box?



Everything is possible but that would be a massive change Much greater than I have seen in other apps the search box is a one character Pony, just like most find boxes in windows. It simply searches for the next character that matches input.
So for every language with e it would need to not match è,é,ê,ë,É,È,Ê,Ë plus those in other languages multiply that by at least 12 for A that I know of then the same for all other letters it suddenly becomes a significant slowdown to decide where the next match is.

Then compound that with a second letter so now the permutations are 2 x 2 not counting Uppercase. and so on by the time you have typed in a word of say 7 characters the slowdown and misfires get worse.

Since PDF does not really contain the logic to store or extract words except as ink blobs, of which on occasion some match basic font entries, it is amazing that we can sometimes get any search match.

I’ve been a programmer for 15 years and I understand the concern. Could the Boyer Moore algorithm reduce this slowness? Well that would be wonderful for Sumatra PDF. Well I still have to resort to Adobe Acrobat Reader to use the search box.

You are probably correct that Acrobat may use an algorithm and in some cases just searching will stop at al but in a rough test with Polish it did not stop at ałl cases I tried. So looks like they may use a more customised search.
I also have to agree Adobe search seems very useful but cannot handle words with one “wrońg” character so a Boyer Moore approach where a level of confidence is applied would be attractive.

me too, I personally use it more often (I mean the Adobe Acrobat Reader)

I am curious why simple pre-processing would not do the trick? Under “pre-processing” I mean to remove all accents in the first place. And perform a regular search after.

Methods based on the Unicode Normalization Form NFD (for example, see stripAccents) can be used for accents removal. You could also take a look at this comment with important points on SO.

Hmm the text is usually encoded as similar to a zip stream and often encrypted so the unpacking order is decrypt, decode stream, determine font definition, lookup font, decode character shape, convert to pixels, present to screen many times over.

secondary either
user select block of characters, lookup font, decode character if possible, present to clipboard as best plain text as is possible (warts and all)
search current expanded displayed text for first best matching character and highlight if anything matches user input (but there may be nothing to match)

Simplistically, I suppose that “pre-processing” (be done only once) will yield a separate “search source” mapped to “display source”. So, search will be preformed against that “search source”, and actual highlight will be done on the correlated place in “display source”.
As far as I can understood from your answer, there’s a catch in “mapping” part, isn’t it?

PDF was designed to put text as inked shapes on paper using a set range of print engine fonts, thus never designed to be reverse engineered. Over the years as PDF became more complex, user expectation was that text extraction could improve with imbedded fonts and lookup tables, reality is it is generally worse with more ways for failure.

Converting html or word docs to PDF is a retrograde step since those formats are based on sequential words, line feeds, sentences and paragraph markers whilst they were never added to PDF methodologies. SO many pdf constructs are never going to reproduce the source words, since a page of ink can be placed in any order from its lower left origin, we are lucky that most converters will try to place words sequentially in the opposite direction at the same font height/style on common lines. A “good” PDF does not need to include any characters and definitely no white spaces as seen between words or paragraphs, just vectors that define the border of ink particles, but for portable convenience in minimising file size that includes using font lookup tables.

StackOverflow and many other forums have tens of thousands of why cant I, since it should be easy to. for one related text accents extraction issue see python - Encoding problems when extracting text with PyPDF2 - Stack Overflow but the majority are why cant I find a character with an accent rather than the topic here