Home / Adding regex to search

Winfried

Hello,

It seems like no PDF viewer currently supports regular expressions to search contents, eg. “(keyword1|keyword2)” to search for two different words.

I suggest adding this feature to SumatraPDF.

Cheers,

SumatraPeter

Found an old request that was rejected:

1112

you might use the Total Commander with the “wdx_xpdfsearch_2.00_beta_1” plugin.

whugemann

I have just used the freeware Agent Ransack by Mythicsoft as a workaround. It can search regular expressions in a lot of file formats, including PDF. Just like Google Search it displays the context where the RegEx was found, which you then in turn can search for in your favourite PDF viewer.

Nevertehless, it would be a good feature to have natively with sumatra PDF.

GitHubRulesOK

There are issues in the way a pdf stores characters, that first make it difficult to assure sequential letters (subject to printing ligatures and kerning) are extractable as words with spaces. agreed using a means to skip letters in theory should be trivial, but see Can SumatraPDF add support for ignoring word accents in the search box?

With some complex heuristic programming it would be possible to support random character sequences (as acrobat appears to do) but as shown in the illustration in ignoring accents it would not be trivial code needed to stop at the correct match.

There are tools to use SumatraPDF iFilter (1 file) like above mentioned Agent Ransack (Many features but fairly large 85MB installed/portable in over 950 files) searches using iFilter such as SumatraPDF & xpdf pdftotext (10.7 MB in 213 files) to extract searchable text.
Lighter is dnGrep (13 MB installed in 30 files of which only 1MB is Xpdf pdftotext plus Everything) I did not get that one working in portable mode, but suspect it would not be as powerful for PDF searches, without the extra 212 support files.

Clearly Xpdf is a favorite text extractor to get pdf contents when possible then parse those text files using regex libraries. However note many textual PDF files are not searchable even if you use regex. If you can use regex in Acrobat its best to use its much deeper deep search facility.

Winfried

Thanks much for the technical explanation.

whugemann

I have just read your explanations in Can SumatraPDF add support for ignoring word accents in the search box? and had a quick look at the content of a PDF file in Notepad++. I have to admit that it is a pretty mess.
Nevertheless, it seems that you already discovered ways to convert this into a stream of text, as one can search for text in a PDF with SumatraPDF. Thus searching for regex would basically mean searching this somehow existing stream of text by other criteria.
I understand that even simple text search is not very reliable and searching for regex will make things worse, but you could show the user a disclaimer before doing the search, like: ‘No matches is no definitive proof that the pattern does not exist.’
In my case, I had to search a bunch of PDFs for the decimal separator. Easy with regex, difficult without.
Well, I have my personal workaround with Agent Ransack, and there are possibly more urgent issus than regex search. Nevertheless, it would be a stand alone feature for SumatraPDF (beside the others!).

GitHubRulesOK

The most common? programming way of scrapping a PDF is using Python where either xpdf / poppler pdftotext or PyMuPDF (uses the same pdf rendering as SumatraPDF) can then use regex to detect fonts, words and line endings.

I find that too heavy in terms of overhead apps for simpler known search term where windows explorer could find a known word using the find in folder box or a single command line with optional start page e.g.

\SumatraPDF.exe -page 1 -search "Live Demo" mydemo.pdf

would do the job across most formats or call multiple files not just pdf.

By far the lightest (only up to double the disk space code requirement) way to do a regex search in PDFs would be to simply use MuTool or pdftotext(xpdfreader.com) to export textual content as a txt or OCR.txt file and then do a findstr, grep or regex in that file, then easily call SumatraPDF above with the resulting fixed string and page number to start at.

Much of that can be scripted to work in a bash/cmd file with a few lines but each user has such a broad range of needs the scripts become personal to the users skill and desire, thus python or a higher level of language or full blown app starts to kick in.

Custom regex searching rapidly results in heavy NLP type searches. And as a viewer SumatraPDF is perfect as a sub task of showing a search result.