Forum moved here!

Home / A way to search for new lines?

cedd

In a lot of command line programs there are slash symbols that allow you to specify new lines, tabs, etc, like this: \n, \n\r, \t. It can be helpful to use these symbols when searching in manuals that reference the same term many times but only use that term once as a header with a new line (or some other spacing symbol) preceding it.

So if I’m searching for subsection 11 of article 200 in a code book, instead of searching “11” I would search it as “\n11” and that will often narrow it down enough.

GitHubRulesOK

I would normally say there are no line ending symbols in a PDF since the characters are placed by XY co-ordinates in a random order so that a piece of text that reads “Here be” (on the right) “ye Gold” (lower down on the Left) could be extracted as He\n re b\n e ye\n Go\n ld\n with any of those objects in any order the \n here being the physical one between xy values. I am showing that as an extreme but very common example of pdf storage where it has not been randomised.

Perhaps a simple sample may explain that better. Ignore the table in the background that may have been drawn before or after the textual content and remember that “words” are just different length blocks of font entries (there are no separate words in “ink” just letters that may include different positive or negative white space)
image

Here is part of that PDF content, nowhere is a literal (\n) line ending or (\t) tab between columns to be seen.

/F10 14.04 Tf (About this Code:) Tj ET Q 
q BT 0 g 42.52 723.41 Td 0 -12.00 Td /F9 12.00 Tf (Demo of how to convert and Download HTML page to PDF file with CSS, using JavaScript and jQuery.) Tj
ET Q
0.78 0.78 0.78 rg
42.52 671.06 100.35 -60.66 re B
BT /F2 16 Tf 18.4 TL 0 g 51.02 618.91 Td
(Person) Tj ET
0.78 0.78 0.78 rg 142.87 671.06 202.75 -60.66 re B
BT /F2 16 Tf 18.4 TL 0 g 151.37 618.91 Td
(Contact) Tj

This is the MuPDF extraction where a line ending is injected in place of where a text block finishes. The common complaint is that tabs have not been injected for the gaps between “words” in the “table” but it is just text and lines without any association so should tabs be inserted in place of bigger word gaps, but again there are no words.

Hello
About this Code:
Demo of how to convert and Download HTML page to PDF file with CSS, using JavaScript and jQuery.
Person
Contact
Country
John
+2345678910
Germany

Each PDF decompiler can produce a different result from the above and some may replace white spaces with tab spaces or keep text blocks on one line. But PDF was designed as a one way only process, thus discards \new line \tabs and \paragraph markers. Like wise there is no font style (\b \i ) nor anything like \u for underline or strike, out those are just rectangles in page media space!.

cedd

Wow, thanks for the super detailed reasoning. So without knowing exactly how the SumatraPDF search function works, I’m assuming the MuPDF extraction is what it searches and then when it finds a match there it figures out where that place in the MuPDF extraction corresponds to in the actual PDF and jumps to it in the reader.

If that’s the case, couldn’t the newlines generated by MuPDF be used to match with “\n”? I recognize they’re not technically new lines from the view of the PDF source, but from the example you’ve given it appears they still roughly function as newlines, separating line-wise related content into meaningful chunks. Obviously the usefulness of these newlines would depend somewhat on how each PDF is sourced and extracted but if a similar extraction was made for my code book example, these newlines would likely be used to great advantage.

And if the assumptions I’ve made up to this point have been true, and the sample you’ve provided is a more or less common extraction, it seems a “\s” symbol (for spaces) would be possible as well if it was programmed to match the spaces on the linewise chunks of the MuPDF extraction. And after checking quickly it appears SumatraPDF already does something like this but normalizes one or more spaces in the search bar to be just one space. Which honestly seems adequate since the cases where you’d want to match more than one space are likely quite rare.

GitHubRulesOK

SumatraPDF relies hevily on resisting unnecessary change to MuPDF but that in turn relies heavily on other libraries to handle text that in turn relies on font libraries being used correctly in a PDF where most commonly they are not used correctly but thats the cause of many issues.
There are many common files with very bad characters or spacing and the current mix is best compromise (outside of using complex heuristics deliberately tuned to different cases, which acrobat slows down to do) For searching problems see

cedd

I understand. Regardless thanks for explaining things clearly.

thegrapevine

Thank you very much for the explanation. Sumatra PDF is the best PDF reader forever due to its portability and light weight. And that’s why I love Sumatra PDF.

GitHubRulesOK

I know its not much help but bearing the above in mind when searching for section 11 it helps to add the following “word” so here i have the advantage of bookmarks but I could have started from contents table on page 2

cedd

I do use that method sometimes as well and it is useful. The code book example I provided is kind of an extreme case honestly. The advantage of being able to search with \n is it doesn’t require being as precise in your search. A regex search requires even less precision but obviously the pdf format makes an implementation of that wonky at best as you’ve demonstrated with your examples.

To make my example more concrete, I’m referencing the NEC (National Electric Code). Each article and section is numbered with a dot as a separator. So 200.118 represents article 200, section 118. Each of these indexes may be referenced in the 1000 page handbook anywhere between 5 and 50 times.

Does this make me use a different pdf reader to read this? No, lol. SumatraPDF is still an all around amazing pdf reading experience. It’s a feature that would be nice but the alternatives aren’t bad just a little slower.

I used to read a lot of manpages on my linux box a couple years ago. Now those aren’t pdfs they’re just text files but with a little practice with these “search hacks” you could skip through those documents like lightning reading them in vim. Lots of fun nerding out on the endless things you can do with just a shell and some text files lol

ilyaz

Well, as an objection this seems to be too literal and too contrived. Nowhere in PDF is the space between the words encoded either — but still, SumatraPDF has no problem with search strings with/without spaces.

As pasting from SumatraPDF shows, there is already some code to “decide” whether a glyph starts/ends a text line. So, when/if RegularExpression search is implemented, implementing ^ and $ should not be too technically complicated. And IMO the original poster wanted exactly the support of ^ and $ .

GitHubRulesOK

It has tons of problems since built on 3rd party extractors, hence many open issues that need fixing first, however the text libraries changed yet again today so lets see what may be better or worse.

“a glyph starts/ends a text line.”
Not exactly a line starts with X location then one, none or more letters then another X location on the same Y level with more letters or not. PDF was not designed for editing it was designed for lasers, think of those letters as stencil cutters for inking, and you have covered all types of text outlines. svg, vector ttf, CID, custom fonts, images of emojis or drop caps or just plain chinese words and triple character composite letters like Myanmar text etc. The text etched on a page could be in an appendix then floating on a numbered screen page but not in a printout, or vice versa. PDF = WYSInotWYG