Home / How to optimize scanned ebooks for fast display in SumatraPDF

GL1zdA

Hi,
I have quite a lot of ebooks that are scanned books with OCR text (books from the Open Library, available via archive.org). The problem is often render slowly in SumatraPDF and searching is extremely slow. I’ve tried some online service to shrink the PDFs, but didn’t like the results, so I’ve decided to tinker with GhostScript and optimize the ebooks myself. I started with a command like:

gswin64 ^
  -o downsampled96.pdf ^
  -sDEVICE=pdfwrite ^
  -dDownsampleColorImages=true ^
  -dDownsampleGrayImages=true ^
  -dDownsampleMonoImages=true ^
  -dColorImageResolution=96 ^
  -dGrayImageResolution=96 ^
  -dMonoImageResolution=96 ^
  -dColorImageDownsampleThreshold=1.0 ^
  -dGrayImageDownsampleThreshold=1.0 ^
  -dMonoImageDownsampleThreshold=1.0 ^
   %1

And tested various resolutions with a 1100 pages book.

With 72 PPI the books are harder to read, but display and search in SumatraPDF blazing fast.

With 96 PPI they look better, quite readable, the quality could be better, but rendering and searching is still fast.

With 150 PPI they look quite good, rendering is still fast enough, but searching starts to feel very slow.

Do you have any suggestions, what can I do, to improve the quality of the output without sacrificing search speed? I’ve seen GhostScript options, but I’m not sure which should I try.

GitHubRulesOK

Compressed pages are slower than uncompressed as it takes time to decompress for reading, also the number of colours adds to the compressed size, similarly for fonts OCR is often the worst slow down as each character can need many bytes to describe that one letter rather than a full line string. web archive is renownded for excessive compression that is very slow to expand.

When scanning documents, using thresholding can remove background, such that PNG or TIFF lossless greyscale are very readable at low resolution for OCR whereas jpeg will be bigger and less clear due to artefacts.
Using good word training for each language, reduces the description of strings so they are stored more like a word or line at a time rather than single letters.

So the biggest influence whole document is before OCR, then you can try to better that downstream but need to be cautious in methodology as it may need a page by page solution, but bursting and merging is often counter productive unless re OCR