Replace text and background colours in PDF

kfeuerherm · October 22, 2022, 4:11pm

In light of recent vision degradation, I’m looking to set up a script or the like which can read in an OCRed PDF and output another but with the white-ish background replaced by black and the black type replaced by, say, green?

I’m not a great expert so I’m hoping someone can help me do a decent job of it

patdavid · October 22, 2022, 8:13pm

GIMP may not be the best tool for this?

The best option off the top of my head would be to see if your PDF viewing software has an option for inverting the PDF. That way you can load any PDF and invert it directly in your reader.

Another option would be to use something like imagemagick. This is especially helpful if the PDF has multiple pages and you want to invert them all at once quickly and easily (I think it’s the -negate option?)

kfeuerherm · October 22, 2022, 8:22pm

Thanks. At the moment I’m using Acrobat Reader. I am able to invert printed-to-PDF files just fine (I set them to green on black), but no joy with pages as images. I was thinking about opening as layers in GIMP and then maybe cutting anything below a certain darkness level and then changing the rest, or something along those lines…

I’ve looked at Imagemagick in the past for some reason, maybe it’s time for me to look again.

Thanks!

Claes · October 22, 2022, 8:24pm

Abends, @kfeuerherm,

Just a question: would not yellow on black give you a bit better contrast?

MfG
Claes in Lund, Sweden

patdavid · October 22, 2022, 8:30pm

Indeed, as @Claes mentions - if this is for deficient vision consider which colors you’re going to swap to in order to increase contrast. I like to use the WebAIM tool for designing with vision in mind:

https://webaim.org/resources/contrastchecker/

kfeuerherm · October 22, 2022, 8:31pm

Yes, that is an option. I’ve found the green more pleasing, though. Could be because that’s what monitors were like when I started out… But either way, the problem is the same…

kfeuerherm · October 22, 2022, 8:33pm

I was not aware of this tool. I will check it out!

The issue is that I have developed a ‘tethered floater’, first in one eye, six years ago, now in both. And they’re both smack bang in the centre.

So, ordinarily my vision is fine; it’s just that the floaters and the ink sort of melt together, and the brightness of white light sort of creates a fog over everything (I do have brief moments of total clarity…).

Terry · October 22, 2022, 9:57pm

I am not sure if this is what you want to achieve with GIMP.

I duplicated the layer, I then inverted it. I created a new layer filled with green and set the blend to darken only.

good luck. Let me know if this helps. Of course images will be a big problem, but layer masks could be used to retain parts of the original page including images.

kfeuerherm · October 22, 2022, 10:06pm

YES! That’s EXACTLY what I wanted to do … on the whole file. I’m not familiar with these options, so I will have to try that. But assuming that we can script it to do every page, that would be what I had in mind!

(I’m not concerned about the images. I can look at the original for that; they’re a problem with print-to-PDFs too, when I’ve changed the colour scheme…)

Terry · October 22, 2022, 10:55pm

Maybe someone else can help write a script. I don’t use scripts much so I lack skills in that department. I am glad the method work. It is very quick to do so a script should be possible. Of course the green color could be made brighter if that helps. I just used the colour green set to 100% and the blue and red set to 0%.

Good luck

rich2005 · October 23, 2022, 10:17am

As previous posts, it is possible using Gimp. One problem is going to be the number of pages to process.

One way is split the PDF into pages and process each page in a batch file.

However, Gimp all at once using an old (ancient) script Multi-layer-actions applied twice

Invert to get white text on black
Colorize to get green text (only works because of the black background)

Works like this 1 minute demo

Snags:
Depending on the number of pages and the resolution (PPI) chosen, it is going to be a big file in memory.
If exported, due to low internal compression, Gimp makes big PDFS.

Where to get the script, (it is somewhere in the defunct gimp-plugin-registry).
Get it here on my storage: https://filedn.com/lkb9dw6mEfXSsOu9uKLaM14/Multiple-Layer-Actions.zip
Unzip it put Multiple-Layer-Actions.scm in you Gimp user profile scripts folder.

kfeuerherm · October 23, 2022, 3:23pm

Thanks, I’ll try that as soon as I have a chance (right now have to prep for tomorrow’s classes… reading green print-to-PDFs )

I have script-fu installed, I believe, so should be good.

RE: snags; can always split the file first if need be. PPI is determined by who gives me the file. I find that’s about as low as I’d want them to go, so I suspect for me that’s an invariable.

Much appreciated, folks!

rich2005 · October 23, 2022, 4:58pm

…PPI is determined by who gives me the file…

That might well be the case but by default Gimp will initially open the import dialogue at 100 ppi Up to user to set a different value.

Edit? You are going to print these ? That is a lot of black ink / toner

kfeuerherm · October 23, 2022, 5:05pm

Ok. I’ve just tried the method. I am finding that the images within GIMP are larger (8.5 x 11) than the actual page, somehow, so I’m guessing that it’s about matching the resolution there to whatever was used before?

The file size has expanded by a factor of 4 as you suggested, but if that’s all then I can live with that.

I do think ideally they should scan at a larger PPI, though…

kfeuerherm · October 23, 2022, 5:09pm

And the OCR info has been lost. But I can live with that too, I can have one read aloud while I look at the other.

Here’s what I mean by page size mismatch:

Terry · October 24, 2022, 12:36am

I am not sure if you are flattening the image to reduce all the layers and masks. That would reduce the file size compared to retaining the layers and masks. Also would a brighter green than I used be easier to read?

kfeuerherm · October 24, 2022, 1:11am

I didn’t, actually, so I guess I could try that.

I just picked an arbitrary green for the moment for the purpose of testing. So no worries in that regard.

Tomorrow’s a full day so I’ll try another go at it on Tuesday.

Thanks so much for your help!

snibgo · October 24, 2022, 3:56am

… I’m looking to set up a script or the like which can read in an OCRed PDF and output another …

I’m not sure what an “OCRed PDF” is.

There are two cases: (1) scanned documents, where each PDF page is a single raster image, the scan of a single page, and (2) text is recorded as vector data (infinitely scalable).

(1) Scanned documents.

I would use pdfimages to extract all the raster images from the PDF. Then I would use ImageMagick to change the colours to whatever I wanted. For my old eyes, I like white text on a black background, which is often a simple “-negate” operation. But you can have any colours you like.

(I would not use ImageMagick to read the source PDF files. This is because IM will rasterize each PDF page, but when each page is already a raster image, this causes re-sampling of the image, which lowers the readability.)

(2) Text is vector data.

From the OP comment …

And the OCR info has been lost. But I can live with that too, I can have one read aloud while I look at the other.

… I think this is the case.

“Ordinary” PDF documents, such as those in scientific journals, typically do not have rasterised text. A PDF viewer can change colours. For example: Adobe Acrobat Reader, Edit, Preferences, Accessibility, tick “Use High-Contrast colors” for white on black or a few others, or click on “Custom Color” for other combinations. Sadly, some PDF documents use gray text instead of black. Adobe Acrobat Reader can’t make this text actually black or white. Annoying.

Ideally, there would be a FOSS PDF editor that could do this simple change of colours. I am not aware of any such tool.

ImageMagick or Gimp can be used to rasterize each page, at some specified dpi. (IM does this via Ghostscript.) Then we can do whatever changes we want. This loses the “vector” nature of text, so it is no longer searchable. For example, using IM:

magick -density 300 -background White in.pdf[0-9] -alpha Background -alpha off -negate out-0-9.pdf

This converts just the first ten pages.

We can use pdfunite, if we want, to join the first ten pages to the next ten pages, and so on.

kfeuerherm · October 26, 2022, 5:11pm

Just managed to do this again after a few attempts (it hung up once or twice and did odd things).

Doing a flatten did in fact reduce the file size to just about the original, and a manual crop got rid of the overhanging empty space (I guess this must have been scanned to US letter even though the book is trade. Thanks. Not to concerned about the exact shade at the moment, I’m just doing proof-of-concept before I get carried away.

@snibgo Yes, that’s what I meant by OCRed PDF. Option #1 as you have it; this is what I am working with. What you call “ordinary” PDF is what I have been using in another class I teach, and there of course it’s a simple matter of changing the colour in the reader and it remains nice and crisp, as you say; no problem in this case. For the future, my best option is to try for these resources. Unfortunately, I’m in “production mode” and don’t have the luxury of replacing everything. I may be able to do it for next term, but it will involve work and the book orders are supposed to be placed now… sigh.

So with what I have right now, I’m not excited by the quality of text I am achieving, but at least it gets me where I need to go after a fashion.

I’ve been thinking of cutting the spine on a spare copy and putting it through a sheet feeder to scan it myself. Then at least I know what I’m starting with…

@rich2005 No, not going to print these! Just for on-screen viewing, thanks

snibgo · October 26, 2022, 6:36pm

For option #1, where each PDF is one raster image (a scanned page), I have experimented with Gimp, with just one page. It has the same problem as ImageMagick: it doesn’t realise the page contains only one raster image, so it will re-rasterise the page, which resamples the embedded raster image.

If you know the exact density of the scan that made the PDF, this isn’t a big problem. But extracting the images with pdfimages avoids the problem entirely.

I suppose pdfimages is widely available. For Windows, it is available in the Cygwin distribution.

In ImageMagick, if we start with black letters on a white background, and want to change this to red on blue (yuck!):

magick in.jpg +level-colors red,blue out.jpg

Of course, if the PDF contained diagrams in colour, these might become unreadable.