How to speed up import > crop > export in GIMP

intrelis · November 24, 2023, 2:01pm

Or any other software, if there are better options.

I have several hundred single page PDFs, that I need to crop in order to compile later. The cropping needs to be manual and cannot be done automatically for all pages, since some scans have a single page, others a double page and the position of the page on the image differs. (Unless there is a software that can detect that? Is there?)

Right now I’m just opening a PDF as layer in GIMP, clicking Import on the pop-up window with the DPI and resolution options, positioning the layer in the center of the already resized image canvas, then Export As JPG, confirm file name, confirm JPG options, and repeat.

I wanted to ask if there is a way to do this faster. I can’t even import more than one PDF at a time without having to confirm the import resolution for every single one. Or if maybe I could export multiple layers as separate files at once? Maybe it is stupid to be doing this with GIMP. I have no idea. Any advise is greatly appreciated.

Edit: OS is Windows 11, but I do have Linux Mint on another system if necessary.

Claes · November 24, 2023, 2:34pm

Hi @intrelis, and welcome!

Operating system?

I am not certain, but perhaps pdfarranger would suit your needs?
— hmmmm perhaps not crop

Have fun!
Claes in Lund, Sweden

lphilpot · November 24, 2023, 3:24pm

There’s also PDFsam but again I don’t know if cropping is there, how it works, etc. Maybe worth a glance though.

paulmatth · November 24, 2023, 3:42pm

Hello, on Linux there’s a program called pdfimages that extracts images from pdfs.
Simple to use:

$ pdfimages -j your.pdf img

This will extract all the images in your.pdf to jpgs named img-000.jpg, img-001.jpg, etc. Tiff and png also available. Perhaps that will speed up your workflow.

snibgo · November 24, 2023, 3:48pm

If pdfimages does what you need, I suggest using that.

Another possibility is ImageMagick. This will rasterize the page(s), making one image per page. Then perhaps “-trim” will remove white space that surrounds whatever text or images you care about. Then it will save as JPEG, or whatever you want.

For example, assuming you want just the first page (number zero):

magick in.pdf[0] -trim out.jpg

If the first page is blank, you could test for that, and process the second page instead.

For “several hundred PDFs”, this could help, even if it only helps for 90% of the PDFs.

paulmatth · November 24, 2023, 3:57pm

Using convert instead of magick, the out.jpg contains two images (that’s exactly what that page holds), while pdfimages turns out two seperate photos, so better for Kard.

Btw, welcome Kard!

Ofnuts · November 24, 2023, 5:59pm

I have a script that:

Exports and closes the current image
Opens the next image in sequence (alphabetical or numerical) in the same directory
Can be assigned to a hot key

This saves you two trips to the file selector, including one where you have to hunt for the next file.

So, I have done exactly this kind of thing on my expense receipts, and the whole process becomes:

File > Open the first image
Crop
Hit the File-next key
Crop
Hit the File-next key
Crop
Hit the File-next key
Crop
Hit the File-next key
Crop
… etc.

But this requires the input and output files to be identical (you could hack it though). However, since Gimp has plenty of questions when opening a PDF, it could be better in your case to mass-convert your PDF to JPG before starting (using ImageMagick or else).

See ofn-file-next on Ofnuts’ Gimp Tools downloads.

Thomas_Do · November 25, 2023, 11:38am

I am not sure if I really understand your needs and workflow (an example would be nice). Stirling PDF is a new tool that does all kinds of cool things to PDFs. However, it is under heavy development and not all modules are stable yet.

snibgo · November 25, 2023, 1:06pm

Yes.

Without an example, we can only give very general advice. With an example, we may be able to give specific solutions.

If you can’t link to samples because of confidentiality, perhaps you can mockup an example with Gimp etc.

rvietor · November 25, 2023, 3:05pm

(Text) pdf’s can be text + layout commands (the “page description” part of pdf) or images, or a combination.
Both can be easily extracted from a pdf file, but you tend to lose the formatting part.

Do you just need the text, or is the formatting/layout on the page important?

intrelis · November 25, 2023, 5:21pm

Hey guys, thank you all. Sorry, I have no idea who this forum works and how to reply to specific people. Can’t even seem to be able to edit my post anymore.

@Ofnuts Thank you, that seems the quickest way excluding a full automation of the process, will give it a try.

I am not sure pdfimages is what I’m looking for, since I don’t need to extract JPG from PDF, I don’t really care what the format of the pages is, I just need them all cropped and to be the same size/resolution. Say you have a 12 000 x 18 000 image file, you crop a part of it (in the proper aspect ratio), which will differ from image to image, then export to 2970 x 4200.

I don’t need to extract the text, the languages are rather old and not supported. Training the OCR is a project for another day. Just images is fine.

In this folder you can find two examples of the raw pdf files and one example of how I want it to look after cropping: https://1drv.ms/f/s!AhAK45tMbmCHg557h-CC-pLgN8kCKA?e=kVAyrU

No need to straighten or clean up, as that will be done automatically when compiling later with Abbyy FineReader.

If the question is why not do the cropping with Abbyy, good question. The program doesn’t seem to be able to handle this task properly. Despite importing 1200DPI scans and exporting as 300DPI, after selecting a size for all pages in the document, it automatically adds white space around the cropped pages instead of scaling them up (which it should do considering there is more than enough resolution for it). So I need all the pages cropped to identical size before importing. There is always the chance I’m doing something wrong, but I even spoke with their support team and they weren’t able to provide a solution.

Ofnuts · November 25, 2023, 6:12pm

So I need all the pages cropped to identical size before importing.

If so:

The Crop tool can be told to work only on a fixed size
You can save this as a Preset