Or any other software, if there are better options.
I have several hundred single page PDFs, that I need to crop in order to compile later. The cropping needs to be manual and cannot be done automatically for all pages, since some scans have a single page, others a double page and the position of the page on the image differs. (Unless there is a software that can detect that? Is there?)
Right now I’m just opening a PDF as layer in GIMP, clicking Import on the pop-up window with the DPI and resolution options, positioning the layer in the center of the already resized image canvas, then Export As JPG, confirm file name, confirm JPG options, and repeat.
I wanted to ask if there is a way to do this faster. I can’t even import more than one PDF at a time without having to confirm the import resolution for every single one. Or if maybe I could export multiple layers as separate files at once? Maybe it is stupid to be doing this with GIMP. I have no idea. Any advise is greatly appreciated.
Edit: OS is Windows 11, but I do have Linux Mint on another system if necessary.
Hello, on Linux there’s a program called pdfimages that extracts images from pdfs.
Simple to use:
$ pdfimages -j your.pdf img
This will extract all the images in your.pdf to jpgs named img-000.jpg, img-001.jpg, etc. Tiff and png also available. Perhaps that will speed up your workflow.
If pdfimages does what you need, I suggest using that.
Another possibility is ImageMagick. This will rasterize the page(s), making one image per page. Then perhaps “-trim” will remove white space that surrounds whatever text or images you care about. Then it will save as JPEG, or whatever you want.
For example, assuming you want just the first page (number zero):
magick in.pdf[0] -trim out.jpg
If the first page is blank, you could test for that, and process the second page instead.
For “several hundred PDFs”, this could help, even if it only helps for 90% of the PDFs.
Using convert instead of magick, the out.jpg contains two images (that’s exactly what that page holds), while pdfimages turns out two seperate photos, so better for Kard.
Opens the next image in sequence (alphabetical or numerical) in the same directory
Can be assigned to a hot key
This saves you two trips to the file selector, including one where you have to hunt for the next file.
So, I have done exactly this kind of thing on my expense receipts, and the whole process becomes:
File > Open the first image
Crop
Hit the File-next key
Crop
Hit the File-next key
Crop
Hit the File-next key
Crop
Hit the File-next key
Crop
… etc.
But this requires the input and output files to be identical (you could hack it though). However, since Gimp has plenty of questions when opening a PDF, it could be better in your case to mass-convert your PDF to JPG before starting (using ImageMagick or else).
I am not sure if I really understand your needs and workflow (an example would be nice). Stirling PDF is a new tool that does all kinds of cool things to PDFs. However, it is under heavy development and not all modules are stable yet.
(Text) pdf’s can be text + layout commands (the “page description” part of pdf) or images, or a combination.
Both can be easily extracted from a pdf file, but you tend to lose the formatting part.
Do you just need the text, or is the formatting/layout on the page important?
Hey guys, thank you all. Sorry, I have no idea who this forum works and how to reply to specific people. Can’t even seem to be able to edit my post anymore.
@Ofnuts Thank you, that seems the quickest way excluding a full automation of the process, will give it a try.
I am not sure pdfimages is what I’m looking for, since I don’t need to extract JPG from PDF, I don’t really care what the format of the pages is, I just need them all cropped and to be the same size/resolution. Say you have a 12 000 x 18 000 image file, you crop a part of it (in the proper aspect ratio), which will differ from image to image, then export to 2970 x 4200.
I don’t need to extract the text, the languages are rather old and not supported. Training the OCR is a project for another day. Just images is fine.
No need to straighten or clean up, as that will be done automatically when compiling later with Abbyy FineReader.
If the question is why not do the cropping with Abbyy, good question. The program doesn’t seem to be able to handle this task properly. Despite importing 1200DPI scans and exporting as 300DPI, after selecting a size for all pages in the document, it automatically adds white space around the cropped pages instead of scaling them up (which it should do considering there is more than enough resolution for it). So I need all the pages cropped to identical size before importing. There is always the chance I’m doing something wrong, but I even spoke with their support team and they weren’t able to provide a solution.