Working with afre_cleantext filter and G'MIC plugin

@afre

Hello, I’m trying G’MIC afre_cleantext to cleanup this book page photo example and convert it to B/W with page background removed and clean crisp text, before saving to a pdf or djvu format.

Would you explain, what’s the right sequence in finding the best positions for each of 4 sliders to achieve best text quality and no visible background? What values would be the best for this example? Can you suggest any journal article or web links with recommendations for scanned book pages cleaning?

Can I binarize the image with this plugin, i.e. make it B/W from 24-bit color original after finding the best settings, which means using a threshold to convert all pixels to black and white?

When using G’MIC as a Paint.net plugin, how I can Reset Image or Undo Changes to the entire image after pressing Apply button without closing the G’MIC window? Can you add such buttons to the plugin, so that users would not need to exit and restart it each time to Reset the image?

You’ll want to talk to @afre :slight_smile:

Actually, my questions about Reset Image or Undo Changes button are more generic, unless @afre developed the entire G’MIC . Also, I hope some experienced users will chime in.

@sambul81 Welcome to the forum! I have changed the title, category and tags to better reflect our topic. I can see that you have several questions bundled in your op. I will address each.

1 Digitization is a topic on its own. If it is a few pages, sure, going this route is reasonable; but if it is for an entire book, then you would be better off getting the binding chopped off and digitizing with a fast scanner and OCR. It removes the awkwardly shaded pages due to the binding and narrow inner margin. Libraries and digital books and audio services do this all of the time.

One other thing to consider is copyright.

As for literature on scanning, there are a few threads with participants who seem to be in the know. Maybe post there. Anyway, I think I have given you enough to think about here to be able to do independent research.

2 The key is to experiment with the parameters. There is no right way, although I have attempted to put the ones with the greatest influence first. The issue is that I made the filter in a hurry and based on only one person’s scans. It may require further tweaking. If you could provide more sample pages, that would be great.

3 I don’t think true B&W is a solution. When people think of B&W scanning, they are actually thinking of grey scale. B&W won’t allow for smooth edges and transitions, and would in fact interfere with OCR if that is where you are going. The feature to consider is contrast, which can be adjusted by the filter or the simple levels or curves tools.

4 The behaviour of the plugin won’t be changing. Once the filter is applied, the data is handed off to the app and committed. Two things you could do is undo in your app proper or select the “new layer” option in the plugin to output the result to a new layer on top of your original. The latter is the best method because you can compare the result and also do stuff like blending and masking with other layers.

PS I totally understand your frustrations with the plugin. I normally avoid it altogether and stick with the CLI because the preview is rarely accurate and I just found out that I can’t zoom out beyond 100%, which is awkward for your sample image. @David_Tschumperle, could you please ask the plugin dev to allow zooming out beyond 100% when in preview(0) mode? Thanks!

I found time to try afre_cleantext on your image.

gmic 484.jpg afre_cleantext 9,.6,85,95 o 484_cleantext.png

Remarks

1 Parameters could be tweaked some more.

2 The input image was dimly illuminated, blurry and noisy. My filter won’t be able pick up the parts of the letters where there is that much variation, although it has done a reasonable job. We can try but the cleaner the input image the better. Best capture one page at a time with a scanning stand and lights.

I see that you migrated from the Paint.NET forum. Now I want to mention that it isn’t possible to do that via Paint.NET. It is possible to do that within Krita or GIMP.

Hi @afre, thanks for the detail reply. In this case, is it possible to transfer control to the main Paint.net window, while G’MIC window is still open? Currently its possible to switch btw these windows, but Paint.net remains unresponsive until G’MIC window is exited. This would allow to revert changes without closing G’MIC. As you pointed, G’MIC Preview is not accurate (why?), and its very inconvenient to wait each time for G’MIC to exit and launch again at each test image changes.

Here are 2 test sets. Reference books are often printed on very thin paper, so back printing is visible through. Some areas on the back and front pages are highlighted by a black background, so the front highlighting must be preserved, and back highlighting removed. There are side “finger” areas “Code” which must remain untouched. Also in some pages the back text overlaps front photo or illustration area, so the text its desirable to remove, while keeping illustrations fully untouched.

https://workupload.com/file/jyxGvv3G
https://workupload.com/file/npm5Tx8h

Binarization is still often done at converting to DjVU to keep the book size small, meaning 60MB per book with 1000 content rich clean pages, such as in the linked example. The print visible from back side can be removed without damage to front side print by an algorithm identifying mirrored letters.

The original small font text seems not exactly black. Yet I was able to improve it a little better with your filter. I hope if the full color is preserved, you can improve the code to keep letters fully visible while background fully removed. One way to achieve that is to improve letters with unsharping, thickening, smoothing, and despeckle algorithms, then making color letters black within the entire outline.

Text can be flatten after cleanup by a different restoration package, then white background replaced at converting to DjVU with easier to read grey or light brown background.

Digression

I will be frank with you for a moment: I won’t be able to deliver on what you ask. I have a challenging life and am here as an outlet for wholesomeness. Keeping in the theme of the season, I am one whom you would call the “least of these”.

Secondly, my filters are meant to be minimalist, though powerful, aren’t meant to be a full solution stack; of course, you may use it at any stage of your workflow. There are also plenty of free, open source and commercial products out there that are much more feature complete than my code.

As you may have noticed, I have been directing the discussion toward problem solving instead of relying on particular tools. Tools are interchangeable but not skill and experience. That said, your discussion has given me a sense of what to do next with this filter.

Lastly, I would like to reiterate the issue of copyright. It seems like you want to scan the whole textbook. Maybe not this particular one. Now, I won’t tell you what to do but I encourage you to honour those who made this textbook.

Back to the Q&A

@Reptorian says it is not. I don’t have time to explore Paint.NET and will defer to his reply.

It is inaccurate because the processing is applied on the preview and its size and zoom level and whatever the plugin+app combo does to the preview. In other words, you aren’t filtering the actual image at all. For some filters, it is completely fine, but for more sophisticated ones, not so much or at all.

You listed a bunch of items. As long as the features aren’t too different in kind and shade, they should be preserved. Keep in mind that the filter does global processing and doesn’t have machine learning to detect objects and other advanced processing of that sort. It will most certainly not preserve the photographs. PDF conversion programs do multiple passes for different content. In any case, even the best apps and services have problems with this. It takes human intervention to get it right.

The Black and White parameters already allow you to narrow the grey scale range to a minimum. All you would need to do is apply a final threshold and convert the image to B&W, which is a very basic and simple thing to do and better left for the app’s native tools to do.

In terms of storage space, as hinted in a previous post, your goal should ultimately be OCR. Text costs nothing. Textbooks are mostly empty space anyway.

I can certainly improve the filter and no doubt I will eventually. However, many of these steps are things that you should look into yourself. Moreover, as said, high quality scanning would definitely help. Processing is only as good as the inputs it is given and how skilled and experienced the person doing it is.

Not sure I understand that? Besides, I don’t intend to scan any book entirely, just various pages, which is permitted by copyright law. OCR would be an overkill given time required for manual text proof reading. I really appreciate your intention to improve the filter, and your processing suggestions.

To furnish more ideas, adding an algorithm to identify mirrored letters and change their color to white would allow to easily clean back side font background on transparent pages of various document scans, even handwritten. A standalone interesting case is cleaning background from pages flatten with BookRestorer, since flattening introduces some deviations to standard font outlines. As you may know, some packages like Acrobat identify used on scanned pages standard fonts at OCR, and use the same fonts for OCR text to preserve document format and appearance. :sweat_smile:

What he likely means, its not possible NOW. What I mean, asking the plugin dev. to make it possible in the near future. :sob:

No, it is more to the limitation of the plugin API system in Paint.NET. I can keep g’mic window open in Krita, then minimize to do whatever, then assign a g’mic filter after maximizing because it doesn’t have that issue. You need to ask Rick Brewster about anything related to PDN plugin system, we can’t do anything about PDN development.

If you do not mind using GIMP 2.10, here is a one click output.

I used the Color to Gray method that is listed in the Desaturate module under the Colors tab. This module takes time to operate under its default parameters. However, for your images, you can tweak the parameters to make it act instantaneously. The parameters I tried were : Radius: 300, Samples: 4 and Iterations: 1. (I have cropped the image a bit and also resized it so as to save the server space of this forum.)

EDIT : I just noticed that this method does not work well if photos are present in the text. I tried the Mono MIxer method of desaturation and then tweaked the contrast using the Levels tool to get this:


This method will work for normal pages too.

1 Like

@shreedhar Good preprocessing step. It doesn’t have to be hard. :slight_smile:

1 Like

Thanks. What is the purpose of this step - convert color to grayscale? It’ll only help if doesn’t result in some pixels loss by front page font, so this page is a good example to try. Ultimate cleanup goal is to remove transparent font visible from the back page. Then image would go to Book Restorer to be flattened. If cleanup was done improperly, major background dirt will appear at flattening (geometric correction).

Once flattened, the image would go to DjVU package for conversion, where binarization will likely follow to cut on resulting DjVU file size. DjVU Solo 3.1 is free and still the most efficient compression choice, but it lacks cleanup and image enhancement tools and custom presets. For that DjVU Document Express Desktop & Enterprise combo is often used, or their popular derivatives.

Success of this step directly depends on quality of background cleanup and font appearance improvement at previous steps, though image auto segmentation on text and pictures by some derivative DjVU packages, and different segments processing may improve scan quality. If binarization produces serious loss in font quality, then its omitted, but the resulting DjVU file will be larger, and background cleanliness may worsen. Not even talking about PDF alternative due to larger file size and scarce cleanup tools.

Some books are copyright protected, and other don’t anymore. Imaging an old encyclopedia book with 1500-2000 pages - that’s where single page file size does matter, especially for book reading on mobile devices. The better scan and cleanup quality, the higher chance to binarize it with good results for smaller file size. :pleading_face:

I found that thread, but examples in it were too easy to fix by existing cleanup tools, and they are not typical for real life dirty old paper archive scans or reference book page scans with very thin transparent pages. It was a good starting point though. :yum:

I think I might have some idea on solving your issue, but first I would need to know how binarization works. And then, I would need to know if leftover areas are smaller than text.

I don’t really know what you mean by pixel loss. But here is the Mono mixer + Levels method applied to the good example. You can download this and put through your process. I will be interested in knowing if it is good enough!

Your pic is factually 8-bit and smaller in size, while in scanned book pages processing its common to double pic size before any cleanup. If you can post processed picture in TIFF without changing its size compare to the original, I can check if subsequent processing would reveal any defects introduced or left by your cleanup. :rofl:

A user or software just auto selects the threshold, and converts everything to black or white above or below it. Of course it would work well only if background is clean and scanned text is pre-improved (by unsharping, thickening, smoothing, despeckling etc.) such a way that converting color text to black would not result in its quality or outline loss, and the resulting various font appearance would ideally be close to the book fonts, i.e. not too thick, fine enough, etc.

Not sure what you mean by “leftover areas”? If you mean transparent text and text highlighting from back page, ideally it must be completely removed if possible. If you talk about embedded photos, they can be selected on page to bypass processing, this is called image segmentation. The problem is, the back text also shows up on the front photos, so the challenge is to remove it from photos too, possibly by a different method. :cry:

Here it is:

This is how it looks flatten and binarized at 127KB file size. Possible to read as this is the ultimate goal. But some text outline is missing, and more prior font outline improvement is desired so its not partially lost at binarization.

Ideally the font outline should be filled without defects, and its pixel color spectrum should be narrowed at cleanup close to black, so its not cut off to white later at binarization thus making small visible defects in the outline. The background might also contain close to black pixels as a result, and their removal may then require a different filter like despeckle. :sweat_smile:

Talking to the site admin, since this site can’t show DJVU format when uploaded, there is a little sense in blocking other hosting sites links here for DJVU files, unless you enable showing them here. :blush:

It is too late now here in India. I will try to post another tiff file tomorrow, if you are interested.