Working with afre_cleantext filter and G'MIC plugin

Digression

I will be frank with you for a moment: I won’t be able to deliver on what you ask. I have a challenging life and am here as an outlet for wholesomeness. Keeping in the theme of the season, I am one whom you would call the “least of these”.

Secondly, my filters are meant to be minimalist, though powerful, aren’t meant to be a full solution stack; of course, you may use it at any stage of your workflow. There are also plenty of free, open source and commercial products out there that are much more feature complete than my code.

As you may have noticed, I have been directing the discussion toward problem solving instead of relying on particular tools. Tools are interchangeable but not skill and experience. That said, your discussion has given me a sense of what to do next with this filter.

Lastly, I would like to reiterate the issue of copyright. It seems like you want to scan the whole textbook. Maybe not this particular one. Now, I won’t tell you what to do but I encourage you to honour those who made this textbook.

Back to the Q&A

@Reptorian says it is not. I don’t have time to explore Paint.NET and will defer to his reply.

It is inaccurate because the processing is applied on the preview and its size and zoom level and whatever the plugin+app combo does to the preview. In other words, you aren’t filtering the actual image at all. For some filters, it is completely fine, but for more sophisticated ones, not so much or at all.

You listed a bunch of items. As long as the features aren’t too different in kind and shade, they should be preserved. Keep in mind that the filter does global processing and doesn’t have machine learning to detect objects and other advanced processing of that sort. It will most certainly not preserve the photographs. PDF conversion programs do multiple passes for different content. In any case, even the best apps and services have problems with this. It takes human intervention to get it right.

The Black and White parameters already allow you to narrow the grey scale range to a minimum. All you would need to do is apply a final threshold and convert the image to B&W, which is a very basic and simple thing to do and better left for the app’s native tools to do.

In terms of storage space, as hinted in a previous post, your goal should ultimately be OCR. Text costs nothing. Textbooks are mostly empty space anyway.

I can certainly improve the filter and no doubt I will eventually. However, many of these steps are things that you should look into yourself. Moreover, as said, high quality scanning would definitely help. Processing is only as good as the inputs it is given and how skilled and experienced the person doing it is.

Not sure I understand that? Besides, I don’t intend to scan any book entirely, just various pages, which is permitted by copyright law. OCR would be an overkill given time required for manual text proof reading. I really appreciate your intention to improve the filter, and your processing suggestions.

To furnish more ideas, adding an algorithm to identify mirrored letters and change their color to white would allow to easily clean back side font background on transparent pages of various document scans, even handwritten. A standalone interesting case is cleaning background from pages flatten with BookRestorer, since flattening introduces some deviations to standard font outlines. As you may know, some packages like Acrobat identify used on scanned pages standard fonts at OCR, and use the same fonts for OCR text to preserve document format and appearance. :sweat_smile:

What he likely means, its not possible NOW. What I mean, asking the plugin dev. to make it possible in the near future. :sob:

No, it is more to the limitation of the plugin API system in Paint.NET. I can keep g’mic window open in Krita, then minimize to do whatever, then assign a g’mic filter after maximizing because it doesn’t have that issue. You need to ask Rick Brewster about anything related to PDN plugin system, we can’t do anything about PDN development.

If you do not mind using GIMP 2.10, here is a one click output.

I used the Color to Gray method that is listed in the Desaturate module under the Colors tab. This module takes time to operate under its default parameters. However, for your images, you can tweak the parameters to make it act instantaneously. The parameters I tried were : Radius: 300, Samples: 4 and Iterations: 1. (I have cropped the image a bit and also resized it so as to save the server space of this forum.)

EDIT : I just noticed that this method does not work well if photos are present in the text. I tried the Mono MIxer method of desaturation and then tweaked the contrast using the Levels tool to get this:


This method will work for normal pages too.

1 Like

@shreedhar Good preprocessing step. It doesn’t have to be hard. :slight_smile:

1 Like

Thanks. What is the purpose of this step - convert color to grayscale? It’ll only help if doesn’t result in some pixels loss by front page font, so this page is a good example to try. Ultimate cleanup goal is to remove transparent font visible from the back page. Then image would go to Book Restorer to be flattened. If cleanup was done improperly, major background dirt will appear at flattening (geometric correction).

Once flattened, the image would go to DjVU package for conversion, where binarization will likely follow to cut on resulting DjVU file size. DjVU Solo 3.1 is free and still the most efficient compression choice, but it lacks cleanup and image enhancement tools and custom presets. For that DjVU Document Express Desktop & Enterprise combo is often used, or their popular derivatives.

Success of this step directly depends on quality of background cleanup and font appearance improvement at previous steps, though image auto segmentation on text and pictures by some derivative DjVU packages, and different segments processing may improve scan quality. If binarization produces serious loss in font quality, then its omitted, but the resulting DjVU file will be larger, and background cleanliness may worsen. Not even talking about PDF alternative due to larger file size and scarce cleanup tools.

Some books are copyright protected, and other don’t anymore. Imaging an old encyclopedia book with 1500-2000 pages - that’s where single page file size does matter, especially for book reading on mobile devices. The better scan and cleanup quality, the higher chance to binarize it with good results for smaller file size. :pleading_face:

I found that thread, but examples in it were too easy to fix by existing cleanup tools, and they are not typical for real life dirty old paper archive scans or reference book page scans with very thin transparent pages. It was a good starting point though. :yum:

I think I might have some idea on solving your issue, but first I would need to know how binarization works. And then, I would need to know if leftover areas are smaller than text.

I don’t really know what you mean by pixel loss. But here is the Mono mixer + Levels method applied to the good example. You can download this and put through your process. I will be interested in knowing if it is good enough!

Your pic is factually 8-bit and smaller in size, while in scanned book pages processing its common to double pic size before any cleanup. If you can post processed picture in TIFF without changing its size compare to the original, I can check if subsequent processing would reveal any defects introduced or left by your cleanup. :rofl:

A user or software just auto selects the threshold, and converts everything to black or white above or below it. Of course it would work well only if background is clean and scanned text is pre-improved (by unsharping, thickening, smoothing, despeckling etc.) such a way that converting color text to black would not result in its quality or outline loss, and the resulting various font appearance would ideally be close to the book fonts, i.e. not too thick, fine enough, etc.

Not sure what you mean by “leftover areas”? If you mean transparent text and text highlighting from back page, ideally it must be completely removed if possible. If you talk about embedded photos, they can be selected on page to bypass processing, this is called image segmentation. The problem is, the back text also shows up on the front photos, so the challenge is to remove it from photos too, possibly by a different method. :cry:

Here it is:

This is how it looks flatten and binarized at 127KB file size. Possible to read as this is the ultimate goal. But some text outline is missing, and more prior font outline improvement is desired so its not partially lost at binarization.

Ideally the font outline should be filled without defects, and its pixel color spectrum should be narrowed at cleanup close to black, so its not cut off to white later at binarization thus making small visible defects in the outline. The background might also contain close to black pixels as a result, and their removal may then require a different filter like despeckle. :sweat_smile:

Talking to the site admin, since this site can’t show DJVU format when uploaded, there is a little sense in blocking other hosting sites links here for DJVU files, unless you enable showing them here. :blush:

It is too late now here in India. I will try to post another tiff file tomorrow, if you are interested.

No problem. I’m definitely interested. :sleeping:

OK. Here is a tif file that is obtained by using the RawTherapee 5.7 programme. The advantage of this is that you can apply the attached processing file to all images without having to open each one of them.
It is a 16 bit tif file generated from a 8 bit jpg file. Hence, the same processing applied to an original 16 bit file may yield better result.

Processing file: 484RawTherapee-1.tif.out.pp3 (11.9 KB)

The text recovery is quite noticeable, but I was unable to remove enough background noise without loosing some text outline, even with intermediate processing in a specialized book restoration package before extra processing in DjVU cleaner & converter. In other words, we’re facing the same task of narrowing the text outline all pixels color spectrum closer to black color before binarization.

Besides, this RawTherapee TIFF format can’t be converted directly by DjVU tools for some reason, thus requiring be re-saved as TIFF by another graphical package.

As mentioned, there are many ways to reach the goal. This time I am not using afre_cleantext but afre_contrastfft (I haven’t written the GUI part yet).

3 Likes

It looks like you’re closer to the goal Of what @sambul81 want than I thought.

It may be, here’s 50KB DjVU page to support that.

Actually there is large community of book lovers who want “what sambul81 wants”. :joy: And it was obvious from the start, the man is highly intelligent and a bright talent. You guys rock!

Looks like I don’t need the collab after all. Don’t get too excited: I haven’t release the GUI yet.

I know, because… here there are 2 test sets. Hope you won’t forget about… :crazy_face: