Working with afre_cleantext filter and G'MIC plugin

Thanks. What is the purpose of this step - convert color to grayscale? It’ll only help if doesn’t result in some pixels loss by front page font, so this page is a good example to try. Ultimate cleanup goal is to remove transparent font visible from the back page. Then image would go to Book Restorer to be flattened. If cleanup was done improperly, major background dirt will appear at flattening (geometric correction).

Once flattened, the image would go to DjVU package for conversion, where binarization will likely follow to cut on resulting DjVU file size. DjVU Solo 3.1 is free and still the most efficient compression choice, but it lacks cleanup and image enhancement tools and custom presets. For that DjVU Document Express Desktop & Enterprise combo is often used, or their popular derivatives.

Success of this step directly depends on quality of background cleanup and font appearance improvement at previous steps, though image auto segmentation on text and pictures by some derivative DjVU packages, and different segments processing may improve scan quality. If binarization produces serious loss in font quality, then its omitted, but the resulting DjVU file will be larger, and background cleanliness may worsen. Not even talking about PDF alternative due to larger file size and scarce cleanup tools.

Some books are copyright protected, and other don’t anymore. Imaging an old encyclopedia book with 1500-2000 pages - that’s where single page file size does matter, especially for book reading on mobile devices. The better scan and cleanup quality, the higher chance to binarize it with good results for smaller file size. :pleading_face:

I found that thread, but examples in it were too easy to fix by existing cleanup tools, and they are not typical for real life dirty old paper archive scans or reference book page scans with very thin transparent pages. It was a good starting point though. :yum:

I think I might have some idea on solving your issue, but first I would need to know how binarization works. And then, I would need to know if leftover areas are smaller than text.

I don’t really know what you mean by pixel loss. But here is the Mono mixer + Levels method applied to the good example. You can download this and put through your process. I will be interested in knowing if it is good enough!

Your pic is factually 8-bit and smaller in size, while in scanned book pages processing its common to double pic size before any cleanup. If you can post processed picture in TIFF without changing its size compare to the original, I can check if subsequent processing would reveal any defects introduced or left by your cleanup. :rofl:

A user or software just auto selects the threshold, and converts everything to black or white above or below it. Of course it would work well only if background is clean and scanned text is pre-improved (by unsharping, thickening, smoothing, despeckling etc.) such a way that converting color text to black would not result in its quality or outline loss, and the resulting various font appearance would ideally be close to the book fonts, i.e. not too thick, fine enough, etc.

Not sure what you mean by “leftover areas”? If you mean transparent text and text highlighting from back page, ideally it must be completely removed if possible. If you talk about embedded photos, they can be selected on page to bypass processing, this is called image segmentation. The problem is, the back text also shows up on the front photos, so the challenge is to remove it from photos too, possibly by a different method. :cry:

Here it is:

This is how it looks flatten and binarized at 127KB file size. Possible to read as this is the ultimate goal. But some text outline is missing, and more prior font outline improvement is desired so its not partially lost at binarization.

Ideally the font outline should be filled without defects, and its pixel color spectrum should be narrowed at cleanup close to black, so its not cut off to white later at binarization thus making small visible defects in the outline. The background might also contain close to black pixels as a result, and their removal may then require a different filter like despeckle. :sweat_smile:

Talking to the site admin, since this site can’t show DJVU format when uploaded, there is a little sense in blocking other hosting sites links here for DJVU files, unless you enable showing them here. :blush:

It is too late now here in India. I will try to post another tiff file tomorrow, if you are interested.

No problem. I’m definitely interested. :sleeping:

OK. Here is a tif file that is obtained by using the RawTherapee 5.7 programme. The advantage of this is that you can apply the attached processing file to all images without having to open each one of them.
It is a 16 bit tif file generated from a 8 bit jpg file. Hence, the same processing applied to an original 16 bit file may yield better result.

Processing file: 484RawTherapee-1.tif.out.pp3 (11.9 KB)

The text recovery is quite noticeable, but I was unable to remove enough background noise without loosing some text outline, even with intermediate processing in a specialized book restoration package before extra processing in DjVU cleaner & converter. In other words, we’re facing the same task of narrowing the text outline all pixels color spectrum closer to black color before binarization.

Besides, this RawTherapee TIFF format can’t be converted directly by DjVU tools for some reason, thus requiring be re-saved as TIFF by another graphical package.

As mentioned, there are many ways to reach the goal. This time I am not using afre_cleantext but afre_contrastfft (I haven’t written the GUI part yet).

3 Likes

It looks like you’re closer to the goal Of what @sambul81 want than I thought.

It may be, here’s 50KB DjVU page to support that.

Actually there is large community of book lovers who want “what sambul81 wants”. :joy: And it was obvious from the start, the man is highly intelligent and a bright talent. You guys rock!

Looks like I don’t need the collab after all. Don’t get too excited: I haven’t release the GUI yet.

I know, because… here there are 2 test sets. Hope you won’t forget about… :crazy_face:

There’s a bit more processing step that could be added. Lightness/Contrast and erode/dilate. And afre would have a working alternative to afre_cleantext.

I wonder if there was any improvement progress on the new “immature” plugin lately? :yum:

I have found that the easiest way to eliminate bleed thru of text from the other side of the page, is to place a black sheet of paper behind the page being scanned. I use it all the time and it works great.

1 Like

Would you point to a suitable black material on Ebay or such? Or were did you get your sheets?

@Bilbo Yes, it is all about technique as I have been saying all along.

@sambul81 I have decided not to. You should improve your scanning technique first. Call it tough love. :slight_smile:

1 Like