Tesseract Teaser

Recent changes
Table of contents
Links to this page


My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:
Not the four-dimensional cube sort of Tesseract, this is the Optical Character Recognition (OCR) "Tesseract", software that takes an image of some lettering and produces a plain text file containing the text.

Or not. It's well-known that this is "Actually Quite Hard(tm)" and Tesseract does a pretty good job "Out of the Box" with very little messing about. But the other day I ran across something that has me utterly baffled. Let me share my bebafflement with you.

Here's an image that I want to convert to plain text. To the eye it's looks pretty straight-forward, and I have quite a few examples that, to the eye, look almost identical, and which tesseract handles without a problem.

This one, however, produces this result:

  TUE 03/04/2018
  0h30m (DR)

OK, it's not so bad for three of the lines, but the third line? Where does that come from? How does it get that?

(Hah! One commenter has said that on the third line if you screw up your eyes and squint you might be able to see "rFsE10" in the background, in the "black", not in the foreground. Maybe, just maybe, that explains where "rFsE10" comes from.)

Well, I'm accustomed to this, and I played with the settings for a bit, and I played with the image for a bit, but if the settings were right for this image, they turned out to be wrong for another, and I have a lot of these that I need to convert as a batch, so I need settings that will work for them all.

So I read the "man page" for tesseract, and discovered the "get.images" option. This will dump to a file the image it ends up using. So I did that, and I got the result you see here. It's clean, it's binary, and it's clearly legible. So why is it getting the answer so wrong?

Then I thought - "Aha! Let's feed that image into tesseract!"

And that's when I got my first surprise. Feeding this image, the one tesseract created, back into tesseract, the answer was this:

  TUE 03/04/2018
  Oh30m (DR)

How can that be different ?!?

The image is the one tesseract output, which we can only assume is the one it's using for the character recognition, and yet it gives a different (and beautifully correct!) answer!

I'm ... well ... stunned! And stumped. Why should this be so?

OK, a number of people have been in touch to say that they don't follow my reasoning. My guess is that most people won't care, so I'm reluctant to extend this page, but if you are confused as to why I am so confused then please, please let me know and I'll write up a more detailed description.

To me this just defies common sense, and to paraphrase Niels Bohr: "If this behaviour of tesseract hasn't profoundly shocked you, you haven't understand it yet."

And in case you're wondering, if you repeat this process and feed the processed image into tesseract and ask for a dump, you get back exactly the same image. So that really is the one it's using second time round, but even though it outputs it first time round, it's not the one it's using.

Does that make sense to you?

It doesn't make sense to me.

<<<< Prev <<<<
More Mental Model Missteps
>>>> Next >>>>
Pending ...

https://mathstodon.xyz/@ColinTheMathmo You can follow me on Mathstodon.

Of course, you can also
follow me on twitter:


Send us a comment ...

You can send us a message here. It doesn't get published, it just sends us an email, and is an easy way to ask any questions, or make any comments, without having to send a separate email. So just fill in the boxes and then

Your name :
Email :
Message :



Links on this page

Site hosted by Colin and Rachel Wright:
  • Maths, Design, Juggling, Computing,
  • Embroidery, Proof-reading,
  • and other clever stuff.

Suggest a change ( <-- What does this mean?) / Send me email
Front Page / All pages by date / Site overview / Top of page

Universally Browser Friendly     Quotation from
Tim Berners-Lee
    Valid HTML 3.2!