My latest posts can be found here:
Previous blog posts:
Additionally, some earlier writings:
Not the four-dimensional cube sort of Tesseract, this is the
Optical Character Recognition (OCR) "Tesseract", software that
takes an image of some lettering and produces a plain text
file containing the text.
Or not. It's well-known that this is "Actually Quite Hard(tm)" and Tesseract does a pretty good job "Out of the Box" with very little messing about. But the other day I ran across something that has me utterly baffled. Let me share my bebafflement with you.
This one, however, produces this result:
OK, it's not so bad for three of the lines, but the third line? Where does that come from? How does it get that?
(Hah! One commenter has said that on the third line if you screw up your eyes and squint you might be able to see "rFsE10" in the background, in the "black", not in the foreground. Maybe, just maybe, that explains where "rFsE10" comes from.)
Well, I'm accustomed to this, and I played with the settings for a bit, and I played with the image for a bit, but if the settings were right for this image, they turned out to be wrong for another, and I have a lot of these that I need to convert as a batch, so I need settings that will work for them all.
Then I thought - "Aha! Let's feed that image into tesseract!"
And that's when I got my first surprise. Feeding this image, the one tesseract created, back into tesseract, the answer was this:
How can that be different ?!?
The image is the one tesseract output, which we can only assume is the one it's using for the character recognition, and yet it gives a different (and beautifully correct!) answer!
I'm ... well ... stunned! And stumped. Why should this be so?
Does that make sense to you?
It doesn't make sense to me.
Send us a comment ...
Links on this page