Monday, November 5, 2018

Family History — Part 40


Wrestling with OCR: Making Optical Character Recognition Work for You


by Brenna Corbit, Technical Services Librarian


After having spent a few months on our genealogical road trip, I think it is time to take a rest and give you advice on how to wrest the best from optical character recognition (OCR). I have often addressed OCR in the past, but I recommend you reread an earlier blog before proceeding: Family History Tips--Part 13

Recently, I noticed in Newspapers.com
 that when I view an article I have clipped, there is a link below the clipping that says “show article text (OCR).” The pictured obituary presents a good example of what I am conveying. 

I was searching for an obituary of Walter T. Swoyer who died 21 December 1928, but my phrased, non-phrased, and spelling variation searches yielded nothing. So, I manually went through the digital images and found the obituary on the day following his death. Nope, no spelling errors in the title line. Once again, OCR failed to correctly read the text. But what is interesting is the OCR link which read his name as Walter T. Swover, not Swoyer. Thus, I tested this search as a phrase with the misspelling and, voila, found it instantly. 

I often stress the use of wild cards and truncation rather than using fuzzy search filters. If you remember, wild cards are asterisks and question marks that represent many letter variations in a word: e.g. “Mierzejewski” = “m*r*sk?.” But some OCR search engines do not allow such search aids. So what’s a genealogist to do?

If a Y and and V are similar, then I could simply try letter changes. I also notice in the article text that an E becomes an A for the word “he,” and that the word “of” appears as “oi.” If I looked at a hundred OCR transcriptions of article text, I could probably present a thousand or more variations. And you could end up trying so many spelling variations you would forget how to spell your own name as you walk off into the sunset finger-flipping your lips. My advice? Look carefully at the word you are searching, throw conventional spelling to the four winds, and start seeing the world as OCR does.

There is no quick and easy fix. If you know the date range of an article, then I would recommend manually searching the digital frames of the newspapers by date if OCR fails. But if you want to find those juicy stories of which no known date is available, and you have an inkling there could be more than what OCR finds, then just let your vision blur.

Here is a crazy idea that actually works. Spell the name on a word processor in a font similar to newsprint, squint your eyes to slightly blur the text and see what letter replacements are possible. I just did this with the word Britton typed in a serifed font. I squinted my eyes into a blur and saw it spelled Bnitton. I searched with this spelling and came up with 79 hits in Pennsylvania papers alone. I clipped one and indeed the OCR transcription read the name as Bnitton.

Another option to blur the fonts would be one-too-many martinis which would result from calming yourself down after having spent too many hours pulling your hair out while wrestling with OCR

Happy searching!

No comments: