Searching Family History with Optical Character Recognition (OCR)
by Brenna Corbit, Technical Services LibrarianI have already mentioned index searches that use optical character recognition (OCR), but I feel this subject needs more attention. Most database queries search manually indexed information, such as names, dates or places. OCR, on the other hand, searches every word within a document. From what I understand, OCR is a computer program that is trained to read images of text much like the human eye. Thus, OCR has opened a whole new world of researching primary sources such as newspapers. OCR can even be trained to read handwritten information.
Before OCR, the only way to find an article within a newspaper, such as an obituary, was to search a date range within a microfilm reel or digital image set. Unfortunately, we may not have a date of a particular incident such as an accident or lawsuits. Now OCR lets us find the good, the bad and the ugly about our ancestors. But it does have some drawbacks.
I have often seen OCR’s inability to read a large headline, while at the same time it will find the same word hidden within a paragraph. Flaws such as smudged text, wrinkles, folds, tears and tape marks on the original document are other OCR challenges. Many images are digital versions of microfiche and microfilm. In those days, a photographer couldn’t see if an image was captured clearly, resulting in flaws that appeared later in the developing room. Therefore, to work within this framework of pluses and minuses, here is my advice.
- Use phrase searching, otherwise OCR will find both phrases and broken up searches: e.g. “Jonathan L. Reber.” Don’t forget variations of the name, too—“J. L. Reber,” “Jonathan Lash Reber,” Jonathan Raber.” Even try inverting the name—“Reber, Jonathan.”
- Contrary to the above, do not use phrase searching. This applies specifically to Google’s digital newspaper archive which rarely finds a phrased search. In fact, it finds very little because it was a project that was left online in its unfinished, ruinous state. It is still a valuable source, but searches are best left to manually searching the images.
- Next of kin do not always appear in obituaries as a full name: e.g. “Thomas McCann is survived by three sons, Jeremy, Michael, and Robert, all of Philadelphia.” Therefore, phrase searches should be avoided in some cases. Unfortunately, most digital image databases do not have proximity searching, which means a search for McCann and Robert will cast a large net.
- Search instead for a person associated with your ancestor, such as a sister, uncle, or cousin. Recently, I phrase-searched a “Henry S. Miller” trying to prove a connection to a Seidel family. My query had a few hits, including his obituary, but no lineal proof. Therefore, I searched for information on “Henry Seidel” and his son “Franklin G. Seidel,” names my Henry Miller was closely associated with. Lo and behold, I found a large article about the Seidel farm where Henry Miller worked. It stated that he was a nephew of Henry Seidel. The phrase was never spotted by OCR.
No comments:
Post a Comment