NEW TOOL TO PIGGYBACK ON MILLIONS OF WEB USERS TO DIGITIZE BOOKS

PREV ARTICLE NEXT ARTICLE FULL ISSUE PREV FULL ISSUE

V11 2008 INDEX E-SYLUM ARCHIVE

The E-Sylum: Volume 11, Number 33, August 17, 2008, Article 25

NEW TOOL TO PIGGYBACK ON MILLIONS OF WEB USERS TO DIGITIZE BOOKS

Numismatic researchers have more and more online resources as out-of-copyright books are scanned and added to Internet archives daily. But optical character recognition technologies only go so far, and human assistance is often required to decipher words read by computers. To aid in this monumental task, researchers at Carnegie-Mellon University in Pittsburgh have devised a new tool to enlist the unwitting help of millions of web users. -Editor

The system of squiggly characters that must be typed correctly to gain access to certain Web sites has annoyed online users for years. Now, the primary inventor of the security technique wants to make amends -- by making Web users decipher even more squiggly words.

Luis von Ahn devised the system of distorted images of letters and numbers in 2000 as a way for email providers, online ticket sellers and other Web services to weed out online undesirables such as computerized ticket scalpers and spammers. The droopy characters, called Captchas, are irritating to humans, but are usually indecipherable to computers, providing a security screen for online service providers.

After widespread adoption of the Captcha system, Mr. von Ahn, a 29-year-old assistant professor at Carnegie Mellon University here, is putting his technique to work in another security scheme, dubbed ReCaptcha. This time, he wants users to assist with what he thinks is an important public service: helping get old books and newspapers online as part of digitized libraries.

The ReCaptcha system presents each user with two words containing distorted characters. Both words come from an old book or newspaper article that has been scanned into an online library. One word has been recognized by the scanning computer, but the other -- possibly because of a smudge or other imperfection on the original document -- can't be made out by the computer's software.

The user tries to decipher the distorted characters and types in both words. If the user matches the computer with the already understood word, then the user's reading of the mystery word is registered.

Other Web users will be shown the same unknown word. When three people have typed in the troublesome characters the same way, the mystery word is considered solved and the system transmits it to the library database so the word can be inserted into the document's digitized version.

When the ReCaptcha project is fully up and running, this month or in early September, Mr. von Ahn expects it to process about 160 books a day being scanned by the Internet Archive, a San Francisco nonprofit. The Internet Archive has paid employees scanning 1,000 books a day at 70 public and university libraries, mostly in the U.S., from the Library of Congress to the Allen County Public Library, in Fort Wayne, Ind.

Computers that run ReCaptcha automatically retrieve scanned books from the Internet Archive's servers, process them, and return the results, devoting an average of nine minutes of working time per book.

Most of the books can be digitized using typical optical character recognition software. Those that prove troublesome are to be handled by ReCaptcha.

"It's a really mind-blowing application," says Internet Archive founder Brewster Kahle.

To read the complete article, see: Web-Security Inventor Charts a Squigglier Course (http://online.wsj.com/article/SB121857740590934637.html)

Wayne Homren, Editor

The Numismatic Bibliomania Society is a non-profit organization promoting numismatic literature. See our web site at coinbooks.org.

To submit items for publication in The E-Sylum, write to the Editor at this address: whomren@gmail.com

To subscribe go to: https://my.binhost.com/lists/listinfo/esylum

PREV ARTICLE NEXT ARTICLE FULL ISSUE PREV FULL ISSUE

V11 2008 INDEX E-SYLUM ARCHIVE