With Terror in Mind, a Formulaic Way to Parse Sentences
<http://www.nytimes.com/2005/03/03/technology/circuits/03next.html?8cir=&pagewanted=print&position=> The New York Times March 3, 2005 WHAT'S NEXT With Terror in Mind, a Formulaic Way to Parse Sentences By NOAH SHACHTMAN MAYBE sixth-grade English was more helpful than you thought. One of the dullest grammar exercises is being used to help find potential terrorists, and save companies a bundle. Diagramming sentences - picking out subject, verb, object, adjective and other parts of speech - has been a staple of middle and high school grammar lessons for decades. Now, with financing from the Central Intelligence Agency, a California firm is using the technique to comb through e-mail messages and chat room talks, which can be a rich lode of corporate and government information, and a tough one to mine. Figuring out the connections among people, places and things is something computer algorithms do pretty well, as long as that information is structured, or categorized and put into a database. Looking through a company's customer file for a person named Bonds, for example, is fairly simple. But if the data is unstructured - if the word "bonds" hasn't been classified as the name of a ballplayer or as an investment option - searching becomes much more difficult. For people in business or in public service, only 20 percent or so of their information is kept in formal databases, noted Nick Patience, an analyst with the 451 Group, a technology research firm. The rest is unstructured, tucked away in e-mail messages, call logs, memos and instant messages. Attensity, based in Palo Alto, Calif., and financed in part by In-Q-Tel, the C.I.A.'s investment arm, has developed a method to parse electronic documents almost instantly, and diagram all of the sentences inside. ("Moby-Dick," for instance, took all of nine and a half seconds.) By labeling subjects and verbs and other parts of speech, Attensity's software gives the documents a definable structure, a way to fit into a database. And that helps turn day-to-day chatter into information that is relevant and usable. "They take the language that people use every day and compile it in a way that a machine can use," Mr. Patience said. "And that allows people to start using this tremendous amount of intelligence which has gone untapped." Whirlpool, the home appliance manufacturer, is now using Attensity software to help cull information from the 400,000 customer service calls the company receives each month. Tom Welke, a Whirlpool general manager, said the company realized it needed help in March 2002, during a microwave oven recall. The machines were arcing, producing electrical sparks, which caused the food inside to smoke. Mr. Welke decided to pore through records of recent customer calls by searching for the words "arcing" and "smoke." His team found 18,500 records that matched. Six people then spent a weekend reading the results, eventually coming up with 700 calls from customers potentially related to the problem. As a comparison, Mr. Welke then ran the same records through a program from Attensity, which had recently paid him a sales call. "It could tell if the microwave was smoking or if the chicken was smoking hot or if the customer was eating smoked chicken," he said. "It came up with 542 instances in about 10 seconds." Whirlpool is now spending a quarter-million dollars a year on Attensity's expertise, joining companies like John Deere, General Motors and Honeywell as Attensity customers. But wringing profits out of unstructured data for corporate America is only about 40 percent of the software maker's business. The rest is in government work, for groups like the Federal Bureau of Investigation, the National Security Agency and the Defense Intelligence Agency. The software helps federal researchers look for clues to terrorist and criminal activities in "the text from the dispatches from around the world, the field reports, the newspaper articles and the chat rooms," said David L. Bean, Attensity's co-founder. "The intelligence community has plenty of systems for doing six degrees of separation, for putting two and two together," Dr. Bean said. "But they need structured data in order to do it. We give them that structure." The intelligence agencies declined to discuss whether they use the software. But Kris Alexander, a former intelligence analyst for the United States military's Central Command, noted that "putting unstructured information into anything that would organize it would be very helpful." "We have guys who can crack hard drives," Mr. Alexander said. "Getting the information out is easy. The hard part is sharing it, and organizing it, so that everybody in an agency, even nonexperts, can use it." Attensity's algorithms can do more than get a document ready to categorize, however. The software ferrets out meaning in sentences as they are being diagramed. If the word "purchase" is used as a verb, the person doing the buying is tagged as a possible customer. If the phrase "plastic explosive" is used as an object, the subject is labeled as a potential enemy. For now, though, Attensity works only with English. That is a weakness the company's competitors in the world of structuring data are quick to point out. Inxight Software, of Sunnyvale, Calif., for example, produces software that turns grammatical relationships into mathematical formulas, allowing it to parse documents in 31 languages. Intelliseek, of Cincinnati, plucks entities - proper names and places - from blog entries as a way to categorize them. The company's software will also characterize a document as positive or negative based on the words it contains. Oracle and the other major database makers also build in some limited functions for extracting information from unstructured texts. But those systems usually rely on the person using it teaching the algorithms what they need to know - that in a legal document, for example, "sued" and "filed charges" are rough equivalents. With Attensity's software, that kind of instruction is often unnecessary. "Attensity shows how the words all relate to one another - all the actors, objects and actions in a document, and how they connect," said Gayle von Eckartsberg, a spokeswoman for In-Q-Tel, which also provides financing for Inxight and Intelliseek. Perfect sentences are not required for the software to work, said Dr. Bean, the son of a high school English teacher. Instead of using strict grammar laws, Attensity relies on constantly reapplying heuristics - rules of thumb - to sort out subject from object. Dangling participles, misspelled words and grammar-mangling slang can all be handled, allowing Attensity to crunch Internet relay chats, instant messenger conversations and other King's English refugees as easily as it would parse a textbook. But that does not mean, Dr. Bean added, that students should stop doing their grammar homework or paying attention in school. -- ----------------- R. A. Hettinga <mailto: rah@ibuc.com> The Internet Bearer Underwriting Corporation <http://www.ibuc.com/> 44 Farquhar Street, Boston, MA 02131 USA "... however it may deserve respect for its usefulness and antiquity, [predicting the end of the world] has not been found agreeable to experience." -- Edward Gibbon, 'Decline and Fall of the Roman Empire'
participants (1)
-
R.A. Hettinga