With Terror in Mind, a Formulaic Way to Parse Sentences

R.A. Hettinga rah at shipwright.com
Thu Mar 3 14:57:28 PST 2005


<http://www.nytimes.com/2005/03/03/technology/circuits/03next.html?8cir=&pagewanted=print&position=>

The New York Times

March 3, 2005
WHAT'S NEXT

With Terror in Mind, a Formulaic Way to Parse Sentences
 By NOAH SHACHTMAN


MAYBE sixth-grade English was more helpful than you thought. One of the
dullest grammar exercises is being used to help find potential terrorists,
and save companies a bundle.

Diagramming sentences - picking out subject, verb, object, adjective and
other parts of speech - has been a staple of middle and high school grammar
lessons for decades. Now, with financing from the Central Intelligence
Agency, a California firm is using the technique to comb through e-mail
messages and chat room talks, which can be a rich lode of corporate and
government information, and a tough one to mine.

 Figuring out the connections among people, places and things is something
computer algorithms do pretty well, as long as that information is
structured, or categorized and put into a database. Looking through a
company's customer file for a person named Bonds, for example, is fairly
simple.

 But if the data is unstructured - if the word "bonds" hasn't been
classified as the name of a ballplayer or as an investment option -
searching becomes much more difficult.

 For people in business or in public service, only 20 percent or so of
their information is kept in formal databases, noted Nick Patience, an
analyst with the 451 Group, a technology research firm. The rest is
unstructured, tucked away in e-mail messages, call logs, memos and instant
messages.

 Attensity, based in Palo Alto, Calif., and financed in part by In-Q-Tel,
the C.I.A.'s investment arm, has developed a method to parse electronic
documents almost instantly, and diagram all of the sentences inside.
("Moby-Dick," for instance, took all of nine and a half seconds.) By
labeling subjects and verbs and other parts of speech, Attensity's software
gives the documents a definable structure, a way to fit into a database.
And that helps turn day-to-day chatter into information that is relevant
and usable.

 "They take the language that people use every day and compile it in a way
that a machine can use," Mr. Patience said. "And that allows people to
start using this tremendous amount of intelligence which has gone untapped."

Whirlpool, the home appliance manufacturer, is now using Attensity software
to help cull information from the 400,000 customer service calls the
company receives each month.

 Tom Welke, a Whirlpool general manager, said the company realized it
needed help in March 2002, during a microwave oven recall. The machines
were arcing, producing electrical sparks, which caused the food inside to
smoke.

 Mr. Welke decided to pore through records of recent customer calls by
searching for the words "arcing" and "smoke." His team found 18,500 records
that matched. Six people then spent a weekend reading the results,
eventually coming up with 700 calls from customers potentially related to
the problem.

 As a comparison, Mr. Welke then ran the same records through a program
from Attensity, which had recently paid him a sales call.

"It could tell if the microwave was smoking or if the chicken was smoking
hot or if the customer was eating smoked chicken," he said. "It came up
with 542 instances in about 10 seconds."

Whirlpool is now spending a quarter-million dollars a year on Attensity's
expertise, joining companies like John Deere,  General Motors and
Honeywell as Attensity customers. But wringing profits out of unstructured
data for corporate America is only about 40 percent of the software maker's
business. The rest is in government work, for groups like the Federal
Bureau of Investigation, the National Security Agency and the Defense
Intelligence Agency.

 The software helps federal researchers look for clues to terrorist and
criminal activities in "the text from the dispatches from around the world,
the field reports, the newspaper articles and the chat rooms," said David
L. Bean, Attensity's co-founder.

 "The intelligence community has plenty of systems for doing six degrees of
separation, for putting two and two together," Dr. Bean said. "But they
need structured data in order to do it. We give them that structure."

 The intelligence agencies declined to discuss whether they use the
software. But Kris Alexander, a former intelligence analyst for the United
States military's Central Command, noted that "putting unstructured
information into anything that would organize it would be very helpful."

 "We have guys who can crack hard drives," Mr. Alexander said. "Getting the
information out is easy. The hard part is sharing it, and organizing it, so
that everybody in an agency, even nonexperts, can use it."

Attensity's algorithms can do more than get a document ready to categorize,
however. The software ferrets out meaning in sentences as they are being
diagramed. If the word "purchase" is used as a verb, the person doing the
buying is tagged as a possible customer. If the phrase "plastic explosive"
is used as an object, the subject is labeled as a potential enemy.

For now, though, Attensity works only with English. That is a weakness the
company's competitors in the world of structuring data are quick to point
out.

 Inxight Software, of Sunnyvale, Calif., for example, produces software
that turns grammatical relationships into mathematical formulas, allowing
it to parse documents in 31 languages. Intelliseek, of Cincinnati, plucks
entities - proper names and places - from blog entries as a way to
categorize them. The company's software will also characterize a document
as positive or negative based on the words it contains.

 Oracle and the other major database makers also build in some limited
functions for extracting information from unstructured texts. But those
systems usually rely on the person using it teaching the algorithms what
they need to know - that in a legal document, for example, "sued" and
"filed charges" are rough equivalents.

 With Attensity's software, that kind of instruction is often unnecessary.
"Attensity shows how the words all relate to one another - all the actors,
objects and actions in a document, and how they connect," said Gayle von
Eckartsberg, a spokeswoman for In-Q-Tel, which also provides financing for
Inxight and Intelliseek.

 Perfect sentences are not required for the software to work, said Dr.
Bean, the son of a high school English teacher. Instead of using strict
grammar laws, Attensity relies on constantly reapplying heuristics - rules
of thumb - to sort out subject from object. Dangling participles,
misspelled words and grammar-mangling slang can all be handled, allowing
Attensity to crunch Internet relay chats, instant messenger conversations
and other King's English refugees as easily as it would parse a textbook.

But that does not mean, Dr. Bean added, that students should stop doing
their grammar homework or paying attention in school.

-- 
-----------------
R. A. Hettinga <mailto: rah at ibuc.com>
The Internet Bearer Underwriting Corporation <http://www.ibuc.com/>
44 Farquhar Street, Boston, MA 02131 USA
"... however it may deserve respect for its usefulness and antiquity,
[predicting the end of the world] has not been found agreeable to
experience." -- Edward Gibbon, 'Decline and Fall of the Roman Empire'





More information about the cypherpunks-legacy mailing list