Re: Re: Memex Oil Gush

20 Apr 2015


      On Mon, Apr 20, 2015 at 10:20:28AM -0400, grarpamp wrote:
...
Some memex bits now open sourced...
http://www.forbes.com/sites/thomasbrewster/2015/04/17/darpa-nasa-and-partner...
...
TJBatchExtractor is what’s going open source today. It allows a user to
extract data, such as a name, organisation or location, from advertisements.
this sounds interesting, there was open-calais so far from reuters which did
this, but only as a centralized service, if gratis, or you could build your
own corpuses if your domain is not covered by the widely available ones.
however there is lot's of problems with non-english names, for evaluation of
such entity-extractors i recommend to test them with some data set containing
eu public officials, with names in greek, bulgarian and some latin-speaking
country and some slavic speaking one and you have something that can confuse
such entity extraction quite sufficiently. i guess i'm gonna give this a test,
maybe it's better. but i guess this again also mostly depends on the corpus.

-- 
otr fp: https://www.ctrlc.hu/~stef/otr.txt