On Mon, Apr 20, 2015 at 10:20:28AM -0400, grarpamp wrote:
Some memex bits now open sourced...
http://www.forbes.com/sites/thomasbrewster/2015/04/17/darpa-nasa-and-partner...
TJBatchExtractor is what’s going open source today. It allows a user to extract data, such as a name, organisation or location, from advertisements.
this sounds interesting, there was open-calais so far from reuters which did this, but only as a centralized service, if gratis, or you could build your own corpuses if your domain is not covered by the widely available ones. however there is lot's of problems with non-english names, for evaluation of such entity-extractors i recommend to test them with some data set containing eu public officials, with names in greek, bulgarian and some latin-speaking country and some slavic speaking one and you have something that can confuse such entity extraction quite sufficiently. i guess i'm gonna give this a test, maybe it's better. but i guess this again also mostly depends on the corpus. -- otr fp: https://www.ctrlc.hu/~stef/otr.txt