
Ben Holiday wrote:
If you have access to a shell, and to the news spool, you can generate some quick lists by hopping into the directory of any newsgroup that interests you and doing:
cat * | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq > my-big-ol-wordlist
With most unixes that will generate an alphabetized list of all the unique words in your source text, converted to lowercase. I've had some problems with tr on a few machines, however. Adding a '-c' after 'uniq' will tell you how many times each word occured (useful for grepping out words that appear too infrequently, or too frequently) ..
Actually I am fairly sure that your selection of words will be mediocre at best. There are words (such as nethermost, insatiable, insufferable) that are almost never used in news. - Igor.