![](https://secure.gravatar.com/avatar/51d63166ae562f590aad8d05c955bf3d.jpg?s=120&d=mm&r=g)
Here (here being at the bottom of the message :) is the code for the stylometry program. Note that I specified that the stylometry also involved a calculator -- that's because the shell script only processes your data to get the numbers you need out; the tough part is still up to you. After it runs, you have A. a file, ./counts, containing wordcounts like so. (The first line is the ever-present quirk, which occurs because I have yet to master sed.) 1689 550 THE 344 AND 316 TO ... and B. Output to the screen, like so: [wc/uwc] 1738 12561 77775 <-- Lines/words/bytes for the original file 2557 5113 31226 <-- First part is the number of *different* words used in the document, ignore the rest. [word counts] <-- A juicy excerpt from the counts file 550 THE 344 AND 316 TO 271 A 195 OF [punc frequency: comma/period/hyphen/quote/semi] 584 <-- Number of commas 1536 <-- Periods 79 <-- Dashes 315 <-- Double-quote marks 10 <-- Semicolons [and/or/but as sentence-splitters] 24 <-- Occurrences of "and," (including comma -- that's the point) 12 <-- "or," 7 <-- "but," There are too many things you can calculate from this output for me to enumerate (although the ratios of words to periods, commas, semicolons, and conjunctions as sentence splitters are rather useful...compare two or three of a known author's documents to find his/her characteristics, then compare that to your unknown and see if you've got a match). [Note that the whole sed mess is supposed to be one line] #!/bin/sh # prep: Prepares a text for analysis sed "y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/;s/[^A-Z']/ /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;s/ / /g;y/ /\n/;"<$1|sort|uniq -c|sort -rn>./counts echo [wc/uwc] wc<$1 wc<./counts echo [word counts] grep -wie "the" -e "and" -e "to" -e "a" -e "of" < counts echo [punc frequency: comma/period/hyphen/quote/semi] grep -c ","<$1 grep -c "."<$1 grep -c "-"<$1 grep -c \"<$1 grep -c "\;"<$1 echo [and/or/but as sentence-splitters] grep -c "and,"<$1 grep -c "or,"<$1 grep -c "but,"<$1 --------------------------------------------------------------------------- Randall Farmer rfarmer@hiwaay.net http://hiwaay.net/~rfarmer
![](https://secure.gravatar.com/avatar/51d63166ae562f590aad8d05c955bf3d.jpg?s=120&d=mm&r=g)
There was a significant glitch in the version of the stylometry aid I posted...read on if you care. Also, does anyone know where I can find some *real* stylometry programs (i.e., ones that do the math, etc.)? =============================================================================== I added the conjunctions-as-sentence-splitters after originally writing the message with the program in it, and messed it up while updating the message to match. Once you change this, it should catch most of the intended conjunctions (not all of them -- specifically, not the ones with the comma on a different line from the conjunction).
echo [and/or/but as sentence-splitters] grep -c "and,"<$1 ^^^^^^ Should be ", and"
grep -c "or,"<$1 ^^^^^ Should be ", or"
grep -c "but,"<$1 ^^^^^^ Should be ", but"
--------------------------------------------------------------------------- Randall Farmer rfarmer@hiwaay.net http://hiwaay.net/~rfarmer
participants (1)
-
Randall Farmer