[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Stylometry

To: [email protected]
Subject: Re: Stylometry
From: Randall Farmer <[email protected]>
Date: Tue, 18 Nov 1997 21:59:43 -0600 (CST)
cc: [email protected]
Reply-To: Randall Farmer <[email protected]>
Sender: [email protected]


Here (here being at the bottom of the message :) is the code for the stylometry
program. Note that I specified that the stylometry also involved a calculator
-- that's because the shell script only processes your data to get the numbers
you need out; the tough part is still up to you. 

After it runs, you have 

A. a file, ./counts, containing wordcounts like so. (The first line is the
ever-present quirk, which occurs because I have yet to master sed.)

1689 
 550 THE
 344 AND
 316 TO
...

and B. Output to the screen, like so:

[wc/uwc]
      1738     12561     77775 <-- Lines/words/bytes for the original file
      2557      5113     31226 <-- First part is the number of *different*
                                   words used in the document, ignore the rest.
[word counts] <-- A juicy excerpt from the counts file
 550 THE
 344 AND
 316 TO
 271 A
 195 OF
[punc frequency: comma/period/hyphen/quote/semi]
584 <-- Number of commas
1536 <-- Periods
79 <-- Dashes
315 <-- Double-quote marks
10 <-- Semicolons
[and/or/but as sentence-splitters]
24 <-- Occurrences of "and," (including comma -- that's the point)
12 <-- "or,"
7 <-- "but,"

There are too many things you can calculate from this output for me to
enumerate (although the ratios of words to periods, commas, semicolons, and
conjunctions as sentence splitters are rather useful...compare two or three of
a known author's documents to find his/her characteristics, then compare that 
to your unknown and see if you've got a match). 

[Note that the whole sed mess is supposed to be one line]

#!/bin/sh
# prep: Prepares a text for analysis

sed "y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/;s/[^A-Z']/ /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;s/  / /g;y/ /\n/;"<$1|sort|uniq -c|sort -rn>./counts

echo [wc/uwc]
wc<$1
wc<./counts
echo [word counts]
grep -wie "the" -e "and" -e "to" -e "a" -e "of" < counts
echo [punc frequency: comma/period/hyphen/quote/semi]
grep -c ","<$1
grep -c "."<$1
grep -c "-"<$1
grep -c \"<$1
grep -c "\;"<$1
echo [and/or/but as sentence-splitters]
grep -c "and,"<$1
grep -c "or,"<$1
grep -c "but,"<$1

---------------------------------------------------------------------------
Randall Farmer
    [email protected]
    http://hiwaay.net/~rfarmer

Follow-Ups:
- Doh! (Re: Stylometry)
  - From: Randall Farmer <[email protected]>

Prev by Date: Re: Report on UN conference on Internet and racism
Next by Date: Re: export restictions and investments
Prev by thread: Password Snarfing
Next by thread: Doh! (Re: Stylometry)
Index(es):
- Date
- Thread