Article 8670 of comp.lang.perl:
Xref: feenix.metronet.com comp.lang.perl:8670
Path: feenix.metronet.com!news.utdallas.edu!hermes.chpc.utexas.edu!cs.utexas.edu!howland.reston.ans.net!newsserver.jvnc.net!yale.edu!yale!zip.eecs.umich.edu!not-for-mail
From: seraphim@umcc.umcc.umich.edu (Henry Hardy)
Newsgroups: comp.lang.perl
Subject: Zipfian perl script -- help to improve
Date: 6 Dec 1993 14:28:12 -0500
Organization: none
Lines: 22
Message-ID: <2e014c$6un@umcc.umcc.umich.edu>
NNTP-Posting-Host: umcc.umcc.umich.edu
Summary: program to count number of occurences of words in a text
Keywords: word rank order frequency analysis script Zipf linguistics cryptography Perl sed awk

Here's a one line script I wrote to do word rank-order frequency analysis on
a text (messages from sci.physics).  I am doing a "Zipfian analysis" after
the work of George Zipf.  The script takes a file called infile and writes
the number of occurences in ascending rank order (alphabetical w/in each rank).
Here is the script:

perl -ne 'print join("\n",split);' infile | sort | uniq -c | sort > outfile

Now, since I have never used perl before, I need a bit of guidance to improve
this thing.

1)	downcase all alpha caps to miniscule (ie 'uncapitalize' words, 
	acronyms etc.)  So 'The' and 'the' etc. will be collapsed together.

2)	Need to break on all non-alphanumeric characters.

If someone can come up with an elegant way of doing these (or even a non-
elegant way) in sed, awk, or perl, please respond.  thanks!

--HH.

seraphim@umcc.umich.edu


