Article 8979 of comp.lang.perl:
Xref: feenix.metronet.com sci.lang:5114 comp.lang.perl:8979 comp.programming:4409
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Path: feenix.metronet.com!news.ecn.bgu.edu!mp.cs.niu.edu!vixen.cso.uiuc.edu!uwm.edu!math.ohio-state.edu!darwin.sura.net!europa.eng.gtefsd.com!library.ucla.edu!agate!boulder!wraeththu.cs.colorado.edu!tchrist
From: tchrist@wraeththu.cs.colorado.edu (Tom Christiansen)
Subject: soundex questions
Message-ID: <CI6wA1.1nC@Colorado.EDU>
Sender: news@Colorado.EDU (USENET News System)
Organization: University of Colorado, Boulder
Date: Fri, 17 Dec 1993 17:22:49 GMT
Lines: 166

I have been thinking of soundex matching, except I know nearly nothing
about it.  I do have a function someone posted once that attempts to deal
it.  I'm not sure whether the algorithm here is "right" or not.  The
algorithm is:

# return the Soundex value of a string using the following rules:
#
#   1) remove W and H
#   2) remove all vowels except in the first position (A E I O U Y)
#   3) recode characters per table:
#           A E I O U Y             0
#           B F P V                 1
#           C G J K Q S X Z         2
#           D T                     3
#           L                       4
#           M N                     5
#           R                       6
#
#   4) if two adjacent digits are now identical, remove one
#   5) truncate to six digits or pad out the result with zeroes to
#   make six digits  
#   6) replace the first digit with the first character from the
#   original word 

Here's what it shows for some sample misspellings

    S52350  sunstem
    S52350  sunstom
    Z52350  zonstem
    Z52350  zonstum
    S52365  sonsterm
    S52365  sonstrom
    S52365  sonstromb
    S52365  sonstromm
    S52365  sunstorm
    S52365  sunstromb
    S52365  sunstrum
    Z52365  zonstorm
    S53236  sondstrom
    S53236  sondstrum
    S53236  soundstorm
    S53236  soundstromboner
    S53236  sundstrom

Have you ever played with soundex?   What might one do with these?
Well, you should be able to look up hits that are close to you
numerically and suggest them as possible alternatives.  It would
take a different database format of course, but that's ok.

The problem is that it's not too smart.  
Some questions/issues:

1.  What does it only produce with a six-characters return key?  

2.  Why doesn't it collapse the initial character as well (S and Z, P
    and B, etc).   

3.  Some of consonant clusters could stand being munged up a bit, like
    -mb -nd, etc.

5.  Maybe vowels should have their own series?  

    1.  Y -> I
	W -> U
    2.  Collapse duplicates
    3.  Score remaining vowel clusters into two or three sets,
	based on open/closed:
	    O U OU EU UE AU EAU
	    I E EI IE AE EA AI 
	The problem with A is "father", "cat", "cake".  I'd say 
	more often it's with the latter set than the former.

    I don't think "coil" and "cowl" should be so close, either.

6.  The liquids (L's and R's) in <VOWEL><L or R><CONSONANT> seem too 
    significant, R's perhaps more than L's.  "order" and "odor" are
    closer than it things.

7.  What about all the digraphs?  Do you dare think about them
    or not?  TH SH CH GN KN PH GH all come to mind.  The problem
    is that they all can give false readings in medial positions, 
    as in "cathair" versus "catheter".  Perhaps only in initial
    and final positions?  Some should know about leading 
    silent letter and throw them out (GN KN), others maps
    into single letter (PH => F), whereas others just go 
    with whatever series they would normally go in, e.g.
    TH would be in the "D T" series, SH would be in with the 
    S's, etc.  Hm... I guess that's why they throw the H's
    out?   But I don't like this:

	C30000  cot
	C23000  caught

    That might not be able to be done right, since then you 
    have to discern "draught" is closer to "raft" than it 
    is to "route", which is itself closer to "drought".

    Ug.

Code follows for people wanting to sample it.

#!/usr/bin/perl

while (<>) {
    chop;
    print &soundex($_), "\t", $_, "\n"; 
} 

# soundex.pl
# by George Armhold <armhold@dimacs.rutgers.edu> 3/22/92
# improvements by Marc Arnold <marc@mit.edu>

# return the Soundex value of a string using the following rules:
#
#   1) remove W and H
#   2) remove all vowels except in the first position (A E I O U Y)
#   3) recode characters per table:
#           A E I O U Y             0
#           B F P V                 1
#           C G J K Q S X Z         2
#           D T                     3
#           L                       4
#           M N                     5
#           R                       6
#
#   4) if two adjacent digits are now identical, remove one
#   5) truncate to six digits or pad out the result with zeroes to
#   make six digits  
#   6) replace the first digit with the first character from the
#   original word 

sub soundex {
# takes a string as an argument, and returns its soundex value

    local($pattern) = @_;

    # upper-case the pattern to normalize matches
    $pattern =~ tr/a-z/A-Z/;

    # remove all but alphanumerics, and H,W
    $pattern =~ tr/A-GI-VX-Z0-9//cd;

    # remove all vowels after 1st letter
    ## substr($pattern, 1, length($pattern)) =~ tr/AEIOUY//d;

    # save first char
    local($first) = substr($pattern, 0, 1);
   
    # replaces letters with numbers and squish identical numbers
    $pattern =~ tr/BFPVCGJKQSXZDTLMNR0-9/1111222222223345560-9/ds;

    # remove all vowels after 1st letter
    substr($pattern, 1, length($pattern)) =~ tr/AEIOUY//d;

    # replace first letter
    substr($pattern, 0, 1) = $first;

    # pad on zeroes if necessary and truncate
    substr($pattern."000000", 0, 6); 
}

1;				# because this is a require'd file
-- 
    Tom Christiansen      tchrist@cs.colorado.edu       
      "Will Hack Perl for Fine Food and Fun"
	Boulder Colorado  303-444-3212


Article 9018 of comp.lang.perl:
Xref: feenix.metronet.com sci.lang:5128 comp.lang.perl:9018 comp.programming:4420
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Path: feenix.metronet.com!news.utdallas.edu!corpgate!bnrgate!bnr.co.uk!pipex!howland.reston.ans.net!spool.mu.edu!sgiblab!sgigate.sgi.com!olivea!pagesat!news.cerf.net!netlabs!lwall
From: lwall@netlabs.com (Larry Wall)
Subject: Re: soundex questions
Message-ID: <1993Dec19.041550.4229@netlabs.com>
Sender: news@netlabs.com
Nntp-Posting-Host: scalpel.netlabs.com
Organization: NetLabs, Inc.
References: <CI6wA1.1nC@Colorado.EDU> <jfhCI7wLu.MI1@netcom.com>
Date: Sun, 19 Dec 1993 04:15:50 GMT
Lines: 69

In article <jfhCI7wLu.MI1@netcom.com> jfh@netcom.com (Jack Hamilton) writes:
: tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) wrote:
: 
: Well, here we were talking about you on the train just the other day, and
: Bang, you post about a subject I'm interested in.  (I decided you ought to
: look like Larry Wall and Larry Wall ought to look like you, by the way.) 

Not unless you think a cute bald viking looks like a Honda mechanic.  :-)

: >I have been thinking of soundex matching, except I know nearly nothing
: >about it.  I do have a function someone posted once that attempts to deal
: >it.  I'm not sure whether the algorithm here is "right" or not.
: 
: I don't think there is a "right" algorithm, although the one in Knuth is
: probably the "standard" algorithm. 

It's hard to claim that any algorithm is "right" for a problem in fuzzy
logic.  The basic problems with soundex is that it's trying to solve a
number of problems at once, and getting about half of the way there.
There are several sources of error in the process.

	Misperception of spoken sounds.
	Mistranscription of perceived sounds to writing.
	Inadequacy of writing to convey spoken distinctions.
	Quantization boundary effects of the algorithm.

Ideally, the computer should be taking the actual spoken sounds and
computing the distance in "speech" space to all potential matches
(I'll let the linguists argue about whether it should be etic or emic
(not to be confused with emetic :-), and if emic, how you handle
dialectic differences while doing phoneme recognition), then displaying
the list in increasing order of linguistic distance.  The soundex
algorithm has a rather crude notion of distance: it only distinguishes
"short" from "long", just like area codes in the phone system (no pun
intended) back in the days when you could get charged long distance
for calling someone across the street.

Even if you limit yourself to processing written text (this is, after
all, cross-posted to comp.lang.perl), you could probably do much better
with an approximate matching algorithm that tried not to throw so much
information away at the outset, but kept a better notion of linguistic
distance.  One thing the soundex system does do pretty good at is
regularizing the dimensionality of the linguistic space.  Perhaps
if each "chunk" of soundex data that currently turns into a byte could
instead be turned into a location in a small space of its own, then
a larger space could be constructed of all the smaller spaces.  The
question then becomes how many different kinds of small spaces you
need.  Minimally, a vowel cluster space and a consonent cluster space,
but you could differentiate word initial and word final, or use alternate
spaces depending on surrounding choices.  The phonologist in me is
starting to go nuts.  How many megabytes am I allowed to use?

: Soundex attempts to map the sound of a name to the spelling of a name, and
: how words are pronounced depends on a lot of different things.  It
: certainly depends on the language (the standard algorithm wouldn't work very
: well for French, for example) and on the regional and personal speech
: patterns of the speaker.  Proper names tend to preserve complicated
: spellings with simplified pronunciations - think of Chomondeley-Magdalen
: (which I've probably misspelled) or Leichester Square.   

The constuction of the overall space from the small spaces could
probably make some guesses about this sort of thing.  The prototypical
pronunciation of a given name could be stored in a dictionary, and
distances compared with that.

At some point it becomes more efficient to simply ask, "How do you spell that?"

[lERiy ual]
lwall@netlabs.com


Article 9021 of comp.lang.perl:
Xref: feenix.metronet.com sci.lang:5129 comp.lang.perl:9021 comp.programming:4422
Path: feenix.metronet.com!news.ecn.bgu.edu!usenet.ins.cwru.edu!howland.reston.ans.net!vixen.cso.uiuc.edu!moe.ksu.ksu.edu!cbs.ksu.ksu.edu!news
From: Steve Davis <strat@cis.ksu.edu>
Newsgroups: sci.lang,comp.lang.perl,comp.programming
Subject: Re: soundex questions
Date: 19 Dec 1993 06:13:30 -0600
Organization: Kansas State University
Lines: 72
Sender: strat@cbs.ksu.ksu.edu (Steve Davis)
Message-ID: <2f1ghaINN4dd@cbs.ksu.ksu.edu>
References: <CI6wA1.1nC@colorado.edu>
Reply-To: strat@cis.ksu.edu (Steve Davis)
NNTP-Posting-Host: cbs.ksu.ksu.edu

tchrist@wraeththu.cs.colorado.edu (Tom Christiansen) writes:

:Have you ever played with soundex?   What might one do with these?

I've played with a couple variations in writing a spelling checker.  The
algorithm I wound up using is this:  (in perl even!)

# "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# ".123.12..22455.12623.1.2.2"

for (0..255) {
    push(@map, ".");
}

$map[ord('B')] = 1; $map[ord('C')] = 2; $map[ord('D')] = 3; 
$map[ord('F')] = 1; $map[ord('G')] = 2; $map[ord('J')] = 2;
$map[ord('K')] = 2; $map[ord('L')] = 4; $map[ord('M')] = 5;
$map[ord('N')] = 5; $map[ord('P')] = 1; $map[ord('Q')] = 2;
$map[ord('R')] = 6; $map[ord('S')] = 2; $map[ord('T')] = 3;
$map[ord('V')] = 1; $map[ord('X')] = 2; $map[ord('Z')] = 2;

sub soundex {	
    y/a-z/A-Z/;			             # Map to upper case. 
    @chars = unpack("C*",$_);	             # Split up by characters.
    $first = pack("C", shift(@chars));       # Save first letter.
    $_ = join("",@map[@chars]);	             # Map to [.0-9] (see above) 
    tr/0-9//s;			             # Remove adjacent duplicates.
    s/\.//go;			             # Remove the placeholder.
    $_ = $_ ? $_ : "0";                      # "0" for no pattern.
    $soundex = "$first$_";
    return $soundex;
}


:The problem is that it's not too smart.  

Well, it was never intended to be.  For the purposes of a spelling checker,
you might want to remove the last consonant sound ('s, 'ing, and so on) and
try looking for root words.  Generally this means lopping off the last
digit of the return value.

:1.  What does it only produce with a six-characters return key?  

They don't all return six characters.  The one I found on the net returns
either four or 'L0' where L is th efirst letter.

:2.  Why doesn't it collapse the initial character as well (S and Z, P
:    and B, etc).   

The obvious example of soundex usage is looking up a persons name out of an
enormous database.  Is that name "smith" or "smythe"?  After running it
through soundex, it generally doesn't matter.  But in this case, there
would be an incredible number of hits just on the common last name.
Keeping the first letter intact helps keep the match hits low and throw out
unreasonable matches.

:3.  Some of consonant clusters could stand being munged up a bit, like
:    -mb -nd, etc.

Yes, sadly.

[ Some more questions and points deleted. ]

I agree with most of them.  Please, write your own! :-)



-- 
                                               Steve Davis (strat@cis.ksu.edu)
                                                       Kansas State University

A billing computer that cheats is not a great public relations ploy.


Article 11895 of comp.lang.perl:
Path: feenix.metronet.com!news.utdallas.edu!convex!cs.utexas.edu!howland.reston.ans.net!gatech!udel!MathWorks.Com!noc.near.net!newshost.meiko.com!not-for-mail
From: mike@meiko.com (Mike Stok)
Newsgroups: comp.lang.perl
Subject: Soundex (again :-)
Date: 23 Mar 1994 19:44:35 -0500
Organization: Meiko Scientific, Inc., MA
Lines: 272
Message-ID: <2mqnpj$qk4@hibbert.meiko.com>
NNTP-Posting-Host: hibbert.meiko.com

Thanks to Rich Pinder <rpinder@hsc.usc.edu> for finding a little bug in my
soundex code I posted a while back.  This showed up when he compared it
with the output from Oracle's soundex function, and were caused by leading
characters which were different but shared the same soundex code.

Here's a fixed shar file...

Mike

#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 03/24/1994 00:35 UTC by Mike.Stok@meiko.concord.ma.us
# Source directory /tmp_mnt/develop/sw/misc/mike/soundex
#
# existing files will NOT be overwritten unless -c is specified
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#   1677 -r--r--r-- soundex.pl
#   2408 -r-xr-xr-x soundex.t
#
# ============= soundex.pl ==============
if test -f 'soundex.pl' -a X"$1" != X"-c"; then
	echo 'x - skipping soundex.pl (File already exists)'
else
echo 'x - extracting soundex.pl (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'soundex.pl' &&
package soundex;
X
;# $Id: soundex.pl,v 1.2 1994/03/24 00:30:27 mike Exp $
;#
;# Implementation of soundex algorithm as described by Knuth in volume
;# 3 of The Art of Computer Programming, with ideas stolen from Ian
;# Phillips <ian@pipex.net>.
;#
;# Mike Stok <Mike.Stok@meiko.concord.ma.us>, 2 March 1994.
;#
;# Knuth's test cases are:
;# 
;# Euler, Ellery -> E460
;# Gauss, Ghosh -> G200
;# Hilbert, Heilbronn -> H416
;# Knuth, Kant -> K530
;# Lloyd, Ladd -> L300
;# Lukasiewicz, Lissajous -> L222
;#
;# $Log: soundex.pl,v $
;# Revision 1.2  1994/03/24  00:30:27  mike
;# Subtle bug (any excuse :-) spotted by Rich Pinder <rpinder@hsc.usc.edu>
;# in the way I handles leasing characters which were different but had
;# the same soundex code.  This showed up comparing it with Oracle's
;# soundex output.
;#
;# Revision 1.1  1994/03/02  13:01:30  mike
;# Initial revision
;#
;#
;##############################################################################
X
;# $soundex'noCode is used to indicate a string doesn't have a soundex
;# code, I like undef other people may want to set it to 'Z000'.
X
$noCode = undef;
X
;# main'soundex
;#
;# usage:
;#
;# @codes = &main'soundex (@wordList);
;# $code = &main'soundex ($word);
;#
;# This strenuously avoids $[
X
sub main'soundex
{
X  local (@s, $f, $fc, $_) = @_;
X
X  foreach (@s)
X  {
X    tr/a-z/A-Z/;
X    tr/A-Z//cd;
X
X    if ($_ eq '')
X    {
X      $_ = $noCode;
X    }
X    else
X    {
X      ($f) = /^(.)/;
X      tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/;
X      ($fc) = /^(.)/;
X      s/^$fc+//;
X      tr///cs;
X      tr/0//d;
X      $_ = $f . $_ . '000';
X      s/^(.{4}).*/$1/;
X    }
X  }
X
X  wantarray ? @s : shift @s;
}
X
1;
SHAR_EOF
chmod 0444 soundex.pl ||
echo 'restore of soundex.pl failed'
Wc_c="`wc -c < 'soundex.pl'`"
test 1677 -eq "$Wc_c" ||
	echo 'soundex.pl: original size 1677, current size' "$Wc_c"
fi
# ============= soundex.t ==============
if test -f 'soundex.t' -a X"$1" != X"-c"; then
	echo 'x - skipping soundex.t (File already exists)'
else
echo 'x - extracting soundex.t (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'soundex.t' &&
#!./perl
;#
;# $Id: soundex.t,v 1.2 1994/03/24 00:30:27 mike Exp $
;#
;# test module for soundex.pl
;#
;# $Log: soundex.t,v $
;# Revision 1.2  1994/03/24  00:30:27  mike
;# Subtle bug (any excuse :-) spotted by Rich Pinder <rpinder@hsc.usc.edu>
;# in the way I handles leasing characters which were different but had
;# the same soundex code.  This showed up comparing it with Oracle's
;# soundex output.
;#
;# Revision 1.1  1994/03/02  13:03:02  mike
;# Initial revision
;#
;#
X
require '../lib/soundex.pl';
X
$test = 0;
print "1..13\n";
X
while (<DATA>)
{
X  chop;
X  next if /^\s*;?#/;
X  next if /^\s*$/;
X
X  ++$test;
X  $bad = 0;
X
X  if (/^eval\s+/)
X  {
X    ($try = $_) =~ s/^eval\s+//;
X
X    eval ($try);
X    if ($@)
X    {
X      $bad++;
X      print "not ok $test\n";
X      print "# eval '$try' returned $@";
X    }
X  }
X  elsif (/^\(/)
X  {
X    ($in, $out) = split (':');
X
X    $try = "\@expect = $out; \@got = &soundex $in;";
X    eval ($try);
X
X    if (@expect != @got)
X    {
X      $bad++;
X      print "not ok $test\n";
X      print "# expected ", scalar @expect, " results, got ", scalar @got, "\n";
X      print "# expected (", join (', ', @expect),
X	    ") got (", join (', ', @got), ")\n";
X    }
X    else
X    {
X      while (@got)
X      {
X	$expect = shift @expect;
X	$got = shift @got;
X
X	if ($expect ne $got)
X	{
X	  $bad++;
X	  print "not ok $test\n";
X	  print "# expected $expect, got $got\n";
X	}
X      }
X    }
X  }
X  else
X  {
X    ($in, $out) = split (':');
X
X    $try = "\$expect = $out; \$got = &soundex ($in);";
X    eval ($try);
X
X    if ($expect ne $got)
X    {
X      $bad++;
X      print "not ok $test\n";
X      print "# expected $expect, got $got\n";
X    }
X  }
X
X  print "ok $test\n" unless $bad;
}
X
__END__
#
# 1..6
#
# Knuth's test cases, scalar in, scalar out
#
'Euler':'E460'
'Gauss':'G200'
'Hilbert':'H416'
'Knuth':'K530'
'Lloyd':'L300'
'Lukasiewicz':'L222'
#
# 7..8
#
# check default bad code
#
'2 + 2 = 4':undef
undef:undef
#
# 9
#
# check array in, array out
#
('Ellery', 'Ghosh', 'Heilbronn', 'Kant', 'Ladd', 'Lissajous'):('E460', 'G200', 'H416', 'K530', 'L300', 'L222')
#
# 10
#
# check array with explicit undef
#
('Mike', undef, 'Stok'):('M200', undef, 'S320')
#
# 11..12
#
# check setting $soundex'noCode
#
eval $soundex'noCode = 'Z000';
('Mike', undef, 'Stok'):('M200', 'Z000', 'S320')
#
# 13
#
# a subtle difference between me & oracle, spotted by Rich Pinder
# <rpinder@hsc.usc.edu>
#
CZARKOWSKA:C622
SHAR_EOF
chmod 0555 soundex.t ||
echo 'restore of soundex.t failed'
Wc_c="`wc -c < 'soundex.t'`"
test 2408 -eq "$Wc_c" ||
	echo 'soundex.t: original size 2408, current size' "$Wc_c"
fi
exit 0

-- 
The "usual disclaimers" apply.    | Meiko
Mike Stok                         | 130C Baker Ave. Ext
Mike.Stok@meiko.concord.ma.us     | Concord, MA 01742
Meiko tel: (508) 371 0088         | 


