Soundex

SOUNDEX

[From IBM document C20-1707-0, Identification Techniques, 1969.]

Sounds-Like Codes

Under this type of system, in order to overcome the problem of variation in the spelling of names and provide a "blocking factor" for organizing a name file, the records are filed under a phonetically coded version of the individual's name. Filing by the phonetic code collects records of names that sound alike. Disadvantages are the similar indexing of names that do no actually sounds alike and very poor alphabetic sequence of the file. Different types of sounds-alike coding systems are described below.

Alphanumeric Coding

The most commonly used phonetic coding technique adds a three-digit numeric code to the first letter of the last name. This phonetic code uses a coding table (Figure 16) and a set of syntactical coding rules as shown in the next column.

Value Letter Phonic Category

1 B,F,P,V (Labials)

2 C,G,J,K,Q,S,X,Z (Gutterals and sibilants)

3 D,T (Dentals)

4 L (Long liquid)

5 M,N (Nasals)

6 R (Short liquid)

Figure 16. A phonetic coding table

The basic coding rules are:

Vowels in positions other than the first position of the name are given no value; therefore A, E, I, O, U, and Y are not coded.
Consonants that sound alike are given the same numeric value.
Consonants H and W are given no value;
Double letters are coded as a single letter (for example, Rugge = RG = R200).
When two or more letters with the same numerical value are together, they are considered as one letter (for example, Scott = ST = S300, because C has the same value as S, and two T's are considered as a single T; Stock = STC = S320, since K has the same value as C).
Names with internal spacing or punctuation (for example, D'Amico = DMC = D520; Von Zann = VNZN = V525).
If the name is too long for the code, only the first three valid consonants after the initial letter are coded (e.g. Anderson = ANDRSN = ANDR = A536).
if the first letter of the first name or middle initial is A, E, I, O, U, Y, W, or H, it is coded as zero. If there is no middle initial, the code is zero (e.g., Day, Abner = D = (A)BN = D000-015).

Example: This code made up of one alphabetic character and three digits for last name, three digits for first name and one digit for middle initial would be as follows:

Johnson, Thomas J. = JNSN TMS J. = J525-352-2

Simple Numeric Coding

Another technique is entirely numeric and offers fewer synonyms by using fifteen values rather than six. It uses essentially the same basic rules as indicated above, except that A, E, I, O, U, Y, W and H are coded when they are the first letter of the last name, first name, or middle initial (Figure 17). In addition, a set of constant multipliers is used to convert the letter value to digits (Figure 18). The products of these calculations are then summed to give the identification code.

Value Letter

0 Nothing (blank)

1 B

2 C

3 D

4 F

5 G

6 J

7 K, Q

8 L

9 M

10 N

11 P, V

12 R

13 S

14 T

15 X, Z

16 A (Use only

17 E for first

18 H letter last

19 I name, first

20 O letter first

21 U name, and

22 W middle

23 Y initial.)

Figure 17. Phonetic coding table (The Computable Name Code)

Value Letter

41,000,000 First letter last name

2,560,000 First consonant last name

160,000 Second consonant last name

10,000 Third consonant last name

400 First letter first name

25 First consonant first name

1 Middle initial

Figure 18. Value table for constant multipliers (The Computable Name Code)

Example: The code is made up of first letter and three consonants of last name, first letter aand one consonant of first name, and middle initial.

Johnson, Thomas J. = JNSN TM J.
J = 6 × 41,000,000 = 246,000,000
N = 10 × 2,560,000 = 25,600,000
S = 13 × 160,000 = 2,080,000
N = 10 × 10,000 = 100,000
T = 14 × 400 = 5,600
M = 9 × 25 = 225
J = 6 × 1 = 6

"Thomas J. Johnson" becomes 273,785,831

A variation of the approach adds the last letter of first and last name to the value table for constant multipliers, creating a twelve-position identification code.

Numeric Frequency Coding

Another system attempts to distribute better the number of names to be scanned in any one sounds-like grouping. The sounds-like coding technique tends to distribute names disproportionately. To achieve a better balance, phonetic frequency coding equates infrequently used letters with more frequent letters by assigning them a numeric value. Frequency of letter occurrence was measured by counting pages of the San Francisco telephone directory. The coding tables for this code are given in Figures 19 and 20.

The phonetic frequency coding rules are as follows:

Initial phonetic scan
- Make double letters single, and CK=K, DT=T, SC=S, KN=N, MN=N.
- Eliminate vowels, W, H, and Y.
Coding scan
- Replace first letter of name with appropriate value from phonetic frequency table 1 (Figure 19).
- Compare the last two or three letters of last name with the phonetic frequency common suffix table (Figure 20) and, if a match is found, use the value as a second digit. If no match is found, use the value found from phonetic frequency value table 3 searching by the last phonetic letter of last name.
- Determine the last three phonetic digits from phonetic frequency value table 2, using the first letter of first name and rescanning the second and third letters of last name.

Table 1 Table 2 Table 3

0 S, Z 0 S, Z 0 B, C, K, Q, V

1 C, K, Q, V 1 C, K, Q 1 D, T

2 F, P, U, W 2 F, P, X 2 F, L, P

3 A, B 3 A, B 3 G, J, X

4 L, O, R 4 O, R 4 M, N

5 D, H, I 5 D, H, I 5 R

6 E, M, N, X 6 M, N 6 S, Z

7 G, J, T, Y 7 G, J, T, Y 7 A, E, H, I, O, U, W, Y

8 8 U, V, W 8

9 9 E, L 9

Figure 19. Phonetic frequency value tables.

Value Suffix

3 TRS

7 DRS, MN, TR

8 PRS, SN, STN

9 SR, STR, TD, TN

Figure 20. Phonetic frequency common suffix table.

Example: The code is made up of one digit for first letter of last name, one digit for suffix or last letter of last name, one digit for first letter of first name, and two digits for second and third letters of last name.

Johnson, Thomas J.

Initial phonetic scan: JNSN, T

Coding Scan: J (first letter last name - Table 1) = 7

SN (suffix - common suffix table) = 8

T (first letter first name - table 2) = 7

N (second letter last name - table 2) = 6

S (third letter last name - table 2) = 0

Total = 78760

If the name were spelled Johnzon, Thomas J., coding would be as shown below.

Johnzon, Thomas J.

Initial phonetic scan: JNZN, T

Coding Scan: J (first letter last name - Table 1) = 7

N (last letter last name - table 3
since neither "ZN" or "TZN"
are common suffixes) = 4

T (first letter first name - table 2) = 7

N (second letter last name - table 2) = 6

Z (third letter last name - table 2) = 0

Total = 74760

Original scans:

Comments to Webmaster

Click here for the Home page.

Updated August 1, 2012