Chris Straw
SHARE:

My Notes: Levenshtein distance and/or Metaphone

aka things that sound alike

Levenshtein distance and Metaphone 3 are two different algorithms primarily used in the context of text and speech processing. However, they each serve very different purposes.

Levenshtein Distance:

Levenshtein distance, also known as edit distance, measures the difference between two strings. Specifically, it’s defined as the minimum number of single-character edits (insertions, deletions or substitutions) needed to change one word into the other. For example, the Levenshtein distance between “kitten” and “sitting” is 3:

  1. kitten -> sitten (replace “k” with “s”)
  2. sitten -> sittin (replace “e” with “I”)
  3. sittin -> sitting (insert “g” at the end)

Levenshtein distance is widely used in applications like spell-checking, DNA sequence alignment, and Natural Language Processing.

https://en.wikipedia.org/wiki/Levenshtein_distance
https://github.com/Turnerj/Quickenshtein

https://gist.github.com/Davidblkx/e12ab0bb2aff7fd8072632b396538560

using System;

namespace Algorithms
{
    public static class LevenshteinDistance
    {
        /// <summary>
        ///     Calculate the difference between 2 strings using the Levenshtein distance algorithm
        /// </summary>
        /// <param name="source1">First string</param>
        /// <param name="source2">Second string</param>
        /// <returns></returns>
        public static int Calculate(string source1, string source2) //O(n*m)
        {
            var source1Length = source1.Length;
            var source2Length = source2.Length;

            var matrix = new int[source1Length + 1, source2Length + 1];

            // First calculation, if one entry is empty return full length
            if (source1Length == 0)
                return source2Length;

            if (source2Length == 0)
                return source1Length;

            // Initialization of matrix with row size source1Length and columns size source2Length
            for (var i = 0; i <= source1Length; matrix[i, 0] = i++){}
            for (var j = 0; j <= source2Length; matrix[0, j] = j++){}

            // Calculate rows and collumns distances
            for (var i = 1; i <= source1Length; i++)
            {
                for (var j = 1; j <= source2Length; j++)
                {
                    var cost = (source2[j - 1] == source1[i - 1]) ? 0 : 1;

                    matrix[i, j] = Math.Min(
                        Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1),
                        matrix[i - 1, j - 1] + cost);
                }
            }
            // return result
            return matrix[source1Length, source2Length];
        }
    }
}

Metaphone 3

Metaphone 3 is an algorithm for English phonetic encoding. It transforms an English word into a rough approximation of its pronunciation. So, words that sound similar will have the same or similar Metaphone codes, even if their spelling is quite different.

For example, the words “Knight” and “Night” would have the same Metaphone 3 encoding since they sound identical when spoken, despite their different spellings.

Metaphone algorithms (like Metaphone 3) are used in applications where matching based on sound rather than exact spelling is important. This includes voice recognition, name matching in databases, and grouping words by phonetic similarity.

http://www.amorphics.com/index.html

Double Metaphone (Metaphone 2)

Metaphone 2 is a phonetic algorithm published by Lawrence Philips in 2000 as an improvement upon his earlier Metaphone algorithm. It is designed to encode English words into phonetic keys that represent how they sound when spoken. It’s a part of a sequence of algorithms that Philips developed: Metaphone, Double Metaphone (often referred to as Metaphone 2), and Metaphone 3.

Metaphone 2 or Double Metaphone significantly expanded upon the original Metaphone algorithm to include more accurate handling of spelling inconsistencies and better support for non-English words that have been widely integrated into English vocabulary. Unlike Metaphone, Double Metaphone can return a primary and an alternate phonetic encoding for a given word to account for different possible pronunciations.

This phonetic algorithm is especially useful in cases where you want to match words that sound similar, even if they are spelled quite differently. Applications often include things like voice recognition, name-matching in databases, and grouping words together by their phonetic similarity.

However, Double Metaphone is still not perfect and has a few shortcomings in capturing the English language’s phonetic diversity and complexity. These issues are addressed to a certain extent by Metaphone 3, the next iteration of the algorithm. As of my last training data in September 2021, Metaphone 3 is a commercial product and is not freely available as Metaphone and Double Metaphone.

In summary

While Levenshtein distance measures the “spelling distance” between two words, Metaphone 3 helps capture phonetic similarity, irrespective of spelling. They can both be useful, depending on the specific requirements of your project.

SQL Server/PostgreSQL notes:

https://www.red-gate.com/simple-talk/blogs/string-comparisons-in-sql-edit-distance-and-the-levenshtein-algorithm/

https://www.encora.com/insights/finding-duplicate-addresses-using-the-levenshtein-distance-metric-in-sql

https://gist.github.com/radityopw/33708e4b65d947225797

SQL Server Metaphone 3

https://github.com/Phil-Factor/SQLMetaPhone

IF  OBJECT_ID('dbo.Metaphone','FN') IS NOT NULL --drop any existing metaphone function
   DROP FUNCTION dbo.Metaphone
go
CREATE FUNCTION dbo.Metaphone
/**
summary:   >
The Metaphone  phonetic algorithm was devised by Lawrence Philips in 1990.
It reduces words to their basic sounds, but produces a more accurate encoding,
than Soundex for matching words that sound similar. 
Metaphone is a built-in operator in a number of systems such as PHP but there
seemed to be no available SQL Version until I wrote this. It is merely
a reverse engineering of the original published algorithm but tweaked to ensure
that it gave the same result as the PHP version.
Author: Phil Factor
Revision: 1.0
date: 21 Jan 2017
example: >
	Select dbo.Metaphone ('opportunities')
	--OPRTNTS
Parameters: 
	-- @String (a word -all punctuation will be stripped out)
  A string representing the Metaphone equivalent of the word. 
**/  
(
	@String VARCHAR(30)
)
RETURNS VARCHAR(10)
AS
BEGIN
DECLARE  @New BIT, @ii INT, @Metaphone VARCHAR(28), @Len INT, @where INT;
DECLARE @This CHAR, @Next CHAR, @Following CHAR, @Previous CHAR, @silent BIT;

SELECT @String = UPPER(LTRIM(COALESCE(@String, ''))); --trim and upper case
SELECT @where= PATINDEX ('%[^A-Z]%',@String COLLATE Latin1_General_CI_AI ) 
WHILE  @where>0 --strip out all non-alphabetic characters!
	BEGIN
	SELECT @String=STUFF(@string,@where,1,'')
	SELECT @where=PATINDEX ('%[^A-Z]%',@String COLLATE Latin1_General_CI_AI ) 
    END
IF(LEN(@String) < 2) RETURN  @String

--do the start of string stuff first.
--If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
-- "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"
IF SUBSTRING(@String, 1, 2) IN ( 'KN', 'GN', 'PN', 'AE', 'WR' )
  SELECT @String = STUFF(@String, 1, 1, '');
-- Beginning of word: "x" change to "s" as in "Deng Xiaopeng"
IF SUBSTRING(@String, 1, 1) = 'X'
  SELECT @String = STUFF(@String, 1, 1, 'S');
-- Beginning of word: "wh-" change to "w" as in "Whatsoever"
IF @String LIKE 'WH%'
  SELECT @String = STUFF(@String, 1, 1, 'W');
-- Set up for While loop 
SELECT @Len = LEN(@String), @Metaphone = '', -- Initialize the main variable 
  @New = 1, -- this variable only used next 10 lines!!! 
  @ii = 1; --Position counter
--
WHILE((LEN(@Metaphone) <= 8) AND (@ii <= @Len))
  BEGIN --SET up the 'pointers' for this loop-around }
  SELECT @Previous =
    CASE WHEN @ii > 1 THEN SUBSTRING(@String, @ii - 1, 1) ELSE '' END,
    -- originally a nul terminated string }
    @This = SUBSTRING(@String, @ii, 1),
    @Next =
      CASE WHEN @ii < @Len THEN SUBSTRING(@String, @ii + 1, 1) ELSE '' END,
    @Following =
      CASE WHEN((@ii + 1) < @Len) THEN SUBSTRING(@String, @ii + 2, 1) ELSE
                                                                        '' END
   -- 'CC' inside word 
  --SELECT @Previous,@this,@Next,@Following,@New,@ii,@Len,@Metaphone
  /* Drop duplicate adjacent letters, except for C.*/
  IF @This=@Previous AND @This<> 'C' 
	BEGIN
	--we do nothing 
	SELECT @New=0
    END
  /*Drop all vowels unless it is the beginning.*/
  ELSE IF @This IN ( 'A', 'E', 'I', 'O', 'U' )
    BEGIN
    IF @ii = 1 --vowel at the beginning
      SELECT @Metaphone = @This;
    /* B -> B unless at the end of word after "m", as in "dumb", "Comb" */
    END;
  ELSE IF @This = 'B' AND NOT ((@ii = @Len) AND (@Previous = 'M'))
         BEGIN
         SELECT @Metaphone = @Metaphone + 'B';
         END;
         -- -mb is silent 
 /*'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-',
 in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. 
 Otherwise, 'C' transforms to 'K'.*/
  ELSE IF @This = 'C'
         BEGIN -- -sce, i, y = silent 
         IF NOT (@Previous= 'S') AND (@Next IN ( 'H', 'E', 'I', 'Y' )) --front vowel set 
           BEGIN
			   IF(@Next = 'I') AND (@Following = 'A')
				 SELECT @Metaphone = @Metaphone + 'X'; -- -cia- 
			   ELSE IF(@Next IN ( 'E', 'I', 'Y' ))
				 SELECT @Metaphone = @Metaphone + 'S'; -- -ce, i, y = 'S' }
			   ELSE IF(@Next = 'H') AND (@Previous = 'S')
				 SELECT @Metaphone = @Metaphone + 'K'; -- -sch- = 'K' }
			   ELSE IF(@Next = 'H')
				 BEGIN
				   IF(@ii = 1) AND ((@ii + 2) <= @Len) 
					 AND NOT(@Following IN ( 'A', 'E', 'I', 'O', 'U' ))
					   SELECT @Metaphone = @Metaphone + 'K';
				   ELSE
					 SELECT @Metaphone = @Metaphone + 'X';
				   END
           End  
		 ELSE 
           SELECT @Metaphone = @Metaphone +CASE WHEN @Previous= 'S' THEN '' else 'K' end;
         	   -- Else silent 
         END; -- Case C }
  /*'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' 
  transforms to 'T'.*/
  ELSE IF @This = 'D'
         BEGIN
         SELECT @Metaphone = @Metaphone
           + CASE WHEN(@Next = 'G') AND (@Following IN ( 'E', 'I', 'Y' )) --front vowel set 
                 THEN 'J' ELSE 'T' END;
         END;
  ELSE IF @This = 'G'
         /*Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' 
 if followed by 'N' or 'NED' and is at the end.
 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. 
 Otherwise, 'G' transforms to 'K'.*/
    BEGIN
  SELECT @silent = 
    CASE WHEN (@Next = 'H') AND (@Following IN ('A','E','I','O','U'))
	AND (@ii > 1) AND (((@ii+1) = @Len) OR ((@Next = 'n') AND
    (@Following = 'E') AND SUBSTRING(@String,@ii+3,1) = 'D') AND ((@ii+3) = @Len)) 
-- Terminal -gned 
  AND (@Previous = 'i') AND (@Next = 'n')
  THEN 1 
 -- if not start and near -end or -gned.) 
  WHEN (@ii > 1) AND (@Previous = 'D')-- gnuw
    AND (@Next IN ('E','I','Y')) --front vowel set 
  THEN 1 -- -dge, i, y 
  ELSE 0 END
  IF NOT (@silent=1)
    SELECT @Metaphone = @Metaphone 
	+ CASE WHEN (@Next IN ('E','I','Y')) --front vowel set 
      THEN  'J' ELSE  'K' END
  END
  /*Drop 'H' if after vowel and not before a vowel.
  or the second char of  "-ch-", "-sh-", "-ph-", "-th-", "-gh-"*/

  ELSE IF @This = 'H'
    BEGIN
    IF NOT ( (@ii= @Len) OR (@Previous IN  ( 'C', 'S', 'T', 'G' ))) 
	   AND (@Next IN ( 'A', 'E', 'I', 'O', 'U' ) )
     SELECT @Metaphone = @Metaphone + 'H';
         -- else silent (vowel follows) }
    END;
  ELSE IF @This IN --some get no substitution
       ( 'F', 'J', 'L', 'M', 'N', 'R' )
     BEGIN
     SELECT @Metaphone = @Metaphone + @This;
     END;
  /*'CK' transforms to 'K'.*/
  ELSE IF @This = 'K'
     BEGIN
     IF(@Previous <> 'C')
       SELECT @Metaphone = @Metaphone + 'K';
     END;
  /*'PH' transforms to 'F'.*/
  ELSE IF @This = 'P'
    BEGIN
      IF(@Next = 'H') SELECT @Metaphone = @Metaphone + 'F', @ii = @ii + 1;
      -- Skip the 'H' 
      ELSE
        SELECT @Metaphone = @Metaphone + 'P';
      END;
  /*'Q' transforms to 'K'.*/
  ELSE IF @This = 'Q'
    BEGIN
      SELECT @Metaphone = @Metaphone + 'K';
    END;
  /*'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.*/
  ELSE IF @This = 'S'
    BEGIN
    SELECT @Metaphone = @Metaphone + 
	  CASE 
		WHEN(@Next = 'H')
		 OR( (@ii> 1) AND (@Next = 'i') 
		  AND (@Following IN ( 'O', 'A' ) )
		  ) 
		THEN 'X' ELSE 'S' END;
     END;
  /*'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms 
to '0'. Drop 'T' if followed by 'CH'.*/
  ELSE IF @This = 'T'
    BEGIN
    SELECT @Metaphone = @Metaphone
      + CASE 
	    WHEN(@ii = 1) AND (@Next = 'H') AND (@Following = 'O') 
	       THEN 'T' -- Initial Tho- }
        WHEN(@ii > 1) AND (@Next = 'i') 
		     AND (@Following IN ( 'O', 'A' )) 
		  THEN 'X'
        WHEN(@Next = 'H') THEN '0'
        WHEN NOT((@Next = 'C') AND (@Following = 'H')) 
		  THEN 'T'
        ELSE '' END;
         -- -tch = silent }
    END;
  /*'V' transforms to 'F'.*/
  ELSE IF @This = 'V'
    BEGIN
    SELECT @Metaphone = @Metaphone + 'F';
    END;
  /*'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.*/
  /*Drop 'Y' if not followed by a vowel.*/
  ELSE IF @This IN ( 'W', 'Y' )
    BEGIN
    IF @Next IN ( 'A', 'E', 'I', 'O', 'U' )
      SELECT @Metaphone = @Metaphone + @This;
     --else silent 
     /*'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.*/
    END;
  ELSE IF @This = 'X'
    BEGIN
      SELECT @Metaphone = @Metaphone + 'KS';
    END;
  /*'Z' transforms to 'S'.*/
  ELSE IF @This = 'Z'
     BEGIN
       SELECT @Metaphone = @Metaphone + 'S';
     END;
  ELSE
  RETURN 'error with '''+ @This+ '''';
  -- end
  SELECT @ii = @ii + 1;
  END; -- While 
return @Metaphone 
END
go
/*
Check against the PHP implementation*/
SELECT * FROM 
(SELECT dbo.Metaphone ('craven') AS Attempt,'craven' AS original ,'KRFN' AS canonical                      
UNION ALL SELECT dbo.Metaphone ('platitudinous'),'platitudinous','PLTTTNS'
UNION ALL SELECT dbo.Metaphone ('woodcarvings'),'woodcarvings','WTKRFNKS'
UNION ALL SELECT dbo.Metaphone ('overlaid'),'overlaid','OFRLT'
UNION ALL SELECT dbo.Metaphone ('solitaries'),'solitaries','SLTRS'
UNION ALL SELECT dbo.Metaphone ('beatific'),'beatific', 'BTFK'
UNION ALL SELECT dbo.Metaphone ('plaza'),'plaza','PLS'
UNION ALL SELECT dbo.Metaphone ('paramilitary'),'paramilitary','PRMLTR'
UNION ALL SELECT dbo.Metaphone ('synod'),'synod','SNT'
UNION ALL SELECT dbo.Metaphone ('marinas'),'marinas','MRNS'
UNION ALL SELECT dbo.Metaphone ('hyperventilation'),'hyperventilation','PRFNTLXN'
UNION ALL SELECT dbo.Metaphone ('celebrant'),'celebrant','SLBRNT'
UNION ALL SELECT dbo.Metaphone ('pipsqueaks'),'pipsqueaks','PPSKKS'
UNION ALL SELECT dbo.Metaphone ('dazzles'),'dazzles', 'TSLS'
UNION ALL SELECT dbo.Metaphone ('bloodbaths'),'bloodbaths','BLTB0S'
UNION ALL SELECT dbo.Metaphone ('lotion'),'lotion','LXN'
UNION ALL SELECT dbo.Metaphone ('agreeable'),'agreeable','AKRBL'
UNION ALL SELECT dbo.Metaphone ('shariah'),'shariah','XR'
UNION ALL SELECT dbo.Metaphone ('direction'),'direction','TRKXN'
UNION ALL SELECT dbo.Metaphone ('constricts'),'constricts','KNSTRKTS'
UNION ALL SELECT dbo.Metaphone ('avowedly'),'avowedly','AFWTL'
UNION ALL SELECT dbo.Metaphone ('exorcisms'),'exorcisms','EKSRSSMS'
UNION ALL SELECT dbo.Metaphone ('starches'),'starches','STRXS'
UNION ALL SELECT dbo.Metaphone ('poses'),'poses','PSS'
UNION ALL SELECT dbo.Metaphone ('levies'),'levies','LFS'
UNION ALL SELECT dbo.Metaphone ('clicks'),'clicks','KLKS'
UNION ALL SELECT dbo.Metaphone ('minstrels'),'minstrels','MNSTRLS'
UNION ALL SELECT dbo.Metaphone ('propounding'),'propounding','PRPNTNK'
UNION ALL SELECT dbo.Metaphone ('opalescent'),'opalescent','OPLSNT'
UNION ALL SELECT dbo.Metaphone ('hotline'),'hotline','HTLN'
UNION ALL SELECT dbo.Metaphone ('soporifically'),'soporifically','SPRFKL'
UNION ALL SELECT dbo.Metaphone ('python'),'python','P0N'
UNION ALL SELECT dbo.Metaphone ('drab'),'drab','TRB'
UNION ALL SELECT dbo.Metaphone ('appraised'),'appraised','APRST'
UNION ALL SELECT dbo.Metaphone ('commotions'),'commotions','KMXNS'
UNION ALL SELECT dbo.Metaphone ('defeatists'),'defeatists','TFTSTS'
UNION ALL SELECT dbo.Metaphone ('dispensations'),'dispensations','TSPNSXNS'
UNION ALL SELECT dbo.Metaphone ('downfall'),'downfall','TNFL'
UNION ALL SELECT dbo.Metaphone ('naturalising'),'naturalising','NTRLSNK')k

WHERE attempt <> canonical
IF @@RowCount>0 RAISERROR( 'As you can see, there was a problem somewhere',16,1)