Let the spelling loose …
What do Callie and Kelly have in common (except for the double ‘l’ in the middle)? What about “no” and “know”, or “Ceasar’s” and “scissors” and what about “message” and “massage”? You definitely got it – Callie and Kelly, “no” and “know”, “Ceasar’s” and “scissors” sound alike, but are spelled quite differently. “message” and “massage” on the other hand differ by only one vowel (“a” vs “e”) but their pronunciation is not at all the same.
It’s a well known fact for many languages that ortography does not determine the pronunciation of words. English is a classic example. George Bernard Shaw was the attributed author of “ghoti” as an alternative spelling of “fish”. And while phonology often reflects the current state of the development of the language, orthography may often lag centuries behind. And while English is notorious for that phenomenon it is not the only one. Swedish, French, Portuguese, among others, all have their ortography/pronunciation discrepancies.
Phonetic Algorithms
So how do we represent things that sound similar but are spelled different? It’s not trivial but for most cases it is not impossible either. Soundex is probably the first algorithm to tackle this problem. It is an example of the so called phonetic algorithms which attempt to solve the problem of giving the same encoding to strings which are pronounced in a similar fashion. Soundex was designed for English only but has its limits. DoubleMetaphone (DM) is one of the possible replacements and relatively successful. Designed by Lawrence Philips in the beginning of 1990s it not only deals with native English names but also takes proper care of foreign names so omnipresent in the language. And what is more – it can output two possible encodings for a given name, hence the “Double” in the naming of the algorithm, – an anglicised and a native (be that Slavic, Germanic, Greek, Spanish, etc.) version.
By relying on DM one can encode all the four names in the title of this post as “PRN”. The name George will get two encodings – JRJ and KRK, the second version reflecting a possible German pronunciation of the name. And a name with Polish origin, like Adamowicz, would also get two encodings – ATMTS and ATMFX, depending on whether you pronounce the “cz” as the English “ch” in “church” or “ts” in “hats”.
The original implementation by Lawrence Philips allowed a string to be encoded only with 4 characters. However, in most subsequent
implementations of the algorithm this option is parameterized or just omitted.
Apache Commons Codec has an implementation of the DM among others (Soundex, Metaphone, RefinedSoundex, ColognePhonetic, Coverphone, to
name just a few.) and here is a tiny example with it:
import org.apache.commons.codec.language.DoubleMetaphone;
public class DM {
public static void main(String[] args) {
String s = "Adamowicz";
DoubleMetaphone dm = new DoubleMetaphone();
// Default encoding length is 4!
// Let's make it 10
dm.setMaxCodeLen(10);
System.out.println("Alternative 1: " + dm.doubleMetaphone(s) +
// Remember, DM can output 2 possible encodings:
"nAlternative 2: " + dm.doubleMetaphone(s, true));
}
}
The above code will print out:
Alternative 1: ATMTS
Alternative 2: ATMFX
It is also relatively straightforward to do phonetic search with Solr. You just need to ensure that you add the phonetic analysis to a field which contains names in your schema.xml:
Enhancements
While DM does perform quite well, at first sight, it has its limitations. We should know that it still originated from the English language and although it aims to tackle a variety of non-native borrowings most of the rules are English-centric. Suppose you work on any of the Scandinavian languages (Swedish, Danish, Norwegian, Icelandic) and one of the names you want to encode is “Örjan”. However, “Orjan” and “Örjan” get different encodings – ARJN vs RJN. Why is that? One look under the hood (the implementation in DoubleMetaphone.java) will give you the answer:
private static final String VOWELS = "AEIOUY";
So the Scandinavian vowels “ö”, “ä”, “å”, “ø” and “æ” are not present. If we just add these then compile and use the new version of the DM implementation we get the desired output – ARJN for both “Örjan” and “Orjan”.
Finally, if you don’t want to use DM or maybe it is really not suitable for your task, you still may use the same principles and create your own encoder by relying on regular expressions for example. Suppose you have a list of bogus product names which are just (mis)spelling variations of some well known names and you want to search for the original name but get back all ludicrous variants. Here is one albeit very naïve way to do it. Given the following names:
CupHoulder
CappHolder
KeepHolder
MacKleena
MackCliiner
MacqQleanAR
Ma’cKcle’an’ar
and with a bunch of regular expressions you can easily encode them as “cphldR” and “mclnR”.
String[] ar = new String[]{"CupHoulder", "CappHolder", "KeepHolder",
"MacKleena", "MackCliiner", "MacqQleanAR", "Ma'cKcle'an'ar"};
for (String a : ar) {
a = a.toLowerCase();
a = a.replaceAll("[ae]r?$", "R");
a = a.replaceAll("[aeoiuy']", "");
a = a.replaceAll("pp+", "p");
a = a.replaceAll("q|k", "c");
a = a.replaceAll("cc+", "c");
System.out.println(a);
}
You can now easily find all the ludicrous spellings of “CupHolder” och “MacCleaner”.
I hope this blogpost gave you some ideas of how you can use phonetic algorithms and their principles in order to better discover names and entities that sound alike but are spelled unlike. At Findwise we have done a number of enhancements to DM in order to make it work better with Swedish, Danish and Norwegian.
References
You can learn more about Double Metaphone from the following article by the creator of the algorithm:
http://drdobbs.com/cpp/184401251?pgno=2
A German phonetic algorithm is the Kölner Phonetik:
http://de.wikipedia.org/wiki/Kölner_Phonetik
And SfinxBis is a phonetic algorithm based on Soundex and is Swedish specific:
http://www.swami.se/projekt/sfinxbis.68.html