Phonetic search can be used in cases when we search for names that sound similar to a given name (our query) – e.g. search for Tomas while the person’s real name is actually Thomas. A typical use case is a people search application on an Intranet. There are a few things to consider and this is what we wanted to lift up with this blog post. If you want more information than is found in this blogpost please contact us.
1.1 Choosing algorithms
There are several different algorithms and implementations of phonetic search today. The most commonly known algorithm is called “Soundex”. In 1990 an improvement of Soundex was created called Metaphone. The Metaphone algorithm takes into account some discrepancies between the spelling and pronunciation in English. However, both algorithms are developed to work with the English language and are therefore not suited for use with other languages.
The Double Metaphone (DM) algorithm is another attempt to extend the encoding capabilities by allowing two encoded versions of the string (hence the name “double”). The algorithm takes into account the names with Slavic, Germanic, Romance, etc. origin that are so omnipresent in the United States and hence American English. This previous blog post (http://www.findwise.com/blog/bryan-brian-briane-bryne-or-what-was-his-name-again/) discusses some issues with Double Metaphone.
At Findwise, we have created an extension of the DM algorithm to properly handle Scandinavian languages (Swedish, Danish and Norwegian) as well as Polish. In addition we have a separate implementation, inspired by DM, which handles Arabic names written with Latin alphabet.
Metaphone 3 is a phonetic algorithm created by the same author as Metaphone and Double Metaphone. In contrary to the earlier Metaphone algorithms, the Metaphone 3 algorithm is not open source, but the source code is sold by Anthropomorphic Software (http://amorphics.com/). The algorithm is said to increase the accuracy from 89% to 98% compared to the Double Metaphone algorithm for names and words commonly used in American English.
Another phonetic algorithm worth mentioning is Beider-Morse Phonetic Matching (BMPM). BMPM is another extension of Soundex and currently has support mainly for Jewish names as written in Russian, Polish, German, Romanian, Hungarian, Hebrew, French, Spanish and English.
The phonetic search algorithms can be implemented either as a plugin to the search engine or both to the processing framework and the search service.
An alternative to phonetic search is an extensive use of synonyms. By using synonyms a more controlled set of alternative spellings of names can be maintained, thus providing high precision matching of names. However, the list must contain all possible spelling variations of all possible names in order to have full coverage. Consider the case where Norwegians try to search for Finnish first names. Do you know the typical spellings that a Norwegian speaker would do of the names “Miikka” or “Jouko”? If you don’t, the synonym solution will not be complete.
1.2 Using a phonetic search algorithm
When you implement a phonetic search algorithm, you must consider the risk of having so called “false positives”, as your result set grows. (False positives are results in the result set that are not a good match for the query.) Independent of which algorithm you use, you will always have false positives in your result set, and they need to be handled in a good manner.
1.2.1 Recall or precision
In any search scenario, a balance between high precision (all results are the relevant ones) and high recall (all relevant and a lot of irrelevant results) needs to be struck. Phonetic search is a way to increase the recall at the expense of precision, which may in some cases be problematic. Therefore you may want to return the alternative name spellings only if there is not exact match (high precision), instead of returning other spellings of the same name with the initial search results (high recall).
1.2.2 User interfaces
Phonetic search may be confusing to the end user. A user accepts the fact that no results are returned when they misspell a word, but find it harder to accept that there are thousands of results for a query that is quite precise. There are a few ways of dealing with this.
One way is to add a checkbox by the search box, where the user can decide if phonetic search should be applied or not. By giving the user the possibility to decide, it is more likely that the user understands why the results were returned.
Another solution is to return phonetic matches only if there are no good matches to the original query. This is a common solution that also means that you can return a text explaining that no exact matches where found. This will however mean that you will not know if there are other people with a similar name to the person that you found.
The easiest solution to dealing with confused users wondering why there are so many hits for a query is to not present the number of results that was returned. This is more effective than most people realize. If you keep scrolling down the result list, then you are probably looking for something that is similar to what you searched for, while users that spelled correctly find their result at the top of the list, and do not realize that there are a lot of results that are not matching their query.
1.3 Conclusion
Phonetic search is a good way to increase the recall when searching for names, but remember that increasing the recall will lower the precision. This needs to be addressed somehow, either programmatically (try to decrease the number of results or rank the results internally) or in the user interface (only return phonetic results in some cases or hide the number of results returned). Whatever the chosen solution is, we, at Findwise, have the experience and expertise to adapt, enhance or even create new phonetic algorithms that will suit your purpose and the underlying data or language.
Please visit our website or contact us for more information.
Hi.
Do you plan to release your extension to metaphone for better polish language handling ? Maybe as c/c++ lib (even with closed source) so we can use it for example in posgresql or php as an extension?
There is a general lack of good phonetic search algorithms for polish language :/
Best regards.
Cześć,
Fajnie, że pytasz o język polski!
Yep, we do have a nice phonetic module for Polish at Findwise. We would be very glad to talk to you more about it. If ya are interested contact our CTO Helge Legernes to discuss business matters or me if you just want to chat about it more.
/Paweł Wróblewski