Machine Learning for Complex Language Entry

Posted: April 15th, 2011 | Author: | Filed under: Machine Learning in the Real World | Tags: , | 21 Comments »

Editors note: We’d like to invite people with interesting machine learning and data analysis applications to explain the techniques that are working for them in the real world on real data. is an open-source browser addon that uses machine learning techniques to make it easier for people around the world to communicate.

Authors: Kevin Scannell and Michael Schade

Many languages around the world use the familiar Latin alphabet (A-Z), but in order to represent the sounds of the language accurately, their writing systems employ diacritical marks and other special characters.    For example:

  • Vietnamese (Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm),
  • Hawaiian  (Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo),
  • Ewe (Amesiame kpɔ mɔ abu tame le eɖokui si eye wòaɖe eƒe susu agblɔ faa mɔxexe manɔmee),
  • and hundreds of others.

Speakers of these languages have difficulty entering text into a computer because keyboards are often not available, and even when they are, typing special characters can be slow and cumbersome.    Also, in many cases, speakers may not be completely familiar with the “correct” writing system and may not always know where the special characters belong.   The end result is that for many languages, the texts people type in emails, blogs, and social networking sites are left as plain ASCII, omitting any special characters, and leading to ambiguities and confusion.

To solve this problem, we have created a free and open source Firefox add-on called that allows users to type texts in plain ASCII, and then automatically adds all diacritics and special characters in the correct places–a process we call “Unicodification”. uses a machine learning approach, employing both character-level and word-level models trained on data crawled from the web for more than 100 languages.

It is easiest to describe our algorithm with an example.   Let’s say a user is typing Irish (Gaelic), and they enter the phrase nios mo muinteoiri fiorchliste with no diacritics.   For each word in the input, we check to see if it is an “ascii-fied” version of a word that was seen during training.

  • In our example, for two of the words, there is exactly one candidate unicodification in the training data: nios is the asciification of the word níos which is very common in our Irish data, and muinteoiri is the asciification of múinteoirí, also very common.   As there are no other candidates, we take níos and múinteoirí as the unicodifications.
  • There are two possibilities for mo; it could be correct as is, or it could be the asciification of mó.   When there is an ambiguity of this kind, we rely on standard word-level n-gram language modeling; in this case, the training data contains many instances of the set phrase níos mó, and no examples of níos mo, so mó is chosen as the correct answer.
  • Finally, the word fiorchliste doesn’t appear at all in our training data, so we resort to a character-level model, treating each character that could admit a diacritic as a classification problem.  For each language, we train a naive Bayes classifier using trigrams (three character sequences) in a neighborhood of the ambiguous character as features.   In this case, the model classifies the first “i” as needing an acute accent, and leaves all other characters as plain ASCII, thereby (correctly) restoring fiorchliste to fíorchliste.

The example above illustrates the ability of the character-level models to handle never-before-seen words; in this particular case fíorchliste is a compound word, and the character sequences in the two pieces fíor and chliste are relatively common in the training data.  It is also an effective way of handling morphologically complex languages, where there can be thousands or even millions of forms of any given root word, so many that one is lucky to see even a small fraction of them in a training corpus.  But the chances of seeing individual morphemes is much higher, and these are captured reasonably well by the character-level models.

We are far from the first to have studied this problem from the machine learning point of view (full references are given in our paper), but this is the first time that models have been trained for so many languages, and made available in a form that will allow widespread adoption in many language communities.

We have done a detailed evaluation of the performance of the software for all of the languages (all the numbers are in the paper) and this raised a number of interesting issues.

First, we were only able to do this on such a large scale because of the availability of training text on the web in so many languages.   But experience has shown that web texts are much noisier than texts found in traditional corpora–does this have an impact on the performance of a statistical systems?   The short answer appears to be “yes,” at least for the problem of unicodification.   In cases where we had access to high quality corpora of books and newspaper texts, we achieved substantially better performance.

Second, it is probably no surprise that some languages are much harder than others.   A simple baseline algorithm is to simply leave everything as plain ASCII, and this performs quite well for languages like Dutch which have only a small number of words containing diacritics (this baseline get 99.3% of words correct for Dutch).    In Figure 1 we plot the word-level accuracy of against this baseline.

But recall there are really two models at play, and we could ask about the relative contribution of, say, the character-level model to the performance of the system.   With this in mind, we introduce a second “baseline” which omits the character-level model entirely.   More precisely, given an ASCII word as input, it chooses the most common unicodification that was seen in the training data, and leaves the word as ASCII if there were no candidate unicodifications in the training data.   In Figure 2 we plot the word-level accuracy of against this improved baseline.  We see that the contribution of the character model is really quite small in most cases, and not surprisingly several of the languages where it helps the most are morphologically quite complex, like Hungarian and Turkish (though Vietnamese is not).  In quite a few cases, the character model actually hurts performance, although our analyses show that this is generally due to noise in the training data: a lot of noise in web texts is English (and hence almost pure ASCII) so the baseline will outperform any algorithm that tries to add diacritics.

The Firefox add-on works by communicating with the web service via its stable API, and we have a number of other clients including a vim plugin (written by fellow St. Louisan Bill Odom) and Perl, Python, and Haskell implementations.    We hope that developers interested in supporting language communities around the world will consider integrating this service in their own software.

Please feel free to contact us with any questions, comments, or suggestions.

  • dpl

    nice, one question: wouldn’t you say it’d be even more useful as thunderbird addon?
    have you thought about it?

  • Michael Schade

    Hi dpl, thanks for writing. It would definitely be very useful as a Thunderbird add-on, and this is something we want to do. Really, the bulk of the code is there since it’s a Firefox add-on, we just need to make some modifications to let it hook into the Thunderbird menu system and gather text.

    Currently, we’re working on a new version of the Firefox add-on that’ll remove the need to manually accentuate, but I’ll see about pushing the priority up of the Thunderbird add-on to incorporate it into one of the upcoming releases. Thanks for writing!

  • dpl

    thanks for answering :)

  • dpl

    and another question: in theory, does that mean your server gets to read everything i type while using the plugin? how do you deal with privacy issues?

  • Michael Schade

    Sure, great question! While technically the server “sees” anything you accentuate, there is no code that actually stores this data (with the notable exception of people providing feedback, which is opt-in every time and only stores the selected text plus a context word on either side). The source code is available at and on various projects listed on, so people are always free to double check us on this.

    So, we never store text without the user’s permission, and the only case in which we would consider doing so is if the user wants to provide feedback text with the proper diacritics in place so that we can feed them into the system later to improve results.

    Regarding overall privacy, due to the community-hosted nature of our solution in which language communities can donate servers for hosting, we route all traffic first through servers that we personally run. Given our personal promise that we abide by the above policy on not storing user data, we consider these proxy servers to be fully trusted access points. From here, we forward the accentuation on to the language community servers, in the process decoupling user agent/IP/other identifying information from the text itself, which helps reduce security concerns (e.g., understanding a user’s full conversation by piecing together smaller accentuations), since the language community-run servers that actually do the accentuation don’t know who the client is for each request.

    Touching away from privacy for a moment, these proxy servers also have the super nice benefit of allowing us to instantly add new servers, remove problem servers (e.g., security issue, downtime, etc.), and load balance requests accordingly.

    If you want more information on our privacy policy, you can read it in full at I of course welcome any more questions about that or anything else.

    Nice hearing again from you :)

  • Michael Schade

    Also, I just wanted to point out for clarity that this add-on only sends things off to us when you request it to do so. That is, you have to right click and choose to accentuate a given text field or selection of text before it’s ever sent to our servers in the first place, so you get to choose what data is sent and what isn’t.

  • Stephany Filimon Wilkes

    I haven’t played with Accentuate yet, but want to thank you for the excellent paper and post and nicely designed solution. Nicely done.

  • Michael Schade

    Thanks for writing in Stephany, I appreciate your kind words! If you’re wanting a quick way to try it and have a Twitter account, we actually just launched a Twitter demo account: @accentuateus:twitter

    You can tweet any language listed in green at by just tweeting like this: @accentuateus:disqus

    For example: “My tu bo ke hoach la chan ten lua @accentuateus vie” and it’ll tweet you back, that way you can try it without having to install anything.

  • Matthew R. Goodman

    You mention you are interested in hearing from people in the DM feild.  What is the best way to get in contact with you guys?

    Thanks, and nice blog.

  • Philip Resnik

    I’m afraid this post, and the paper, are misrepresenting the prior art.  I don’t doubt that there is practical value in the solution promoted here, but Yarowsky’s work, cited briefly in the paper, applied purely data-driven machine learning techniques to this problem on a large scale more than 15 years ago.  In their literature review, the authors lump that paper (Yarowsky 1994, in with others that “rely on pre-existing NLP resources such as electronic dictionaries and part-of-speech taggers”, but this suggests they may not have read that paper with care.  Yarowsky’s paper utilized absolutely no pre-existing dictionaries, taggers, or any other resources; the range of possible diacritizations and contextual evidence for their
    disambiguation were discovered solely from statistical analysis of 49
    million words of monolingual Spanish text and 20 million words of
    monolingual French text, which was very large coverage for the era.  Again, I’m sure there is significant value in the scaling up that’s done here, but I do hope the authors will amend their attribution of credit when it comes to innovations in approaches to this problem.

  • Kevin Scannell

    Philip, there was certainly no intent to misrepresent prior work on this problem. I’ll rephrase the comment following the citation to Yarowsky’s paper and the others – I was trying to distinguish earlier approaches from ones based on character n-grams.  “These papers all rely on pre-existing NLP resources…” is, I agree, incorrect as stated.

  • Franking Machine

    This is a nice content.There is exactly one candidate unicodification in the training data: nios.I like this one.This is a great article.The written skill is so good.I appreciate to this well informative blog.Keep sharing.

  • Michael Schade

    Thanks so much Franking for the kind word, I’m glad you enjoyed our post! Please let me know if you have any questions, I’d be happy to answer.

  • dahlsg

    Thanks for sharing this information with us.Some interesting thoughts on
    the subject. Looking forward to see what else you post in the future. 

  • seo optimization

    HelloGreat job with great concept.I really like the post.Thanks for sharing this.

  • Yarout

    Smart post with   
    beautiful graph  and excellent content.Liked your article a lot.Very   
    impressive written skills.Thanks a lot for sharing such excellent post. 

  • wills

    Some websites are geared
    toward compiling news, commenting on it, and spreading it around the
    Internet. Thanks for sharing.

  • dazy

    thanks for sharing
    this useful information, your site is interesting.

  • Cross Cultural Training

    Amazing post written for languages.I properly read your article which was very interesting and excellent.Keep sharing such interesting articles.

  • leadrecyclingindia

    It is very important that each and everyone of you who blogs go check your Alexa rating now, this will help you see where you are.Really thanks.

  • James Demi

    Its really very useful tool.Thanks to share this content.