Naive Language Detector

Last week, as I was working on my new project ‘Complete‘ (A personalized autocomplete extension for Gmail), I was searching for a solution that would be able to correctly detect the language of a text. I thought finding one should be easy since I needed it to be able to work only on long texts.

The first solution I thought to incorporate could have fitted the project needs, had it not been based on the NLTK stopwords corpus, and supported only 14 languages. Besides this solution, I found a few other ones, which were a bit too heavy or complex for my needs. Not being entirely satisfied with the available solutions I set out to build my own one. You can find my code here and some more details about it throughout this post.

 1. ‘data.json’:

In my code there is a file called ‘data.json’, that is in fact the model for my solution. It was built by using the 1,000 most common words for each language from here, and then filtering from it the 50 most common and unique words for each language.

 2. ‘language_detector.py’:

This is my Python code that uses the model. Step A: my code converts the whole text into a set of tokens (by splitting the text into sentences and then each sentence into a list of tokens). Step B: my code checks the intersection between the token set of the text and each language set in my model and store the highest value (which will be the result).

 3. Comments about my solution:

  • As I already mentioned, this solution will work only with long texts (emails, news articles, full documents…), not phrases or single sentences.
  • Theoretically, this solution should not have ‘false-positive’. It returns ‘None’ if it cannot identify the language of the given text.
  • In my code, I used NLTK to split the text into tokens. If you don’t want to use it, you can just use regular Python .split() function.
  • Feel free to use my ‘data.json’ model and implement my code in any other programming language.

Have fun,

Shlomi Babluki.

Advertisement

How To Easily Recognize People’s Names In A Text

Named Entity Recognition (or just NER) is one of the more traditional tasks done with Natural Language Processing. The definition of the task is very simple :- build an automatic tool that can recognize and classify names in any piece of text.

For example:

Input: “Barack Obama is the president of the United States of America”

Output: “[Barack Obama] (Person) is the president of the [United States of America] (Location)”

Today NER is considered as a “Solved problem”, since the accuracy of modern NER systems is over 90%! However, I decided to take it as a case study to show you how important it is to have a good understanding of the NLP problem you want to solve before even starting to write a single line of code.

The “Problem” with Traditional NER systems

Traditional NER systems are quite heavy and complex. They are built using training sets of data, statistic methods, heuristics and complex algorithms. Moreover, when people started to work on these systems 30 or 40 years ago, having a simple dictionary with all the possible names in the world was not an option at all!

But today, in some cases, the story is quite different…

Case 1 – News articles

Lets take the definition from the beginning of this post and change it a little bit:

Building an automatic tool that can recognize and classify people’s names in any text news articles.

So yes, any traditional NER system can solve this task, but in this case a much simpler solution might also work pretty well. Take a look at my Python code for dummy NER for news articles.

I just use simple regular expression to extract strings that “look like names” and then validate them with the Freebase API.

I run it on this article, and got these results:

Jon Lester, Alex Rodriguez, Ryan Dempster, Joe Girardi, Jake Peavy

Comments about my code:

  1. Obviously it works quite slowly since it uses external API calls, but if you really want you can find a way to download Freebase’ data (or something similar like Wikipedia) and run it locally.
  2. I’m sure you can improve it a little bit by adding some special cases like handling ‘s at the end of the name or ignoring middle names… etc

Why does it work?

  1. First, we only want to recognize people’s names.
  2. According to a little research I did over a year ago, the number of names that are mentioned in news articles in a single point of time is around only 20,000 (= a very small set of names!).
  3. Whenever a new name comes up in the news, someone will probably add it to Freebase / Wikipedia within just a few hours.
  4. Usually in every news article the full names of the people in the text (“Barack Obama” and not just “Obama”) are written at least once (most likely at the beginning).

Case 2 – Any Article?

As I already said, 30 or 40 years ago having a dictionary with all the possible names in the world wasn’t a real option, but today we actually do have a very good dictionary of names – Facebook users!

So I took my code and added another layer that uses Facebook API to validate names. You can grab my full code from here.

This time I ran it on this article, and got these results:

Oz Katz, Emily Engelson, Shlomi Babluki, Lior Degani, Ohad Frankfurt, Shira Abel, Kevin Systrom

So now we theoretically have :- an automatic tool that can recognize and classify people’s names in any text.

Conclusion

The idea behind this post was to show you that sometimes a small change in the problem might lead into a much simpler and naive solution. The key in such cases is to deeply analyze the problem before even starting to think about the possible solutions.

Welcome to the NLP World

For some reason I feel that NLP (Natural Language Processing) is considered an “Academic” field. While I don’t have a degree in this field, I do have quite a bit of practical experience. In the past few years I have developed a several NLP systems: a public transportation route planner, a remote television program recorder, an appointment scheduling system, as well as a few others. I am proud to say I have developed real products that thousands of people use every day!

First I want to apologize to the academics, as you may not agree with many of the things in this post (or the ones that follow).

Welcome to the NLP world

Welcome to the NLP world

The goal of NLP systems:

In simple words the goal of a NLP system is to convert (or “translate”) a human text such as: a news article, text message, search request, Facebook status… etc, into a well-defined data structure which is readable for a computer.

A very simple example – a system that recognizes flights search requests:

Possible inputs: The Result:
I’m looking for a flight- from Madrid, Spain to London, England- from Spain, Madrid to England, London- to London, England, from Madrid, Spain

– to England, London, from Spain, Madrid

<from>
<city>Madrid</city>
<country>Spain</country>
</from>
<to>
<city>London</city>
<country>England</country>
</to>

* Different inputs in a human language (left side) with the same XML result (right side).

Usually the NLP system is not a standalone system, but a one module of a larger system. In most cases the result of the NLP engine is used to retrieve some information (in our example: search for a relevant flight in the schedule) and then send the final result back to the user.

The cycle of a standard system:

User → User input → NLP system → Database, Information center → Final Result → User

More NLP systems:

For further reading, here are some well-known uses of NLP:

Semantic role labeling
Named-entity recognition (NER)
Document classification
Language identification

Timeout – My life outside the NLP world:

I would like to introduce you LEGO Mindstorms, a nice kit for learning the robotics world:

Follow me on Twitter or contact me: shlomibabluki@gmail.com.

Shlomi Babluki