NER | The Tokenizer

Named Entity Recognition (or just NER) is one of the more traditional tasks done with Natural Language Processing. The definition of the task is very simple :- build an automatic tool that can recognize and classify names in any piece of text.

For example:

Input: “Barack Obama is the president of the United States of America”

Output: “[Barack Obama] (Person) is the president of the [United States of America] (Location)”

Today NER is considered as a “Solved problem”, since the accuracy of modern NER systems is over 90%! However, I decided to take it as a case study to show you how important it is to have a good understanding of the NLP problem you want to solve before even starting to write a single line of code.

The “Problem” with Traditional NER systems

Traditional NER systems are quite heavy and complex. They are built using training sets of data, statistic methods, heuristics and complex algorithms. Moreover, when people started to work on these systems 30 or 40 years ago, having a simple dictionary with all the possible names in the world was not an option at all!

But today, in some cases, the story is quite different…

Case 1 – News articles

Lets take the definition from the beginning of this post and change it a little bit:

Building an automatic tool that can recognize and classify people’s names in any text news articles.

So yes, any traditional NER system can solve this task, but in this case a much simpler solution might also work pretty well. Take a look at my Python code for dummy NER for news articles.

I just use simple regular expression to extract strings that “look like names” and then validate them with the Freebase API.

I run it on this article, and got these results:

Jon Lester, Alex Rodriguez, Ryan Dempster, Joe Girardi, Jake Peavy

Comments about my code:

Obviously it works quite slowly since it uses external API calls, but if you really want you can find a way to download Freebase’ data (or something similar like Wikipedia) and run it locally.
I’m sure you can improve it a little bit by adding some special cases like handling ‘s at the end of the name or ignoring middle names… etc

Why does it work?

First, we only want to recognize people’s names.
According to a little research I did over a year ago, the number of names that are mentioned in news articles in a single point of time is around only 20,000 (= a very small set of names!).
Whenever a new name comes up in the news, someone will probably add it to Freebase / Wikipedia within just a few hours.
Usually in every news article the full names of the people in the text (“Barack Obama” and not just “Obama”) are written at least once (most likely at the beginning).

Case 2 – Any Article?

As I already said, 30 or 40 years ago having a dictionary with all the possible names in the world wasn’t a real option, but today we actually do have a very good dictionary of names – Facebook users!

So I took my code and added another layer that uses Facebook API to validate names. You can grab my full code from here.

This time I ran it on this article, and got these results:

Oz Katz, Emily Engelson, Shlomi Babluki, Lior Degani, Ohad Frankfurt, Shira Abel, Kevin Systrom

So now we theoretically have :- an automatic tool that can recognize and classify people’s names in any text.

Conclusion

The idea behind this post was to show you that sometimes a small change in the problem might lead into a much simpler and naive solution. The key in such cases is to deeply analyze the problem before even starting to think about the possible solutions.

The Tokenizer

Here’s a few things you might need to know, or maybe you just forgot…

Tag Archives: NER

How To Easily Recognize People’s Names In A Text

Share this: