Naive Language Detector

Last week, as I was working on my new project ‘Complete‘ (A personalized autocomplete extension for Gmail), I was searching for a solution that would be able to correctly detect the language of a text. I thought finding one should be easy since I needed it to be able to work only on long texts.

The first solution I thought to incorporate could have fitted the project needs, had it not been based on the NLTK stopwords corpus, and supported only 14 languages. Besides this solution, I found a few other ones, which were a bit too heavy or complex for my needs. Not being entirely satisfied with the available solutions I set out to build my own one. You can find my code here and some more details about it throughout this post.

 1. ‘data.json’:

In my code there is a file called ‘data.json’, that is in fact the model for my solution. It was built by using the 1,000 most common words for each language from here, and then filtering from it the 50 most common and unique words for each language.

 2. ‘’:

This is my Python code that uses the model. Step A: my code converts the whole text into a set of tokens (by splitting the text into sentences and then each sentence into a list of tokens). Step B: my code checks the intersection between the token set of the text and each language set in my model and store the highest value (which will be the result).

 3. Comments about my solution:

  • As I already mentioned, this solution will work only with long texts (emails, news articles, full documents…), not phrases or single sentences.
  • Theoretically, this solution should not have ‘false-positive’. It returns ‘None’ if it cannot identify the language of the given text.
  • In my code, I used NLTK to split the text into tokens. If you don’t want to use it, you can just use regular Python .split() function.
  • Feel free to use my ‘data.json’ model and implement my code in any other programming language.

Have fun,

Shlomi Babluki.

How To Easily Recognize People’s Names In A Text

Named Entity Recognition (or just NER) is one of the more traditional tasks done with Natural Language Processing. The definition of the task is very simple :- build an automatic tool that can recognize and classify names in any piece of text.

For example:

Input: “Barack Obama is the president of the United States of America”

Output: “[Barack Obama] (Person) is the president of the [United States of America] (Location)”

Today NER is considered as a “Solved problem”, since the accuracy of modern NER systems is over 90%! However, I decided to take it as a case study to show you how important it is to have a good understanding of the NLP problem you want to solve before even starting to write a single line of code.

The “Problem” with Traditional NER systems

Traditional NER systems are quite heavy and complex. They are built using training sets of data, statistic methods, heuristics and complex algorithms. Moreover, when people started to work on these systems 30 or 40 years ago, having a simple dictionary with all the possible names in the world was not an option at all!

But today, in some cases, the story is quite different…

Case 1 – News articles

Lets take the definition from the beginning of this post and change it a little bit:

Building an automatic tool that can recognize and classify people’s names in any text news articles.

So yes, any traditional NER system can solve this task, but in this case a much simpler solution might also work pretty well. Take a look at my Python code for dummy NER for news articles.

I just use simple regular expression to extract strings that “look like names” and then validate them with the Freebase API.

I run it on this article, and got these results:

Jon Lester, Alex Rodriguez, Ryan Dempster, Joe Girardi, Jake Peavy

Comments about my code:

  1. Obviously it works quite slowly since it uses external API calls, but if you really want you can find a way to download Freebase’ data (or something similar like Wikipedia) and run it locally.
  2. I’m sure you can improve it a little bit by adding some special cases like handling ‘s at the end of the name or ignoring middle names… etc

Why does it work?

  1. First, we only want to recognize people’s names.
  2. According to a little research I did over a year ago, the number of names that are mentioned in news articles in a single point of time is around only 20,000 (= a very small set of names!).
  3. Whenever a new name comes up in the news, someone will probably add it to Freebase / Wikipedia within just a few hours.
  4. Usually in every news article the full names of the people in the text (“Barack Obama” and not just “Obama”) are written at least once (most likely at the beginning).

Case 2 – Any Article?

As I already said, 30 or 40 years ago having a dictionary with all the possible names in the world wasn’t a real option, but today we actually do have a very good dictionary of names – Facebook users!

So I took my code and added another layer that uses Facebook API to validate names. You can grab my full code from here.

This time I ran it on this article, and got these results:

Oz Katz, Emily Engelson, Shlomi Babluki, Lior Degani, Ohad Frankfurt, Shira Abel, Kevin Systrom

So now we theoretically have :- an automatic tool that can recognize and classify people’s names in any text.


The idea behind this post was to show you that sometimes a small change in the problem might lead into a much simpler and naive solution. The key in such cases is to deeply analyze the problem before even starting to think about the possible solutions.

An Efficient Way to Extract the Main Topics from a Sentence

Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way.

Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!

I ran it on this sentence –

“Swayy is a beautiful new dashboard for discovering and curating online content.”

And got this result –

This sentence is about: Swayy, beautiful new dashboard, online content

The first time you run the code, it loads the brown corpus into memory, so it might take a few seconds.


From the linguistic aspect, we usually say that the main “building blocks” of a sentence are Noun Phrases (NP) and Verb Phrases (VP). The Noun Phrases are usually the topics or objects in the sentence, or in simple words – this is what the sentence is talking about, while Verb Phrases describe some action between the objects in the sentence. Take this example:

“Facebook acquired Instagram”
About Who/What? – Facebook and Instagram > Noun Phrases
What happened? – acquired (=acquisition) > Verb Phrase

My goal was to extract only the Noun Phrases from the sentence, so I had to define some simple patterns which describe the structure of a Noun Phrase, for example:
NN = content
JJ+NN = visual content
NN+NN = content marketing

*NN = noun, JJ = adj…

Computer Science:

Now, I believe that some of you probably ask – “Wait! What? Why don’t you use parsing?”
So, first – you’re right! The known method to convert a sentence into Noun and Verb Phrases (or in other words – a tree..) is parsing. However, the problem with parsing algorithms is that their complexity is quite bad. For example CYK algorithm has the complexity of O(n^3 * |G|) !
The second problem is that full-parsing was a bit of an overkill for what I wanted to achieve.

My Solution:

First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful.  Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases.

So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.


Here I’m going to give you a quick overview of my code:

bigram_tagger – I use the NLTK taggers classes to define my own tagger. As you can see it’s built from 3 different taggers and it’s trained with the brown corpus.

cfg – This is my “Semi-CFG”. It includes the basic rules to match a regular Noun Phrase.

tokenize_sentence – Split the sentence into tokens (single words).

normalize_tags – Since there are many tags in the brown corpus, I just rename some of them.

extract – This is our main method. Split the sentence, tag it, and search for patterns.

Lines 96-97 – The different between these lines, is that line 97 also accepts single nouns. The meaning of this condition, is that you’ll get more results per sentence – but some of the results will be false positive! You can overcome / ignore the false positive results by using words frequencies or by defining some special dictionary according to your needs.

The bottom line

As I already said, the best way to extract Noun/Verb phrases from a sentence is by using parsing. However, if you need to do it fast and you want to be able to process many sentences / full documents in a very short time – I suggest you to take an approach like the one I illustrated above.

Build your own summary tool!

After Yahoo! acquired Summly and Google acquired Wavii, there is no doubt that auto summarization technologies are a hot topic in the industry. So as a NLP freak, I decided  to give a quick overview and hands-on experience on how these technologies actually work.

Since some of you might not read the entire post, I decided to start with the bottom line – Here you can find my Python implementation for a very naive summarization algorithm. This algorithm extracts the key sentence from each paragraph in the text. The algorithm works nicely on news & blog articles, and it usually cuts around 60-70% of the original text. As an example I ran it on this article (about my company, Swayy), and got the following results:

Swayy is a beautiful new dashboard for discovering and curating online content [Invites]

Lior Degani, the Co-Founder and head of Marketing of Swayy, pinged me last week when I was in California to tell me about his startup and give me beta access.
One week later, I’m totally addicted to Swayy and glad I said nothing about the spam (it doesn’t send out spam tweets but I liked the line too much to not use it for this article).
What is Swayy? It’s like Percolate and LinkedIn recommended articles, mixed with trending keywords for the topics you find interesting, combined with an analytics dashboard that shows the trends of what you do and how people react to it.
After I decided that I trusted the service, I added my Facebook and LinkedIn accounts.
The analytics side isn’t as interesting for me right now, but that could be due to the fact that I’ve barely been online since I came back from the US last weekend.
It was the suggested content that impressed me the most.
Yes, we’re in the process of filing a patent for it.
Ohad Frankfurt is the CEO, Shlomi Babluki is the CTO and Oz Katz does Product and Engineering, and I [Lior Degani] do Marketing.
➤ Want to try Swayy out without having to wait? Go to this secret URL and enter the promotion code thenextweb

Original Length 4529

Summary Length 1307
Summary Ratio: 71.1415323471

Summarization Technologies:

Today there are two common approaches to “attacking” the summarization mission. The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. As far as I know, until today this approach didn’t achieve any substantial results. The second approach ,which is similar to my naive algorithm, tries to extract the key sentences from the text, and then tries to put them together properly. One famous algorithm that implements this approach is TextRank.

Our Summarization Algorithm

I’m going to explain step-by-step my naive algorithm. I’ll use both programming and computer science terminology. Before you continue, in case you didn’t do it already, I suggest to to take a quick look at the code.

The intersection function:

This function receives two sentences, and returns a score for the intersection between them.
We just split each sentence into words/tokens, count how many common tokens we have, and then we normalize the result with the average length of the two sentences.
Computer Science: f(s1, s2) = |{w | w in s1 and w in s2}| / ((|s1| + |s2|) / 2)

The sentences dictionary:

This part is actually the “Heart” of the algorithm. It receives our text as input, and calculates a score for each sentence. The calculations is composed of two steps:

In the first step we split the text into sentences, and store the intersection value between each two sentences in a matrix (two-dimensional array). So values[0][2] will hold the intersection score between sentence #1 and sentence #3.
Computer Science: In fact, we just converted our text into a fully-connected weighted graph! Each sentence is a node in the graph and the two-dimensional array holds the weight of each edge!

In the second step we calculate an individual score for each sentence and store it in a key-value dictionary, where the sentence itself is the key and the value is the total score. We do that just by summing up all its intersections with the other sentences in the text (not including itself).
Computer Science: We calculates the score for each node in our graph. We simply do that by summing all the edges that are connected to the node.

Building the summary:

Obviously, the final step of our algorithm is generating the final summary. We do that by splitting our text into paragraphs, and then we choose the best sentence from each paragraph according to our sentences dictionary.
Computer Science: The Idea here is that every paragraph in the text represents some logical subset of our graph, so we just pick the most valuable node from each subset!

Why this works

There are two main reasons why this algorithm works: The first (and obvious) reason is that a paragraph is a logical atomic unit of the text. In simple words – there is probably a very good reason why the author decided to split his text that way. The second (and maybe less obvious..) reason is that if two sentences have a good intersection, they probably holds the same information. So if one sentence has a good intersection with many other sentences, it probably holds some information from each one of them- or in other words, this is probably a key sentence in our text!

Your turn!

If you read until here, you probably want to play a little bit with the algorithm. Here are a few things you can try to do:

1. In order to make it more useful, you may wrap it with some tool that extracts content from a URL. I personally like Goose, but you may use other tools like Readability, boilerpipe or others. It obviously won’t improve your algorithm, but it can be really nice just to pick a URL and see the resulting summary.

2. In my code I intentionally didn’t use any other packages. You can explore the NLTK or OpenNLP packages and use their methods for splitting and tokenizing text. They usually provide much better methods for that.

3. Play with the intersection function. As I already wrote, you may use stopwords or stemming (these both tools are included in NLTK!) and see how they change the result. You can also play with the normalization part of the equation (try to divide the result with different factors).

4. Create a new variation of the algorithm. For example, instead of picking the best sentence from each paragraph, try and pick the 2-3 most important paragraphs (In this case- each node of your graph is a full paragraph, instead of a single sentence!)

5. Use the title! – In my code I just print it at the top of the summary, but I’m sure it can be useful (For example – try to add it with some factor in your intersection function).

6. Check it on other input languages – Although I tested this code only on English, theoretically it should work in any other language!

Obviously this algorithm is just a simple Proof of concept, but I hope it gave you some general knowledge about how summarization technologies works. I also have to mention that in this post I introduced only one approach to complete this task, while indeed there are a few others!

Leave your comments below, or just contact me directly on Twitter.

Shlomi Babluki.