python | The Tokenizer

Last week, as I was working on my new project ‘Complete‘ (A personalized autocomplete extension for Gmail), I was searching for a solution that would be able to correctly detect the language of a text. I thought finding one should be easy since I needed it to be able to work only on long texts.

The first solution I thought to incorporate could have fitted the project needs, had it not been based on the NLTK stopwords corpus, and supported only 14 languages. Besides this solution, I found a few other ones, which were a bit too heavy or complex for my needs. Not being entirely satisfied with the available solutions I set out to build my own one. You can find my code here and some more details about it throughout this post.

1. ‘data.json’:

In my code there is a file called ‘data.json’, that is in fact the model for my solution. It was built by using the 1,000 most common words for each language from here, and then filtering from it the 50 most common and unique words for each language.

2. ‘language_detector.py’:

This is my Python code that uses the model. Step A: my code converts the whole text into a set of tokens (by splitting the text into sentences and then each sentence into a list of tokens). Step B: my code checks the intersection between the token set of the text and each language set in my model and store the highest value (which will be the result).

3. Comments about my solution:

As I already mentioned, this solution will work only with long texts (emails, news articles, full documents…), not phrases or single sentences.
Theoretically, this solution should not have ‘false-positive’. It returns ‘None’ if it cannot identify the language of the given text.
In my code, I used NLTK to split the text into tokens. If you don’t want to use it, you can just use regular Python .split() function.
Feel free to use my ‘data.json’ model and implement my code in any other programming language.

Have fun,

Shlomi Babluki.

After Yahoo! acquired Summly and Google acquired Wavii, there is no doubt that auto summarization technologies are a hot topic in the industry. So as a NLP freak, I decided to give a quick overview and hands-on experience on how these technologies actually work.

Since some of you might not read the entire post, I decided to start with the bottom line – Here you can find my Python implementation for a very naive summarization algorithm. This algorithm extracts the key sentence from each paragraph in the text. The algorithm works nicely on news & blog articles, and it usually cuts around 60-70% of the original text. As an example I ran it on this article (about my company, Swayy), and got the following results:

Swayy is a beautiful new dashboard for discovering and curating online content [Invites]

Lior Degani, the Co-Founder and head of Marketing of Swayy, pinged me last week when I was in California to tell me about his startup and give me beta access.
One week later, I’m totally addicted to Swayy and glad I said nothing about the spam (it doesn’t send out spam tweets but I liked the line too much to not use it for this article).
What is Swayy? It’s like Percolate and LinkedIn recommended articles, mixed with trending keywords for the topics you find interesting, combined with an analytics dashboard that shows the trends of what you do and how people react to it.
After I decided that I trusted the service, I added my Facebook and LinkedIn accounts.
The analytics side isn’t as interesting for me right now, but that could be due to the fact that I’ve barely been online since I came back from the US last weekend.
It was the suggested content that impressed me the most.
Yes, we’re in the process of filing a patent for it.
Ohad Frankfurt is the CEO, Shlomi Babluki is the CTO and Oz Katz does Product and Engineering, and I [Lior Degani] do Marketing.
➤ Want to try Swayy out without having to wait? Go to this secret URL and enter the promotion code thenextweb

Original Length 4529
Summary Length 1307
Summary Ratio: 71.1415323471

Summarization Technologies:

Today there are two common approaches to “attacking” the summarization mission. The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. As far as I know, until today this approach didn’t achieve any substantial results. The second approach ,which is similar to my naive algorithm, tries to extract the key sentences from the text, and then tries to put them together properly. One famous algorithm that implements this approach is TextRank.

Our Summarization Algorithm

I’m going to explain step-by-step my naive algorithm. I’ll use both programming and computer science terminology. Before you continue, in case you didn’t do it already, I suggest to to take a quick look at the code.

The intersection function:

This function receives two sentences, and returns a score for the intersection between them.
We just split each sentence into words/tokens, count how many common tokens we have, and then we normalize the result with the average length of the two sentences.
Computer Science: f(s1, s2) = |{w | w in s1 and w in s2}| / ((|s1| + |s2|) / 2)

The sentences dictionary:

This part is actually the “Heart” of the algorithm. It receives our text as input, and calculates a score for each sentence. The calculations is composed of two steps:

In the first step we split the text into sentences, and store the intersection value between each two sentences in a matrix (two-dimensional array). So values[0][2] will hold the intersection score between sentence #1 and sentence #3.
Computer Science: In fact, we just converted our text into a fully-connected weighted graph! Each sentence is a node in the graph and the two-dimensional array holds the weight of each edge!

In the second step we calculate an individual score for each sentence and store it in a key-value dictionary, where the sentence itself is the key and the value is the total score. We do that just by summing up all its intersections with the other sentences in the text (not including itself).
Computer Science: We calculates the score for each node in our graph. We simply do that by summing all the edges that are connected to the node.

Building the summary:

Obviously, the final step of our algorithm is generating the final summary. We do that by splitting our text into paragraphs, and then we choose the best sentence from each paragraph according to our sentences dictionary.
Computer Science: The Idea here is that every paragraph in the text represents some logical subset of our graph, so we just pick the most valuable node from each subset!

Why this works

There are two main reasons why this algorithm works: The first (and obvious) reason is that a paragraph is a logical atomic unit of the text. In simple words – there is probably a very good reason why the author decided to split his text that way. The second (and maybe less obvious..) reason is that if two sentences have a good intersection, they probably holds the same information. So if one sentence has a good intersection with many other sentences, it probably holds some information from each one of them- or in other words, this is probably a key sentence in our text!

Your turn!

If you read until here, you probably want to play a little bit with the algorithm. Here are a few things you can try to do:

1. In order to make it more useful, you may wrap it with some tool that extracts content from a URL. I personally like Goose, but you may use other tools like Readability, boilerpipe or others. It obviously won’t improve your algorithm, but it can be really nice just to pick a URL and see the resulting summary.

2. In my code I intentionally didn’t use any other packages. You can explore the NLTK or OpenNLP packages and use their methods for splitting and tokenizing text. They usually provide much better methods for that.

3. Play with the intersection function. As I already wrote, you may use stopwords or stemming (these both tools are included in NLTK!) and see how they change the result. You can also play with the normalization part of the equation (try to divide the result with different factors).

4. Create a new variation of the algorithm. For example, instead of picking the best sentence from each paragraph, try and pick the 2-3 most important paragraphs (In this case- each node of your graph is a full paragraph, instead of a single sentence!)

5. Use the title! – In my code I just print it at the top of the summary, but I’m sure it can be useful (For example – try to add it with some factor in your intersection function).

6. Check it on other input languages – Although I tested this code only on English, theoretically it should work in any other language!

Obviously this algorithm is just a simple Proof of concept, but I hope it gave you some general knowledge about how summarization technologies works. I also have to mention that in this post I introduced only one approach to complete this task, while indeed there are a few others!

Leave your comments below, or just contact me directly on Twitter.

Shlomi Babluki.

The Tokenizer

Here’s a few things you might need to know, or maybe you just forgot…

Tag Archives: python

Naive Language Detector

An Algorithm For Generating Automatic Hashtags

Build your own summary tool!

Share this:

Share this:

Share this: