Build your own summary tool!

After Yahoo! acquired Summly and Google acquired Wavii, there is no doubt that auto summarization technologies are a hot topic in the industry. So as a NLP freak, I decided  to give a quick overview and hands-on experience on how these technologies actually work.

Since some of you might not read the entire post, I decided to start with the bottom line – Here you can find my Python implementation for a very naive summarization algorithm. This algorithm extracts the key sentence from each paragraph in the text. The algorithm works nicely on news & blog articles, and it usually cuts around 60-70% of the original text. As an example I ran it on this article (about my company, Swayy), and got the following results:


Swayy is a beautiful new dashboard for discovering and curating online content [Invites]

Lior Degani, the Co-Founder and head of Marketing of Swayy, pinged me last week when I was in California to tell me about his startup and give me beta access.
One week later, I’m totally addicted to Swayy and glad I said nothing about the spam (it doesn’t send out spam tweets but I liked the line too much to not use it for this article).
What is Swayy? It’s like Percolate and LinkedIn recommended articles, mixed with trending keywords for the topics you find interesting, combined with an analytics dashboard that shows the trends of what you do and how people react to it.
After I decided that I trusted the service, I added my Facebook and LinkedIn accounts.
The analytics side isn’t as interesting for me right now, but that could be due to the fact that I’ve barely been online since I came back from the US last weekend.
It was the suggested content that impressed me the most.
Yes, we’re in the process of filing a patent for it.
Ohad Frankfurt is the CEO, Shlomi Babluki is the CTO and Oz Katz does Product and Engineering, and I [Lior Degani] do Marketing.
➤ Want to try Swayy out without having to wait? Go to this secret URL and enter the promotion code thenextweb


Original Length 4529

Summary Length 1307
Summary Ratio: 71.1415323471


Summarization Technologies:

Today there are two common approaches to “attacking” the summarization mission. The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. As far as I know, until today this approach didn’t achieve any substantial results. The second approach ,which is similar to my naive algorithm, tries to extract the key sentences from the text, and then tries to put them together properly. One famous algorithm that implements this approach is TextRank.

Our Summarization Algorithm

I’m going to explain step-by-step my naive algorithm. I’ll use both programming and computer science terminology. Before you continue, in case you didn’t do it already, I suggest to to take a quick look at the code.

The intersection function:

This function receives two sentences, and returns a score for the intersection between them.
We just split each sentence into words/tokens, count how many common tokens we have, and then we normalize the result with the average length of the two sentences.
Computer Science: f(s1, s2) = |{w | w in s1 and w in s2}| / ((|s1| + |s2|) / 2)

The sentences dictionary:

This part is actually the “Heart” of the algorithm. It receives our text as input, and calculates a score for each sentence. The calculations is composed of two steps:

In the first step we split the text into sentences, and store the intersection value between each two sentences in a matrix (two-dimensional array). So values[0][2] will hold the intersection score between sentence #1 and sentence #3.
Computer Science: In fact, we just converted our text into a fully-connected weighted graph! Each sentence is a node in the graph and the two-dimensional array holds the weight of each edge!

In the second step we calculate an individual score for each sentence and store it in a key-value dictionary, where the sentence itself is the key and the value is the total score. We do that just by summing up all its intersections with the other sentences in the text (not including itself).
Computer Science: We calculates the score for each node in our graph. We simply do that by summing all the edges that are connected to the node.

Building the summary:

Obviously, the final step of our algorithm is generating the final summary. We do that by splitting our text into paragraphs, and then we choose the best sentence from each paragraph according to our sentences dictionary.
Computer Science: The Idea here is that every paragraph in the text represents some logical subset of our graph, so we just pick the most valuable node from each subset!

Why this works

There are two main reasons why this algorithm works: The first (and obvious) reason is that a paragraph is a logical atomic unit of the text. In simple words – there is probably a very good reason why the author decided to split his text that way. The second (and maybe less obvious..) reason is that if two sentences have a good intersection, they probably holds the same information. So if one sentence has a good intersection with many other sentences, it probably holds some information from each one of them- or in other words, this is probably a key sentence in our text!

Your turn!

If you read until here, you probably want to play a little bit with the algorithm. Here are a few things you can try to do:

1. In order to make it more useful, you may wrap it with some tool that extracts content from a URL. I personally like Goose, but you may use other tools like Readability, boilerpipe or others. It obviously won’t improve your algorithm, but it can be really nice just to pick a URL and see the resulting summary.

2. In my code I intentionally didn’t use any other packages. You can explore the NLTK or OpenNLP packages and use their methods for splitting and tokenizing text. They usually provide much better methods for that.

3. Play with the intersection function. As I already wrote, you may use stopwords or stemming (these both tools are included in NLTK!) and see how they change the result. You can also play with the normalization part of the equation (try to divide the result with different factors).

4. Create a new variation of the algorithm. For example, instead of picking the best sentence from each paragraph, try and pick the 2-3 most important paragraphs (In this case- each node of your graph is a full paragraph, instead of a single sentence!)

5. Use the title! – In my code I just print it at the top of the summary, but I’m sure it can be useful (For example – try to add it with some factor in your intersection function).

6. Check it on other input languages – Although I tested this code only on English, theoretically it should work in any other language!

Obviously this algorithm is just a simple Proof of concept, but I hope it gave you some general knowledge about how summarization technologies works. I also have to mention that in this post I introduced only one approach to complete this task, while indeed there are a few others!

Leave your comments below, or just contact me directly on Twitter.

Shlomi Babluki.

28 thoughts on “Build your own summary tool!

  1. Interesting! I didn’t know that the main idea was so simple. I’ve been playing with similar stuff with Solr and Mahout (http://bit.ly/14F3KLz). This all makes me wonder: if you took all the sentences in a large corpus (you name it, all of the New York Post, the complete works of Shakespeare, the Bible), and applied this strategy. What would be the “central sentence” in the entire collection. What’s more, I would be interested in different centrality metrics. (In particular, I’m thinking about the eigenvector centrality metric.)

    • Putting this over a complete corpus might lead to interesting results. Interesting, not necessarily useful, though.

      It might be fun to look at chapter basis (some deliminator for chapters would have to be found).

      Iterating the whole thing to smaller and smaller sizes might also work.

      i.e. Summarize each chapter, then summarize the corpus of chapter summaries.

      Hope I am making sense.

    • Just a quick reply. I updated the Node.js module to combine the two “public” facing methods and renamed it. It’s also now fully asynchronous.

  2. Pingback: Python News Roundup – 5/3 | Docstrings

  3. Pingback: Web-enable your Research/Project with an API | Chrispogeek

  4. Pingback: An Efficient Way to Extract the Main Topics from a Sentence | The Tokenizer

  5. Very cool article! The naïve algorithm you implemented is quite effective. I like the legibility of Python as well. Thanks for sharing, Shlomi!

  6. Hi, can I know why you are considering normalizing the sentence intersection, instead of considering just the intersection as a metric?What is the significance of dividing it by the average length of sentences?

  7. Pingback: The Importance of sentences in articles | Something Catchy

  8. Pingback: GPS Trivia | -(Lab *) oneTwoClick

  9. I think there’s a third reason to “why this ‘works'” beyond the mentioned two. It looks like this ‘works’ because as all picked sentences are the highest score in their paragraph (following the naive algorithm, not necessarilty the suggested variations), they hold a lot of commonality among themselves. This makes the conglomeration of them as one sequential piece of output, have a feel of context persistency, and thus the result reads easily, even if it isn’t a good semantic summary of the original text. I think that’s an important distinction to make about this algorithm.

    What do you think?

  10. Pingback: 自动摘要算法 | isnowfy

  11. Pingback: 自动摘要算法 | 极客我爱你 geek521.com

  12. Pingback: Looking back on 2013

  13. Thanks for explaining this topic in such a brief and clear way. Now, I feel as if I can look behind the curtains and that I actually see that there is no magic. 🙂

    • Essentially, the algorithm takes the sentences in every paragraph of text, and gives them a score based on how many words in the sentence intersect(also occur) in other sentences of the text. The highest ranked sentences from each paragraph are chosen and put in order.

Leave a reply to James Brooks Cancel reply