Build your own summary tool!

After Yahoo! acquired Summly and Google acquired Wavii, there is no doubt that auto summarization technologies are a hot topic in the industry. So as a NLP freak, I decided to give a quick overview and hands-on experience on how these technologies actually work.

Since some of you might not read the entire post, I decided to start with the bottom line – Here you can find my Python implementation for a very naive summarization algorithm. This algorithm extracts the key sentence from each paragraph in the text. The algorithm works nicely on news & blog articles, and it usually cuts around 60-70% of the original text. As an example I ran it on this article (about my company, Swayy), and got the following results:

Swayy is a beautiful new dashboard for discovering and curating online content [Invites]

Lior Degani, the Co-Founder and head of Marketing of Swayy, pinged me last week when I was in California to tell me about his startup and give me beta access.
One week later, I’m totally addicted to Swayy and glad I said nothing about the spam (it doesn’t send out spam tweets but I liked the line too much to not use it for this article).
What is Swayy? It’s like Percolate and LinkedIn recommended articles, mixed with trending keywords for the topics you find interesting, combined with an analytics dashboard that shows the trends of what you do and how people react to it.
After I decided that I trusted the service, I added my Facebook and LinkedIn accounts.
The analytics side isn’t as interesting for me right now, but that could be due to the fact that I’ve barely been online since I came back from the US last weekend.
It was the suggested content that impressed me the most.
Yes, we’re in the process of filing a patent for it.
Ohad Frankfurt is the CEO, Shlomi Babluki is the CTO and Oz Katz does Product and Engineering, and I [Lior Degani] do Marketing.
➤ Want to try Swayy out without having to wait? Go to this secret URL and enter the promotion code thenextweb

Original Length 4529
Summary Length 1307
Summary Ratio: 71.1415323471

Summarization Technologies:

Today there are two common approaches to “attacking” the summarization mission. The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. As far as I know, until today this approach didn’t achieve any substantial results. The second approach ,which is similar to my naive algorithm, tries to extract the key sentences from the text, and then tries to put them together properly. One famous algorithm that implements this approach is TextRank.

Our Summarization Algorithm

I’m going to explain step-by-step my naive algorithm. I’ll use both programming and computer science terminology. Before you continue, in case you didn’t do it already, I suggest to to take a quick look at the code.

The intersection function:

This function receives two sentences, and returns a score for the intersection between them.
We just split each sentence into words/tokens, count how many common tokens we have, and then we normalize the result with the average length of the two sentences.
Computer Science: f(s1, s2) = |{w | w in s1 and w in s2}| / ((|s1| + |s2|) / 2)

The sentences dictionary:

This part is actually the “Heart” of the algorithm. It receives our text as input, and calculates a score for each sentence. The calculations is composed of two steps:

In the first step we split the text into sentences, and store the intersection value between each two sentences in a matrix (two-dimensional array). So values[0][2] will hold the intersection score between sentence #1 and sentence #3.
Computer Science: In fact, we just converted our text into a fully-connected weighted graph! Each sentence is a node in the graph and the two-dimensional array holds the weight of each edge!

In the second step we calculate an individual score for each sentence and store it in a key-value dictionary, where the sentence itself is the key and the value is the total score. We do that just by summing up all its intersections with the other sentences in the text (not including itself).
Computer Science: We calculates the score for each node in our graph. We simply do that by summing all the edges that are connected to the node.

Building the summary:

Obviously, the final step of our algorithm is generating the final summary. We do that by splitting our text into paragraphs, and then we choose the best sentence from each paragraph according to our sentences dictionary.
Computer Science: The Idea here is that every paragraph in the text represents some logical subset of our graph, so we just pick the most valuable node from each subset!

Why this works

There are two main reasons why this algorithm works: The first (and obvious) reason is that a paragraph is a logical atomic unit of the text. In simple words – there is probably a very good reason why the author decided to split his text that way. The second (and maybe less obvious..) reason is that if two sentences have a good intersection, they probably holds the same information. So if one sentence has a good intersection with many other sentences, it probably holds some information from each one of them- or in other words, this is probably a key sentence in our text!

Your turn!

If you read until here, you probably want to play a little bit with the algorithm. Here are a few things you can try to do:

1. In order to make it more useful, you may wrap it with some tool that extracts content from a URL. I personally like Goose, but you may use other tools like Readability, boilerpipe or others. It obviously won’t improve your algorithm, but it can be really nice just to pick a URL and see the resulting summary.

2. In my code I intentionally didn’t use any other packages. You can explore the NLTK or OpenNLP packages and use their methods for splitting and tokenizing text. They usually provide much better methods for that.

3. Play with the intersection function. As I already wrote, you may use stopwords or stemming (these both tools are included in NLTK!) and see how they change the result. You can also play with the normalization part of the equation (try to divide the result with different factors).

4. Create a new variation of the algorithm. For example, instead of picking the best sentence from each paragraph, try and pick the 2-3 most important paragraphs (In this case- each node of your graph is a full paragraph, instead of a single sentence!)

5. Use the title! – In my code I just print it at the top of the summary, but I’m sure it can be useful (For example – try to add it with some factor in your intersection function).

6. Check it on other input languages – Although I tested this code only on English, theoretically it should work in any other language!

Obviously this algorithm is just a simple Proof of concept, but I hope it gave you some general knowledge about how summarization technologies works. I also have to mention that in this post I introduced only one approach to complete this task, while indeed there are a few others!

Leave your comments below, or just contact me directly on Twitter.

Shlomi Babluki.

28 thoughts on “Build your own summary tool!”

jslo says:

April 28, 2013 at 6:24 PM

Great post! That reminded me what computer science is all about.

John Berryman (@JnBrymn) says:

April 29, 2013 at 4:17 AM

Interesting! I didn’t know that the main idea was so simple. I’ve been playing with similar stuff with Solr and Mahout (http://bit.ly/14F3KLz). This all makes me wonder: if you took all the sentences in a large corpus (you name it, all of the New York Post, the complete works of Shakespeare, the Bible), and applied this strategy. What would be the “central sentence” in the entire collection. What’s more, I would be interested in different centrality metrics. (In particular, I’m thinking about the eigenvector centrality metric.)

- Marcus says:
  
  May 8, 2013 at 1:58 PM
  
  Putting this over a complete corpus might lead to interesting results. Interesting, not necessarily useful, though.
  
  It might be fun to look at chapter basis (some deliminator for chapters would have to be found).
  
  Iterating the whole thing to smaller and smaller sizes might also work.
  
  i.e. Summarize each chapter, then summarize the corpus of chapter summaries.
  
  Hope I am making sense.
  
Chris Ismael (@craftshape) says:

April 29, 2013 at 6:14 AM

Hey man this is awesome! I turned it into an API here – https://www.mashape.com/ismaelc/summarizer-tool#!documentation I will blog tomorrow on how I turned it into an API, and will turn it into another API that accepts just a URL as you suggested.

phaniroy says:

April 29, 2013 at 2:27 PM

kudos to you!!

James Brooks says:

April 29, 2013 at 4:33 PM

I created a Node.js module using the same algorithm and Underscore.js https://npmjs.org/package/node-summary

It doesn’t bring back quite the same results, but for the little amount of time I spent on it, it seems to work well enough.

You can check the source at https://github.com/jbrooksuk/node-summary

- jbrooksuk says:
  
  May 7, 2013 at 10:27 AM
  
  Just a quick reply. I updated the Node.js module to combine the two “public” facing methods and renamed it. It’s also now fully asynchronous.
  
Pingback: Python News Roundup – 5/3 | Docstrings
anton says:

May 4, 2013 at 3:53 AM

Thanks for the insight. That was a really nice and informative article.

Pingback: Web-enable your Research/Project with an API | Chrispogeek
Pingback: An Efficient Way to Extract the Main Topics from a Sentence | The Tokenizer
Eugene says:

May 10, 2013 at 8:00 PM

Very cool article! The naïve algorithm you implemented is quite effective. I like the legibility of Python as well. Thanks for sharing, Shlomi!

ramya vemuganti says:

July 13, 2013 at 4:32 AM

Hi, can I know why you are considering normalizing the sentence intersection, instead of considering just the intersection as a metric?What is the significance of dividing it by the average length of sentences?

SUEM says:

July 24, 2013 at 12:53 PM

Is this working?

Pingback: The Importance of sentences in articles | Something Catchy
Pingback: GPS Trivia | -(Lab *) oneTwoClick
nima says:

September 20, 2013 at 9:28 AM

hey can any one write the above code in perl language plz post it urgently

matansa says:

October 19, 2013 at 11:20 PM

I think there’s a third reason to “why this ‘works'” beyond the mentioned two. It looks like this ‘works’ because as all picked sentences are the highest score in their paragraph (following the naive algorithm, not necessarilty the suggested variations), they hold a lot of commonality among themselves. This makes the conglomeration of them as one sequential piece of output, have a feel of context persistency, and thus the result reads easily, even if it isn’t a good semantic summary of the original text. I think that’s an important distinction to make about this algorithm.

What do you think?

Pingback: 自动摘要算法 | isnowfy
Pingback: 自动摘要算法 | 极客我爱你 geek521.com
Phil Plückthun says:

December 16, 2013 at 2:20 PM

Based on James Brook’s code I created a new node-module summarizing web articles.
The algorithm is a bit different: It now compares the sentences across paragraphs.
https://github.com/philplckthun/node-sumuparticles

Pingback: Looking back on 2013
NAomi says:

March 13, 2014 at 4:53 AM

hey how to i link this program to a front end

meillo says:

April 11, 2014 at 1:04 PM

Thanks for explaining this topic in such a brief and clear way. Now, I feel as if I can look behind the curtains and that I actually see that there is no magic. 🙂

Jason Cooperman says:

October 25, 2014 at 8:01 AM

i didnt understand the algrotihm. Could you explain it to me? thanks 🙂

- Vardaan Bhat says:
  
  November 17, 2014 at 12:40 AM
  
  Essentially, the algorithm takes the sentences in every paragraph of text, and gives them a score based on how many words in the sentence intersect(also occur) in other sentences of the text. The highest ranked sentences from each paragraph are chosen and put in order.
  
Vardaan Bhat says:

November 9, 2014 at 9:52 PM

Thanks,
With some simple code changes and different text inputs, I got some pretty good results!

Vardaan Bhat says:

December 27, 2014 at 12:23 AM

What is the difference between this and Textrank?