Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way.
Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!
I ran it on this sentence –
“Swayy is a beautiful new dashboard for discovering and curating online content.”
And got this result –
This sentence is about: Swayy, beautiful new dashboard, online content
The first time you run the code, it loads the brown corpus into memory, so it might take a few seconds.
From the linguistic aspect, we usually say that the main “building blocks” of a sentence are Noun Phrases (NP) and Verb Phrases (VP). The Noun Phrases are usually the topics or objects in the sentence, or in simple words – this is what the sentence is talking about, while Verb Phrases describe some action between the objects in the sentence. Take this example:
“Facebook acquired Instagram”
About Who/What? – Facebook and Instagram > Noun Phrases
What happened? – acquired (=acquisition) > Verb Phrase
My goal was to extract only the Noun Phrases from the sentence, so I had to define some simple patterns which describe the structure of a Noun Phrase, for example:
NN = content
JJ+NN = visual content
NN+NN = content marketing
*NN = noun, JJ = adj…
Now, I believe that some of you probably ask – “Wait! What? Why don’t you use parsing?”
So, first – you’re right! The known method to convert a sentence into Noun and Verb Phrases (or in other words – a tree..) is parsing. However, the problem with parsing algorithms is that their complexity is quite bad. For example CYK algorithm has the complexity of O(n^3 * |G|) !
The second problem is that full-parsing was a bit of an overkill for what I wanted to achieve.
First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful. Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases.
So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.
Here I’m going to give you a quick overview of my code:
bigram_tagger – I use the NLTK taggers classes to define my own tagger. As you can see it’s built from 3 different taggers and it’s trained with the brown corpus.
cfg – This is my “Semi-CFG”. It includes the basic rules to match a regular Noun Phrase.
tokenize_sentence – Split the sentence into tokens (single words).
normalize_tags – Since there are many tags in the brown corpus, I just rename some of them.
extract – This is our main method. Split the sentence, tag it, and search for patterns.
Lines 96-97 – The different between these lines, is that line 97 also accepts single nouns. The meaning of this condition, is that you’ll get more results per sentence – but some of the results will be false positive! You can overcome / ignore the false positive results by using words frequencies or by defining some special dictionary according to your needs.
The bottom line
As I already said, the best way to extract Noun/Verb phrases from a sentence is by using parsing. However, if you need to do it fast and you want to be able to process many sentences / full documents in a very short time – I suggest you to take an approach like the one I illustrated above.