cfg | The Tokenizer

Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way.

Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!

I ran it on this sentence –

“Swayy is a beautiful new dashboard for discovering and curating online content.”

And got this result –

This sentence is about: Swayy, beautiful new dashboard, online content

The first time you run the code, it loads the brown corpus into memory, so it might take a few seconds.

Linguistics

From the linguistic aspect, we usually say that the main “building blocks” of a sentence are Noun Phrases (NP) and Verb Phrases (VP). The Noun Phrases are usually the topics or objects in the sentence, or in simple words – this is what the sentence is talking about, while Verb Phrases describe some action between the objects in the sentence. Take this example:

“Facebook acquired Instagram”
About Who/What? – Facebook and Instagram > Noun Phrases
What happened? – acquired (=acquisition) > Verb Phrase

My goal was to extract only the Noun Phrases from the sentence, so I had to define some simple patterns which describe the structure of a Noun Phrase, for example:
NN = content
JJ+NN = visual content
NN+NN = content marketing
…
*NN = noun, JJ = adj…

Computer Science:

Now, I believe that some of you probably ask – “Wait! What? Why don’t you use parsing?”
So, first – you’re right! The known method to convert a sentence into Noun and Verb Phrases (or in other words – a tree..) is parsing. However, the problem with parsing algorithms is that their complexity is quite bad. For example CYK algorithm has the complexity of O(n^3 * |G|) !
The second problem is that full-parsing was a bit of an overkill for what I wanted to achieve.

My Solution:

First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful. Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases.

So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.

Programming:

Here I’m going to give you a quick overview of my code:

bigram_tagger – I use the NLTK taggers classes to define my own tagger. As you can see it’s built from 3 different taggers and it’s trained with the brown corpus.

cfg – This is my “Semi-CFG”. It includes the basic rules to match a regular Noun Phrase.

tokenize_sentence – Split the sentence into tokens (single words).

normalize_tags – Since there are many tags in the brown corpus, I just rename some of them.

extract – This is our main method. Split the sentence, tag it, and search for patterns.

Lines 96-97 – The different between these lines, is that line 97 also accepts single nouns. The meaning of this condition, is that you’ll get more results per sentence – but some of the results will be false positive! You can overcome / ignore the false positive results by using words frequencies or by defining some special dictionary according to your needs.

The bottom line

As I already said, the best way to extract Noun/Verb phrases from a sentence is by using parsing. However, if you need to do it fast and you want to be able to process many sentences / full documents in a very short time – I suggest you to take an approach like the one I illustrated above.

The Tokenizer

Here’s a few things you might need to know, or maybe you just forgot…

Tag Archives: cfg

An Efficient Way to Extract the Main Topics from a Sentence

Share this: