Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way.
Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!
I ran it on this sentence –
“Swayy is a beautiful new dashboard for discovering and curating online content.”
And got this result –
This sentence is about: Swayy, beautiful new dashboard, online content
The first time you run the code, it loads the brown corpus into memory, so it might take a few seconds.
From the linguistic aspect, we usually say that the main “building blocks” of a sentence are Noun Phrases (NP) and Verb Phrases (VP). The Noun Phrases are usually the topics or objects in the sentence, or in simple words – this is what the sentence is talking about, while Verb Phrases describe some action between the objects in the sentence. Take this example:
“Facebook acquired Instagram”
About Who/What? – Facebook and Instagram > Noun Phrases
What happened? – acquired (=acquisition) > Verb Phrase
My goal was to extract only the Noun Phrases from the sentence, so I had to define some simple patterns which describe the structure of a Noun Phrase, for example:
NN = content
JJ+NN = visual content
NN+NN = content marketing
*NN = noun, JJ = adj…
Now, I believe that some of you probably ask – “Wait! What? Why don’t you use parsing?”
So, first – you’re right! The known method to convert a sentence into Noun and Verb Phrases (or in other words – a tree..) is parsing. However, the problem with parsing algorithms is that their complexity is quite bad. For example CYK algorithm has the complexity of O(n^3 * |G|) !
The second problem is that full-parsing was a bit of an overkill for what I wanted to achieve.
First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful. Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases.
So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.
Here I’m going to give you a quick overview of my code:
bigram_tagger – I use the NLTK taggers classes to define my own tagger. As you can see it’s built from 3 different taggers and it’s trained with the brown corpus.
cfg – This is my “Semi-CFG”. It includes the basic rules to match a regular Noun Phrase.
tokenize_sentence – Split the sentence into tokens (single words).
normalize_tags – Since there are many tags in the brown corpus, I just rename some of them.
extract – This is our main method. Split the sentence, tag it, and search for patterns.
Lines 96-97 – The different between these lines, is that line 97 also accepts single nouns. The meaning of this condition, is that you’ll get more results per sentence – but some of the results will be false positive! You can overcome / ignore the false positive results by using words frequencies or by defining some special dictionary according to your needs.
The bottom line
As I already said, the best way to extract Noun/Verb phrases from a sentence is by using parsing. However, if you need to do it fast and you want to be able to process many sentences / full documents in a very short time – I suggest you to take an approach like the one I illustrated above.
thank you for your post, would you please just show step by step what happen to
the sentence in the example ?
(“Swayy is a beautiful new dashboard for discovering and curating online content.”)
what tag will be add to each token?
and whitch token you keep
Thank you, very much
I’ve used Prolog to capture noun phrases (and similar) for keyword analysis. It’s pretty handy: defining this kind of FCGs is straightforward and damn fast in Prolog, even loading a corpus (I have used the Brill tagged corpus and another I don’t remember exactly, both had Prolog knowledge databases available.)
I had to do some tweaks to my grammar and add the Wordnet corpus, but got this:
tester3(‘Swayy is a beautiful new dashboard for discovering and curating online content’, ‘s’, X), write(X).
I know it doesn’t look pretty, I didn’t take the time to clean up the results. It is the raw output of the grammar tree. As you can see it generates the tree, and gets the same nps as your code in this particular example (in this example it also shows the intermediate steps, i.e. adj + simple np is np):
(incidentally, it also recognizes that it is a valid sentence.)
Thanks for posting, you made me review my own code and make it better 😉 Oh, by the way the “code” to generate this is 63 lines for the grammar, and the databases of word types (Wordnet, which sadly only classifies as noun, verb, adjective and adverb and the Brill part-of-speech tagged database, which adds conjunctions and the other requirements for complex phrases) are 300k lines. Of course these came for free 🙂
How would you compare this to the Stanford Named Entity Extractor (speed, accuracy, etc). Thanks, great demo! http://www-nlp.stanford.edu/software/CRF-NER.shtml
Pingback: Eh, Ess, Dee & Eff.
Pingback: Efficient way to extract key words | tech9thoughts
Pingback: An Efficient Way to Extract the Main Topics from a Sentence | The Tokenizer | Readist
Pingback: Parsing and Coding Guide to Extract Main Sentence Topics : Stephen E. Arnold @ Beyond Search
Pingback: An Efficient Way to Extract the Main Topics from a Sentence | ben powell | ben powell
To get a fuzzy taste of conventional parsing speed, parsing the sentence “Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence” with the Stanford online parser at http://nlp.stanford.edu:8080/parser/index.jsp takes close to one second, and that duration is not even including the network round-trip…. as much as it’s necessary to run the full parsing for NER or topic guessing.
cud u give me a jsp code to extract above?
Thank u for your post. But is there any way to do this using java
Thanks so much.
Your algorithms, articles, and code are all useful in my project.