Naive Language Detector

Last week, as I was working on my new project ‘Complete‘ (A personalized autocomplete extension for Gmail), I was searching for a solution that would be able to correctly detect the language of a text. I thought finding one should be easy since I needed it to be able to work only on long texts.

The first solution I thought to incorporate could have fitted the project needs, had it not been based on the NLTK stopwords corpus, and supported only 14 languages. Besides this solution, I found a few other ones, which were a bit too heavy or complex for my needs. Not being entirely satisfied with the available solutions I set out to build my own one. You can find my code here and some more details about it throughout this post.

 1. ‘data.json’:

In my code there is a file called ‘data.json’, that is in fact the model for my solution. It was built by using the 1,000 most common words for each language from here, and then filtering from it the 50 most common and unique words for each language.

 2. ‘language_detector.py’:

This is my Python code that uses the model. Step A: my code converts the whole text into a set of tokens (by splitting the text into sentences and then each sentence into a list of tokens). Step B: my code checks the intersection between the token set of the text and each language set in my model and store the highest value (which will be the result).

 3. Comments about my solution:

  • As I already mentioned, this solution will work only with long texts (emails, news articles, full documents…), not phrases or single sentences.
  • Theoretically, this solution should not have ‘false-positive’. It returns ‘None’ if it cannot identify the language of the given text.
  • In my code, I used NLTK to split the text into tokens. If you don’t want to use it, you can just use regular Python .split() function.
  • Feel free to use my ‘data.json’ model and implement my code in any other programming language.

Have fun,

Shlomi Babluki.

How To Easily Recognize People’s Names In A Text

Named Entity Recognition (or just NER) is one of the more traditional tasks done with Natural Language Processing. The definition of the task is very simple :- build an automatic tool that can recognize and classify names in any piece of text.

For example:

Input: “Barack Obama is the president of the United States of America”

Output: “[Barack Obama] (Person) is the president of the [United States of America] (Location)”

Today NER is considered as a “Solved problem”, since the accuracy of modern NER systems is over 90%! However, I decided to take it as a case study to show you how important it is to have a good understanding of the NLP problem you want to solve before even starting to write a single line of code.

The “Problem” with Traditional NER systems

Traditional NER systems are quite heavy and complex. They are built using training sets of data, statistic methods, heuristics and complex algorithms. Moreover, when people started to work on these systems 30 or 40 years ago, having a simple dictionary with all the possible names in the world was not an option at all!

But today, in some cases, the story is quite different…

Case 1 – News articles

Lets take the definition from the beginning of this post and change it a little bit:

Building an automatic tool that can recognize and classify people’s names in any text news articles.

So yes, any traditional NER system can solve this task, but in this case a much simpler solution might also work pretty well. Take a look at my Python code for dummy NER for news articles.

I just use simple regular expression to extract strings that “look like names” and then validate them with the Freebase API.

I run it on this article, and got these results:

Jon Lester, Alex Rodriguez, Ryan Dempster, Joe Girardi, Jake Peavy

Comments about my code:

  1. Obviously it works quite slowly since it uses external API calls, but if you really want you can find a way to download Freebase’ data (or something similar like Wikipedia) and run it locally.
  2. I’m sure you can improve it a little bit by adding some special cases like handling ‘s at the end of the name or ignoring middle names… etc

Why does it work?

  1. First, we only want to recognize people’s names.
  2. According to a little research I did over a year ago, the number of names that are mentioned in news articles in a single point of time is around only 20,000 (= a very small set of names!).
  3. Whenever a new name comes up in the news, someone will probably add it to Freebase / Wikipedia within just a few hours.
  4. Usually in every news article the full names of the people in the text (“Barack Obama” and not just “Obama”) are written at least once (most likely at the beginning).

Case 2 – Any Article?

As I already said, 30 or 40 years ago having a dictionary with all the possible names in the world wasn’t a real option, but today we actually do have a very good dictionary of names – Facebook users!

So I took my code and added another layer that uses Facebook API to validate names. You can grab my full code from here.

This time I ran it on this article, and got these results:

Oz Katz, Emily Engelson, Shlomi Babluki, Lior Degani, Ohad Frankfurt, Shira Abel, Kevin Systrom

So now we theoretically have :- an automatic tool that can recognize and classify people’s names in any text.

Conclusion

The idea behind this post was to show you that sometimes a small change in the problem might lead into a much simpler and naive solution. The key in such cases is to deeply analyze the problem before even starting to think about the possible solutions.

An Efficient Way to Extract the Main Topics from a Sentence

Last week, while working on new features for our product, I had to find a quick and efficient way to extract the main topics/objects from a sentence. Since I’m using Python, I initially thought that it’s going to be a very easy task to achieve with NLTK. However, when I tried its default tools (POS tagger, Parser…), I indeed got quite accurate results, but performance was pretty bad. So I had to find a better way.

Like I did in my previous post, I’ll start with the bottom line – Here you can find my code for extracting the main topics/noun phrases from a given sentence. It works fine with real sentences (from a blog/news article). It’s a bit less accurate compared to the default NLTK tools, but it works much faster!

I ran it on this sentence –

“Swayy is a beautiful new dashboard for discovering and curating online content.”

And got this result –

This sentence is about: Swayy, beautiful new dashboard, online content

The first time you run the code, it loads the brown corpus into memory, so it might take a few seconds.

Linguistics

From the linguistic aspect, we usually say that the main “building blocks” of a sentence are Noun Phrases (NP) and Verb Phrases (VP). The Noun Phrases are usually the topics or objects in the sentence, or in simple words – this is what the sentence is talking about, while Verb Phrases describe some action between the objects in the sentence. Take this example:

“Facebook acquired Instagram”
About Who/What? – Facebook and Instagram > Noun Phrases
What happened? – acquired (=acquisition) > Verb Phrase

My goal was to extract only the Noun Phrases from the sentence, so I had to define some simple patterns which describe the structure of a Noun Phrase, for example:
NN = content
JJ+NN = visual content
NN+NN = content marketing

*NN = noun, JJ = adj…

Computer Science:

Now, I believe that some of you probably ask – “Wait! What? Why don’t you use parsing?”
So, first – you’re right! The known method to convert a sentence into Noun and Verb Phrases (or in other words – a tree..) is parsing. However, the problem with parsing algorithms is that their complexity is quite bad. For example CYK algorithm has the complexity of O(n^3 * |G|) !
The second problem is that full-parsing was a bit of an overkill for what I wanted to achieve.

My Solution:

First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful.  Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases.

So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.

Programming:

Here I’m going to give you a quick overview of my code:

bigram_tagger – I use the NLTK taggers classes to define my own tagger. As you can see it’s built from 3 different taggers and it’s trained with the brown corpus.

cfg – This is my “Semi-CFG”. It includes the basic rules to match a regular Noun Phrase.

tokenize_sentence – Split the sentence into tokens (single words).

normalize_tags – Since there are many tags in the brown corpus, I just rename some of them.

extract – This is our main method. Split the sentence, tag it, and search for patterns.

Lines 96-97 – The different between these lines, is that line 97 also accepts single nouns. The meaning of this condition, is that you’ll get more results per sentence – but some of the results will be false positive! You can overcome / ignore the false positive results by using words frequencies or by defining some special dictionary according to your needs.

The bottom line

As I already said, the best way to extract Noun/Verb phrases from a sentence is by using parsing. However, if you need to do it fast and you want to be able to process many sentences / full documents in a very short time – I suggest you to take an approach like the one I illustrated above.

Build your own summary tool!

After Yahoo! acquired Summly and Google acquired Wavii, there is no doubt that auto summarization technologies are a hot topic in the industry. So as a NLP freak, I decided  to give a quick overview and hands-on experience on how these technologies actually work.

Since some of you might not read the entire post, I decided to start with the bottom line – Here you can find my Python implementation for a very naive summarization algorithm. This algorithm extracts the key sentence from each paragraph in the text. The algorithm works nicely on news & blog articles, and it usually cuts around 60-70% of the original text. As an example I ran it on this article (about my company, Swayy), and got the following results:


Swayy is a beautiful new dashboard for discovering and curating online content [Invites]

Lior Degani, the Co-Founder and head of Marketing of Swayy, pinged me last week when I was in California to tell me about his startup and give me beta access.
One week later, I’m totally addicted to Swayy and glad I said nothing about the spam (it doesn’t send out spam tweets but I liked the line too much to not use it for this article).
What is Swayy? It’s like Percolate and LinkedIn recommended articles, mixed with trending keywords for the topics you find interesting, combined with an analytics dashboard that shows the trends of what you do and how people react to it.
After I decided that I trusted the service, I added my Facebook and LinkedIn accounts.
The analytics side isn’t as interesting for me right now, but that could be due to the fact that I’ve barely been online since I came back from the US last weekend.
It was the suggested content that impressed me the most.
Yes, we’re in the process of filing a patent for it.
Ohad Frankfurt is the CEO, Shlomi Babluki is the CTO and Oz Katz does Product and Engineering, and I [Lior Degani] do Marketing.
➤ Want to try Swayy out without having to wait? Go to this secret URL and enter the promotion code thenextweb


Original Length 4529

Summary Length 1307
Summary Ratio: 71.1415323471


Summarization Technologies:

Today there are two common approaches to “attacking” the summarization mission. The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. As far as I know, until today this approach didn’t achieve any substantial results. The second approach ,which is similar to my naive algorithm, tries to extract the key sentences from the text, and then tries to put them together properly. One famous algorithm that implements this approach is TextRank.

Our Summarization Algorithm

I’m going to explain step-by-step my naive algorithm. I’ll use both programming and computer science terminology. Before you continue, in case you didn’t do it already, I suggest to to take a quick look at the code.

The intersection function:

This function receives two sentences, and returns a score for the intersection between them.
We just split each sentence into words/tokens, count how many common tokens we have, and then we normalize the result with the average length of the two sentences.
Computer Science: f(s1, s2) = |{w | w in s1 and w in s2}| / ((|s1| + |s2|) / 2)

The sentences dictionary:

This part is actually the “Heart” of the algorithm. It receives our text as input, and calculates a score for each sentence. The calculations is composed of two steps:

In the first step we split the text into sentences, and store the intersection value between each two sentences in a matrix (two-dimensional array). So values[0][2] will hold the intersection score between sentence #1 and sentence #3.
Computer Science: In fact, we just converted our text into a fully-connected weighted graph! Each sentence is a node in the graph and the two-dimensional array holds the weight of each edge!

In the second step we calculate an individual score for each sentence and store it in a key-value dictionary, where the sentence itself is the key and the value is the total score. We do that just by summing up all its intersections with the other sentences in the text (not including itself).
Computer Science: We calculates the score for each node in our graph. We simply do that by summing all the edges that are connected to the node.

Building the summary:

Obviously, the final step of our algorithm is generating the final summary. We do that by splitting our text into paragraphs, and then we choose the best sentence from each paragraph according to our sentences dictionary.
Computer Science: The Idea here is that every paragraph in the text represents some logical subset of our graph, so we just pick the most valuable node from each subset!

Why this works

There are two main reasons why this algorithm works: The first (and obvious) reason is that a paragraph is a logical atomic unit of the text. In simple words – there is probably a very good reason why the author decided to split his text that way. The second (and maybe less obvious..) reason is that if two sentences have a good intersection, they probably holds the same information. So if one sentence has a good intersection with many other sentences, it probably holds some information from each one of them- or in other words, this is probably a key sentence in our text!

Your turn!

If you read until here, you probably want to play a little bit with the algorithm. Here are a few things you can try to do:

1. In order to make it more useful, you may wrap it with some tool that extracts content from a URL. I personally like Goose, but you may use other tools like Readability, boilerpipe or others. It obviously won’t improve your algorithm, but it can be really nice just to pick a URL and see the resulting summary.

2. In my code I intentionally didn’t use any other packages. You can explore the NLTK or OpenNLP packages and use their methods for splitting and tokenizing text. They usually provide much better methods for that.

3. Play with the intersection function. As I already wrote, you may use stopwords or stemming (these both tools are included in NLTK!) and see how they change the result. You can also play with the normalization part of the equation (try to divide the result with different factors).

4. Create a new variation of the algorithm. For example, instead of picking the best sentence from each paragraph, try and pick the 2-3 most important paragraphs (In this case- each node of your graph is a full paragraph, instead of a single sentence!)

5. Use the title! – In my code I just print it at the top of the summary, but I’m sure it can be useful (For example – try to add it with some factor in your intersection function).

6. Check it on other input languages – Although I tested this code only on English, theoretically it should work in any other language!

Obviously this algorithm is just a simple Proof of concept, but I hope it gave you some general knowledge about how summarization technologies works. I also have to mention that in this post I introduced only one approach to complete this task, while indeed there are a few others!

Leave your comments below, or just contact me directly on Twitter.

Shlomi Babluki.

Maze Solving Algorithm – Practicing Backtracking

Although it’s not directly connected to NLP, I decided to dedicate this post to a general algorithm called “Backtracking”. In simple words Backtracking is a general way to solve a problem by using a computer. The algorithm is considered “general” since it doesn’t describe how to solve a specific problem (like sorting numbers or searching for an element in a list), but it gives some high-level principles to solve many different problems.

The main idea of Backtracking is to solve a problem by starting from point A and going through all the possible options/paths until reaching point B. When we find a path from A to B it means we found a solution and solved the problem!

The general form of a Backtracking algorithm:

The easiest way to implement a backtracking algorithm is by using recursion.

First, We need to define three cases:

  • Reject: We are in the wrong way → this path can’t lead us to the solution.
  • Accept: We find a solution for the problem.
  • Step: We are at some point between A to B, continue to the next step.

Second, we just put our cases in this “template” (pseudo code):

function backtracking (input_list, output_list):

	# Reject – this path doesn't lead to any solution
	if reject case = true:
		return false

	# Accept – We find a solution!
	if accept case = true:
		print output_list # Our solution
		return true

	# Step – go over all the possible options in the input
	for each (n in input_list):
		test =  backtracking(input_list.remove(n), outputlist.push(n)) # Recursion
		if test = true:
			return true

	# We didn't find any solution
	return false

# First call
call  backtracking({some_input}, {})

Example:

Lets take this list of numbers: {3,4,1}, and try to sort them using Backtracking.

First, our cases:

Reject – if at the end of the output we have two unsorted numbers.

Accept – the input list is empty! (= we sorted all the numbers)

Continue – each time take a number from the input and try to put it at the end of the output.

Running the algorithm:

First call: original list {3,4,1}, output {}

  1. pick 3 → original list {4,1}, output {3} – Continue
  2.   pick 4 → original list {1}, output {3,4} – Continue
  3.     pick 1 → original list {}, output {3,4,1} – Reject! no more options, go backward
  4.   pick 1 → original list {4}, output {3,1} – Reject! no more options, go backward
  5. pick 4 → original list {3,1}, output {4} – Continue
  6.   pick 3 → original list {1}, output {4,3} – Reject! try the next option
  7.   pick 1 → original list {3}, output {4,1} – Reject! no more options, go backward
  8. pick 1 → original list {3,4}, output {1} – Continue
  9.   pick 3 → original list {4}, output {1,3} – Continue
  10.     pick 4 → original list {}, output {1,3,4} – Accept! This is our solution!

Putting it all together – Java Implementation for sorting with backtracking:

import java.util.ArrayList;
public class BacktrackingSort {

	public static int counter = 0;

	public static boolean sortBacktracking
		(ArrayList input_list, ArrayList output_list,	int level) {

		// Trace
		counter++;
		System.out.println(counter + "# Level " + level
				+ ": " + input_list + " " + output_list);

		// Reject – this path doesn't lead to any solution
		if (output_list.size() > 1 &&
				(output_list.get(output_list.size() - 2)
						> output_list.get(output_list.size() - 1))){
			return false;
		}

		// Accept – We find a solution!
		if (input_list.size() == 0) {
			System.out.println("Solution: " + output_list);
			return true;
		}

		// Step – go over all the possible options in the input
		for (Integer n : input_list) {
			ArrayList clone_input_list
				= (ArrayList) input_list.clone();
			clone_input_list.remove(n);
			ArrayList clone_output_list
			= (ArrayList) output_list.clone();
			clone_output_list.add(n);
			boolean test = sortBacktracking(clone_input_list,
					clone_output_list, level + 1);
			if (test){
				return true;
			}
		}

		// We didn't find any solution
		return false;
	}

	public static void main(String[] args) {
		ArrayList input_list = new ArrayList();
		input_list.add(3);
		input_list.add(4);
		input_list.add(1);
		ArrayList output_list = new ArrayList();

		// First call
		sortBacktracking(input_list, output_list, 0);
	}
}

Different variations of Backtracking:

Some problems have a single solution, but some others have multiple solutions. Moreover, some problems have good and bad solutions – for example: solving a maze.

Find any solution:

This variation of Backtracking stops once it encountered any solution for the problem. In case of a maze, once we find a path from the starting point to the exit – we return it as the solution.

This variation of Backtracking is actually similar to the sorting problems.

I wrote a simple Backtracking code for solving a maze with JAVA. You can find the code here, and here is its solution:

# # # # # # # # # #
# X * * #   #     #
#     * #   #   # #
#     * * * * * * #
#               * #
# # #   * * * * * #
#       * # # # # #
# * * * *   #   # #
# S #             #
# # # # # # # # # #

Find the best solution:

Sometimes finding any solution is just not good enough and we need to find the best solution. Again, in case of maze, we’ll probably want to get the shortest path to the exit. In this case you need to change the Backtracking template a little bit. First, you should always check all the possible paths! Even if you find a solution, you should continue looking for a better one. The trick here is to have a dynamic Reject case. At the beginning of the algorithm you set some best_solution variable to a maximal value. During the runtime of the algorithm, every time you find a better solution you update this variable. This variable gives you the ability to drop possible solutions without completely checking them.

For example, in the runtime of a maze solving algorithm, your best_solution = 11, which means you already found a way of 11 steps to the exit. Then when you check some other solution, you can drop it immediately if it becomes longer than 11 steps!

Here you can find a simple Backtracking code for getting the best solution of maze, and here is its output:

# # # # # # # # # #
# X     #   #     #
# *     #   #   # #
# *               #
# * * *           #
# # # *           #
# * * *   # # # # #
# *         #   # #
# S #             #
# # # # # # # # # #

Find all solutions:

Of course there is a variation of Backtracking for finding all the possible solution for a problems. For example: You have this input of digits: {1,2,3,4,5} and you want to find all the permutations that their sum equals 12. This variation of the algorithms is between the two other variations (going over all the possible options with a regular Reject case):

import java.util.ArrayList;
public class BacktrackingDigitSum {

	// Find all permutations the equal MATCH
	public final static int MATCH = 12;

	public static int counter = 0;

	public static void digitSumBacktracking	(ArrayList input_list,
											ArrayList output_list,
											int sum, int level) {

		counter++;
		//Uncomment the line below to follow the algorithm runtime
		//System.out.println(counter + "# Level " + level
		//			+ ": " + input_list + " " + output_list);

		// Reject – this path doesn't lead to any solution
		if (sum > MATCH) {
			return;
		}

		// Accept – We find a solution!
		if (sum == MATCH) {
			System.out.println("Solution: " + output_list);
			return;
		}

		// Step – go over all the possible options in the input
		for (Integer n : input_list) {
			ArrayList clone_input_list
				= (ArrayList) input_list.clone();
			clone_input_list.remove(n);
			ArrayList clone_output_list
				= (ArrayList) output_list.clone();
			clone_output_list.add(n);
			digitSumBacktracking(clone_input_list ,
					clone_output_list, sum + n,level + 1);
		}
	}

	public static void main(String[] args) {
		ArrayList input_list = new ArrayList();
		input_list.add(1);
		input_list.add(2);
		input_list.add(3);
		input_list.add(4);
		input_list.add(5);
		ArrayList output_list = new ArrayList();

		// First call
		digitSumBacktracking(input_list, output_list, 0, 0);
	}
}

Do and don’t do:

As a developer, I can tell you that once you understand the concept of Backtracking, it can be a really useful tool. It can help you to quickly design and implement algorithms that can solve quite complex problems. However, the biggest problem with Backtracking algorithms is that usually their efficiency is quite poor (Brute-Force). Therefore you should be really careful when you choose to use it. I have my own set of rules for using Backtracking:

  1. Don’t use Backtracking to solve a problem that already has a known solution – for example sorting. Sorting numbers with backtracking is a very bad decision, you definitely should use Quicksort/Mergesort! You should use Backtracking only when there is no other known/easy way to solve your particular problem.
  2. Use Backtracking to solve problems with a small input or with a small set of possible solutions, otherwise your CPU will really hate you!
  3. Trade-off – This is the most important one for me. Sometimes there is a known solution for a problem, but implementing & QA it may take a long time. In this case you can consider using Backtracking, as long as you still receive reasonable performances.

The bottom line: Backtracking is like a developers’ 5Kg hammer  – don’t use it to hang a picture on the wall, use it if you need to break the wall!

My next post is going to be about Spellcheckers and it will include a nice example of how to build your own simple spellchecker by using Backtracking. Stay tuned and follow me on Twitter.

Shlomi Babluki

The Input of Natural Language systems

As I wrote in my previous post, a crucial part of building a good NLP system is properly defining the system’s main task. By my experience, the first step of defining that task is to deeply understand the input of the system.
Here is a list of questions and tips that should help you better understand your input, and design your system accordingly:

The Input of Natural Language Systems

The Input of Natural Language Systems


Input Length:

Is your input a short text  – like a search query or short command, or a long text like a news article or a Word document?

Short Text:

  1. In such cases you could potentially build a system that understands almost every word or phrase in your input.
  2. If your input is short enough, you may consider using some non-efficient algorithm, for example: Backtracking. Implementing and maintaining non-efficient/naive algorithms is usually much easier, and if your input is really short, it won’t hurt performance by very much – but be careful!
  3. From my own experience, when I have to deal with short inputs, I get better results by using older approaches like “Rule-Based System” and “Pattern Matching”.

Long Text:

  1. My first tip here is to not start analyzing text from the title, but from the body. I have several reasons for that:
    • The title is often not a full sentence; and sometimes it may be really short, up to a single word!
    • The author can use the title as a teaser that doesn’t say much about the real content of the article.
    • The title may contain a word/phrase that could completely confuse your system. For example: “The flea meets the tiger, Who will win?” (Hint: this article is not about nature!)

            I usually analyze the body of the text first, and then try to compare it to the title.

  1. In most cases an article has a standard structure: 1-2 paragraphs for the introduction, the rest of the content and then 1-2 paragraphs for summary / conclusions. Try to identify this special paragraphs, or just use this rule of thumb – What comes first is probably more important!
  2. In the case of a long text, your system may achieve good results without recognizing every word or phrase in the input. Define the cases in which your system can skip a word, phrase or even a full sentence!
  3. If you’re dealing with a long text, I suggest you to take the statistical approach.

Supervised Input:

Is your input supervised in some way? For example an article from a big news site or a research/academic institute… etc. If so, you may assume the following are true:

  1. No Typos/Spelling mistakes.
  2. No Grammar mistakes.
  3. No punctuation mistakes.
  4. Topics and names will always start with an uppercase.
  5. The article probably has the standard structure I mentioned before.
  6. No slang.


Input Device:

How does the user (in case there is one) enter the input to your system? A keyboard? iPhone touchscreen? or maybe some old dialpad? It may have a big influence on your system!
Here are my notes (for short query systems):

  • Keyboard:
  1. Long input, usually up to 10-12 words.
  2. Probably a full sentence.
  3. Example: “I need the first flight tomorrow from London to Madrid”.
  4. Easy and comfortable typing in front of a large screen. Usually won’t contain a lot of mistakes, especially if the user knows that a computer needs to analyze his query.
  5. Typos, spelling mistakes and shortcuts frequency: Low.
  • iPhone (or any other smartphone) with a touchscreen:
  1. Medium input, up to 6-8 words.
  2. Not a full sentence.
  3. Example: “first flight tomorrow from London to Madrid”.
  4. Small screen and a “jumpy” keyboard. In addition, some people uses auto-correct systems which can completely change the input!
  5. Typos, spelling mistakes and shortcuts frequency: Medium.
  • Dialpad:
  1. Short input, up to 4-5 words!
  2. Missing conjunctions (from, to, in, on..)
  3. Example: “London Madrid 1st 2moro”.
  4. Small screen, uncomfortable keyboard and a slow typing. In this case you may also deal with some unusual mistakes like switching a → b, a → c… etc.
  5. Typos, spelling mistakes and shortcuts frequency: High.


I’m sure there are many other issues related to the input of NLP systems. The idea behind this post was to raise your awareness to the topic – and to give you a few notes on why the input of those systems is not as obvious as it may seem at first.

My upcoming posts are going to talk about Backtracking and Spellcheckers, stay tuned and follow me on Twitter.

Shlomi Babluki

“Understand”

In the past few years I had developed a few NLP (Natural Language Processing) systems. Some systems were very simple, and others were extremely complicated. However, in both cases it all started when one of my clients or partners came and requested “A system that understands… (something)”. So – I decided to dedicate this post to share my opinion about the relationship between the term “Understand” and NLP.

"Understand"

“Understand”

The Human Brain:

How does the human brain work? I guess nobody knows the real answer. However, I like the Dual Process theory and Daniel Kahneman approach, that simply states that the human brain is a combination of two systems:

#1 System:
“System 1 is fast; it’s intuitive, associative, metaphorical, automatic, impressionistic, and it can’t be switched off. Its operations involve no sense of intentional control, but it’s the “secret author of many of the choices and judgments you make””. [source]
“System 1 that decides whether you like a person, which thoughts or associations come to mind, and what you feel about something. All of this happens automatically. You can’t help it, and yet you often base your decisions on it.” [source]

#2 System:
“System 2 is slow, deliberate, effortful. Its operations require attention. To set it going now, ask yourself the question “What is 13 x 27?” And to see how it hogs attention” [source]
“System 2, on the other hand, is lazy and only becomes active when necessary. Slow, deliberate thinking is hard work.” [source]

The Computer “Brain”:

Obviously a computer doesn’t have a brain, but it has a very powerful (at least compared to humans) computing unit (CPU) and different types of memory. In some way we can say that a computer is a “Super #2 System”, but still – it doesn’t have a #1 System at all!

My Personal Opinion: (in relation to the NLP world)

#1 System gives us as humans the ability to understand the relation between objects in our world. If you ask a six year old to give you a list of words that are related to the word “Car”, he can do it quite easily. This is, of course, not true in case of a computer. A computer can store in memory a long list of words under the category “Car”, but it doesn’t have any built-in ability to understand the relation between the words in the list, moreover it doesn’t have the ability to get a completely new word/object and immediately understand if and how it is related to the list.

The bottom line, in my opinion, is that a computer can’t really understand anything! This is simply because it doesn’t have a #1 System. Although there are a few theories (See Turing test) that state that a #1 System can be built of a finite number of #2 Systems, I personally don’t agree with them. I truly believe that we need completely new technologies, models and approaches in order to build a machine/computer with a real #1 System.

So, What is NLP all about?

If a computer can’t understand anything, how can we build a NLP system? A system that recognizes a human text as an input and return a meaningful result.

The answer is quite simple. NLP is all about taking a very limited and defined task that is related to human language and building a “Model” that tries to simulate the behavior of #1 System under the circumstances of the given task . We build the “Model” using tools from Computer Science (data structures, algorithms..), Mathematics (Statistics, Probability..) and much more.

In my opinion, defining the task properly and building the right model is the “Art” of NLP.

For example, here is hashtagify.me, a visual graph that shows relations between hashtags on Twitter. On top of this graph it is possible to build a model that recognizes and rank the relationship between tweets and Twitter users.

Timeout – My life outside the NLP world:
Besides being a computer geek, I’m also an amateur cross-fitter. Take a look of what I do for fun in my free time:

Follow me on Twitter or contact me: shlomibabluki@gmail.com.

Shlomi Babluki

Welcome to the NLP World

For some reason I feel that NLP (Natural Language Processing) is considered an “Academic” field. While I don’t have a degree in this field, I do have quite a bit of practical experience. In the past few years I have developed a several NLP systems: a public transportation route planner, a remote television program recorder, an appointment scheduling system, as well as a few others. I am proud to say I have developed real products that thousands of people use every day!

First I want to apologize to the academics, as you may not agree with many of the things in this post (or the ones that follow).

Welcome to the NLP world

Welcome to the NLP world

The goal of NLP systems:

In simple words the goal of a NLP system is to convert (or “translate”) a human text such as: a news article, text message, search request, Facebook status… etc, into a well-defined data structure which is readable for a computer.

A very simple example – a system that recognizes flights search requests:

Possible inputs: The Result:
I’m looking for a flight- from Madrid, Spain to London, England- from Spain, Madrid to England, London- to London, England, from Madrid, Spain

– to England, London, from Spain, Madrid

<from>
<city>Madrid</city>
<country>Spain</country>
</from>
<to>
<city>London</city>
<country>England</country>
</to>

* Different inputs in a human language (left side) with the same XML result (right side).

Usually the NLP system is not a standalone system, but a one module of a larger system. In most cases the result of the NLP engine is used to retrieve some information (in our example: search for a relevant flight in the schedule) and then send the final result back to the user.

The cycle of a standard system:

User → User input → NLP system → Database, Information center → Final Result → User

More NLP systems:

For further reading, here are some well-known uses of NLP:

Semantic role labeling
Named-entity recognition (NER)
Document classification
Language identification

Timeout – My life outside the NLP world:

I would like to introduce you LEGO Mindstorms, a nice kit for learning the robotics world:

Follow me on Twitter or contact me: shlomibabluki@gmail.com.

Shlomi Babluki