algorithms | The Tokenizer

As I wrote in my previous post, a crucial part of building a good NLP system is properly defining the system’s main task. By my experience, the first step of defining that task is to deeply understand the input of the system.
Here is a list of questions and tips that should help you better understand your input, and design your system accordingly:

The Input of Natural Language Systems

Input Length:

Is your input a short text – like a search query or short command, or a long text like a news article or a Word document?

Short Text:

In such cases you could potentially build a system that understands almost every word or phrase in your input.
If your input is short enough, you may consider using some non-efficient algorithm, for example: Backtracking. Implementing and maintaining non-efficient/naive algorithms is usually much easier, and if your input is really short, it won’t hurt performance by very much – but be careful!
From my own experience, when I have to deal with short inputs, I get better results by using older approaches like “Rule-Based System” and “Pattern Matching”.

Long Text:

My first tip here is to not start analyzing text from the title, but from the body. I have several reasons for that:
- The title is often not a full sentence; and sometimes it may be really short, up to a single word!
- The author can use the title as a teaser that doesn’t say much about the real content of the article.
- The title may contain a word/phrase that could completely confuse your system. For example: “The flea meets the tiger, Who will win?” (Hint: this article is not about nature!)

I usually analyze the body of the text first, and then try to compare it to the title.

In most cases an article has a standard structure: 1-2 paragraphs for the introduction, the rest of the content and then 1-2 paragraphs for summary / conclusions. Try to identify this special paragraphs, or just use this rule of thumb – What comes first is probably more important!
In the case of a long text, your system may achieve good results without recognizing every word or phrase in the input. Define the cases in which your system can skip a word, phrase or even a full sentence!
If you’re dealing with a long text, I suggest you to take the statistical approach.

Supervised Input:

Is your input supervised in some way? For example an article from a big news site or a research/academic institute… etc. If so, you may assume the following are true:

No Typos/Spelling mistakes.
No Grammar mistakes.
No punctuation mistakes.
Topics and names will always start with an uppercase.
The article probably has the standard structure I mentioned before.
No slang.

Input Device:

How does the user (in case there is one) enter the input to your system? A keyboard? iPhone touchscreen? or maybe some old dialpad? It may have a big influence on your system!
Here are my notes (for short query systems):

Keyboard:

Long input, usually up to 10-12 words.
Probably a full sentence.
Example: “I need the first flight tomorrow from London to Madrid”.
Easy and comfortable typing in front of a large screen. Usually won’t contain a lot of mistakes, especially if the user knows that a computer needs to analyze his query.
Typos, spelling mistakes and shortcuts frequency: Low.

iPhone (or any other smartphone) with a touchscreen:

Medium input, up to 6-8 words.
Not a full sentence.
Example: “first flight tomorrow from London to Madrid”.
Small screen and a “jumpy” keyboard. In addition, some people uses auto-correct systems which can completely change the input!
Typos, spelling mistakes and shortcuts frequency: Medium.

Dialpad:

Short input, up to 4-5 words!
Missing conjunctions (from, to, in, on..)
Example: “London Madrid 1st 2moro”.
Small screen, uncomfortable keyboard and a slow typing. In this case you may also deal with some unusual mistakes like switching a → b, a → c… etc.
Typos, spelling mistakes and shortcuts frequency: High.

I’m sure there are many other issues related to the input of NLP systems. The idea behind this post was to raise your awareness to the topic – and to give you a few notes on why the input of those systems is not as obvious as it may seem at first.

My upcoming posts are going to talk about Backtracking and Spellcheckers, stay tuned and follow me on Twitter.

Shlomi Babluki

The Tokenizer

Here’s a few things you might need to know, or maybe you just forgot…

Tag Archives: algorithms

The Input of Natural Language systems

Share this: