Last week, as I was working on my new project ‘Complete‘ (A personalized autocomplete extension for Gmail), I was searching for a solution that would be able to correctly detect the language of a text. I thought finding one should be easy since I needed it to be able to work only on long texts.
The first solution I thought to incorporate could have fitted the project needs, had it not been based on the NLTK stopwords corpus, and supported only 14 languages. Besides this solution, I found a few other ones, which were a bit too heavy or complex for my needs. Not being entirely satisfied with the available solutions I set out to build my own one. You can find my code here and some more details about it throughout this post.
In my code there is a file called ‘data.json’, that is in fact the model for my solution. It was built by using the 1,000 most common words for each language from here, and then filtering from it the 50 most common and unique words for each language.
This is my Python code that uses the model. Step A: my code converts the whole text into a set of tokens (by splitting the text into sentences and then each sentence into a list of tokens). Step B: my code checks the intersection between the token set of the text and each language set in my model and store the highest value (which will be the result).
3. Comments about my solution:
- As I already mentioned, this solution will work only with long texts (emails, news articles, full documents…), not phrases or single sentences.
- Theoretically, this solution should not have ‘false-positive’. It returns ‘None’ if it cannot identify the language of the given text.
- In my code, I used NLTK to split the text into tokens. If you don’t want to use it, you can just use regular Python .split() function.
- Feel free to use my ‘data.json’ model and implement my code in any other programming language.