Webpage Language Identification
Posted on 15 March 2007
Many a times while searching for information on Internet we come across webpages that are actually in any other language. Worse still, sometimes they are the only pages that have some information (only single digit results in Google). This is common for people whose native language is not English.
While there are tools like Google Translate to convert a given webpage into a desired language, they can help only if we know the source language of the content for sure. At times it gets difficult.
The same happened with me while strolling for information related to Highway engineering. There was this one tutorial about the concept, and the only second link. My query fetched me only 2 results in Google. Ah, my bad luck.
I tried Google for every combination of language provided but couldn’t retrieve any information. My curiosity took me to search for a language identification tool. My results in Google fetched me some 22 million results, but most of them were either demo or could not identify the language of the tutorial. I dug deeper into the results.
Then I found FaganFinder, it identified the page in a few seconds. The power of the tool was quite compelling, and I was forced to try the tool more. I searched for more and more pages (the language of which I could not identify) and then I found this page which even FaganFinder wasn’t able to decipher. Now what, my search for the ultimate language identification tool continued until I came across Xerox PARC website for translation. Now this one could decipher the page immediately for me.
are the two tools that can come handy for language identification of a webpage.