Handling Internationalized domain name (IDN) in scripting

IDN is helping many, but caused a lot of trouble initially for many programming languages, specially in parsing and reading their contents.

The focus of this article is on scripting languages : bash, perl, python and ruby reading contents from an IDN or simply say an URL with unicode chars in them.

The ICANN has approved many TLD out of which for this senario I shall select 'http://☃.net' (snowman!)

BASH :  curl 'http://☃.net' # Awesomeness of curl!

Perl :

Well, it's pretty easy to handle it in perl. Don't forget to use utf8

#!/usr/bin/env perl
 
use strict;
use warnings FATAL => 'all';
use feature ":5.10";
use utf8;
$req = HTTP::Request->new(GET => 'http://☃.net');
say $ua->request($req)->content;

Python:

Not as easy as Perl, but still manageable. Do check this BUG

#!/usr/bin/env python
 
domain = '☃.net'
url = 'http://'+unicode(domain, "utf8").encode("idna")
urllib2.urlopen(url).read()

Before we proceed, it's must to know, Punycode

Punycode is an instance of Bootstring that uses particular parameter values specified by RFC 3492 to transfer encoding Internationalized Domain Names in Applications (IDNA). It uniquely and reversibly transforms a Unicode string into an ASCII string. ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).

unicode(domain, "utf8").encode("idna") would result in a punycode like 'xn--n3h.net'

Ruby : URI doesn't implement unicode domains! (Shame?) (BUG)

Workaround, yes you guessed it right! $ sudo gem install addressable and don't forget encoding: utf-8.

#!/usr/bin/env ruby
# encoding: utf-8
require "rubygems"
require "addressable/uri"
require "open-uri"
 
url = Addressable::URI.parse('http://☃.net').normalize.site
open(url).read()

That's it from me, do let me know if you find better ways of handling this! Happy Hacking!

Share this