Handling Internationalized Domain Name (IDN) in Scripting
IDN is helping many, but caused a lot of trouble initially for many programming languages, specially in parsing and reading their contents.
The focus of this article is on scripting languages : bash , perl , python and ruby reading contents from an IDN or simply say an URL with unicode chars in them.
The ICANN has approved many TLD out of which for this senario I shall select ‘http://☃.net’ (snowman!)
BASH : curl 'http://☃.net' # Awesomeness of curl!
Perl :
Well, it’s pretty easy to handle it in perl. Don’t forget to use utf8
“
!/usr/bin/env perl
use strict; use warnings FATAL => ‘all’; use feature “:5.10”; use utf8; $req = HTTP::Request->new(GET => ‘http://☃.net’); say $ua->request($req)->content;
Python:
Not as easy as Perl , but still manageable. Do check this BUG
!/usr/bin/env python
domain = ‘☃.net’ url = ‘http://‘+unicode(domain, “utf8”).encode(“idna”) urllib2.urlopen(url).read()
Before we proceed, it’s must to know, Punycode
Punycode is an instance of Bootstring that uses particular parameter values specified by RFC 3492 to transfer encoding Internationalized Domain Names in Applications (IDNA). It uniquely and reversibly transforms a Unicode string into an ASCII string. ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).
unicode(domain, “utf8”).encode(“idna”) would result in a punycode like ‘xn—n3h.net’
Ruby : URI doesn’t implement unicode domains! (Shame?) (BUG)
Workaround, yes you guessed it right! $ sudo gem install addressable and don’t forget encoding: utf-8.
!/usr/bin/env ruby
encoding: utf-8
require “rubygems” require “addressable/uri” require “open-uri”
url = Addressable::URI.parse(‘http://☃.net’).normalize.site open(url).read()
That’s it from me, do let me know if you find better ways of handling this! Happy Hacking!
About Hemanth HM
Hemanth HM is a Sr. Machine Learning Manager at PayPal, Google Developer Expert, TC39 delegate, FOSS advocate, and community leader with a passion for programming, AI, and open-source contributions.