Hyperlink to URL && Extracting all the links from a site

Submitted by hemanth on Sun, 07/26/2009 - 16:42

This is a sample experiment i tried to convert , rather extract an URI from a hyperlink.

Consider an example :

href = "< href="http://www.w3schools.com/"target="_blank">VisitW3Schools!<"

to to get the URL , we need only that part which starts with http and is between "".

So , i can made a simple R.E to strip the URL from any given hyperlink. The R.E is designed for `sed` , hence can be used with stdout or any file.

sed -nr 's/(.*)href="?([^ ">]*).*/\2\n\1/; T; P; D;'

As evident , the href="?([^ ">]*).* part says , get me the thing after href= and which is not [^] space , " or > .

So, to get the URL , we can just echo the contents of the variable href and sed the o/p to the URL variable.

echo $href | sed -nr 's/(.*)href="?([^ ">]*).*/\2\n\1/; T; P; D;'

An echo on URL , echo $URL , would gives us the URL part of the href , hence converting href to URL.

Digging more , if its a file with loads of href's , you can just cat the file and use the R.E as mentioned , that is it works well with multiple href's.

You also dump the URl to and *.html using lynx or just curl and cat the *.html and sed the output stream to get your URL's , that can be again rededicated to a file .

If you just want to collect all the links is a site :
lynx -dump "http://www.h3manth.com" | grep -o "http:.*" > links

Hemanth.HM's Experiments on web, CLI, GNU/Linux and more

Hyperlink to URL && Extracting all the links from a site

Recent blog posts