Hemanth's Scribes

web

Hyperlink to URL && Extracting all the links from a site

Author Photo

Hemanth HM

Thumbnail

This is a sample experiment i tried to convert , rather extract an URI from a hyperlink.

Consider an example :

href = ”< href=“http://www.w3schools.com/“target=“_blank”>VisitW3Schools!<”

to to get the URL , we need only that part which starts with http and is between "".

So , i can made a simple R.E to strip the URL from any given hyperlink. The R.E is designed for sed , hence can be used with stdout or any file.

sed -nr 's/(.*)href="?([^ ">]*).*/\2\n\1/; T; P; D;'

As evident , the href=”?([^ ”>]). part says , get me the thing after href= and which is not [^] space , ” or > .

So, to get the URL , we can just echo the contents of the variable href and sed the o/p to the URL variable.

echo $href | sed -nr 's/(.*)href="?([^ ">]*).*/\2\n\1/; T; P; D;'

An echo on URL , echo $URL , would gives us the URL part of the href , hence converting href to URL.

Digging more , if its a file with loads of href’s , you can just cat the file and use the R.E as mentioned , that is it works well with multiple href’s.

You also dump the URl to and *.html using lynx or just curl and cat the *.html and sed the output stream to get your URL’s , that can be again rededicated to a file .

If you just want to collect all the links is a site : lynx -dump "http://www.h3manth.com" | grep -o "http:.*" > links

#javascript#linux
Author Photo

About Hemanth HM

Hemanth HM is a Sr. Machine Learning Manager at PayPal, Google Developer Expert, TC39 delegate, FOSS advocate, and community leader with a passion for programming, AI, and open-source contributions.