xml sitemap to url list command line tool in python

I am much into technical SEO. Testing broken links with xenu is my routine work. So, I wanted to import list of website urls from sitemap.xml to feed xenu software.

So, I tried to implement it in my favorite scripting language Python. Pyquery is python dom parsing tool similar to jquery for nodejs.

Let’s install required dependency first.

Verify pyquery setup

python
>>> import pyquery
>>> dir(pyquery)
['PyQuery','__builtins__','__doc__','__file__','__name__','__package__','__path__','cssselectpatch','openers','pyquery']

Now let’s look into pyquery basics.

In jQuery you select dom node as follow

jQuery('element')

In pyQuery you will select dom node as follow

from pyquery import PyQuery as pq
jQuery = pq(url=remote_sitemap)

So above statement will create pyquey object and assign it to variable jQuery.

Now let’s code sitemap grabber and parser.

Sample sitemap format

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.com/</loc><lastmod>2013-09-10</lastmod><priority>1.0</priority></url>
</urlset>

Grabbing and parsing url with pyquery

import os
import pyquery as pq
jQuery = pq(url=remote_sitemap,headers={'user-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
for i in jQuery('loc'):
    print jQuery(i).text()

above statement will parse sitemap.xml node value.

Now lets do small python file manipulation tutorial for list of urls saving.

import os
if os.path.exists(filename):
    f = file(filename, "r+")
else:
    f = file(filename, "w")
for i in jQuery('loc'):
    print jQuery(i).text()
    f.write(jQuery(i).text())
    f.write("\n")

Above statement will open file and write urls from node.

Finally let’s make it command line tool

from pyquery import PyQuery as pq
import sys
import os
import urllib2

if len(sys.argv) != 3:
	print "Usage: sitemap_to_list.py remote_url localfile"
	sys.exit(0)

remote_sitemap = sys.argv[1]
local_file = sys.argv[2]

try:
	jQuery = pq(url=remote_sitemap,headers={'user-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
except urllib2.HTTPError, err:
	if err.code == 404:
		print "File " +  str(remote_sitemap) + " not found"
		sys.exit(0)
	elif err.code == 403:
		print "\n403 Access Denied\n****************\nFile " +  str(remote_sitemap) + " reading error"		
		sys.exit(0)

filename = local_file

if os.path.exists(filename):
    f = file(filename, "r+")
else:
    f = file(filename, "w")

for i in jQuery('loc'):
    print jQuery(i).text()
    f.write(jQuery(i).text())
    f.write("\n")

f.close()
Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

 
Previous Post

How to upload file using jQuery, iframe and PHP

Next Post

How to make text container background semi-transparent using CSS and jQuery?

Related Posts