Generate a Sitemap for Web Output

Summary

Quadralay should make it easy to generate one Google sitemap containing URLs for all generated files. It's needed to get content indexed quickly by Google (Site Search, Custom Search, regular Google). The flat link files generated at wwhelp/books.htm don't work for this. While really needed for WebWorks Help 5.0 it's also needed for Reverb.

Detailed Description

When deploying large bodies of content to the Web and using Google to provide search functionality into it, you often need to get new/updated content indexed as soon as possible. The only way Google allows this is by submitting a sitemap listing the URLs to all of the pages it should recrawl asap. Just submitting a sitemap that is an entry point to a flat set of link files to crawl doesn't work, it will only crawl the top-level file fast, the rest it will do "whenever".

Generating these sitemap files is pretty painful or even impossible using current tools. The main options are to either install a script on your web server, or use fairly arcane desktop tools. Neither of which knows about the wwhelp/books.htm entrypoint so you have to figure out some hack to get it to look at that.

It doesn't have to be something built into the product, or even officially supported. Similar to some of the projects on the wiki, someone could write a wee Python/sed/awk/VB script that chugs through wwhelp/books.htm and writes out a sitemap 0.9 XML file as spec'd on sitemaps.org - and post it on the wiki. I think ePublisher has reached a maturity level where as much value is added by providing adjunct tools like this as by supporting new DITA formats and so on.

Use Cases

The use case would be that when ePub generates WWH5 or Reverb, it would generate a sitemap. Even better if the sitemap was tailored to make the new/changed files a higher priority so they get crawled first.


:) :)) :( ;) :\ |) X-( B)

Enhancements/Generate a sitemap for your web output (last edited 2011-08-24 14:45:17 by DaveTruman)