Get a list of URLs from a site

Web Crawler Problem Overview

I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous.

So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs.

I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than to find deeper pages.

Web Crawler Solutions

Solution 1 - Web Crawler

I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

Solution 2 - Web Crawler

do wget -r -l0 www.oldsite.com

Then just find www.oldsite.com would reveal all urls, I believe.

Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.

Solution 3 - Web Crawler

Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

> Web Sitemap Generators > > The following are links to tools that generate or maintain files in > the XML Sitemaps format, an open standard defined on sitemaps.org and > supported by the search engines such as Ask, Google, Microsoft Live > Search and Yahoo!. Sitemap files generally contain a collection of > URLs on a website along with some meta-data for these URLs. The > following tools generally generate "web-type" XML Sitemap and URL-list > files (some may also support other formats). > > Please Note: Google has not tested or verified the features or > security of the third party software listed on this site. Please > direct any questions regarding the software to the software's author. > We hope you enjoy these tools! > > Server-side Programs > > - Enarion phpSitemapsNG (PHP) > - Google Sitemap Generator (Linux/Windows, 32/64bit, open-source) > - Outil en PHP (French, PHP) > - Perl Sitemap Generator (Perl) > - Python Sitemap Generator (Python) > - Simple Sitemaps (PHP) > - SiteMap XML Dynamic Sitemap Generator (PHP) $ > - Sitemap generator for OS/2 (REXX-script) > - XML Sitemap Generator (PHP) $ > > CMS and Other Plugins: > > - ASP.NET - Sitemaps.Net > - DotClear (Spanish) > - DotClear (2) > - Drupal > - ECommerce Templates (PHP) $ > - Ecommerce Templates (PHP or ASP) $ > - LifeType > - MediaWiki Sitemap generator > - mnoGoSearch > - OS Commerce > - phpWebSite > - Plone > - RapidWeaver > - Textpattern > - vBulletin > - Wikka Wiki (PHP) > - WordPress > > Downloadable Tools > > - GSiteCrawler (Windows) > - GWebCrawler & Sitemap Creator (Windows) > - G-Mapper (Windows) > - Inspyder Sitemap Creator (Windows) $ > - IntelliMapper (Windows) $ > - Microsys A1 Sitemap Generator (Windows) $ > - Rage Google Sitemap Automator $ (OS-X) > - Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $ > - Site Map Pro (Windows) $ > - Sitemap Writer (Windows) $ > - Sitemap Generator by DevIntelligence (Windows) > - Sorrowmans Sitemap Tools (Windows) > - TheSiteMapper (Windows) $ > - Vigos Gsitemap (Windows) > - Visual SEO Studio (Windows) > - WebDesignPros Sitemap Generator (Java Webstart Application) > - Weblight (Windows/Mac) $ > - WonderWebWare Sitemap Generator (Windows) > > Online Generators/Services >
> - AuditMyPc.com Sitemap Generator > - AutoMapIt > - Autositemap $ > - Enarion phpSitemapsNG > - Free Sitemap Generator > - Neuroticweb.com Sitemap Generator > - ROR Sitemap Generator > - ScriptSocket Sitemap Generator > - SeoUtility Sitemap Generator (Italian) > - SitemapDoc > - Sitemapspal > - SitemapSubmit > - Smart-IT-Consulting Google Sitemaps XML Validator > - XML Sitemap Generator > - XML-Sitemaps Generator > > CMS with integrated Sitemap generators >
> - Concrete5 > > Google News Sitemap Generators The following plugins allow > publishers to update Google News Sitemap files, a variant of the > sitemaps.org protocol that we describe in our Help Center. In addition > to the normal properties of Sitemap files, Google News Sitemaps allow > publishers to describe the types of content they publish, along with > specifying levels of access for individual articles. More information > about Google News can be found in our Help Center and Help Forums. >
> - WordPress Google News plugin > > Code Snippets / Libraries >
> - ASP script > - Emacs Lisp script > - Java library > - Perl script > - PHP class > - PHP generator script > > If you believe that a tool should be added or removed for a legitimate > reason, please leave a comment in the Webmaster Help Forum.

Solution 4 - Web Crawler

The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.

It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.

Solution 5 - Web Crawler

So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.

You're presumably not in an ideal world. Why not do this...?

Create a mapping between the well known old URLs and the new ones. Redirect when you see an old URL. I'd possibly consider presenting a "this page has moved, it's new url is XXX, you'll be redirected shortly".
If you have no mapping, present a "sorry - this page has moved. Here's a link to the home page" message and redirect them if you like.
Log all redirects - especially the ones with no mapping. Over time, add mappings for pages that are important.

Solution 6 - Web Crawler

wget from a linux box might also be a good option as there are switches to spider and change it's output.

EDIT: wget is also available on Windows: http://gnuwin32.sourceforge.net/packages/wget.htm

Solution 7 - Web Crawler

Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.

Solution 8 - Web Crawler

I would look into any number of online sitemap generation tools. Personally, I've used this one (java based)in the past, but if you do a google search for "sitemap builder" I'm sure you'll find lots of different options.

Content Type	Original Author	Original Content on Stackoverflow
Question	Oli	View Question on Stackoverflow
Solution 1 - Web Crawler	Oli	View Answer on Stackoverflow
Solution 2 - Web Crawler	alamar	View Answer on Stackoverflow
Solution 3 - Web Crawler	Franck Dernoncourt	View Answer on Stackoverflow
Solution 4 - Web Crawler	Collins	View Answer on Stackoverflow
Solution 5 - Web Crawler	Martin Peck	View Answer on Stackoverflow
Solution 6 - Web Crawler	Thomas Schultz	View Answer on Stackoverflow
Solution 7 - Web Crawler	Mork0075	View Answer on Stackoverflow
Solution 8 - Web Crawler	Eric Petroelje	View Answer on Stackoverflow