Sitemap.xml Support
The Site Search Crawler supports the Sitemap XML format. Refer to this format for the required and optional elements, character escaping, and other technical considerations and examples.
Using Sitemap can provide a significant speed boost to the crawl.
Instead of examining each page for new links to follow, the crawler will use your sitemap file(s) to download the URLs directly.
The Sitemap Format
The sitemap XML format specifies a list of URLs to index.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.yourdomain.com/</loc>
</url>
<url>
<loc>http://www.yourdomain.com/faq/</loc>
</url>
<url>
<loc>http://www.yourdomain.com/about/</loc>
</url>
</urlset>
A sitemap file can also link to a list of other sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.yoursite.com/sitemap1.xml</loc>
<lastmod>2012-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.yoursite.com/sitemap2.xml</loc>
<lastmod>2012-01-01</lastmod>
</sitemap>
</sitemapindex>
For full details, review the Sitemaps documentation
Installing Your Sitemap
The Site Search Crawler supports specifying Sitemap files in your robots.txt file.
/robots.txt
file with multiple Sitemap URLs
User-agent: *
Sitemap: http://www.yourdomain.com/sitemap1.xml
Sitemap: http://www.yourdomain.com/sitemap2.xml
If no Sitemap files are found in the robots.txt file, the crawler will try to find one at /sitemap.xml
.
Unsupported Features
Site Search does not currently support:
- Pinging to notify the crawler of Sitemap existence.
- Page priority.
- Last modification date.
- Refresh frequency.
Stuck? Looking for help? Contact support or check out the Site Search community forum!