Out of the box, the Site Search Crawler will crawl and index most websites with great speed. But depending on how the website is configured, one might run into snags that require custom configurations or troubleshooting.
For the fastest, most concise, and efficient crawl, you can use a sitemap.
tl;dr Curate accurate sitemaps and restrict crawling to those sitemaps for the most expedient crawl.
On Effective Sitemaps
A sitemap is what it sounds like: a map of your website.
Depending on how you are hosting and building your website, it is likely that you already have one or can create one with minimal effort.
A sitemap is etched in eXtensible Markup Language (XML) and appears like this:
<urlset> <url> <loc> https://swiftype.com/documentation/site-search/ </loc> </url> <url> <loc> https://swiftype.com/documentation/site-search/guides/search-optimization </loc> </url> </urlset>
Above is a trimmed version of the actual sitemap of this documentation, which you can see in full here: https://swiftype.com/documentation/sitemap.xml. It contains a
<urlset> and lists URL locations for each individual page within the documentation. These are the pages that we crawl to fuel our own search.
The Site Search Crawler will look for a sitemap at the default location:
https://example.com/sitemap.xml. If a sitemap is at a non-standard location, one can place a
Sitemap: parameter within their
robots.txt file that points to the location of their various sitemaps:
User-agent: Swiftbot Sitemap: https://swiftype.com/documentation/sitemap.xml Sitemap: https://swiftype.com/documentation/possible_second_sitemap.xml
The crawler will follow this map and index the pages that it finds, following the links within those pages, until it has crawled the entire surface area of the website. Given the crawler's natural inclination to follow links, the default crawl might include too many pages, looking too far into things that one might not want indexed into their Site Search Engine.
robots.txt file can be used to point to sitemaps in distant locations, but it won't restrict the crawling to those pages. Instead, you can both define sitemap locations and provide restrictions within the Site Search dashboard using Advanced Settings...
There are two different aspects of Advanced Settings, the Global Settings view which is shared among all domains and the domain specific view, which allows you to configure each domain by selecting it via the dropdown menu:
Within this view you have control over whether crawling is manual or automatic, restricted to sitemaps, or not. Consider above where we specified
Sitemap: within a
robots.txt file. You can do the same thing within Advanced Settings -- add your custom sitemap URLs for each domain, then activate the Restrict Crawling to Sitemaps switch:
Now the crawler will think as such:
Visit the website, look for the sitemap. Find the sitemap, follow it, index and crawl only the URLs that are listed, without following or indexing any additional URLs.
Remember that a sitemap is not an direct representation of your Engine contents and removing a document from your sitemap will not remove the document from Engine and its awareness.
With this style of crawling, it is up to you to maintain concise and accurate sitemaps. But doing so will put the crawler on a much smaller circuit than if it were to crawl and follow all of your available pages. For you, that means speedier crawls and a more accurate search engine.