search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Crawler Optimization

Out of the box, the Site Search Crawler will crawl and index most websites with great speed. But depending on how the website is configured, one might run into snags that require custom configurations or troubleshooting.

For the fastest, most concise, and efficient crawl, you can use a sitemap.

tl;dr Curate accurate sitemaps and restrict crawling to those sitemaps for the most expedient crawl.

On Effective Sitemaps

A sitemap is what it sounds like: a map of your website.

Depending on how you are hosting and building your website, it is likely that you already have one or can create one with minimal effort.

A sitemap is etched in eXtensible Markup Language (XML) and appears like this:

<urlset>
  <url>
    <loc>
      https://swiftype.com/documentation/site-search/
    </loc>
  </url>
  <url>
    <loc>
      https://swiftype.com/documentation/site-search/guides/search-optimization
    </loc>
  </url>
</urlset>

Above is a trimmed version of the actual sitemap of this documentation, which you can see in full here: https://swiftype.com/documentation/sitemap.xml. It contains a <urlset> and lists URL locations for each individual page within the documentation. These are the pages that we crawl to fuel our own search.

The Site Search Crawler will look for a sitemap at the default location: https://example.com/sitemap.xml. If a sitemap is at a non-standard location, one can place a Sitemap: parameter within their robots.txt file that points to the location of their various sitemaps:

User-agent: Swiftbot
Sitemap: https://swiftype.com/documentation/sitemap.xml
Sitemap: https://swiftype.com/documentation/possible_second_sitemap.xml

The crawler will follow this map and index the pages that it finds, following the links within those pages, until it has crawled the entire surface area of the website. Given the crawler's natural inclination to follow links, the default crawl might include too many pages, looking too far into things that one might not want indexed into their Site Search Engine.

A robots.txt file can be used to point to sitemaps in distant locations, but it won't restrict the crawling to those pages. Instead, you can both define sitemap locations and provide restrictions within the Site Search dashboard using Advanced Settings...

Advanced Settings are only available within Pro or Premium plans. Read Advanced Settings documentation.
Optimization - A list of websites within the Domains section of the Site Search dashboard, with Advanced Settings selected.
After clicking on domains, a list of your managed domains appears. Clicking the manage drop down reveals four options: recrawl, manage crawl rules, advanced settings, and delete. Advanced settings is selected.

There are two different aspects of Advanced Settings, the Global Settings view which is shared among all domains and the domain specific view, which allows you to configure each domain by selecting it via the dropdown menu:

Optimization - The Advanced Settings, which aren't too complicated, really.
The advanced settings page, with no changes.

Within this view you have control over whether crawling is manual or automatic, restricted to sitemaps, or not. Consider above where we specified Sitemap: within a robots.txt file. You can do the same thing within Advanced Settings -- add your custom sitemap URLs for each domain, then activate the Restrict Crawling to Sitemaps switch:

Optimization - Crawling restricted to only custom sitemaps.
The advanced settings page, the Restrict Crawling to Sitemaps switch is active.

Now the crawler will think as such:

Visit the website, look for the sitemap. Find the sitemap, follow it, index and crawl only the URLs that are listed, without following or indexing any additional URLs.

Remember that a sitemap is not an direct representation of your Engine contents and removing a document from your sitemap will not remove the document from Engine and its awareness.

With this style of crawling, it is up to you to maintain concise and accurate sitemaps. But doing so will put the crawler on a much smaller circuit than if it were to crawl and follow all of your available pages. For you, that means speedier crawls and a more accurate search engine.


Stuck? Looking for help? Contact support or check out the Site Search community forum!