search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Crawler Optimization

How can I make crawls faster?

Out of the box, the Site Search Crawler will crawl and index most websites with great speed.

For the fastest, most concise, and efficient crawl, you can use a sitemap.

On Effective Sitemaps

A sitemap is what it sounds like: a map of a website.

You may already have one, or can create one with minimal effort.

A sitemap is etched in eXtensible Markup Language (XML) and appears like this:

<urlset>
  <url>
    <loc>
      https://swiftype.com/documentation/site-search/
    </loc>
  </url>
  <url>
    <loc>
      https://swiftype.com/documentation/site-search/guides/search-optimization
    </loc>
  </url>
</urlset>

Above is a trimmed version of the swiftype.com documentation sitemap.

It contains a <urlset> and lists URL locations for each individual page within the documentation.

These are the pages that we crawl to fuel our own documentation search.

The Site Search Crawler will look for a sitemap at the default location and filename:

https://example.com/sitemap.xml

Sometimes a sitemap might be in a non-default location, or have a unique filename...

https://example.com/special_directory/sitemap.xml https://example.com/different_sitemap_name.xml

No matter the location, you can make the crawler aware of it altering your robots.txt file:

User-agent: Swiftbot
Sitemap: https://example.com/special_directory/sitemap.xml
Sitemap: https://example.com/different_sitemap_name.xml

The crawler will use the robots.txt file to follow this map and index the pages that it finds.

It will then follow the links within each page it discovers, until it has crawled the entire interlinked surface area of the website.

Given the crawler's natural inclination to follow links, the default crawl might include too many pages, bogging down performance.

It is this behaviour that we would like to limit.

You can both define sitemap locations and provide restrictions using Advanced Settings...

Advanced Settings are only available within Pro or Premium plans.
Read Advanced Settings documentation.
Optimization - A list of websites within the Domains section of the Site Search dashboard, with Advanced Settings selected.
After clicking on domains, a list of your managed domains appears. Clicking the manage drop down reveals four options: recrawl, manage crawl rules, advanced settings, and delete. Advanced settings is selected.

There are two different aspects of Advanced Settings:

  1. The Global Settings view, which is shared among all domains.
  2. The Domain specific view, which allows you to configure each domain by selecting it via the dropdown menu:
Optimization - The Advanced Settings, which aren't too complicated, really.
The advanced settings page, with no changes.

You have control over whether crawling is manual or automatic, restricted to sitemaps, or not.

Consider above where we specified a Sitemap: within a robots.txt file.

We can do a similar thing within Advanced Settings:

  1. Add a custom sitemap URLs for each domain.
  2. Activate the Restrict Crawling to Sitemaps switch.
Optimization - Crawling restricted to only custom sitemaps.
The advanced settings page, the Restrict Crawling to Sitemaps switch is active.

Now the crawler will:

  1. Visit the website.
  2. Look for and find the sitemap as instructed.
  3. Index and crawl only the URLs that are listed.
    • It will not follow or index any additional URLs.

With this style of crawling, it is up to you to maintain concise and accurate sitemaps.

Doing so will put the crawler on a smaller circuit than if it were to crawl all available webpages.

For you, that means speedier crawls and a more accurate search engine.

Remember: a sitemap is not a direct representation of your Engine contents.

Removing a document from your sitemap will not remove the document from the Engine, nor block its awareness of the document.


Stuck? Looking for help? Contact support or check out the Site Search community forum!