search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Site Search Crawler Overview

The easiest way to get started using Site Search is to let the Site Search Crawler 'crawl' your content. A crawler, or web crawler, is what one calls a robust application which scans the content and structure of publicly available webpages.

Swiftbot is a high-performance web crawler that will quickly index your webpages. In doing so, it will fill your Site search Engine with documents, allowing you to then use that Engine to provide a robust and useful search experience.

There are other crawlers out in the wild, too: Google, DuckDuckGo, Bing, Yahoo -- every search engine will dispatch an intelligent crawler to build their search engine rankings. You now have control over a crawler of your own.

This overview contains information on...

What is a Crawler-based Engine?

There are two possible Engine types within Site Search:

  1. API-based
  2. Crawler-based.

If you use the crawler, then you have a Crawler-based Engine.

If your Engine was created as an API-based Engine during your initial signup or created via the Engines API, then it is an API-based Engine - you will know if that was the case!

If you are unsure, assume you are using a Crawler-based Engine.

Crawler Configuration

By default, the crawler will crawl each available page within your domain. In many cases, you will want to specify what will be crawled, what will be indexed, and more.

To control how your content is indexed, the crawler supports Meta Tags and Content Curation Rules.

The crawler also processes and respects the rules of your domain's robots.txt file, sitemap.xml, and RSS or Atom Feeds. In addition, within your Site Search dashboard, you can configure Manage Path Rules to specify the portions of your website that should or should not be crawled.

For more information, see Crawler Configuration.

Crawling Multiple Domains

The crawler will not cross domains while indexing content -- including subdomains! If you would like to index multiple domains, add them when first creating the Engine, or from the Manage Domains page on the Dashboard. You can also use the Crawler Operations endpoint.

Crawler Frequency

Site Search will re-index your content periodically. You can force a re-crawl by clicking the Recrawl button on the Domains page of the dashboard. Or, you can use the Crawler Operations endpoint.

API Control

Crawler-based Engines can be controlled through a variety of robust and useful API endpoints. There is one key difference, in that a Crawler-based Engine indexes documents via the Crawler, while an API-based Engine indexes them via an API endpoint.

As such, to control how the Crawler will behave you can explore the crawler-specific endpoint by clicking here.

Endpoint Supported?
Document Indexing No.
Crawler Operations Yes!
Engines Engines are account level.
Search Yes!
Autocomplete Yes!
Analytics Yes!

Page Schema

When your pages are crawled, the content is indexed according to a schema. When a page is indexed, by default, a DocumentType called page with the following schema is created:

Field Type Description
external_id enum For crawler based search engines, the hexadecimal MD5 digest of the normalized URL of the page.
updated_at date The date when the page was last indexed.
title string The title of the page taken from the <title> tag or the title meta tag.
url enum The URL of the page.
sections string Sections of the page determined by <h1>, <h2>, and <h3> tags or the sections field.
body text The text of the page
type enum The page type (set by the type meta tag.
image enum A URL for an image associated with the page (set by the image meta tag), used as a thumbnail in your search result listing if present.
published_at date The date the page was published. It can be set with the published_at meta tag. If not specified, it defaults to the time when the page was crawled, which might not be particularly useful for result sorting.
popularity integer The popularity score for a page. Specialized crawlers for content management systems like Tumblr may use this field, or it can be set with the popularity meta tag and used to change search result rankings with functional boosts. If not specified, the default value is 1.
info string Additional information about the page returned with the results (set by the info meta tag)

Read more about Field Types and DocumentTypes.

You may use these field names to control what results are returned with fetch_fields, to control field boosts with functional_boosts, or as search filters with filters. See the search documentation for details.

Stuck? Looking for help? Contact support or check out the Site Search forum!