search mobile facets autocomplete spellcheck crawler rankings weights synonyms analytics engage api customize documentation install setup technology content domains user history info home business cart chart contact email activate analyticsalt analytics autocomplete cart contact content crawling custom documentation domains email engage faceted history info install mobile person querybuilder search setup spellcheck synonyms weights engage_search_term engage_related_content engage_next_results engage_personalized_results engage_recent_results success add arrow-down arrow-left arrow-right arrow-up caret-down caret-left caret-right caret-up check close content conversions-small conversions details edit grid help small-info error live magento minus move photo pin plus preview refresh search settings small-home stat subtract text trash unpin wordpress x alert case_deflection advanced-permissions keyword-detection predictive-ai sso

Crawler Features

To best ingest and index your webpages, the Site Search Crawler supports and abides by many different features of the modern web:

Content Inclusion & Exclusion

We have instructions on how you can include or exclude crawled content from your Search Engine:

Whitelist: Content Inclusion

Your webpages may include content that you do want ingested and content that you do not wanted ingested. In these scenarios you can use the content inclusion feature to whitelist elements of your pages for indexing.

For example, if you would like to index a single content section of every page you can set data-swiftype-index=true on its element and the crawler will only extract text from that element.

<body>

  This is content that will not be indexed by the Swiftype crawler.

  <div data-swiftype-index='true'>
    <p>
      All content under the above div tag will be indexed.
    </p>
    <p>
      Content in this paragraph tag will be included from the search index!
    </p>
  </div>

  This content will not be indexed, since it isn't surrounded by an include tag.

</body>

Blacklist: Content Exclusion

You may want to keep certain content from being indexed in your site search engine -- for example, your site header, footer, or menu bar. You can tell the Swiftype crawler to exclude these elements by adding the data-swiftype-index=false attribute to any element, as illustrated below.

<body>

  This is your page content, which will be indexed by the Swiftype crawler.

  <p data-swiftype-index='false'>
    Content in this paragraph tag will be excluded from the search index!
  </p>

  This content will be indexed, since it isn't surrounded by an excluded tag.

  <div id='footer' data-swiftype-index='false'>
    This footer content will be excluded as well.
  </div>

</body>

Nested Rules

Whitelist and Blacklist rules, when nested, will work as you might expect. If there are multiple rules present on the page, all text will inherit behavior from the nearest parent element that contains a rule. This way, you will be able to include and exclude elements within each other.

For example, if the first rule is data-swiftype-index=false applied to a child element, any text outside that element and any other element with an inclusion rule will be indexed to the page's document record.

<body>

  This is content that will not be indexed since the first rule is true.

  <div data-swiftype-index='true'>
    <p>
      All content under the above div tag will be indexed.
    </p>
    <p>
      Content in this paragraph tag will be included from the search index!
    </p>
    <p data-swiftype-index='false'>
      Content in this paragraph will be excluded because of the nested rule.
    </p>
    <span data-swiftype-index="false">
      <p>
        Content in this paragraph will be excluded because the parent span is false.
      </p>
      <p data-swiftype-index="true">
        Content in this paragraph will be INCLUDED because the parent container is true.
      </p>
    </span>
  </div>

</body>

Content Inclusion & Exclusion, URL Paths

Whitelist and blacklist rules allow you to tell the Site Search Crawler to include or exclude parts of your domain when crawling. To configure these rules, visit the Manage Domain page of your Site Search dashboard.

As you type a whitelist or blacklist rule, you will see a sample of URLs that will be affected.

Path rule example


You can...

Whitelist - Including only certain paths

Whitelist rules allow you to specify which parts of your domain you want the Site Search Crawler to ingest. If you add rules to the whitelist, the Site Search Crawler will only include the parts of you domain that match these rules. Otherwise the crawler will include every page on your domain which is not excluded by blacklist or your robots.txt file:

Whitelist Options

Option Description Example
begin with Include URLs that begin with this text. Setting this to /doc would only include paths like /documents and /doctors but would ban paths like /down or /help if there are no other whitelist rules including these paths.
contain Include URLs that contain this text. Setting this to doc would include paths like /example/docs/ and /my-doctor.
end with Include URLs that end with this text. Setting this to docs would include paths like /example/docs and /docs but ban paths like /docs/example.
match regex Include URLs that match a regular expression. Advanced users only. Setting this to /archives/\d+/\d+ would include paths like /archives/2012/07 and /archives/123/9 but ban paths like /archives/december-2009.

Blacklist - Excluding certain paths

Blacklist rules allow you to tell the Site Search Crawler not to index parts of your domain. The rules you create in the blacklist will be applied to the paths that you have allowed within the whitelist rules. If no items are within the whitelist, then everything will be crawled. However, if it is on the blacklist, it will not be crawled.

Blacklist Options

Option Description Example
begin with Exclude URLs that begin with this text. Setting this to /doc would exclude paths like /documents and /docs/examples but would allow paths like /down.
contain Exclude URLs that contain this text. Setting this to doc would exclude paths like /example/docs/, /my-doctor.
end with Exclude URLs that end with this text. Setting this to docs would exclude paths like /example/docs and /docs but allow paths like /docs/example.
match regex Exclude URLs that match a regular expression. Advanced users only. Setting this to /archives/\d+/\d+ would exclude paths like /archives/2012/07 but allow paths like /archives/december-2009. Be careful with regex exclusions because you can easily exclude more than you intended.

Site Search Meta Tags

The Site Search Crawler supports a very flexible set of <meta>-tags that allow you to deliver structured information to our web crawler. When our crawler visits your webpage, by default, we extract a standard set of fields (e.g. title, body) that are then indexed for searching in your search engine.

With these <meta>-tags, you can augment - or completely alter - the set of fields our crawler extracts in order to better fit the data you wish to be indexed on your website.

We have documentation around:

Meta Tags

Each field we extract must meet specific structure guidelines, with defined name, type, and content values. The field type, which is specified in the data-type attribute, must be a valid, Site Search supported field type, which you may read about here.

The template for a Site Search Meta-Tag is as follows, and should be placed within the <head> of your webpage:

<meta class="swiftype" name="[field name]" data-type="[field type]" content="[field content]" />
Note: Choose your field's data-type carefully! Once a new meta-tag has been indexed, and the custom field created, its data-type cannot be changed.


This is a more complex example, showing the creation of multiple fields. As you can see, the tag field is repeated, and as a result our crawler extracts an array of tags for this URL. All field types can be extracted as arrays.

<head>
  <title>page title | website name</title>
  <meta class="swiftype" name="title" data-type="string" content="page title" />
  <meta class="swiftype" name="body" data-type="text" content="this is the body content" />
  <meta class="swiftype" name="url" data-type="enum" content="http://www.swiftype.com" />
  <meta class="swiftype" name="price" data-type="float" content="3.99" />
  <meta class="swiftype" name="quantity" data-type="integer" content="12" />
  <meta class="swiftype" name="published_at" data-type="date" content="2013-10-31" />
  <meta class="swiftype" name="store_location" data-type="location" content="20,-10" />
  <meta class="swiftype" name="tags" data-type="string" content="tag1" />
  <meta class="swiftype" name="tags" data-type="string" content="tag2" />
</head>

Body-embedded Data Attribute Tags

If you do not want to repeat tons of text in the <head> of your page then you can just add data attributes to existing elements:

<body>
  <h1 data-swiftype-name="title" data-swiftype-type="string">title here</h1>
  <div data-swiftype-name="body" data-swiftype-type="text">
    Lots of body content goes here...
    Other content goes here too, and can be of any type, like a price:
    $<span data-swiftype-name="price" data-swiftype-type="float">3.99</span>
  </div>
</body>

Upgrade your old Meta Tags

When Site Search originally launched, we supported a small set of Meta Tags that were intended for very specific use-cases. We will continue to support those tags if you already have them on your website, but we do not recommend using them for new projects. It would be a wise decision to upgrade.

For each original-style meta tag, simply replace the tag with an equivalent tag using the new format. In each example below, the first line shows an example original-style tag and the second line shows the correct replacement.

st:title
<!-- old, deprecated style -->
<meta property='st:title' content='[title value]' />

<!-- new, correct style -->
<meta class='swiftype' name='title' data-type='string' content='[title value]' />
st:section
<!-- old, deprecated style -->
<meta property='st:section' content='[section value]' />

<!-- new, correct style -->
<meta class='swiftype' name='sections' data-type='string' content='[sections field value]' />
st:image
<!-- old, deprecated style -->
<meta property='st:image' content='[image url]' />

<!-- new, correct style -->
<meta class='swiftype' name='image' data-type='enum' content='[image url]' />
st:type
<!-- old, deprecated style -->
<meta property='st:type' content='[type value]' />

<!-- new, correct style -->
<meta class='swiftype' name='type' data-type='enum' content='[type value]' />
st:info
<!-- old, deprecated style -->
<meta property='st:info' content='[info value]' />

<!-- new, correct style -->
<meta class='swiftype' name='info' data-type='string' content='[info value]' />
st:published_at
<!-- old, deprecated style -->
<meta property='st:published_at' content='[published_at date]' />

<!-- new, correct style -->
<meta class='swiftype' name='published_at' data-type='date' content='[published_at date]' />
st:popularity
<!-- old, deprecated style -->
<meta property='st:popularity' content='[popularity value]' />

<!-- new, correct style -->
<meta class='swiftype' name='popularity' data-type='integer' content='[popularity value]' />

robots.txt Support

The Site Search Crawler supports the features of the robots.txt file standard and will respect all rules issued to our User-agent. Among other uses, the robots.txt file is a good way to exclude certain portions of your site from your Site Search Engine.

The Swiftype Crawler's User-agent is: Swiftbot.


If you would like your robots.txt file rules to apply only to the Site Search Crawler, you should specify the Swiftbot User-agent in your file, as shown in the Disallow example below. We will also respect rules specified under the wildcard User-agent.

Example - robots.txt file disallowing the Site Search Crawler from indexing any content under the /mobile path.
User-agent: Swiftbot
Disallow: /mobile/

If you have a wildcard Disallow, your site will not be crawled. If you would like to specifically Allow only Swiftbot to crawl your site, you can Allow it using a blank Disallow rule, as shown in the example below.

Example - robots.txt file allowing the Swiftbot while disallowing all other User-agents.
User-agent: Swiftbot
Disallow:

User-agent: *
Disallow: /

You can also control the rate at which we access your website with the crawler by using the Crawl-delay directive, which expects a number representing seconds. A delay of 5 seconds is 17,280 crawls per day. A crawl is web traffic, so limiting it can reduce bandwidth. Limiting it too much, however, can limit the uptake of new documents!

Example - robots.txt file with a Crawl-delay of 5 seconds.
User-agent: Swiftbot
Crawl-delay: 5

For fine-grained control over how your pages are indexed, you may use robots Meta Tags. Read more about that below!

Robots Meta Tag Support

The Site Search Crawler supports the robots meta-tag standard. This allows you to control how the Site Search Crawler indexes the pages on your site.

We have instructions upon:

Using the "robots" meta-tag

Place the robots meta-tag in the <head> section of your page:

Example - Place the robots meta tag in the head section
<!doctype html>
<html>
  <head>
    <meta name="robots" content="noindex, nofollow">
  </head>
  <body>
    Page content here
  </body>
</html>

Robots meta tag content values

Site Search supports the NOFOLLOW, NOINDEX, and NONE values for the robots tag. FOLLOW and INDEX are the defaults and are not necessary unless you are overriding a robots meta tag for Swiftype, see below. Other values - such as NOARCHIVE - are ignored.

To tell the Site Search Crawler not to index a page, use NOINDEX:

<meta name="robots" content="noindex">

Links from an unindexed page will still be followed.

To tell the Site Search Crawler not to follow links from a page, use NOFOLLOW.

<meta name="robots" content="nofollow">

Content from a page that has NOFOLLOW will still be indexed.

To not follow links and not index content from a page, use NOINDEX, NOFOLLOW or NONE.

<meta name="robots" content="noindex, nofollow">

NONE is a synonym for the above:

<meta name="robots" content="none">

We recommend specifying the robots directives in a single tag, but multiple tags will be combined if present.

Directing instructions at the Site Search Crawler only

Using the meta name="robots" will apply your instructions to all web crawlers, including Swiftbot, the Site Search Crawler. If you would like to direct special instructions at the Site Search Crawler, use st:robots as the name instead of robots.

Example - st:robots overrides robots for the Site Search Crawler
<meta name="robots" content="noindex, nofollow">
<meta name="st:robots" content="follow, index">

This example tells other crawlers not to index or follow links from the page, but allows the Site Search Crawler to index and follow links.

When any meta name of st:robots is present on the page, all other robots meta rules will be ignored in favor of the st:robots rule.

Repeated content values

If robots directives are repeated, the Site Search Crawler will use the most restrictive.

<meta name="robots" content="noindex">
<meta name="robots" content="index">

The above is equivalent to NOINDEX.

Casing, spacing, and ordering

Tags, attribute names, and attribute values are all case-insensitive.

Multiple attribute values must be separated by a comma, but whitespace is ignored. Order is not important (NOINDEX, NOFOLLOW is the same as NOFOLLOW, NOINDEX). The following are considered the same:

<meta name="robots" content="noindex, nofollow">
  <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
  <META name="rOBOTs" content="     noIndex    ,     NOfollow   ">

Nofollow Support

The Site Search Crawler supports the rel="nofollow" standard.

<a href="/some/page" rel="nofollow">Do not follow this link</a>

Since the Site Search Crawler only follows links on the same domain, this is useful for crawl prioritization. For example, you could put rel="nofollow" on your sign in and create account links, and the Site Search Crawler will not try to follow them.

Sitemap.xml Support

The Site Search Crawler supports the Sitemap XML format. Using Sitemap can provide a significant speed boost to the crawl. Instead of examining each page for new links to follow, the Site Search Crawler will use your Sitemap file(s) to download the URLs directly.

We have instructions upon:

The Sitemap Format

The Sitemap XML format specifies a list of URLs to index.

Example - Sitemap
<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://www.yourdomain.com/</loc>
  </url>
  <url>
    <loc>http://www.yourdomain.com/faq/</loc>
  </url>
  <url>
    <loc>http://www.yourdomain.com/about/</loc>
  </url>
</urlset>

A sitemap file can also link to a list of other sitemaps:

Example - Sitemap index
<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>http://www.yoursite.com/sitemap1.xml</loc>
    <lastmod>2012-10-01T18:23:17+00:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>http://www.yoursite.com/sitemap2.xml</loc>
    <lastmod>2012-01-01</lastmod>
  </sitemap>
</sitemapindex>

For full details, review the Sitemaps documentation

Installing Your Sitemap

The Site Search Crawler supports specifying Sitemap files in your robots.txt file.

Example - /robots.txt file with multiple Sitemap URLs
User-agent: *
Sitemap: http://www.yourdomain.com/sitemap1.xml
Sitemap: http://www.yourdomain.com/sitemap2.xml

If no Sitemap files are found in the robots.txt file, the crawler will try to find one at /sitemap.xml.

Unsupported Features

Site Search does not currently support pinging to notify the crawler of Sitemap existence, page priority, last modification date, or refresh frequency.

RSS and Atom Feed Support

RSS and Atom are standards for content syndication. In short, they provide a machine-readable way of describing updated content on a website. Many blogging platforms and content management systems support either RSS or Atom (or both) to help readers stay up-to-date with their content.

The Site Search Crawler support RSS and Atom feeds. If your website provides an RSS or Atom feed, our crawler will download it to find new links on your site to index first. This is particularly useful when Site Search is doing an incremental update of your website, as it gives us a good hint about where to find the most recently updated pages.

To tell the Site Search Crawler about your RSS or Atom feed, use auto-discovery in the <head> section of your template.

For example, if your website has an RSS feed at http://www.yoursite.com/index.rss the auto-discovery code would look like this:

Example - RSS auto-discovery
<html>
  <head>
    <title>Your Site Title</title>
    <link rel="alternate" type="application/rss+xml" title="RSS feed" href="http://www.yoursite.com/index.rss">
  </head>
  <body>
    ...
  </body>
</html>

If your website has an Atom feed, you would use the application/atom+xml type for the link tag:

<link rel="alternate" type="application/atom+xml" title="Atom feed" href="http://www.yoursite.com/index.atom">

To verify that your RSS or Atom auto-discovery is properly configured, try subscribing to your site in a feed reader like NetNewsWire, FeedDemon, or Feedly.

Password Protected Content

The Site Search Crawler does not automatically support indexing content protected by HTTP authentication. You can work around this limitation by configuring your site to allow access to the Site Search Crawler agent, Swiftbot. For a more secure solution, our support team can also supply you with your account specific user agent string.

If these options will not work for your content, you may want to think about implementing the Site Search API or trying out App Search

Constant Crawl

For our Business and Premium customers, we offer our Constant Crawl feature, which is capable of detecting new and updated pages on your site and indexing them in near real time. This feature is most beneficial for ingesting new content that is added in-between crawls of your entire website.

If you think the Constant Crawl feature would be a good option for your site, but are not currently a Business or Premium customer, you can reach out to sales@swiftype.com to learn more.

The Site Search URL Inspector, FAQ

The URL Inspector is an invaluable tool! We have written this in FAQ format.

Review these steps to get up and running in no time.

What does the URL Inspector do?

If you are using the Site Search Crawler, you can use the URL Inspector to dive deeper into any URLs to learn more details. The Inspector will evaluate a URL and give you feedback on its findings.

Where is the URL Inspector?

There are two ways to get to the URL Inspector:

The first option is to go to the Content section of the Site Search Dashboard and paste a URL in the search bar. If the URL has no matches there will be an option to 'Inspect this URL.'

Inspect this URL

The second option is to go to the Content section of the Swiftype Dashboard and click on a content link. Once on the Content's Properties, page you will find an Options drop down at the top right. Click the drop down and select Inspect.

Inspect

Both of these options take you to the URL Inspector landing page.

How do I use the URL Inspector?

When you get to the URL Inspector page, the URL previously selected will appear in the search bar and the Inspector's evaluation of the URL will appear below it.

You can use the URL Inspector landing page to search other URLs as well. Just make sure when typing the URL into the search bar you include http:// or https://. Then click inspect and see what Site Search can find!

What will it be able to tell me about my URLs?

As mentioned earlier, the URL Inspector gives you more details about your URLs such as:

If the URL is not valid:

Invalid URL

If the URL is blocked by domain rules:

It even gives you specific details about which white or blacklisted rules are impacting the URL and guides you with resolution steps.

Blacklist Rules

If the URL doesn't have a confirmed domain for the engine:

Missing Domain

If the URL can't be reached, but the Site Search Crawler has crawled it before:

Cannot be reached with History

If the page can't be reached:

Cannot be reached

If the URL has not been visited by the current Crawl yet:

Not visited yet

If the URL can't index a page due to a duplicate page:

Duplicate page

If the URL can't index a page due to mismatched field types:

Mismathced field types

If indexing is in progress:

Indexing in progress

Or if everything looks great for that URL:

You will get a confirmation that the latest version of the page is available for search as well as details on when it was last updated.

Good to Go

Once you've completed these steps, you should be all set! The URL Inspector promises to take your Site Search indexing to the next level.

Canonical Crawling

Webpages may contain duplicate content. Search engines want to know the truth and reward pages that represent the original content in a clear manner. To ensure good standing within search engines results, developers can include canonical link elements within the <head> of duplicate pages:

<link rel="canonical" href="https://example.com/tea/peppermint" />

These tags tell search engine crawlers that: "This is a duplicate. The original is found at this link: https://example.com/tea/peppermint." Thus, they preserve quality rankings.

The Site Search Crawler will obey these tags, too. However, an incorrect implementation will prevent your pages from being indexed...

When canonical link elements are attached to pages, they should include the precise URI: https://example.com/tea/peppermint or no canonical link element at all.

For example, the homepage URI - https://example.com - is used as the canonical link element on every page. Each page should have its own URI. The crawler will assume there is only one page - https://example.com - and the other pages will not be indexed.

Re-direct loops!

Developers may choose to put a canonical link element within the original page, too, denoting that it is the source of truth. Consider a case where the element does not share an exact match with the actual page URI.

For example, https://example.com/tea/peppermint/ is included within the canonical link element - note the trailing slash! When the crawler visits: https://example.com/tea/peppermint, where the content is located, it will be directed to: https://example.com/tea/peppermint/ by the canonical link element... But hitting that will takes the session to: https://example.com/tea/peppermint -- argh ~ infinite loop! This can also occur with other misconfigured 301 redirects.

The crawler will give up after awhile, exhausted. The page will not be indexed.


Stuck? Looking for help? Contact Support!